November 3, 2016

Overview

picture of chess pieces

  • What is data?
  • What is a database?
  • Hands-on examples
    • JSON and Map/Reduce
    • sqlite3
    • R with rsql
  • Conclusion

What is "Data"

Let's orient the epistemology

The Word "Data"

Plautus

  • data (plural), datum (singular)
  • etymology
    • borrowed from Latin
    • noun: gift, present
  • English language definitions
    • a measurement
    • a fact from direct observation

Quia non suppetunt dictis data.

Because his gifts do not match his words.

— Plautus, Asinaria 56 (c. 254 BCE – 184 BCE)

Looking at the World

Microscope

  • Observations, measurements, values, results
    • specific to your domain of inquiry
    • data is a general concept
  • data may be recorded or encoded
    • making marks on paper
    • arranging numbers into tables
    • digital encoding
  • data are often the basis for scientific inquiry

Table Topology

Table of Bushel Weights

  • Table: a 2-D matrix with rows, columns, and maybe row/column names
  • Which way is this table oriented?
    • wide: variable in its own column, case/id/event on its own row.
    • long: every data point on its own row. Common columns: id, variable, value. (e.g. time series data.)
    • pivot: some mixture of raw and summary data. Common for data portals and Excel.
  • Transpose: rotate table so columns become rows

Information: a Property of our Universe

Maxwell's Demon

  • Information is physical, just like energy and mass.
    • Maxwell's demon (1872): information reduces entropy
  • Shannon information (1948)
    • entropy is unpredictability
    • information is knowledge about an unpredictable event
  • binary: 0 or 1 ("base-2 number system")
  • bit: the amount information a binary value can contain

Binary Encodings

ASCII

  • byte: 8 bits of information
  • encoding: use information (bytes) to transmit a message (data)
  • ASCII (1960): American Standard Code for Information Interchange; a 1-byte encoding
  • UTF-8 (1993): Unicode Transformation Format; n-byte encoding

Summary: What is "Data?"

punch card

  • data are specific to your research
    • literally: a gift
  • encode your data as digital information
    • possibly with expensive measurement hardware
  • an information system stores and retrieves your data
    • maybe you have access to a digital computer? That would be a good information system.
    • punch cards will work, but I recommend a Solid State Drive (SSD)

Databases

a brief discussion of several ways to store information

Introducing the Database

IBM 2311 memory unit

  • database: a type of information system
    • can be customized for different sorts of work
    • usually has optimizations for "scaling up"
  • a database is software running on hardware
    • download it, run it, use it
  • sometimes a database is embedded inside software
    • e.g. Firefox and Chrome store browser history in a database

Types of Databases

IBM 704 at Langley

  • flat files (1960s)
    • Tab delimited, CSV, fixed width
  • relational (1970s)
    • sqlite3, PostgreSQL, MySQL, MS SQL Server
  • binary "blob" data objects (1980s)
    • Excel, SPSS data file, R environment image
  • key/value storage (2000s)
    • Bigtable, Memcache, Redis, HTML5 localStorage
  • graph (2000s)
    • Neo4J, OrientDB
  • document (2010s)
    • JSON, CouchDB, MongoDB, Hadoop HDFS

Flat Files

csv

  • Essentially just a table (matrix)
  • Tab Delimited
    • ASCII encoding, character #9 is a delimiter
    • literally how a printer would print tables
  • CSV (Comma Separated Values)
    • Excel Dialect
      • quote is "
      • double-quote (i.e. "" for literal quote in field)
      • CRLF
  • Fixed Width: each column is exactly n characters

Binary "blob" Data objects

excel hex dump

  • You need a special program to use these files
    • still basically just a table (like flat files)
    • files usually include data and metadata
  • examples of programs that use blobs
    • Excel
      • includes expressions in addition to data
      • some might argue Excel is the most prolific database software used today
    • SPSS data files
    • R environment image

key/value storage

key value

  • BigTable
  • HTML5 localStorage
  • Redis
  • Memcache

relational

relational database schema

  • Sqlite3 (weakest, easiest)
  • PostgreSQL (recommended)
  • MySQL
  • MS SQL Server

document

JSON

  • JSON
    • data is JavaScript
    • object types: string, integer, float, array, dictionary
    • moving beyond matrix
  • CouchDB
  • MongoDB
  • Hadoop (HDFS with Map/Reduce)

graph

Graph: Alice and Bob are on the Chess team

  • Neo4J
  • OrientDB

Summary: Databases

Hands on: MapReduce

Map and Reduce

Map Reduce

  • functions that take functions as parameters
  • map(f): \(f(x) = x^2\)
    • values: [1, 2, 3]
    • create map of [1, 2, 3] with \(f(x) = x^2\)
      • \(map_1 = 1^2 = 1\)
      • \(map_2 = 2^2 = 4\)
      • \(map_3 = 3^2 = 9\)
    • result: [1, 4, 9]
  • reduce(f): \(sum_i = map_i + sum_{i-1}\)
    • reduce result [1, 4, 9] with \(f = sum\)
    • \(sum_1 = 1 + 0\)
    • \(sum_2 = 4 + 1\)
    • \(sum_3 = 9 + 5\)
    • \(sum = 14\)

Try it yourself

Example: gh-impact

Load a JSON file

get_json("/assets/ghimpact-b3.json");
// wait

View the usernames

Object.keys(df);
df["mattn"]["s"];
// Array.proto.map, Array.proto.reduce
// the scores come from the data frame keys using this map function
var scores = Object.keys(df).map( function(key) { return( df[key]["s"] ) } );
var average = scores.reduce( function(sum, val) { return( sum + val ) }, initialValue=0 ) / scores.length;
Math.max.apply(Math, scores);

Statistics with MapReduce

Mean (average), expressed as a mathematical function:

\(\mu = \frac{1}{n}\displaystyle\sum_{i=1}^{n} x_i\)

MapReduce Equivalent:

  • map: \(map_i = x_{i}\) (retrieve the value)
  • reduce (accumulate):
    • \(sum_{0} = map_{0}\)
    • \(sum_{i} = map_{i} + sum_{i-1}\)
  • \(mean = \frac{1}{n} sum\)

Summary: MapReduce

  • map and reduce functions can be broken down to run on lots of computers at once
  • Many statistical functions can be represented in terms of MapReduce
  • Databases that support MapReduce theoretically provide good scalability
    • Hadoop
    • CouchDB
  • Often uses javascript
  • for some questions, using lots of computers in parallel might be the only feasible way to finish a computation in time

Hands on: R and SQL

Plan

  • get a data set from R: mtcars
  • select from it using SQL
  • export it as CSV to filesystem
  • in sqlite, import CSV and create database
  • use rsqlite connection to select direct from database

Hands on: Python and SQL

Plan

  • open sqlite3 database
  • send a query
  • parse and print the result

Hands on: sqlite3

Plan

  • launch sqlite3 with a DB file
  • help for sqlite
  • how to read errors
  • how to search for more information
  • import and export CSV
  • SELECT count() WHERE
  • SELECT … WHERE
  • SELECT … JOIN
  • SELECT FROM
  • SELECT AS
  • INSERT …
  • UPDATE … WHERE
  • ALTER …
  • DELETE … WHERE

Conclusion

Thank you