Computing with large data sets

(1)

Computing with

large

data sets

Richard Bonneau, spring 2009

mini-Lecture 13(week 7):

(2)

different reasons for using databases

v22.0480: computing with data, Richard Bonneau Lecture 13

There are a multitude of reasons for using transactional databases when programing with data.

1. the word data is in the word database

2. Reducing the active memory needed to carry out an

operation or search over a large dataset (look for best cor over a matrix with 1,000,000 rows given a single row)

3. Organize multiple interlnked datatypes and use SQL sysntax to conviniently construct queries, relying on SQL to organize data under the hood.

4. Share complex data with another program / language / multiple threads.

(3)

splitting up large memory opperations

What system memmory would be required to

run lars( y ~ X ) if we had 1,000,000 observations and 2,000 predictors?

What system memory do we need to cluster 500,000 genetic changes mesured for 20,000 individuals?

Assuming there are many fewer classes / clusters / model-components than observations we can use a block aproach , where a small fraction of the data is needed at any given time.

Several methods for dividing opperations of large matricies or datasets exist and databases help us optimise and structure our codes access to arbitrary subsets of the data, flushing what we arn’t currently looking at from active memory.

(4)

what are transactional databases

We will use SQL type DBMS in this class ( a subset of transactional databases) Transactional databases are ACID

- Atomic : all of a transaction is completed OR none

- Consistent: all completed transactions leave DB in a state compliant with rules - Isolated: you can’t see results until transaction is complete

- Durable: is transaction complete then the change persists even if

program crashes after, system goes down, etc. (within reason i.e. no lightning strike clause)

(5)

relational databases

SQL is the most common type of relational database management system.

MySQL, oracle, SQLite, etc. are all SQL relational databases “relational” refers to the grouping of entries in the DB

by conditional statements on their attributes. return all rows with attribute.x = x

return parts of rows with attribute.y > y.thresh examples below and in the exercise.

(6)

relational databases databases

Advantages of SQL like DBMS:

Code and style of code are roughly compatible across many systems. Methods for porting and converting from one system to another exist Or could easily be built.

(7)

a sigle page SQL tutorial and links

We’ll use R’s functions to create tables.

Many good tutorials and brief docs exist: http://www.sqlcourse.com/

http://www.mysql.com/ http://www.sqlite.org/ select * from USArrests

select Murder from USArrests

select row_names, Murder from USArrests where Murder < 10.0 insert into employee

(first, last, age, address, city, state)

values ('Rich', 'Bonneau', 33, '3 washington sq.', 'New York', 'NY')

delete from employee

where lastname = 'Gentlemen' SELECT id, firstn, lastn, title, salary FROM employee_info

WHERE salary >= 55000.00 AND title = 'Saucier' SELECT g.id, g.mirror, g.diam, e.voltage

FROM geom_table as g, elec_measures as e WHERE g.id = e.id and g.mirrortype = ‘inside’ ORDER BY g.diam

(8)

databases in R

Core connection to DB - DBI.

http://cran.r-project.org/web/packages/DBI/vignettes/DBI.pdf

DBI requires a driver for the specific DBMS system used

http://cran.r-project.org/web/packages/RMySQL/index.html http://cran.r-project.org/web/packages/RSQLite/index.html

(9)

SQLite

v22.0480: computing with data, Richard Bonneau

SQLite is (according to its website the most widely used DBMS ... I’m not sure I buy that ... but it is certainly very handy)

SQLite is a self contained DBMS that is contained in a single C library (a single file that can be compiled and linked

as part of nearly any program)

It uses local files instead of remote connections.

It has many disadvantages when databases are very large,

need to support many connections or threads, and is not beefy enough for lots of tasks (thus the Lite)

It has dynamic typing (columns in tables are typed element-wise)... this drives lots of people nuts.

We are using is because our main interest is breaking up operations and organizing data within a single thread, and because the principles translate to MySQL, etc.

(10)

SQLite in R

require( DBI ) ## specific DBMS

require( RSQLite )

## could be: Berkeley DB, MySQL, Oracle, ODBC, PostgreSQL ## we choose : SQLite because we're slackers!

# create a SQLite instance and create one connection.

m <- dbDriver("SQLite")

# initialize a new database to a tempfile and copy some data.frame # from the base package into it

tfile <- tempfile()

con <- dbConnect(m, dbname = tfile)

data(USArrests)

dbWriteTable(con, "USArrests", USArrests)

require( lattice )

data( barley )

(11)

DBI -> RSQLite -> SQLite

The rest of the commands will be DBI and DBI wrapping SQL.

SQLite stuff handled in a mostly

silent way by the DBI connection to RSQLite to SQLite database (just a file)

(12)

making a dataframe into a table

require( lattice )

data( barley )

dbWriteTable(con, "barley", barley)

rs <- dbSendQuery(con, "select * from USArrests")

d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows

dbHasCompleted(rs)

fetch( rs, n = 1)

d2 <- fetch(rs, n = -1) # extract all remaining data

dbHasCompleted(rs)

dbClearResult(rs)

dbListTables(con)

rs <- dbSendQuery(con, "select Murder from USArrests")

fetch( rs )

dbHasCompleted(rs)

dbClearResult(rs)

rs <- dbSendQuery(con, paste("select row_names, ",

" Murder from USArrests where Murder < 10.0" ))

fetch( rs , n = 10) ## get first 10 fetch( rs , n = -1) ## get rest

dbHasCompleted(rs)

dbClearResult(rs)

dbListTables(con)

(13)

R slices in SQL

rs <- dbSendQuery(con, "select * from USArrests")

d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows dbHasCompleted(rs) ## returns if rs has un-fetched records left fetch( rs, n = 10)[1:2, 2:3]

fetch( rs, n = 1)

d2 <- fetch(rs, n = -1) # extract all remaining data dbHasCompleted(rs)

dbClearResult(rs) dbListTables(con) # clean up

rs <- dbSendQuery(con, "select Murder from USArrests") fetch( rs )

dbHasCompleted(rs) dbClearResult(rs)

rs <- dbSendQuery(con, paste("select row_names, ", " Murder from USArrests where Murder < 10.0" )) fetch( rs , n = 10) dbHasCompleted(rs) dbClearResult(rs) dbListTables(con) dbDisconnect(con) file.info(tfile) file.remove(tfile)

(14)

R slices in SQL

require( DBI )

require( RSQLite )

load("baa.ratios.rda")

## stay away from dots when using SQL !!!

rownames( ratios ) <- gsub( "\\.", "\\_", rownames( ratios ) )

colnames( ratios ) <- gsub( "\\.", "\\_", colnames( ratios ) )

mm <- dbDriver("SQLite")

con <- dbConnect(mm, dbname = tfile)

sql.file <- "ba.ratios.sqlite"

## not legal name

## dbWriteTable(con, "ba_ratios", as.data.frame(ratios) )

dbWriteTable(con, "ba_ratios", data.frame( ratios ) ) ## colnames of dataframe are col names of table

rs <- dbSendQuery(con, "select * from ba_ratios")

d1 <- fetch(rs, n = 10)

dbHasCompleted(rs)

dbClearResult(rs)

col.names <- colnames( d1 )

rm( d1 )

## getting gene names in table

rs <- dbSendQuery( con, "select row_names from ba_ratios") row.names <- fetch( rs , n = -1)

dbHasCompleted(rs)

(15)

R slices in SQL

### slicing out rows

par( mfrow = c( 2, 1) )

genes.selected <- c(2,4,45)

matplot( t(ratios[ genes.selected, ]) , type = "b", main = "using R matrix") rs <- dbSendQuery( con, paste("select * from ba_ratios where row_names in ( \'", paste( rownames( ratios )[genes.selected], collapse = "\',\'"), "\' )" , sep = "" ) )

d1 <-fetch( rs , n = -1)

matplot( t(d1[, -1]), type = "b", main = "sliced from SQLite db" ) ## -1 gets rid of row_names col

dbHasCompleted(rs) dbClearResult(rs) dbListTables(con) dbDisconnect(con) file.info(tfile) file.remove(tfile) 1111₁₁11 1111₁₁ 11111 11111 11₁₁ 1 1 1 11 11₁₁₁₁₁₁1₁₁ 1 1 11111 0 10 20 30 40 50 − 2 − 1 0 1 using R matrix t(ratios[genes.selected, ]) 222 22222222 222₂₂ 222 222₂2 2222 2 2 2 22 22 2 2 2 2 2 22 2 22 2 2 2 222 333 333 3 3 333 333 3 3 333333₃₃₃ 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3333 3 3 3 333 1111₁₁11 1111₁₁ 11111 11111 11₁₁ 1 1 1 11 11₁₁₁₁₁₁1₁₁ 1 1 11111 0 10 20 30 40 50 − 2 − 1 0 1

sliced from SQLite db

t(d1[, − 1]) 222 22222222 222₂₂ 222 222₂2 2222 2 2 2 22 22 2 2 2 2 2 22 2 22 2 2 2 222 333 333 3 3 333 333 3 3 333333₃₃₃ 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3333 3 3 3 333

(16)

reading and assignment

SQL, SQLite, RSQLite, MySQL doc and tutorials.

non-graded Assignment:

1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda)

2. rm( ratios ) ; gc()

3. Redo the cor.explore function using your SQLite db ... never have more than 20 rows in active memory.

4. make and fill a new table that stores for each gene the names of the genes with correlation > 0.75

I nees a volunteer to:

1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda)

2. quite out of R, keeping the SQLite database on disk. 2. write a python program that then accesses the

saved SQLite DB and given a gene-name outputs the row of ratios for that gene.