Computing with
large
data sets
Richard Bonneau, spring 2009
mini-Lecture 13(week 7):
different reasons for using databases
v22.0480: computing with data, Richard Bonneau Lecture 13
There are a multitude of reasons for using transactional databases when programing with data.
1. the word data is in the word database
2. Reducing the active memory needed to carry out an
operation or search over a large dataset (look for best cor over a matrix with 1,000,000 rows given a single row)
3. Organize multiple interlnked datatypes and use SQL sysntax to conviniently construct queries, relying on SQL to organize data under the hood.
4. Share complex data with another program / language / multiple threads.
splitting up large memory opperations
v22.0480: computing with data, Richard Bonneau Lecture 13
What system memmory would be required to
run lars( y ~ X ) if we had 1,000,000 observations and 2,000 predictors?
What system memory do we need to cluster 500,000 genetic changes mesured for 20,000 individuals?
Assuming there are many fewer classes / clusters / model-components than observations we can use a block aproach , where a small fraction of the data is needed at any given time.
Several methods for dividing opperations of large matricies or datasets exist and databases help us optimise and structure our codes access to arbitrary subsets of the data, flushing what we arn’t currently looking at from active memory.
what are transactional databases
v22.0480: computing with data, Richard Bonneau Lecture 13
We will use SQL type DBMS in this class ( a subset of transactional databases) Transactional databases are ACID
- Atomic : all of a transaction is completed OR none
- Consistent: all completed transactions leave DB in a state compliant with rules - Isolated: you can’t see results until transaction is complete
- Durable: is transaction complete then the change persists even if
program crashes after, system goes down, etc. (within reason i.e. no lightning strike clause)
relational databases
v22.0480: computing with data, Richard Bonneau Lecture 13
SQL is the most common type of relational database management system.
MySQL, oracle, SQLite, etc. are all SQL relational databases “relational” refers to the grouping of entries in the DB
by conditional statements on their attributes. return all rows with attribute.x = x
return parts of rows with attribute.y > y.thresh examples below and in the exercise.
relational databases databases
v22.0480: computing with data, Richard Bonneau Lecture 13
Advantages of SQL like DBMS:
Code and style of code are roughly compatible across many systems. Methods for porting and converting from one system to another exist Or could easily be built.
a sigle page SQL tutorial and links
v22.0480: computing with data, Richard Bonneau Lecture 13
We’ll use R’s functions to create tables.
Many good tutorials and brief docs exist: http://www.sqlcourse.com/
http://www.mysql.com/ http://www.sqlite.org/ select * from USArrests
select Murder from USArrests
select row_names, Murder from USArrests where Murder < 10.0 insert into employee
(first, last, age, address, city, state)
values ('Rich', 'Bonneau', 33, '3 washington sq.', 'New York', 'NY')
delete from employee
where lastname = 'Gentlemen' SELECT id, firstn, lastn, title, salary FROM employee_info
WHERE salary >= 55000.00 AND title = 'Saucier' SELECT g.id, g.mirror, g.diam, e.voltage
FROM geom_table as g, elec_measures as e WHERE g.id = e.id and g.mirrortype = ‘inside’ ORDER BY g.diam
databases in R
v22.0480: computing with data, Richard Bonneau Lecture 13
Core connection to DB - DBI.
http://cran.r-project.org/web/packages/DBI/vignettes/DBI.pdf
DBI requires a driver for the specific DBMS system used
http://cran.r-project.org/web/packages/RMySQL/index.html http://cran.r-project.org/web/packages/RSQLite/index.html
SQLite
v22.0480: computing with data, Richard Bonneau
SQLite is (according to its website the most widely used DBMS ... I’m not sure I buy that ... but it is certainly very handy)
SQLite is a self contained DBMS that is contained in a single C library (a single file that can be compiled and linked
as part of nearly any program)
It uses local files instead of remote connections.
It has many disadvantages when databases are very large,
need to support many connections or threads, and is not beefy enough for lots of tasks (thus the Lite)
It has dynamic typing (columns in tables are typed element-wise)... this drives lots of people nuts.
We are using is because our main interest is breaking up operations and organizing data within a single thread, and because the principles translate to MySQL, etc.
SQLite in R
v22.0480: computing with data, Richard Bonneau Lecture 13
require( DBI ) ## specific DBMS
require( RSQLite )
## could be: Berkeley DB, MySQL, Oracle, ODBC, PostgreSQL ## we choose : SQLite because we're slackers!
# create a SQLite instance and create one connection.
m <- dbDriver("SQLite")
# initialize a new database to a tempfile and copy some data.frame # from the base package into it
tfile <- tempfile()
con <- dbConnect(m, dbname = tfile)
data(USArrests)
dbWriteTable(con, "USArrests", USArrests)
require( lattice )
data( barley )
DBI -> RSQLite -> SQLite
v22.0480: computing with data, Richard Bonneau
The rest of the commands will be DBI and DBI wrapping SQL.
SQLite stuff handled in a mostly
silent way by the DBI connection to RSQLite to SQLite database (just a file)
making a dataframe into a table
v22.0480: computing with data, Richard Bonneau Lecture 13
require( lattice )
data( barley )
dbWriteTable(con, "barley", barley)
rs <- dbSendQuery(con, "select * from USArrests")
d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows
dbHasCompleted(rs)
fetch( rs, n = 1)
d2 <- fetch(rs, n = -1) # extract all remaining data
dbHasCompleted(rs)
dbClearResult(rs)
dbListTables(con)
rs <- dbSendQuery(con, "select Murder from USArrests")
fetch( rs )
dbHasCompleted(rs)
dbClearResult(rs)
rs <- dbSendQuery(con, paste("select row_names, ",
" Murder from USArrests where Murder < 10.0" ))
fetch( rs , n = 10) ## get first 10 fetch( rs , n = -1) ## get rest
dbHasCompleted(rs)
dbClearResult(rs)
dbListTables(con)
R slices in SQL
v22.0480: computing with data, Richard Bonneau Lecture 13
rs <- dbSendQuery(con, "select * from USArrests")
d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows dbHasCompleted(rs) ## returns if rs has un-fetched records left fetch( rs, n = 10)[1:2, 2:3]
fetch( rs, n = 1)
d2 <- fetch(rs, n = -1) # extract all remaining data dbHasCompleted(rs)
dbClearResult(rs) dbListTables(con) # clean up
rs <- dbSendQuery(con, "select Murder from USArrests") fetch( rs )
dbHasCompleted(rs) dbClearResult(rs)
rs <- dbSendQuery(con, paste("select row_names, ", " Murder from USArrests where Murder < 10.0" )) fetch( rs , n = 10) dbHasCompleted(rs) dbClearResult(rs) dbListTables(con) dbDisconnect(con) file.info(tfile) file.remove(tfile)
R slices in SQL
v22.0480: computing with data, Richard Bonneau Lecture 13
require( DBI )
require( RSQLite )
load("baa.ratios.rda")
## stay away from dots when using SQL !!!
rownames( ratios ) <- gsub( "\\.", "\\_", rownames( ratios ) )
colnames( ratios ) <- gsub( "\\.", "\\_", colnames( ratios ) )
mm <- dbDriver("SQLite")
con <- dbConnect(mm, dbname = tfile)
sql.file <- "ba.ratios.sqlite"
## not legal name
## dbWriteTable(con, "ba_ratios", as.data.frame(ratios) )
dbWriteTable(con, "ba_ratios", data.frame( ratios ) ) ## colnames of dataframe are col names of table
rs <- dbSendQuery(con, "select * from ba_ratios")
d1 <- fetch(rs, n = 10)
dbHasCompleted(rs)
dbClearResult(rs)
col.names <- colnames( d1 )
rm( d1 )
## getting gene names in table
rs <- dbSendQuery( con, "select row_names from ba_ratios") row.names <- fetch( rs , n = -1)
dbHasCompleted(rs)
R slices in SQL
v22.0480: computing with data, Richard Bonneau Lecture 13
### slicing out rows
par( mfrow = c( 2, 1) )
genes.selected <- c(2,4,45)
matplot( t(ratios[ genes.selected, ]) , type = "b", main = "using R matrix") rs <- dbSendQuery( con, paste("select * from ba_ratios where row_names in ( \'", paste( rownames( ratios )[genes.selected], collapse = "\',\'"), "\' )" , sep = "" ) )
d1 <-fetch( rs , n = -1)
matplot( t(d1[, -1]), type = "b", main = "sliced from SQLite db" ) ## -1 gets rid of row_names col
dbHasCompleted(rs) dbClearResult(rs) dbListTables(con) dbDisconnect(con) file.info(tfile) file.remove(tfile) 11111111 111111 11111 11111 1111 1 1 1 11 11111111111 1 1 11111 0 10 20 30 40 50 − 2 − 1 0 1 using R matrix t(ratios[genes.selected, ]) 222 22222222 22222 222 22222 2222 2 2 2 22 22 2 2 2 2 2 22 2 22 2 2 2 222 333 333 3 3 333 333 3 3 333333333 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3333 3 3 3 333 11111111 111111 11111 11111 1111 1 1 1 11 11111111111 1 1 11111 0 10 20 30 40 50 − 2 − 1 0 1
sliced from SQLite db
t(d1[, − 1]) 222 22222222 22222 222 22222 2222 2 2 2 22 22 2 2 2 2 2 22 2 22 2 2 2 222 333 333 3 3 333 333 3 3 333333333 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3333 3 3 3 333
reading and assignment
v22.0480: computing with data, Richard Bonneau
SQL, SQLite, RSQLite, MySQL doc and tutorials.
non-graded Assignment:
1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda)
2. rm( ratios ) ; gc()
3. Redo the cor.explore function using your SQLite db ... never have more than 20 rows in active memory.
4. make and fill a new table that stores for each gene the names of the genes with correlation > 0.75
I nees a volunteer to:
1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda)
2. quite out of R, keeping the SQLite database on disk. 2. write a python program that then accesses the
saved SQLite DB and given a gene-name outputs the row of ratios for that gene.