• No results found

Computing with large data sets

N/A
N/A
Protected

Academic year: 2021

Share "Computing with large data sets"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Computing with

large

data sets

Richard Bonneau, spring 2009

mini-Lecture 13(week 7):

(2)

different reasons for using databases

v22.0480: computing with data, Richard Bonneau Lecture 13

There are a multitude of reasons for using transactional databases when programing with data.

1. the word data is in the word database

2. Reducing the active memory needed to carry out an

operation or search over a large dataset (look for best cor over a matrix with 1,000,000 rows given a single row)

3. Organize multiple interlnked datatypes and use SQL sysntax to conviniently construct queries, relying on SQL to organize data under the hood.

4. Share complex data with another program / language / multiple threads.

(3)

splitting up large memory opperations

v22.0480: computing with data, Richard Bonneau Lecture 13

What system memmory would be required to

run lars( y ~ X ) if we had 1,000,000 observations and 2,000 predictors?

What system memory do we need to cluster 500,000 genetic changes mesured for 20,000 individuals?

Assuming there are many fewer classes / clusters / model-components than observations we can use a block aproach , where a small fraction of the data is needed at any given time.

Several methods for dividing opperations of large matricies or datasets exist and databases help us optimise and structure our codes access to arbitrary subsets of the data, flushing what we arn’t currently looking at from active memory.

(4)

what are transactional databases

v22.0480: computing with data, Richard Bonneau Lecture 13

We will use SQL type DBMS in this class ( a subset of transactional databases) Transactional databases are ACID

- Atomic : all of a transaction is completed OR none

- Consistent: all completed transactions leave DB in a state compliant with rules - Isolated: you can’t see results until transaction is complete

- Durable: is transaction complete then the change persists even if

program crashes after, system goes down, etc. (within reason i.e. no lightning strike clause)

(5)

relational databases

v22.0480: computing with data, Richard Bonneau Lecture 13

SQL is the most common type of relational database management system.

MySQL, oracle, SQLite, etc. are all SQL relational databases “relational” refers to the grouping of entries in the DB

by conditional statements on their attributes. return all rows with attribute.x = x

return parts of rows with attribute.y > y.thresh examples below and in the exercise.

(6)

relational databases databases

v22.0480: computing with data, Richard Bonneau Lecture 13

Advantages of SQL like DBMS:

Code and style of code are roughly compatible across many systems. Methods for porting and converting from one system to another exist Or could easily be built.

(7)

a sigle page SQL tutorial and links

v22.0480: computing with data, Richard Bonneau Lecture 13

We’ll use R’s functions to create tables.

Many good tutorials and brief docs exist: http://www.sqlcourse.com/

http://www.mysql.com/ http://www.sqlite.org/ select * from USArrests

select Murder from USArrests

select row_names, Murder from USArrests where Murder < 10.0 insert into employee

(first, last, age, address, city, state)

values ('Rich', 'Bonneau', 33, '3 washington sq.', 'New York', 'NY')

delete from employee

where lastname = 'Gentlemen' SELECT id, firstn, lastn, title, salary FROM employee_info

WHERE salary >= 55000.00 AND title = 'Saucier' SELECT g.id, g.mirror, g.diam, e.voltage

FROM geom_table as g, elec_measures as e WHERE g.id = e.id and g.mirrortype = ‘inside’ ORDER BY g.diam

(8)

databases in R

v22.0480: computing with data, Richard Bonneau Lecture 13

Core connection to DB - DBI.

http://cran.r-project.org/web/packages/DBI/vignettes/DBI.pdf

DBI requires a driver for the specific DBMS system used

http://cran.r-project.org/web/packages/RMySQL/index.html http://cran.r-project.org/web/packages/RSQLite/index.html

(9)

SQLite

v22.0480: computing with data, Richard Bonneau

SQLite is (according to its website the most widely used DBMS ... I’m not sure I buy that ... but it is certainly very handy)

SQLite is a self contained DBMS that is contained in a single C library (a single file that can be compiled and linked

as part of nearly any program)

It uses local files instead of remote connections.

It has many disadvantages when databases are very large,

need to support many connections or threads, and is not beefy enough for lots of tasks (thus the Lite)

It has dynamic typing (columns in tables are typed element-wise)... this drives lots of people nuts.

We are using is because our main interest is breaking up operations and organizing data within a single thread, and because the principles translate to MySQL, etc.

(10)

SQLite in R

v22.0480: computing with data, Richard Bonneau Lecture 13

require( DBI ) ## specific DBMS

require( RSQLite )

## could be: Berkeley DB, MySQL, Oracle, ODBC, PostgreSQL ## we choose : SQLite because we're slackers!

# create a SQLite instance and create one connection.

m <- dbDriver("SQLite")

# initialize a new database to a tempfile and copy some data.frame # from the base package into it

tfile <- tempfile()

con <- dbConnect(m, dbname = tfile)

data(USArrests)

dbWriteTable(con, "USArrests", USArrests)

require( lattice )

data( barley )

(11)

DBI -> RSQLite -> SQLite

v22.0480: computing with data, Richard Bonneau

The rest of the commands will be DBI and DBI wrapping SQL.

SQLite stuff handled in a mostly

silent way by the DBI connection to RSQLite to SQLite database (just a file)

(12)

making a dataframe into a table

v22.0480: computing with data, Richard Bonneau Lecture 13

require( lattice )

data( barley )

dbWriteTable(con, "barley", barley)

rs <- dbSendQuery(con, "select * from USArrests")

d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows

dbHasCompleted(rs)

fetch( rs, n = 1)

d2 <- fetch(rs, n = -1) # extract all remaining data

dbHasCompleted(rs)

dbClearResult(rs)

dbListTables(con)

rs <- dbSendQuery(con, "select Murder from USArrests")

fetch( rs )

dbHasCompleted(rs)

dbClearResult(rs)

rs <- dbSendQuery(con, paste("select row_names, ",

" Murder from USArrests where Murder < 10.0" ))

fetch( rs , n = 10) ## get first 10 fetch( rs , n = -1) ## get rest

dbHasCompleted(rs)

dbClearResult(rs)

dbListTables(con)

(13)

R slices in SQL

v22.0480: computing with data, Richard Bonneau Lecture 13

rs <- dbSendQuery(con, "select * from USArrests")

d1 <- fetch(rs, n = 10) # extract data in chunks of 10 rows dbHasCompleted(rs) ## returns if rs has un-fetched records left fetch( rs, n = 10)[1:2, 2:3]

fetch( rs, n = 1)

d2 <- fetch(rs, n = -1) # extract all remaining data dbHasCompleted(rs)

dbClearResult(rs) dbListTables(con) # clean up

rs <- dbSendQuery(con, "select Murder from USArrests") fetch( rs )

dbHasCompleted(rs) dbClearResult(rs)

rs <- dbSendQuery(con, paste("select row_names, ", " Murder from USArrests where Murder < 10.0" )) fetch( rs , n = 10) dbHasCompleted(rs) dbClearResult(rs) dbListTables(con) dbDisconnect(con) file.info(tfile) file.remove(tfile)

(14)

R slices in SQL

v22.0480: computing with data, Richard Bonneau Lecture 13

require( DBI )

require( RSQLite )

load("baa.ratios.rda")

## stay away from dots when using SQL !!!

rownames( ratios ) <- gsub( "\\.", "\\_", rownames( ratios ) )

colnames( ratios ) <- gsub( "\\.", "\\_", colnames( ratios ) )

mm <- dbDriver("SQLite")

con <- dbConnect(mm, dbname = tfile)

sql.file <- "ba.ratios.sqlite"

## not legal name

## dbWriteTable(con, "ba_ratios", as.data.frame(ratios) )

dbWriteTable(con, "ba_ratios", data.frame( ratios ) ) ## colnames of dataframe are col names of table

rs <- dbSendQuery(con, "select * from ba_ratios")

d1 <- fetch(rs, n = 10)

dbHasCompleted(rs)

dbClearResult(rs)

col.names <- colnames( d1 )

rm( d1 )

## getting gene names in table

rs <- dbSendQuery( con, "select row_names from ba_ratios") row.names <- fetch( rs , n = -1)

dbHasCompleted(rs)

(15)

R slices in SQL

v22.0480: computing with data, Richard Bonneau Lecture 13

### slicing out rows

par( mfrow = c( 2, 1) )

genes.selected <- c(2,4,45)

matplot( t(ratios[ genes.selected, ]) , type = "b", main = "using R matrix") rs <- dbSendQuery( con, paste("select * from ba_ratios where row_names in ( \'", paste( rownames( ratios )[genes.selected], collapse = "\',\'"), "\' )" , sep = "" ) )

d1 <-fetch( rs , n = -1)

matplot( t(d1[, -1]), type = "b", main = "sliced from SQLite db" ) ## -1 gets rid of row_names col

dbHasCompleted(rs) dbClearResult(rs) dbListTables(con) dbDisconnect(con) file.info(tfile) file.remove(tfile) 11111111 111111 11111 11111 1111 1 1 1 11 11111111111 1 1 11111 0 10 20 30 40 50 − 2 − 1 0 1 using R matrix t(ratios[genes.selected, ]) 222 22222222 22222 222 22222 2222 2 2 2 22 22 2 2 2 2 2 22 2 22 2 2 2 222 333 333 3 3 333 333 3 3 333333333 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3333 3 3 3 333 11111111 111111 11111 11111 1111 1 1 1 11 11111111111 1 1 11111 0 10 20 30 40 50 − 2 − 1 0 1

sliced from SQLite db

t(d1[, − 1]) 222 22222222 22222 222 22222 2222 2 2 2 22 22 2 2 2 2 2 22 2 22 2 2 2 222 333 333 3 3 333 333 3 3 333333333 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3333 3 3 3 333

(16)

reading and assignment

v22.0480: computing with data, Richard Bonneau

SQL, SQLite, RSQLite, MySQL doc and tutorials.

non-graded Assignment:

1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda)

2. rm( ratios ) ; gc()

3. Redo the cor.explore function using your SQLite db ... never have more than 20 rows in active memory.

4. make and fill a new table that stores for each gene the names of the genes with correlation > 0.75

I nees a volunteer to:

1. create a SQLite database and make a new table holding ratios (from baa.ratios.rda)

2. quite out of R, keeping the SQLite database on disk. 2. write a python program that then accesses the

saved SQLite DB and given a gene-name outputs the row of ratios for that gene.

References

Related documents

In addition, polyvinyl chloride (PVC)- supported palladium species have also been reported as catalyst in the Sonogashira reaction, as exemplified in the

Leaves persistent, imparipinnate, coriaceous, leaflets 3(-5)-foliolate; petiole 20-40mm long, petiolules of lateral leaflets, subsessile sometimes up to 10mm long, those of

The resulting system functions reliably up to the local clock frequency of 220MHz (according to SPICE simulation) — the maximum clock rate is limited by the ring oscillator, not

University performs its activities according to the Law on Higher Education, based on the Act on fulfillment of conditions for starting the academic activities number 07.2-4777/07

For a long time Western scholars did not study the actual death rites of ordinary people in their own societies (Grimes, 2000), and although in recent years there has been

On the other hand, in terms of banking sector stability, we find that, while increased concentration again appears to have had virtually no influence on bank insolvency risk,

This chapter presents a basic of the methodology so-called an asymptotic expansion approach, and applies this method to approximation of prices of currency options with a libor