Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)

(1)

Statistical Analysis of Gene

Expression Data With

Oracle & “R”

(- data mining)

Patrick E. Hoffman Sc.D. Senior Principal Analytical

Consultant

(2)

Agenda (Oracle & R Analysis)

y Tools

y Loading Data

y Statistical Analysis

– Most Important Genes

y MDL – Correlations – t statistics, etc – R Visualizations y Predictive models? y Other?

(3)

Analysis Tools

y Tools For DB * PLSQL development

– ODM (Oracle Data Miner)

– TOAD (From Quest)

– Jdeveloper (Free from Oracle) – does Java & plsql

development

– Enterprise Manager Console ( Oracle client)

y Managing the whole DB

– SQLPlus (command line sql or plsql)

y R Project

– Open source clone of S-Plus

(4)

The data

y

Affy gene expression (old AML/ALL)

y

7129 genes

y

72 patients (combined train/test)

Can be expanded to the new chips (50,000

genes)

(5)

(6)

Loading Data into Oracle DB?

From flat file?

y

**SQL*Loader**

y

Oracle Warehouse Builder

–

**Production (auto sql*ldr)**

–

All types of data

y

Oracle Data Miner (ODM)

–

Quick & Easy

(7)

Screen Shots ODM’r

to load csv file

(8)

(9)

(10)

(11)

(12)

(13)

(14)

Load Data From R to DB

y

Set up ODBC driver (in Windows)

y

Download RODBC pkg (from cran)

y

Load CSV file

(15)

R code – send csv to db

# you must install(download) the RODBC first # r script to load a csv file to Oracle db

# install packages(RODBC to connect to a database) from CRAN (menu option) library(RODBC)

# use standard microsoft odbc connection

# set up a dsn name to connect to correct service

# setup the channel to database, ODBC must be set up to connect to Oracle DB chan1 <- odbcConnect(dsn="")

odbcGetInfo(chan1) #make sure channel is ok

# load in a csv file

filename <- "D:\\clients\\affy\\gs_EXPRESSION.csv" csv <- read.csv(filename)

sqlDrop(chan1, "GS_EXPRESSION", errors = TRUE) #drop the old table # load the csv file to a table in the Oracle db

sqlSave(chan1, csv, tablename = "GS_EXPRESSION",

(16)

7000 genes?

1000 column DB Limit?

Gone!

y

Transactional Format

y

or

y

Nested Columns

Convert flat file table to one of

these formats

(17)

Analysis

y

Pl/Sql Package affy

– affy_to_trans Flat to Transactional

– affy_ai Calculate MDL Attrib. Import.

– trans_to_affy Convert top gene to flat form.

– corr_genes Correlate genes to genes

– corr_cases Correlate cases to cases

– corr_target Correlate genes to target

– T – statistic Calculate t-statistics

(18)

Convert gene expression table

to transactional format

y

Load data into table GS_EXPRESSION

y

Use Affy package

y

exec affy.affy_to_trans('GS_EXPRESSION');

OUTPUT

- will be in the table AFFY_TRANS

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

Attribute Importance

Target is ALL or AML

y

This is Minimum Distance Length (MDL)

algorithm

(27)

(28)

Top Genes by MDL

(no normalization)

(29)

Classification & Clustering (ODM)

y

Transactional or Flat tables

y

SVM, Naïve Bayes, Adaptive Bayes

y

Java, Plsql, API

y

Java based GUI

y

Advanced K-means, Orthogonal Clustering

(30)

10g

Statistics & SQL Analytics

y Ranking functions

–rank, dense_rank, cume_dist, percent_rank,

ntile

y Window Aggregate functions

(moving and cumulative)

–Avg, sum, min, max, count, variance, stddev,

first_value, last_value

y LAG/LEAD functions

–Direct inter-row reference using offsets

y Reporting Aggregate functions

–Sum, avg, min, max, variance, stddev, count,

ratio_to_report

y Statistical Aggregates

–Correlation, linear regression family, covariance

y Linear regression

–Fitting of an ordinary-least-squares regression

line to a set of number pairs.

–Frequently combined with the COVAR_POP,

COVAR_SAMP, and CORR functions.

y Descriptive Statistics

– average, standard deviation, variance, min, max, median

(via percentile_count), mode, group-by & roll-up

– DBMS_STAT_FUNCS: summarizes numerical columns

of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- 3 sigma values, top/bottom 5 values

y Correlations

– Pearson’s correlation coefficients, Spearman's and

Kendall's (both nonparametric).

y Cross Tabs

– Enhanced with % statistics: chi squared, phi coefficient,

Cramer's V, contingency coefficient, Cohen's kappa

y Hypothesis Testing

– Student t-test , F-test, Binomial test, Wilcoxon Signed

Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

y Distribution Fitting

– Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential

y Pareto Analysis (documented) – 80:20 rule, cumulative results table

(31)

Analytical Functions

with TX format

y

Correlating cases with cases

y

Correlating genes with genes

y

Correlating genes with TARGET (all/aml)

y

T- statistics

(32)

Correlating Cases

-- put caseid’s to correlate in table aftran2 create view aftran2 as

select distinct(caseid) from affy_trans;

create table pcor1 as

select a.caseid p1, b.caseid p2, corr(a.attr_value, b.attr_value) corr

from affy_trans a, affy_trans b where a.caseid < b.caseid and

a.attrib = b.attrib and

a.caseid in (select * from aftran2) group by a.caseid, b.caseid

having corr(a.attr_value, b.attr_value) > .93 or corr(a.attr_value, b.attr_value) < -.93;

(33)

(34)

Correlating Genes

select a.attrib p1, b.attrib g2, corr(a.attr_value, b.attr_value) corr

from affy_trans a, affy_trans b where a.attrib < b.attrib and

a.seqnum = b.seqnum and

a.attrib = 'X95735_at‘ --- zyxin group by a.attrib, b.attrib

(35)

208 genes correlate with zyxin

across 72 patients

(36)

Correlating Genes with TARGET

select a.attrib p1, b.attrib g2, corr(a.attr_value, b.attr_value) corr

from affy_trans a, affy_trans b where a.attrib < b.attrib and

a.seqnum = b.seqnum and

a.attrib = ‘TARGET‘ --- AML/ALL

group by a.attrib, b.attrib

(37)

(38)

R Correlations

library(RODBC)

chan1 <- odbcConnect(dsn="")

odbcGetInfo(chan1) #make sure channel is ok

# get gene expression data from db, but drop gene name

gs1 <- sqlQuery(chan1, query = "create table gs1 as SELECT * FROM GS_EXPRESSION")

gs1 <- sqlQuery(chan1, query = "alter table gs1 drop column gene") gs <- sqlQuery(chan1, query = "SELECT * FROM GS1")

# get correlation matrix

cm <- cor(gs,use="pairwise.complete.obs") # Write the new table to a file.

write.table( cm, file="D:\\clients\\affy\\gs_cm.csv", append = FALSE, quote = FALSE, sep = ",", eol = "\n", na = "", dec = ".",

row.names = T, col.names = T )

(39)

(40)

(41)

Other Statistics

y

Corr_s

- Spearman’s rho correlation coef.

y

Corr_k

- Kendall's tau-b correlation coef.

y

stats_t_test_indep - equal variance

y

stats_t_test_indepu - unequal variance

(42)

T statistics for each gene (tx format)

Create table t_stats as

SELECT a.attrib, count(*) cnt,avg(a.attr_value) avg_atr, avg(b.attr_value) avg_trg,

STATS_T_TEST_INDEP(b.attr_value, a.attr_value, 'STATISTIC') t_observed,

STATS_T_TEST_INDEP(b.attr_value, a.attr_value)*7130

two_sided_p_value

FROM affy_trans A, affy_trans b

WHERE a.attrib in (select * from aftran1) and b.attrib = 'TARGET' and

a.caseid = b.caseid group by a.attrib;

(43)

(44)

F-distribution one way ANOVA

drop table anova;

create table anova as

SELECT a.attrib, count(*) cnt,avg(a.attr_value) avg_atr, avg(b.attr_value) avg_trg,

STATS_ONE_WAY_ANOVA(b.attr_value, a.attr_value, 'F_RATIO') f_ratio,

STATS_ONE_WAY_ANOVA(b.attr_value, a.attr_value, 'SIG')*7130 p_value

FROM affy_trans A, affy_trans b

WHERE a.attrib in (select * from aftran1) and b.attrib = 'TARGET' and

a.caseid = b.caseid group by a.attrib;

(45)

(46)

R for Plotting and Visualization

#### get data and plot all variables

a1 <- sqlQuery(chan1, query = "select * from anova")

plot(a1)

(47)

(48)

Histograms of Expression Distribution

#Generate Histograms of gene expression cases, with labels and cut offs pam1 = 40 #Bars

pam2 = 5000 #Ceiling pam3 = -1000 #floor wx = 1000 #width hx = 714 #floor

N <- ncol(csv) # gene expression data is in csv N = 3 # do only 3 histograms

R <- nrow(csv)

nom<- attr(csv, "names") par(mfrow = c(N-1,1)) b <- 0:pam1 c = pam1/(pam2 - pam3) b <- b/c b <- b + pam3 for( num in 2:N) { h1 <- csv[,num] h1[ h1>pam2] <- pam2 h1[ h1<pam3] <- pam3

h<- hist(h1,nclass=pam1,breaks=b, main = paste("Histogram of" , nom[num],"clamp at",pam2,pam3), xlab = nom[num], col=5)

(49)

(50)

Histogram – no labels

N <- ncol(csv) # number of columns

pam1 = 40 #Bars

pam2 = 5000 #Ceiling pam3 = -1000 #floor

R <- nrow(csv)

#make the break points b <- 0:pam1

c = pam1/(pam2 - pam3) b <- b/c

b <- b + pam3

par(mfrow = c(N,1),mar=c(0,0,0,0)) # space on graph for( num in 2:N) {

h1 <- csv[,num]

h1[ h1>pam2] <- pam2 h1[ h1<pam3] <- pam3

h<- hist(h1,breaks=b, main = "", xlab = "",axes=F, col=5) }

(51)

(52)

(53)

More?

y

Many other applications of R

y

Bioconductor

y

Many other applications of Oracle

y

Other code is available

(54)

Life Sciences DM Workshop

A one day onsite technical session educating organizations on how to leverage one of their most valuable assets to provide insight in the operations of their business, the behavioral patterns of their customers and hidden relationships found deep within corporate data that can have direct impact to the bottom line.

Life Sciences DM Blueprint

A documented technical roadmap providing the organization with the strategy to integrate and deploy Life Sciences technology. This includes recommendations based on feedback from the Life Sciences workshop focusing on source data preparation, mining methodologies and supporting architecture.

Life Sciences DM Insight

A five day onsite engagement focused on providing a detailed analysis of the business problem, data preparation, model build and analysis and knowledge

deployment extending the analysis of the Life Sciences workshop culminating with a technical roadmap with a strategy to integrate and deploy Life Sciences technology.

Life Sciences DMQuickstart

A thirty day engagement focused on taking a business problem and transforming into a Life Sciences solution. This includes transforming the business problem, preparing e data, creation of the mining model and knowledge deployment. Upon completion, results will be delivered mapped to the initial business problem.

Life Sciences DM Services

A series of custom services focused on delivering Life Sciences methodologies and solutions to provide insight in the operations of their business, the behavioral patterns of their customers and hidden relationships found deep within corporate data that can have direct impact to the bottom line.

(55)

Life Science Informatics experience

y

Gene expression analysis

y

Sequence Analysis (blast exon/intron

prediction)

y

Clinical/Medical data analysis

y

QSAR/Cheminformatics

– Isis,Molconz, Predictive Tox

y

Animal Studies

y

Protein analysis (arrays, Mass spec)

(56)

Data Mining & Informatics Services

Contact Richard Solari 508-477-5765

630-561-9950

[email protected]

Contact Patrick Hoffman 781-744-0783

617-755-6740