Software Tools for Scientific Data Analysis and Visualization: And Bioconductor. Stowers Science Club

(1)

1

And Bioconductor

Software Tools

for

Scientific Data Analysis

and Visualization:

Stowers Science Club

Earl F. Glynn

Scientific Programmer Bioinformatics

(2)

2

Topics

• What is “R”?

• What is “Bioconductor”?

• Pros and Cons

• How to get R and Bioconductor?

• Why should a biologist care?

(3)

3

What is “R”?

•“Calculator” and Statistical Analysis Tool

large number of built-in statistical and math functions

•Exploratory Data Analysis Tool

descriptive statistics/graphics

•Graphics and Data Visualization Tool

high-quality, customizable graphics

•Huge Library of Specialty “Packages”

a growing number specifically for microarray data analysis

•Statistical Computing Language/Environment

(4)

4 Temperature dependence of equilibrium constant > T <- c(478, 533, 588, 643) > Keq <- c(210, 73, 31, 16) > plot(1/T, log(Keq), main = "log(Keq) Vs 1/T") 0.0016 0.0017 0.0018 0.0019 0.0020 0.0021 3 .0 3 .5 4 .0 4 .5 5 .0 log(Keq) Vs 1/T 1/T lo g (K e q )

R Example

R as a graphing “calculator”

(5)

5

R Example

0.0016 0.0017 0.0018 0.0019 0.0020 0.0021 3 .0 3 .5 4 .0 4 .5 5 .0 log(Keq) Vs 1/T 1/T lo g (K e q )

> fit <- lm(log(Keq) ~ I(1/T)) > fit

Call:

lm(formula = log(Keq) ~ I(1/T)) Coefficients: (Intercept) I(1/T) -4.726 4810.111 > abline(fit, col="red") > coefficients(fit) (Intercept) I(1/T) -4.726161 4810.111351 > summary(fit)

(6)

6

R Example

> fit <- lm(log(Keq) ~ I(1/T))

> T <- c(200, 225, 273, 300, 325)

> Keq <- exp(1.2 + 0.3/T - 1.25*log(T) + 0.01*T - 0.0003*T^2)

> fit <- lm(log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2)) > fit

Call:

lm(formula = log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2)) Coefficients:

(Intercept) I(1/T) I(log(T)) T I(T^2) 1.2000 0.3000 -1.2500 0.0100 -0.0003

Simple Model: log(Keq) = a + b/T

(7)

7

R Example

Exploratory Data Analysis

•Use descriptive statistics to see “big picture” prior to formal analysis to examine data quality

•Need techniques that are robust to outliers

§Measures of Center:

- Mean (normal distribution) - Median (skewed distribution)

§Measures of Spread:

- Standard Deviation (SD) (appropriate with Mean)

standardize: (X – mean(X)) / sd(X)

- Median Absolution Deviation (MAD) (appropriate with Median)

standardize: (X – median(X)) / mad(X)

(8)

8

R Example

65 29 73 57 37 22 79 48 41 12 84 45 7

Source: John W. Tukey, Exploratory Data Analysis, 1977.

Min _Median _Max

Tukey’s “Five Number” Summary

> x <- c(79,73,7,12,29,22,65,84,45,41,48,57,37) > fivenum(x) [1] 7 29 45 65 84 Interquartile Range (IQR) Q1 Q3

(9)

9 2 0 4 0 6 0 8 0

Five Number Summary

R Example

Visualize “five-number summary” with a boxplot:

Minimum, Quartile 1, Median, Quartile 3, Maximum Min Max Q1 Q3 Median > x <- c(79,73,7,12, 29,22,65,84,45, 41,48,57,37) > boxplot(x, main=

″Five Number Summary″) IQR

“box and whisker” plot or simply a “boxplot”

(10)

10

R Example

> RawData <- read.csv("Complete_Dataset.csv", as.is=TRUE)

> Expression <- log2( data.matrix(RawData[,2:ncol(RawData)])) > boxplot(data.frame(Expression),

main="Bozdech 'Complete' Plasmodium Dataset", las=VERTICAL<-3, cex.axis=0.7,

ylab="Log2 Expression Ratio")

T P 1 T P 2 T P 3 T P 4 T P 5 T P 6 T P 7 T P 8 T P 9 T P 1 0 T P 1 1 T P 1 2 T P 1 3 T P 1 4 T P 1 5 T P 1 6 T P 1 7 T P 1 8 T P 1 9 T P 2 0 T P 2 1 T P 2 2 T P 2 3 T P 2 4 T P 2 5 T P 2 6 T P 2 7 T P 2 8 T P 2 9 T P 3 0 T P 3 1 T P 3 2 T P 3 3 T P 3 4 T P 3 5 T P 3 6 T P 3 7 T P 3 8 T P 3 9 T P 4 0 T P 4 1 T P 4 2 T P 4 3 T P 4 4 T P 4 5 T P 4 6 T P 4 7 T P 4 8 -8 -6 -4 -2 0 2 4

Bozdech 'Complete' Plasmodium Dataset

L o g 2 E x p re s s io n R a ti o

(11)

11

R Example

> # Use Bioconductor package > library(arrayMagic)

> plot.imageMatrix (

Expression, yLabels="", main="Log2 Gene Expression

(12)

12

R Example

Statistical Analysis:

Evaluate Gene Expression for Periodicity

0 10 20 30 40 -2 -1 0 1 i3518_1 Time [hours] E x p re s s io n N = 46

Time Interval Variability

log10(delta T) F re q u e n c y -1.0 -0.5 0.0 0.5 1.0 0 1 0 2 0 3 0 4 0 0.00 0.05 0.10 0.15 0.20 0 5 1 0 1 5 2 0 2 5 Lomb-Scargle Periodogram Frequency [1/hour] N o rm a liz e d P o w e r S p e c tr a l D e n s it y p = 0. 05 p = 0. 01 p = 0. 001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 45.7 hours

0.00 0.05 0.10 0.15 0.20 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 Peak Significance Frequency [1/hour] P ro b a b ili ty p = 1.48e-008 at Peak > ShowSingleOligoProfileByName("i3518_1")

(13)

13

R Example

Evaluate Gene Expression for Periodicity

0 10 20 30 40 -0 .5 0 .0 0 .5 1 .0 j167_5 Time [hours] E x p re s s io n N = 35

Time Interval Variability

log10(delta T) F re q u e n c y -1.0 -0.5 0.0 0.5 1.0 0 5 1 0 1 5 2 0 2 5 0.00 0.05 0.10 0.15 0.20 0 5 1 0 1 5 2 0 2 5 Lomb-Scargle Periodogram Frequency [1/hour] N o rm a liz e d P o w e r S p e c tr a l D e n s it y p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 17.8 hours

0.00 0.05 0.10 0.15 0.20 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 Peak Significance Frequency [1/hour] P ro b a b il it y p = 0.998 at Peak > ShowSingleOligoProfileByName("j167_5")

(14)

14

R Example

Multiple Hypothesis Testing

0 1000 2000 3000 4000 5000 6000 7000 -8 -6 -4 -2 0

M ultiple T esting Correction M ethods

Rank Order of Sorted p Values

L o g 1 0 (p ) bonferroni holm hochberg fdr none (Using R's p.adjust methods)

fdr = Benjamini and Hochberg’s “False Discovery Rate” Method

α = 0.0001

p.adjust function

in R “stats” package

(15)

15

R Example

Logic Regression

Where L₁ and L₂ are Boolean expressions. Each L can be represented by logic tree.

Logic Tree:

L = (B ∧ C) ∨ A

Ruczinksi, et al, (2003), Logic Regression,

Journal of Computational and Graphical Statistics, 12(3), 475-511. R Package: LogicRec

(16)

16

R Example

Data Visualization

> library(scatterplot3d) > example(scatterplot3d) scatterplot3d - 5 81 10 12 14 16 18 20 22 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 6065 7075 8085 90 Girth H e ig h t V o lu m e -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 > example(layout) 4 17 18 1 12 9 16 6 15 31 10 8 7 20 3 19 14 2 5 11 9 6 10 2 4 1 5 3 7 8 > set.seed(19) > library(MASS) > x <- rnorm(50) > y <- rnorm(50) > d <- kde2d(x,y) > image(d, col=terrain.colors(50)) > contour(d,add=T) > set.seed(19) > hist(rnorm(100),freq=F) > curve(dnorm(x), add=T, col="blue") -2 -1 0 1 2 -2 -1 0 1 2 Histogram of rnorm(100) -2 -1 0 1 2 3 0 .0 0 .1 0 .2 0 .3 0 .4 > set.seed(19) > x <- matrix(rnorm(200),10,20) > heatmap(x)

(17)

17

R Example

Data Visualization

# plot with error bars x <- c(1,2,3,4,5)

y <- c(15,9,NA,19,22)

error <- c(3, 4, 1, 2.5, 0.5) plot(x,y, col="red", type="o",

main="R plot", xaxt="n", xlab="specimen",

ylab="concentration", ylim=c(0,25))

delta <- 0.01 * diff(par("usr")[1:2])

segments(x, y-error, x, y+error) segments(x-delta, y-error, x+delta, y-error) segments(x-delta, y+error, x+delta, y+error) names

<-c("ABC","12345","Missing","VeryLongName","XYZ") text(x, par("usr")[3] –

0.01*diff(par("usr")[3:4]),

srt=30, adj=1, labels=names, xpd=TRUE)

mtext("subtitle") 0 5 1 0 1 5 2 0 2 5 R plot specimen c o n c e n tr a ti o n ABC ₁₂345 Miss ing VeryL ongN ame XYZ subtitle Customized Graphics

(18)

18

R Example

Data Visualization

Graphics Notes

•R creates graphics as postscript, pdf, or in a variety of other formats.

•In Windows, copy and paste graphics as “metafile” to Word, PowerPoint, or other programs.

•In Windows, enable “History, Recording” in graphics

window: Use PageUp/PageDown to step through graphics. •In Word, save as “Web page, filtered” to make web page

(19)

19

~500 R Packages

http://cran.r-project.org/src/contrib/PACKAGES.html

Most packages deal with data analysis, statistics, and visualization. Caution: Software quality varies. Validate first!

(20)

20

What is “Bioconductor”?

•Open Source Software for Bioinformatics

•Started in Fall 2001 at Harvard

•First Bioconductor Release in May 2002

•~100 R Packages

•Software categories:

-Analysis (e.g., “limma” linear models for microarrays) -Annotation (e.g., “Data packages”)

-Database Interaction

-Graphics & User Interface (e.g., “limmaGUI”) -Graphs

-Pre-processing

-Ontologies (tools for working with gene ontologies) •Web: www.bioconductor.org

(21)

21

Bioconductor Example

Limma: linear models for microarrays

library(limma)

# Adapted from ?contrasts.fit

# Simulate gene expression data: 6 microarrays and 20000 genes # with one gene differentially expressed in first 3 arrays.

# contrasts.fit: Given a linear model fit to microarray data, # compute estimated coefficients and standard errors for a # given set of contrasts.

set.seed(71)

M <- matrix(rnorm(20000*6,sd=0.3),20000,6) M[1,1:3] <- M[1,1:3] + 2

# design matrix corresponds to oneway layout, # columns are orthogonal

design <- cbind(First3Arrays=c(1,1,1,0,0,0), Last3Arrays=c(0,0,0,1,1,1)) fit <- lmFit(M,design=design)

# Would like to consider original two estimates plus # difference between first 3 and last 3 arrays

contrast.matrix <- cbind(First3=c(1,0),Last3=c(0,1), "Last3-First3"=c(-1,1))

fit2 <- contrasts.fit(fit,contrast.matrix ) fit2 <- eBayes(fit2)

# large values of eb$t indicate differential expression results <- classifyTestsF(fit2) vennDiagram( vennCounts(results)) First3 Last3 Last3-First3 19805 23 53 14 72 19 13 1

(22)

22

R/Bioconductor

Pros

•Powerful analysis tools •Command line processing;

Batch processing

•Graphics rich software •Several revisions/year •Fast (most tasks)

•Free and open source: UNIX/Windows/Apple •Strong user community •Help via mailing list

Cons

•Can be quirky

•No “GUI”: Difficult to interact with data

•Graphics poor documentation •Several revisions/year

•Slow (processing huge datasets)

•“Correct” way to ask

“One of the most intimidating things about R is the seeming endlessness of it.”

Paul E. Johnson, KU Political Science Dept, R-Help, 9 May 2000 www.ku.edu/~pauljohn/R/Rtips.html

(23)

23

How to get R and Bioconductor?

(24)

24

How to get R and Bioconductor?

http://bioinfo/software/R.htm

. . . . . .

(25)

25

Resources

R for Bioinformatics

Nov 2005?

SummeR Sessions?

Comprehensive R Archive Network (CRAN) http://cran.r-project.org

(26)

26

Why should a biologist care?

•Excel has many limitations.

•R can serve as powerful graphing

“calculator.”

•R can easily work with vectors and

matrices with microarray data.

•State of the art analysis software

often introduced in published papers

using R.

(27)

27

Bioinformatics

us

Acknowledgements

Arcady Mushegian Director

Amy Ubben Admin

Support

Mike Coleman Scientific Programmer

Malcolm Cook Database Applications

Dan Thomasset UNIX Admin (IT)

Research

Jie Chen Visiting Scientist

Frank Emmert-Streib Galina Glasko

Manisha Goel Piotr Kozbial Jing Liu

(28)

28

Acknowledgements

Microarrays

Pourquié Lab

Chris Seidel & Karen Zueckert-Gaudenz

Mary-Lee Dequeant & Olivier Pourquié