1
And Bioconductor
Software Tools
for
Scientific Data Analysis
and Visualization:
Stowers Science Club
Earl F. Glynn
Scientific Programmer Bioinformatics
2
Topics
• What is “R”?
• What is “Bioconductor”?
• Pros and Cons
• How to get R and Bioconductor?
• Why should a biologist care?
3
What is “R”?
•“Calculator” and Statistical Analysis Tool
large number of built-in statistical and math functions
•Exploratory Data Analysis Tool
descriptive statistics/graphics
•Graphics and Data Visualization Tool
high-quality, customizable graphics
•Huge Library of Specialty “Packages”
a growing number specifically for microarray data analysis
•Statistical Computing Language/Environment
4 Temperature dependence of equilibrium constant > T <- c(478, 533, 588, 643) > Keq <- c(210, 73, 31, 16) > plot(1/T, log(Keq), main = "log(Keq) Vs 1/T") 0.0016 0.0017 0.0018 0.0019 0.0020 0.0021 3 .0 3 .5 4 .0 4 .5 5 .0 log(Keq) Vs 1/T 1/T lo g (K e q )
R Example
R as a graphing “calculator”
5
R Example
R as a graphing “calculator”
0.0016 0.0017 0.0018 0.0019 0.0020 0.0021 3 .0 3 .5 4 .0 4 .5 5 .0 log(Keq) Vs 1/T 1/T lo g (K e q )> fit <- lm(log(Keq) ~ I(1/T)) > fit
Call:
lm(formula = log(Keq) ~ I(1/T)) Coefficients: (Intercept) I(1/T) -4.726 4810.111 > abline(fit, col="red") > coefficients(fit) (Intercept) I(1/T) -4.726161 4810.111351 > summary(fit)
6
R Example
R as a graphing “calculator”
> fit <- lm(log(Keq) ~ I(1/T))> T <- c(200, 225, 273, 300, 325)
> Keq <- exp(1.2 + 0.3/T - 1.25*log(T) + 0.01*T - 0.0003*T^2)
> fit <- lm(log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2)) > fit
Call:
lm(formula = log(Keq) ~ I(1/T) + I(log(T)) + T + I(T^2)) Coefficients:
(Intercept) I(1/T) I(log(T)) T I(T^2) 1.2000 0.3000 -1.2500 0.0100 -0.0003
Simple Model: log(Keq) = a + b/T
7
R Example
Exploratory Data Analysis
•Use descriptive statistics to see “big picture” prior to formal analysis to examine data quality
•Need techniques that are robust to outliers
§Measures of Center:
- Mean (normal distribution) - Median (skewed distribution)
§Measures of Spread:
- Standard Deviation (SD) (appropriate with Mean)
standardize: (X – mean(X)) / sd(X)
- Median Absolution Deviation (MAD) (appropriate with Median)
standardize: (X – median(X)) / mad(X)
8
R Example
Exploratory Data Analysis
65 29 73 57 37 22 79 48 41 12 84 45 7
Source: John W. Tukey, Exploratory Data Analysis, 1977.
Min Median Max
Tukey’s “Five Number” Summary
> x <- c(79,73,7,12,29,22,65,84,45,41,48,57,37) > fivenum(x) [1] 7 29 45 65 84 Interquartile Range (IQR) Q1 Q3
9 2 0 4 0 6 0 8 0
Five Number Summary
R Example
Exploratory Data Analysis
Visualize “five-number summary” with a boxplot:
Minimum, Quartile 1, Median, Quartile 3, Maximum Min Max Q1 Q3 Median > x <- c(79,73,7,12, 29,22,65,84,45, 41,48,57,37) > boxplot(x, main=
″Five Number Summary″) IQR
“box and whisker” plot or simply a “boxplot”
10
R Example
Exploratory Data Analysis
> RawData <- read.csv("Complete_Dataset.csv", as.is=TRUE)
> Expression <- log2( data.matrix(RawData[,2:ncol(RawData)])) > boxplot(data.frame(Expression),
main="Bozdech 'Complete' Plasmodium Dataset", las=VERTICAL<-3, cex.axis=0.7,
ylab="Log2 Expression Ratio")
T P 1 T P 2 T P 3 T P 4 T P 5 T P 6 T P 7 T P 8 T P 9 T P 1 0 T P 1 1 T P 1 2 T P 1 3 T P 1 4 T P 1 5 T P 1 6 T P 1 7 T P 1 8 T P 1 9 T P 2 0 T P 2 1 T P 2 2 T P 2 3 T P 2 4 T P 2 5 T P 2 6 T P 2 7 T P 2 8 T P 2 9 T P 3 0 T P 3 1 T P 3 2 T P 3 3 T P 3 4 T P 3 5 T P 3 6 T P 3 7 T P 3 8 T P 3 9 T P 4 0 T P 4 1 T P 4 2 T P 4 3 T P 4 4 T P 4 5 T P 4 6 T P 4 7 T P 4 8 -8 -6 -4 -2 0 2 4
Bozdech 'Complete' Plasmodium Dataset
L o g 2 E x p re s s io n R a ti o
11
R Example
Exploratory Data Analysis
> # Use Bioconductor package > library(arrayMagic)
> plot.imageMatrix (
Expression, yLabels="", main="Log2 Gene Expression
12
R Example
Statistical Analysis:
Evaluate Gene Expression for Periodicity
0 10 20 30 40 -2 -1 0 1 i3518_1 Time [hours] E x p re s s io n N = 46
Time Interval Variability
log10(delta T) F re q u e n c y -1.0 -0.5 0.0 0.5 1.0 0 1 0 2 0 3 0 4 0 0.00 0.05 0.10 0.15 0.20 0 5 1 0 1 5 2 0 2 5 Lomb-Scargle Periodogram Frequency [1/hour] N o rm a liz e d P o w e r S p e c tr a l D e n s it y p = 0. 05 p = 0. 01 p = 0. 001 p = 1e-04 p = 1e-05 p = 1e-06
Period at Peak = 45.7 hours
0.00 0.05 0.10 0.15 0.20 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 Peak Significance Frequency [1/hour] P ro b a b ili ty p = 1.48e-008 at Peak > ShowSingleOligoProfileByName("i3518_1")
13
R Example
Statistical Analysis:
Evaluate Gene Expression for Periodicity
0 10 20 30 40 -0 .5 0 .0 0 .5 1 .0 j167_5 Time [hours] E x p re s s io n N = 35
Time Interval Variability
log10(delta T) F re q u e n c y -1.0 -0.5 0.0 0.5 1.0 0 5 1 0 1 5 2 0 2 5 0.00 0.05 0.10 0.15 0.20 0 5 1 0 1 5 2 0 2 5 Lomb-Scargle Periodogram Frequency [1/hour] N o rm a liz e d P o w e r S p e c tr a l D e n s it y p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06
Period at Peak = 17.8 hours
0.00 0.05 0.10 0.15 0.20 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 Peak Significance Frequency [1/hour] P ro b a b il it y p = 0.998 at Peak > ShowSingleOligoProfileByName("j167_5")
14
R Example
Statistical Analysis:
Multiple Hypothesis Testing
0 1000 2000 3000 4000 5000 6000 7000 -8 -6 -4 -2 0
M ultiple T esting Correction M ethods
Rank Order of Sorted p Values
L o g 1 0 (p ) bonferroni holm hochberg fdr none (Using R's p.adjust methods)
fdr = Benjamini and Hochberg’s “False Discovery Rate” Method
α = 0.0001
p.adjust function
in R “stats” package
15
R Example
Statistical Analysis:
Logic Regression
Where L1 and L2 are Boolean expressions. Each L can be represented by logic tree.
Logic Tree:
L = (B ∧ C) ∨ A
Ruczinksi, et al, (2003), Logic Regression,
Journal of Computational and Graphical Statistics, 12(3), 475-511. R Package: LogicRec
16
R Example
Data Visualization
> library(scatterplot3d) > example(scatterplot3d) scatterplot3d - 5 81 10 12 14 16 18 20 22 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 6065 7075 8085 90 Girth H e ig h t V o lu m e -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 > example(layout) 4 17 18 1 12 9 16 6 15 31 10 8 7 20 3 19 14 2 5 11 9 6 10 2 4 1 5 3 7 8 > set.seed(19) > library(MASS) > x <- rnorm(50) > y <- rnorm(50) > d <- kde2d(x,y) > image(d, col=terrain.colors(50)) > contour(d,add=T) > set.seed(19) > hist(rnorm(100),freq=F) > curve(dnorm(x), add=T, col="blue") -2 -1 0 1 2 -2 -1 0 1 2 Histogram of rnorm(100) -2 -1 0 1 2 3 0 .0 0 .1 0 .2 0 .3 0 .4 > set.seed(19) > x <- matrix(rnorm(200),10,20) > heatmap(x)17
R Example
Data Visualization
# plot with error bars x <- c(1,2,3,4,5)
y <- c(15,9,NA,19,22)
error <- c(3, 4, 1, 2.5, 0.5) plot(x,y, col="red", type="o",
main="R plot", xaxt="n", xlab="specimen",
ylab="concentration", ylim=c(0,25))
delta <- 0.01 * diff(par("usr")[1:2])
segments(x, y-error, x, y+error) segments(x-delta, y-error, x+delta, y-error) segments(x-delta, y+error, x+delta, y+error) names
<-c("ABC","12345","Missing","VeryLongName","XYZ") text(x, par("usr")[3] –
0.01*diff(par("usr")[3:4]),
srt=30, adj=1, labels=names, xpd=TRUE)
mtext("subtitle") 0 5 1 0 1 5 2 0 2 5 R plot specimen c o n c e n tr a ti o n ABC 12345 Miss ing VeryL ongN ame XYZ subtitle Customized Graphics
18
R Example
Data Visualization
Graphics Notes
•R creates graphics as postscript, pdf, or in a variety of other formats.
•In Windows, copy and paste graphics as “metafile” to Word, PowerPoint, or other programs.
•In Windows, enable “History, Recording” in graphics
window: Use PageUp/PageDown to step through graphics. •In Word, save as “Web page, filtered” to make web page
19
~500 R Packages
http://cran.r-project.org/src/contrib/PACKAGES.html
Most packages deal with data analysis, statistics, and visualization. Caution: Software quality varies. Validate first!
20
What is “Bioconductor”?
•Open Source Software for Bioinformatics
•Started in Fall 2001 at Harvard
•First Bioconductor Release in May 2002
•~100 R Packages
•Software categories:
-Analysis (e.g., “limma” linear models for microarrays) -Annotation (e.g., “Data packages”)
-Database Interaction
-Graphics & User Interface (e.g., “limmaGUI”) -Graphs
-Pre-processing
-Ontologies (tools for working with gene ontologies) •Web: www.bioconductor.org
21
Bioconductor Example
Limma: linear models for microarrays
library(limma)
# Adapted from ?contrasts.fit
# Simulate gene expression data: 6 microarrays and 20000 genes # with one gene differentially expressed in first 3 arrays.
# contrasts.fit: Given a linear model fit to microarray data, # compute estimated coefficients and standard errors for a # given set of contrasts.
set.seed(71)
M <- matrix(rnorm(20000*6,sd=0.3),20000,6) M[1,1:3] <- M[1,1:3] + 2
# design matrix corresponds to oneway layout, # columns are orthogonal
design <- cbind(First3Arrays=c(1,1,1,0,0,0), Last3Arrays=c(0,0,0,1,1,1)) fit <- lmFit(M,design=design)
# Would like to consider original two estimates plus # difference between first 3 and last 3 arrays
contrast.matrix <- cbind(First3=c(1,0),Last3=c(0,1), "Last3-First3"=c(-1,1))
fit2 <- contrasts.fit(fit,contrast.matrix ) fit2 <- eBayes(fit2)
# large values of eb$t indicate differential expression results <- classifyTestsF(fit2) vennDiagram( vennCounts(results)) First3 Last3 Last3-First3 19805 23 53 14 72 19 13 1
22
R/Bioconductor
Pros
•Powerful analysis tools •Command line processing;
Batch processing
•Graphics rich software •Several revisions/year •Fast (most tasks)
•Free and open source: UNIX/Windows/Apple •Strong user community •Help via mailing list
Cons
•Can be quirky
•No “GUI”: Difficult to interact with data
•Graphics poor documentation •Several revisions/year
•Slow (processing huge datasets)
•“Correct” way to ask
“One of the most intimidating things about R is the seeming endlessness of it.”
Paul E. Johnson, KU Political Science Dept, R-Help, 9 May 2000 www.ku.edu/~pauljohn/R/Rtips.html
23
How to get R and Bioconductor?
24
How to get R and Bioconductor?
http://bioinfo/software/R.htm
. . . . . .
25
Resources
R for Bioinformatics
Nov 2005?
SummeR Sessions?
Comprehensive R Archive Network (CRAN) http://cran.r-project.org
26
Why should a biologist care?
•Excel has many limitations.
•R can serve as powerful graphing
“calculator.”
•R can easily work with vectors and
matrices with microarray data.
•State of the art analysis software
often introduced in published papers
using R.
27
Bioinformatics
us
Acknowledgements
Arcady Mushegian Director
Amy Ubben Admin
Support
Mike Coleman Scientific Programmer
Malcolm Cook Database Applications
Dan Thomasset UNIX Admin (IT)
Research
Jie Chen Visiting Scientist
Frank Emmert-Streib Galina Glasko
Manisha Goel Piotr Kozbial Jing Liu
28
Acknowledgements
Microarrays
Pourquié Lab
Chris Seidel & Karen Zueckert-Gaudenz
Mary-Lee Dequeant & Olivier Pourquié