Visualization and Clustering
with High-dimensional
Genomic Data
Zhenqiu Liu, PhD
Agenda
• Big data and their visualization
• PCA and MDS for data visualization
• Clustering and data mining
• Data
integration:
An
example
3
Introduction
• Visualization is the use of graphical techniques to communicate information and support reasoning or analysis
• Two types of Visualization:
– Scientific Visualization – Information Visualization
Visualization
• Scientific Visualization
● graphical representations from the results of mathematical
models, computational data, and simulations.
● Involves research in computer graphics, statistics, image
processing, high performance computing, and other areas
● It's not just a pretty picture or animation
• Information Visualization:
– The use of computer-supported, interactive, visual representations of abstract data to
amplify cognition.
• Visualization is not only looking into a pretty picture…
– understanding of the data
A Key Question
How do weConvert abstract information into a visual representation
While still preserving the underlying meaning and at the same time providing new insight?
6
7
LifeLines
• Visualization of computerised medical records • For a patient
– Horizontal lines (time lines) represent medical problems, hospitalization and medications
– Icons on these lines represent events such as tests and physician consultations
Types of Visualization
(Kosslyn 89)• Graphs
• Charts
• Maps
• Diagrams
Type name here
Type title here Type name hereType title here Type name hereType title here Type name here
Common Graph Types
length of page length of access URL # of accesses length of access # of accesses length of access length of page 0 5 10 15 20 25 30 35 40 45 short medium lo ng very long days # of accesses url 1 url 2 url 3 url 4 url 5 url 6 url 7 # of accessesWhen to use which type?
• Line graph
– x-axis requires quantitative variable – Variables have continuous values – Ordering among ordinals
• Bar graph
– comparison of relative point values
• Scatter plot
– convey overall impression of relationship between two variables
• Pie Chart?
Growth
Chart
Of
GEO
(RNA
etc)
Gene Expression Omnibus
(GEO) database holds over 10
000 experiments comprising
300 000 samples, 16 billion individual abundance
measurements, for over 500
organisms, submitted by 5000
laboratories from around the
world. The database typically receives over 60 000 query
hits and 10 000 bulk FTP
downloads per day, and has
been cited in over 5000
GenBank growth
chart
(DNA
sequences)
There are 126 billion bases in
135 million sequence records in
the traditional GenBank divisions
and 191 billion bases in 62
million sequence records in the
Big
Omics
Data
•
A
lot
of
genes,
and
samples,
heterogenious
data
structure
and
data
type.
•
Big
data
collection
vs.
big
data
objects
•
Big
data
collection:
aggregates of many data
sets (multi‐source, multi‐disciplinary,
heterogeneous, and maybe distributed)
•
Big
data
objects:
single object too large
– For main memory
– For local disk
Basic
Types
of
Omics
Data
•
Nominal
(qualitative)
– (no inherent order) SNP, Sequencing, ...
•
Ordinal
(qualitative)
– (ordered, but not at measurable intervals)
– first, second, third, …
– Clinical phenotypes (e.g. cancer stages)
•
Quantitative
– list of integers or reals
– Gene expression, protein expression.
Dimension
Reduction
• High dimensional data points are difficult to visualize
• Always good to plot data in 2D‐3D
– Easier to detect or confirm the relationship among data
points
– Catch stupid mistakes (e.g. in clustering)
• Two ways to reduce:
– By genes: some experiments are similar or have little
information
– By experiments: some genes are similar or have little
Agenda
•
Big
data
and
their
visualization
•
PCA
and
MDS
for
data
visualization
•
Clustering
and
data
mining
•
Data
integration:
An
example
Principal
Component
Analysis
• Optimal linear transformation that chooses a new
coordinate system for the data set that maximizes
the variance by projecting the data on to new axes in
order of the principal components
• Components are orthogonal (mutually uncorrelated)
• Few PCs may capture most
variation in original data
PCA
v1 v2 v1 v2 v1 v2x
z
y
Dimension Reduction (PCA)
Principal Components pick out the directions in the data that capture the greatest variability
New Axis 1 New Axis 2
The first new axes will be projected through the data so as to explain the greatest proportion of the variance in the data (most important).
The second new axis will be orthogonal, and will explain the next largest amount of variance
Representing
data
in
a
reduced
space
New Axis 1 New Axis 2
Typical
Analysis
0. 000 0. 005 0 .010 0. 01 5 0 .020 0. 0 2 5X
PCA analysis Plot of eigenvalues, select number. Plot PC1 v PC2 etcInterpreting
an
PCA
Each
axes
represent
a
different
“trend”
or
set
of
profiles
The
further
from
the
origin
Greater
loading/contribution
(ie
higher
expression)
Multidimensional
scaling
(MDS)
•
MDS
deals
with
the
following
problem:
for
a
set
of
observed
similarities
(or
distances)
between
every
pair
of
N
items,
find
a
representation
of
the
items
in
few
dimensions
such
that
the
similarity
(distance)
structure
nearly
match
the
structure
original
similarities
(or
distance).
•
The
numerical
measure
of
how
close
the
original
distances
and
the
distances
at
lower
MDS
1. MDS attempts to map objects to a visible 2D or 3D Euclidean space. The goal is to best preserve the distance structure after the mapping.
2. The original data can be of high-dimensional or even non-metric space. The method only cares the distance (dissimilarity) structure.
3. It could be shown that the results of PCA are exactly those of classical MDS if the distances calculated from the data matrix are Euclidean.
PCA MDS
Input data Data matrix (S subjects in G dimensions)
Dissimilarity structure (distance between any pair of subjects)
Method “Project” subjects to low-dimensional space and preserve as large
variance as possible
Find a low-dimensional space that best keep the dissimilarity structure Restrictions Data have to be in
Euclidean space
Flexible to any data structure as long as the dissimilarity structure can be defined
Pros and cons The PCs can be further used to model in
downstream analyses. If a new subject is added, it can be similarly
projected.
Flexibility and
visualization. But if a new subject is added, it can’t be shown in an existing MDS solution.
PCA
application:
genomic
study
•
Population
stratification:
allele
frequency
differences
between
cases
and
controls
due
to
systematic
ancestry
differences— which
can
cause
spurious
associations
in
disease
studies.
•
PCA
could
be
used
to
infer
underlying
Figure 2
Nature Genetics38, 904 - 909 (2006)
Principal components analysis corrects for stratification in genome-wide association studies
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reic
Chao Tian, Peter K. Gregersen and Michael F. Seldin. (2008) Accounting for ancestry: population substructure and
Software for dimension reduction & visualization
PCA in R:
prcomp(stats) Principal Components Analysis (preferred) princomp(stats) Principal Components Analysis
screeplot(stats) Screeplot of PCA Results PCA in IMSL (a commercial C library)
MDS in R:
isoMDS(MASS) Kruskal's Non-metric Multidimensional Scaling cmdscale(stats) Classical (Metric) Multidimensional Scaling sammon(MASS) Sammon's Non-Linear Mapping
MDS: Various software and resources about MDS
http://www.granular.com/MDS/
Heatmap visualization:
Agenda
•
Big
data
and
their
visualization
•
PCA
and
MDS
for
data
visualization
•
Clustering
and
data
mining
•
Data
integration:
An
example
Visualization
vs.
Analysis?
• Applications to data mining and data discovery.
– Visualization tools are helpful for exploring hunches and presenting
results
• Examples: scatterplots
– They are the WRONG primary tool when the goal is to find a good
Data Mining and Machine Learning
• Machine Learning and data mining can be used: to recognize or classify complex items (objects,
situations, etc.), to predict future data or events, and to explore the data structure in the data.
• On the boundary of Computer Science and Statistics.
Why Data Mining?
• A lot of data
• Data is noisy
• No clear biological theory
• Large number of features (genes)
• Complex relationships
36
Unsupervised Learning
Unsupervised learning attempts to discover interesting structure in the available data
Data mining, Clustering
Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.
Tailor-made for each person: too expensive One-size-fits-all: does not fit all.
Example 2: In medicine, identifying patients subtype based on their omics profiling
37 Supervised Learning Train dataset ML algorithm
model
prediction new observation System (unknown) observations property of interest?
Classification• Biologists are estimated to produce
25.000.000.000.000.000 bytes of data
each year (± 35 billion CD-rooms).
• How do we learn something from this
data?
• Find patterns/structure in the data.
Use
cluster analysis
• Definition:
Clustering
is the process of
grouping several objects into a number of
groups, or clusters.
• Goal:
Objects in the same cluster are more
similar to one another than they are to
objects in other clusters.
40
Basic principles of clustering
Aim: to group observations or variables that are“similar” based on predefined criteria.
Issues: Which genes / genomic technology to use? Which similarity or dissimilarity measure? Which method to use to join the
clusters/observations? Which clustering algorithm?
How to validate the resulting clusters?
41 Omics Data
For each gene, calculate a summary statistics and/or
adjusted p-values
Clustering
Clustering of genes
Set of candidate DE genes. Biological verification Descriptive interpretation Similarity metrics Clustering algorithm
42
Which similarity or dissimilarity measure?
• A metric is a measure of the similarity or dissimilarity between two data objects
• Two main classes of metric:
– Correlation coefficients (similarity)
• Compares shape of expression curves
– Kernel matrix (e.g. string kernel)
– Distance metrics (dissimilarity)
• City Block (Manhattan) distance • Euclidean distance
43 • Pearson Correlation Coefficient (centered correlation)
Sx = Standard deviation of x Sy = Standard deviation of y
• Others include Spearman’s and Kendall’s
n i y i x i n S y y S x x 1 1 1Correlation (a measure between -1 and 1)
44
Distance metrics
• City Block (Manhattan) distance:
– Sum of differences across dimensions
– Less sensitive to outliers – Diamond shaped clusters
• Euclidean distance:
– Most commonly used distance – Sphere shaped cluster
– Corresponds to the geometric distance into the
multidimensional space
i i i y x Y X d( , )
i i i y x Y X d( , ) ( )2where gene X = (x1,…,xn) and gene Y=(y1,…,yn) X Y Condition 1 Condition 2 Condition 1 X Y Condition 2
45
Euclidean vs Correlation (I)
• Euclidean distance
• Correlation
46
Clustering algorithms
• Clustering algorithm comes in 2 basic flavors
47
Hierarchical methods
• Hierarchical clustering methods produce a tree or
dendrogram.
• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. • The tree can be built in two distinct ways
– bottom-up: agglomerative clustering (usually used). – top-down: divisive clustering.
48 1 5 2 3 4 1 5 2 3 4 1,2,5 3,4 1,5 1,2,3,4,5 Agglomerative Illustration of points In two dimensional space 1 5 3 4 2
Relationships between these pairwise
distances- Clustering Algorithms
• Different algorithms
– Bottom-up or top-down
– Popular hierarchical bottom-up clustering method
– The distance between a cluster and the remaining clusters can be measured using minimum, maximum or average distance.
Comparison of Linkage Methods
Single Average Complete
51
Partitioning methods
• Partition the data into a
pre-specified
number
k
of mutually exclusive and
exhaustive groups.
• Iteratively reallocate the observations to
clusters until some criterion is met, e.g.
minimize within cluster sums of squares.
Ideally, dissimilarity between
clusters will
be maximized while it is minimized within
clusters.
52
K = 2
53
K = 4
Cluster Analysis
dist() hclust() heatmap()
56
Classification and Prediction
Learning Set Data with known classes Classification Technique Classification rule Data with unknown classes Class Assignment Discrimination Prediction
57
Classification in Bioinformatics
• Computational diagnostic: early cancer
detection
• Tumor biomarker discovery
• Protein folding prediction
• Protein-protein binding sites prediction
• Gene function prediction
58 ? Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Reference
L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer.Nature, Jan.
. Objects Array Feature vectors Gene expression Predefine classes Clinical outcome new array Learning set Classification rule Good Prognosis Matesis > 5
Agenda
• Big data and their visualization
• PCA and MDS for data visualization
• Clustering and data mining
• Data
integration:
An
example
Clinical Problem: Upfront Therapy - Chemo or Surgery?
Optimal
Cytoreduction Chemotherapy Remission
Interval
Cytoreduction Remission
Chemo
Suboptimal Debulking
• Standard primary treatment for ovarian cancer is surgery
followed by chemotherapy
• Rationale for surgery as the primary treatment is to remove
as much tumor as possible
• Leaving tumor nodules larger than 1 cm (defined as
suboptimal debulking) is associated with reduced
chemosensitivity and poor survival
• If the tumor cannot be effectively removed by surgery, the
patient is first treated with chemotherapy to partially shrink
the tumor and then by surgery
• Surgeons cannot predict whether surgery will be effective or
not
• The effect in individual patients is highly divergent
depending on the biology of their disease
• Biomarkers may help physicians decide which patients
should undergo surgery and which should be treated with
Clinical Decision-making Based on Serum Biomarkers
Optimal
Cytoreduction Chemotherapy Remission
Interval Cytoreduction Remission Chemo Serum Assay Low Levels High Levels
Overview: Horizontal vs Vertical
Integration
• Merge multiple data matrices
Why Horizontal Integration?
Tremendous amount of public data
Individual study usually contains
moderate sample size
• Low statistical power
• Inconsistent conclusions
Theoretically, the sample size
required increases exponentially with
the number of variables
Combining multiple studies (meta‐analysis) is a practical
Identifying Markers with Horizontal
Integration
•
Three
gene
expression
datasets:
– TCGA gene expression data: n=522 samples, m=
13238 genes with 401/121 suboptimal/optimal
debulking
– GSE26712 gene expression data: n=185 samples,
m=13238 genes with 95/90 suboptimal/optimal
debulking
– GSE9891 (Tohill) gene expression data: n =248
samples and m=22635 genes with 164/84
Build networks and identify genes with the first two datasets
Validate identifying markers with the third (Tohill) data and PPI
database • Normalize the data • Screen the differentiated genes with p < 0.05 • Find common DE genes • Construct common and differentiated networks • Select markers
with both Diff
net and genes
• External
Why Networks?
•
Genes
may
interact
with
each
other
and
function
together
•
Network
structures
may
vary
under
different
clinical
conditions
•
Reproducibility
of
sub
‐
networks
is
higher
than
that
of
individual
genes
•
Permutation
test
is
usually
used
for
detecting
the
difference
in
correlation
structure
Left: Suboptimal Debulking Associated Cluster (Network module) with B from Our Data. Right: PPI Network from National Database
Some
Observations
•
Potential
markers
are
highly
reproducible,
21/22
genes
come
up
in
the
independent
Tohill
data.
•
In
contrary,
FABP4
and
ADH1B
identified
with
TCGA
and
Tohill
data
by
the
MD
Anderson
group
are
not
significant
in
the
third
data
•
Most
genes
are
quite
significant
with
very
small
P
‐
values
•
COLL11A1
achieves
the
smallest
P
‐
value
(<
3e
‐
9)
using
the
external
validation
data.
Further
studies
are
ongoing
by
my
collaborator
(Dr.
Sandra
• Liu Z, Beach JA, Agadjanian H, Jia D, Aspuria PJ, Karlan BY, and Orsulic S (2015), Suboptimal cytoreduction in ovarian carcinoma is associated with molecular pathways characteristic of increased stromal activation,
Gynecologic Oncology, S0090-8258(15)30117-7
.
Agenda
• Big data and their visualization
• PCA and MDS for data visualization
• Clustering and data mining
• Data
integration:
An
example
Five websites that all biologists
should know
• NCBI (The National Center for Biotechnology Information;
– http://www.ncbi.nlm.nih.gov/
• EBI (The European Bioinformatics Institute)
– http://www.ebi.ac.uk/
• The Canadian Bioinformatics Resource
– http://www.cbr.nrc.ca/
• SwissProt/ExPASy (Swiss Bioinformatics Resource)
– http://expasy.cbr.nrc.ca/sprot/
• PDB (The Protein Databank)
A few more resources to be
aware of
• Human Genome Working Draft
– http://genome.ucsc.edu/
• TIGR (The Institute for Genomics Research)
– http://www.tigr.org/
• Celera
– http://www.celera.com/
• (Model) Organism specific information:
– Yeast: http://genome-www.stanford.edu/Saccharomyces/
– Arabidopis: http://www.tair.org/
– Mouse: http://www.jax.org/
– Fruitfly: http://www.fruitfly.org/
– Nematode: http://www.wormbase.org/
• Nucleic Acids Research Database Issue
Challenges
• Confusing choice of tools
• Developed independently
• Written by and for nerds
Outline
• what is R
• What is Bioconductor
• getting and using Bioconductor
• Overview of Bioconductor packages
• demo
R
• R is a language and environment for
statistical computing and graphics.
• what sorts of things is R good
at?
– there are very many statistical algorithms – there are very many machine learning
algorithms – visualization
Goals of Bioconductor
• Provide access to powerful statistical andgraphical methods for the analysis of genomic data.
• Facilitate the integration of biological metadata
(GenBank, GO, LocusLink, PubMed) in the analysis of experimental data.
• Allow the rapid development of extensible,
interoperable, and scalable software.
• Promote high-quality documentation and
reproducible research.
• Provide training in computational and
statistical
Installation
1. Main R software
: download from CRAN
(
cran.r-project.org
), use latest release, now
1.8.0.
2. Bioconductor packages
: download from
Bioconductor (
www.bioconductor.org
),
use latest release, now 1.3.
Available for Linux/Unix, Windows, and
Mac OS.
Documentation and help
• R manuals and tutorials:available from the R website or on-line in an R session.
• R on-line help system: detailed on-line documentation, available in text, HTML, PDF, and LaTeX formats.
> help.start() > help(lm) > ?hclust > apropos(mean) > example(hclust) > demo() > demo(image)
R cluster analysis packages
• cclust: convex clustering methods.
• class: self-organizing maps (SOM).
• cluster:
– AGglomerative NESting (agnes),
– Clustering LARe Applications (clara),
– DIvisive ANAlysis (diana),
– Fuzzy Analysis (fanny),
– MONothetic Analysis (mona),
– Partitioning Around Medoids (pam).
• e1071:
– fuzzy C-means clustering (cmeans),
– bagged clustering (bclust).
• flexmix: flexible mixture modeling.
• fpc: fixed point clusters, clusterwise regression and discriminant plots.
• GeneSOM: self-organizing maps.
• mclust, mclust98: model-based cluster analysis.
• mva:
– hierarchical clustering (hclust),
– k-means (kmeans).
Hierarchical clustering
hclust function from
Heatmaps
References
• R www.r-project.org, cran.r-project.org – software (CRAN); – documentation; – newsletter: R News; – mailing list. • Bioconductor www.bioconductor.org– software, data, and documentation (vignettes); – training materials from short courses;
Conclusions
• Visualization
• PCA and MDS visualization
• Clustering
• Classification
• Bioinformatics resources
• R and Bioconductor