Visualization and Clustering with High-dimensional Genomic Data

(1)

Visualization and Clustering

with High-dimensional

Genomic Data

Zhenqiu Liu, PhD

(2)

Agenda

• Big data and their visualization

• PCA and MDS for data visualization

• Clustering and data mining

• Data

integration:

An

example

(3)

3

Introduction

• Visualization is the use of graphical techniques to communicate information and support reasoning or analysis

• Two types of Visualization:

– Scientific Visualization – Information Visualization

(4)

Visualization

• Scientific Visualization

● graphical representations from the results of mathematical

models, computational data, and simulations.

● Involves research in computer graphics, statistics, image

processing, high performance computing, and other areas

● It's not just a pretty picture or animation

• Information Visualization:

– The use of computer-supported, interactive, visual representations of abstract data to

amplify cognition.

• Visualization is not only looking into a pretty picture…

– understanding of the data

(5)

A Key Question

How do we

Convert abstract information into a visual representation

While still preserving the underlying meaning and at the same time providing new insight?

(6)

6

(7)

7

LifeLines

• Visualization of computerised medical records • For a patient

– Horizontal lines (time lines) represent medical problems, hospitalization and medications

– Icons on these lines represent events such as tests and physician consultations

(8)

Types of Visualization

(Kosslyn 89)

• Graphs

• Charts

• Maps

• Diagrams

Type name here

Type title here Type name hereType title here Type name hereType title here Type name here

(9)

Common Graph Types

length of page length of access URL # of accesses length of access # of accesses length of access length of page 0 5 10 15 20 25 30 35 40 45 short medium lo ng very long days # of accesses url 1 url 2 url 3 url 4 url 5 url 6 url 7 # of accesses

(10)

When to use which type?

• Line graph

– x-axis requires quantitative variable – Variables have continuous values – Ordering among ordinals

• Bar graph

– comparison of relative point values

• Scatter plot

– convey overall impression of relationship between two variables

• Pie Chart?

(11)

Growth

Chart

Of

GEO

(RNA

etc)

Gene Expression Omnibus

(GEO) database holds over 10

000 experiments comprising

300 000 samples, 16 billion individual abundance

measurements, for over 500

organisms, submitted by 5000

laboratories from around the

world. The database typically receives over 60 000 query

hits and 10 000 bulk FTP

downloads per day, and has

been cited in over 5000

(12)

GenBank growth

chart

(DNA

sequences)

There are 126 billion bases in

135 million sequence records in

the traditional GenBank divisions

and 191 billion bases in 62

million sequence records in the

(13)

Big

Omics

Data

•

A

lot

of

genes,

and

samples,

heterogenious

data

structure

and

data

type.

•

Big

data

collection

vs.

big

data

objects

•

Big

data

collection:

aggregates of many data

sets (multi‐source, multi‐disciplinary,

heterogeneous, and maybe distributed)

•

Big

data

objects:

single object too large

– For main memory

– For local disk

(14)

Basic

Types

of

Omics

Data

•

Nominal

(qualitative)

– (no inherent order) SNP, Sequencing, ...

•

Ordinal

(qualitative)

– (ordered, but not at measurable intervals)

– first, second, third, …

– Clinical phenotypes (e.g. cancer stages)

•

Quantitative

– list of integers or reals

– Gene expression, protein expression.

(15)

Dimension

Reduction

• High dimensional data points are difficult to visualize

• Always good to plot data in 2D‐3D

– Easier to detect or confirm the relationship among data

points

– Catch stupid mistakes (e.g. in clustering)

• Two ways to reduce:

– By genes: some experiments are similar or have little

information

– By experiments: some genes are similar or have little

(16)

Agenda

•

Big

data

and

their

visualization

•

PCA

and

MDS

for

data

visualization

•

Clustering

and

data

mining

•

Data

integration:

An

example

(17)

Principal

Component

Analysis

• Optimal linear transformation that chooses a new

coordinate system for the data set that maximizes

the variance by projecting the data on to new axes in

order of the principal components

• Components are orthogonal (mutually uncorrelated)

• Few PCs may capture most

variation in original data

(18)

PCA

v₁ v₂ v₁ v2 v₁ v₂

(19)

x

z

y

Dimension Reduction (PCA)

Principal Components pick out the directions in the data that capture the greatest variability

New Axis 1 New Axis 2

(20)

The first new axes will be projected through the data so as to explain the greatest proportion of the variance in the data (most important).

The second new axis will be orthogonal, and will explain the next largest amount of variance

Representing

data

in

a

reduced

space

New Axis 1 New Axis 2

(21)

Typical

Analysis

0. 000 0. 005 0 .010 0. 01 5 0 .020 0. 0 2 5

X

PCA analysis Plot of eigenvalues, select number. Plot PC1 v PC2 etc

(22)

Interpreting

an

PCA

Each

axes

represent

a

different

“trend”

or

set

of

profiles

The

further

from

the

origin

Greater

loading/contribution

(ie

higher

expression)

(23)

(24)

Multidimensional

scaling

(MDS)

•

MDS

deals

with

the

following

problem:

for

a

set

of

observed

similarities

(or

distances)

between

every

pair

of

N

items,

find

a

representation

of

the

items

in

few

dimensions

such

that

the

similarity

(distance)

structure

nearly

match

the

structure

original

similarities

(or

distance).

•

The

numerical

measure

of

how

close

the

original

distances

and

the

distances

at

lower

(25)

(26)

MDS

1. MDS attempts to map objects to a visible 2D or 3D Euclidean space. The goal is to best preserve the distance structure after the mapping.

2. The original data can be of high-dimensional or even non-metric space. The method only cares the distance (dissimilarity) structure.

3. It could be shown that the results of PCA are exactly those of classical MDS if the distances calculated from the data matrix are Euclidean.

(27)

PCA MDS

Input data Data matrix (S subjects in G dimensions)

Dissimilarity structure (distance between any pair of subjects)

Method “Project” subjects to low-dimensional space and preserve as large

variance as possible

Find a low-dimensional space that best keep the dissimilarity structure Restrictions Data have to be in

Euclidean space

Flexible to any data structure as long as the dissimilarity structure can be defined

Pros and cons The PCs can be further used to model in

downstream analyses. If a new subject is added, it can be similarly

projected.

Flexibility and

visualization. But if a new subject is added, it can’t be shown in an existing MDS solution.

(28)

PCA

application:

genomic

study

•

Population

stratification:

allele

frequency

differences

between

cases

and

controls

due

to

systematic

ancestry

differences— which

can

cause

spurious

associations

in

disease

studies.

•

PCA

could

be

used

to

infer

underlying

(29)

Figure 2

Nature Genetics38, 904 - 909 (2006)

Principal components analysis corrects for stratification in genome-wide association studies

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reic

(30)

Chao Tian, Peter K. Gregersen and Michael F. Seldin. (2008) Accounting for ancestry: population substructure and

(31)

Software for dimension reduction & visualization

PCA in R:

prcomp(stats) Principal Components Analysis (preferred) princomp(stats) Principal Components Analysis

screeplot(stats) Screeplot of PCA Results PCA in IMSL (a commercial C library)

MDS in R:

isoMDS(MASS) Kruskal's Non-metric Multidimensional Scaling cmdscale(stats) Classical (Metric) Multidimensional Scaling sammon(MASS) Sammon's Non-Linear Mapping

MDS: Various software and resources about MDS

http://www.granular.com/MDS/

Heatmap visualization:

(32)

Agenda

•

Big

data

and

their

visualization

•

PCA

and

MDS

for

data

visualization

•

Clustering

and

data

mining

•

Data

integration:

An

example

(33)

Visualization

vs.

Analysis?

• Applications to data mining and data discovery.

– Visualization tools are helpful for exploring hunches and presenting

results

• Examples: scatterplots

– They are the WRONG primary tool when the goal is to find a good

(34)

Data Mining and Machine Learning

• Machine Learning and data mining can be used: to recognize or classify complex items (objects,

situations, etc.), to predict future data or events, and to explore the data structure in the data.

• On the boundary of Computer Science and Statistics.

(35)

Why Data Mining?

• A lot of data

• Data is noisy

• No clear biological theory

• Large number of features (genes)

• Complex relationships

(36)

36

Unsupervised Learning

Unsupervised learning attempts to discover interesting structure in the available data

Data mining, Clustering

Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.

Tailor-made for each person: too expensive One-size-fits-all: does not fit all.

Example 2: In medicine, identifying patients subtype based on their omics profiling

(37)

37 Supervised Learning Train dataset ML algorithm

model

prediction new observation System (unknown) observations property of interest

?

Classification

(38)

• Biologists are estimated to produce

25.000.000.000.000.000 bytes of data

each year (± 35 billion CD-rooms).

• How do we learn something from this

data?

• Find patterns/structure in the data.



Use

cluster analysis

(39)

• Definition:

Clustering

is the process of

grouping several objects into a number of

groups, or clusters.

• Goal:

Objects in the same cluster are more

similar to one another than they are to

objects in other clusters.

(40)

40

Basic principles of clustering

Aim: to group observations or variables that are

“similar” based on predefined criteria.

Issues: Which genes / genomic technology to use? Which similarity or dissimilarity measure? Which method to use to join the

clusters/observations? Which clustering algorithm?

How to validate the resulting clusters?

(41)

41 Omics Data

For each gene, calculate a summary statistics and/or

adjusted p-values

Clustering

Clustering of genes

Set of candidate DE genes. Biological verification Descriptive interpretation Similarity metrics Clustering algorithm

(42)

42

Which similarity or dissimilarity measure?

• A metric is a measure of the similarity or dissimilarity between two data objects

• Two main classes of metric:

– Correlation coefficients (similarity)

• Compares shape of expression curves

– Kernel matrix (e.g. string kernel)

– Distance metrics (dissimilarity)

• City Block (Manhattan) distance • Euclidean distance

(43)

43 • Pearson Correlation Coefficient (centered correlation)

S_x= Standard deviation of x S_y = Standard deviation of y

• Others include Spearman’s  and Kendall’s 



         _        n i y i x i n _S y y S x x 1 1 1

Correlation (a measure between -1 and 1)

(44)

44

Distance metrics

• City Block (Manhattan) distance:

– Sum of differences across dimensions

– Less sensitive to outliers – Diamond shaped clusters

• Euclidean distance:

– Most commonly used distance – Sphere shaped cluster

– Corresponds to the geometric distance into the

multidimensional space



  i i i y x Y X d( , ) 



 i i i y x Y X d( , ) ( )2

where gene X = (x₁,…,x_n) and gene Y=(y₁,…,y_n) X Y Condition 1 Condition 2 Condition 1 X Y Condition 2

(45)

45

Euclidean vs Correlation (I)

• Euclidean distance

• Correlation

(46)

46

Clustering algorithms

• Clustering algorithm comes in 2 basic flavors

(47)

47

Hierarchical methods

• Hierarchical clustering methods produce a tree or

dendrogram.

• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. • The tree can be built in two distinct ways

– bottom-up: agglomerative clustering (usually used). – top-down: divisive clustering.

(48)

48 1 5 2 3 4 1 5 2 3 4 1,2,5 3,4 1,5 1,2,3,4,5 Agglomerative Illustration of points In two dimensional space 1 5 3 4 2

(49)

Relationships between these pairwise

distances- Clustering Algorithms

• Different algorithms

– Bottom-up or top-down

– Popular hierarchical bottom-up clustering method

– The distance between a cluster and the remaining clusters can be measured using minimum, maximum or average distance.

(50)

Comparison of Linkage Methods

Single Average Complete

(51)

51

Partitioning methods

• Partition the data into a

pre-specified

number

k

of mutually exclusive and

exhaustive groups.

• Iteratively reallocate the observations to

clusters until some criterion is met, e.g.

minimize within cluster sums of squares.

Ideally, dissimilarity between

clusters will

be maximized while it is minimized within

clusters.

(52)

52

K = 2

(53)

53

K = 4

(54)

Cluster Analysis

dist() hclust() heatmap()

(55)

(56)

56

Classification and Prediction

Learning Set Data with known classes Classification Technique Classification rule Data with unknown classes Class Assignment Discrimination Prediction

(57)

57

Classification in Bioinformatics

• Computational diagnostic: early cancer

detection

• Tumor biomarker discovery

• Protein folding prediction

• Protein-protein binding sites prediction

• Gene function prediction

(58)

58 ? Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Reference

L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer.Nature, Jan.

. Objects Array Feature vectors Gene expression Predefine classes Clinical outcome new array Learning set Classification rule Good Prognosis Matesis > 5

(59)

Agenda

• Data

integration:

An

example

(60)

Clinical Problem: Upfront Therapy - Chemo or Surgery?

Optimal

Cytoreduction Chemotherapy Remission

Interval

Cytoreduction Remission

Chemo

(61)

Suboptimal Debulking

• Standard primary treatment for ovarian cancer is surgery

followed by chemotherapy

• Rationale for surgery as the primary treatment is to remove

as much tumor as possible

• Leaving tumor nodules larger than 1 cm (defined as

suboptimal debulking) is associated with reduced

chemosensitivity and poor survival

• If the tumor cannot be effectively removed by surgery, the

patient is first treated with chemotherapy to partially shrink

the tumor and then by surgery

• Surgeons cannot predict whether surgery will be effective or

not

• The effect in individual patients is highly divergent

depending on the biology of their disease

• Biomarkers may help physicians decide which patients

should undergo surgery and which should be treated with

(62)

Clinical Decision-making Based on Serum Biomarkers

Optimal

Cytoreduction Chemotherapy Remission

Interval Cytoreduction Remission Chemo Serum Assay Low Levels High Levels

(63)

Overview: Horizontal vs Vertical

Integration

• Merge multiple data matrices

(64)

Why Horizontal Integration?

 Tremendous amount of public data

 Individual study usually contains

moderate sample size

• Low statistical power

• Inconsistent conclusions

 Theoretically, the sample size

required increases exponentially with

the number of variables

 Combining multiple studies (meta‐analysis) is a practical

(65)

Identifying Markers with Horizontal

Integration

•

Three

gene

expression

datasets:

– TCGA gene expression data: n=522 samples, m=

13238 genes with 401/121 suboptimal/optimal

debulking

– GSE26712 gene expression data: n=185 samples,

m=13238 genes with 95/90 suboptimal/optimal

debulking

– GSE9891 (Tohill) gene expression data: n =248

samples and m=22635 genes with 164/84

(66)

Build networks and identify genes with the first two datasets

Validate identifying markers with the third (Tohill) data and PPI

database • Normalize the data • Screen the differentiated genes with p < 0.05 • Find common DE genes • Construct common and differentiated networks • Select markers

with both Diff

net and genes

• External

(67)

Why Networks?

•

Genes

may

interact

with

each

other

and

function

together

•

Network

structures

may

vary

under

different

clinical

conditions

•

Reproducibility

of

sub

‐

networks

is

higher

than

that

of

individual

genes

•

Permutation

test

is

usually

used

for

detecting

the

difference

in

correlation

structure

(68)

Left: Suboptimal Debulking Associated Cluster (Network module) with B from Our Data. Right: PPI Network from National Database

(69)

(70)

Some

Observations

•

Potential

markers

are

highly

reproducible,

21/22

genes

come

up

in

the

independent

Tohill

data.

•

In

contrary,

FABP4

and

ADH1B

identified

with

TCGA

and

Tohill

data

by

the

MD

Anderson

group

are

not

significant

in

the

third

data

•

Most

genes

are

quite

significant

with

very

small

P

‐

values

•

COLL11A1

achieves

the

smallest

P

‐

value

(<

3e

‐

9)

using

the

external

validation

data.

Further

studies

are

ongoing

by

my

collaborator

(Dr.

Sandra

(71)

• Liu Z, Beach JA, Agadjanian H, Jia D, Aspuria PJ, Karlan BY, and Orsulic S (2015), Suboptimal cytoreduction in ovarian carcinoma is associated with molecular pathways characteristic of increased stromal activation,

Gynecologic Oncology, S0090-8258(15)30117-7

.

(72)

Agenda

• Data

integration:

An

example

(73)

Five websites that all biologists

should know

• NCBI (The National Center for Biotechnology Information;

– http://www.ncbi.nlm.nih.gov/

• EBI (The European Bioinformatics Institute)

– http://www.ebi.ac.uk/

• The Canadian Bioinformatics Resource

– http://www.cbr.nrc.ca/

• SwissProt/ExPASy (Swiss Bioinformatics Resource)

– http://expasy.cbr.nrc.ca/sprot/

• PDB (The Protein Databank)

(74)

A few more resources to be

aware of

• Human Genome Working Draft

– http://genome.ucsc.edu/

• TIGR (The Institute for Genomics Research)

– http://www.tigr.org/

• Celera

– http://www.celera.com/

• (Model) Organism specific information:

– Yeast: http://genome-www.stanford.edu/Saccharomyces/

– Arabidopis: http://www.tair.org/

– Mouse: http://www.jax.org/

– Fruitfly: http://www.fruitfly.org/

– Nematode: http://www.wormbase.org/

• Nucleic Acids Research Database Issue

(75)

Challenges

• Confusing choice of tools

• Developed independently

• Written by and for nerds

(76)

Outline

• what is R

• What is Bioconductor

• getting and using Bioconductor

• Overview of Bioconductor packages

• demo

(77)

R

• R is a language and environment for

statistical computing and graphics.

• what sorts of things is R good

at?

– there are very many statistical algorithms – there are very many machine learning

algorithms – visualization

(78)

Goals of Bioconductor

• Provide access to powerful statistical and

graphical methods for the analysis of genomic data.

• Facilitate the integration of biological metadata

(GenBank, GO, LocusLink, PubMed) in the analysis of experimental data.

• Allow the rapid development of extensible,

interoperable, and scalable software.

• Promote high-quality documentation and

reproducible research.

• Provide training in computational and

statistical

(79)

(80)

Installation

1. Main R software

: download from CRAN

(

cran.r-project.org

), use latest release, now

1.8.0.

2. Bioconductor packages

: download from

Bioconductor (

www.bioconductor.org

),

use latest release, now 1.3.

Available for Linux/Unix, Windows, and

Mac OS.

(81)

Documentation and help

• R manuals and tutorials:available from the R website or on-line in an R session.

• R on-line help system: detailed on-line documentation, available in text, HTML, PDF, and LaTeX formats.

> help.start() > help(lm) > ?hclust > apropos(mean) > example(hclust) > demo() > demo(image)

(82)

R cluster analysis packages

• cclust: convex clustering methods.

• class: self-organizing maps (SOM).

• cluster:

– AGglomerative NESting (agnes),

– Clustering LARe Applications (clara),

– DIvisive ANAlysis (diana),

– Fuzzy Analysis (fanny),

– MONothetic Analysis (mona),

– Partitioning Around Medoids (pam).

• e1071:

– fuzzy C-means clustering (cmeans),

– bagged clustering (bclust).

• flexmix: flexible mixture modeling.

• fpc: fixed point clusters, clusterwise regression and discriminant plots.

• GeneSOM: self-organizing maps.

• mclust, mclust98: model-based cluster analysis.

• mva:

– hierarchical clustering (hclust),

– k-means (kmeans).

(83)

Hierarchical clustering

hclust function from

(84)

Heatmaps

(85)

References

• R www.r-project.org, cran.r-project.org – software (CRAN); – documentation; – newsletter: R News; – mailing list. • Bioconductor www.bioconductor.org

– software, data, and documentation (vignettes); – training materials from short courses;

(86)

Conclusions

• Visualization

• PCA and MDS visualization

• Clustering

• Classification

• Bioinformatics resources

• R and Bioconductor