• No results found

Visualization and Clustering with High-dimensional Genomic Data

N/A
N/A
Protected

Academic year: 2021

Share "Visualization and Clustering with High-dimensional Genomic Data"

Copied!
86
0
0

Loading.... (view fulltext now)

Full text

(1)

Visualization and Clustering

with High-dimensional

Genomic Data

Zhenqiu Liu, PhD

(2)

Agenda

• Big data and their visualization

• PCA and MDS for data visualization

• Clustering and data mining

• Data

 

integration:

 

An

 

example

(3)

3

Introduction

• Visualization is the use of graphical techniques to communicate information and support reasoning or analysis

• Two types of Visualization:

– Scientific Visualization – Information Visualization

(4)

Visualization

• Scientific Visualization

● graphical representations from the results of mathematical

models, computational data, and simulations.

● Involves research in computer graphics, statistics, image

processing, high performance computing, and other areas

● It's not just a pretty picture or animation

• Information Visualization:

– The use of computer-supported, interactive, visual representations of abstract data to

amplify cognition.

Visualization is not only looking into a pretty picture…

– understanding of the data

(5)

A Key Question

How do we

Convert abstract information into a visual representation

While still preserving the underlying meaning and at the same time providing new insight?

(6)

6

(7)

7

LifeLines

• Visualization of computerised medical records • For a patient

– Horizontal lines (time lines) represent medical problems, hospitalization and medications

– Icons on these lines represent events such as tests and physician consultations

(8)

Types of Visualization

(Kosslyn 89)

• Graphs

• Charts

• Maps

• Diagrams

Type name here

Type title here Type name hereType title here Type name hereType title here Type name here

(9)

Common Graph Types

length of page length of access URL # of accesses length of access # of accesses length of access length of page 0 5 10 15 20 25 30 35 40 45 short medium lo ng very long days # of accesses url 1 url 2 url 3 url 4 url 5 url 6 url 7 # of accesses
(10)

When to use which type?

• Line graph

– x-axis requires quantitative variable – Variables have continuous values – Ordering among ordinals

• Bar graph

– comparison of relative point values

• Scatter plot

– convey overall impression of relationship between two variables

• Pie Chart?

(11)

Growth

 

Chart

 

Of

 

GEO

 

(RNA

 

etc)

Gene Expression Omnibus 

(GEO) database holds over 10 

000 experiments comprising 

300 000 samples, 16 billion individual abundance 

measurements, for over 500 

organisms, submitted by 5000 

laboratories from around the 

world. The database typically receives over 60 000 query 

hits and 10 000 bulk FTP 

downloads per day, and has 

been cited in over 5000 

(12)

GenBank growth

 

chart

 

(DNA

 

sequences)

There are 126 billion bases in 

135 million  sequence records in 

the traditional GenBank divisions 

and 191 billion bases in 62 

million sequence records in the 

(13)

Big

 

Omics

 

Data

A

 

lot

 

of

 

genes,

 

and

  

samples,

 

heterogenious

 

data

 

structure

 

and

 

data

 

type.

 

Big

 

data

 

collection

 

vs.

 

big

 

data

 

objects

Big

 

data

 

collection:

 

aggregates of many data 

sets (multi‐source, multi‐disciplinary, 

heterogeneous, and maybe distributed)

Big

 

data

 

objects:

 

single object too large 

– For main memory 

– For local disk

(14)

Basic

 

Types

 

of

 

Omics

 

Data

Nominal

 

(qualitative)

– (no inherent order) SNP,  Sequencing, ...

Ordinal

 

(qualitative)

– (ordered, but not at measurable intervals)

– first, second, third, …

– Clinical phenotypes (e.g. cancer stages)

Quantitative

– list of integers or reals

– Gene expression,  protein expression.

(15)

Dimension

 

Reduction

• High dimensional data points are difficult to visualize

• Always good to plot data in 2D‐3D

– Easier to detect or confirm the relationship among data 

points

– Catch stupid mistakes (e.g. in clustering)

• Two ways to reduce:

– By genes: some experiments are similar or have little 

information

– By experiments: some genes are similar or have little 

(16)

Agenda

Big

 

data

 

and

 

their

 

visualization

PCA

 

and

 

MDS

 

for

 

data

 

visualization

Clustering

 

and

 

data

  

mining

 

Data

 

integration:

 

An

 

example

(17)

Principal

 

Component

 

Analysis

• Optimal linear transformation that chooses a new 

coordinate system for the data set that maximizes 

the variance by projecting the data on to new axes in 

order of the principal components

• Components are orthogonal (mutually uncorrelated)

• Few PCs may capture most 

variation in original data

(18)

PCA

v1 v2 v1 v2 v1 v2
(19)

x

z

y

Dimension Reduction (PCA)

Principal Components pick out the directions in the data that capture the greatest variability

New Axis 1 New Axis 2

(20)

The first new axes will be projected through the data so as to explain the greatest proportion of the variance in the data (most important).

The second new axis will be orthogonal, and will explain the next largest amount of variance

Representing

 

data

 

in

 

a

 

reduced

 

space

New Axis 1 New Axis 2

(21)

Typical

 

Analysis

0. 000 0. 005 0 .010 0. 01 5 0 .020 0. 0 2 5

X

PCA analysis Plot of eigenvalues, select number. Plot PC1 v PC2 etc
(22)

Interpreting

 

an

 

PCA

Each

 

axes

 

represent

 

a

 

different

 

“trend”

 

or

 

set

 

of

 

profiles

  

The

 

further

 

from

 

the

 

origin

Greater

 

loading/contribution

(ie

 

higher

 

expression)

(23)
(24)

Multidimensional

 

scaling

 

(MDS)

MDS

  

deals

 

with

 

the

 

following

 

problem:

 

for

 

a

 

set

 

of

 

observed

 

similarities

 

(or

 

distances)

 

between

 

every

 

pair

 

of

 

N

 

items,

 

find

 

a

 

representation

 

of

 

the

 

items

 

in

 

few

 

dimensions

 

such

 

that

 

the

 

similarity

 

(distance)

  

structure

 

nearly

 

match

 

the

 

structure

 

original

 

similarities

 

(or

 

distance).

The

 

numerical

 

measure

 

of

 

how

 

close

 

the

 

original

 

distances

 

and

 

the

 

distances

 

at

 

lower

 

(25)
(26)

MDS

1. MDS attempts to map objects to a visible 2D or 3D Euclidean space. The goal is to best preserve the distance structure after the mapping.

2. The original data can be of high-dimensional or even non-metric space. The method only cares the distance (dissimilarity) structure.

3. It could be shown that the results of PCA are exactly those of classical MDS if the distances calculated from the data matrix are Euclidean.

(27)

PCA MDS

Input data Data matrix (S subjects in G dimensions)

Dissimilarity structure (distance between any pair of subjects)

Method “Project” subjects to low-dimensional space and preserve as large

variance as possible

Find a low-dimensional space that best keep the dissimilarity structure Restrictions Data have to be in

Euclidean space

Flexible to any data structure as long as the dissimilarity structure can be defined

Pros and cons The PCs can be further used to model in

downstream analyses. If a new subject is added, it can be similarly

projected.

Flexibility and

visualization. But if a new subject is added, it can’t be shown in an existing MDS solution.

(28)

PCA

 

application:

 

genomic

 

study

Population

 

stratification:

 

allele

 

frequency

 

differences

 

between

 

cases

 

and

 

controls

 

due

 

to

 

systematic

 

ancestry

 

differences— which

 

can

 

cause

 

spurious

 

associations

 

in

 

disease

 

studies.

 

PCA

 

could

 

be

 

used

 

to

 

infer

 

underlying

 

(29)

Figure 2

Nature Genetics38, 904 - 909 (2006)

Principal components analysis corrects for stratification in genome-wide association studies

Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reic

(30)

Chao Tian, Peter K. Gregersen and Michael F. Seldin. (2008) Accounting for ancestry: population substructure and

(31)

Software for dimension reduction & visualization

PCA in R:

prcomp(stats) Principal Components Analysis (preferred) princomp(stats) Principal Components Analysis

screeplot(stats) Screeplot of PCA Results PCA in IMSL (a commercial C library)

MDS in R:

isoMDS(MASS) Kruskal's Non-metric Multidimensional Scaling cmdscale(stats) Classical (Metric) Multidimensional Scaling sammon(MASS) Sammon's Non-Linear Mapping

MDS: Various software and resources about MDS

http://www.granular.com/MDS/

Heatmap visualization:

(32)

Agenda

Big

 

data

 

and

 

their

 

visualization

PCA

 

and

 

MDS

 

for

 

data

 

visualization

Clustering

 

and

 

data

  

mining

 

Data

 

integration:

 

An

 

example

(33)

Visualization

 

vs.

 

Analysis?

• Applications to data mining and data discovery.

– Visualization tools are helpful for exploring hunches and presenting 

results

• Examples: scatterplots

– They are the WRONG primary tool when the goal is to find a good 

(34)

Data Mining and Machine Learning

• Machine Learning and data mining can be used: to recognize or classify complex items (objects,

situations, etc.), to predict future data or events, and to explore the data structure in the data.

• On the boundary of Computer Science and Statistics.

(35)

Why Data Mining?

• A lot of data

• Data is noisy

• No clear biological theory

• Large number of features (genes)

• Complex relationships

(36)

36

Unsupervised Learning

Unsupervised learning attempts to discover interesting structure in the available data

Data mining, Clustering

Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.

Tailor-made for each person: too expensive One-size-fits-all: does not fit all.

Example 2: In medicine, identifying patients subtype based on their omics profiling

(37)

37 Supervised Learning Train dataset ML algorithm

model

prediction new observation System (unknown) observations property of interest

?

Classification
(38)

• Biologists are estimated to produce

25.000.000.000.000.000 bytes of data

each year (± 35 billion CD-rooms).

• How do we learn something from this

data?

• Find patterns/structure in the data.

Use

cluster analysis

(39)

• Definition:

Clustering

is the process of

grouping several objects into a number of

groups, or clusters.

• Goal:

Objects in the same cluster are more

similar to one another than they are to

objects in other clusters.

(40)

40

Basic principles of clustering

Aim: to group observations or variables that are

“similar” based on predefined criteria.

Issues: Which genes / genomic technology to use? Which similarity or dissimilarity measure? Which method to use to join the

clusters/observations? Which clustering algorithm?

How to validate the resulting clusters?

(41)

41 Omics Data

For each gene, calculate a summary statistics and/or

adjusted p-values

Clustering

Clustering of genes

Set of candidate DE genes. Biological verification Descriptive interpretation Similarity metrics Clustering algorithm

(42)

42

Which similarity or dissimilarity measure?

• A metric is a measure of the similarity or dissimilarity between two data objects

• Two main classes of metric:

– Correlation coefficients (similarity)

• Compares shape of expression curves

– Kernel matrix (e.g. string kernel)

– Distance metrics (dissimilarity)

• City Block (Manhattan) distance • Euclidean distance

(43)

43 • Pearson Correlation Coefficient (centered correlation)

Sx = Standard deviation of x Sy = Standard deviation of y

• Others include Spearman’s  and Kendall’s 

                n i y i x i n S y y S x x 1 1 1

Correlation (a measure between -1 and 1)

(44)

44

Distance metrics

• City Block (Manhattan) distance:

– Sum of differences across dimensions

– Less sensitive to outliers – Diamond shaped clusters

• Euclidean distance:

– Most commonly used distance – Sphere shaped cluster

– Corresponds to the geometric distance into the

multidimensional space

  i i i y x Y X d( , ) 

i i i y x Y X d( , ) ( )2

where gene X = (x1,…,xn) and gene Y=(y1,…,yn) X Y Condition 1 Condition 2 Condition 1 X Y Condition 2

(45)

45

Euclidean vs Correlation (I)

• Euclidean distance

• Correlation

(46)

46

Clustering algorithms

• Clustering algorithm comes in 2 basic flavors

(47)

47

Hierarchical methods

• Hierarchical clustering methods produce a tree or

dendrogram.

• They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. • The tree can be built in two distinct ways

– bottom-up: agglomerative clustering (usually used). – top-down: divisive clustering.

(48)

48 1 5 2 3 4 1 5 2 3 4 1,2,5 3,4 1,5 1,2,3,4,5 Agglomerative Illustration of points In two dimensional space 1 5 3 4 2

(49)

Relationships between these pairwise

distances- Clustering Algorithms

• Different algorithms

– Bottom-up or top-down

– Popular hierarchical bottom-up clustering method

– The distance between a cluster and the remaining clusters can be measured using minimum, maximum or average distance.

(50)

Comparison of Linkage Methods

Single Average Complete

(51)

51

Partitioning methods

• Partition the data into a

pre-specified

number

k

of mutually exclusive and

exhaustive groups.

• Iteratively reallocate the observations to

clusters until some criterion is met, e.g.

minimize within cluster sums of squares.

Ideally, dissimilarity between

clusters will

be maximized while it is minimized within

clusters.

(52)

52

K = 2

(53)

53

K = 4

(54)

Cluster Analysis

dist() hclust() heatmap()

(55)
(56)

56

Classification and Prediction

Learning Set Data with known classes Classification Technique Classification rule Data with unknown classes Class Assignment Discrimination Prediction

(57)

57

Classification in Bioinformatics

• Computational diagnostic: early cancer

detection

• Tumor biomarker discovery

• Protein folding prediction

• Protein-protein binding sites prediction

• Gene function prediction

(58)

58 ? Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Reference

L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer.Nature, Jan.

. Objects Array Feature vectors Gene expression Predefine classes Clinical outcome new array Learning set Classification rule Good Prognosis Matesis > 5

(59)

Agenda

• Big data and their visualization

• PCA and MDS for data visualization

• Clustering and data mining

• Data

 

integration:

 

An

 

example

(60)

Clinical Problem: Upfront Therapy - Chemo or Surgery?

Optimal

Cytoreduction Chemotherapy Remission

Interval

Cytoreduction Remission

Chemo

(61)

Suboptimal Debulking

• Standard primary treatment for ovarian cancer is surgery 

followed by chemotherapy

• Rationale for surgery as the primary treatment is to remove 

as much tumor as possible

• Leaving tumor nodules larger than 1 cm (defined as 

suboptimal debulking) is associated with reduced 

chemosensitivity and poor survival

• If the tumor cannot be effectively removed by surgery, the 

patient is first treated with chemotherapy to partially shrink 

the tumor and then by surgery

• Surgeons cannot predict whether surgery will be effective or 

not

• The effect in individual patients is highly divergent 

depending on the biology of their disease

• Biomarkers  may help physicians decide which patients 

should undergo surgery and which should be treated with 

(62)

Clinical Decision-making Based on Serum Biomarkers

Optimal

Cytoreduction Chemotherapy Remission

Interval Cytoreduction Remission Chemo Serum Assay Low Levels High Levels

(63)

Overview: Horizontal vs Vertical

Integration

• Merge multiple data matrices 

(64)

Why Horizontal Integration?

 Tremendous amount of public data

 Individual study usually contains

moderate sample size

• Low statistical power

• Inconsistent conclusions

 Theoretically, the sample size 

required increases exponentially with 

the number of variables

 Combining multiple studies (meta‐analysis) is a practical 

(65)

Identifying Markers with Horizontal

Integration

Three

 

gene

 

expression

 

datasets:

– TCGA gene expression data: n=522 samples, m= 

13238 genes with 401/121 suboptimal/optimal 

debulking

– GSE26712 gene expression data: n=185 samples, 

m=13238 genes with 95/90 suboptimal/optimal 

debulking

– GSE9891 (Tohill) gene expression data: n =248 

samples and m=22635 genes with  164/84 

(66)

Build networks  and identify genes with the first  two datasets

Validate identifying markers with the third (Tohill) data and PPI 

database • Normalize the  data • Screen the  differentiated  genes with p <  0.05 • Find common DE  genes  • Construct  common and  differentiated  networks • Select markers 

with both Diff 

net and genes

• External 

(67)

Why Networks?

Genes

 

may

  

interact

 

with

 

each

 

other

 

and

 

function

 

together

 

Network

 

structures

 

may

 

vary

 

under

 

different

 

clinical

 

conditions

Reproducibility

 

of

 

sub

networks

 

is

 

higher

 

than

 

that

 

of

 

individual

 

genes

Permutation

  

test

 

is

  

usually

 

used

 

for

 

detecting

  

the

 

difference

 

in

 

correlation

 

structure

(68)

Left: Suboptimal Debulking Associated Cluster (Network module) with B from Our Data. Right: PPI Network from National Database

(69)
(70)

Some

 

Observations

Potential

 

markers

  

are

 

highly

 

reproducible,

 

21/22

 

genes

 

come

 

up

 

in

 

the

 

independent

 

Tohill

 

data.

 

In

 

contrary,

 

FABP4

 

and

 

ADH1B

 

identified

 

with

 

TCGA

 

and

 

Tohill

 

data

 

by

 

the

 

MD

 

Anderson

 

group

 

are

 

not

 

significant

 

in

 

the

 

third

 

data

Most

 

genes

 

are

 

quite

 

significant

 

with

 

very

 

small

 

P

values

 

COLL11A1

 

achieves

 

the

 

smallest

 

P

value

 

(<

 

3e

9)

 

using

 

the

 

external

 

validation

 

data.

 

Further

 

studies

 

are

 

ongoing

 

by

 

my

 

collaborator

 

(Dr.

 

Sandra

 

(71)

• Liu Z, Beach JA, Agadjanian H, Jia D, Aspuria PJ, Karlan BY, and Orsulic S (2015), Suboptimal cytoreduction in ovarian carcinoma is associated with molecular pathways characteristic of increased stromal activation,

Gynecologic Oncology, S0090-8258(15)30117-7

.

(72)

Agenda

• Big data and their visualization

• PCA and MDS for data visualization

• Clustering and data mining

• Data

 

integration:

 

An

 

example

(73)

Five websites that all biologists

should know

• NCBI (The National Center for Biotechnology Information;

http://www.ncbi.nlm.nih.gov/

• EBI (The European Bioinformatics Institute)

http://www.ebi.ac.uk/

• The Canadian Bioinformatics Resource

http://www.cbr.nrc.ca/

• SwissProt/ExPASy (Swiss Bioinformatics Resource)

http://expasy.cbr.nrc.ca/sprot/

• PDB (The Protein Databank)

(74)

A few more resources to be

aware of

• Human Genome Working Draft

http://genome.ucsc.edu/

• TIGR (The Institute for Genomics Research)

http://www.tigr.org/

• Celera

http://www.celera.com/

• (Model) Organism specific information:

Yeast: http://genome-www.stanford.edu/Saccharomyces/

Arabidopis: http://www.tair.org/

Mouse: http://www.jax.org/

Fruitfly: http://www.fruitfly.org/

Nematode: http://www.wormbase.org/

• Nucleic Acids Research Database Issue

(75)

Challenges

• Confusing choice of tools

• Developed independently

• Written by and for nerds

(76)

Outline

• what is R

• What is Bioconductor

• getting and using Bioconductor

• Overview of Bioconductor packages

• demo

(77)

R

• R is a language and environment for

statistical computing and graphics.

• what sorts of things is R good

at?

– there are very many statistical algorithms – there are very many machine learning

algorithms – visualization

(78)

Goals of Bioconductor

• Provide access to powerful statistical and

graphical methods for the analysis of genomic data.

• Facilitate the integration of biological metadata

(GenBank, GO, LocusLink, PubMed) in the analysis of experimental data.

• Allow the rapid development of extensible,

interoperable, and scalable software.

• Promote high-quality documentation and

reproducible research.

• Provide training in computational and

statistical

(79)
(80)

Installation

1. Main R software

: download from CRAN

(

cran.r-project.org

), use latest release, now

1.8.0.

2. Bioconductor packages

: download from

Bioconductor (

www.bioconductor.org

),

use latest release, now 1.3.

Available for Linux/Unix, Windows, and

Mac OS.

(81)

Documentation and help

• R manuals and tutorials:available from the R website or on-line in an R session.

• R on-line help system: detailed on-line documentation, available in text, HTML, PDF, and LaTeX formats.

> help.start() > help(lm) > ?hclust > apropos(mean) > example(hclust) > demo() > demo(image)

(82)

R cluster analysis packages

cclust: convex clustering methods.

class: self-organizing maps (SOM).

cluster:

– AGglomerative NESting (agnes),

– Clustering LARe Applications (clara),

– DIvisive ANAlysis (diana),

– Fuzzy Analysis (fanny),

– MONothetic Analysis (mona),

– Partitioning Around Medoids (pam).

e1071:

– fuzzy C-means clustering (cmeans),

– bagged clustering (bclust).

flexmix: flexible mixture modeling.

fpc: fixed point clusters, clusterwise regression and discriminant plots.

GeneSOM: self-organizing maps.

mclust, mclust98: model-based cluster analysis.

mva:

– hierarchical clustering (hclust),

– k-means (kmeans).

(83)

Hierarchical clustering

hclust function from

(84)

Heatmaps

(85)

References

• R www.r-project.org, cran.r-project.org – software (CRAN); – documentation; – newsletter: R News; – mailing list. • Bioconductor www.bioconductor.org

– software, data, and documentation (vignettes); – training materials from short courses;

(86)

Conclusions

• Visualization

• PCA and MDS visualization

• Clustering

• Classification

• Bioinformatics resources

• R and Bioconductor

References

Related documents

The Brussels Faculty of Science , founded in 1834, includes twelve departments: Mathematics, Computer Science, Physics, Chemistry, Earth and Environmental Sciences,

However, this does not necessarily mean that the models are perfect (in fact, note that the neural network incorrectly predicts class 2 for this observation of actual class 1) but

The purpose of the summer 2016 Open Houses (one in NE Iowa at the Calmar Dairy Center, one in central Iowa at the ISU Dairy Farm, and one in NW Iowa at a large family- owned

There are two papers which study legal insider trading on the Dutch stock market. Although the aim of their paper is to assess the profitability of insider’s portfolio based on

Early career teachers (ECTs) are situated in a dynamic contextual landscape that both influences their development and practice and dictates professional expectations for