BIN 504 Probabilistic and Statistical Modeling for Bioinformatics

(1)

BIN 504

Probabilistic and Statistical Modeling for Bioinformatics

Fall 2010-2011

Tolga Can (Office: BMB B-109) e-mail: [email protected]

(2)

Goals of the course

• The primary objective of this course is to expose students to fundamentals of statistical and

probabilistic techniques used for bioinformatics problems.

• Learn and use R effectively

(3)

Course outline

• Probability and statistics review (2 weeks)

• Introduction to R (1 week)

• Statistical quality control of high-throughput biological data (1 week)

• Statistical testing and significance (1 week)

• Statistical models, inference (1 week)

• Statistical techniques for better experimental design (1 week)

• Statistical resampling (1 week)

(4)

Course outline

• Clustering techniques for large scale biological data (1 week)

• Classification techniques for large scale biological data (1 week)

• Analysis and visualization of high dimensional data (1 week)

• Statistical network analysis (1 week)

• Statistical challenges in genome-wide association studies (1 week)

(5)

Grading

• Midterm exam - 30%

• Final exam - 30%

• Assignments 40%

(6)

Course Web Site

– Syllabus

– Lecture slides and reading materials – Assignments

– Announcements

(7)

Textbook

• Statistical Bioinformatics for Biomedical and Life Science Researchers by Jae K. Lee

• Lectures slides are adapted from Prof. Jae K.

Lee’s Statistical Bioinformatics courses:

– http://geossdev.med.virginia.edu/research/teaching /teaching.html

(8)

Characteristics of biological data

• variable and noisy  probabilistic understanding

– Biological variability: treatment, time, individual variability, etc.

– Experimental variability: experimental condition, time, sample preparation, experimenter variability

• Need quality control and prescreening

(9)

High-throughput Biological Data

• Information rich and may contain more

information than the researchers’ investigation goals

• Therefore most data is made public for other researchers to discover new relationships

within data

(10)

Challenges in Biological Data Analysis

• Multiple-comparisons issue

• High-dimensional biological data

• Small n and large p problem

• Computational limitations

• Noisy high-throughput data

• Integration of multiple heterogeneous

(11)

Challenges: Multiple-comparisons issue

• High number of false positives compared to small number of true positives

• Even a %1 false positive rate for identifying

differentially expressed genes will identify 200 false positive genes among 20K genes. If in

that condition only 20 genes are truly involved in the process, the difficulty is to find the

correct 20 among 220 genes.

(12)

Challenges: High-dimensional Data

• Curse of dimensionality: as the dimensionality of the data increases all your data becomes

equidistant  impossible to discriminate between two different classes of data

• Effective dimensionality reduction techniques are needed

(13)

Challenges: small n large p problem

• Number of samples: n

• Number of parameters to predict: p

• In general, statistical techniques require n>>p

• However, in biological data, often p>n

• Special statistical analysis tools are needed that can provide high specificity and high sensitivity even when p>n

(14)

Challenges: noisy high throughput data

• Various sources of variation and error in biological experiments

• Very difficult to control every parameter during an experiment

• Reducing and decomposing various sources of error is necessary

• Quality control of initial data will help

(15)

Challenges: Integration of multiple heterogeneous data sources

• Biological data come in many different forms.

• Data may provide redundant information

• Redundancy is beneficial for reducing high false-positive rates

• Techniques that can integrate heterogeneous data are needed

(16)

Probability and Statistics Review

• Probability and conditional probability

• Random variables

• Distributions

– Discrete and continuous distributions

• Joint and marginal distribution

• Multivariate distributions

• Sampling distributions

(17)

Probability

• Studies uncertainty

• A random experiment

– An experiment/observation which does not have a certain outcome before it is conducted

• Examples

– Tossing a coin

– Observing the life time of a light bulb

– The grade you are going to get from this course – Others?

(18)

Sample space

• The set of all possible outcomes of a random experiment is called the sample space

– Tossing a coin:

• Sample space = {H, T}

– Tossing two coins:

• Sample space = {HH, HT, TH, TT}

– Lifetime of a light bulb:

• Sample space = [0,+∞)

(19)

Event

• Any collection of possible outcomes of an experiment

– Any subset of the sample space

• Examples:

– Experiment: tossing two coins. Event: obtaining exactly one heads. {HT,TH} {HH,HT,TH,TT}

– Experiment: lifetime of light bulb. Event: light bulb does not last more than a month.

(20)

Event algebra

• Union of two events: same as set union

– A B = {x: x A or x B}

• Intersection of two events: same as set intersection

– A B = {x: x A and x B}

• Complementation: same as in sets

– A^c = {x : x A }

• Disjoint sets: If A and B have no outcomes in

(21)

Probability

• Assignment of a real number to an event

– The relative frequency of occurrence of an event in a large number of experiments

• P(A)

• Axioms:

– P(A) 0 – P(S) = 1

– If A and B are mutually exclusive events, then P(A B) = P(A)+P(B)

(22)

Example

• Experiment:

– Tossing two coins

– A = {obtaining exactly one head}

– P(A) = ?

(23)

Conditional Probability

• Updating of the sample space based on new information

• Consider two events A and B. Suppose that the event B has occurred. This information will

change the probability of event A.

• P(A|B) denotes the conditional probability of event A given that B has occurred.

(24)

Conditional probability

• If A and B are events in S and P(B)>0, then P(A|B) is called the conditional probability of A given B if the following axiom is satisfied:

– P(A|B) = P(A B)/P(B)

• Example: tossing a fair dice.

– A = {the number on the dice is even}

– B = {the number on the dice < 4}

(25)

Independence

• If P(A|B)=P(A) we call that event A is independent of event B

• Axiom:

– if two events A and B are independent, then P(A B)=P(A)P(B)

• P(B|A)=P(B) also holds in this case.

• A and B are mutually independent

(26)

Independence

• Example: tossing a fair dice.

– A = {the number on the dice is even}

– B = {the number on the dice > 2}

– P(A|B) = ? – P(B|A) = ? – P(A) = ? – P(B) = ?

(27)

Bayes’s Theorem

• Using conditional probability formula we may write:

– P(A|B) = P(A B)/P(B) – P(B|A) = P(A B)/P(A)

–  P(A B) = P(A|B)P(B) = P(B|A)P(A) –  P(B|A)= P(A|B)P(B)/P(A)

• This is known as the Bayes’s theorem

(28)

Bayes’s rule

• Let B₁, B₂, B₃, ….,B_k be a partition of the

sample space. B_is are mutually disjoint. Let A be any event. Then for each i = 1,2,…,k

) (

)

| (

) (

)

| ( )

(

) (

)

| ) (

| (

1 k

j j j

i i

i P A B P B

B P B

A P

B P B

A A P

B P

(29)

Example

• A novel diagnosis array is 95% effective in detecting a certain disease when it is present.

The test also has a 1% false positive rate. If

0.5% of the population has the disease, what is the probability a person with a positive test

result actually has the disease?

(30)

Solution

• A = {a person’s test result is positive}

• B = {a person has the disease}

• P(B) = 0.005, P(A|B) = 0.95, P(A|B_c) = 0.01

) (

)

| ( )

( )

| (

) ( )

| ) (

|

( _c _c

B P B

A P B

P B A

P

B P B A

A P B

P

475 005

. 0 95

. 0

(31)

Random Variables

• A random variable (r.v.) associates a unique numerical value with each outcome in the

sample space. It is a real-valued function from a sample space S into real numbers.

• Similar to events it is denoted by an uppercase letter (e.g., X or Y) and a particular value

taken by a r.v. is denoted by the corresponding lowercase letter (e.g., x or y).

(32)

Examples

• Toss three coins. X = number of heads

• Pick a student from Informatics Institute.

X = age of the student

• Observe lifetime of a light bulb X = lifetime in minutes

• X may be discrete or continuous

(33)

Discrete random variables

• If the values a r.v. can take is finite or countably infinite then the r.v. is discrete

• Suppose that we can calculate P(X=x) for every value of x. The collection of these

probabilities can be viewed as a function of X.

The probability mass function (p.m.f) of a discrete r.v. is given by

(34)

Continuous random variables

• A r.v. X is continuous if it can take any value from one or more intervals of real numbers.

• We cannot use p.m.f because P(X=x) = 0 since there are infinitely many possible outcomes.

Instead we use a probability density function (p.d.f), f(x), such that the areas under the curve represent probabilities

(35)

Continuous random variable

• The p.d.f (f(x)) of a continuous r.v. X is the function that satisfies

for all x

where F_X(x) is the cumulative distribution function (c.d.f) of a r.v. X defined by

x

X

X x f t dt

F ( ) ( )

(36)

Expected Value

• Expected value or the mean of a discrete r.v.

• Expected value of a continuous r.v.

x

x f x x

f x x

xf X

E( ) ( ) ₁ ( ₁) ₂ ( ₂)...

dx x

xf X

E( ) ( )

(37)

Expected Value

• Expected value of a function of X

• For example if g(X) = X²

x x

X dx

x f x g

X x

f x g x

g

E ( ) ( ) if is continuous

discrete is

if )

( ) ( )]

( [

x

X x

f X X

E

discrete is

if )

( ]

[

2 2

E ( ) ( , )

(40)

Variance and Standard Deviation

• Variance of a r.v. is denoted by Var(X) or ²

) (

] ) [(

)

(

² ² ²

X Var

X E

X

Var

(41)

Example

• Experiment: two fair dice are tossed

• What is the expected value of the r.v. X?

X = sum of two dice

• What is the variance?

(42)

Example

• Let X be a continuous r.v. with a p.d.f f(X) = ½ for 0<x<2.

• What is the expected value of X?

• What is the variance of X?

(43)

Covariance

• A measure of strength of a relationship between two random variables

• E.g., X = height of a person p, Y = weight of the same person

• X and Y are paired random variables

• Is there a relationship between them?

(44)

Correlation

• The correlation (or correlation coefficient) is simply the covariance standardized to the

BIN 504 Probabilistic and Statistical Modeling for Bioinformatics

BIN 504

Probabilistic and Statistical Modeling for Bioinformatics

Goals of the course

Course outline

Course outline

Grading

Course Web Site

Textbook

Characteristics of biological data

High-throughput Biological Data

Challenges in Biological Data Analysis

Challenges: Multiple-comparisons issue

Challenges: High-dimensional Data

Challenges: small n large p problem

Challenges: noisy high throughput data

Challenges: Integration of multiple heterogeneous data sources

Probability and Statistics Review

Probability

Sample space

Event

Event algebra

Probability

Example

Conditional Probability

Conditional probability

Independence

Independence

Bayes’s Theorem

Bayes’s rule

Example

Solution

Random Variables

Examples

Discrete random variables

Continuous random variables

Continuous random variable

Expected Value

Expected Value

Linearity of Expected Values

) (

) (

)

( c

X

c

X

c

E X

c

E X

E

Expected value of a product

dxdy y

x xyj

XY

E ( ) ( , )

Variance and Standard Deviation

) (

) (

] ) [(

)

(

X Var

X E

X E

X

Var

Example

Example

Covariance

Correlation

) (

) (

) ,

) ( ,

( Var X Var Y

Y X

Y Cov X

Corr