BIN 504
Probabilistic and Statistical Modeling for Bioinformatics
Fall 2010-2011
Tolga Can (Office: BMB B-109) e-mail: [email protected]
Goals of the course
• The primary objective of this course is to expose students to fundamentals of statistical and
probabilistic techniques used for bioinformatics problems.
• Learn and use R effectively
Course outline
• Probability and statistics review (2 weeks)
• Introduction to R (1 week)
• Statistical quality control of high-throughput biological data (1 week)
• Statistical testing and significance (1 week)
• Statistical models, inference (1 week)
• Statistical techniques for better experimental design (1 week)
• Statistical resampling (1 week)
Course outline
• Clustering techniques for large scale biological data (1 week)
• Classification techniques for large scale biological data (1 week)
• Analysis and visualization of high dimensional data (1 week)
• Statistical network analysis (1 week)
• Statistical challenges in genome-wide association studies (1 week)
Grading
• Midterm exam - 30%
• Final exam - 30%
• Assignments 40%
Course Web Site
– Syllabus
– Lecture slides and reading materials – Assignments
– Announcements
Textbook
• Statistical Bioinformatics for Biomedical and Life Science Researchers by Jae K. Lee
• Lectures slides are adapted from Prof. Jae K.
Lee’s Statistical Bioinformatics courses:
– http://geossdev.med.virginia.edu/research/teaching /teaching.html
Characteristics of biological data
• variable and noisy probabilistic understanding
– Biological variability: treatment, time, individual variability, etc.
– Experimental variability: experimental condition, time, sample preparation, experimenter variability
• Need quality control and prescreening
High-throughput Biological Data
• Information rich and may contain more
information than the researchers’ investigation goals
• Therefore most data is made public for other researchers to discover new relationships
within data
Challenges in Biological Data Analysis
• Multiple-comparisons issue
• High-dimensional biological data
• Small n and large p problem
• Computational limitations
• Noisy high-throughput data
• Integration of multiple heterogeneous
Challenges: Multiple-comparisons issue
• High number of false positives compared to small number of true positives
• Even a %1 false positive rate for identifying
differentially expressed genes will identify 200 false positive genes among 20K genes. If in
that condition only 20 genes are truly involved in the process, the difficulty is to find the
correct 20 among 220 genes.
Challenges: High-dimensional Data
• Curse of dimensionality: as the dimensionality of the data increases all your data becomes
equidistant impossible to discriminate between two different classes of data
• Effective dimensionality reduction techniques are needed
Challenges: small n large p problem
• Number of samples: n
• Number of parameters to predict: p
• In general, statistical techniques require n>>p
• However, in biological data, often p>n
• Special statistical analysis tools are needed that can provide high specificity and high sensitivity even when p>n
Challenges: noisy high throughput data
• Various sources of variation and error in biological experiments
• Very difficult to control every parameter during an experiment
• Reducing and decomposing various sources of error is necessary
• Quality control of initial data will help
Challenges: Integration of multiple heterogeneous data sources
• Biological data come in many different forms.
• Data may provide redundant information
• Redundancy is beneficial for reducing high false-positive rates
• Techniques that can integrate heterogeneous data are needed
Probability and Statistics Review
• Probability and conditional probability
• Random variables
• Distributions
– Discrete and continuous distributions
• Joint and marginal distribution
• Multivariate distributions
• Sampling distributions
Probability
• Studies uncertainty
• A random experiment
– An experiment/observation which does not have a certain outcome before it is conducted
• Examples
– Tossing a coin
– Observing the life time of a light bulb
– The grade you are going to get from this course – Others?
Sample space
• The set of all possible outcomes of a random experiment is called the sample space
– Tossing a coin:
• Sample space = {H, T}
– Tossing two coins:
• Sample space = {HH, HT, TH, TT}
– Lifetime of a light bulb:
• Sample space = [0,+∞)
Event
• Any collection of possible outcomes of an experiment
– Any subset of the sample space
• Examples:
– Experiment: tossing two coins. Event: obtaining exactly one heads. {HT,TH} {HH,HT,TH,TT}
– Experiment: lifetime of light bulb. Event: light bulb does not last more than a month.
Event algebra
• Union of two events: same as set union
– A B = {x: x A or x B}
• Intersection of two events: same as set intersection
– A B = {x: x A and x B}
• Complementation: same as in sets
– Ac = {x : x A }
• Disjoint sets: If A and B have no outcomes in
Probability
• Assignment of a real number to an event
– The relative frequency of occurrence of an event in a large number of experiments
• P(A)
• Axioms:
– P(A) 0 – P(S) = 1
– If A and B are mutually exclusive events, then P(A B) = P(A)+P(B)
Example
• Experiment:
– Tossing two coins
– A = {obtaining exactly one head}
– P(A) = ?
Conditional Probability
• Updating of the sample space based on new information
• Consider two events A and B. Suppose that the event B has occurred. This information will
change the probability of event A.
• P(A|B) denotes the conditional probability of event A given that B has occurred.
Conditional probability
• If A and B are events in S and P(B)>0, then P(A|B) is called the conditional probability of A given B if the following axiom is satisfied:
– P(A|B) = P(A B)/P(B)
• Example: tossing a fair dice.
– A = {the number on the dice is even}
– B = {the number on the dice < 4}
Independence
• If P(A|B)=P(A) we call that event A is independent of event B
• Axiom:
– if two events A and B are independent, then P(A B)=P(A)P(B)
• P(B|A)=P(B) also holds in this case.
• A and B are mutually independent
Independence
• Example: tossing a fair dice.
– A = {the number on the dice is even}
– B = {the number on the dice > 2}
– P(A|B) = ? – P(B|A) = ? – P(A) = ? – P(B) = ?
Bayes’s Theorem
• Using conditional probability formula we may write:
– P(A|B) = P(A B)/P(B) – P(B|A) = P(A B)/P(A)
– P(A B) = P(A|B)P(B) = P(B|A)P(A) – P(B|A)= P(A|B)P(B)/P(A)
• This is known as the Bayes’s theorem
Bayes’s rule
• Let B1, B2, B3, ….,Bk be a partition of the
sample space. Bis are mutually disjoint. Let A be any event. Then for each i = 1,2,…,k
) (
)
| (
) (
)
| ( )
(
) (
)
| ) (
| (
1 k
j j j
i i
i i
i P A B P B
B P B
A P
A P
B P B
A A P
B P
Example
• A novel diagnosis array is 95% effective in detecting a certain disease when it is present.
The test also has a 1% false positive rate. If
0.5% of the population has the disease, what is the probability a person with a positive test
result actually has the disease?
Solution
• A = {a person’s test result is positive}
• B = {a person has the disease}
• P(B) = 0.005, P(A|B) = 0.95, P(A|Bc) = 0.01
) (
)
| ( )
( )
| (
) ( )
| ) (
|
( c c
B P B
A P B
P B A
P
B P B A
A P B
P
475 005
. 0 95
. 0
Random Variables
• A random variable (r.v.) associates a unique numerical value with each outcome in the
sample space. It is a real-valued function from a sample space S into real numbers.
• Similar to events it is denoted by an uppercase letter (e.g., X or Y) and a particular value
taken by a r.v. is denoted by the corresponding lowercase letter (e.g., x or y).
Examples
• Toss three coins. X = number of heads
• Pick a student from Informatics Institute.
X = age of the student
• Observe lifetime of a light bulb X = lifetime in minutes
• X may be discrete or continuous
Discrete random variables
• If the values a r.v. can take is finite or countably infinite then the r.v. is discrete
• Suppose that we can calculate P(X=x) for every value of x. The collection of these
probabilities can be viewed as a function of X.
The probability mass function (p.m.f) of a discrete r.v. is given by
Continuous random variables
• A r.v. X is continuous if it can take any value from one or more intervals of real numbers.
• We cannot use p.m.f because P(X=x) = 0 since there are infinitely many possible outcomes.
Instead we use a probability density function (p.d.f), f(x), such that the areas under the curve represent probabilities
Continuous random variable
• The p.d.f (f(x)) of a continuous r.v. X is the function that satisfies
for all x
where FX(x) is the cumulative distribution function (c.d.f) of a r.v. X defined by
x
X
X x f t dt
F ( ) ( )
Expected Value
• Expected value or the mean of a discrete r.v.
• Expected value of a continuous r.v.
x
x f x x
f x x
xf X
E( ) ( ) 1 ( 1) 2 ( 2)...
dx x
xf X
E( ) ( )
Expected Value
• Expected value of a function of X
• For example if g(X) = X2
x x
X dx
x f x g
X x
f x g x
g
E ( ) ( ) if is continuous
discrete is
if )
( ) ( )]
( [
x
X x
f X X
E
discrete is
if )
( ]
[
2 2
Linearity of Expected Values
• For any two random variables X1 and X2 and any constants c1, c2
) (
) (
)
( c
1X
1c
2X
2c
1E X
1c
2E X
2E
Expected value of a product
• In general E(XY) is not equal to E(X)E(Y)
• where j(x,y) is the joint distribution which is NOT equal to f(x)g(y) if X and Y are not
y x
dxdy y
x xyj
XY
E ( ) ( , )
Variance and Standard Deviation
• Variance of a r.v. is denoted by Var(X) or 2
) (
) (
] ) [(
)
(
2 2 2X Var
X E
X E
X
Var
Example
• Experiment: two fair dice are tossed
• What is the expected value of the r.v. X?
X = sum of two dice
• What is the variance?
Example
• Let X be a continuous r.v. with a p.d.f f(X) = ½ for 0<x<2.
• What is the expected value of X?
• What is the variance of X?
Covariance
• A measure of strength of a relationship between two random variables
• E.g., X = height of a person p, Y = weight of the same person
• X and Y are paired random variables
• Is there a relationship between them?
Correlation
• The correlation (or correlation coefficient) is simply the covariance standardized to the
range of [-1,1]