• No results found

BIN 504 Probabilistic and Statistical Modeling for Bioinformatics

N/A
N/A
Protected

Academic year: 2022

Share "BIN 504 Probabilistic and Statistical Modeling for Bioinformatics"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

BIN 504

Probabilistic and Statistical Modeling for Bioinformatics

Fall 2010-2011

Tolga Can (Office: BMB B-109) e-mail: [email protected]

(2)

Goals of the course

• The primary objective of this course is to expose students to fundamentals of statistical and

probabilistic techniques used for bioinformatics problems.

• Learn and use R effectively

(3)

Course outline

• Probability and statistics review (2 weeks)

• Introduction to R (1 week)

• Statistical quality control of high-throughput biological data (1 week)

• Statistical testing and significance (1 week)

• Statistical models, inference (1 week)

• Statistical techniques for better experimental design (1 week)

• Statistical resampling (1 week)

(4)

Course outline

• Clustering techniques for large scale biological data (1 week)

• Classification techniques for large scale biological data (1 week)

• Analysis and visualization of high dimensional data (1 week)

• Statistical network analysis (1 week)

• Statistical challenges in genome-wide association studies (1 week)

(5)

Grading

• Midterm exam - 30%

• Final exam - 30%

• Assignments 40%

(6)

Course Web Site

– Syllabus

– Lecture slides and reading materials – Assignments

– Announcements

(7)

Textbook

• Statistical Bioinformatics for Biomedical and Life Science Researchers by Jae K. Lee

• Lectures slides are adapted from Prof. Jae K.

Lee’s Statistical Bioinformatics courses:

– http://geossdev.med.virginia.edu/research/teaching /teaching.html

(8)

Characteristics of biological data

• variable and noisy  probabilistic understanding

– Biological variability: treatment, time, individual variability, etc.

– Experimental variability: experimental condition, time, sample preparation, experimenter variability

• Need quality control and prescreening

(9)

High-throughput Biological Data

• Information rich and may contain more

information than the researchers’ investigation goals

• Therefore most data is made public for other researchers to discover new relationships

within data

(10)

Challenges in Biological Data Analysis

• Multiple-comparisons issue

• High-dimensional biological data

• Small n and large p problem

• Computational limitations

• Noisy high-throughput data

• Integration of multiple heterogeneous

(11)

Challenges: Multiple-comparisons issue

• High number of false positives compared to small number of true positives

• Even a %1 false positive rate for identifying

differentially expressed genes will identify 200 false positive genes among 20K genes. If in

that condition only 20 genes are truly involved in the process, the difficulty is to find the

correct 20 among 220 genes.

(12)

Challenges: High-dimensional Data

• Curse of dimensionality: as the dimensionality of the data increases all your data becomes

equidistant  impossible to discriminate between two different classes of data

• Effective dimensionality reduction techniques are needed

(13)

Challenges: small n large p problem

• Number of samples: n

• Number of parameters to predict: p

• In general, statistical techniques require n>>p

• However, in biological data, often p>n

• Special statistical analysis tools are needed that can provide high specificity and high sensitivity even when p>n

(14)

Challenges: noisy high throughput data

• Various sources of variation and error in biological experiments

• Very difficult to control every parameter during an experiment

• Reducing and decomposing various sources of error is necessary

• Quality control of initial data will help

(15)

Challenges: Integration of multiple heterogeneous data sources

• Biological data come in many different forms.

• Data may provide redundant information

• Redundancy is beneficial for reducing high false-positive rates

• Techniques that can integrate heterogeneous data are needed

(16)

Probability and Statistics Review

• Probability and conditional probability

• Random variables

• Distributions

– Discrete and continuous distributions

• Joint and marginal distribution

• Multivariate distributions

• Sampling distributions

(17)

Probability

• Studies uncertainty

• A random experiment

– An experiment/observation which does not have a certain outcome before it is conducted

• Examples

– Tossing a coin

– Observing the life time of a light bulb

– The grade you are going to get from this course – Others?

(18)

Sample space

• The set of all possible outcomes of a random experiment is called the sample space

– Tossing a coin:

• Sample space = {H, T}

– Tossing two coins:

• Sample space = {HH, HT, TH, TT}

– Lifetime of a light bulb:

• Sample space = [0,+∞)

(19)

Event

• Any collection of possible outcomes of an experiment

– Any subset of the sample space

• Examples:

– Experiment: tossing two coins. Event: obtaining exactly one heads. {HT,TH} {HH,HT,TH,TT}

– Experiment: lifetime of light bulb. Event: light bulb does not last more than a month.

(20)

Event algebra

• Union of two events: same as set union

– A B = {x: x A or x B}

• Intersection of two events: same as set intersection

– A B = {x: x A and x B}

• Complementation: same as in sets

– Ac = {x : x A }

• Disjoint sets: If A and B have no outcomes in

(21)

Probability

• Assignment of a real number to an event

– The relative frequency of occurrence of an event in a large number of experiments

• P(A)

• Axioms:

– P(A) 0 – P(S) = 1

– If A and B are mutually exclusive events, then P(A B) = P(A)+P(B)

(22)

Example

• Experiment:

– Tossing two coins

– A = {obtaining exactly one head}

– P(A) = ?

(23)

Conditional Probability

• Updating of the sample space based on new information

• Consider two events A and B. Suppose that the event B has occurred. This information will

change the probability of event A.

• P(A|B) denotes the conditional probability of event A given that B has occurred.

(24)

Conditional probability

• If A and B are events in S and P(B)>0, then P(A|B) is called the conditional probability of A given B if the following axiom is satisfied:

– P(A|B) = P(A B)/P(B)

• Example: tossing a fair dice.

– A = {the number on the dice is even}

– B = {the number on the dice < 4}

(25)

Independence

• If P(A|B)=P(A) we call that event A is independent of event B

• Axiom:

– if two events A and B are independent, then P(A B)=P(A)P(B)

• P(B|A)=P(B) also holds in this case.

• A and B are mutually independent

(26)

Independence

• Example: tossing a fair dice.

– A = {the number on the dice is even}

– B = {the number on the dice > 2}

– P(A|B) = ? – P(B|A) = ? – P(A) = ? – P(B) = ?

(27)

Bayes’s Theorem

• Using conditional probability formula we may write:

– P(A|B) = P(A B)/P(B) – P(B|A) = P(A B)/P(A)

–  P(A B) = P(A|B)P(B) = P(B|A)P(A) –  P(B|A)= P(A|B)P(B)/P(A)

• This is known as the Bayes’s theorem

(28)

Bayes’s rule

• Let B1, B2, B3, ….,Bk be a partition of the

sample space. Bis are mutually disjoint. Let A be any event. Then for each i = 1,2,…,k

) (

)

| (

) (

)

| ( )

(

) (

)

| ) (

| (

1 k

j j j

i i

i i

i P A B P B

B P B

A P

A P

B P B

A A P

B P

(29)

Example

• A novel diagnosis array is 95% effective in detecting a certain disease when it is present.

The test also has a 1% false positive rate. If

0.5% of the population has the disease, what is the probability a person with a positive test

result actually has the disease?

(30)

Solution

• A = {a person’s test result is positive}

• B = {a person has the disease}

• P(B) = 0.005, P(A|B) = 0.95, P(A|Bc) = 0.01

) (

)

| ( )

( )

| (

) ( )

| ) (

|

( c c

B P B

A P B

P B A

P

B P B A

A P B

P

475 005

. 0 95

. 0

(31)

Random Variables

• A random variable (r.v.) associates a unique numerical value with each outcome in the

sample space. It is a real-valued function from a sample space S into real numbers.

• Similar to events it is denoted by an uppercase letter (e.g., X or Y) and a particular value

taken by a r.v. is denoted by the corresponding lowercase letter (e.g., x or y).

(32)

Examples

• Toss three coins. X = number of heads

• Pick a student from Informatics Institute.

X = age of the student

• Observe lifetime of a light bulb X = lifetime in minutes

• X may be discrete or continuous

(33)

Discrete random variables

• If the values a r.v. can take is finite or countably infinite then the r.v. is discrete

• Suppose that we can calculate P(X=x) for every value of x. The collection of these

probabilities can be viewed as a function of X.

The probability mass function (p.m.f) of a discrete r.v. is given by

(34)

Continuous random variables

• A r.v. X is continuous if it can take any value from one or more intervals of real numbers.

• We cannot use p.m.f because P(X=x) = 0 since there are infinitely many possible outcomes.

Instead we use a probability density function (p.d.f), f(x), such that the areas under the curve represent probabilities

(35)

Continuous random variable

• The p.d.f (f(x)) of a continuous r.v. X is the function that satisfies

for all x

where FX(x) is the cumulative distribution function (c.d.f) of a r.v. X defined by

x

X

X x f t dt

F ( ) ( )

(36)

Expected Value

• Expected value or the mean of a discrete r.v.

• Expected value of a continuous r.v.

x

x f x x

f x x

xf X

E( ) ( ) 1 ( 1) 2 ( 2)...

dx x

xf X

E( ) ( )

(37)

Expected Value

• Expected value of a function of X

• For example if g(X) = X2

x x

X dx

x f x g

X x

f x g x

g

E ( ) ( ) if is continuous

discrete is

if )

( ) ( )]

( [

x

X x

f X X

E

discrete is

if )

( ]

[

2 2

(38)

Linearity of Expected Values

• For any two random variables X1 and X2 and any constants c1, c2

) (

) (

)

( c

1

X

1

c

2

X

2

c

1

E X

1

c

2

E X

2

E

(39)

Expected value of a product

• In general E(XY) is not equal to E(X)E(Y)

• where j(x,y) is the joint distribution which is NOT equal to f(x)g(y) if X and Y are not

y x

dxdy y

x xyj

XY

E ( ) ( , )

(40)

Variance and Standard Deviation

• Variance of a r.v. is denoted by Var(X) or 2

) (

) (

] ) [(

)

(

2 2 2

X Var

X E

X E

X

Var

(41)

Example

• Experiment: two fair dice are tossed

• What is the expected value of the r.v. X?

X = sum of two dice

• What is the variance?

(42)

Example

• Let X be a continuous r.v. with a p.d.f f(X) = ½ for 0<x<2.

• What is the expected value of X?

• What is the variance of X?

(43)

Covariance

• A measure of strength of a relationship between two random variables

• E.g., X = height of a person p, Y = weight of the same person

• X and Y are paired random variables

• Is there a relationship between them?

(44)

Correlation

• The correlation (or correlation coefficient) is simply the covariance standardized to the

range of [-1,1]

) (

) (

) ,

) ( ,

( Var X Var Y

Y X

Y Cov X

Corr

References

Related documents

the ordinary shares issued by the Company in connection with the employee share program. lack of assets above the registered capital) the Company’s Board of

Van der Berg and Burger (2003) arrive at a model with a statistically significant link between pupil performance (at the secondary level) and teacher pay, but attribute this link

2015, Isab Energy – Project Manager, Implementation of tools, methodologies, procedures and services for supporting the development of the power and gas market operations

Select the node to insert GATEWAY devices associated with this network.. CAUTION : Each device associated with a Gateway must have

Comparative modelling usually involves three steps: (i) the identification of template structures for modelling the query protein, (ii) sequence alignment between the template and

meliloti RpoH regulon; sufT contributes to Fe/S protein metabolism and effective symbiosis under intrinsic iron limitation exerted by RirA, a global iron regulator.. Our study

Short (Division of Entomology, Biodiversity Institute and Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, U.S.A) — Martin Fikáček (Department

This thesis describes ”Task-based User Interface Design” (TB-UID) 1. It is a design method that covers the important aspects from task analysis to prototype evaluation. The main