• No results found

Learning from Diversity

N/A
N/A
Protected

Academic year: 2021

Share "Learning from Diversity"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Learning from Diversity

Epitope Prediction with Sequence and Structure Features using an Ensemble of Support Vector Machines

Rob Patro and Carl Kingsford

Center for Bioinformatics and Computational Biology University of Maryland

Nov. 16, 2010

(2)

Overview

Challenge: epitope-antibody recognition Solution: ensemble of support vector machines

I Trained with probabilistic extension

I Variety of feature classes : physicochemical properties, string kernels, structure

I Performance of individual methods and ensemble

(3)

Problem Overview

The Challenge

©http://visualscience.ru, 2010

}

Binding Site

{

Binding Site

?

Binding with linear epitopes

“Simpler” sequence → affinity relation

The Details

Measure binding affinity aff(pi) ∈ [0, 65536]

C+= {pi| aff(pi) ∈ [10000, 65536]}

6, 841 binders

C= {pi | aff(pi) ∈ [0, 1000]}

20, 437 non-binders

Learn a function to predict binding f : P → [0, 1]

f (pi) ≥ 0.5 =⇒ p ∈ C+ f (pi) < 0.5 =⇒ p ∈ C

(4)

System Overview

?

C -

f

0

f

1 . . .

f

M

0

} }

0.5 1.0

C+

Individual classifiers trained on various features

Decision Trees, Boosted / Bagged / Random Forests, Naive Bayes, Logistic Regression, Maximum Entropy Classification, (Balanced) Winnow Classifiers, etc.

Support Vector Machines (SVM)

Aggregate scores of classifiers

Produces prediction for binding class

Unlikely Binder Likely Binder

(5)

Probabilistic SVMs

Ideally we want a confidence in each prediction (Platt:1999)

For each prediction, we obtain a posterior probability

Allows ranking of predictions by posterior

Aids in classifier combination

C -

0

} }

0.5 1.0

C+

(6)

Combining Predictions

Probabilistic SVMs trained on various features using various kernels, with various parameters Combined by weighted voting:

Weight of the j classifier based on cross-validation

performance

th

Normalizing constant

th

Posterior of positive class label

for j classifier

(7)

Choosing Features

To train SVMs we translate each peptide pi into a feature vector xi Good features are essential

Good features should Be discriminative

Lead to class separability Be efficient to compute

Real features

Capture partial information Separate data subsets Are often complementary

Consider many useful features → predictive power

(8)

Which Features?

ILAMRSHYPF

Sequence Features

Physico|Biochemical Features

Structure Features

k-spectrum kernel mismatch kernel substitution kernel string subsequence kernel sparse spatial sample kernel

BLOSUM encoding AAIndex encoding Local composition

Peptide/Structure shape complementarity

(9)

Sequence Features (String Kernels)

String kernels assign a similarity to a pair of strings K-spectrum kernel

Consider all (K ) k-mers that occur in the training set Encode each peptide as a vector v ∈ RK

vj =

(1 if p contains the jth k-mer 0 otherwise

or vj can encode the frequency of the jth k-mer in the peptide.

Other string kernels: Mismatch kernel, Substitution kernel,

Restricted gappy kernel, String subsequence kernel, Sparse Spatial Sample (SSS) kernel

(10)

Compositional Features

Consider physicochemical properties of each peptide sequence

Hydropathy, Antigenicity, Structure preference etc.

Average property over entire peptide

Map each peptide to a scalar v ∈ R

A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V 1.8 -4.5 -3.5 -3.5 2.5 -3.5 -3.5 -0.4 -3.2 4.5 3.8 -3.9 1.9 2.8 -.6 -0.8 -0.7 -0.9 -1.3 4.2

4.5 3.8 1.8 1.9 -4.5 -0.8 -3.2 -1.3 -0.6 2.8

Amino Acid Hydropathy

0.44

}

A R N D C Q E G H I

. . .

(11)

Local Compositional Features

Physicochemical features can be useful but are global Epitope is only a subset of the peptide

A/L R/K N/M D/F C/P Q/S E/T G/W H/Y I/V 1.8 -4.5 -3.5 -3.5 2.5 -3.5 -3.5 -0.4 -3.2 4.5 3.8 -3.9 1.9 2.8 -.6 -0.8 -0.7 -0.9 -1.3 4.2

4.5 3.8 1.8 1.9 -4.5 -0.8 -3.2 -1.3 -0.6 2.8

Amino Acid Hydropathy

3.36 2.5 -0.26 0.3

. . . . . .

A R N D C Q E G H I

. . .

Consider a sliding window of a given length w Move window along the peptide from left to right Average values over window

Concatenate output to represent the peptide

(12)

Orthogonal Encoding

Orthonormal representation proposed by Qian1988

C E H R

Orthogonal Encoding Matrix

1

1 1

1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Map each amino acid aj ∈ pi to a 20 long bit-vector vj xi = v0v1. . . vk−1 for an amino acid of length k

(13)

Property Encoding

Orthogonality is not actually important in our application

C E H R

9 -1 -1 . . . -4 0 0 . . . -3 -1 0 . . . -3 -1 -1

BLOSUM 62

Replace the indicator vector by something more informative e.g. a row from a BLOSUM or PAM matrix

(14)

AAIndex Encoding

The Amino Acid Index (AAIndex) (Kawashima2008) compiles a growing list of different phyiscochemical and biochemical properties of amino acids . . . 544 to date!

C E H R

Property Matrix

0 1 . . . K

Is it possible to make use of all this information?

Use non-linear factor matrix of AAIndex (Nanni2010)

(15)

Structural Features

Consider how well IgG and a peptide “fit” together.

Poor Shape Complementarity

score

frequency

Good Shape Complementarity

score

frequency

Experimentally measured IgG conformation

Approximate native peptide conformation Choose most common sidechain positions Relax energy

Compute 2000 "best" dockings

using ZDock (Chen, Li, and Wang 2003)

Feature vector given by histogram of docking scores

(16)

Results Table

vs. ensemble

Features AUROC AUPR ∆AUROC ∆AUPR

k-spectrum 0.85 0.70 -0.043 -0.072

Sparse Spatial Sample 0.87 0.73 -0.023 -0.042 Nonlinear Fisher Mat. 0.86 0.69 -0.024 -0.082 Statistical Analysis Mat. 0.85 0.67 -0.025 -0.102

BLOSUM Encoding 0.86 0.70 -0.024 -0.072

Local Composition 0.88 0.74 -0.013 -0.032

Structure 0.74 0.53 -0.153 -0.242

ensemble 0.893 0.772

2ndPlace 0.892 0.766 -0.001 -0.006

3rdPlace 0.864 0.691 -0.029 -0.081

4thPlace 0.855 0.689 -0.038 -0.083

(17)

Performance Curves ROC

(18)

Performance Curves P/R

(19)

Conclusions

I Many good features exist

I They capture some non-overlapping information

I Ensemble solutions, used properly, are effective

I Structure features are hard to compute

I Much room for improvement here

I Simple features should not be discounted

I The local composition feature was the best single classifier

I We didn’t encounter anyone using this in the literature!

(20)

Thanks

Funding:

NIH grant 1R21AI085376 and NSF grant 0849899 to C.K.

For many interesting and useful conversations:

Geet Duggal, Darya Filippova, Justin Malin, Guillaume Marçais, Saket Navlakha, Emre Sefer

References

Related documents

As part of the project, we have been annotating a corpus with the syntactic, seman- tic and discourse information that is needed for dif- ferent subtasks of NP realization,

Impacts of Recent Submarine Cable Cuts on Middle East Internet.. Doug Madory, Renesys Corp

At present the work is focussed mainly with our roles being to advise on the development of a food security plan through our participation on the advisory committee. We meet

Hidden communication can be enabled by employing steganographic methods applied to the users’ voice that is carried inside the RTP packets’ payload, by utilising so- called

Six obso- lete and modern cotton genotypes (Table 1) were grown at two different plant population densities in any one year throughout the duration of the study.. The

You can access the SEEK function by tabbing the cursor to the Search submenu on the Menu bar and pressing ENTER. Security Extract Engine by Key - Advanced

On arrival to a hospital a patient passes through entry physical examinations, which consists of a number of tests.. The database should contain information about

Since Fiji’s independence from the United Kingdom in 1970 Pitcairn has been the responsibility of a Governor residing in New Zealand (the convention being that Pitcairn’s