Genomic, Proteomic and Transcriptomic Lab
High Performance Computing and Networking Institute National Research Council, Italy
Mathematical Models of Supervised Learning and their Application to Medical Diagnosis
Mario Rosario Guarracino
January 9, 2007
Acknowledgements
X prof. Franco Giannessi – U. of Pisa,
X prof. Panos Pardalos – CAO UFL,
X Onur Seref – CAO UFL,
X Claudio Cifarelli – HP.
Agenda
XMathematical models of supervised learning
XPurpose of incremental learning
XSubset selection algorithm
XInitial points selection
XAccuracy results
XConclusion and future work
Introduction
X Supervised learning refers to the capability of a system to learn from examples (training set).
X The trained system is able to provide an answer (output) for each new question (input).
X Supervised means the desired output for the training set is provided by an external teacher.
X Binary classification is among the most successful methods for supervised learning.
Applications
X
Many applications in biology and medicine:
Tissues that are prone to cancer can be detected with high accuracy.
Identification of new genes or isoforms of gene expressions in large datasets.
New DNA sequences or proteins can be tracked down to their origins.
Analysis and reduction of data spatiality and principal characteristics for drug design.
Problem characteristics
X Data produced in biomedical application will exponentially increase in the next years.
X Gene expression data contain tens of thousand characteristics.
X In genomic/proteomic application, data are often updated, which poses problems to the training step.
X Current classification methods can over-fit the problem, providing models that do not generalize well.
A A B B
Linear discriminant planes
X Consider a binary classification task with points in two linearly separable sets.
– There exists a plane that classifies all points in the two sets
X There are infinitely many planes that correctly classify the training data.
SVM classification
X A different approach, yielding the same solution, is to maximize the margin between support planes
– Support planes leave all points of a class on one side
X Support planes are pushed apart until they “bump” into a small set of data points (support vectors).
A A B B
SVM classification
X Support Vector Machines are the state of the art for the existing classification methods.
X Their robustness is due to the strong fundamentals of statistical learning theory.
X The training relies on optimization of a quadratic convex cost function, for which many methods are available.
– Available software includes SVM-Lite and LIBSVM.
X These techniques can be extended to the nonlinear
discrimination, embedding the data in a nonlinear space using kernel functions.
A different religion
X Binary classification problem can be formulated as a generalized eigenvalue problem (GEPSVM).
X Find x’w1=γ1 the closer to A and the farther from B:
A A B B
O. Mangasarian et al., (2006) IEEE Trans. PAMI
ReGEC technique
Let
[w
1γ
1]
and[w
mγ
m]
be eigenvectors associated to min and max eigenvalues of Gx=λHx:X
a
∈A
⇔ closer tox'w
1- γ
1=0
than tox'w
m- γ
m=0
, Xb
∈B
⇔ closer tox'w
m- γ
m=0
than tox'w
1- γ
1=0
.M.R. Guarracino et al., (2007) OMS.
Nonlinear classification
X When classes cannot be linearly separated, nonlinear discrimination is needed.
X Classification surfaces can be very tangled.
X This model accurately describes original data, but does not generalize to new data (over-fitting).
−2 −1 0 1 2
.5 2 .5 1 .5 0 .5 1 .5 2 .5
How to solve the problem?
Incremental classification
X A possible solution is to find a small and robust subset of the training set that provides comparable accuracy results.
X A smaller set of points:
– reduces the probability of over-fitting the problem,
– is computationally more efficient in predicting new points.
X As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated with respect to the small subset.
I-ReGEC: Incremental learning algorithm
1: Γ
0= C \ C
02: {M
0, Acc
0} = Classify( C; C
0) 3: k = 1
4: while | Γ
k| > 0 do
5: x
k= x : max
x ∈ {Mk ∩ Γk-1}
{ dist(x, P
class(x)) }
6: {M
k, Acc
k} = Classify( C; {C
k-1∪ {x
k}} ) 7: if Acc
k> Acc
k-1then
8: C
k= C
k-1∪ {x
k}
9: k = k + 1
10: end if
11: Γ
k= Γ
k-1\ {x
k}
12: end while
I-ReGEC overfitting
ReGEC accuracy=84.44 I-ReGEC accuracy=85.49
X When ReGEC algorithm is trained on all points, surfaces are affected by noisy points (left).
X I-ReGEC achieves clearly defined boundaries, preserving accuracy (right).
Less then 5% of points needed for training!
−2 −1 0 1 2
.5 2 .5 1 .5 0 .5 1 .5 2 .5
−2 −1 0 1 2
.5 2 .5 1 .5 0 .5 1 .5 2 .5
Initial points selection
X Unsupervised clustering techniques can be adapted to select initial points.
X We compare the classification obtained with k randomly selected starting points for each class, and k points
determined by k-means method.
X Results show higher classification accuracy and a more
consistent representation of the training set, when k-means method is used instead of random selection.
Initial points selection
X Starting points Ci chosen:
randomly (top),
k-means (bottom).
X For each kernel produced by Ci, a set of evenly distributed points x is classified.
The procedure is repeated 100 times.
X Let yi ∈ {1; -1} be the
classification based on Ci.
X y = |∑ yi| estimates the
probability x is classified in one class.
random acc=84.5 std = 0.05
k-means acc=85.5 std = 0.01
Initial points selection
X Starting points Ci chosen:
randomly (top),
k-means (bottom).
X For each kernel produced by Ci, a set of evenly distributed points x is classified.
The procedure is repeated 100 times.
X Let yi ∈ {1; -1} be the
classification based on Ci.
X y = |∑ yi| estimates the
probability x is classified in one class.
random acc=72.1std = 1.45
k-means acc=97.6 std = 0.04
10 20 30 40 50 60 70 80 90 100 0.5
0.6 0.7 0.8 0.9 1
Initial point selection
X Effect of increasing initial points k with k-means on Chessboard dataset.
X The graph shows the classification accuracy versus the total number of initial points 2k from both classes.
X This result empirically shows that there is a minimum k, for which maximum accuracy is reached.
Initial point selection
X Bottom figure shows k vs. the number of additional points included in the incremental dataset.
10 20 30 40 50 60 70 80 90 100
0.5 0.6 0.7 0.8 0.9 1
10 20 30 40 50 60 70 80 90 100
0 2 4 6 8 10 12
Dataset reduction
X Experiments on real and synthetic datasets confirm training data reduction.
1.45 9.67
Flare-solar
8.85 12.40
Thyroid
4.25 4.215
WPBC
6.62 25.9
Votes
4.92 15.28
Bupa
2.76 7.59
Haberman
3.55 16.63
Diabetis
4.15 29.09
German
3.92 15.7
Banana
% of train chunk
Dataset
I-ReGEC
Accuracy results
X Classification accuracy with incremental techniques well compare with standard methods
65.80 65.11
3 9.67
58.23 666
Flare- solar
95.20 94.01
5 12.40
92.76 140
Thyroid
63.60 60.27
2 42.15
58.36 99
WPBC
95.60 93.41
10 25.90
95.09 391
Votes
69.90 63.94
4 15.28
59.03 310
Bupa
71.70 73.45
2 7.59
73.26 275
Haberman
76.21 74.13
5 16.63
74.56 468
Diabetis
75.66 73.5
8 29.09
70.26 700
German
89.15 85.49
5 15.70
84.44 400
Banana
acc acc
k chunk
acc train
Dataset
SVM I-ReGEC
ReGEC
Positive results
X Incremental learning, in conjunction with ReGEC, reduces training sets
dimension.
X Accuracy results well compare with those obtained selecting all training points.
X Classification surfaces can be generalized.
Ongoing research
X Microarray technology can scan expression levels of tens of
thousands of genes to classify patients in different groups.
X For example, it is possible to classify types of cancers with respect to the patterns of gene activity in the tumor cells.
X Standard methods fail to derive grouping of genes responsible of classification.
Examples of microarray analysis
X Breast cancer: BRCA1 vs. BRCA2 and sporadic mutations,
– I. Hedenfalk et al, NEJM, 2001.
X Prostate cancer: prediction of patient outcome after prostatectomy,
– Singh D. et al, Cancer Cell, 2002.
X Malignant gliomas survival: gene expression vs. histological classification,
– C. Nutt et al, Cancer Res., 2003.
X Clinical outcome of breast cancer,
– L. van’t Veer et al, Nature, 2002.
X Recurrence of hepatocellaur carcinoma after curative resection,
– N. Iizuka et al, Lancet, 2003.
X Tumor vs. normal colon tissues,
– A. Alon et al, PNAS, 1999.
X Acute Myeloid vs. Lymphoblastic Leukemia,
– T. Golub et al, Science, 1999.
Feature selection techniques
X
Standard methods need long and memory intensive computations.
– PCA, SVD, ICA,…
X
Statistical techniques are much faster, but can produce low accuracy results.
– FDA, LDA,…
X
Need for hybrid techniques that can take advantage of
both approaches.
ILDC-ReGEC
X
Simultaneous incremental learning and decremented characterization permit to acquire knowledge on gene grouping during the classification process.
X
This technique relies on standard statistical indexes
(mean µ and standard deviation σ):
ILDC-ReGEC: Golub dataset
X About 100 genes out of 7129
responsible of discrimination
– Acute Myeloid Leukemia, and – Acute Lymphoblastic Leukemia.
X Selected genes in agreement with previous studies.
X Less then 10 patients, out of 72, needed for training.
– Classification accuracy: 96.86%
AllAll AMLAML
ILDC-ReGEC: Golub dataset
X Different techniques agree on the miss-classified patient!
Missclassified patient
Gene expression analysis
X ILDC-ReGEC
– Incremental classification with feature selection for microarray datasets.
X Few
experiments and genes selected as important for discrimination.
1.34 95.39
11.15
Golub 7.25
1.62 32.43
9.70
Alon 5.43
62 x 2000
1.72 122.63
37.30 20.14
Iizuka
60 x 7129
1.96 474.35
9.31
Vantveer 8.10
98 x 24188
1.68 211.66
18.42
Nutt 8.29
50 x 12625
2.29 288.23
5.63
Singh 6.87
136 x 12600
1.77 57.15
34.00
H-Sporadic 6.80
22 x 3226
1.75 56.48
21.40
H-BRCA2 4.28
22 x 3226
1.55 49.85
30.55
H-BRCA1 6.11
22 x 3226
% of features features
% of train chunk
Dataset
ILDC-ReGEC: gene expression analysis
96.86 96.86 88.10
92.06 90.08
94.44 93.25
93.25 93.65
96.83 Golub
83.50 81.75
90.87 84.52
90.08 89.68
90.08 82.14
91.27 91.27 Alon
62 x 2000
69.00 69.00 n.a.
n.a.
61.90 66.67
n.a.
n.a.
61.90 67.10
Iizuka
60 x 7129
68.00 68.00 n.a.
n.a.
64.57 65.33
n.a.
n.a.
66.86 66.86
Vantveer
98 x 24188
76.60 76.60 n.a.
n.a.
67.46 67.46
n.a.
n.a.
74.60 72.22
Nutt
50 x 12625
77.86 n.a.
n.a.
84.85 88.74
n.a.
n.a.
90.48 91.20
91.20 Singh
136 x 12600
77.00 69.05
69.05 79.76
79.76 70.24
75.00 69.05
78.57 73.81
H-Sporadic
22 x 3226
85.00 85.00 63.10
64.29 72.62
69.05 79.76
72.62 77.38
84.52 H-BRCA2
22 x 3226
80.00 80.00 52.38
66.67 69.05
76.19 75.00
77.38 72.62
75.00 H-BRCA1
22 x 3226
ILDC ReGEC KUPCA
FDA KUPCA
FDA LSPCA
FDA LUPCA
FDA SPCA
FDA UPCA
FDA KLS
SVM LLS
Dataset SVM
Conclusions
X ReGEC is a competitive classification method.
X Incremental learning reduces redundancy in training sets and can help avoiding over-fitting.
X Subset selection algorithm provides a constructive way to reduce complexity in kernel based classification algorithms.
X Initial points selection strategy can help in finding regions where knowledge is missing.
X IReGEC can be a starting point to explore very large problems.