Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

(1)

Genomic, Proteomic and Transcriptomic Lab

High Performance Computing and Networking Institute National Research Council, Italy

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Mario Rosario Guarracino

January 9, 2007

(2)

Acknowledgements

X prof. Franco Giannessi – U. of Pisa,

X prof. Panos Pardalos – CAO UFL,

X Onur Seref – CAO UFL,

X Claudio Cifarelli – HP.

(3)

Agenda

XMathematical models of supervised learning

XPurpose of incremental learning

XSubset selection algorithm

XInitial points selection

XAccuracy results

XConclusion and future work

(4)

Introduction

X Supervised learning refers to the capability of a system to learn from examples (training set).

X The trained system is able to provide an answer (output) for each new question (input).

X Supervised means the desired output for the training set is provided by an external teacher.

X Binary classification is among the most successful methods for supervised learning.

(5)

Applications

X

Many applications in biology and medicine:

Tissues that are prone to cancer can be detected with high accuracy.

Identification of new genes or isoforms of gene expressions in large datasets.

New DNA sequences or proteins can be tracked down to their origins.

Analysis and reduction of data spatiality and principal characteristics for drug design.

(6)

Problem characteristics

X Data produced in biomedical application will exponentially increase in the next years.

X Gene expression data contain tens of thousand characteristics.

X In genomic/proteomic application, data are often updated, which poses problems to the training step.

X Current classification methods can over-fit the problem, providing models that do not generalize well.

(7)

A A ^B ^B

Linear discriminant planes

X Consider a binary classification task with points in two linearly separable sets.

– There exists a plane that classifies all points in the two sets

X There are infinitely many planes that correctly classify the training data.

(8)

SVM classification

X A different approach, yielding the same solution, is to maximize the margin between support planes

– Support planes leave all points of a class on one side

X Support planes are pushed apart until they “bump” into a small set of data points (support vectors).

A A ^B ^B

(9)

SVM classification

X Support Vector Machines are the state of the art for the existing classification methods.

X Their robustness is due to the strong fundamentals of statistical learning theory.

X The training relies on optimization of a quadratic convex cost function, for which many methods are available.

– Available software includes SVM-Lite and LIBSVM.

X These techniques can be extended to the nonlinear

discrimination, embedding the data in a nonlinear space using kernel functions.

(10)

A different religion

X Binary classification problem can be formulated as a generalized eigenvalue problem (GEPSVM).

X Find x’w₁=γ₁ the closer to A and the farther from B:

A A ^B ^B

O. Mangasarian et al., (2006) IEEE Trans. PAMI

(11)

ReGEC technique

Let

[w

₁

γ

₁

]

^and

[w

_m

γ

_m

]

- γ

₁

=0

.

M.R. Guarracino et al., (2007) OMS.

(12)

Nonlinear classification

X When classes cannot be linearly separated, nonlinear discrimination is needed.

X Classification surfaces can be very tangled.

X This model accurately describes original data, but does not generalize to new data (over-fitting).

−2 −1 0 1 2

.5 2 .5 1 .5 0 .5 1 .5 2 .5

(13)

How to solve the problem?

(14)

Incremental classification

X A possible solution is to find a small and robust subset of the training set that provides comparable accuracy results.

X A smaller set of points:

– reduces the probability of over-fitting the problem,

– is computationally more efficient in predicting new points.

X As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated with respect to the small subset.

(15)

I-ReGEC: Incremental learning algorithm

_k

| > 0 do

5: x

_k

= x : max

_x _∈ _{Mk _∩ _Γ

k-1}

{ dist(x, P

_class(x)

) }

then

8: C

_k

= C

_k-1

∪ {x

_k

}

9: k = k + 1

10: end if

11: Γ

_k

= Γ

_k-1

\ {x

_k

}

12: end while

(16)

I-ReGEC overfitting

ReGEC accuracy=84.44 I-ReGEC accuracy=85.49

X When ReGEC algorithm is trained on all points, surfaces are affected by noisy points (left).

X I-ReGEC achieves clearly defined boundaries, preserving accuracy (right).

 Less then 5% of points needed for training!

−2 −1 0 1 2

.5 2 .5 1 .5 0 .5 1 .5 2 .5

−2 −1 0 1 2

.5 2 .5 1 .5 0 .5 1 .5 2 .5

(17)

Initial points selection

X Unsupervised clustering techniques can be adapted to select initial points.

X We compare the classification obtained with k randomly selected starting points for each class, and k points

determined by k-means method.

X Results show higher classification accuracy and a more

consistent representation of the training set, when k-means method is used instead of random selection.

(18)

Initial points selection

X Starting points C_i chosen:

randomly (top),

k-means (bottom).

X For each kernel produced by C_i, a set of evenly distributed points x is classified.

The procedure is repeated 100 times.

X Let y_i ∈ {1; -1} be the

classification based on C_i.

X y = |∑ y_i| estimates the

probability x is classified in one class.

random acc=84.5 std = 0.05

k-means acc=85.5 std = 0.01

(19)

Initial points selection

X Starting points C_i chosen:

randomly (top),

k-means (bottom).

X For each kernel produced by C_i, a set of evenly distributed points x is classified.

The procedure is repeated 100 times.

X Let y_i ∈ {1; -1} be the

classification based on C_i.

X y = |∑ y_i| estimates the

probability x is classified in one class.

random acc=72.1std = 1.45

k-means acc=97.6 std = 0.04

(20)

10 20 30 40 50 60 70 80 90 100 0.5

0.6 0.7 0.8 0.9 1

Initial point selection

X Effect of increasing initial points k with k-means on Chessboard dataset.

X The graph shows the classification accuracy versus the total number of initial points 2k from both classes.

X This result empirically shows that there is a minimum k, for which maximum accuracy is reached.

(21)

Initial point selection

X Bottom figure shows k vs. the number of additional points included in the incremental dataset.

10 20 30 40 50 60 70 80 90 100

0.5 0.6 0.7 0.8 0.9 1

10 20 30 40 50 60 70 80 90 100

0 2 4 6 8 10 12

(22)

Dataset reduction

X Experiments on real and synthetic datasets confirm training data reduction.

1.45 9.67

Flare-solar

8.85 12.40

Thyroid

4.25 4.215

WPBC

6.62 25.9

Votes

4.92 15.28

Bupa

2.76 7.59

Haberman

3.55 16.63

Diabetis

4.15 29.09

German

3.92 15.7

Banana

% of train chunk

Dataset

I-ReGEC

(23)

Accuracy results

X Classification accuracy with incremental techniques well compare with standard methods

65.80 65.11

3 9.67

58.23 666

Flare- solar

95.20 94.01

5 12.40

92.76 140

Thyroid

63.60 60.27

2 42.15

58.36 99

WPBC

95.60 93.41

10 25.90

95.09 391

Votes

69.90 63.94

4 15.28

59.03 310

Bupa

71.70 73.45

2 7.59

73.26 275

Haberman

76.21 74.13

5 16.63

74.56 468

Diabetis

75.66 73.5

8 29.09

70.26 700

German

89.15 85.49

5 15.70

84.44 400

Banana

acc acc

k chunk

acc train

Dataset

SVM I-ReGEC

ReGEC

(24)

Positive results

X Incremental learning, in conjunction with ReGEC, reduces training sets

dimension.

X Accuracy results well compare with those obtained selecting all training points.

X Classification surfaces can be generalized.

(25)

Ongoing research

X Microarray technology can scan expression levels of tens of

thousands of genes to classify patients in different groups.

X For example, it is possible to classify types of cancers with respect to the patterns of gene activity in the tumor cells.

X Standard methods fail to derive grouping of genes responsible of classification.

(26)

Examples of microarray analysis

X Breast cancer: BRCA1 vs. BRCA2 and sporadic mutations,

– I. Hedenfalk et al, NEJM, 2001.

X Prostate cancer: prediction of patient outcome after prostatectomy,

– Singh D. et al, Cancer Cell, 2002.

X Malignant gliomas survival: gene expression vs. histological classification,

– C. Nutt et al, Cancer Res., 2003.

X Clinical outcome of breast cancer,

– L. van’t Veer et al, Nature, 2002.

X Recurrence of hepatocellaur carcinoma after curative resection,

– N. Iizuka et al, Lancet, 2003.

X Tumor vs. normal colon tissues,

– A. Alon et al, PNAS, 1999.

X Acute Myeloid vs. Lymphoblastic Leukemia,

– T. Golub et al, Science, 1999.

(27)

Feature selection techniques

X

Standard methods need long and memory intensive computations.

– PCA, SVD, ICA,…

X

Statistical techniques are much faster, but can produce low accuracy results.

– FDA, LDA,…

X

Need for hybrid techniques that can take advantage of

both approaches.

(28)

ILDC-ReGEC

X

Simultaneous incremental learning and decremented characterization permit to acquire knowledge on gene grouping during the classification process.

X

This technique relies on standard statistical indexes

(mean µ and standard deviation σ):

(29)

ILDC-ReGEC: Golub dataset

X About 100 genes out of 7129

responsible of discrimination

– Acute Myeloid Leukemia, and – Acute Lymphoblastic Leukemia.

X Selected genes in agreement with previous studies.

X Less then 10 patients, out of 72, needed for training.

– Classification accuracy: 96.86%

AllAll AMLAML

(30)

ILDC-ReGEC: Golub dataset

X Different techniques agree on the miss-classified patient!

Missclassified patient

(31)

Gene expression analysis

X ILDC-ReGEC

– Incremental classification with feature selection for microarray datasets.

X Few

experiments and genes selected as important for discrimination.

1.34 95.39

11.15

Golub 7.25

1.62 32.43

9.70

Alon 5.43

62 x 2000

1.72 122.63

37.30 20.14

Iizuka

60 x 7129

1.96 474.35

9.31

Vantveer 8.10

98 x 24188

1.68 211.66

18.42

Nutt 8.29

50 x 12625

2.29 288.23

5.63

Singh 6.87

136 x 12600

1.77 57.15

34.00

H-Sporadic 6.80

22 x 3226

1.75 56.48

21.40

H-BRCA2 4.28

22 x 3226

1.55 49.85

30.55

H-BRCA1 6.11

22 x 3226

% of features features

% of train chunk

Dataset

(32)

ILDC-ReGEC: gene expression analysis

96.86 96.86 88.10

92.06 90.08

94.44 93.25

93.25 93.65

96.83 Golub

83.50 81.75

90.87 84.52

90.08 89.68

90.08 82.14

91.27 91.27 Alon

62 x 2000

69.00 69.00 n.a.

n.a.

61.90 66.67

n.a.

61.90 67.10

Iizuka

60 x 7129

68.00 68.00 n.a.

n.a.

64.57 65.33

n.a.

66.86 66.86

Vantveer

98 x 24188

76.60 76.60 n.a.

n.a.

67.46 67.46

n.a.

74.60 72.22

Nutt

50 x 12625

77.86 n.a.

n.a.

84.85 88.74

n.a.

90.48 91.20

91.20 Singh

136 x 12600

77.00 69.05

69.05 79.76

79.76 70.24

75.00 69.05

78.57 73.81

H-Sporadic

22 x 3226

85.00 85.00 63.10

64.29 72.62

69.05 79.76

72.62 77.38

84.52 H-BRCA2

22 x 3226

80.00 80.00 52.38

66.67 69.05

76.19 75.00

77.38 72.62

75.00 H-BRCA1

22 x 3226

ILDC ReGEC KUPCA

FDA KUPCA

FDA LSPCA

FDA LUPCA

FDA SPCA

FDA UPCA

FDA KLS

SVM LLS

Dataset SVM

(33)

Conclusions

X ReGEC is a competitive classification method.

X Incremental learning reduces redundancy in training sets and can help avoiding over-fitting.

X Subset selection algorithm provides a constructive way to reduce complexity in kernel based classification algorithms.

X Initial points selection strategy can help in finding regions where knowledge is missing.

X IReGEC can be a starting point to explore very large problems.

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis