Fakultät Physik
Experimentelle Physik V
Monday Morning
Data Mining
Tim Ruhe
Fakultät Physik
Experimentelle Physik V
Outline:
- data mining - IceCube
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Fakultät Physik
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
1. Find representation of data 2. Find a good algorithm
Fakultät Physik
Experimentelle Physik V
IceCube in a nutshell:
- completed in December 2010 - located at the geographic
South Pole
- 5160 Digital Optical Modules on 86 strings
- instrumented volume of 1 km3
- subdetectors DeepCore and IceTop
Fakultät Physik
Experimentelle Physik V
IceCube in a nutshell:
- Detection principle: Cherenkov light
- Look for events of the form:
ν
+ X
e,µ,
τ
Fakultät Physik
Experimentelle Physik V
IceCube: Scientific goals
- detection of astrophysicalneutrinos
- atmospheric neutrino energy spectrum
- neutrino oscillations - CR-anisotropy
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Experimentelle Physik V
Data Mining in IceCube:
- app. 2600 reconstructed attributes
- Data and MC do not necessarily agree - signal/background ratio ~ 10-3
interesting for studies within the scope
of machine learning
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Make sure you understand your input:
Attributes can be: nominal green, blue, red, yellow
ordinal cool, mild, hot cool < mild < hot
numerical 1,2,3,4,....
labels can be: polynominal red, green, yellow, blue
binominal signal, background
Fakultät Physik
Experimentelle Physik V
Data Preprocessing: Preselection of parameters
1. Check for consistency (data vs.Signal MC vs. Backgr. MC) 2. Check for missing values (nans, infs)
How to handle the nans? (see next slide)
3. Eliminate the “obvious“ (Azimuth angle, timing information...) 4. Eliminate highly correlated and constant parameters
Fakultät Physik
Experimentelle Physik V
Data and MC preprocessing: How to handle nans?
Several possibilities:- Exclude attributes that exceed a certain number of nans - Replace by: - minimum
- maximum - average
- nothing at all - (median...)
Fakultät Physik
Experimentelle Physik V
Data and MC preprocessing: Feature Selection
1. Forward Selection
start with empty selection
add each unused attribute
estimate performance
Fakultät Physik
Experimentelle Physik V
Data and MC preprocessing: Feature Selection
2. Backward Elimination
start with a full set of attributes
Remove each of the attributes
Estimate performance for each removed attribute
The attribute giving the least decrease in performance is removed
Fakultät Physik
Experimentelle Physik V
Backward Elimination in RapidMiner:
Fakultät Physik
Experimentelle Physik V
Data and MC preprocessing: Feature Selection
3.
M
ininmum
R
edundancy
M
aximum
R
elevance
iteratively add features with biggest relevance and least redundancy
Quality criterion Q:
∑
′′
−
=
j F in xx
x
D
j
y
x
R
Fakultät Physik
Experimentelle Physik V
MRMR in RapidMiner:
Fakultät Physik
Experimentelle Physik V
Evaluating the Stability of the Parameter Selection:
- Data and MC is subject to a certain varianceFakultät Physik
Experimentelle Physik V
Stability of the MRMR Selection:
Jaccard Index: Kuncheva‘s Index:
B
A
B
A
J
∪
∩
=
)
(
)
,
(
2k
n
k
k
rn
B
A
I
C−
−
=
Fakultät Physik
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Learners:
1. Decision Trees 2. Naive Bayes
3. k - Nearest Neighbours 4. Random Forests
Fakultät Physik
Experimentelle Physik V
A bit more technically speaking:
set of vectors x = (x1,x2,...,xn); xi = attribute (attributes = features, variables, parameters) labels y1,y2,...,yn labels = target class
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Naive Bayes:
]
Pr[
]
Pr[
]
|
Pr[
]
|
Pr[
E
H
H
E
E
H
=
×
- based on Bayes theorem:
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Naive Bayes:
Play?
outlook = sunny, temperature = cool,
humidity = high, windy = true
Fakultät Physik
Experimentelle Physik V
Naive Bayes:
]
Pr[
14
/
9
9
/
3
9
/
3
9
/
3
9
/
2
]
|
Pr[
E
E
Fakultät Physik
Experimentelle Physik V
Naive Bayes:
0206
.
0
]
|
Pr[
0053
.
0
]
|
Pr[
]
Pr[
14
/
9
9
/
3
9
/
3
9
/
3
9
/
2
]
|
Pr[
=
=
×
×
×
×
=
E
no
E
yes
E
E
yes
795
.
0
]
|
Pr[
205
.
0
]
|
Pr[
=
=
E
no
E
yes
Fakultät Physik
Experimentelle Physik V
Naive Bayes: What if Pr[E
i|yes]=0?
12
/
6
]
|
Pr[
9
/
5
]
|
Pr[
12
/
5
]
|
Pr[
9
/
4
]
|
Pr[
=
→
=
=
→
=
yes
overcast
yes
overcast
yes
sunny
yes
sunny
Let‘s assume we don not have positive examples
for outlook = rainy
Fakultät Physik
Experimentelle Physik V
k-Nearest Neighbours (k-NN)
- memory based classifier- unsupervised
- find the k neighbours closest to x and classify by
majority vote
- all features should be normalized
Fakultät Physik
Experimentelle Physik V
Random Forests:
- ensemble of decision trees
- developed by Leo Breiman (2001) - no boosting between individual trees - events are classified by individual trees - uses average for final classification
Fakultät Physik
Experimentelle Physik V
Random Forests: Output
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Boosting:
- uses an ensemble of weak classifiers (decision trees) - weights are increased for
false classified events - weighted vote is applied
Fakultät Physik
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Cut Nugen Corsika Sum
0.990 4817 ± 44 93 ± 38 4910 ± 58 0.992 4633 ± 43 80 ± 30 4633 ± 52 0.994 4414 ± 41 57 ± 30 4414 ± 51 0.996 4122 ± 32 49 ± 26 4122 ± 41
Cross validated predictions:
Fakultät Physik
Experimentelle Physik V
Cross Validation for a limited number of examples?
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Change the Scaling of the Corsika:
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Cut Nugen Corsika Sum Data
0.990 4817 ± 44 93 ± 38 4910 ± 58 4988
0.992 4633 ± 43 80 ± 30 4633 ± 52 4757 0.994 4414 ± 41 57 ± 30 4414 ± 51 4476
0.996 4122 ± 32 49 ± 26 4122 ± 41 4134 0.998 3695 ± 46 18 ± 17 3695 ± 49 3638
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
k-means Clustering:
- Pick mean at random- Calculate distance of examples to mean - assign examples to cluster
- recalculate mean of the cluster
- reiterate until mean does not change any longer
Significantly faster than hierarchical clustering
Fakultät Physik
Experimentelle Physik V
Fakultät Physik
Experimentelle Physik V
Summary:
- IceCube is interesting for detailed studies in machine learning
- studies can be carried out using RapidMiner - MRMR for Feature Selection
- Simple learners are good for benchmarks - Cross Validation is good for you!
Fakultät Physik
Fakultät Physik