• No results found

Monday Morning Data Mining

N/A
N/A
Protected

Academic year: 2021

Share "Monday Morning Data Mining"

Copied!
68
0
0

Loading.... (view fulltext now)

Full text

(1)

Fakultät Physik

Experimentelle Physik V

Monday Morning

Data Mining

Tim Ruhe

(2)

Fakultät Physik

Experimentelle Physik V

Outline:

- data mining - IceCube

(3)

Fakultät Physik

Experimentelle Physik V

(4)

Fakultät Physik

(5)

Fakultät Physik

(6)

Fakultät Physik

Experimentelle Physik V

(7)

Fakultät Physik

Experimentelle Physik V

1. Find representation of data 2. Find a good algorithm

(8)

Fakultät Physik

Experimentelle Physik V

IceCube in a nutshell:

- completed in December 2010 - located at the geographic

South Pole

- 5160 Digital Optical Modules on 86 strings

- instrumented volume of 1 km3

- subdetectors DeepCore and IceTop

(9)

Fakultät Physik

Experimentelle Physik V

IceCube in a nutshell:

- Detection principle: Cherenkov light

- Look for events of the form:

ν

+ X

e,µ,

τ

(10)

Fakultät Physik

Experimentelle Physik V

IceCube: Scientific goals

- detection of astrophysical

neutrinos

- atmospheric neutrino energy spectrum

- neutrino oscillations - CR-anisotropy

(11)

Fakultät Physik

(12)

Fakultät Physik

(13)

Fakultät Physik

(14)

Fakultät Physik

(15)

Fakultät Physik

(16)

Fakultät Physik

(17)

Fakultät Physik

(18)

Fakultät Physik

Experimentelle Physik V

Data Mining in IceCube:

- app. 2600 reconstructed attributes

- Data and MC do not necessarily agree - signal/background ratio ~ 10-3

interesting for studies within the scope

of machine learning

(19)

Fakultät Physik

Experimentelle Physik V

(20)

Fakultät Physik

Experimentelle Physik V

Make sure you understand your input:

Attributes can be: nominal green, blue, red, yellow

ordinal cool, mild, hot cool < mild < hot

numerical 1,2,3,4,....

labels can be: polynominal red, green, yellow, blue

binominal signal, background

(21)

Fakultät Physik

Experimentelle Physik V

Data Preprocessing: Preselection of parameters

1. Check for consistency (data vs.Signal MC vs. Backgr. MC) 2. Check for missing values (nans, infs)

How to handle the nans? (see next slide)

3. Eliminate the “obvious“ (Azimuth angle, timing information...) 4. Eliminate highly correlated and constant parameters

(22)

Fakultät Physik

Experimentelle Physik V

Data and MC preprocessing: How to handle nans?

Several possibilities:

- Exclude attributes that exceed a certain number of nans - Replace by: - minimum

- maximum - average

- nothing at all - (median...)

(23)

Fakultät Physik

Experimentelle Physik V

Data and MC preprocessing: Feature Selection

1. Forward Selection

start with empty selection

add each unused attribute

estimate performance

(24)

Fakultät Physik

Experimentelle Physik V

Data and MC preprocessing: Feature Selection

2. Backward Elimination

start with a full set of attributes

Remove each of the attributes

Estimate performance for each removed attribute

The attribute giving the least decrease in performance is removed

(25)

Fakultät Physik

Experimentelle Physik V

Backward Elimination in RapidMiner:

(26)

Fakultät Physik

Experimentelle Physik V

Data and MC preprocessing: Feature Selection

3.

M

ininmum

R

edundancy

M

aximum

R

elevance

iteratively add features with biggest relevance and least redundancy

Quality criterion Q:

=

j F in x

x

x

D

j

y

x

R

(27)

Fakultät Physik

Experimentelle Physik V

MRMR in RapidMiner:

(28)

Fakultät Physik

Experimentelle Physik V

Evaluating the Stability of the Parameter Selection:

- Data and MC is subject to a certain variance

(29)

Fakultät Physik

Experimentelle Physik V

Stability of the MRMR Selection:

Jaccard Index: Kuncheva‘s Index:

B

A

B

A

J

=

)

(

)

,

(

2

k

n

k

k

rn

B

A

I

C

=

(30)

Fakultät Physik

(31)

Fakultät Physik

Experimentelle Physik V

(32)

Fakultät Physik

Experimentelle Physik V

Learners:

1. Decision Trees 2. Naive Bayes

3. k - Nearest Neighbours 4. Random Forests

(33)

Fakultät Physik

Experimentelle Physik V

A bit more technically speaking:

set of vectors x = (x1,x2,...,xn); xi = attribute (attributes = features, variables, parameters) labels y1,y2,...,yn labels = target class

(34)

Fakultät Physik

Experimentelle Physik V

(35)

Fakultät Physik

Experimentelle Physik V

(36)

Fakultät Physik

Experimentelle Physik V

Naive Bayes:

]

Pr[

]

Pr[

]

|

Pr[

]

|

Pr[

E

H

H

E

E

H

=

×

- based on Bayes theorem:

(37)

Fakultät Physik

Experimentelle Physik V

(38)

Fakultät Physik

Experimentelle Physik V

Naive Bayes:

Play?

outlook = sunny, temperature = cool,

humidity = high, windy = true

(39)

Fakultät Physik

Experimentelle Physik V

Naive Bayes:

]

Pr[

14

/

9

9

/

3

9

/

3

9

/

3

9

/

2

]

|

Pr[

E

E

(40)

Fakultät Physik

Experimentelle Physik V

Naive Bayes:

0206

.

0

]

|

Pr[

0053

.

0

]

|

Pr[

]

Pr[

14

/

9

9

/

3

9

/

3

9

/

3

9

/

2

]

|

Pr[

=

=

×

×

×

×

=

E

no

E

yes

E

E

yes

795

.

0

]

|

Pr[

205

.

0

]

|

Pr[

=

=

E

no

E

yes

(41)

Fakultät Physik

Experimentelle Physik V

Naive Bayes: What if Pr[E

i

|yes]=0?

12

/

6

]

|

Pr[

9

/

5

]

|

Pr[

12

/

5

]

|

Pr[

9

/

4

]

|

Pr[

=

=

=

=

yes

overcast

yes

overcast

yes

sunny

yes

sunny

Let‘s assume we don not have positive examples

for outlook = rainy

(42)

Fakultät Physik

Experimentelle Physik V

k-Nearest Neighbours (k-NN)

- memory based classifier

- unsupervised

- find the k neighbours closest to x and classify by

majority vote

- all features should be normalized

(43)

Fakultät Physik

Experimentelle Physik V

Random Forests:

- ensemble of decision trees

- developed by Leo Breiman (2001) - no boosting between individual trees - events are classified by individual trees - uses average for final classification

(44)

Fakultät Physik

Experimentelle Physik V

Random Forests: Output

(45)

Fakultät Physik

Experimentelle Physik V

(46)

Fakultät Physik

Experimentelle Physik V

(47)

Fakultät Physik

Experimentelle Physik V

Boosting:

- uses an ensemble of weak classifiers (decision trees) - weights are increased for

false classified events - weighted vote is applied

(48)

Fakultät Physik

(49)

Fakultät Physik

Experimentelle Physik V

(50)

Fakultät Physik

(51)

Fakultät Physik

Experimentelle Physik V

(52)

Fakultät Physik

Experimentelle Physik V

(53)

Fakultät Physik

Experimentelle Physik V

(54)

Fakultät Physik

Experimentelle Physik V

(55)

Fakultät Physik

Experimentelle Physik V

Cut Nugen Corsika Sum

0.990 4817 ± 44 93 ± 38 4910 ± 58 0.992 4633 ± 43 80 ± 30 4633 ± 52 0.994 4414 ± 41 57 ± 30 4414 ± 51 0.996 4122 ± 32 49 ± 26 4122 ± 41

Cross validated predictions:

(56)

Fakultät Physik

Experimentelle Physik V

Cross Validation for a limited number of examples?

(57)

Fakultät Physik

Experimentelle Physik V

(58)

Fakultät Physik

Experimentelle Physik V

Change the Scaling of the Corsika:

(59)

Fakultät Physik

Experimentelle Physik V

(60)

Fakultät Physik

Experimentelle Physik V

Cut Nugen Corsika Sum Data

0.990 4817 ± 44 93 ± 38 4910 ± 58 4988

0.992 4633 ± 43 80 ± 30 4633 ± 52 4757 0.994 4414 ± 41 57 ± 30 4414 ± 51 4476

0.996 4122 ± 32 49 ± 26 4122 ± 41 4134 0.998 3695 ± 46 18 ± 17 3695 ± 49 3638

(61)

Fakultät Physik

Experimentelle Physik V

(62)

Fakultät Physik

Experimentelle Physik V

(63)

Fakultät Physik

Experimentelle Physik V

(64)

Fakultät Physik

Experimentelle Physik V

k-means Clustering:

- Pick mean at random

- Calculate distance of examples to mean - assign examples to cluster

- recalculate mean of the cluster

- reiterate until mean does not change any longer

Significantly faster than hierarchical clustering

(65)

Fakultät Physik

Experimentelle Physik V

(66)

Fakultät Physik

Experimentelle Physik V

Summary:

- IceCube is interesting for detailed studies in machine learning

- studies can be carried out using RapidMiner - MRMR for Feature Selection

- Simple learners are good for benchmarks - Cross Validation is good for you!

(67)

Fakultät Physik

(68)

Fakultät Physik

References

Related documents

Firstly, at a policy develop- ment level, the problems relate to ineffective public participation, poor policy communi- cation, and inadequate tenure policy, related to control of

Comparing the treatment group to the control group on the basis of characteristics that change over time was challenging, as the control group consists of individuals who

[r]

Free trade and higher world prices in world agriculture would have an adverse impact on poor countries that are large net importers of food, and this effect should be taken into

If the flurry of developments in EV and HEV heat pump technology and the projected increase in electric vehicle sales are any indication, researchers and vehicle manufacturers

Parent nodes (i.e. the main splitters) consist of the variables Relationship status, Alcohol use in the past 3 months, Race, Education, Cigarette use in the past 3 months,

The vendor will provide all technical and administrative aspects required to plan, configure, and integrate the proposed “Digital Radiology System” as defined in section IV..

The increase in knowledge regarding concussions over the past 2 decades has led to the institution of concussion policies across all major professional sports in the United States,