Data analysis
in
Par,cle Physics
$ whoami
•
Lukasz (Luke) Kreczko – Par,cle Physicist
•
Graduated in Physics from University of
Hamburg in 2009
•
2009 – 2013 PhD in Par,cle Physics at the
University of Bristol
•
Currently Compu,ng Research Assistant at the
Outline
•
Data taking at the Compact Muon Solenoid
(CMS) experiment
•
Data format (and distribu,on)
•
Data analysis procedure
Outline
•
Data taking at the Compact Muon Solenoid
(CMS) experiment
•
Data format (and distribu,on)
•
Data analysis procedure
Outline
•
Data taking at the Compact Muon Solenoid
(CMS) experiment
•
Data format (and distribu,on)
•
Data analysis procedure
What is CERN
•
Conseil Europeen pour la Recherche Nucleaire – aka
European Laboratory for Par,cle Physics
•
Between Geneva and the Jura mountains, straddling the
Swiss-‐French border
•
Founded in 1954 with an interna,onal treaty
•
Our business is fundamental par,cle and how our universe
works
–
What is the origin of mass? We are a step closer with the Higgs!
–
What is 96 % of the universe made of? We only see 4%!
–
Why isn’t there an,-‐maber in the universe?
Large Hadron Collider
Mankind’s biggest machine (27 km circumference)
Ho:er than the centre of the sun: collisions are 100,000 @mes ho:er
Colder than deep space: (super) liquid helium cooling at 1.9 K (-‐271 C)
The experiment: a big digital camera
40 million “pictures”
per second
Each “picture” around
1 MB!
The data: a structured mess
What do we do?
Experiment
Local compu,ng farm
CERN data centre
Globally distributed
data centres
My computer
What do we do?
Experiment
Local compu,ng farm
CERN data centre
Globally distributed
data centres
My computer
Paper
The experiment -‐ CMS
Experiment
Local compu,ng farm
CERN data centre
Globally distributed
data centres
My computer
Paper
•
Input from LHC
•
40 million collisions
per second
•
40 Tera bytes per
second
•
Hardware trigger (L1)
•
Low resolu,on
•
Makes decision in 3
micro seconds
•
Reduces output to 100
High Level Trigger
Experiment
Local compu,ng farm
CERN data centre
Globally distributed
data centres
My computer
Paper
•
Input from experiment
•
100,000 collisions per
second
•
Sodware trigger (HLT)
•
“poor man’s”
reconstruc,on
•
High resolu,on
•
Writes around 700 Hz
(700 MB/s) in ROOT
data format
(Event) Reconstruc,on
•
hbp://en.wikipedia.org/wiki/
Event_reconstruc,on
•
Reading the detector informa,on and
bundling it into par,cles
–
Detector response from different detector regions
helps to iden,fy par,cles
–
In addi,on algorithms look for specific par,cle
behaviour (i.e. b-‐quark: travels half a millimetre
before decaying) and iden,fy them
ROOT
•
ROOT (
hbp://root.cern.ch
,
hbp://root.cern.ch/git/root.git
)
•
Developed in 1995
•
ROOT is a lot of things:
hbp://root.cern.ch/drupal/content/about
•
Most used features (subjec,ve):
ROOT
•
Also has a C interpreter (CINT)
•
blessing and curse
–
ask any student which one is more accurate
•
177 PB of LHC data stored in ROOT format
•
“ROOT – The Next Genera,on”:
hbps://indico.cern.ch/conferenceTimeTable.py?
ROOT data format
•
hbp://root.cern.ch/drupal/content/root-‐
files-‐1
•
Binary storage for C++ objects
–
Serialisa,on via TObject class
–
Supports par,al reads (i.e. subset of objects)
–
Objects grouped by event (i.e.
file.GetEvent(10).electron.at(0).energy())
–
Supports read-‐ahead (tuneable parameter for
CERN T0 – data reconstruc,on
Experiment
Local compu,ng farm
CERN data centre
Globally distributed
data centres
My computer
Paper
•
Input:
•
300-‐350 collisions per
second
•
Rest is done when
machine is shut down
•
Reconstruc,on
•
Connec,ng the dots
Analysing all data
•
CMS records 10 000 Terabytes of data every
year (around 70 years of full HD movies)
•
+ same amount of simula,on
•
To analyse this on a single computer would
Analysing all data
•
CMS records 10 000 Terabytes of data every
year (around 70 years of full HD movies)
•
+ same amount of simula,on
•
To analyse this on a single computer would
The LHC grid
Experiment
Local compu,ng farm
CERN data centre
Globally distributed
data centres
My computer
Paper
•
Distribu,ng on a global
scale
•
This is where the analysis
The data: a much nicer picture
Muon: pT = 71.5 GeV/c η = ‐0.82 Missing ET: 22.3 GeV Jet: pT = 89.0 GeV/c η = 2.14 Jet: pT = 85.3 GeV/c η = 2.02 Jet: pT = 90.5 GeV/c η = ‐1.40 Run: 163583 Event: 26579562 Jet: pT = 84.1 GeV/c η = ‐2.24 m(F)=1.2 TeV/c_ 2The goal: extend our knowledge
Muon: pT = 71.5 GeV/c η = ‐0.82 Missing ET: 22.3 GeV Jet: pT = 89.0 GeV/c η = 2.14 Jet: pT = 85.3 GeV/c η = 2.02 Jet: pT = 90.5 GeV/c η = ‐1.40 Run: 163583 Event: 26579562 Jet: pT = 84.1 GeV/c η = ‐2.24 m(F)=1.2 TeV/c_ 2Billions of + simula,on
(GeV)
γ γm
110
120
130
140
150
S/
(S
+
B
)
W
ei
ghted
Events
/ 1
.5 GeV
0
500
1000
1500
Data S+B Fit B Fit Component σ 1 ± σ 2 ± -1 = 8 TeV, L = 5.3 fb s -1 = 7 TeV, L = 5.1 fb s CMS (GeV) γ γ m 120 130 Events / 1.5 GeV1000 1500 UnweightedThe goal: extend our knowledge
Muon: pT = 71.5 GeV/c η = ‐0.82 Missing ET: 22.3 GeV Jet: pT = 89.0 GeV/c η = 2.14 Jet: pT = 85.3 GeV/c η = 2.02 Jet: pT = 90.5 GeV/c η = ‐1.40 Run: 163583 Event: 26579562 Jet: pT = 84.1 GeV/c η = ‐2.24 m(F)=1.2 TeV/c_ 2Billions of + simula,on
(GeV)
γ γm
110
120
130
140
150
S/
(S
+
B
)
W
ei
ghted
Events
/ 1
.5 GeV
0
500
1000
1500
Data S+B Fit B Fit Component σ 1 ± σ 2 ± -1 = 8 TeV, L = 5.3 fb s -1 = 7 TeV, L = 5.1 fb s CMS (GeV) γ γ m 120 130 Events / 1.5 GeV1000 1500 UnweightedAnalysis
Data prepara,on
Data reduc,on
Event selec,on
histogramming
Correc@ons
: applying the newest knowledge about
the experiment
Analysis
Data prepara,on
Data reduc,on
Event selec,on
histogramming
Filtering
: we know more or less what we
are looking for
Analysis
Data prepara,on
Data reduc,on
Event selec,on
histogramming
Selec@on
: very refined selec,on to increase
signal purity (usually a ,ny effect compared
to backgrounds)
Muon: pT = 71.5 GeV/c Missing ET: 22.3 GeV Jet: pT = 89.0 GeV/c η = 2.14 Jet: pT = 85.3 GeV/c η = 2.02 Jet: pT = 90.5 GeV/c η = ‐1.40 Jet: pT = 84.1 GeV/c η = ‐2.24Analysis
Data prepara,on
Data reduc,on
Event selec,on
histogramming
Analysis:
apply algorithms (produce derived data)
Histograms
: data reduc,on
+ B ) W ei ghted Events / 1 .5 GeV 500 1000 1500 Data S+B Fit B Fit Component σ 1 ± -1 = 8 TeV, L = 5.3 fb s -1 = 7 TeV, L = 5.1 fb s CMS (GeV) γ γ m 120 130 Events / 1.5 GeV1000 1500 Unweighted
Analysis
Data prepara,on
Data reduc,on
Event selec,on
histogramming
Rinse
&
repeat
Analysis in Big data terms
Data prepara,on
Data reduc,on
Event selec,on
histogramming
MAP
MAP
REDUCE
REDUCE
Analysis in Big data terms
Data prepara,on
Data reduc,on
Event selec,on
histogramming
MAP
MAP
REDUCE
REDUCE
LHC Grid
Usually
local site
Summary
•
The data from the experiments are reduced
before storing them to disk/tape
•
All data is stored in ROOT format: either as
classes or as basic data types
•
Heavy workflows are performed on the LHC
grid, frequent and fast work usually on local
servers
•
The final result is a histogram (or table) and is
a huge reduc,on step from the input (20 PB -‐>
100 MB)
Ques,ons?
Thank you for listening.
ROOT and CMS
•
hbps://indico.cern.ch/getFile.py/access?
contribId=16&resId=0&materialId=slides&con
fId=217511
What do we do?
Experiment
Local compu,ng farm
CERN data centre
Globally distributed
data centres
My computer
Paper
online
offline