CERN IT Department CH-1211 Genève 23 Switzerland
www.cern.ch/it
HEP computing and
Grid computing
&
Big Data
Massimo Lamanna/CERN
IT department - Data Storage Services group
May 11
th2014
CERN IT Department CH-1211 Genève 23 Switzerland
www.cern.ch/it
My objectives for today!
Why High-Energy Physics (HEP) needs computing?
Computing: CPU cycles and Data Storage
How can we deliver enough computing to HEP?
Lots of CPU, disks and tapes!
Any relevance beyond HEP?
Who is CERN IT? Staff, fellows and students!
And of coursetrying to answer all your questions!
What are all these computers/disks/networks... for?
3
Theory and
(Detector) simulation
Data analysis and
Results interpretation
Data acquisition and
From Physics to Raw Data
4
Basic physics Fragmentation,
Decay Interaction with detector material Multiple scattering, interactions Detector response Noise, pile-up, cross-talk, inefficiency, ambiguity, resolution, response function, alignment, temperature 2037 2446 1733 1699 4003 3611 952 1328 2132 1870 2093 3271 4732 1102 2491 3216 2421 1211 2319 2133 3451 1942 1121 3429 3742 1288 2343 7142 Raw data (Bytes) Read-out addresses, ADC, TDC values, Bit patterns
e+,q
e-,q
f
f
Z0
_
_
From Raw Data to Physics
5
e+,q
e-,q
f
f
Z0
Basic physics Results Fragmentation, Decay Physics analysis Interaction with detector material Pattern, recognition, Particle identification Detector response apply calibration, alignment, 2037 2446 1733 1699 4003 3611 952 1328 2132 1870 2093 3271 4732 1102 2491 3216 2421 1211 2319 2133 3451 1942 1121 3429 3742 1288 2343 7142 Raw data Convert to physics quantitiesReconstruction
Simulation (Monte-Carlo)
Analysis
_
_
(Innovation in) Computing in High-Energy Physics
•
Demanding science
Demanding computing
•
Power usage
•
Innovation
–
Web invention
–
Grid computing (LHC Computing Grid)
7
Invented by/in/for the HEP community
Now at the hearth of our lives
WLCG (Worldwide LHC Computing Grid)
• The Worldwide LHC Computing Grid (WLCG) collaboration was set up in 2002 for the Large Hadron Collider (LHC) physics programme at CERN. It links up national and international grid infrastructures.
• WLCG mission is to provide global computing resources to store, distribute and analyse the ~25 Petabytes (25 million Gigabytes) of data annually generated by the Large Hadron Collider (LHC) at CERN
• Challenging points:
– Large data volumes
– Large computing needs (simulation and data reconstruction and analysis)
– Very large community (including large number of computer centres and other institutes not necessary part of a LHC experiment)
– Aiming to total:CERN resource ratio of the order of 5:1 or more (20%)
• Historical notes:
– Computing activities in the late 1990 (e.g. Monarc project: http://monarc.web.cern.ch/MONARC/ )
– “Grid” concepts (I. Foster and K. Kesselman) matched with our approach
– Distributed approach “matches” our distributed user communities
– Participation to multiscience Grid projects (notably EGEE 2004-2010)
• WLCG now operates a distributed infrastructure with 200+ major cooperating computer centres worldwide and provides the foundation for all data processing activities. WLCG delivers access to LHC data and computing power for extensive data analysis to virtually every LHC physicists.
• Several tens of PB per year of data files are collected and distributed: experimental data (collected at rates up to several GB/s) supplemented by derived data (processing and filtering) and simulation data (generated on the same infrastructure). Depending on data type and usage, files are replicated on disk farms and tape repositories to ensure
Typical data rates 100-1000 MB/s aver multiple 10 Gb fibres.
Large (and final) filtering takes place at the experiment (trigger filter farms). All data are then kept (RAW data)
Bd 513 9 AMS ALICE ATLAS LHCb CMS COMPASS
10
FR-CCIN2P3
DE-KIT
NL-T1
ES-PIC
IT-INFN-CNAF
NDGF
US-T1-BNL
CA-TRIUMF
TW-ASGC
UK-T1-RAL
12 Tier1SK-KISTI
US-FNAL
Intercontinental links
(
data and jobs
)
AP area: Taipei (Tier1), Tokyo, Beijing, Seul, Melbourne, Mumbai, …
North America: BNL and Fermilab (US Tier1), Victoria (Canada Tier1) and
many Tier2 like Stanford, MIT, Wisconsin, Argonne, …
South America: several Tier2s Africa: few sites as the South
Africa Tier2s
•
Data distribution (e.g. CERN distributes RAW data to Tier1s)
•
12 centres on-line
•
Tier1s being built in Moscow area (@Dubna and @Kurchatov Institute)
•
(Large) processing campaigns on preplaced data (e.g. Reprocessing)
•
Download small sample for local analysis? … but
•
Scale out with grid jobs (user executable dispatched where data are)
•
Using file parallelism (data set
list of files
list of independent jobs)
•
Federating storages
Grid at work (computing power)
CPU contributions across the worldwide
LHC Computing Grid
See next slides: ~80,000 cores
CMS processing: wall
clock consumption
•
Tier0 and Tier1 processing
–
Top/Bottom: particle
collisions off/o
–
Sizeable even if no data
taking: continuous
reprocessing
•
Reconstruction activities
–
RAW
reconstructed
objects
–
Organised processing
–
Output for physicists
analysis
•
They can access RAW data if
needed
•
Final analysis more efficient
on the files containing the
“reconstructed objects”
The grid never sleeps…
Data on the Grid (CMS experiment)
Much more data than
RAW!
•
Primary data from
collisions
•
+ their replicas
•
Different levels of
reconstructed and
filtered data sets
•
+ their replicas
70 PB
LHC run 1 (2012) data (transfers)
•
Data taking is part of the full
picture
•
Primary data
•
Replication, multi-step
analysis and simulation
“generates” more data
•
And network traffic (data
placement)
Data handling is an operational
challenge
Source destination correlation matrix
Analysis flow (user view)
18
Raw data
(Exp. Data)
Production System Production System User analysisSimulation
Data selection (quality, type, config…)But how this is done
in practice
? Of course we need CPUs, disks, networks etc..
The problem is orchestrating hardware resources, software and humans :)
“All” data are stored in files (aggregated as “datasets” = collections of files). Only a small
fraction of data in real DBs (e.g calibrations). This is one characteristics of HEP computing.
19
Central services
Regional services
Tier2/3 (data on disk + CPU)
Execution site
2011 data; quality; run conditions… to be analyse it with myprog.exe [f1,f2,…fn]
Federation/location: fjis in [site1,site2,…sitem] Executed myprog.exe(/storage/data/2011/run4/fj)
In general is from [site1,site2,…sitem] but not always(flexibility, robustness, efficiency)
Toy model of user analysis:
20
CMS user analysis
as May the 11th
Close collaboration with CERN IT
COMPASS was the first/among the first to:
●
Migration to Linux PC farms
●
CHEP2000
●
CCF at CERN
●
ACID in Trieste
●
Use of Objectivity/DB at the 10
2TB scale
●
First production user of CASTOR
●
Deliver and operate CORAL (C++ framework)
for data processing
●
And the algorithmic parts, notably the
RICH reconstruction
CCF at CERN
ACID in Trieste (INFN)
Still in use
Still in use
Eventually decommissioned
Design
Value (35 MB/s)
5thof May 2014 CASTOR statistics (tape files only)
Total size: 87.337 PB
Total number of files: 262.601 M Top-6 size (by group)
30.436 PB with group ATLAS (id 1307) 17.266 PB with group CMS (id 1399) 13.888 PB with group COMPASS (id 1665)
9.616 PB with group ALICE (id 1395) 6.585 PB with group LHCb (id 1470) 1.839 PB with group zf (id 1018)
23
CERN Computer Centre
CERN computer centre:
• Built in the 70s on the CERN site (Meyrin-Geneva) • ~3000 m2(3 main machine rooms)
• 3.5 MW for equipment • Est. PUE ~ 1.6
New extension:
• Located at Wigner (Budapest) • ~1000 m2
• 2.7 MW for equipment
• Connected to the Geneva CC with 2x100Gb links (21 and 24 ms RTT)
• Another room ~20 ms away…
CASTOR and EOS
CASTOR and EOS are using the same commodity HW
•
RAID-1 for CASTOR
•
2 copies in the mirror (same box, multiple copies possible)
•
JBOD with RAIN for EOS
•
enforce replicas to be on different disk servers
•
tunable number (usually 2 replicas)
•
Erasure coding
•
“Arbitrary” configurable durability
•
Disaster recovery: cross-site replication
CASTOR and EOS:
developed at CERN
(IT department)
Workhorses for
physics data
(archival,
reconstruction,
analysis and
distribution)
2526
Data durability
CERN
Tier1s
Simplified schema: 1 copy + 1 “distributed” copy (2 tape copies)
Scalability + data durability (disaster recovery)
27
Data durability (2)
MGM
NS + Stager
D
t ~ several 10
3s
D
t ~ several 10
3s
Single disk failure
: probability p;
healing time:
D
t
Double failure
: p
2effect; overlapping
failure during healing time
D
t;
∩
(disk
x,disk
y)
Single disk failure
: probability p;
Double failure
: p
2effect; overlapping
failure during healing time
D
t;
∩
(disk
x,disk
y)
Data & Storage Services
Total installed capacity: 25PB ~900 disk servers
Total installed disk capacity: 13.0PB ~600 disk servers
Mainly used as disk front-end
(includes LHC data repository)
Disk-only data (analysis)
CERN data management (physics data)
Unique function across LCG:
• Primary repository for LHC data (raw)
• Source of prompt reconstruction
• Distribution point for LHC data (raw)
28
“Low” numbers (maintenance due to LHC technical stop)
Availability and I/O performances
Data taken driven
Analysis driven
pp 2012
pA 2013
29
HW types
Different generations of PC boxes with 20-200 TB disks space
Multiple copy
• Geolocalisation
• 1 in Meyrin (CERN Geneva), 1 in
Budapest (Wigner)
• Whenever possible
Multiple MGM
deployment
• Complex deployment to guarantee low
latency
• Xroot protocol: redirect clients
• Single master MGM
• Most operations can be done on R/O
MGM
• Transparent redirection
CERN Disk/Tape Storage Management @ storage-day.ch
CASTOR (tape archive)
31
Data:
•
~90 PB of data on tape; 250 M files
•
Up to 4.5 PB new data per month
•
Over 10GB/s (R+W) peaks
Infrastructure:
•
~ 52K tapes (1TB, 4TB, 5TB, 8TB)
•
9 Robotic libraries (IBM and Oracle)
Other main services:
•
TSM
Internal/beta services:
•
Hadoop (*)
•
CERNBOX (cloud sync)
(*) Initiated by a Technical Student (S. Russo) from TS+UD uni (ATLAS experiment)
CERNBOX to enter beta soon
CERNBOX provides a cloud synchronisation service for all CERN users between personal devices (laptops, smartphone, tablets) and a centrally-managed data storage.
Coherent data handling (with our other services): SLA, ACL, curation... CERNBOX is built on top of the ownCloud software. CERNBOX
supports access via web browsers, desktop clients (Linux, Mac and Windows) and mobile-device applications (Android and iOS).
http://owncloud.com
Ceph Architecture and Use-Cases
• Open-software storage
solution
• Scalable from 10s to
10,000s of machines,
from Terabytes to
Exabytes (10
3PB)
• Fault tolerant with no
SPOF, uses commodity
hardware
• Self-managing,
self-healing
Deploying a 3-PB prototype
• Fully puppetized deployment
• Automated machine
commissioning and maintenance
• Add a server to the hostgroup (osd, mon, radosgw)
• OSD disks are detected, formatted, prepared, auth’d
• Also after disk replacement
• Auto-generated ceph.conf
• Last step is manual/controlled: service ceph start
• Mcollective for bulk operations on
the servers
• Ceph rpm upgrades • daemon restarts
# df -h /mnt/ceph
Filesystem Size Used Avail Use% Mounted on
Beyond HEP
Ground floor exposition
Sharing experience with other activities (e.g. Biology)
Planning strategic partnership with other sciences (HelixNebula)
Collaboration with technology providers (OpenLab)
The 3 Vs in HEP
Volume
• For sure at somewhere between 10 and 100 PB you are getting big
Variety
• # of different technical solutions, operational procedures, users and workflows?
• # of different types (>3? >30?)
Velocity
• Dvolume/Dt in 1 PB/month (disk installation) • Last 2 years of EOS@CERN
• Dvolume/Dt in 4 PB/month (recording, processing and redistributing)
• LHC running (end of pp Run1) But do not forget:
• Max time you can be down! • Max integrated time in a month! • # of human interventions per day!
...and we want to read high volume of data at high velocity in a variety of ways
User program
Output data
∀ event in 2012 CMS AOD …
Our “Big Data”
Physics Data on CASTOR/EOS
●
LHC experiments produce ~10GB/s 25PB/year
User Data on AFS & DFS
●
Home directories for 30k users
●
Physics analysis development
●
Project spaces for applications
Service Data on AFS/NFS/Filers
●
Databases, admin applications
Tape archival with CASTOR/TSM
●
Physics data (RAW and derived/additional)
●
Desktop/Server backups
Data growth
●
2 PB/month over the last 2 years (raw disk
space)
●
4 PB/month new data at the end of LHC phase 1
● Factor of 2 more at the beginning of phase 2?
Service Size Files
AFS 290TB 2.3B
CASTOR 89.0PB 325M
Staff, fellows and students!
Xav ie r Co rt ad a (w ith th e p arti ci pati o n o f ph ysic ist P et e Ma rko w itz ), "I n s earc h o f th e H igg s bo so n: H -> ZZ" , d ig ital art , 2013. 40•
Staff, fellow and students
– cern.ch/jobs– Many opportunities for students!
• Physicists & Engineers (incl. Computer science) – Challenging but exciting!
• Summer Student (Jan/Feb 8-12 weeks in summer)
• OpenLab Summer Students (Jan/Feb 8-12 weeks in summer)
• Technical Students (2 times per year 6-12 months)
But finally why so much data?
Ldt
Luminosity
It is a measurement of the potential statistics accumulated in the collisions over time (here it accounts for the 2011 and 2012 data taking)
Think to the exposure time of a camera when taking pictures Events with 2 Zs decaying in 2-lepton pairs (here e+e-and m+m- )
But finally why so much data?
Ldt
Luminosity
It is a measurement of the potential statistics accumulated in the collisions over time (here it accounts for the 2011 and 2012 data taking)
Think to the exposure time of a camera when taking pictures Events with 2 Zs decaying in 2-lepton pairs (here e+e-and m+m- )