HEP computing and Grid computing & Big Data

(1)

CERN IT Department CH-1211 Genève 23 Switzerland

www.cern.ch/it

HEP computing and

Grid computing

&

Big Data

Massimo Lamanna/CERN

IT department - Data Storage Services group

May 11

th

₂₀₁₄

(2)

CERN IT Department CH-1211 Genève 23 Switzerland

www.cern.ch/it

My objectives for today!

Why High-Energy Physics (HEP) needs computing?

Computing: CPU cycles and Data Storage

How can we deliver enough computing to HEP?

Lots of CPU, disks and tapes!

Any relevance beyond HEP?

Who is CERN IT? Staff, fellows and students!

And of coursetrying to answer all your questions!

(3)

What are all these computers/disks/networks... for?

3 Theory and

(Detector) simulation

Data analysis and

Results interpretation

Data acquisition and

(4)

From Physics to Raw Data

4

Basic physics Fragmentation,

Decay Interaction with detector material Multiple scattering, interactions Detector response Noise, pile-up, cross-talk, inefficiency, ambiguity, resolution, response function, alignment, temperature 2037 2446 1733 1699 4003 3611 952 1328 2132 1870 2093 3271 4732 1102 2491 3216 2421 1211 2319 2133 3451 1942 1121 3429 3742 1288 2343 7142 Raw data (Bytes) Read-out addresses, ADC, TDC values, Bit patterns

e+,q

e-,q

f

Z0

_

(5)

From Raw Data to Physics

5 e+,q

e-,q

f

Z0

Basic physics Results Fragmentation, Decay Physics analysis Interaction with detector material Pattern, recognition, Particle identification Detector response apply calibration, alignment, 2037 2446 1733 1699 4003 3611 952 1328 2132 1870 2093 3271 4732 1102 2491 3216 2421 1211 2319 2133 3451 1942 1121 3429 3742 1288 2343 7142 Raw data Convert to physics quantities

Reconstruction

Simulation (Monte-Carlo)

Analysis

_

(6)

(Innovation in) Computing in High-Energy Physics

• Demanding science



Demanding computing

• Power usage

• Innovation

–

Web invention

–

Grid computing (LHC Computing Grid)

(7)

7

Invented by/in/for the HEP community

Now at the hearth of our lives

(8)

WLCG (Worldwide LHC Computing Grid)

• The Worldwide LHC Computing Grid (WLCG) collaboration was set up in 2002 for the Large Hadron Collider (LHC) physics programme at CERN. It links up national and international grid infrastructures.

• WLCG mission is to provide global computing resources to store, distribute and analyse the ~25 Petabytes (25 million Gigabytes) of data annually generated by the Large Hadron Collider (LHC) at CERN

• Challenging points:

– Large data volumes

– Large computing needs (simulation and data reconstruction and analysis)

– Very large community (including large number of computer centres and other institutes not necessary part of a LHC experiment)

– Aiming to total:CERN resource ratio of the order of 5:1 or more (20%)

• Historical notes:

– Computing activities in the late 1990 (e.g. Monarc project: http://monarc.web.cern.ch/MONARC/ )

– “Grid” concepts (I. Foster and K. Kesselman) matched with our approach

– Distributed approach “matches” our distributed user communities

– Participation to multiscience Grid projects (notably EGEE 2004-2010)

• WLCG now operates a distributed infrastructure with 200+ major cooperating computer centres worldwide and provides the foundation for all data processing activities. WLCG delivers access to LHC data and computing power for extensive data analysis to virtually every LHC physicists.

• Several tens of PB per year of data files are collected and distributed: experimental data (collected at rates up to several GB/s) supplemented by derived data (processing and filtering) and simulation data (generated on the same infrastructure). Depending on data type and usage, files are replicated on disk farms and tape repositories to ensure

(9)

Typical data rates 100-1000 MB/s aver multiple 10 Gb fibres.

Large (and final) filtering takes place at the experiment (trigger filter farms). All data are then kept (RAW data)

Bd 513 9 AMS ALICE ATLAS LHCb CMS COMPASS

(10)

10

FR-CCIN2P3

DE-KIT

NL-T1

ES-PIC

IT-INFN-CNAF

NDGF

US-T1-BNL

CA-TRIUMF

TW-ASGC

UK-T1-RAL

12 Tier1

SK-KISTI

US-FNAL

(11)

Intercontinental links

(

data and jobs

)

AP area: Taipei (Tier1), Tokyo, Beijing, Seul, Melbourne, Mumbai, …

North America: BNL and Fermilab (US Tier1), Victoria (Canada Tier1) and

many Tier2 like Stanford, MIT, Wisconsin, Argonne, …

South America: several Tier2s Africa: few sites as the South

Africa Tier2s

(12)

• Data distribution (e.g. CERN distributes RAW data to Tier1s)

• 12 centres on-line

• Tier1s being built in Moscow area (@Dubna and @Kurchatov Institute)

• (Large) processing campaigns on preplaced data (e.g. Reprocessing)

• Download small sample for local analysis? … but

• Scale out with grid jobs (user executable dispatched where data are)

• Using file parallelism (data set



list of files



list of independent jobs)

• Federating storages

(13)

Grid at work (computing power)

CPU contributions across the worldwide

LHC Computing Grid

See next slides: ~80,000 cores

(14)

CMS processing: wall

clock consumption

• Tier0 and Tier1 processing

–

Top/Bottom: particle

collisions off/o

–

Sizeable even if no data

taking: continuous

reprocessing

• Reconstruction activities

–

RAW



reconstructed

objects

–

Organised processing

–

Output for physicists

analysis

• They can access RAW data if

needed

• Final analysis more efficient

on the files containing the

“reconstructed objects”

The grid never sleeps…

(15)

Data on the Grid (CMS experiment)

Much more data than

RAW!

• Primary data from

collisions

• + their replicas

• Different levels of

reconstructed and

filtered data sets

• + their replicas

70 PB

(16)

LHC run 1 (2012) data (transfers)

• Data taking is part of the full

picture

• Primary data

• Replication, multi-step

analysis and simulation

“generates” more data

• And network traffic (data

placement)

(17)

Data handling is an operational

challenge

Source destination correlation matrix

(18)

Analysis flow (user view)

18 Raw data

(Exp. Data)

Production System Production System User analysis

Simulation

Data selection (quality, type, config…)

But how this is done

in practice

? Of course we need CPUs, disks, networks etc..

The problem is orchestrating hardware resources, software and humans :)

“All” data are stored in files (aggregated as “datasets” = collections of files). Only a small

fraction of data in real DBs (e.g calibrations). This is one characteristics of HEP computing.

(19)

19

Central services

Regional services

Tier2/3 (data on disk + CPU)

Execution site

2011 data; quality; run conditions… to be analyse it with myprog.exe [f₁,f₂,…f_n]

Federation/location: f_jis in [site₁,site₂,…site_m] Executed myprog.exe(/storage/data/2011/run4/f_j)

In general is from [site₁,site₂,…site_m] but not always(flexibility, robustness, efficiency)

Toy model of user analysis:

(20)

20

CMS user analysis

as May the 11th

(21)

Close collaboration with CERN IT

COMPASS was the first/among the first to:

●

Migration to Linux PC farms

●

CHEP2000

●

CCF at CERN

●

ACID in Trieste

●

Use of Objectivity/DB at the 10

2

TB scale

●

First production user of CASTOR

●

Deliver and operate CORAL (C++ framework)

for data processing

●

And the algorithmic parts, notably the

RICH reconstruction

CCF at CERN

ACID in Trieste (INFN)

Still in use

Eventually decommissioned

(22)

Design

Value (35 MB/s)

5th_{of May 2014 CASTOR statistics (tape files only)}

Total size: 87.337 PB

Total number of files: 262.601 M Top-6 size (by group)

30.436 PB with group ATLAS (id 1307) 17.266 PB with group CMS (id 1399) 13.888 PB with group COMPASS (id 1665)

9.616 PB with group ALICE (id 1395) 6.585 PB with group LHCb (id 1470) 1.839 PB with group zf (id 1018)

(23)

23

(24)

CERN Computer Centre

CERN computer centre:

• Built in the 70s on the CERN site (Meyrin-Geneva) • ~3000 m2_{(3 main machine rooms)}

• 3.5 MW for equipment • Est. PUE ~ 1.6

New extension:

• Located at Wigner (Budapest) • ~1000 m2

• 2.7 MW for equipment

• Connected to the Geneva CC with 2x100Gb links (21 and 24 ms RTT)

• Another room ~20 ms away…

(25)

CASTOR and EOS

CASTOR and EOS are using the same commodity HW

• RAID-1 for CASTOR

• 2 copies in the mirror (same box, multiple copies possible)

• JBOD with RAIN for EOS

• enforce replicas to be on different disk servers

• tunable number (usually 2 replicas)

• Erasure coding

• “Arbitrary” configurable durability

• Disaster recovery: cross-site replication

CASTOR and EOS:

developed at CERN

(IT department)

Workhorses for

physics data

(archival,

reconstruction,

analysis and

distribution)

25

(26)

26

Data durability

CERN

Tier1s

Simplified schema: 1 copy + 1 “distributed” copy (2 tape copies)

Scalability + data durability (disaster recovery)

(27)

27

Data durability (2)

MGM

NS + Stager

D

t ~ several 10

3

_s

_D

_{t ~ several 10}

₃

_s

Single disk failure

: probability p;

healing time:

D

t

Double failure

: p

2

_{effect; overlapping}

failure during healing time

 D

t;

∩

(disk

_x

,disk

_y

)

Single disk failure

: probability p;

Double failure

: p

2

_{effect; overlapping}

failure during healing time

 D

t;

∩

(disk

_x

,disk

_y

)

(28)

Data & Storage Services

Total installed capacity: 25PB ~900 disk servers

Total installed disk capacity: 13.0PB ~600 disk servers

Mainly used as disk front-end

(includes LHC data repository)

Disk-only data (analysis)

CERN data management (physics data)

Unique function across LCG:

• Primary repository for LHC data (raw)

• Source of prompt reconstruction

• Distribution point for LHC data (raw)

28

“Low” numbers (maintenance due to LHC technical stop)

(29)

Availability and I/O performances

Data taken driven

Analysis driven

pp 2012

pA 2013

29

HW types

Different generations of PC boxes with 20-200 TB disks space

(30)

Multiple copy

• Geolocalisation

• 1 in Meyrin (CERN Geneva), 1 in

Budapest (Wigner)

• Whenever possible

Multiple MGM

deployment

• Complex deployment to guarantee low

latency

• Xroot protocol: redirect clients

• Single master MGM

• Most operations can be done on R/O

MGM

• Transparent redirection

(31)

CERN Disk/Tape Storage Management @ storage-day.ch

CASTOR (tape archive)

31

Data:

• ~90 PB of data on tape; 250 M files

• Up to 4.5 PB new data per month

• Over 10GB/s (R+W) peaks

Infrastructure:

• ~ 52K tapes (1TB, 4TB, 5TB, 8TB)

• 9 Robotic libraries (IBM and Oracle)

(32)

Other main services:

• TSM

Internal/beta services:

• Hadoop (*)

• CERNBOX (cloud sync)

(*) Initiated by a Technical Student (S. Russo) from TS+UD uni (ATLAS experiment)

(33)

CERNBOX to enter beta soon

CERNBOX provides a cloud synchronisation service for all CERN users between personal devices (laptops, smartphone, tablets) and a centrally-managed data storage.

Coherent data handling (with our other services): SLA, ACL, curation... CERNBOX is built on top of the ownCloud software. CERNBOX

supports access via web browsers, desktop clients (Linux, Mac and Windows) and mobile-device applications (Android and iOS).

http://owncloud.com

(34)

Ceph Architecture and Use-Cases

• Open-software storage

solution

• Scalable from 10s to

10,000s of machines,

from Terabytes to

Exabytes (10

3

_PB)

• Fault tolerant with no

SPOF, uses commodity

hardware

• Self-managing,

self-healing

(35)

Deploying a 3-PB prototype

• Fully puppetized deployment

• Automated machine

commissioning and maintenance

• Add a server to the hostgroup (osd, mon, radosgw)

• OSD disks are detected, formatted, prepared, auth’d

• Also after disk replacement

• Auto-generated ceph.conf

• Last step is manual/controlled: service ceph start

• Mcollective for bulk operations on

the servers

• Ceph rpm upgrades • daemon restarts

# df -h /mnt/ceph

Filesystem Size Used Avail Use% Mounted on

(36)

Beyond HEP

Ground floor exposition

Sharing experience with other activities (e.g. Biology)

Planning strategic partnership with other sciences (HelixNebula)

Collaboration with technology providers (OpenLab)

(37)

(38)

The 3 Vs in HEP

Volume

• For sure at somewhere between 10 and 100 PB you are getting big

Variety

• # of different technical solutions, operational procedures, users and workflows?

• # of different types (>3? >30?)

Velocity

• Dvolume/Dt in 1 PB/month (disk installation) • Last 2 years of EOS@CERN

• Dvolume/Dt in 4 PB/month (recording, processing and redistributing)

• LHC running (end of pp Run1) But do not forget:

• Max time you can be down! • Max integrated time in a month! • # of human interventions per day!

...and we want to read high volume of data at high velocity in a variety of ways 

User program

Output data

∀ event in 2012 CMS AOD …

(39)

Our “Big Data”

Physics Data on CASTOR/EOS

● LHC experiments produce ~10GB/s 25PB/year

User Data on AFS & DFS

● Home directories for 30k users

● Physics analysis development

● Project spaces for applications

Service Data on AFS/NFS/Filers

● Databases, admin applications

Tape archival with CASTOR/TSM

● Physics data (RAW and derived/additional)

● Desktop/Server backups

Data growth

● 2 PB/month over the last 2 years (raw disk

space)

● 4 PB/month new data at the end of LHC phase 1

● Factor of 2 more at the beginning of phase 2?

Service Size Files

AFS 290TB 2.3B

CASTOR 89.0PB 325M

(40)

Staff, fellows and students!

Xav ie r Co rt ad a (w ith th e p arti ci pati o n o f ph ysic ist P et e Ma rko w itz ), "I n s earc h o f th e H igg s bo so n: H -> ZZ" , d ig ital art , 2013. 40

• Staff, fellow and students

– cern.ch/jobs

– Many opportunities for students!

• Physicists & Engineers (incl. Computer science) – Challenging but exciting!

• Summer Student (Jan/Feb 8-12 weeks in summer)

• OpenLab Summer Students (Jan/Feb 8-12 weeks in summer)

• Technical Students (2 times per year 6-12 months)

(41)

(42)

But finally why so much data?

Ldt



Luminosity

It is a measurement of the potential statistics accumulated in the collisions over time (here it accounts for the 2011 and 2012 data taking)

Think to the exposure time of a camera when taking pictures Events with 2 Zs decaying in 2-lepton pairs (here e+_e-_andm₊m_{- )}

(43)

But finally why so much data?

Ldt



Luminosity

It is a measurement of the potential statistics accumulated in the collisions over time (here it accounts for the 2011 and 2012 data taking)

Think to the exposure time of a camera when taking pictures Events with 2 Zs decaying in 2-lepton pairs (here e+_e-_andm₊m_{- )}