CCRC08 post mortem
CCRC08 post-mortem
LHCb activities at PIC
G. Merino
LHCb Computing
• Main user analysis supported at CERN + 6Tier-1s • Tier-2s essentially MonteCarlo production facilities
CCRC08: Planned tasks
• May activities: Maintain equivalent of 1 month data
t ki i 50% hi l ffi i
taking assuming a 50% machine cycle efficiency • Raw data distribution from pit → T0 centre
• Raw data distribution from T0 → T1 centres
– Use of FTS - T1D0Use of FTS T1D0
• Recons of raw data at CERN & T1 centres
– RAW (T1D0) rDST (T1D0)
• Stripping of data at CERN & T1 centres
– RAW & rDST (T1D0) DST (T1D1)
• Distribution of DST data to all other centres • Distribution of DST data to all other centres
Activities across the
sites
sites
• Planned breakdown of processing activities (CPU needs) prior to CCRC08 Site Fraction (%) CERN 14 FZK 11 IN2P3 25 CNAF 9 NIKHEF/SARA 26 PIC 4 RAL 11
Tier 0
Tier 1
• FTS from CERN to Tier-1 centres– Transfer of RAW will only occur once data has migrated to tape & checksum is verified
– Rate out of CERN ~35MB/s averaged over the period Peak rate far in excess of requirement
– Peak rate far in excess of requirement
Tier 0
Tier 1
• To first order all transfers eventually succeeded
plot shows efficiency on 1st attempt
Issue with UK certificates
– plot shows efficiency on 1st attempt…
Restart IN2P3 SRM endpoint CERN outage CERN SRM endpoint problems
Reconstruction
• Used SRM 2.2– LHCb space tokens are:
• LHCb_RAW (T1D0); LHCb_RDST (T1D0)
• Data shares need to be preserved
– Important for resource planningImportant for resource planning
• Input 1 RAW file & output 1 rDST file (1.6 GB)
• Reduced nos of events per recons job from 50k to 25k (job ~12 hour duration on 2.8 kSI2k machine)
– In order to fit within the available queues
– Need to get queues at all sites that match our i ti
processing time
Reconstruction
• After data transfer file should be online, as job submitted immediatelyy
– NOTE: in principle only LHCb has this requirement of “online reconstruction”
• Reco jobs will read the input data from the T1D0 write buffer
• Just in case… LHCb pre-stages files (srm_bringonline) & th h k th t t f th fil ( l ) b f
then checks on the status of the file (srm_ls) before submitting pilot job via GFAL
– Pre-stage should ensure access availability from cachePre stage should ensure access availability from cache – Only issue at NL-T1 with reporting of file status
Reconstruction
• 41.2k reconstruction jobs41.2k reconstruction jobs Sub Done Done/
submitted
• 27 6k jobs proceeded to
jobs jobs Sub
NIKHEF 10.3k (26%) 2.3k (6%) 23% • 27.6k jobs proceeded to done state D / d 6 % PIC 1.8k (4%) 1.6k (4%) 89% ☺ RAL 4.7k 3.5k 74% • Done/created ~67% RAL 4.7k(11%) 3.5k (8%) 74% CERN 6.1k (14%) 5.3k 86% (14%) (13%) CNAF 3.9k (9%) 2.8k (7%) 72% ( ) ( ) GridKa 4.1k (11%) 3.1k (7%) 76% IN2P3 10 3k 6 1k 56% IN2P3 10.3k (25%) 6.1k (14%) 56%
Reconstruction
• 27.6k reconstruction jobs in27.6k reconstruction jobs in 25k Fail Success
done state
– 21 2k jobs processed 25k
events upload /Created
NIKHEF 1.2k (53%) 0.9k (70%) 4% 21.2k jobs processed 25k events – Done/25k events ~77% PIC 1.6k (99%) 0.0k (0%) 89% ☺ RAL 3.1k 0.0k 68% – Done/25k events 77%
• 3.0k jobs failed to upload DST t l l SE (89%) (1%) CERN 5.2k (100%) 0.7k (14%) 76% rDST to local SE
– Only 1 attempt before t i F il CNAF 2.6k (95%) 0.0k (1%) 67% GridKa 3.0k 0.7k 58% trying Failover – Failover/25k events ~13% (99%) (22%) IN2P3 5.1k (90%) 0.7k (14%) 43%
Error humano en el PIC:
WN con la red desconfigurada 24-27 de Mayo Hacía de black-hole (ticket-4386)
Reconstruction
CPU efficiency: ratio of wall/cpu time on running jobs
CNAF: more jobs than cores on a WN …
IN2P3 & RAL:
P bl di
Problems reading input data
Reconstruction
CPU efficiency: ratio of wall/cpu time on running jobs
PIC: The most PIC: The most cpu-efficient T1 ☺
dCache Observations
• Official LCG recommendation - 1.8.0-15p3 • LHCb ran smoothly at half of T1 dCache sitesPIC OK version 1 8 0 12p6 (dcap) – PIC OK - version 1.8.0-12p6 (dcap) – GridKa OK - version 1.8.0-15p2 (dcap)
– IN2P3 - problematic - version 1.8.0-12p6 (gsidcap)
• Seg faults - needed to ship version of GFAL to run • Could explain CGSI-gSOAP problem????
– NL-T1 - problematicp ((gsidcapg p))
• Many versions during CCRC to solve number of issues • 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4p p
Databases
• Conditions DB used at CERN & Tier-1 centres
– No replication tests of conditions DB Pit ↔Tier-0 (and beyond)
– Switched to using Conditions DB 15th May for reconstruction
• LFC
U “ t i ” t l t th d l i t t
– Use “streaming” to populate the read-only instance at T1 from CERN
P bl ith CERN i t l d l l i t
– Problem with CERN instance revealed local instances not being used by LHCb!
T ti d
Stripping
• Stripping on rDST files• 1 rDST file & associated RAW file
• Space tokens: LHC RAW & LHCb rDSTSpace tokens: LHC_RAW & LHCb_rDST
• DST files & ETC produced during the process stored locally on T1D1 (add storage class)
locally on T1D1 (add storage class)
• Space tokens: LHCb_M-DST
• DST & ETC file then distributed to all other computing • DST & ETC file then distributed to all other computing
centres on T0D1 (except CERN T1D1)
Stripping
Subm Done CERN 2.4k 2.3k CNAF 2.3k 2.0k GridKa 2 0k 2 0k GridKa 2.0k 2.0k IN2P3 4.5k 0.2kNIKHEF 0 3k 0 1k • 31.8k stripping jobs were
submitted
NIKHEF 0.3k <0.1k
PIC 1.1k 1.1k
• 9.3k jobs ran to “Done” • Major issues with LHCb
RAL 2.2k 1.6k
Failed to resolve
17.0k
Major issues with LHCb book-keeping
resolve datasets
Stripping: T1-T1 transfers
CNAF PIC Initial Catch up ok Initial problems uploading to M-DST Token p once solved M DST Token at PIC GridKa RAL GridKa RAL 20Conclusiones
• A pesar de ser el Tier-1 más pequeño de LHCb, la calidad de servicio del PIC ha sido la más alta en el CCRC08
servicio del PIC ha sido la más alta en el CCRC08
• Se han testeado los siguientes procesos para los Tier-1
Recepción de datos desde el CERN – Recepción de datos desde el CERN – Reconstrucción
– Stripping y envío de DST a otros Tier-1Stripping y envío de DST a otros Tier 1
• Los resultados en el PIC han sido positivos
– Recepción de datos desde el CERN (~5MB/s)Recepción de datos desde el CERN ( 5MB/s) – Lectura de datos desde WNs (dcap) – OK
– Demostrada replicación de DST a otros Tier-1s a más velocidad p
de la requerida (catch-up)
• El ejercicio ha sido también útil para que LHCb detecte los puntos débiles de su infraestructura Grid DIRAC
débiles de su infraestructura Grid DIRAC