ATLAS Software and Computing Week April 4-8, 2011 General News

(1)

April 19, 2011

• Refactor requests for resources (originally requested in 2010) by expected running conditions (running in 2012 with shutdown in 2013)

– 20% more CPU required at T0 which will be redeployed from CAF – No extra CPU for T2 in 2012 but 11% more disk requested

– T1 CPU up 15%, disk and tape OK

• Expect Group and User activities to increase as the experiment matures • Change in data distribution model:

– 1 disk copy of RAW at T1’s

– Rolling buffer of 10% of ESD at T1’s + small streams

– 10 copies of AOD over the clouds (3 copies of previous version) – 2 copies of DESD at T1, 4 at T2

– Larger pileup means ~double the event size

– Work done already to reduce the reconstruction time

• New version of Oracle (11g) to be introduced at end of 2011 (somewhat disruptive move from current 9g)

ATLAS Software and Computing Week April 4-8, 2011

General News

(2)

• Break original cloud and DDM model as network architectures have

evolved away from the original MONARC model

• Allow DDM to transfer data freely between all sites

• T2D (D=Directly connected) are an attempt to break the cloud

boundaries where it makes sense

• In the new model, every T1 is topologically close with the other

T1s, its T2s and all the T2Ds

• Requirements for being/becoming a T2D

– Need satisfy transfer metrics with all T1s

• Strictly quantifiable via SONAR tests

– Need to provide a certain level of commitment and reliability

• Not quantifiable at the moment, up to ATLAS central ops to

decide

Data Movement and Cloud Model

Number of files transferred

SMALL ≤3 4 ≥5

MEDIUM ≤2 3 ≥4

Avg(Byterate)+StD(Byterate) SMALL <0.05MB/s <0.1MB/s ≥0.1MB/s

(3)

April 19, 2011

• Many T2Ds commissioned in less than 2 months

– Represent over 80% of ATLAS resources

• But many clouds have no T2Ds and less than 50% of T2’s

represented

• LHCONE should address some of these issues

• Proposal that only T2Ds participate in multicloud production

activities and even hosting GROUPDISK

• T2Ds could become primary replica sites esp. for GROUPDISK

replication

• Strongly urged to get as many T2Ds as possible in a cloud but up to

squads and sites to do the commissioning

• perfSONAR and independent FTS tests at US sites has been

immensely helpful in commissioning and monitoring

• Won’t discuss LHCONE here but refer you to slides

• Intercloud transfers may need direct T2-T2 – new feature in DDM

based on measurements

(4)

• 15.6.X.Y – last of 2010 analyses

• 16.0.X.Y – recommended for all pp analyses • 16.2.X.Y – HI reprocessing

• 16.6.X.Y (latest 16.6.3.4) – for 2011 MC and data

• Rel 17 due out begin May, validate July for Sept reprocessing

– Hope for significant cleanup (compilation warnings, savannah

bugs, etc)

• 64-bit builds maybe in 16.6.3.Y

– CPU performance gain but 1.5x more memory (reco ~3.5 GB)

• Cleanup of releases needed before deployment on CVMFS • People needed to help the effort (2-3 people at 30-50% FTE)

(5)

April 19, 2011

• MC10a production (for 2011 data)

– Starting now

– Reprocess (digi+reco) all MC10 HITS (16.6.3.Y) – 800M events, ~6 weeks to reprocess

–

• MC11 Geant4 in validation now (16.6.X.Y)

– All 800M samples redone from scratch – ~ 3 months

– HITS go to TAPE until reconstruction in summer with rel 17.

–

• Multi-cloud working reasonably well with a few mostly local

concerns

– Monitoring, foreign T2s stealing jobs

• Clouds tend to empty at the same time – this is good

MC Production

ATLAS-Canada Tier-1/2 Computing Meeting Leslie Groer

(6)

• NoSQL database (e.g. MongoDB, Cassandra, Hbase) • Cloud Computing

• xRootD Federation and File Level Caching • Event Data Caching

• Tier-3 Monitoring • CVMFS

• Multicores

– AthenaMP – reduce memory footprint

– Reco prototype working in 16.5.0.3 with caveats

• Network and Transfer Monitoring

• New ATLAS Development Meeting Wed’s 17:00 CET

–

– Open meeting with people invited from outside of ATLAS – Discuss future computing projects

–

– [email protected]

(7)

April 19, 2011

• Dashboard

– e.g. Historical Views for Job accounting, DDM Dashboard 2.0, Global

Job monitoring

– Some still in validation so report features and bugs

• ADC Monitoring

• ATLAS Grid Information System development:

• Discussions on converging monitoring tools within Dashboard framework • Some useful new pages

– Data Distribution (T0T1, T1T1)

http://panda.cern.ch:25880/server/pandamon/query?mode=listCR

• Site Status Board

–

• or shifter view

–

• Autopyfactory monitoring

• Talk about implementing more stringent criteria on site availability to avoid flip-flopping (e.g. from CMS: site OK >5 days in last 7 or in last 2 days)

ATLAS Monitoring and Reporting

(8)

• 50 PB, 2.5M datasets, 185M files, 800 end-points and growing • 14/0.6M reads/writes per day (162/7Hz), 500k deletes/day • New accounting service dq2-list-accounting

– E.g. Count all data10 datasets at CERN by datatype:

– dq2-list-accounting project=data10* location=cern* datatype

• Automatic notification to squads of suspicious files and tagging

datasets

• Multi-hop/direct transfer services being worked on for T2D’s • Dropping python 2.3/2.4 support – currently 2.5/2.6

• dq2 Stable release 0.1.36 (tested with Python 2.6)

– Few name changes to maintain compatibility

e.g. dq2-put2 replaces old dq2-put

– Python files now in /opt/dq2/lib

• Working on DQ3 code-named “Rucio”

– More hooks for group datasets, symlinks,

multiple replicas, cloud computing, etc

(9)

April 19, 2011

• Distribute read-only binaries (immutable, sha-1 hashed)

• Files/meta-data downloaded on demand and locally cached (fuse) • Self-contained (e.g. /cvmfs/atlas.cern.ch/)

• Local load-balanced squids with fail-over capability

• In deployment at few sites already (RAL, QMUL, Wuppertal) • V2.0 end of May

• Recommended for grid sites for ease of installation, no local repo

to worry about space or performance, do need squid proxy

• Caveat is need local disk cache (at least 8GB recommended) or can

use shared memory

• KV works transparently • Installation DB recognizes

CVMFS site and removes concurrent job restrictions

• VO_ATLAS_SW_DIR=

/cvmfs/atlas.cern.ch/repo/sw

• Athena software will be installed in

• $VO_ATLAS_SW_DIR/software/<version>

Cern VM File System (CVMFS)

· /cvmfs/atlas.cern.ch — Production Software

5 release managers, 24 releases SLC4 + 31 releases SLC5 590 GB, 11 Million files, 16 Million entries (shadow) 85 GB and 1.5 Million files (repository)

· /cvmfs/atlas-condb.cern.ch — ATLAS Condition Flat Files Release manager machine hosted by CERN IT

Automatic update several times a day 30 GB, 110000 files, 7000 directories, 3000 symlinks (shadow tree)

30 GB, 70000 files (repository) Only fraction of all conditions data

· /cvmfs/atlas-nightlies.cern.ch — ATLAS Nightlies to be done

(10)

• Merging of output files for analysis jobs • Event picking in PD2P

• Optimize rebrokerage of jobs (general agreement to reduce from 3

days to 1 day for analysis jobs)

• LFC registration by panda server

• Pilot changes to support new functionality

• T1T1 PD2P for secondary copies based on usage – Split dataset containers based on MOU shares

• Further copies made to T2s based on job backlogs (log10 function) • Algorithm will be tuned with experience

– Maybe add popularity and history (needs additional tables in

db)

(11)

April 19, 2011

Backups

(12)

• Revise model due to 200 Hz  400 Hz trigger rate • No long term storage of ESDs on disk for physics streams

– Muon, JetTauEtmiss, Egamma, MinBias

• Do not store any ESD on tape

– This removes the possibility of reprocessing from ESD

• Provide 2 replicas of ESD from all physics streams for ~10% of the data

– Last 2 months ‘rolling buffer’ for Tier0 produced ESDs

– Specific data period corresponding to ~10% of the data for ESDs from

reprocessing

• Provide 2 replicas of ESD for some small streams

– CosmicCalo, ZeroBias, Standby & express

• Reduce the ESD size by ~30% by dropping unused or redundant information

• Provide 1 copy of RAW data on disk for the physics stream for data taken in the last year

– In addition to copy of RAW on tape

– Compress RAW data on disk (can achieve a factor of ~2)

(13)

April 19, 2011 ATLAS-Canada Tier-1/2 Computing Meeting