Experience of Data Transfer to
the Tier-1 from a DIRAC
Perspective
Lydia Heck
Institute for Computational Cosmology
Manager of the DiRAC-2 Data Centric
Talk layout
● Introduction to DiRAC ?
● The DiRAC computing systems ● What is DiRAC
● What type of science is done on the DiRAC facility ? ● Why do we need to copy data to RAL?
● Copying data to RAL – network requirements
● Collaboration between DiRAC and RAL to produce the archive ● Setting up the archiving tools
● Archiving ● Open issues
Introduction to
DiRAC
●
DIRAC -- Distributed Research utilising Advanced
Computing established in 2009 with DiRAC-1
●
Support of research in theoretical astronomy,
particle physics and nuclear physics
●
Funded by STFC with infrastructure money
allocated from the
Department for Business,
Innovation and Skills
(BIS)
●
The running costs, such as staff costs and
Introduction to
DiRAC
, cont’d
● 2009 – DiRAC-1
– 8 installations across the UK of which COSMA-4 at the
ICC in Durham is one. Still a loose federation.
● 2011/2012 – DiRAC-2
– major funding of £15M for e-Infrastructure
– in bidding to host – 5 installations identified – judged by
peers
– for successful bidders scrutiny and interview by
representatives for BIS to see if we could deliver by a tight deadline
Introduction to
DiRAC
, cont’d
●
DiRAC has full management structure.
●
Computing time on the DiRAC facility is allocated
through a peer-reviewed procedure.
●
Current director: Dr Jeremy Yates, UCL
●
Current technical director:Prof Peter Boyle,
The DiRAC computing systems
Blue Gene Edinburgh Cosmos Cambridge Complexity Leicester Data Centric Durham Data Analytic CambridgeThe Bluegene @ DiRAC
●
Edinburgh – IBM Blue
Gene
– 98304 cores
– 1 Pbyte of GPFS storage – designed around
COSMA @ DiRAC (Data Centric)
●
Durham – Data Centric
system –IBM IDataplex
– 6720 Intel Sandy Bridge
cores – 53.8 TB of RAM – FDR10 infiniband 2:1 blocking – 2.5 Pbyte of GPFS storage (2.2 Pbyte used!)
Complexity @ DiRAC
Leicester Complexity –
HP system
• 4352 Intel Sandy Bridge
cores
• 30 Tbyte of RAM
• FDR 1:1 non-blocking • 0.8 Pbyte of Panasas
Cosmos @ DiRAC (SMP)
●
Cambridge COSMOS
●SGI shared memory
system
– 1856 Intel Sandy Bridge
cores
– 31 Intel Xeon Phi
co-processors
– 14.8 Tbyte of RAM – 146 Tbyte of storage
HPCS @ DiRAC (Data Analytic)
Cambridge Data Analytic
– Dell
• 4800 Intel Sandy Bridge
cores
• 19.2 TByte of RAM
• FDR Infiniband 1:1
non-blocking
What is DiRAC
●
A national service run/managed/allocated by the
scientists who do the science funded by BIS and
STFC
●
The systems are built around and for the
applications with which the science is done.
●
We do not rival a facility like ARCHER, as we do
not aspire to run a general national service.
●
DiRAC is classed as a major research facility by
What is DiRAC, cont’d
● Long projects with significant amount of CPU hours
allocated for 3 years typically on a specific system – for 2012 – 2015 with examples:
– Cosmos - dp002 : ~20M cpu hours on Cambridge Cosmos – Virgo-dp004 : 63M cpu hours on Durham DC
– UK-MHD-dp010 : 40.5M cpu hours on Durham DC – UK-QCD-dp008 : ~700M cpu hours on Edinburgh BG
– Exeter – dp005: ~15M cpu hours on Leicester Complexity
What type of Science is done on DiRAC ?
●
For the highlights of science carried out on
the DiRAC facility please see:
http://www.dirac.ac.uk/science.html
●
Specific example: Large scale structure
calculations with the Eagle run
–
4096 cores
–
~8 GB RAM/core
–
47 days = 4,620,288 cpu hours
–200 TB of data
Why do we need to copy data (to RAL) ?
● Original plan - each research project should make
provisions for storing the research data
– requires additional storage resource at researchers’ home institutions – Not enough provision – will require additional funds.
– data creation considerably above expectation ?
Why do we need to copy data (to RAL) ?
● Research data must now be shared with/available to interested parties ● Install DiRAC’s own archive – requires funds and currently there is no
budget.
● we needed to get started:
– Jeremy Yates negotiated access to the RAL archive system
● Acquire expertise
● Identify bottlenecks and technical challenges
– submitted 2,000,000 files and created an issue at the file servers
● How can we collaborate and make use of previous experience. ● AND: copy data!
Copying data to RAL – network
requirements
●
network bandwidth – situation for Durham
–
now:
● currently possible 300-400 Mbytes/sec
● required investment and collaboration from DU CIS ● upgrade to 6GBit/sec to JANET - Sep 2014
● will be 10 Gbit/sec by end of 2015 – infra structure
already installed
–
past:
Copying data to RAL – network
requirements
●
network bandwidth – situation for Durham
investment to by-pass of external campus firewall:
two new routers (~£80k) – configured for throughput
with minimal ACL enough to safeguard site.
deploying internal firewalls – part of new security
infrastructure, essential for such a venture
Security now relies on front-end system of Durham
Copying data to RAL – network
requirements
Result for COSMA and GridPP in Durham
guaranteed 2-3 Gbit/sec with bursts of up to 3-4Gbit/sec (3 Gbit/sec outside of term time)
pushed the network performance for Durham GridPP from
bottom 3 in the country to top 5 of the UK GridPP sites
achieves up to 300 – 400 Mbyte/sec throughput to
Collaboration between DiRAC
and GridPP/RAL
●
Durham Institute for Computational Cosmology
(ICC) volunteered to be the prototype installation
●
Huge thanks to Jens Jensen and Brian Davies -
there were many emails exchanged, many
questions asked and many answers given.
●
Resulting document
“Setting up a system for data archiving using
FTS3” by Lydia Heck, Jens Jensen and Brian
Davies
Setting up the archiving tools
●
Identify appropriate hardware – could mean
extra expense:
need freedom to modify and experiment with -
cannot have HPC users logged in and working!
free to do very latest security updates
requires optimal connection to storage -
Setting up the archiving tools
●
Create an interface to access the
file/archving service at RAL using the
GridPP tools
–
gridftp – Globus Toolkit – also provides Globus
Connect
–
Trust anchors (egi-trustanchors)
–voms tools (emi3-xxx)
Archiving?
●
long-lived voms proxy?
– myproxy-init; myproxy-logon; voms-proxy-init;
fts-transfer-delegation
● How to create a proxy and delegation that lasts
weeks even months? – still an issue
●
grid-proxy-init; fts-transfer-delegation
– grid-proxy-init –valid HH:MM
– fts-transfer-delegation –e time-in-seconds
Archiving
● Large files – optimal throughput limited by network
bandwidth
● Many small files – limited by latency; using ‘-r’ flag to
fts-transfer-submit to re-use connection
● Transferred:
– ~40 Tbytes since 20 August – ~2M files
– challenge to FTS service at RAL
Open issues
●
ownership and permissions are not
preserved
●
depends on single admin to carry out.
●
what happens when content in directories
change? – complete new archive sessions?
●
tries to archive all the files again but then
‘fails’ as file already exists – should be
more like rsync
Conclusions
● With the right network speed we can archive the DiRAC
data to RAL.
● The documentation has to be completed and shared with
the system managers on the other DiRAC sites
● Each DiRAC site will have their own dirac0X account ● Start with and keep on archiving
● Collaboration between DiRAC and GridPP/RAL DOES
work!