Brij Kishor Jashal
Email – brij.jashal@tifr.res.in
GARUDA - NKN Partner's Meet 2015
Outline:
•
Scale of LHC computing
( as an example of Big data network )• CMS designed to observe a billion (1x109) collisions/sec.
• Data rate out of the detector of more than 1,000,000 Gigabytes/sec (1 PBy/s)
• 50 Gb/s for 7x24 is distributed to physics groups around the world
Higgs event in
CMS: 2012
Nobel prize in
Physics 2013
Made possible by
distributed computing
1 Higgs event out of 1012 proton − proton collisions
Computing characteristics at LHC
Large numbers of independent events (millions/sec) –
“Job granularity”
Large data sets – mostly read-only
Modest I/O rates – few MB/sec per processor
Modest floating point requirement – HEP-SPEC06
performance. ( which matches with batch jobs ~10% ) Computation and storage needs can not be met at single site.
Therefore
Scale of LHC computing
• Connected by High speed wide area networks
• Linking more than 300 computer centers
• Providing > 340,000 cores
• To more than 2000 (active) users
LHC data volume is predicted to grow 10 fold over the next 10 years Data growth prediction
4.7 billion / events processed in 2 years
Analysis
Simulation and HCtests
Dedicated 4G with best
efforts up to 10G
(to be upgraded to
Downloads – 928 TB
2013-08-01 00:00 to 2015-08-01 00:00 UTC
Data transfers to / from T2_IN_TIFR Uploads – 763 TB
India’s R&E network
Scalable and reliable
Connected organizations – 1070 and increasing
ultra-high speed core,
starting with multiple 2.5/10 G and progressively
moving towards 40/100 (Gbps)
The core is complimented with a distribution layer
Garuda-NKN meet 10 Sep 2015 13 MPLS L3 VPN services
within the country to many Organizations / projects, some of them are…
• GARUDA
• Indian WLCG partner
institutes . (14 and increasing) (Upgrading from 2G to 10G to TIFR)
DAE L2 VPN
Providing important International connectivity via TEIN4 (Trans Eurasia Information Network)
Expanding connectivity to Internet2 and other R&E networks in pipeline
Data-intensive science ( HEP as a prototype)
What made this possible – Hardware and software development at all levels.
The underlying network ( General ) Optical signal transport
Networking devices
Data transport (TCP a “fragile workhorse”) Operating system and middleware evolution
Data organization (movement and management system)
Security mechanism
Network monitoring and testing
=> Continuous movement of data at ~150 Gb/s across three continents
(Optical network technology )
Dense wave division multiplexing(DWDM) 100Gb/s per wave (optical channel) using “coherent” optical processing.
Transport using dual polarization–quadrature phase shift
keying (DP-QPSK) technology with coherent detection • DP
- two independent optical signals, same frequency - two polarization
• QPSK
- Encode data by changing the signal phase relative to optical carrier
- reduces the symbol rate by half (twice data / symbol)
Data transport
• TCP remains the workhorse of the internet, including for data-intensive science
• Reliability in the DNA of TCP
• Using TCP to support the sustained, long distance, high data-rate flows of data-intensive science requires an error-free network
Because
Very sensitive to packet loss (due to bit errors)
A single bit error can cause the loss of 1-9 kBy packets
(depending on the MTU size ) significantly reducing throughput
• Reason ?
Slow-start
Congestion avoidance algorithms added to TCP Packet loss is seen by TCP’s congestion control
Impact of packet loss on TCP
On 10 Gb/s LAN path the impact of packet loss is minimum
On a 10Gb/s WAN path the impact of even very low packet loss rate is
enormous (~80X throughput reduction from TIFR to FNAL where latency is about 270ms )
Modern TCP stack
• Modern TCP stack – Kernel implementation of TCP protocol
• Important to reduce sensitivity to packet loss while still providing congestion avoidance
Mechanism that more quickly increases back to full
speed after an error forces a reset to low bandwidth
• Linux supports pluggable congestion control algorithms. To get a
list of congestion control algorithms that are available in your kernel (kernal 2.6.20+), run:
“TCP tuning” commonly refers to the proper configuration of TCP
windowing buffers for the path length
Optimal TCP buffer size critical for your end to end path (RTT)
• Default TCP buffer sizes are typically too small for High speed
networks. (for all OS) typically 64 KB
Historically TCP window size and parameters were host global Now it can be a function of application or path to destination
Auto-tuning TCP with pre-configured limits helps, but still not best Why ? Because upper limits not adequate
Example –
10 Gbps flow – 100ms latency ( typical across india) => 128 MB of buffering
GridFTP, would employ 2-8 streams to do this efficiently
• BIC reaches Max throughput much faster then older algos. • From Linux 2.6.19 CUBIC is default (refined version of BIC
Garuda-NKN meet 10 Sep 2015 23
http://www.slac.stanford.edu/~ytl/thesis.pdf
Even modern TCP stacks are only of some help in the face of packet loss on a long path, high-speed network
Throughput out to ~9000 km on a 10Gb/s network 32 MBy (auto-tuned) vs. 64 MBy (hand tuned) TCP window size
Recently there have been a number of enhancements to the Linux TCP code that potentially can have a large impact on TCP
throughput. A number of these enhancements are described in this talk:
• David Miller NetDev KeyNote, Feb 2015 (slides)
The most important changes include:
• per-flow pacing
• dynamic TSO sizing (TCP segmentation offload (TSO) is a hardware-assisted technique for improving the performance of outbound data streams)
• CDG congestion control (coming soon) (CAIA Delay-Gradient ) https://lwn.net/Articles/645015/
If you want to see if these changes help in your environment, you can try the latest 'stable' kernel from kernel.org.
Data transfer tools
Parallelism is the key in data transfer tools
• Multiple parallel connection always better then one connection
• Because OS is very good multiple threads then sustained single thread
Tools should be latency tolerance
• Specially for WAN
• Tools like SCP/SFTP and HPSS mover protocols assume latency typical of a LAN environment
Disk Performance
• To get more then 500 Mb/s then storage solutions with multiple disks are required like (RAID arrays and clusters )
Sample results on a typical network with RTT = 53 ms and network capacity = 10Gbps.
Tool Throughput
SCP 140 Mbps
Patched SCP (HPN) 1.2 Gbps
FTP 1.4 Gbps
GridFTP, 4 streams 5.4 Gbps GridFTP, 8 streams 6.6 Gbps
Globus GridFTP is the basis of most modern high-performance data movement systems
.
• The newer Globus Online incorporates all of these and small file support, pipelining, automatic error recovery, third-party transfers, etc.
Many others to take care of
Large MTUs (several issues)
• jumbo frames (9000 bytes) on NKN
• CPU overhead is reduced significantly and efficiency is improved
NIC tuning
• Defaults are usually fine for 1GE, but 10GE often requires additional tuning
Other OS tuning knobs
many firewalls can’t handle >1 Gb/s flows
Garuda-NKN meet 10 Sep 2015 29 The only way to keep multi-domain, international scale networks
error-free is to test and monitor continuously end-to-end to detect soft errors and facilitate their isolation and correction
Composite view of health of LHC peers from PerfSONAR at TIFR Network Monitoring and testing
PerfSONAR: Community efforts Standardize measurement Bundled package
PerfSONAR provides a standardize way • Test, Measure, Export
• Catalogue
• Access performance data from many different networks domain.
Deployed extensively throughout WLCG
Data organization (movement and management system)
Optimized networks and resources.
Advanced data transfer services and tools
GridFTP on RAID arrays
Standard access mechanism to user and his analysis framework
Service like Any time Any data Any where ( ~xrootd)
Universal data catalogue of distributed research data