GARUDA - NKN Partner's Meet 2015 Big data networks and TCP

(1)

Brij Kishor Jashal

Email – brij.jashal@tifr.res.in

GARUDA - NKN Partner's Meet 2015

(2)

Outline:

• Scale of LHC computing

( as an example of Big data network )

(3)

• CMS designed to observe a billion (1x109) collisions/sec.

• Data rate out of the detector of more than 1,000,000 Gigabytes/sec (1 PBy/s)

• 50 Gb/s for 7x24 is distributed to physics groups around the world

 Higgs event in

CMS: 2012

 Nobel prize in

Physics 2013

 Made possible by

distributed computing

1 Higgs event out of 1012 _{proton − proton collisions}

(4)

Computing characteristics at LHC

 Large numbers of independent events (millions/sec) –

“Job granularity”

 Large data sets – mostly read-only

 Modest I/O rates – few MB/sec per processor

 Modest floating point requirement – HEP-SPEC06

performance. ( which matches with batch jobs ~10% ) Computation and storage needs can not be met at single site.

Therefore

(5)

(6)

Scale of LHC computing

• Connected by High speed wide area networks

• Linking more than 300 computer centers

• Providing > 340,000 cores

• To more than 2000 (active) users

(7)

(8)

LHC data volume is predicted to grow 10 fold over the next 10 years Data growth prediction

(9)

4.7 billion / events processed in 2 years

Analysis

Simulation and HCtests

(10)

Dedicated 4G with best

efforts up to 10G

(to be upgraded to

(11)

Downloads – 928 TB

2013-08-01 00:00 to 2015-08-01 00:00 UTC

Data transfers to / from T2_IN_TIFR Uploads – 763 TB

(12)

India’s R&E network

 Scalable and reliable

 Connected organizations – 1070 and increasing

 ultra-high speed core,

starting with multiple 2.5/10 G and progressively

moving towards 40/100 (Gbps)

 The core is complimented with a distribution layer

(13)

Garuda-NKN meet 10 Sep 2015 13 MPLS L3 VPN services

within the country to many Organizations / projects, some of them are…

• GARUDA

• Indian WLCG partner

institutes . (14 and increasing) (Upgrading from 2G to 10G to TIFR)

DAE L2 VPN

Providing important International connectivity via TEIN4 (Trans Eurasia Information Network)

Expanding connectivity to Internet2 and other R&E networks in pipeline

(14)

(15)

Data-intensive science ( HEP as a prototype)

 What made this possible – Hardware and software development at all levels.

 The underlying network ( General )  Optical signal transport

 Networking devices

 Data transport (TCP a “fragile workhorse”)  Operating system and middleware evolution

 Data organization (movement and management system)

 Security mechanism

 Network monitoring and testing

=> Continuous movement of data at ~150 Gb/s across three continents

(16)

(Optical network technology )

Dense wave division multiplexing(DWDM) 100Gb/s per wave (optical channel) using “coherent” optical processing.

 Transport using dual polarization–quadrature phase shift

keying (DP-QPSK) technology with coherent detection • DP

- two independent optical signals, same frequency - two polarization

• QPSK

- Encode data by changing the signal phase relative to optical carrier

- reduces the symbol rate by half (twice data / symbol)

(17)

Data transport

• TCP remains the workhorse of the internet, including for data-intensive science

• Reliability in the DNA of TCP

• Using TCP to support the sustained, long distance, high data-rate flows of data-intensive science requires an error-free network

(18)

Because

 Very sensitive to packet loss (due to bit errors)

 A single bit error can cause the loss of 1-9 kBy packets

(depending on the MTU size ) significantly reducing throughput

• Reason ?

 Slow-start

 Congestion avoidance algorithms added to TCP  Packet loss is seen by TCP’s congestion control

(19)

Impact of packet loss on TCP

 On 10 Gb/s LAN path the impact of packet loss is minimum

 On a 10Gb/s WAN path the impact of even very low packet loss rate is

enormous (~80X throughput reduction from TIFR to FNAL where latency is about 270ms )

(20)

Modern TCP stack

• Modern TCP stack – Kernel implementation of TCP protocol

• Important to reduce sensitivity to packet loss while still providing congestion avoidance

 Mechanism that more quickly increases back to full

speed after an error forces a reset to low bandwidth

• Linux supports pluggable congestion control algorithms. To get a

list of congestion control algorithms that are available in your kernel (kernal 2.6.20+), run:

(21)

 “TCP tuning” commonly refers to the proper configuration of TCP

windowing buffers for the path length

 Optimal TCP buffer size critical for your end to end path (RTT)

• Default TCP buffer sizes are typically too small for High speed

networks. (for all OS) typically 64 KB

 Historically TCP window size and parameters were host global  Now it can be a function of application or path to destination

 Auto-tuning TCP with pre-configured limits helps, but still not best  Why ? Because upper limits not adequate

Example –

10 Gbps flow – 100ms latency ( typical across india) => 128 MB of buffering

GridFTP, would employ 2-8 streams to do this efficiently

(22)

• BIC reaches Max throughput much faster then older algos. • From Linux 2.6.19 CUBIC is default (refined version of BIC

(23)

Garuda-NKN meet 10 Sep 2015 23

http://www.slac.stanford.edu/~ytl/thesis.pdf

 Even modern TCP stacks are only of some help in the face of packet loss on a long path, high-speed network

(24)

Throughput out to ~9000 km on a 10Gb/s network 32 MBy (auto-tuned) vs. 64 MBy (hand tuned) TCP window size

(25)

Recently there have been a number of enhancements to the Linux TCP code that potentially can have a large impact on TCP

throughput. A number of these enhancements are described in this talk:

• David Miller NetDev KeyNote, Feb 2015 (slides)

The most important changes include:

• per-flow pacing

• dynamic TSO sizing (TCP segmentation offload (TSO) is a hardware-assisted technique for improving the performance of outbound data streams)

• CDG congestion control (coming soon) (CAIA Delay-Gradient ) https://lwn.net/Articles/645015/

If you want to see if these changes help in your environment, you can try the latest 'stable' kernel from kernel.org.

(26)

Data transfer tools

 Parallelism is the key in data transfer tools

• Multiple parallel connection always better then one connection

• Because OS is very good multiple threads then sustained single thread

 Tools should be latency tolerance

• Specially for WAN

• Tools like SCP/SFTP and HPSS mover protocols assume latency typical of a LAN environment

 Disk Performance

• To get more then 500 Mb/s then storage solutions with multiple disks are required like (RAID arrays and clusters )

(27)

Sample results on a typical network with RTT = 53 ms and network capacity = 10Gbps.

Tool Throughput

SCP 140 Mbps

Patched SCP (HPN) 1.2 Gbps

FTP 1.4 Gbps

GridFTP, 4 streams 5.4 Gbps GridFTP, 8 streams 6.6 Gbps

Globus GridFTP is the basis of most modern high-performance data movement systems

.

• The newer Globus Online incorporates all of these and small file support, pipelining, automatic error recovery, third-party transfers, etc.

(28)

Many others to take care of

 Large MTUs (several issues)

• jumbo frames (9000 bytes) on NKN

• CPU overhead is reduced significantly and efficiency is improved

 NIC tuning

• Defaults are usually fine for 1GE, but 10GE often requires additional tuning

 Other OS tuning knobs

 many firewalls can’t handle >1 Gb/s flows

(29)

Garuda-NKN meet 10 Sep 2015 29 The only way to keep multi-domain, international scale networks

error-free is to test and monitor continuously end-to-end to detect soft errors and facilitate their isolation and correction

Composite view of health of LHC peers from PerfSONAR at TIFR Network Monitoring and testing

PerfSONAR:  Community efforts  Standardize measurement  Bundled package

(30)

PerfSONAR provides a standardize way • Test, Measure, Export

• Catalogue

• Access performance data from many different networks domain.

Deployed extensively throughout WLCG

(31)

Data organization (movement and management system)

Optimized networks and resources.

Advanced data transfer services and tools

GridFTP on RAID arrays

Standard access mechanism to user and his analysis framework

Service like Any time Any data Any where ( ~xrootd)

Universal data catalogue of distributed research data

(32)

GARUDA - NKN Partner's Meet 2015 Big data networks and TCP

Outline:

•

Scale of LHC computing

Thank you

Questions ??