PUBLIC Page 1 of 36
Document Filename: BGII-DSA2-9-v0_1-FinalReport_IMCSUL.doc
Partner(s): KTH, EENet, VU, RTU, UIIP NASB
Lead Partner: IMCS UL
Document classification: PUBLIC
This document gives an overview on the network resource provisioning in the Baltic countries and Belarus during the BalticGrid-II project. The deliverable focuses on issues identified by the SA2 activity, TCP test measurements and methodology used to fine-tune TCP stack parameters on BG-II clusters, as well as provides data transfer speed comparison before and after the tuning. Network performance monitoring carried out by SA2 team and data collected during the project are also described and analysed in more detail.
PUBLIC Page 2 of 36
Document review and moderation
Name Partner Date Signature
Released for moderation to Approved for delivery by
Version Date Summary of changes Author
0.1 18/02/2010 Plan and structure of the deliverable
0.5 9/03/2010 First draft Katrina Sataki
0.6 1/04/2010 Latest monitoring data added Martins Libins 0.7 12/04/2010 Latest TCP test results added Edgars Znots
1.0 12/04/2010 Finalised version Katrina Sataki, Baiba Kaskina, Guntis Barzdins
1.1 15/04/2010 Reviewed version Marcin Radecki
BGII-DSA2-9-v1-2-FinalReport_IMCSUL.doc PUBLIC Page 3 of 36
Contents1. INTRODUCTION ... 4 1.1.PURPOSE ... 4 1.2.APPLICATIONAREA ... 4 1.3.REFERENCES ... 4 1.4.TERMINOLOGY ... 4
2. NETWORK RESOURCE PROVISIONING IN BALTICGRID-II ... 6
2.1.OVERVIEW ... 6
2.2.ORGANISATIONAL STRUCTURE ... 6
2.2.1. Operational principles of CNCC ... 9
2.2.2. Operational principles of NNCCs ... 9
2.3.SERVICE LEVEL AGREEMENTS ... 10
3. NETWORK MONITORING ... 11
3.1.MONITORING PORTAL ... 11
3.2.EXPANSION OF THE BALTICGRID NETWORK ... 12
3.3.MONITORING DATA... 13
3.4.CASE STUDY ... 17
4. TCP PERFORMANCE TESTS ... 20
4.1.MEASUREMENT LABORATORY ... 21
4.2.COLLECTED DATA ... 21
5. HIGHSPEED TCP AND SCALABLE TCP PERFORMANCE TESTS IN LABORATORY ENVIRONMENT ... 24
5.1.1. Test laboratory: high-performance network simulator ... 24
5.1.2. HighSpeed TCP ... 24
5.1.3. Scalable TCP ... 25
5.1.4. Test data comparison and analysis ... 26
6. TCP PERFORAMNCE TUNING IN THE BALTICGRID-II NETWORK ... 27
6.1.TCP TUNING IN LABORATORY ENVIRONMENT ... 27
6.1.1. Baseline measurements ... 27
6.1.2. Performance of ScientificLinux in comparison with baseline measurements ... 28
6.1.3. TCP tuning results on production sites ... 30
6.2.TEST DATA COMPARISON AND ANALYSIS ... 32
6.3.TCP THROUGHPUT TESTS BETWEEN SITES ... 33
PUBLIC Page 4 of 36
The purpose of this document is to give an overview of the network provisioning and work done by SA2 activity during the BalticGrid-II project. The document contains all essential information concerning network connectivity, network performance and traffic monitoring activity of the project.
1.2. APPLICATION AREA
This document is intended as a summary of the work carried out by the SA2 activity during the BalticGrid-II project. The document outlines organisational model of cooperation of network coordination centres in the partnering countries, gives an overview of methodology used to measure network throughput and test various TCP stack parameters in order to fine-tune the performance of the BalticGrid-II sites. The analysis of network monitoring data gathered during the project is also performed.
 http://www.geant.net GÉANT Public Portal
Information about the current project GN3 and previous project GN2
h_High-Performance_GridPP_Feb04.ppt High Performance Networking for ALL
Presentation by Robin Tasker, DataTAG Project, RIPE-47, Amsterdam, 29 January 2004
AIRT Application for Incident Response Teams (by SurfNET)
BDP Bandwidth Delay Product
BIC Binary Increase Congestion control
CCA Congestion Control Algorithm
CMS Compact Muon Solenoid Experiment
CNCC Central Network Coordination Centre
Gbps Gigabits per second
GÉANT European Academic Network
LAN Local Area Network
LBE Less than Best Effort
PUBLIC Page 5 of 36
LHC Large Hadron Collider
LTS Long Term Support
Mbps Megabits per second
NNCC National Network Coordination Centre NREN National Research and Education Network
QoS Quality of Service
Reno TCP network congestion avoidance algorythm
RTT Round-trip Time
SA2 Network Provisioning activity of the BalticGrid-II project
SE Storage Element
SLA Service Level Agreement
SRM Storage Resource Management
TCP Transmission Control Protocol
WN Worker Node
PUBLIC Page 6 of 36
2. NETWORK RESOURCE PROVISIONING IN BALTICGRID-II
The overall strategy and objective of the SA2 activity of the BalticGrid-II project was to ensure reliable network connectivity for Grid infrastructure in the Baltic countries and Belarus. During the first phase of the project – BalticGrid – it was identified that transfer of large data volumes is the bottleneck of the network resources available to Grid centres in the partnering countries. Therefore the main tasks for the SA2 activity were:
1) coordinate network services in all partnering countries, monitor network performance and its adherence to the Service Level Agreements (SLA) concluded between BalticGrid-II and NRENs; 2) ensure optimal network resources for applications that require transfer of large data volumes; 3) provide efficient handling of security incidents in close cooperation with SA1;
4) coordinate the work of SA2 with other projects.
In order to implement the objectives of the activity a sustainable organisational model that would continue collaboration between national Grid initiatives after the end of the project was proposed and introduced in all the partnering countries. Central Network Coordination Centre (CNCC) was established to work as a multi-domain umbrella for different entities and to initiate collaboration between the National Network Coordination Centres (NNCC) in the partnering countries.
CNCC also coordinated the transfer of knowledge accumulated during the BalticGrid project to the partners in Belarus by helping them to establish reliable network connectivity and monitoring their network interface and clusters.
During the project CNCC monitored network and its adherence to the signed SLAs (established already during the BalticGrid project and updated with the BalticGrid-II project) and advocated interests of the Project and Grid users with National Research and Education Networks providing network infrastructure.
To increase Grid usability for applications that require transfer of large data, the main emphasis was to perform network throughput tests, to propose solutions for network enhancement, as well as to implement and test HighSpeed TCP and Scalable TCP – modifications of a standard TCP implementation to dramatically improve TCP performance in highspeed wide are networks. During the project alternative solution was proposed and TCP stack tuning parameters introduced. Results from the TCP tuning were compared with the ones of the tests of HighSpeed TCP and Scalable TCP. To increase the level of trust in Grid technologies and usage, SA2 carried out extended risk analysis for the BalticGrid-II network. Based on this analysis Grid Acceptable Use Policy and Security Policy were developed and implemented. Finally, Incident Handling and Response Policy was implemented and security incident handling carried out by BalticGrid Incident Response Teams (BG-IRT) within each NNCC. Thus, SA2 activity performed the role of the network-based second line of defence after SA1 concentrated on resource centre management and solving incidents arising on the network, in this way increasing users’ confidence.
2.2. ORGANISATIONAL STRUCTURE
Central Network Coordination Centre, established by the IMCS UL as a coordinating body for all NNCCs, had the following responsibilities:
PUBLIC Page 7 of 36
• Since CNCC is a coordinating body, its main responsibility was to supervise, coordinate and lead all networking activities of the project. The main objective was to focus on the procedures of mutual collaboration between NNCCs in a way to ensure sustainability of the effective communication and problem solving even after the end of the project. One of the major tasks was to provide, test and implement all necessary tools to monitor network and storage elements 24x7 and elaborate procedures to identify and prevent network disruptions or other problems, as well as to alert about the possible bottlenecks in due time. This task required also study and research of observed network problems;
• CNCC was also a body that represented SA2 in the work with other activities within the project, as well as coordinated the work of SA2 with other projects and teams outside the consortium;
• In a case of observed network problems, possible security incidents or other events it was the responsibility of CNCC to notify all other NNCCs and control the measures taken by NNCC staff to prevent possible network failures;
• CNCC continued the work started during BalticGrid project by ensuring adherence of concluded SLAs, represents the project in the negotiations between the project and Internet Service providers (National Research and Education Networks of the partnering countries); • CNCC tested and approbated new solutions for full utilisation of available network
• CNCC was responsible for the development and implementation of security policy and handling of security incidents. These policies were developed and discussed between all interested parties, e.g., NNCCs, SA1, NRENs, users;
• It was also necessary to transfer knowledge accumulated during the BalticGrid project in respect of network resource provisioning to the new partners;
• CNCC maintained and developed the BalticGrid monitoring portal
CCNC was established by IMCS UL involving network experts of Latvian NREN SigmaNet as well as IMCS UL Grid experts.
National Network Coordination Centres, established in Lithuania, Latvia, Estonia, and Belarus, had the following responsibilities during the project and will continue their work after the project to ensure sustainability of the Grid resources in the Baltics and Belarus:
• monitoring network and storage elements 24x7 in their respective countries;
• study and identification of observed network problems, notify other NNCC teams about the status of the network, problems identified and solutions tested and implemented. This is to ensure the distribution of knowledge gathered during the project between all partners involved;
• NNCC team shall notify local Grid users about security incidents and solutions implemented, as well as educate the users in respect of the network security and precautions each user has to take;
• communication with service providers about daily operations;
• testing and approbation of new solutions for non-standard HighSpeed or Scalable TCP data transfer and full utilisation of available network bandwidth;
PUBLIC Page 8 of 36
• development and implementation of security policy and handling of security incidents according to CNCC directives.
The following partners are responsible for daily operations of NNCC: • Lithuania – VU;
• Latvia – IMCS UL; • Estonia – EENet; • Belarus – UIIP NASB.
Within each NNCC, as a separate group of at least 2 people an incident response teams (IRT) was established with the following responsibilities:
• actively participate in the process of policy development carried out by CNCC, implement the policy in their institutions, periodically review the provisions of the policy and suggest changes in order to ensure flexible and operational procedure that serve the needs of the user community;
• deal with security incidents, solve them and inform local community and other NNCCs about the incidents, using AIRT software;
• analyse security incidents and suggest security improvements.
Each partner responsible for NNCC has assigned the tasks of incident response to skilled and professional people who would be able to perform them in good quality.
The following people have been assigned: • Lithuania:
Eduardas Kutka, firstname.lastname@example.org;
Algimantas Juozapavicius, email@example.com. IRT –
Eduardas Kutka, firstname.lastname@example.org;
Algimantas Juozapavicius, email@example.com.
Guntis Barzdins, firstname.lastname@example.org; Martins Libins, email@example.com; Edgars Znots, firstname.lastname@example.org. IRT –
Baiba Kaskina, email@example.com; Martins Libins, firstname.lastname@example.org. • Estonia:
PUBLIC Page 9 of 36
Hardi Teder, email@example.com; Tõnu Raitviir, firstname.lastname@example.org. IRT –
Hardi Teder, email@example.com; Tõnu Raitviir, firstname.lastname@example.org. • Belarus:
Sergey Aneichik, email@example.com;
Alexandr Lavrinenko, firstname.lastname@example.org; Andrey Volkov, email@example.com; Oleg Nosilovsky, firstname.lastname@example.org;
Oleg Moiseichuk, email@example.com; Yury Zemtsov, firstname.lastname@example.org; Pavel Prokoshin, email@example.com.
Andrey Volkov, firstname.lastname@example.org; Oleg Nosilovsky, email@example.com;
Oleg Moiseichuk, firstname.lastname@example.org.
Essential benefits of involving these experts were that they were familiar both with the Grid environment and demands as well as with the networking infrastructure in the country and usual practice of security teams. These skills only together can ensure successful Grid security incident handling.
2.2.1. Operational principles of CNCC
CNCC has been established at the IMCS UL. CNCC started its operation in July 2008. The main principles of CNCC operation were to:
1. Ensure effective collaboration between NNCCs by developing framework procedures and providing necessary facilities;
2. Foster information exchange between NNCC teams;
3. Maintenance of the monitoring portal for the global BalticGrid network resource availability; 4. Represent BalticGrid interests in collaboration with other projects, e.g., GN2 (since April
2009 GN3) , EGEE.
5. Monitor SLA adherence and inform NNCC teams about the identified problems, as well as negotiate possible solutions with the respective NRENs.
2.2.2. Operational principles of NNCCs
NNCCs have been established in Lithuania, Latvia, Estonia, and Belarus and the main principles of their operation were to:
PUBLIC Page 10 of 36
1. Ensure network resource provisioning in its respective country;
2. Monitor all parts of its network via BalticGrid Network monitoring portal
3. Inform other team members and CNCC team about observed network anomalies and steps taken to solve identified problems;
4. Inform Grid users about steps taken to solve identified network problems.
In order to perform those operations all partners have received information about the BalticGrid monitoring portal, as well as have been familiarised with the BalticGrid network structure.
2.3. SERVICE LEVEL AGREEMENTS
The Service Level Agreement (SLA) of the BalticGrid-II project consists of two parts:
1) General provisions – this part sets out the substantive clauses of the cooperation between National Research and Education network of Belarus BASNET and the BalticGrid-II project. These provisions also serve as guidelines to the interpretation of the specific provisions of the Agreement;
2) Specific provisions - technical service parameters which can be offered and/or ordered.
General provisions of the SLA contain Administrative Level Objects (ALO) that include general information related to parties and the agreement itself:
• requisites of the Parties;
• purpose of the Agreement (includes the clause about the chosen SLA type);
• responsibilities of the parties: who is responsible for what, who are the contact persons, what are the expected reaction times of helpdesks etc.
• modification and termination of the Agreement. This part serves as a legal basis of the cooperation.
Specific provisions of the SLA contain Service Level Objects (SLO), i.e., regarding of the SLA type chosen Specific provisions list the actual technical service parameters which can be offered and/or ordered.
Agreement was signed by parties in 30 October 2008 during the first All-Hands Meeting of the BalticGrid-II project held in Minsk.
PUBLIC Page 11 of 36
3. NETWORK MONITORING
3.1. MONITORING PORTAL
Information on the network infrastructure and available monitoring data is collected on-line at
http://gridimon.balticgrid.eu. The first page shows geographical position of the Baltic countries and Belarus.
This portal provides a convenient centralised view on the essential historic and real-time BalticGrid network parameters, thus serving as an excellent troubleshooting and SLA adherence monitoring portal.
Fig. 1 Homepage of gridimon.balticgrid.org
PUBLIC Page 12 of 36
3.2. EXPANSION OF THE BALTICGRID NETWORK
By adding Belarus to the BalticGrid infrastructure, detailed map of Belarusian Grid network was developed and updated during the project.
Fig. 2 Belarusian Grid network infrastructure view By clicking on each link it is possible to view available graphical information.
Grid network infrastructure of BASNET was successfully included into the infrastructure of BalticGrid-II.
Service Level Agreement between BalticGrid-II and BASNET was signed. Belarusian Grid network diagram was successfully added to the BalticGrid-II network monitoring portal at http://gridimon.balticgrid.eu. It allows monitoring of SLA adherence.
PUBLIC Page 13 of 36
Last year’s upgrades of the Belarusian network connectivity to the European Academic Network GÉANT2 and participation of BASNET in the GN3 project in the status of associate member ensured the sustainability of the network and enough resources for academic and, particularly, Grid users.
3.3. MONITORING DATA
The structure of the monitoring portal http://gridimon.balticgrid.eu among other features allows observing the relation between Link Load and Latency. The graphs below show monitoring data of Estonia, Latvia, Lithuania, and Belarus.
The SA2 team performed a monitoring of the most significant parameters such as link load, latency, jitter and packetloss. Portal http://gridimon.balticgrid.eu monitors each member of the BalticGrid-II project separately from others and collects statistics graphically.
The graphs below (see Fig. 3 to Fig. 10) show statistics of primary GÉANT channel of EENet, LITNET, SigmaNet and BASNET. These graphs are for the period of time of the last month (March 2010) and show no outages or congested network in the networking infrastructure of BalticGrid-II.
Fig. 3. EENet GÉANT link utilisation
Fig. 4. EENet GÉANT SLA (rping)
PUBLIC Page 14 of 36
Fig. 6. LITNET GÉANT SLA (rping)
Fig. 7. SigmaNet GÉANT link utilisation
Fig. 8. SigmaNet GÉANT SLA (rping)
PUBLIC Page 15 of 36
Fig. 10. BASNET GÉANT SLA (rping)
Statistics of the primary GÉANT channel of EENet, LITNET, SigmaNet, and BASNET for the period of time of the last Year (fig.9-16)
Fig. 11. EENet GÉANT link utilisation
Fig. 12. EENet GÉANT SLA (rping)
PUBLIC Page 16 of 36
Fig. 14. LITNET GÉANT SLA (rping)
Fig. 15. SigmaNet GÉANT link utilisation
PUBLIC Page 17 of 36
Fig. 17. BASNET GÉANT link utilisation
Fig. 18. BASNET GÉANT SLA (rping) 3.4. CASE STUDY
The monitoring portal gridimon.balticgrid.org shows statistics on the utilisation of the network infrastructure of BalticGrid-II.
These measurements are made in real time and shown in the graphs. A good example of how the SLA monitoring is working can be seen in the relevance of a networking problem to SLA parameters. There was an outage in the link between Riga (LV) and Tallinn (EE) due to a fibre cut in Televork network. As a result of this networking problem, all traffic between SigmaNet and EENet was re-routed over alternative paths. The traffic from SigmaNet to EENet (and vice versa) was going trough Lithuania.
It affected the latency up to 6 times – from average 10ms to average 60ms (see Fig. 19 and Fig. 20). However, it did not affect the quality of the network significantly and thanks to the fast re-routing the end user did not even notice it. This was a temporary problem that was solved when the Televork fibre was fixed and all SLA parameters returned back to their previous values (see the added ticket below).
PUBLIC Page 18 of 36
Fig. 19. SLA monitoring – case study
Fig. 20. SLA monitoring – case study-II DANTE TICKET: 4476
DANTE TICKET: 4476DANTE TICKET: 4476 DANTE TICKET: 4476
Type: Dashboard Alarm Status: Closed
Description: [rig-tal] EE-LV link down Location A: Riga , LV
Location B: Tallinn , EE
Incident Start: 16/11/2009 08:03:10 (UTC) Incident Resolved: 16/11/2009 19:37:20 (UTC)
Ticket Open: 16/11/2009 08:17:47 (UTC) Ticket Close: 17/11/2009 19:37:28 (UTC)
EE-LV IP link down but traffic will be re-routed over alternative paths HISTORY
Latest update: RESOLVED
Outage due to fibre cut in Televork network. Link has been up and stable since 13:25:46 UTC. Link will be monitored for the next 24 hours.
BGII-DSA2-9-v1-2-FinalReport_IMCSUL.doc PUBLIC Page 19 of 36 HENRY.AGBENU 16/11/2009 19:41:03 UTC --- HISTORY UPDATE
Our logs show that the STM 64 link (id: DANTE/10GBWL/TALL-RIGA/001) between Tallinn and Riga is down since 08:10:58 UTC. Circuit provider has been contacted.
AKILESWARAN.RADHAKRISHNAN 16/11/2009 08:27:20 UTC
--- HISTORY UPDATE
Link is up and stable since 13:25:46 UTC. We are waiting for an RFO from the provider, Televork.
PUBLIC Page 20 of 36
4. TCP PERFORMANCE TESTS
TCP is the primary transport protocol used by vast majority of all services included in EGEE Grid middleware, deployed in the BalticGrid-II project. Although there have been devised multi-session TCP file transfer protocols such as GridFTP and similar, their actual use is rather limited due to sophisticated setup and tuning required. BalticGrid-II took the initiative to investigate this problem in detail and to try to find a solution. The solution indeed has been found, deployed in BalticGrid, and positively conformed by tests. The solution turned out to be different from what it was envisioned – instead of the need to switch to HighSpeed TCP or Scalable TCP in the BalticGrid, the actual solution was found in TCP buffer optimization, which allows to increase considerably the achievable throughput for single TCP connection while maintaining compatibility with existing TCP protocol implementations.
Already in the first phase of the BalticGrid project it was correctly identified that the true “bottleneck” for file transfers in grid infrastructure is not the GÉANT network itself (most BalticGrid resource centres have 1Gbps connectivity to GÉANT), but rather the poor TCP protocol performance limiting individual single-session file transfers to merely 60-250Mbps. During the first year of the project there were three breakthroughs that allowed to identify the cause of low TCP performance, as well as to propose a solution to this problem. During the second year of the project additional tests and TCP protocol fine-tuning was carried out in test-lab environment and later implemented in pilot setups. After verification of results in pilot implementation, devised TCP tuning was implemented in BalticGrid-II production network and resource centres.
The three breakthroughs are the following:
Creation of reliable gigabit test-lab for TCP performance measurement at variable RTT delays. This turned out to be a non-trivial task, because many off-the-shelf delay simulation techniques either produce results not consistent with real networks or not capable of delaying gigabit traffic flows without distortion. Example of non-conforming approach is use of Linux transmission control ('tc') tools, which were attempted initially. A working gigabit traffic delaying solution (breakthrough) has been found to be FreeBSD firewall ('ipfw') configured as Layer 2 bridge rather than Layer3 router (Linux approach). The test results achieved with this solution have been compared and positively matched with real international gigabit network tests. This achievement provided possibility to reliably test large number of TCP implementations and configurations in lab environment.
Despite the initial theory that better than average TCP performance in some BacticGrid-II clusters was due to Intel® I/O Acceleration Technology (Intel® IOAT) technology (coincidentally present in one of the high-performing BalticGrid-II nodes), the thorough tests in the test-lab linked the TCP performance variations to different versions of Linux kernels installed on different BalticGrid-II clusters. By changing only Linux kernel version it was possible to re-create in the lab both the low TCP performance characteristic to most of the BalticGrid-II clusters, as well as the high TCP performance observed in some BalticGrid-II nodes.
The last breakthrough came from a careful study of Linux TCP stack and its tuning parameters, as well as close inspection of TCP setup in BG-II clusters. It was discovered in test lab that default TCP stack for Scientific Linux has very small RX/TX buffer sizes (maximum TCP window), which plays crucial role in TCP performance. It was traced down that as few as 4 configuration lines for TCP stack inserted in system configuration file resolves the TCP performance bottleneck up to gigabit speeds. Later it was positively confirmed that KTH clusters, which despite using low-performing Linux kernel version achieved exceptionally good TCP performance, also have almost identical lines for improved TCP performance. KTH clusters had
PUBLIC Page 21 of 36
these lines inserted a while back, and largely forgotten by the general community. The following 4 lines must be put in /etc/sysctl.conf file to improve TCP performance:
net.core.rmem_max = 8738000 net.core.wmem_max = 8738000
net.ipv4.tcp_rmem = 8192 4369000 8738000 net.ipv4.tcp_wmem = 8192 4369000 8738000
These lines relate to minimal, default and maximum allocatable TCP window sizes, and the following tests on other BG-II nodes with various Linux kernel versions have confirmed that the same 4 TCP configuration lines indeed resolve the TCP performance bottleneck up to gigabit speeds.
4.1. MEASUREMENT LABORATORY
To investigate possible network latency or server TCP/IP stack influence on data transfer speeds, SA2 created a test laboratory where network throughput measurements could be made. The measurement laboratory provided possibility to simulate network latencies and optionally – jitter and packet loss. It was decided to use a middle server between the transferring two. FreeBSD was chosen as a next alternative for OS, since FreeBSD has tools with similar functionality to that of Linux 'tc', called 'dummynet' network simulator module and managed through 'ipfw' tools.
FreeBSD 'dummynet' works on the Ethernet frame level, thus it acts more as a passive switch. This implementation consumes considerably less CPU power on the host that acts as latency creator (network simulator), thus the results are consistent.
4.2. COLLECTED DATA
After performing various tests using FreeBSD server with 'dummynet' tools between two Linux servers it was found that Scientific Linux 4, if used with default kernel and TCP/IP stack configuration, performs very poorly in network environments with latencies (round trip time, RTT) above 1ms. Results were compared with other Linux kernels and distributions, as well as with other TCP/IP congestion control algorithms.
Preliminary results are shown in Fig. 21. The graph shows TCP bandwidth for various Linux kernels and distributions depending on network latency (RTT) and TCP send/receive buffer sizes. As it can be seen, default Scientific Linux 4 installation underperforms as soon as network latency becomes greater than 1-2ms. Typical network latency between sites in BalticGrid is 8-35ms.
After tuning basic TCP/IP stack parameters, Scientific Linux 4 with default Linux kernel 2.6.9 with traditional BIC or Reno congestion control algorithm performs up to 10-20 times faster in latency range of 6-40ms, the important latency range for the BalticGrid-II applications.
PUBLIC Page 22 of 36
Fig. 21. TCP bandwidth for various Linux kernels
The Fig. 22 illustrates the TCP performance in BalticGrid-II at the beginning of the project and at the end of the project.
Fig. 22. Single-session TCP performance of some BalticGrid clusters before and after TCP tuning (as measured from site in Latvia)
PUBLIC Page 23 of 36
This figure clearly shows that methods proposed have indeed manifold improved the TCP single-session performance in BalticGrid. Effectively the TCP performance “bottleneck” has been removed and transfer rates are primarily limited only by the actual network capacity, which for majority of the BalticGrid resource centres is 1Gbps.
Not all BalticGrid resource centres have been able to achieve close to gigabit TCP performance as seen in Fig. 22. Meanwhile the cause of limited performance in these sites is purely due to available network capacity limitations – all affected sites have only 100Mbps GÉANT connectivity, with Belarus resource centre having even less capacity towards GÉANT (as seen in Fig. 23).
PUBLIC Page 24 of 36
5. HIGHSPEED TCP AND SCALABLE TCP PERFORMANCE TESTS IN LABORATORY ENVIRONMENT
5.1.1. Test laboratory: high-performance network simulator
The achieved TCP performance improvement throughout the BalticGrid network was made possible solely due to the early decision made by the SA2 team to employ a powerful network simulator for TCP performance testing rather than attempt to carry out the tests on a live pan-European network (the success of this strategy later was confirmed by the successful deployment in the live network). The lack of full control over test-conditions in live network and inability to exactly duplicate earlier measurements made scientific argument impossible due to endless guesswork about side-effects during each measurement.
Meanwhile creation of a reliable network simulator for gigabit speeds and variable pan-European RTT latencies was not a non-trivial task due to two factors:
• Other teams around the world use bundles of expensive multi-thousand kilometres of optical fibre to reliably simulate performance of a real network (e.g., The WAN in Lab (WiL) Project at Caltech has a 2400+km long haul fibre optic test bed, designed specifically to aid FAST TCP research and provide a TCP benchmarking facility).
• The standard network latency simulation software “tc” (transmission control) included with Linux operating system produces highly inaccurate simulation results not consistent with real networks and thus is useless for TCP performance investigation. These poor results cast doubt whether PC-based simulator at all can reliably simulate gigabit network at significant latencies.
Despite this un-encouraging landscape of obvious choices, the BalticGrid team was able to come up with a working gigabit traffic delaying solution. The breakthrough solution has been found in FreeBSD firewall ('ipfw') configured as Layer 2 bridge rather than Layer3 router (Linux approach). The test results achieved with this solution have been compared and positively matched with real international gigabit network tests. Moreover, the Layer 2 implementation of network simulator made it fully agnostic towards the kind of traffic to be delayed and allowed to reliably simulate also effects of multiple concurrent TCP sessions, which turned out to be a crucial aspect for selecting a TCP implementation appropriate for the BalticGrid network and GÉANT in general (because of LBE – “Less than Best Effort” service discontinuation in GÉANT network, which was crucial for non-disruptive use of “aggressive” TCP variations, like HighSpeed TCP and Scalable TCP). This achievement finally provided possibility to reliably test large number of TCP implementations and configurations in lab environment and proceed with deployment only after thorough studies.
The test laboratory setup and TCP performance test results have been reported in the TERENA Networking Conference 2009 (http://tnc2009.terena.org/).
5.1.2. HighSpeed TCP
Although HighSpeed TCP was measured to be capable of providing consistently high throughput at various network latencies, it also showed unacceptable tendency to degrade performance of regular TCP sessions for other network users as illustrated in Fig. 24 .
PUBLIC Page 25 of 36
Fig. 24. HighSpeed TCP degrades performance of regular TCP in the same network 5.1.3. Scalable TCP
Although Scalable TCP was measured to be capable to provide consistently high throughput at various network latencies, it also showed unacceptable tendency to degrade performance of regular TCP sessions for other network users as illustrated in Fig. 25 .
PUBLIC Page 26 of 36
5.1.4. Test data comparison and analysis
The above results are consistent with conclusions from other teams and confirm that both HighSpeed TCP and Scalable TCP may be non-disruptively used only on private networks while there is no packet loss for competing TCP versions in the network, or in conjunction with some sort of traffic-management solution, such as LBE (Less than Best Effort) traffic class previously available in GÉANT network, but discontinued in 2008.
The deployment of HighSpeed TCP or Scalable TCP in BalticGrid network as a cure against the low TCP protocol performance (which was already then correctly identified as the actual network performance “bottleneck”) was envisioned in 2007 during planning stages of BalticGrid-II project when LBE service was still available in GÉANT network. LBE service discontinuation by GÉANT has been correctly identified as one of the risks for SA2 activity.
Fortunately, the powerful network testing laboratory created by the SA2 team enabled to design different, less obvious, but much more elegant solution for the TCP performance “bottleneck” problem.
PUBLIC Page 27 of 36
6. TCP PERFORAMNCE TUNING IN THE BALTICGRID-II NETWORK
6.1. TCP TUNING IN LABORATORY ENVIRONMENT
6.1.1. Baseline measurements
To begin mitigation of TCP performance bottlenecks between BalticGrid sites, first a set of baseline TCP performance measurements had to be taken. This would allow later comparison of TCP performance between different Linux distributions present at BalticGrid sites, as well as at different parameter setups and network latencies.
Ubuntu 8.04 LTS (kernel 2.6.24) was chosen for baseline measurements, because at the time this distribution had the latest stable kernel, thus representing state-of-the-art for Linux TCP stack performance. Fig. 26 shows baseline TCP performance measurements for Ubuntu 8.04 at different TCP setups – default setup, setup with TCP buffers increased to 8MB, and setup with TCP buffers increased to 16MB.
PUBLIC Page 28 of 36
As it can be seen from the graph above, default TCP parameters for Ubuntu 8.04, especially TCP buffer size, is not optimal for high latency networks. Increasing TCP buffers to at least 8MB significantly improves single session TCP throughput for network latency range of 10-50ms. This range represents typical latencies in WANs, most latencies measured between BalticGrid sites are in the range of 8-35ms. It seems, though, that doubling TCP buffer size again for Ubuntu 8.04 LTS does not necessarily improve TCP performance further. On the contrary, throughput seems to degrade. Performing these baseline measurements gave several important results:
1. Measurements showed that our network delay simulator works and has an effect on TCP throughput
2. Even state-of-the-art Linux often has less than optimal TCP stack parameters for achieving high throughput
3. Even as simple procedure as increase of TCP buffers with changes in few system configuration lines can have a considerable effect on overall throughput
4. More is not always better – although theoretically linear increase of TCP buffers should equally reflect in higher TCP throughput at large latency networks, the Linux TCP stack itself may be unable to scale with increased buffers
Due to all of results and conclusions mentioned above, it was decided that existing network simulator (FreeBSD 7.1, ipfw with “dummynet”, 1000Hz kernel) is suitable for further measurement of Linux distributions and configurations used in BalticGrid.
Also, a good baseline for comparing further TCP tuning results was set, using Ubuntu 8.04 LTS distribution with TCP buffers set to 8MB.
6.1.2. Performance of ScientificLinux in comparison with baseline measurements
After correct operation of FreeBSD based network simulator was verified and baseline performance measured for Ubuntu 8.04 LTS, it was possible to measure and compare performance of ScientificLinux and baseline TCP throughput. Fig. 27 shows TCP throughput for ScientificLinux in comparison with the baseline measurements. ScientificLinux is the primary operating system for gLite middleware and as such is used virtually by all sites.
As it can be seen by the TCP throughput trend of ScientificLinux default configuration, the TCP stack is configured in such a manner that single session TCP throughput drops from near gigabit speed to less than 500Mbit/s as soon as network latency becomes greater than 1ms, and to less than 100Mbit/s as soon as latency increases above 5ms.
These results were at first thought to be incorrect and caused by malfunctioning network latency simulator or other equipment or setup details, but several repeated results with the same hardware and different Linux kernel versions and TCP stack parameters verified that, indeed, data acquired by these measurements is correct, and ScientificLinux significantly underperforms in regard to TCP throughput. More specifically, it seems that ScientificLinux has a fixed and very small TCP buffers, whereas all major distributions usually use TCP buffer autotuning, increasing and decreasing their size for optimal performance and resource usage. The effect, thus, is that for all network connections with latencies above 1ms – essentially all non-LAN connections, ScientificLinux is unusable for high-speed single-session TCP data transfer.
After identifying poor TCP throughput in case of ScientificLinux, the system parameters were adjusted to enable larger TCP buffers. After tests with several different buffer sizes it was concluded that, just like in case of Ubuntu 8.04 LTS, optimum upper values for TCP buffers that still provide TCP throughput increase in ScientificLinux 4.7 are 8MB. As it can be seen in the Figure 4-2, both
PUBLIC Page 29 of 36
setups of Ubuntu 8.04 LTS and ScientificLinux 4.7 achieve almost identical throughput results across the latency range, although ScientificLinux uses much older kernel (2.6.9) with BIC congestion control algorithm, whereas Ubuntu has latest stable kernel 2.6.24 and uses Reno congestion control algorithm. These measurements provide some interesting results and conclusions:
1) Default ScientificLinux 4.7 TCP configuration underperforms for non-LAN scenarios 2) Bottleneck for TCP throughput in ScientificLinux is caused by small fixed TCP buffers 3) This bottleneck can be removed by simply increasing TCP buffers
4) It is possible to achieve baseline TCP throughput performance for ScientificLinux
5) Newer Linux kernels do not necessarily have better TCP stack, they are just (usually) configured with better (but not optimal) parameters in comparison to ScientificLinux
6) Usage of different congestion control algorithms (Reno vs. BIC) do not play any significant role in single-session TCP throughput for packet-loss free networks
PUBLIC Page 30 of 36
6.1.3. TCP tuning results on production sites
After all the simulations and tuning of TCP performance carried out in the test laboratory it was concluded, that most BalticGrid sites are using default TCP stack configuration provided by corresponding distribution maintainers (Scientific Linux, RedHat, SUSE, etc.) or by the YAIM configuration tool used in installing and configuring specifig gLite middleware components on the hosts. These default are not appropriate for networks with high BDP (Bandwidth Delay Product), and it is necessary to increase available buffers to compensate for larger BDP.
As it was later discovered, the YAIM configuration tool – a tool to configure the middleware developed by the EGEE project – does however increase TCP buffer size up to 2MB for Storage Element type hosts, but not in the case of Worker Node hosts. This decision to tune TCP buffers for SE hosts and leave default underperforming configuration for WN hosts seems to be deliberate design choice, and is not causing (actually preventing) network performance bottlenecks in case of CMS/Atlas jobs. CMS/Atlas research workflow implies that before the calculations data is first transferred from Tier1 to Tier2 sites. These transfers are performed between Storage Elements at Tier1 and Tier2 sites, thus only TCP buffers for these nodes have to be tuned. Worker Nodes at these Tier2 centers that perform calculations then fetch data from their respective local Storage Element, and since this transfer takes place in LAN with RTT less than 1ms, default TCP buffer configuration works as required. More over, the decision deliberately not to tune TCP parameters at Worker Nodes ensures that no other job at the site could cause network throughput problems for CMS/Atlas jobs in case a LHC-unrelated job is also running at the site and trying to download large amount of data to Worker Nodes from distant Storage Elements.
# TCP buffer sizes configured by YAIM on Storage Elements net.ipv4.tcp_rmem = 131072 1048576 2097152
net.ipv4.tcp_wmem = 131072 1048576 2097152 net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
But, althouth this design decision ensures SESE data transfer for LHC experiments at Tier2 sites whould not be limited running jobs that transfer data from non-local SEs, this model is not suited for workflows of most other applications and users in the BalticGrid-II. Majority of sites in BalticGrid-II are not performing heavy calculations within LHC experiments, and majority of CPUHours and amount of submitted jobs is from LHC unrelated research. These users rely on LFC and SRM services to choose closest Storage Element for initial data storage (upload) before the execution of computing jobs. But since the jobs are then distributed across the whole grid infrastructure, the resulting computing jobs again use the LFC and SRM services to download the input data to the Worker Node from Storage Element that is no longer the local site SE. The exact same process repeats upon job exit, when the job output (results) are stored on job-local SE and then have to be downloaded from a great distance from users User Interface. Theoretically, all grid applications, if following CMS/Atlas job design, should always perform additional data transfer step by transferring data from users-local Storage Element to local Storage Element, and only then download data on the LAN from job-local SE to WN. Unfortunately, since most grid users are unaware of the intricate inner design and limitations of the grid infrastructure, this two-stage data transfer process is never practiced. Also, in several workflows a direct single stage transfer is necessary due to data transfer between services other than SRM or GridFTP.
Thus, since the workflow for most BalticGrid-II users requires or prefers single stage data transfer, but at the same time there are sites, namely T2_Estonia at KBFI, that do require network throughput priveledge for LHC experiments, a decision was made that each site can choose to either leave sites
PUBLIC Page 31 of 36
tuned for LHC experiment workflows, of apply TCP parameter tuning created by IMCS UL to enable more efficient single stage data transfer directly between Worker Nodes and non-local Storage Elements.
Below is the tuned Linux TCP parameter version for those sites that choose to optimize data transfer for single stage workflows. Just by setting these parameters on Worker Nodes (optionally also on other host types for improved input/output sandbox transfer, etc) it is possible to significantly increase single-session TCP performance. Best performance results are in case hosts at both data transfer endpoints use these settings. Also, this configuration does not change the CCA (Congestion Control Algorithm) or other parameters influencing operation of TCP stack, thus these improvements are equally effective on various Linux distributions and CCAs (Reno/BIC/CUBIC/Vegas) used in BalticGrid and provide maximum compatibility with already deployed applications and systems.
# TCP buffer sizes configured according to IMCS UL recommendations
net.core.rmem_max = 8738000 net.core.wmem_max = 8738000
net.ipv4.tcp_rmem = 8192 4369000 8738000 net.ipv4.tcp_wmem = 8192 4369000 8738000
All that was necessary for the BalticGrid site administrators after TCP stack improvement instructions were posted was to add these four lines to /etc/sysctl.conf configuration file on Linux machines. Additionaly, command “sysctl -p” allowed to apply this updated configuration without restarting hosts. Most site administrators were able to apply these improvements within one day, and so far there have been no negative side-effects to network operation after the tuning of TCP stack for these hosts. As it can be seen, such an elegant and easily deployable solution for site administrators was possible due to proper evaluation, testing and configuration of TCP stack in the test laboratory to find the most effective and yet compatible and problem-free TCP configuration suitable for all sites.
Fig. 28 illustrates TCP tuning results on actual BalticGrid production sites.
PUBLIC Page 32 of 36
As it can be seen from Figure 28, most sites experience major improvement in TCP performance once the optimal TCP stack configuration is applied. During the implementation of TCP stack improvements in BalticGrid sites it was noted that there are cases when parameter values specified in system default configuration file /etc/sysctl.conf are overridden by distribution specific configuration tool that stores system configuration values in some additional database. Thus site administrators must be aware of procedures and their order during system start-up for these TCP parameters to take effect and not to be overridden.
6.2. TEST DATA COMPARISON AND ANALYSIS
After TCP throughput tests were carried out both on test laboratory and production sites who had implemented the recommended increase of TCP buffers it was possible to compare simulated tests with real life measurements. Figure 29 shows comparison of laboratory tests and tests on production sites. The continuous lines show laboratory measurements for default ScientificLinux setup as well as setup with increased TCP buffers in case of simulated network latency. As already described, default configuration underperforms significantly while configuration with 8MB buffers reaches baseline measurements achieved with Ubuntu 8.04 LTS. The graph shows that results of un-tuned production sites correlate very well with the test laboratory measurements of un-tuned ScientificLinux. Also results after TCP buffer increase show that majority of sites are within close correlation to predicted results, although the achieved TCP throughput for production sites is lower due to other network traffic on the measured network path. This shows that test laboratory is capable to simulate and replicate real world conditions and results obtained within test laboratory are reproducible and usable in real world applications for production sites. Thus, not only have these tests given benefit to practical application in increasing single-session TCP throughput in BalticGrid, but also verified and approved approach of using FreeBSD and its network utilities to simulate real world network conditions in laboratory environment – a capability that classically required expensive and specialized equipment.
PUBLIC Page 33 of 36
Fig. 29. Comparison of test laboratory and production site measurements
Fig. 29 also shows how well performance data obtained from real Iperf measurements correlate with results obtained in dedicated test laboratory using network simulator.
Also comparison and analysis of measurements for laboratory setup or production sites has not revealed any negative side-effects from using the techniques described and recommended in this document. Whereas HighSpeed TCP and Scalable TCP are only feasible on networks where all participants use these exotic TCP implementations, or these more aggressive implementations are used on Lower-Than-Best-Effort network services to protect other classic TCP Reno/CUBIC connections from congestion collapse, the proposed technique here keeps all existing TCP implementations already used and just improves their performance, maintaining interoperability and compatibility without restricting diversity on TCP implementations available for use. This approach is thus recommended for other organizations and cases where improvement of TCP performance is desirable without great effort or high risk of negative side-effects.
6.3. TCP THROUGHPUT TESTS BETWEEN SITES
After test data comparison and analysis, described in Section 6.2, several sites were still identified as not reaching the baseline TCP performance expected after the TCP parameter tuning. At initial stage all throughput tests were carried out only in direction from all sites to main Iperf throughput testing server at IMCS UL site. To investigate more in causes and extent of identified bottlenecks, full or semi-full cross-site tests involving all paths between sites (full mesh tests) were necessary. Results from these tests would indicate on locality (city wide, country wide, region wide) of bottleneck for
PUBLIC Page 34 of 36
each site. This would ensure more accurate indications of where TCP bottlenecks originate – site TCP configuration, limited bandwidth routers between sites and GÉANT uplinks, or oversaturated national/regional GÉANT links.
TCP throughput tests were carried out between 12 sites and 9 Iperf servers, resulting in measurements for 108 distinct network paths. These results were able to convince network administrators of several sites that network performance bottlenecks are not caused by IMCS UL Iperf server being deployed in oversaturated network and producing overly slow measurements. Also, cross site measurements unveiled whether throughput bottleneck is unidirectional or bidirectional, as well as being present over all tested paths (regional and national bottleneck), or only on paths between sites in different countries (national network capacity sufficient, bottleneck in regional infrastructure).
Table 1. TCP throughput (mbps) between sites
Table 1 shows collected data from TCP cross-site throughput tests. Rows represent source site of the generated traffic, and columns represent destination site. A target of 300Mbps was set as a minimal threshold for TCP throughput improvements to be considered as successful. If this target is met, the TCP throughput measurement is depicted in green. In case the measurement fails to achieve target performance, it is depicted yellow. In case it also fails to achieve performance threshold of 100Mbps, it is depicted red, indicating major performance bottleneck or lack of network capacity.
Analysis of results in Table 1 indicate that all sites with properly configured and 1Gbps capable network equipment that did tune TCP parameters can reach TCP throughput many times that of previous limitation at 160Mbps. These sites include – T2_Estonia, VU-MIF-LCG2, IMCS UL (both CEs) and PSNC. These sites pass the target throughput both as source and destination sites of the data transfer.
As it can be identified from color codification, there are three sites already, that have disproportionate difference between achievable TCP throughput depending on the direction of data transfer. RTUETF site is capable of sending data efficiently only to its site local Iperf server as well as IMCS UL Iperf server (city local connection). Whereas at the same time RTUETF site performs very well in receiving data very efficiently from most other sites. This shows that RTUETF site has no incomming traffic bottleneck, but has city (MAN) area bottleneck. After further investigation this bottleneck was identified as problem in central RTUETF core router that sends traffic to either GÉANT or commercial ISP. At the moment of writing this deliverable the problem has still been unresolved,
Source/Dest ITPA IFJ-PAN-BG KTU-ELEN-LCG2 PSNC T2_Estonia RTUETF VU-MIF-LCG2 EENet IMCS UL
RTUETF 72.6 65.4 93.2 66.5 19.3 942 130 178 574 VU-MIF-LCG2 92.8 164 94.2 208 7.92 706 931 730 841 T2_Estonia 15.7 176 91 50 301 846 801 866 867 VGTU-TEST-gLite 92.3 37.1 93.7 57 3.52 70.7 745 41.3 72.2 VGTU-gLite 92.2 93.9 93.7 91.9 7.9 94.4 94.5 92.9 95.2 EENet 34.8 94.1 92.3 78.3 29 93.4 94.6 93.3 94.2 IFJ-PAN-BG 24 735 88.2 253 7.73 149 194 120 205 KTU-ELEN-LCG2 48.2 77.2 928 85.4 3.4 93.1 81.7 45.2 91.3 IMCSUL(ce01) 84.7 128 93.8 181 31.4 735 428 370 950 IMCSUL(ce02) 73.9 175 93.9 144 38.2 795 771 692 950 PSNC 54.1 251 92.5 942 21.7 413 599 333 455 BASNET 6.38 7.06 6.4 6.81 6.23 5.83 6.84 4.95 6.57
PUBLIC Page 35 of 36
partially due to ISP owning the core router and not being fast enough in performing troubleshooting and customer support.
Also, T2_Estonia site shows disproportionate results. While reaching throughput target as data transfer source in most scenarios, data transfer to T2_Estonia is extremely slow. This seems to be consequence of T2_Estonia being primarily targeted as computational site for LHC experiments and thus having heavily loaded network due to large amount of experiment data transfer.
Third significant disproportionality identifiable in the results is for EENet site. Data can be transferred with a very high throughput to EENet Iperf server, but data cannot be transferred fast from EENet Worker Nodes. This is caused by EENet not having 1Gbps network connections to all site grid resources. All Worker Nodes at EENet are connected using only 100Mbps Ethernet connection, whereas Iperf server was connected to one of available 1Gbps links. Thus the network throughput bottleneck for EENet is identified as lack of network capacity in LAN.
There are several sites that also do not reach TCP throughput target, but have much simpler reasons for these limitations. As can be seen from the Table 1, BASNET site in Belarus is not able to reach even 10Mbps TCP throughput to any of the deployed Iperf servers. This shows fully saturated network link from Belarus NREN to GÉANT. This conclusion is also supported by examining GÉANT connection infrastructure and monitoring data at IMCS UL operated Gridimon monitoring site. IFJ-PAN-BG site has limitation on router performance that peaks at around 270-300Mbps. The router is a desktop class PC running OpenBSD operating system. Tests show that it is possible to send and receive data at high throughput within IFJ_PAN-BG LAN, but all external data transfer is not able to reach or exceed target throughput.
KTU-ELEN-LCG2, completely opposite to EENet, has 1Gbps local LAN infrastructure, but is limited by 100Mbps external connection to GÉANT. Therefore all TCP throughput tests, except LAN tests, to or from KTU-ELEN-LCG2 fail to reach 100Mbps threshold.
VGTU-TEST-gLite has identical situation to KTU-ELEN-LCG2.
As it can be seen from the results, all sites that have 1Gbps transparent network infrastructure, correctly configured routing and TCP parameter optimizations on hosts achieve significant TCP throughput performance. And, most importantly, all performance bottlenecks have been identified and explained. Common reasons for remaining network performance bottlenecks are lack of network capacity in LAN or connection to GÉANT (usage of 100Mbps instead of 1Gbps Ethernet), oversaturated link to GÉANT (Belarus), misconfigured (RTUETF) or performance-limited (IFJ-PAN-BG) network routing equipment.
PUBLIC Page 36 of 36
The SA2 activity has met all objectives according to the Description of Work. National Network Coordination Centres were established in each country to form the basis for successful coordination of activities even after the end of the project. Network infrastructure and proposed solutions for improvement of TCP performance are helping to maintain high quality network capacity for grid users in Estonia, Latvia, Lithuania, and Belarus.
The described TCP performance “bottleneck” removal solution has been successfully tested not only in the lab, but also deployed in BalticGrid. The designed solution exceeds the initial expectations in two aspects:
• It is much simpler to deploy than initially envisioned HighSpeed TCP or Scalable TCP, but achieves the same close to Gigabit single-session TCP performance.
• It does not degrade TCP performance for other network users and thus can be safely deployed in general purpose multi-user networks such as GÉANT (no need for LBE or other QoS services).
Since the proposed solution has no (observed) negative side effects, it is recommended for general deployment also outside BalticGrid.
BG-IRT team has monitored the security of the BalticGrid-II infrastructure. In case of international security incidents affecting the EGEE infrastructure, BG-IRT administrators are following the case and applying necessary patches or remedies.