4.3 Scalability and Evaluation
5.1.1 Components of High Performance Data Transfer
5.1.1.1 Storage to Memory
In order for data transfer to occur, the data must be sourced from and saved to some location. In some applications, it might be generated and consumed entirely within the applications on the client and server. In others though, it must be retrieved from some storage location. The manner in which data is accessed must be carefully examined in order to avoid unintended bottlenecks. Components to consider include:
• storage device bandwidth • storage controller bandwidth
• Non-Uniform Memory Access (NUMA) • bus and input/output (I/O) bandwidth
Data storage has notoriously been one of the slowest components of computers. Modern architectures can be equipped with a variety of high speed storage controllers such as serial advanced technology attachment (SATA), serial attached small computer system interface (SAS), and redundant array of independent disks (RAID). When used with slower spinning disks, the disks are the I/O bottleneck,
Figure 5.1.1: The components of a data transfer. When moving data between components, bottle- necks must be considered.
sustaining average data transfer rates of around 500Mbps, as compared to the ability of their con- troller. SATA III for example is capable of 6Gbps per port and an aggregate controller bandwidth of approximately 10-11Gbps depending on the controller implementation. When coupled with modern high-speed solid state drives, the bottleneck shifts from the disks themselves to the disk controller. With high performance SSDs capable of read and write speeds in excess of 4Gbps, just a few SSDs is plenty to saturate the disk controller.
The use of multiple disk controllers is the next logical step to eliminate the bottleneck of a single controller, but even in modern architectures, many cannot support multiple disk controllers on a single CPU socket due to available peripheral component interconnect (PCI) lanes or constraints in how the motherboard has wired PCI lanes. With NUMA architectures, multiple disk controllers can be installed on different CPU sockets to further increase throughput reading from and writing to disk. However, this imposes design constraints on the data transfer application, forcing it to be NUMA-aware and access data from the local disks/controller(s) only else incur a local, inter- CPU bus usage penalty – for Intel CPUs referred to as direct media interface (DMI) or quick path interconnect (QPI); HyperTransport for AMD. Furthermore, given this data is either from or destined to the network, the NIC it is from or is en route to should also be on the same CPU socket as the corresponding disk controller, otherwise, the local bus must be used to transfer the data to the remote CPU, imposing another potential bottleneck.
Many state of the art commercial solutions exist to alleviate this storage headache for consumers. Such solutions consist of multiple storage nodes that run in parallel, storing and rotating data through a hierarchy of storage technologies – solid state disks (SSDs) for offload at speed parity from the network, then cheaper and higher-capacity spinning disks for long-term storage. Multiple storage nodes are used in parallel to eliminate bottlenecks present in a single-node design described above and to scale to larger storage demands as data sizes increase and network speeds increase.
The design of storage access for a particular data transfer will of course depend on the data transfer rates desired. A single machine could be used with a single disk controller and SSDs for sub-10Gbps data transfers, multiple disk controllers and a NUMA-aware application could be used for 10Gbps+ data transfers, or a more advanced storage solution could be designed for a general purpose application. Such solutions could make use of state-of-the-art commercial techniques described above or a distributed parallel file system like OrangeFS [80]. Parallel file systems like OrangeFS present a network file system abstractly as a mount point on the local file system of a client. The client simply reads from and writes to this mount point through its filesystem. The OrangeFS daemon directs I/O through this mount point to remote nodes where the data is physically stored. To be effective, the additional network I/O used by OrangeFS must not contend with that used by the client’s network card to receive or transmit the data. A second NIC might be necessary depending on the specific deployment scenario.
Despite being a complex and important part of the data transfer problem, the storage access problem presented here is in fact orthogonal to the network data transfer problem, and must be addressed independently for any complete end-to-end data transfer. Without adequate consideration, it can easily become the bottleneck, thwarting any efforts made in the network to further increase performance.
5.1.1.2 Memory to Network
Achieving high throughput data transfer from memory to the network is the goal of the memory to network component. Potential bottlenecks to consider include:
• sockets and kernel • network interface card • bus and I/O bandwidth
From memory, the application needs to write the data to a socket in order to have the kernel facilitate the data transfer to the network. To reduce latency, the application should minimize data copies from memory to memory. A memory-mapped buffer can be used so that the write system call does not need to perform any copies from user-space to kernel-space. However, this has the drawback of added buffer management complexity, since there is not a way to know for sure when the TCP state machine has consumed and successfully relayed the entire buffer to the remote host. Without memory-mapped buffers, when a write is performed to the socket, the kernel will copy the written data into buffers of the TCP state machine. From there, the data will be relayed by TCP from the kernel to the NIC via direct memory access (DMA) if possible or via CPU if DMA from memory to the NIC is not supported.
Data moving from memory to the NIC must traverse the PCI bus on which the NIC is connected. To reduce latency and avoid bottlenecks, the NIC should be connected to PCI lanes wired to the same CPU on which the application is running. Otherwise, a penalty due to CPU local bus usage will occur. This is the same penalty as previously mentioned for storage to memory considerations. This means that the NIC and storage controller should be wired to the same CPU socket and that the application should be pinned to a CPU core on the same socket in order to reduce latency and avoid bottlenecks.
5.1.1.3 End-to-End Data Transport
Facilitating the transfer of data from one user application to another is the goal of end-to- end data transport. To accomplish end-to-end data transport, the user has a choice as to which technology or application to use. Choices include edge based data transfer tools such as Aspera or GridFTP, parallel TCP, or even homegrown solutions. End user intervention to facilitate such edge initiated solutions introduces deployment, management, and usage hurdles, hindering their widespread adoption [44]. As such, many users choose to use regular TCP to achieve end-to-end data transport.
Tools such as Aspera and GridFTP accomplish high performance data transfer through thought- ful use of transport protocols. Due to the TCP windowing problem over large delay-bandwidth- product networks, GridFTP makes use of multiple TCP connections in parallel. This has a couple of advantages. First, the impact of network congestion and resulting packet loss is lessened. Only the TCP connection(s) that sustained loss will suffer decreased window sizes. The other TCP con-
nections will be unaware and will continue at their current window size as if packet loss did not occur. Another advantage is that it allows GridFTP to work around the OS’s likely low, default TCP buffer limits, which if data loss did not occur on a single TCP connection, would impose a data transfer bottleneck in the form of an insufficiently large TCP window.
GridFTP also has a UDP operating mode (via UDT), similar to Aspera. Both use UDP combined with slim TCP control connections to reliably transfer data. UDP is not a windowed transport protocol, so it does not suffer from the same limitations as TCP. It will naturally transfer data as rapidly as possible to the event the application can produce and read data and that the host operating system can send and receive the data on the network.
In common to each of these techniques is the attempt to fill the pipe with data. TCP-based techniques address the pitfalls present in the TCP protocol to make better use of available bandwidth and keep sustained throughput high, even with data loss. UDP-based techniques naturally keep the pipe full and have been augmented to be reliable and ordered, necessary for general purpose data transfer.