Hardware Considerations - High-level load balancing for web services

Ideally, an examination of wide-area Internet traffic would make use of actual infrastructure, software and hardware. However, because of the potentially costly and time-consuming work associated with assembling equipment, locations and approvals, this idea was abandoned in favour of a network environment using virtual machines with emulated WAN behaviour.

An alternative approach to WAN studies would be to adopt off-line trace-driven analysis, where existing data dumps are obtained from sources with access to backbone routers. Two institutions that provide this kind of packet traces are CAIDA1 _{and NLANR.}2 _{This kind of analysis is in-} deed useful for certain applications, but because the traces are anonymized (i.e. all addresses are removed and replaced by non-routable ones, such as 10.0.0.0/8), it is impossible to determine node proximity and route paths of individual packets. Also, the trace is of course observed from one sin- gle vantage point, and therefore does not provide any insight into actual end-to-end latency properties.

5.2.1 Virtual Network with Xen

The test environment is implemented in a virtual network using Xen, a virtual machine monitor for the x86 architecture. [42] A total of 10 virtual machines (VMs) run on top of a high-end, consumer-grade server, see figure 5.2. Compared to other virtualisation engines, Xen comes close to native performance, and clearly outperforms alternatives like VMware and User- Mode Linux. This makes Xen a suitable choice for virtual networks, espe- cially because of its performance isolation features, which means that individual VMs operate without being adversely affected by the load of others. When running the experiments, the web servers will sustain a moderate to high load. Performance isolation will ensure that the client machines are not influenced significantly by the servers, and can run the load-generating applications and measurement tools unaffected.

The host machine is a dual AMD Opteron 242 1.6GHz system, with 2GB RAM and a Samsung SP0411N ATA 40GB disk drive. Each virtual machine is equipped with its own disk image that is mapped to a virtual disk drive within the VM, and a limited amount of RAM: A 500 MB disk drive and

1_{CAIDA Anonymized OC48 Data available at}

http://www.caida.org/analysis/measurement/oc48 data request.xml

2_{The National Laboratory for Applied Network Research publish extensive traces at}

. . .

Hardware layer

Virtual switch

Host

system system 1Guest system 2Guest system nGuest

Figure 5.2: Xen – conceptual design. The host system acts as a bridge between the actual hardware and the guest operating systems. Dashed lines represent vir- tual network connections.

64 MB RAM for each host, except the DNS server, which has 128 MB RAM. Configuration of the system was carried out using the tool MLN, an easy to use front-end to Xen and User-Mode Linux. [43] The relevant configuration files can be found in appendixA.1.

Performance of the host system’s disk drive is reasonably high, yield- ing a buffered disk read rate of 52.4 ± 0.3 MB/sec (15 samples), reported by the hdparm tool. However, when running the same test with 30 samples on a virtual machine, the results were quite different: For each consecu- tive run, the reported buffered disk read speed steadily increased from 33 MB/sec up to around 132 MB/sec, where it flattened out after 28 iterations. This is clearly indicative of caching behaviour in Xen that appears to un- dermine the cache flushing mechanism that hdparm uses before each run. Any vigorous analysis of these numbers is irrelevant, however, since disk I/O performance on the virtual machines lies outside the scope of our examination. That said, the numbers show that disk throughput can not be considered a bottleneck in the setup.

CPU and RAM resources are more central, and play the main part in determining VM performance. Since the virtual machines essentially share access to the real hardware, the host system should not only have sufficient CPU time and RAM for all machines; it should also be able to handle a high rate of hardware interrupts, that is, requests to and from the hardware devices to ask for CPU time. An interrupt triggers a context switch in the CPU, and by monitoring the context switch counter on the host system, it is possible to determine whether the sheer rate of switches represents a bottleneck in the experiments.

When running a network performance test, by fetching a 250 MB file over HTTP 15 times, the network throughput was measured to 433.6 ± 1.6 Mbps, or 54.2 ± 0.2 MB/sec. This appears to be similar to the disk perfor- mance measured earlier. On the host system, the context switching rate reported by vmstat peaked at around 18600 cs/sec, with a corresponding CPU utilisation of 40 percent. For the disk measurement, we found a peak rate of approximately 4400 cs/sec, with a CPU load of 4 percent, but 40 percent waiting time – evidently waiting for disk I/O. When trying a flood ping between two virtual machines, the host system observed a sustained rate of 12000 cs/sec and 15 percent CPU load. Comparably, an idle system generates around 20 cs/sec.

A qualitative conclusion to draw from these somewhat informal mea- surements, is that the performance of both the host system and virtual machines is far greater than needed for disk and network throughput. The requirements on CPU and RAM resources are somewhat harder to determine and will be tested later. However, it is safe to say that the current system forms a solid basis for the simulations.

5.2.2 Emulating WANs – NetEm and IProute

Originally inspired by the need for a simulation framework for protocol development, NetEm has made its way to the mainstream Linux kernel tree. [44] Built together with the existing mechanisms for QoS and differen- tiated services available in the kernel, it provides users with a very flexible kernel-level implementation of WAN emulation, including delay, packet loss, duplication and reordering, rate control and non-FIFO queueing. One of the main highlights of NetEm is the protocol independence, which al- lows the use of unmodified user-space networking software.

NetEm is built into the feature-rich Linux QoS skeleton, which offers a wide array of possibilities for queueing, queue manipulation and packet classification. Options and control over these features are handled by the iproute2 user-space software, which ships with a toolset for controlling all aspects of networking in Linux; everything from link speed and interface addressing, to routing and traffic control. With iproute2, the traffic control options are managed through the tc command line program. For example, a command like

# tc qdisc add dev eth0 root netem delay 200ms 50ms 25%

would set the eth0 qdisc up with a delay of 200 ± 50 ms, with a 25 percent correlation factor, which means that the next random element depends 25 percent on the previous one. If not specified, the elements are generated according to a uniform random distribution. The qdisc keyword is short for

A timer limitation in the kernel has an adverse effect on the operation of NetEm. The kernel setting for timer frequency can take three values: 100, 250 and 1000Hz. By default, the 2.6 kernel series use a 100Hz timer. A consequence is that NetEm can only deal with a time granularity of 10 ms, i.e. a packet can only be delayed in steps of 10 ms. Fortunately, setting the frequency to 1000Hz reduces the granularity to 1 ms, which is sufficient for our purpose. The host system and all virtual machines run a 2.6 series kernel with 1000Hz timer frequency.

5.2.2.1 Delay Distributions

With NetEm, it is possible to specify the delay distribution it should fol- low, with parameters for mean and standard deviation. Four different types are bundled with the software package: The default uniform random, and table-based distributions for normal, Pareto and a normal/Pareto mixture. To assess the different possibilities, a small experiment was con- ducted where a client machine was set up to use a scheduler with 300 ± 80 ms latency. A 25 percent correlation factor was used for the uniform random distribution – the factor has no effect on the table-based distributions. Data was gathered by sending 17700 ping requests to the DNS node. The command lines for the tc program were as follows:

(a)# tc qdisc add dev eth0 root netem delay 300ms 80ms 25% (b)# tc qdisc add dev eth0 root netem delay 300ms 80ms normal (c)# tc qdisc add dev eth0 root netem delay 300ms 80ms pareto (d)# tc qdisc add dev eth0 root netem delay 300ms 80ms paretonormal

Normalised results from the tests are shown in figure 5.3. Plots (a) through (c) are shaped according to the parameters specified on the command line. The normal/Pareto mixture is shown in plot (d). The shape can be identified as normal distribution up to around x = 300 ms, where the Pareto form is incorporated to model the elongated tail. Note how the dis- tribution does not form a heavy tail, since it appears to have a well-defined mean around 300 ms.

Note the small spike present around x = 600 ms in plots (c) and (d). These seem to be caused by the Pareto generator, seeing how there is no similar spike in the Gaussian plot. The spike is located at approximately two times the mean of 300 ms, which led us to believe that ARP requests were to blame; Every now and then, hosts refresh their ARP cache by ask- ing for the physical address of a local IP address. These requests are also delayed through the scheduler, and may cause a delay for ping packets that are sent upon ARP cache expiry. Close examination of the packet traces showed that this was not the case. Even though ARP traffic was present, it did not appear to cause any additional delay, and there was no packet loss in the entire trace.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 100 200 300 400 500 600 Probability density P(x) Round-trip time (ms) Uniform distribution

(a) Uniform random distribution

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 100 200 300 400 500 600 Probability density P(x) Round-trip time (ms) Normal distribution (b) Normal distribution 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 100 200 300 400 500 600 Probability density P(x) Round-trip time (ms) Pareto distribution (c) Pareto distribution 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 100 200 300 400 500 600 Probability density P(x) Round-trip time (ms) Normal-Pareto distribution (d) Pareto-Normal distribution

Figure 5.3: NetEm delay distributions. These are the four basic delay distribu- tions offered by the NetEm software. The plots are normalised experimental data, based on approximately 17700 ping round-trip times for each distribution, grouped into 7 ms deltas.

-10000 0 10000 20000 30000 0 1000 2000 3000 4000 Table value Sample no. Pareto Pareto-normal 29000 30000 31000 32000 33000 4050 4075 4100

Figure 5.4: Pareto Generator Phenomenon. This is a plot of the inverse cumu- lative Pareto and Pareto-Normal tables used by NetEm. 4096 samples are gener- ated, and we can see that the Pareto table overflows at sample 4053, flattening out abruptly. This also affects the Pareto-Normal distribution, where we see a marked indentation in the same area.

Further analysis led to the source of the problem: The NetEm Pareto generator produces signed 16 bit integers within a scaled inverse of the cu- mulative distribution function – see figure5.4. Towards the end of the pro- cess, the generated numbers tend to overflow the data type, so the author of the software has decided to maintain the max value for all overflows, that is, every number above 215_{− 1 = 32, 767 is set to 32, 767. This increases the} occurrence and therefore also the probability for this maximum. Despite this slight anomaly, the Pareto generator was left unchanged throughout the experiments.

In document High-level load balancing for web services (Page 57-62)