Linux TCP Stack Performance Comparison and Analysis on SMP systems

(1)

Linux TCP Stack Performance Comparison and Analysis on SMP

systems

Shourya P. Bhattacharya

Indian Institute of Technology, Bombay

Kanwal Kekhi School of Information Technology

Powai, Mumbai - 400076

[email protected]

Abstract

1. Introduction

With the phenomenal growth of networked ap-plications and their users, there is a compelling need for very high speed and highly scaleable com-munications software. A networked application can be made high speed and scalable by using a variety of approaches. We can aim at the application layer first, achieving speed up by improving the inter-nal algorithms, scale up by having multiple replicas of the application and balancing the load between these replicas. However, at the heart of any appli-cation written for use over a network, is the lower layer protocol stack implementation of the under-lying operating system. For any application to run at high speed and achieve high scalability, the pro-tocol stack implementation that it uses must also be able to keep up, and not become a bottleneck. Thus it is important to study and understand the perfor-mance of the protocol stack implementation which the application will use.

Recent trends in technology are showing that al-though the raw transmission speeds used in net-works are increasing rapidly, the rate of advance-ment of microprocessor technology, that needs to

drive the network, has slowed down over the last couple of years. Gigabit Ethernet networks now have become commonplace while 10 Gbps Eth-ernet is rapidly making its presence felt[12]. On the other hand, ....***some data about processor speeds***??. Hence major microprocessor man-ufacturers such as Intel and AMD are explor-ing newer processor architectures. Intel was the first to introduce symmetric multi-treading or Hy-perthreading [5] in their Pentium IV processors which enabled a single Pentium IV processor to appear as two logical processors. More recently both AMD and Intel have launched their dual core processors??. Dual core processors, as the name suggests, packages two processing cores on a sin-gle chip, effectively behaving like an SMP (Sym-mmetric MultiProcessor) system. In the future we can expect multiple core CPU to be introduced.

The end result of this is that network protocol processing overheads have risen sharply in compar-ison with the time spend in packet transmission. Al-though the network layer processing is highly opti-mized in routers, which are special-purpose, in ap-plication servers (i.e. ”hosts”), this protocol pro-cessing has to happen on the general purpose hard-ware that the server runs on. When load on the server increases, protocol processing overheads can start dominating dominate ??, and can degrade the useful throughput of the application.

Several approaches have been employed to speed up and scale up the protocol stack

(2)

imple-mentations. In the context of TCP/IP, offloading the protocol processing to dedicated hardware, instead of carrying it out on the general purpose CPU, has been proposed. /* Shourya add something here */ This essentially enables the CPU to be dedicated to application processing. In either case - i.e. with our without offloading, parallel processing archi-tectures can be exploited for speeding up protocol processing.

Whether the purpose is to determine the TCP/IP components that have the highest processing re-quirements, or to determine how the implemen-tation scales to SMP architectures, a careful per-formance study of the TCP/IP stack implementa-tion of the operating system in quesimplementa-tion is a must. In this paper, we discuss the the results of such a study done for the Linux Operating System. The Linux OS has shot up in popularity recently, and is now used even by large-scale system operatiors ??. Specifically, we have focussed on the network stack performance of Linux kernel 2.4 and kernel 2.6. Until recently, kernel 2.4 was the most stable Linux kernel and was used extensively. Kernel 2.6 is the latest stable Linux kernel and is fast replacing ker-nel 2.4.

Although performance studies of /* ..*/ have been done, this is the first time a thorough compar-ison of Linux 2.4 and Linux 2.6 TCP/IP stack per-formance has been carried out. We have studied and compared the performance of these two Linux ver-sion along various metrics: bulk data throughput, connection throughput and scalability across multi-ple processors. We also present a fine-grained pro-filing of resource usage by the TCP/IP stack func-tions, thereby identifying the bottlnecks.

In almost all the tests, kernel 2.6 performed bet-ter than kernel 2.4. It was observed that kernel 2.6 sustained higher data transfer rates and much higher number of simultaneous connections than kernel 2.4. The higher sustained data transfer rates in kernel 2.6 was attributed to it’s more efficient “copy” routines while the much higher number of simultaneous connections in kernel 2.6 was a re-sult of its superior O(1) scheduler. Our experiments with Kernel 2.6 on SMP architecture shows de-graded performance when the processing of a

sin-gle connection is spread out on multiple processes, thus verifying the superiority of the “Processor per Connection” [2] model for protocol processing in SMP systems. The kernel profiling results also re-vealed that interrupt processing time, the device driver code and the copy routines take up signifi-cant amount of the total network processing time.

This paper is organised as follows. Section 3 dis-cusses the improvements made in Linux kernel 2.6 which affect the network performance of the sys-tem. Section 4 describes the benchmarking experi-ments performed, the hardware setup, their results and implications. In section 5 we discuss the kernel profiling results obtained with OProfile [8] and in-ferences drawn from it. In section ?? we conclude our observations and suggests future work.

2. Speeding up protocol processing

Improving and speeding up the network proto-col processing stack has become an area of active research interest. In this section we look at a couple of ways in which the problem of improving proto-col processing speed has been approached. Firstly, attempts has been made to offload the protocol pro-cessing from the host system to the NIC which has dedicated hardware for it. Secondly, efforts have been made to parallelise the protocol stack process-ing to make use of multiple processors. We take a brief look at both these approaches. It must be noted that these approaches are not mutually exclu-sive.

2.1. TCP offloading

TCP offload has been well debated topic over the last decade [11]. The typical benefits of TCP/IP offloading include reduction of host CPU require-ments for stack processing and checksumming, fewer interrupts to the host CPU, fewer bytes copied over the system bus and offloading of com-putationally expensive features such as encryption to specialised hardware on the NIC. In spite of these benefits, there has been some criticism of TCP offload. The primary argument against it has been the fact that TCP protocol in itself is not a

(3)

very expensive operation and the rapid increase of CPU processing power will nullify the cost effec-tiveness of TCP offloading to the NIC, but given the changing dynamics of technological develop-ments, TCP offloading has become extremely rele-vant [6].

Our experimental results in section 5 show that significant protocol processing time is spent in the buffer copying, checksumming and interrupt han-dling. TCP offloading will be able to free the host CPU from these overheads and hence will be ex-tremely useful.

2.2. Parallel

Protocol

Processing

Ap-proaches

The layered architecture of the network stack makes it difficult to parallelise it efficiently. A lot of work has been done [2, 1, 7] in this field to ef-fectively parallelise the protocol stack. We discuss some of the well known paradigms.

• Processor per Message is a parallelising

paradigm in which each processor executes the whole protocol stack for one message. With this approach heavily used connections can be efficiently served since several pro-cessors can service different messages of the same connection, however this also implies that the connection state has to be shared be-tween the processors servicing the packets which can lead to synchronisation problems.

• _{Processor per Connection lets one processor}

handle all the messages belonging to a partic-ular connection. This approach works well in SMP systems as it can make optimal use of the processor cache, however, it suffers from frag-mentation of resources.

• Processor per Protocol is another approach

in which each layer of the protocol stack is processed by a particular processor. Messages may be shuffled from one processor to another as it moves up or down the protocol stack. One limitation of this approach is that the messages can not be cached efficiently as they are shuf-fled from one processor to the other[2].

• Processor per Task is a parallelising

tech-nique in which each processor performs a spe-cific task or function within a protocol, or it might also do a task common to more than one protocol. This paradigm, in theory, tries to re-duce the processing time or latency of a mes-sage. The main disadvantage of this approach is that both protocol state and messages must be shared between processors. It also suffers from poor caching as the case of processor per protocol paradigm.

It has been shown by Schmidt and Suda [10] that “processor per message” and “processor per connection” generally work better than the other two paradigm’s in a shared memory multiproces-sor system. This happens because of the high cost of context switches while crossing protocol lay-ers. Our results have shown that even “processor per message” paradigm has a detrimental effect on TCP performance. This happens because TCP is a connection oriented protocol and needs to store the connection state in memory. Distributing the processing of packets belonging to a single con-nection on different processors, leads to frequent cache invalidations in all the processors. Hence only “processor per connection” paradigm is opti-mally suited for new generation processors having multiple cores and symmetric multithreading.

3. Improvements in Linux kernel 2.6

The Linux kernel 2.6 was a major upgrade from the earlier default kernel 2.4 with many improve-ments and was supposed to be faster and more responsive than the earlier kernel. In this section we discuss some of the improvements and changes made in kernel 2.6, which can have an impact on the performance of the networking subsystem. Some of the major improvements in kernel 2.6 in-clude minimal use of the Big Kernel Lock (BKL), an improved interrupt mechanism i.e. NAPI, more efficient block copy routines and a vastly improved O(1) Scheduler. The kernel 2.6 TCP stack also in-cludes new congestion control and recovery algo-rithms which are not available in the kernel 2.4 TCP stack.

(4)

3.1. Minimal use of Big Kernel Lock

The BKL is a global kernel lock, which only al-lows only one processor to be running kernel code at any given time, to make the kernel safe for con-current access from multiple CPUs. The BKL is es-sentially a spinlock, but with a couple of interesting properties:

• The BKL can be taken recursively; Therefore two consecutive requests for it will not dead-lock the process.

• _{Code holding the BKL can sleep and even} en-ter the scheduler while holding the lock. The lock is released while the given thread sleeps, and re-acquired upon awakening.

The BKL makes SMP Linux possible, but it dosn’t scale very well, hence there is a continuous effort to avoid the BKL and use more fine grained locks instead. Although kernel 2.6 is still not com-pletely free of the BKL, it’s usage has been greatly reduced. The kernel 2.6 networking stack has only one reference of the BKL.

3.2. New API - NAPI

One of the most significant change in kernel 2.6 network stack, is the addition of NAPI (“New API”), which is designed to improve the perfor-mance of high-speed networking. The basic prin-ciple of NAPI is:

• _{Interrupt mitigation. High-speed networking} can create thousands of interrupts per second, which can lead to an interrupt livelock. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load. • Packet throttling. When the system is

over-whelmed and must drop packets, it’s bet-ter if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adapter itself, be-fore the kernel sees them at all.

3.3. Efficient copy routines

The Linux kernel maintains separate address space for the kernel and user processes for protec-tion against misbehaving programs. Due to the two separate address spaces, when a packet is sent or re-ceived over the network, an additional step of copy-ing the network buffer from the user space to the kernel space or vice versa is required. As a result the efficiency of the kernel copy routine used has a profound impact on the overall network perfor-mance.

Kernel 2.6 copy routines have been optimised for the x86 architecture which use the technique of hand unrolled loop [3, 9] with integer registers in-stead of the less efficient “movsd” instruction used in kernel 2.4.

3.4. Scheduling Algorithm

The Linux kernel 2.6 scheduler is probably the single most significant change made in the ker-nel. The kernel 2.6 scheduler is written completely from scratch to overcome some of the limitations of the kernel 2.4 scheduler. The kernel 2.4 scheduler while being widely used and quite reliable, had sev-eral undesirable characteristics. The biggest flaw of the kernel 2.4 scheduler was that it contained O(n) algorithms where “n” is the number of processes in the system, and hence was not scalable.

The new scheduler in kernel 2.6 on the other hand does not contain any algorithms that run in worse than O(1) time. This is extremely important in applications like web servers as it allows them to handle large number of concurrent connections, without dropping requests.

The Linux kernel 2.4 scheduling algorithm di-vides time into “epochs”, which are periods of time during which every task is allowed to use up it’s timeslice. Timeslices need to be computed for all tasks in the system when epochs begin, which means that the scheduler’s algorithm for timeslice calculation runs in O(n) time, since it must iter-ate over every task. On the other hand kernel 2.6 uses more sophisticated algorithms which make sure that there is no point at which all tasks need

(5)

new timeslice calculated for them at the same time ****give reference here*****

4. High Level Performance comparison

tests

In this section we discusses the benchmarking experiments performed for comparative study be-tween kernel 2.4 and 2.6. Each subsection describes the type of tests performed, the experimental setup and the results obtained.

4.1. Web server performance

HTTP performance tests were done on kernel 2.4 and 2.6 to get a comparative picture of both ker-nel’s efficiency in a real world scenario. The freely available httperf [4] tool was used for load gener-ation and the performance of Apache web server was tested on both the kernels. The following two changes were made to the default Apache configu-ration file:

1. MaxClientswas set to the maximum value of 4096

2. MaxRequestsPerChild was set to zero (unlimited)

The Apache web server was run on a dual pro-cessor 3.2Ghz Xeon system. Separate tests were done on both kernel 2.4 and 2.6 with SMP en-abled and disen-abled. The httperf load generator was run on two separate client machines. This was nec-essary as a single client was unable to generate enough requests to saturate the server. To reliably test the maximum number of simultaneous connec-tions that the web server could handle, the clients were made to request a static text page of only 6 Bytes in size. This ensured that the 100Mbps net-work bandwidth did not become a bottleneck. The following command line instruction was used for the httperf experiments.

$httperf --hog --port 80 --uri /small.html

--server (IP) --rate=(R) --num-conn (R×15) --timeout 5

The maximum client connection request rate sus-tained by the server, response time for the requests, connection establishing time and the number of er-rors reported by the two kernels are shown in the Figures 1, 2, 3 and 4 respectively.

500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000

Request sustained by the server

Number of Client requests per second Server sustained request rate

Kernel 2.6 UNI

Kernel 2.4 UNI

Figure 1. Request rate sustained by kernel 2.4 and 2.6 0 5 10 15 20 25 30 500 1000 1500 2000 2500 3000 3500 4000 Time (ms)

Number of requests per second Response Time

Kernel 2.6 UNI

Kernel 2.4 UNI

Figure 2. Response time comparisons of kernel 2.4 and 2.6

These graphs clearly show that kernel 2.6 per-formance is much better than that of kernel 2.4.

(6)

0 50 100 150 200 250 500 1000 1500 2000 2500 3000 3500 4000 Time (ms)

Number of requests per second Connection Duration

Kernel 2.6 UNI

Kernel 2.4 UNI

Figure 3. Connection time compar-isons of kernel 2.4 and 2.6

0 2000 4000 6000 8000 10000 12000 500 1000 1500 2000 2500 3000 3500 4000 Number of errors

Number of requests per second Errors

Kernel 2.6 UNI

Kernel 2.4 UNI

Figure 4. Time Out error comparison of kernel 2.4 and 2.6

nel 2.4 struggled to handle more than 2800 simulta-neous connections and started reporting errors be-yond that point. It’s connection time and response time also started rising sharply. In contrast kernel 2.6 could easily handle 4000 simultaneous connec-tions and there was no sign of any increase in con-nection time or response time, suggesting that ker-nel 2.6 would be able to handle even higher number of simultaneous connections than could be tested. These results again showcase the superiority of the

kernel 2.6 scheduler.

4.2. Connection throughput

In this experiment we have compared the max-imum throughput of multiple simultaneous client connections in kernel 2.4 and 2.6. The tests were done on a single processor, P4 machine which acted as a server by listening on many ports for con-nections. Clients repeatedly connected and discon-nected from the server without transmitting any data. The server forked a child process for each new connection request.

Two experiments were done with slight varia-tions. In the first experiment the server was made to listen on a fixed number (N) of ports. Exactly N clients were then started which repeatedly con-nected and disconcon-nected to each of the open ports on the server. The CPU utilisation of the server and the throughput as seen by the clients was measured. In the second experiment, the server instead of list-ing on exactly N ports was listenlist-ing on M ports such that M>>N. In our experiments M was fixed at 300 while N was varied from 10 to 100.

During both the experiments the network I/O was monitored to make sure that the network band-width was not a bottleneck. Results obtained from the experiments are shown in Figure 5 and Figure 6. It must be noted that new connections on any given active port were made in a sequential order, after the connection prior to it had terminated. Hence 100% CPU utilisation could only be achieved with multiple number of active ports simultaneously at-tempting connection setup and tear down.

Table 1 displays the average CPU time per con-nection on the two kernels. This gives us the time the server spent in servicing a single connection, i.e. the total time taken for accept() and close().

The observations from this experiment reveal a very interesting story. The first thing that is evi-dent from Figure 5 and Table 1 is that the pro-cessing overheads in kernel 2.6 for connection setup and teardown is higher than that of ker-nel 2.4. This is attributed to the fact that the kernel 2.6 socket code contains many security hooks, for example, the socket system call in

(7)

2000 4000 6000 8000 10000 0 20 40 60 80 100 120

Connections per second

Active Connections Thruput comparision

Kernel 2.6 Kernel 2.4

Figure 5. Connection throughput

comparison with varying number of active connections.

CPU Time/conn. (µs)

Kernel 2.4 95.02

Kernel 2.6 105.30

Table 1. Average CPU time spent per connection in microseconds.

nel 2.6 has the additional overhead of calling the functions security_socket_crete() and security_socket_post_create() which result in higher CPU utilisation.

Figure 6 reveals another very interesting fact. When there exists a large number of open port, the connection setup/teardown throughput in kernel 2.4 lags behind that of kernel 2.6. This clearly demon-strates the superiority of the kernel 2.6 scheduler. The kernel 2.4 scheduler has to cycle through all the processes listening on the open ports in the sys-tem irrespective of the fact that they are active or not. On the other hand the kernel 2.6 scheduler is unaffected by the number of open ports in the system and it’s performance is comparable to that shown in Figure 5

Thus the two conclusions we can draw from this experiment are: 2000 4000 6000 8000 10000 0 20 40 60 80 100 120

Connections per second

Active Connections Thruput comparision

Kernel 2.6 Kernel 2.4

Figure 6. Measured throughput on the server with 300 open ports and vary-ing number of active connections.

• _{The per connection processing costs in kernel} 2.6 is slightly higher than that of kernel 2.4 • _{The kernel 2.6 scheduler is vastly superior to}

the kernel 2.4 scheduler and to a large de-gree compensates for the higher per connec-tion processing cost in kernel 2.6.

4.3. Socket system calls

This was a high level test, intended to com-pare the performance of network related system calls. The tests were run on stock kernel-2.4.20 and kernel-2.6.3. The test environment consisted of a Pentium 4 machine with 256MB RAM. All non es-sential services were turned off. The test were run with the highest priority set and with no other net-work activity. Custom TCP server and client pro-gram were written and both propro-grams were run on the same machine connecting over the loopback in-terface. Time taken by the socket system calls were measured on both the server and the client pro-grams. strace was used for the measurements.

The result obtained is shown in Table 4.3. The observed socket system call times in both the ker-nels are very comparable. It shows that there is not much difference in the bind() and socket() system call overheads between the two kernels, but the

(8)

lis-Kernel ⇒ 2.4.20 2.6.3

socket() 18.05 20.56

bind() 2.91 3.37

listen() 32.37 25.97

connect() 98.97 89.19

Table 2. Average time spent in each system call. Values in microsecond

ten() and connect() system calls are a little cheaper in kernel 2.6.

4.4. Bulk data transfer performance on

SMP system

Bulk data transfer throughput is one of the most important parameter to measure the TCP stack’s performance. To measure the maximum data trans-fer throughput the freely available tool iperf was used. iperf is a tool to measure maximum TCP bandwidth, allowing the tuning of various param-eters. iperf can report bandwidth, delay jitter and datagram loss.

We used a system with dual CPU 3.2 Ghz Xeon processors as our test machine. The Xeon proces-sors were HyperThreading (HT) [5] capable and we performed different sets of test with both HT enabled and disabled. Kernels 2.4 and 2.6 were compiled with SMP support and their performance were compared. The tests were also repeated with-out SMP support on both the kernels to obtain a ref-erence point for comparison. The experiments were run over the loopback interface as the the physical 100Mbps network interface was too slow for test-ing the CPU. Since the experiments were done over the loopback interface the cost of interrupt handling and the efficiency of the device driver code did not come into picture. The iperf server and client were invoked with the following command line ar-guments respectively.

$iperf -s -p 9999

$iperf -c localhost -p 9999 -w 255K -t 30

The tests were run for a duration of 30 seconds with the TCP window size set to the maximum of 255KB. The buffer size was set to the default value of 8KB. This experiment yielded some surprising results which provided significant insight into the SMP scalability issue of the kernel TCP stack. The results are shown in Figure 7.

As one might expect in Uni-Processor mode ker-nel 2.6 was considerably faster than kerker-nel 2.4 but the surprising result here is that of kernel 2.6 in SMP mode. The observed throughput in SMP mode oscillated between 3.4 Gbits/sec or 7.8 Gbits/sec. Such a large variation cannot be attributed to ran-dom errors. On the other hand in the case of SMP kernel 2.4, the throughput rises marginally to 5.1 Gbits/sec from 4.6 Gbits/sec.

The higher data throughput of kernel 2.6 in Uni-Processor mode is due to it’s more efficient copy routines as discussed in section 3.

The variability in SMP kernel 2.6 could be at-tributed to “cache bouncing”. In kernel 2.6 because of it’s better scheduling logic and smaller kernel locks, packet processing can be distributed on all available processors. In our tests, iperf creates a single TCP connection and sends data over that connection, but if incoming packets of a connection are processed on different CPU’s it would lead to frequent cache misses, as all packets belong to the same connection will not be able to reuse the TCP state information optimally. This results in poorer performance in comparison to Uni-Processor ker-nel, because in a Uni-Processor kernel the entire processing is done on a single CPU which can take advantage of TCP state information present in it’s cache, resulting in fewer misses.

This also explain the fluctuating high perfor-mance ( 7.5Gbits/sec) on 2.6 SMP kernel. Since the Intel Xeon processors are hyper-threaded, the SMP scheduler would randomly schedule the packet pro-cessing on 2 logical processors of the same physical processor. In such a situation there will not be any cache penalty as the logical processors will have access to the same cache. To verify this, HT was disabled and the SMP kernel tests repeated. This stopped the performance oscillations and the 2.6 SMP kernel consistently gave a throughput of 3.4

(9)

Figure 7. Graphical representation of the data transfer rates achived in dif-ferent test cases

- 3.5 Gbits/sec. The results from the kernel profil-ing tests in section 5 also confirm the fact that the 2.6 SMP kernel is spending excessive amounts of time in it’s copy routines.

Further tests were done by transmitting data over two TCP connections instead of one, to check the SMP performance of kernel 2.4 and 2.6. Fig-ure 7 clearly show that data transfer throughput in SMP kernel 2.6 is much better with two TCP con-nections on a dual CPU system. The increase in the data throughput on kernel 2.6 in SMP mode on a dual processor system with two TCP con-nections is nearly 180% when compared with the data throughput of a single TCP connection on a Uni-Processor kernel. On the other hand kernel 2.4 shows a speedup of only 147%.

To verify that the observed behaviour of kernel 2.6 in SMP mode with a single TCP connection is an anomaly, we did more tests with multiple simul-taneous TCP connections. The results are shown in Figure 8. It clearly shows that the low performance

0 3 6 9 12 0 3 6 9 12 15 Data Throughput (Gbps) Number of TCP connections Thruput comparision Kernel 2.4-uni Kernel 2.4-smp Kernel 2.6-uni Kernel 2.6-smp

Figure 8. Data throughput with vary-ing number of TCP connections

of kernel 2.6 in SMP mode with a single TCP con-nection is indeed an anomaly. When the number of simultaneous TCP connections are increased, ker-nel 2.6 gives excellent performance.

Few other observations that can be made from Figure 8 is that in Uni-Processor kernel 2.4 there is practically no change in the data throughput with increasing number of simultaneous TCP connec-tions. On the other hand, in SMP kernel 2.4, the data throughput rises initially with two simultane-ous connection but drops slightly as the number of parallel TCP connections are increased. Since the test machine had only two physical processors, this implies that kernel 2.4 SMP incurs some small penalties while multiplexing multiple TCP streams on a physical processor.

5. Kernel profiling results

The processing of TCP packets involves inter-action with a large number subsystems within the kernel. Any attempt to optimise and improve the packet processing time can not be successful unless all these factors are considered. To identify these overheads, we profiled both the Linux kernels us-ing OProfile [8]. It is a statistical profiler that uses hardware performance counters available on mod-ern processors to collect information of executing processes. The profiling results also provide

(10)

valu-able insight and concrete explanation of the ob-served anomalous behaviour in section 4.4 of SMP kernel 2.6, processing a single TCP connection on a dual CPU system.

5.1. Breakup of TCP packet processing

overheads

The breakup of TCP packet processing over-heads are shown in Table 3. It lists the kernel functions that took more than 1% of the over-all TCP packet processing time. The function boomerang interrupt function is the inter-rupt service routine for the 3COM 3c59x series NIC, which was used in our experiments. The otherboomerang *functions are also part of the NIC driver involved in packet transmission and re-ception. copy from user ll copies a block of memory from the user space to kernel space. csum partialis the kernel checksumming rou-tine.

Thus we can see that the NIC driver code, In-terrupt processing, buffer copying, checksumming are the most CPU intensive operations during TCP packet processing. In comparison TCP functions take up only a small part of the overall CPU time. These data make a strong case for TCP offload-ing which could potentially lead to 30-40% of CPU time.

5.2. Analysis of kernel 2.6 SMP anomaly

In section 4.4 we had observed that there was a sharp drop in the performance of SMP kernel 2.6 when a single TCP connection was setup on a dual CPU system, but as the number of TCP flows were increased to 2 and more, kernel 2.6 performed ex-tremely well.

To analyse this anomalous behaviour, we re-ran the data throughput experiments for kernel 2.6 in both SMP and Uni-Processor mode, and pro-filed the kernel during that period. In these experi-ments, each TCP connection sent and received ex-actly 2GB of data. This allowed us to directly com-pare the samples collected in both the situations.

CPU Samples % Function Name

24551 12.0273 boomerang interrupt

15615 7.6496 boomerang start xmit

14559 7.1323 copy from user ll

12037 5.8968 issue and wait

8904 4.3620 csum partial 6811 3.3366 mark offset tsc 5442 2.6660 ipt do table 5389 2.6400 csum partial 4913 2.4068 boomerang rx 4806 2.3544 ipt do table 3654 1.7901 tcp sendmsg

3426 1.6784 irq entries start

2832 1.3874 default idle

2382 1.1669 skb release data

2052 1.0053 ip queue xmit

2039 0.9989 tcp v4 rcv

2013 0.9862 timer interrupt

Table 3. Breakup of TCP packet pro-cessing overheads in the kernel

CPU Samples % Function Name

122519 13.1518 copy from user ll

94653 10.1605 copy to user ll 45455 4.8794 system call 41397 4.4438 (no symbols) 35921 3.8559 schedule 35829 3.8461 tcp sendmsg 31186 3.3477 switch to

Table 5. TCP packet processing over-heads in Kernel 2.6 UNI with a single TCP connection

(11)

CPU 0 Samples % CPU 1 Samples % Total % Function Name

373138 28.5411 417998 31.7169 791136 30.1354 copy from user ll

293169 22.4243 264998 20.1076 558167 21.2613 copy to user ll 74732 5.7162 82923 6.2921 157655 6.0053 tcp sendmsg 26537 2.0298 24047 1.8246 54153 2.0628 schedule 25327 1.9372 28826 2.1873 50584 1.9268 (no symbols) 23410 1.7906 23275 1.7661 46685 1.7783 system call 21441 1.64 21757 1.6509 43198 1.6455 tcp v4 rcv

Table 4. TCP packet processing costs in Kernel 2.6 SMP with single TCP connection

CPU 0 Samples % CPU 1 Samples % Total % Function Name

129034 11.215 127949 11.1399 256983 11.1775 copy from user ll

121712 10.5786 127182 11.0731 248894 10.8256 copy to user ll 59891 5.2054 56946 4.958 116837 5.0818 schedule 47992 4.1712 46888 4.0823 94880 4.1268 tcp sendmsg 44767 3.8909 46023 4.007 90790 3.9489 system call 32413 2.8172 30421 2.6486 62834 2.733 switch to 26090 2.2676 25822 2.2482 51912 2.2579 tcp v4 rcv

Table 6. TCP packet processing costs in Kernel 2.6 SMP with two TCP connection

The most striking fact emerging from Ta-ble 4 and 6 is the large increase in time spent in the kernel copy routines. The

functions copy from user ll() and

copy to user ll() are used for copy-ing buffers from user space to kernel space and from kernel space to user space respectively. There is a very sharp increase in the time spent by these two functions of the SMP Kernel with a sin-gle TCP connection. More than 50% of the total time is spent in these functions. Such a sharp in-crease in the cost of copy routines can be attributed to a high miss rate of processor cache. To ver-ify this, the copy from user ll() and copy to user ll() routines were fur-ther analysed and it was found that more than 95% time in these routines were spent on the assem-bly instruction

repz movsl %ds:(%esi),%es:(%edi) The above instruction copies data between the

memory locations pointed by the registers in a loop. The performance of themovslinstruction is heav-ily dependent on the processor data cache hits or misses. The significantly higher number of clocks required by the movslinstruction in the case of SMP kernel 2.6, for copying the same amount of data can only be explained by an increase in the data cache misses of the processor.

6. Conclusion

References

[1] M. Bjo¨rkman and P. Gunningberg. Locking effects

in multiprocessor implementations of protocols. In Conference proceedings on Communications archi-tectures, protocols and applications, pages 74–83. ACM Press, 1993.

[2] M. Bjo¨rkman and P. Gunningberg. Performance

(12)

protocols. IEEE/ACM Trans. Netw., 6(3):262–273, 1998.

[3] J. W. Davidson and S. Jinturkar. Improving

instruction-level parallelism by loop unrolling and

dynamic memory disambiguation. In MICRO

28: Proceedings of the 28th annual international symposium on Microarchitecture, pages 125–132. IEEE Computer Society Press, 1995.

[4] A Tool for Measuring Web Server

Performance. World Wide Web,

http://www.hpl.hp.com/personal/David Mosberger/httperf.html. [5] D. Marr, F. Binns, D. Hill, G. Hinton, and D.

Ko-ufaty. Hyper-Threading Technology Architecture

and Microarchitecture. Intel Technology Journal,

http://www.intel.com/technology/itj/2002/volume06issue01/art01 hyper/p15 authors.htm, 2002.

[6] J. C. Mogul. TCP offload is a dumb idea whose time has come. In Proceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, May 2003.

[7] E. M. Nahum, D. J. Yates, J. F. Kurose, and D. F. Towsley. Performance issues in parallelized net-work protocols. In Operating Systems Design and Implementation, pages 125–137, 1994.

[8] OProfile profiling system for Linux 2.2/2.4/2.6. World Wide Web, http://oprofile.sourceforge.net. [9] V. S. Pai and S. Adve. Code transformations to

im-prove memory parallelism. In MICRO 32: Proceed-ings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, pages 147–155. IEEE Computer Society, 1999.

[10] D. C. Schmidt and T. Suda. Measuring the im-pact of alternative parallel process architecture on communication subsystem performance. In Proto-cols for High-Speed Networks IV, pages 123–138. Chapman & Hall, Ltd., 1995.

[11] P. Shivam and J. S. Chase. On the elusive

ben-efits of protocol offload. In Proceedings of the

ACM SIGCOMM workshop on Network-I/O con-vergence, pages 179–184. ACM Press, 2003. [12] H. Xie, L. Zhao, and L. Bhuyan. Architectural

anal-ysis and instruction-set optimization for design of

network protocol processors. In Proceedings of

the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign & system synthe-sis, pages 225–230. ACM Press, 2003.