• No results found

Part IV: Network Layer

Chapter 14. The Internet Protocol

14.2 Implementing the Internet Protocol

This section explains the architecture of the IP instance in the Linux kernel. We will use the path a packet takes across the IP layer to introduce the basic properties of the Internet Protocol. We assume that this is a normal IP packet without special properties, to ensure that our explanations will be clear and easy to understand. All special functions of the Internet Protocol, such as fragmenting and reassembling, source routing, multicasting, and so on, will be described in the next chapters. The objective of this section is to introduce the fundamental operation of the IP implementation in Linux, to be able to better understand more complex parts later on. This section also serves as an entry point into the other chapters of this book, because each packet passes the IP layer, where it can take a particular path (e.g., across a firewall or a tunnel). It is necessary to understand how the Internet Protocol is implemented in the Linux kernel to understand later chapters.

An IP packet can enter the IP instance in three different places:

• Packets arriving in a computer over a network adapter are stored in the input queue of the respective CPU, as described in Chapter 6. Once the layer-3 protocol in the data-link layer has been determined (which is ETH_PROTO_IP in this case), the packets are passed to the ip_rcv() function. The path these packets take will be described in Section 14.2.1.

• The second entry point for IP packets is at the interface to the transport protocols. These are packets used by TCP, UDP, and other protocols that use the IP protocol. They use the

ip_queue_xmit() function to pack a transport-layer PDU into an IP packet and send it.

Other functions are available to generate IP packets at the boundary with the transport layer. These functions and the operation of ip_queue_xmit() will be described in Section 14.2.2

.

• With the third option, the IP layer generates IP packets itself, on the Internet Protocol's initiative. These are mainly new multicast packets, new fragments of a large packet, and ICMP or IGMP packets that don't include a special payload. Such packets are created by specific meth ods (e.g., icmp_send()). (See Section 14.4.)

Once a packet (or socket buffer) has entered the IP layer, there are several options for how it can exit. We generally distinguish two different roles a computer can assume with regard to the Internet Protocol, where the first case is a special case of the second:

• End system: A Linux computer is normally configured as an end system? it is used as a workstation or server, assuming primarily the task of running user applications or providing application services. Also, a Web server and a network printer are nothing but end systems (with regard to the IP layer). The basic property of end systems is that they do not forward IP packets. This means that you can recognize an end system easily by the fact that it has only one network adapter. Even a system that has several network accesses can be configured as a host, if packet forwarding is disabled.

• Router: A router passes IP packets arriving in a network adapter to a second network adapter. This means that a router has several network adapters that forward packets between these interfaces. When packets arrive in a router, there are generally two options: they can deliver packets locally (i.e., deliver them to the transport layer) or they can forward them. The first case is identical with the procedure of packets arriving in an end system, where packets are always delivered locally. Consequently, a router can be thought of as a generalization of an end system, with the additional capability of forwarding packets. In contrast to end systems, generally no applications are started in routers, to ensure that packets can be forwarded as fast as possible.

Linux lets you enable and disable the packet-forwarding mechanism at runtime, provided that the forwarding support was integrated when the kernel was created. The directory

/proc/sys/net/ipv4/ includes a virtual file, ip_forward. You will see in Appendix B.3 that

there is a way to change system settings from within the proc directory. If a 0 is written to this file,

then packet forwarding is disabled. To activate IP packet forwarding, you can use the command echo '1' > /proc/sys/net/ipv4/ip_forward.

Figure 14-4 shows the path an IP packet takes across the Internet Protocol implementation in Linux. The gray ovals represent invoked functions, and the rectangles show the position of the netfilter hooks in the Internet Protocol.

Figure 14-4. Architecture of the Internet Protocol implementation in Linux. [View full size image]

The following sections describe different paths a packet can take across the IP implementation in the Linux kernel. We begin with incoming packets, which have to be either forwarded or delivered locally. The next section describes how packets are passed from the transport layer to IP.

14.2.1 The Path of Incoming IP Packets

Chapter 6 introduced the path of an incoming packet up to the boundary of layer 3. Once the NET_RX

tasklet has removed a packet from the input queue, netif_rx_action() chooses the appropriate

layer-3 protocol. Next, the Internet Protocol is selected, and the ip_rcv() function is invoked on the

basis of the identifier in the Ethernet protocol field (ETH_PROTO_IP) or from appropriate fields of

other MAC transmission protocols.

ip_rcv() net/ipv4/ip_input.c

ip_rcv(skb, dev, pkt_type) does some work for the IP protocol. First, the function rejects

packets not addressed to the local computer. For example, the promiscuous mode allows a network device to accept packets actually addressed to another computer. Such packets are filtered by the packet type (skb->pkt_type PACKET_OTHERHOST) in the lower layers.

Subsequently, the basic correctness criteria of a packet are checked:

• Does the packet have at least the size of an IP header?

• Is this IP Version 4?

• Is the checksum correct?

• Does the packet have a wrong length?

If the actual packet size does not match the information maintained in the socket buffer (skb->len),

then the current packet data range is adapted by skb_trim(skb, iph->total_len). (See

Section 4.1.) Now that the packet is correct, the netfilter hook NF_IP_PRE_ROUTING is invoked.

Netfilter allows you to extend the procedure of various protocols by specific functions, if desired. Netfilter hooks always reside in strategic points of certain protocols and are used, for example, for firewall, QoS, and address-translation functions. These examples will be discussed in later chapters. A netfilter hook is invoked by a macro, and the function following the handling of the netfilter extension is passed to this macro in the form of a function pointer. If netfilter was not configured, then the macro ensures that there is a direct jump to this follow-up function. We can see in Figure 14-4 that the procedure continues with ip_rcv_finish(skb).

ip_rcv_finish() net/ipv4/ip_input.c

The function ip_route_input() is invoked within ip_rcv_finish(skb) to determine the route

of a packet. The skb->dst pointer of the socket buffer is set to an entry in the routing cache, which

stores not only the destination on the IP level, but also a pointer to an entry in the hard header cache (cache for layer-2 frame packet headers), if present. If ip_route_input() cannot find a route,

then the packet is discarded.

In the next step, ip_rcv_finish() checks for whether the IP packet header includes options. If this

is the case, then the options are analyzed, and an ip_options structure is created. All options set

are stored in this structure in an efficient form. Section 14.3 describes how IP options are handled. Finally in ip_rcv_finish(), the procedure of the IP protocol reaches the junction between packets

addressed to the local computer and packets to be forwarded. The information about the further path of an IP packet is stored in the routing entry skb->dst. Notice that a trick often used in the Linux

kernel is used here. If a switch (variable value) is used to select different functions, then we simply insert a pointer to each of these functions. This saves us an if or switch instruction for each

decision of how the program should continue. In the example used here, the pointer

skb->dst->input() points to the function that should be used to handle a packet further:

• ip_local_deliver() is entered in the case of unicast and multicast packets that should

be delivered to the local computer.

• ip_forward() handles all unicast packets that should be forwarded.

• ip_mr_input() is used for multicast packets that should be forwarded.

We can see from the above discussion that a packet can take different paths. The following section describes how packets to be forwarded are handled (skb->dst->input = ip_forward).

Subsequently, we will see how skb->dst->input = ip_local_deliver handles packets to be

delivery locally. Forwarding Packets

If a computer has several network adapters, and if packet IP forwarding is enabled (

/proc/sys/net/ipv4/ip_forward 1), then packets addressed to other computers are handled

by the ip_forward() function. This function does all the work necessary for forwarding a packet.

The most important task? routing? was already done in ip_input(), because it is necessary to be

able to discover whether the packet is to be delivered locally or has to be forwarded.

ip_forward() net/ipv4/ip_forward.c

The primary task of ip_forward(skb) is to process a few conditions of the Internet Protocol (e.g.,

a packet's lifetime) and packet options. First, packets not marked with pkt_type == PACKET_HOST

are deleted. Next, the reach of the packet is checked. If the value in its TTL field is 1 (before it is decremented), then the packet is deleted. RFC 791 specifies that, if such an action occurs, an ICMP packet has to be returned to the sender to inform the latter (ICMP_TIME_EXCEEDED).

Once a redirect message has been checked, if applicable, the socket buffer is checked to see if there is sufficient memory for the headroom. This means that the function skb_cow(skb, headroom) is

used to check whether there is still sufficient space for the MAC header in the output network device (

out_dev->hard_header_len). If this is not the case, then skb_realloc_headroom() creates

sufficient space. Subsequently, the TTL field of the IP packet is decremented by one.

When the actual packet length (including the MAC header) is known, it is checked for whether it really fi ts into the frame format of the new output network device. If it is too long (skb->len > mtu), and if

no fragmenting is allowed because the Don't-Fragment bit is set in the IP header, then the packet is discarded, and the ICMP message ICMP_FRAG_NEEDED is transmitted to the sender. In any case, the

packet is not fragmented yet; fragmenting is delayed. The early test for such cases prevents potential Don't-Fragment candidates from running through the entire IP protocol-handling process, only to be dropped eventually.

ip_forward_finish( ) net/ipv4/ip_forward.c

We can see in Figure 14-4 that the ip_forward() function is split into two parts by a netfilter hook.

Once the NF_IP_FORWARD hook has been processed, the procedure continues with

ip_forward_finish(). This function has actually very little functionality (unless FASTROUTE is

enabled). Once the IP options, if used, have been processed in ip_forward_options(), the ip_send() function is invoked to check on whether the packet has to be fragmented and to

eventually do a fragmentation, if applicable. (See Section 14.2.3.)

ip_send() include/net/ip.h

ip_send(skb) decides whether the packet should be passed to ip_finish_output()

immediately or ip_fragment() should first adapt it to the appropriate layer-2 frame size. (See

Section 14.2.3.)

ip_finish_output() net/ipv4/ip_output.c

ip_finish_output(skb) initiates the last tasks of the Internet Protocol. First, the skb->dev

pointer is set to the output network device dev, and the layer-2 packet type is set to ETH_P_IP.

Subsequently, the netfilter hook NF_IP_POST_ROUTING is processed. The exact operation of netfilter

and the set of different hooking points within the Internet Protocol are described in Section 19.3. It is common for netfilter hooks to continue with the inline function ip_finish_output2() after their

invocation.

ip_finish_output2() net/ipv4/ip_output.c

At this point, the packet leaves the Internet Protocol, and the Address Resolution Protocol (ARP) is used, if necessary. Chapter 15 describes the Address Resolution Protocol. For now, it is sufficient to understand the following:

• If the routing entry used (skb->dst) already includes a reference to the layer-2 header

cache (dst->hh), then the layer-2 packet header is copied directly into the packet-data space

of the socket buffer, in front of the IP packet header. The output() function used here is dev_queue_xmit(), which is invoked if the entry in the hardware header cache is valid. dev_queue_xmit() ensures that the socket buffer is sent immediately over the network

device, dev.

• If there is no entry in the hard header cache yet, then the corresponding address-resolution routine is invoked, which is normally the function neigh_resolve_output().

The procedure described above was optimized so that a packet can pass the router quickly without special options. However, it became clear where there are junctions to the corresponding handling routines (e.g., netfilter, multicasting, ICMP, fragmenting, or IP packet options).

Delivering Packets Locally

The previous section described the route a packet travels when it has to be forwarded. If

ip_route_input() is the selected route, then the packet is addressed to the local computer. In this

case, branching is to ip_local_deliver() rather than to ip_forward(). This section describes

the path of packets to be delivered locally.

At this point, too, instead of using a conditioned if instruction to distinguish the two options, a pointer (

skb->dst->input()) is used, which points to ip_local_deliver() in this case. At the end of ip_input(), the procedure continues with the packet's local delivery.

ip_local_deliver() net/ipv4/ip_input.c

The first (and only) task of ip_local_deliver(skb) is to reassemble fragmented packets, using ip_defrag(). Section 14.2.3 describes in detail how packets are fragmented and defragmented.

For now, it is sufficient to understand that all fragments of a packet are collected over a certain period of time, until all fragments of an IP datagram have arrived, so that they can be passed upwards as a whole.

Subsequently, it is almost mandatory to call a netfilter hook (NF_IP_LOCAL_IN) when the procedure

continues with the ip_local_deliver_finish() function. ip_local_deliver_finish() net/ipv4/ip_input.c

The packet has now reached the end of the Internet Protocol processing. It is checked to see whether the packet is intended for a RAW-IP socket; otherwise, the transport protocol has to be determined for further processing (demultiplexing).

All transport protocols are managed in the ipprot hash table on the IP layer in Linux. At the end of the IP processing, there is now a special data structure, instead of simple query sequences and simple commands. The reason lies mainly in the nature of the Internet Protocol. Unless a packet includes special options, IP processing is very simple, and so IP is efficient and easy to implement. The complexity of IP packet options normally necessitates several more complex programming methods. The protocol ID of the IP header modulo (MAX_INET_PROTOS - 1) is used to calculate the hash

value in the ipprot hash table. The hash table is organized so that there are no collisions. If a new transport protocol would ever have to be integrated, then the assignment in the hash table should be checked. If the corresponding transport protocol can be found, then the appropriate handling routine ( handler) of the protocol is invoked. The following handling routines are most common:

• tcp_v4_rcv(): Transmission Control Protocol (TCP)

• udp_rcv(): User Datagram Protocol (UDP)

• icmp_rcv(): Internet Control Message Protocol (ICMP)

• igmp_rcv(): Internet Group Management Protocol (IGMP)

If no transport protocol can be found, then the packet either is passed to a RAW socket (if there is one) or it is dropped and an ICMP Destination Unreachable message is returned to the sender.

The chapters dealing with the TCP and UDP transport protocols describe how a packet is further handled in the transport layer. Chapter 17 describes IGMP packets, and ICMP packets are discussed in Section 14.4. The following section describes the path a packet takes as it passes from the transport layer to the Internet Protocol for transmission.

14.2.2 Transport-Layer Packets

Packets created locally and passed from the transport layer to the Internet Protocol are handled in a way totally separate from the procedures introduced so far. (See Figure 14-4.) First of all, there is not just one single function available to the transport layer, but several, including ip_queue_xmit()

and ip_build_and_send_pkt(). Each of these functions is specialized and optimized for a specific

use.

This section considers only the ip_queue_xmit() function, because this is the one normally used

for data packets; ip_build_and_send_pkt() is used for SYN or ACK packets that do not

transport payload.

Figure 14-5. Hash table used to multiplex transport protocols. [View full size image]

ip_queue_xmit() net/ipv4/ip_output.c

At the beginning, ip_queue_xmit(skb) checks for whether the socket structure sk->dst includes

a pointer to an entry in the routing cache and, if so, whether this pointer is actually valid. The route for a packet is stored in the skb->sk socket structure, because all packets of a socket go to the same

destination. Storing a reference means that expensive searches for routes can be avoided. If no route is present yet (e.g., when the first packet of a socket is ready), then the

ip_route_output() function is used to choose a route. Once this route has been entered in the

routing cache, its use counter is incremented to ensure that the route is not inadvertently deleted as long as there is still a socket buffer referencing it.

Subsequently, the fields of the IP packet are filled (version, header length, TOS field, fragment offset, TTL, addresses, and protocol). Next, ip_options_build() handles options, if present, and the

netfilter hook NF_IP_LOCAL_OUTPUT is invoked. ip_queue_xmit2() net/ipv4/ip_output.c

The next function, ip_queue_xmit2(dev) of the netfilter hook NF_IP_LOCAL_OUTPUT, sets the

output network device as specified in the routing cache entry. Now it is necessary to check once more how much headroom is available in the socket buffer, although the buffer reservation is already complete. Also, it is necessary to learn the network device used and its MTU size. Unfortunately, it can happen that a socket buffer was created for the device dev1 (with mtu1), but the route has changed

in the meantime, and the packet is sent over device dev2 with a smaller MTU. This means that,

infrequently, the available headroom has to be increased. Subsequently, the packet is checked for fragmentation, and the checksum is computed (ip_send_check(iph)).

Subsequently, the packet created locally crosses the path for forwarding packets. The function pointer

dst->output(), which is set during the routing process, causes the ip_output() function to be

invoked, which executes the last steps in the Internet Protocol, primarily guiding the packet across the netfilter hook NF_IP_POST_ROUTING.

14.2.3 Fragmenting Packets

The Internet Protocol has to be capable of adapting the size of IP packets to the respective network type in order to be able to send IP datagrams over any type of network. Each network has a maximum