Underlay and Network Equipment Problems - Data Movement Challenges and Solutions with Software

A.7.1 Faulty Hardware

When experimenting with SDN and OpenFlow on a network, the first thing questioned when some- thing goes awry is the controller. This is a good assumption, since controllers are software-defined and are inherently more prone to error than traditional hardware switching devices. This is especially true during the development and debugging phase on an SDN project. However, as was learned through the development of SOS, traditional network can fail as well. It is very important to keep this in mind when debugging challenging SDN problems.

Best explained through a short narration, one day SOS was running great over an AL2S deploy- ment; the next day, SOS was no longer working. The symptoms were:

• A trunk port disappeared on a physical switch VLAN-delineated OpenFlow instance. This was

Easily fixed.

• All frames of any size and layer 3/layer 4 type traverse AL2S link in one direction • The other direction:

– Pings with a payload over 18 bytes are lost; less than 18 byes succeed – TCP packets with don’t fragment (DF) flag set, even ACKs, are lost

– UDP and TCP without DF flag set are truncated by 192 bytes, regardless of actual payload size

• No errors or drop reports on any switches, both OpenFlow and non-OpenFlow

Unfortunately, the timing was a week or two after rolling out a new code revision in the controller, so all energy was channeled into debugging the controller. The first symptom, as noted, was easily fixed and was the combination of a power failure and not saving a permanent change to the switch’s configuration. The other problems though were not as easy to solve. In fact, after two weeks of discussion with IT at Clemson University and the other endpoint at the University of Utah, the root cause was at last determined to be a module failure on the main 100Gbps aggregation services router at Utah. Overnight maintenance of the router solved all problems.

Although this issue was ultimately unrelated to an SDN aspect of SOS, it is a valuable experience worth sharing, since precious time was devoted to trying to find a problem in the controller and no consideration was initially given to a non-OpenFlow component failure in the network. A similar, less time-consuming situation occurred at the Clemson side of the network a year prior, where one of the 10Gbps twinax cables failed connecting an SOS agent to the physical OpenFlow switch. The moral of these stories is to always explore each possibility and never underestimate or discount anything, especially in the physical network layer on which SDN depends for its operation.

A.7.2 Buggy Software

Similar to the faulty hardware case above, it cannot be assumed the software chosen to implement an SDN is perfect. Consider the release cycle for product hardware and software, network switches in particular. When a new switch is released, it consists of the switching hardware along with a bundled OS called the firmware. Historically, each hardware platform has many firmware revisions

that are released for it over its lifetime. Each subsequent firmware release might add new features driven by consumer demand. However, an often overlooked purpose of firmware revisions is to provide a mechanism for the company to address bugs discovered in the field, in real-world consumer deployments. Unfortunate as it may be, there was a bug that plagued SOS deployments, on and oﬀ, for years. In the end, it turned out to be caused by a firmware bug in our chosen Dell OpenFlow switches running their Force10 OS (FTOS). Since then, a ticket was opened and resolved with Dell to patch the issue for future firmware releases.

The symptoms of the problem were straightforward: ping (and other IP traﬃc) would sometimes not be forwarded across the network. The most troubling part of this statement is “sometimes”. A bug that is not easily reproducible or inconsistent is very diﬃcult to solve in many cases.

A.7.2.1 Increase Available Information

To discover the root cause of the problem, control plane monitoring was performed at the Dell switches themselves and at the controller, along with data plane monitoring via tcpdump at the edge hosts. This allowed us to gather information about how the switch thought it was communicating with the controller, how the controller thought it was communicating with the switch, and what packets the hosts were generating and receiving. One would think that the controller and switch debug logs would indicate the same information, but depending on where the logs are implemented, events might occur that are not captured by one but visible on the other.

A.7.2.2 Narrow Down the Problem

In an SDN, flows can be installed proactively or relatively. In the case of the former, all flows are pre-installed in the data plane, reducing the number of packet-in messages the controller must process. Installing flows proactively is a systematic way to eliminate the data plane as the source of the problem and possibly indicate a control plane problem.

For Dell OpenFlow switches, ARP requests and replies cannot be processed in the data plane with flows, nor can ARP opcode be matched by a flow. Thus, to eﬀectively use them in a proactive setup, IP flows can be pushed to the switches, and packet-in followed by a flooded packet-out for all ARP packets can be performed to achieve host-to-host IP connectivity. Alternatively, static ARP entires can be installed on the hosts, as discussed later in this section.

never matched an IP packet from the ping. This indicated that the problem might be related to ARP, as further evidenced by continuous ARP requests in the source hosts’s tcpdump. So, analyzing the ARP packet-in and packet-out messages revealed that the first hop Dell switch sent an ARP request packet-in, the controller received the packet-in, generated a flood packet-out, and relayed it to the switch, where it was also seen. The switch then flooded the packet-out, or so it claimed. The adjacent hosts on this switch saw the flooded ARP request packet in their tcpdump logs, indicating the flood was a success. Furthermore, adjacent Dell and other switches received the ARP request and generated packet-in messages of their own in a similar workflow described above.

However, there was one adjacent Dell switch en route to the intended destination host of the ARP request that did not appear to see the ARP request packet after the first hop switch flooded it. This indicated that (1) the ARP request never made it to the switch, meaning the link itself dropped the packet, (2) the receiving switch decided to not send the ARP request as a packet-in to the controller, or (3) the sending switch did not actually send the ARP request out the port of the link between the two switches.

Although it was a possibility, (1) was quickly ruled out as the source of the problem, since LLDP used by the controller to discover links was able to consistently be sent as a packet-out, propagate across the link, and was consistently received as a packet-in at the adjacent switch. To examine (2) and (3), further digging was required. For (2), the receiving switch logs did not indicate the packet-in was received. However, there is no indication where in the packet processing pipeline in the switch OpenFlow agent these messages are displayed. So, we could not rule out the possibility that the receiving switch was dropping the ARP requests before it could generate a packet-in. For (3), switch logs indicated the flooded ARP packet-out was sent out the port of the adjacent switch. But, for the same reasons cited for (2), we could not assume this log was accurate. The most certain way to was place a tap in the link to monitor the packets that traversed it. This revealed that (3) was actually the culprit, and contrary to the sending switch logs, the ARP request packet-out message was not actually being sent on the link.

This posed the questions: Why was LLDP (also from a packet-out on the source switch) being sent out the link, but ARP was not? Were there other types of packets that could be sent as packet- outs, such as IPv4? To test these questions, static ARP entries were installed in the hosts involved in the ping, and the controller was changed back to reactive flow installation (to force IPv4 packet-in and packet-out messages). This test resulted in a ping that worked, and the IPv4 packets were sent

from the packet-out messages.

However, recall that ARP request packet-outs were seen on some of the devices adjacent to the switch. The question then became, what was special about this port or link where the ARP request packet-out did not get sent but the IPv4 and LLDP did? To answer this, the switch’s OpenFlow instance config was consulted, and it revealed that this particular link was a port channel link, while the others that worked were 10GbE, 40GbE, and 100GbE. Given that there was no indication in the customer support documentation that port channel links were not compatible with OpenFlow, a support ticket was opened with Dell, and after half a year of additional debugging, they were able to reproduce the problem and formulate a solution in the form of a firmware update.

A.7.2.3 Lessons Learned

The reason this bug was so hard to track down was: (1) We incorrectly and consistently kept assuming it was a software bug on our controller or in our flows. (2) It did not occur on all Dell OpenFlow instances on a given switch, although it was eventually determined to impact every OpenFlow instance fairly consistently except OpenFlow instance #2 (in most cases, although it still did arise infrequently). This instance #2 data was ultimately valuable information that led Dell to be able to reproduce the issue and determine the bug fix. (3) Occasionally, every few minutes, the port channel link would send the ARP request packet-out, and the ping would then work. But after the ARP cache entries expired, the bug would surface again. This generated much churn in the network in the form of device relearning on the controller, flow removal and reinstallation, and made it very diﬃcult to find the source of the problem and further led us astray to the controller as the source of the problem.

In short, the lesson learned was that one cannot rely on third party software to work, even paid solutions from a reputable company. Bugs and mistakes happen in the real world, and the investigation of the underlying SDN-enabling technology should always be a “checkbox item” to consider when debugging SDN and general networking problems. It’s worth repeating that Dell did come through with a solution to the problem, and we extend great thanks to their support engineers for quickly releasing a patch.

In document Data Movement Challenges and Solutions with Software Defined Networking (Page 187-191)