Distributed Control Plane and Load Balancing

The nave SDN concept proposed a single, centralized controller controlling a number of switches. A single controller controlling the network, while novel, was highly impractical due to efficiency, network scalability and availability issues [126]. This singular controller represented a single point of failure for the network since all forwarding decisions depended in the controller [127]. With this in mind, several proposals were put forth which maintained the logically centralized nature of the controller, but distributed the control plane for better resilience. While this approach increased the number of controllers, making it harder to exhaust the available resources, it also brought new challenges as controllers would need coordinate actions to avoid issuing conflicting commands to switches, out of sync network views and race conditions within the network.

To avoid the scenario in which the single controller becomes overloaded or fails, crippling the network, Fonseca et al [128] add a second controller which takes control of the network in the event of primary controller failure. The system replicates updates to the primary controller to the secondary ensuring they both maintain the same view of the state of the network and policies to provide smooth transition in the event of primary controller failure. The Kandoo system [129] adds several more controllers but looks at increasing scalability without handling routing. Kandoo em- ploys a root controller which continues to maintain the global view of the network and deploys a set of local controllers near the switches that handle network events close to the switches to take some of the load off the root controller. The local controllers do not have a global view of the network and as such cannot handle routing. They receive all the network updates sent by switches and pass on any relevant ones to the root controller. Instead of a master-slave architecture, Tootoonchian and Ganjali [130] move to a more resilient peer to peer architecture with HyperFlow. The HyperFlow system describes a control plane which has several controllers distributed around the network. Each switch is connected to the controller nearest to it and the network can hold as many controllers as necessary. Each controller can read and write to all switches (including ones they are not directly connected to) by sending messages to the other controllers.

In addition to these solutions, several open sourced controllers sought to address this problem, providing inbuilt mechanisms for distribution and load balancing. OpenDaylight [37], ONOS [131] and ONIX [132] all offer intrinsically distributed control planes. ONIX provides a distributed view of the network to each controller instance by partitioning the Network Information Base (NIB) and assigning respon- sibility for a portion of the NIB to each controller instance. ONOS and OpenDaylight similarly provide a distributed (but logically centralized) global network view to all its applications and controller instances which is regularly updated by the instances in the cluster. ONOS also provides load balancing mechanisms that ensure the switches are fairly managed among the controllers in the cluster.

While these works focus on distribution of the control plane, none of them ex- plicitly consider security. Some researchers have since taken the distribution a step further by taking into account the actions of a malicious adversary specifically aim- ing to overload the controller to cripple the network. [133] proposes a system for mitigating controller overload by attackers by employing a pool of controllers and switching to idle ones when one becomes overloaded. In the event of a controller receiving flow requests at a rate which exceeds a certain threshold, the defence system instructs the switch to select a new controller from the pool while attempting to filter out the attack packets. Similarly, [134] proposes a method for distributing controller load by allowing dynamic mapping between switches and controllers. Each switch has several controllers connected to it in a Master-Slave configuration. In the event of the Master controller being overloaded, it tells another controller to take over as the Master controller. Instead of a middlebox defense system as, [135] uses a master controller which is selected from among all present controllers. The role of the master controller is to monitor the network load on each other controller and switch. If a failure or traffic change is detected, the master controller re-organises the switch-controller mappings to better balance the load.

The systems carry out their load balancing very differently, however. In [134], the controller under attack must tell another controller to take over. This takes system takes 6 round trip messages to complete the handover which may be impractical under an attack which causes congestion in the controller-switch communication channel. Additionally, unless CPU resource is reserved for the defences system

in the controller, the attack may set upon the controller with such ferocity that it is unable to notify another controller about the problem before it goes down. By contrast, [133] has a dedicated monitor for the controller resources and sends only 2 messages in the event of an attack (one to start filtering and one to switch controllers) and so the mitigation method is more likely to work here. Though less so, it is still vulnerable to undelivered messages due to congestion in the channel. Instead of involving the switch in the reassignments, remapping in [135] takes place entirely within the control plane by the master controller and by that virtue, may be the best of the methods put forth here.

In document Towards smarter SDN switches:revisiting the balance of intelligence in SDN networks (Page 65-67)