Cognitive Model-Based Autonomic Fault Management in SDN

(1)

e Mo del-Based Autonomic F ault Managemen t in SDN 2 0 1 3 Sungsu Kim

(2)

Cognitive Model-Based Autonomic

Fault Management in SDN

Sungsu Kim (

김성

수

)

Division of Electrical and Computer Engineering

(Computer Science and Engineering)

Pohang University of Science and Technology

2013

(3)

연

구

Cognitive Model-Based Autonomic Fault

Management in SDN

(4)

Management in SDN

by

Sungsu Kim

Division of Electrical and Computer Engineering

(Computer Science and Engineering)

Pohang University of Science and Technology

A dissertation submitted to the faculty of the Pohang

Univer-sity of Science and Technology in partial fulfillment of the

re-quirements for the degree of Doctor of Philosophy in the Division

of Electrical and Computer Engineering (Computer Science and

Engineering).

Pohang, Korea

June 27, 2013

Approved by

(5)

Management in SDN

Sungsu Kim

The undersigned have examined this dissertation and hereby

certify that it is worthy of acceptance for a doctoral degree from

POSTECH.

June 27, 2013

Committee

Chair

James Won-Ki Hong

Member

Hwangjun Song

Member

Jae-Hyoung Yoo

Member

Chansu Yu

(6)

Division of Electrical and Computer Engineering (Computer Science and Engineering), 2013, 84P, Advisors: James Won-Ki Hong. Text in English.

ABSTRACT

In the past decades, network technologies have rapidly evolved with respect to their conception, coverage, capacities and complexities. Advanced technologies, such as software agents and policy-based network management, have been developed for more efficient network management. However, the development of network man-agement technologies has not kept pace with the rapid evolution of the network technologies.

Autonomic network management, which helps the network itself detect, diag-nose and repair failures by adaptive configuration changes based on the context, is a solution approach for advanced network management. Autonomic network man-agement is first and foremost, a method to manage complexity. Autonomic network management systems can lead to important business advantages, such as reduction of operational expenditure through task automation.

Unfortunately, autonomic network management is still an immature technology that poses many challenges to be solved before successful deployment. In order to be a real autonomic management system, not just a automated system, the system should have enough capacities to adapt itself to changing situations. Autonomic network management systems have relied on control loops to change their behaviors based on observations of situations. Existing control loops provide only a single routine to handle situations, regardless of the nature of problems. Although most

(7)

how to deal with unexpected situations which are not defined by policies.

In this thesis, we propose a novel autonomic network management architecture, called CogMan, which is based on a cognitive model for efficient problem resolving and accurate decision making even in unexpected situations. A cognitive control loop of CogMan provides reactive, deliberative, and reflective loops for managing systems based on an analysis of the current status. In order to validate the proposed architecture and the control loop, we apply them to fault management in Software Defined Networks (SDN), as well as legacy networks. We also propose a Fast Flow Setup (FFS) algorithm for fast failure recovery in SDN. Finally, this thesis presents the results of experiments on which managing various failure situations with the proposed architecture and algorithm are evaluated.

(8)

I

INTRODUCTION

1 1.1 Background . . . 1 1.2 Problem Statements . . . 3 1.3 Research Approach . . . 6 1.4 Thesis Organization . . . 7

II RELATED WORK

8 2.1 Network Management Approaches . . . 8

2.2 Management Information Models for Knowledge Representation . . . 12

2.3 Control Loops for Autonomic Systems . . . 13

2.4 Fault Management . . . 16

2.4.1 Failure Recovery in SDN . . . 16

2.4.2 Alarm Correlation Techniques . . . 20

2.5 Related Projects . . . 21

2.5.1 Knowledge Plane and 4D . . . 21

2.5.2 AutoI . . . 22

2.5.3 4WARD . . . 23

2.5.4 FAME . . . 24

2.6 Chapter Summary . . . 25

(9)

3.2 Knowledge Representation . . . 28

3.3 The Cognitive Control Loop . . . 30

3.4 Hierarchical Autonomic Management Architecture . . . 33

IV Fault Management in SDN

37 4.1 Introduction . . . 37

4.2 Detailed Fault Management Processes . . . 39

4.3 Prototype Implementation . . . 50

4.4 Evaluation . . . 52

V Autonomic Fault Diagnosis based on Association Rule Mining

59 5.1 Introduction . . . 59

5.2 Multiple Control Flows based on the Priorities of Alarms . . . 61

5.3 Association Rule Mining . . . 64

5.4 Determination of Alarm Priorities . . . 65

5.5 Evaluation . . . 67

VI CONCLUSIONS AND FUTURE WORK

73 6.1 Summary . . . 73 6.2 Contributions . . . 75 6.3 Future Work . . . 76 6.4 Chapter Summary . . . 77

REFERENCES

80 ii

(10)

II.1 Manager-agent model . . . 9

II.2 Policy-based network management framework . . . 10

II.3 P2P based network management architecture . . . 11

II.4 IBM autonomic control loop . . . 14

II.5 Boyd’s Observe-Orient-Decide-Act loop . . . 15

II.6 FOCALE autonomic architecture . . . 16

II.7 The CHIP architecture . . . 17

II.8 OpenFlow group table . . . 18

II.9 Path protection . . . 19

III.1 Conceptual architecture of CogMan . . . 27

III.2 Conceptual model of the cognitive control loop . . . 31

III.3 The cognitive control loops . . . 32

III.4 Block diagram of autonomic element . . . 34

III.5 Example of autonomic element hierarchy . . . 35

IV.1 Reactive, deliberative, and reflective cases for managing failures . . . 39

IV.2 Example of a causality graph . . . 43

IV.3 Example of restoration . . . 45

IV.4 Fields used for FFS . . . 46

IV.5 Example of FFS . . . 49

IV.6 A CogMan prototype: fault management . . . 52

IV.7 Recovery time of UDP flow when a single failure occurs . . . 53

(11)

recovery (multiple failures) . . . 55

IV.10Number of required packet exchanges for flow setup . . . 57

V.1 Multiple control flows based on the control loop . . . 62

V.2 Example of alarm generalization . . . 63

V.3 Example of a dependency model . . . 64

V.4 Ontological model for alarms and priority . . . 67

V.5 Experimental Topology . . . 68

V.6 Synthetically Generated Alarms from Each Device . . . 70

V.7 Clustered Set of Alarms . . . 71

(12)

V.1 Alarm transaction data set . . . 64 V.2 Alarms and detected association rules . . . 69

(13)

1 Normalize (observed) . . . 41 2 Compare (clusterSet) . . . 44 3 FastFlowSetup(packet) . . . 48

(14)

(15)

Chapter

I

INTRODUCTION

This chapter provides a brief introduction to autonomic network management. In this chapter, we will list main problems in current autonomic network management and outline our proposed approaches.

1.1 Background

Network technologies have rapidly evolved with respect to their conception, cover-age, capacities and complexities. The traditional data networks have been replaced by more elaborate ones with emerging new network technologies and applications with more complex requirements, such as Voice over IP (VoIP), real-time video streaming, and online game services. However, network management technologies have not caught up with the rapid evolution of the network technologies.

In early days, network management systems were dependent on Command Line Interfaces (CLI), which enabled network administrators to query and manually

(16)

re-configure network elements. For efficient network management, advanced technolo-gies, such as software agents [1] and policy-based network management [2], have been developed. Although the new technologies for network management provide partial automation, decision-making in network management still lacks understand-ing of situation and adaptability due to their embeddunderstand-ing static nature. In addition, traditional management approaches cannot handle the ever increasing complexity of managing a large number of devices that use sophisticated and heterogeneous technologies.

Autonomic network management, which helps the network itself to detect, diag-nose and repair failures by adaptive configuration changes based on the context, is a solution approach for the aforementioned problems [3]. Autonomic network manage-ment is first and foremost, a way to manage complexity. If autonomic network man-agement systems can perform manual, time consuming tasks, then these actions will save operators and administrators valuable time, enabling them to perform higher level cognitive network functions, such as network planning and optimization. They will also lead to important business advantages, such as reduction of operational expenditure through task automation.

Autonomic network management is inspired by a human nervous system that manages different body processes without human consciousness [4]. Autonomic net-work management approaches have been more decentralized and less deterministic, as they apply the mechanisms of biological systems [5]. Unfortunately, autonomic network management is still immature and many challenges should be resolved for successful deployment. Challenges for autonomic network management are represen-tation of semantics and knowledge, bio-inspired behavior model, and control loops

(17)

for management (i.e., configuration, healing, protection, and self-optimization) [6].

To be a real autonomic network management system, not just a automated network management system, the system should have adaptability to the chang-ing situation. Autonomic network management approaches broadly rely on control loops for a network to change its behaviors by itself based on observations of its cur-rent state. Existing control loops only provide limited functionalities for unexpected situations, as they resolve problems based on predefined policies. For successful deployment of autonomic network management systems, there has been a call for more sophisticated and advanced methodologies for improving situation-awareness and adaptability to changing environments.

1.2 Problem Statements

Autonomic network management systems broadly rely on feedback or control loops to adapt their behaviors based on observations of the current state. IBM proposed the Monitor-Analyze-Plan-Execute (MAPE) control loop [7] and Strassner proposed the FOCALE control loop [8]. These control loops provide functionalities such as self-configuration, self-healing, self-optimization, and self-protection. Existing con-trol loops define building blocks and concon-trol logic to maintain a managed system in a desired state. Existing control loops provide a single routine for control and heavily depend on policies specified by administrators to handle situations. Situa-tions that have no matching policy, therefore, cannot be handled by the control loop logic. resulting in situations where management system is not able to resolve the

(18)

unanticipated problems. To achieve a satisfactory level of adaptability, control loops should be able to handle situations that the system does not have predefined plans for the situations.

Autonomic network management is often incorrectly characterized as one or more of the following self-functions: self-configuration, -healing, -optimization, and -protection. These are the benefits of autonomics, not defining features. In order to implement any of these four functions, the system must first be able to know itself and its environment – what its capabilities are, and what is demanded of it. This is called self-knowledge, and is the basis for Situated Autonomic Communications [9]. Specifically, the ability to learn and incorporate knowledge at runtime is crucial to harnessing the power of emergent behavior for enabling components and systems to dynamically self-organize [10].

There have been many autonomic network management architectures and re-search projects, but only few have validated their methods by resolving real network management problems better than existing methods. Therefore, in this thesis, we validate our proposed autonomic network management concept by applying it to fault management in SDN (Software-defined Networks) [11], as well as traditional networks. We also apply the proposed autonomic management architecture to man-age faults in traditional networks, we focus on the SDN case in this thesis. Since SDN separates control and data planes, it provides more controllability to adminis-trators and an efficient decision making process is more important. Open Networking Foundation (ONF) [12] defines OpenFlow [13] as the first standard communications interface defined between the control and forwarding layers of an SDN architecture. Failure recovery in SDN has different aspects to that of traditional networks.

(19)

Restoration and protection are well-known failure recovery mechanisms in OpenFlow networks [14]. As the controller redirects affected flows one by one in restoration, it takes a long time for recovery if many flows are affected. Protection sets a backup path in advance, like Multiprotocol Label Switching (MPLS) path protection. This mechanism provides fast failure recovery that meets a failure recovery requirement of carrier grade networks, 50 ms [15]. However, there are cases that protection cannot cover, such as multiple failures that affect both working and backup paths. When there is no available backup path, restoration is the only possible method but its recovery time is relatively long.

As we have shown in the aforementioned failure management case, network prob-lems cannot be resolved by a single solution. Based on analysis of the current sit-uation, the management system should be able to provide the most appropriate solution. In this thesis, we concentrate on the following key questions.

• What are the limitations of the current network management approaches?

• What features of autonomic network management should be investigated to handle situations adaptively?

• How can we design scalable management architecture to support a growing number of users, devices, and services?

• Are the existing failure recovery mechanisms in SDN sufficient for managing various failure scenarios?

• How can we manage multiple failures as well as a single failure in SDN?

(20)

1.3 Research Approach

In this section, we explain our solutions to the problems mentioned in chapter 1.2 and the research methodologies. In order for more adaptable control loops, we applied a human cognition model to our proposed control loop. We propose and design new autonomic network management architecture called Cognitive network Management (CogMan), which is based on the proposed control loop. We also present how our proposed architecture and methods are applied to fault management in SDN. We propose a Fast Flow Setup (FFS) algorithm for fast failure recovery in SDN.

Pertaining to these questions, the following items are the goals of this thesis.

• This thesis will state the main problems with current network management that our approach solves.

• This thesis will propose a cognitive control loop which employs a model of human intelligence for adaptive response to changes in the managing network.

• This thesis will propose hierarchical autonomic management architecture to manage large number of managed resources with a number of managing nodes.

• This thesis will investigate existing failure recovery mechanisms in SDN.

• This thesis will propose failure recovery method to recover multiple failures in SDN.

• This thesis will validate the proposed architecture and method by implement-ing a prototype system and applyimplement-ing it to SDN fault management.

(21)

1.4 Thesis Organization

The organization of this thesis is as follows. Chapter II describes autonomic net-work management and failure recovery in SDN as related net-work. In Chapter III, we describe the cognitive control loop, and management architecture of CogMan. In Chapter IV, we present detailed processes for managing fault in SDN includ-ing the proposed algorithm and then show the experiment results for evaluation of the proposed methods. In Chapter V, we present fault diagnosis applying proposed architecture and simulation results for validating the proposed methods. Finally, Chapter VI concludes the thesis with a summary, contributions, and possible future work.

(22)

Chapter

II

RELATED WORK

In this chapter, we discuss network management approaches from traditional manger-agent to peer-to-peer (P2P)-based network management models. We then present management information models for managing heterogeneous devices and existing control loops are compared to the proposed cognitive control loop. We then de-scribe fault management in SDN, as well as traditional networks. Finally, network management programs and projects are presented.

2.1 Network Management Approaches

Traditionally, various network management techniques, such as Simple Network Management Protocol (SNMP) [16] and Telecommunication Management Network (TMN) [17] have been used to manage communication networks. However, SNMP does not provide any liaison between business requirements and technology. Also, SNMP architecture is not able to reconfigure managed elements automatically. TMN

(23)

is based on object oriented approach and its protocol stack is comprehensive but it makes the protocol stack heavy and brings more complexity. TMN agents are also dumb and have no intelligence to make important decisions. Web based technol-ogy such as WBEM were developed [18]. Both traditional and current distributed technologies face interoperability issues due to different standards and uncommon models. WBEM is not only a solution to persisting interoperability issues, but it also enhances management capabilities by abstraction and decomposition of busi-ness process and services. However, even if WBEM helps to solve interoperability problems, it does not have intelligent decision-making capability.

Figure II.1: Manager-agent model

Figure II.1 describes the manager-agent model, which defines the principles of operation for management solution. Managed resources are modeled as Managed Ob-jects (MOs). SNMP is typical example for the manager-agent model. While SNMP is popular and has been used for most industry, it is difficult to meet emerging re-quirements of network management including interoperability among heterogeneous

(24)

networks, QoS guaranteed services, and management of increasing number of net-work devices and users.

Figure II.2: Policy-based network management framework

Internet Engineering Task Force (IETF) [19] provided the policy-based network management framework [20] as shown in Figure II.2. The policy-based network man-agement framework is composed of three parts: a dedicated policy repository, Policy Decision Points (PDPs), and Policy Enforcement Points (PEPs). The policy repos-itory stores policies and PDPs distribute the policies to PEPs. The policy-based management framework has a feature of distribution which is important as the size of managed network increases. The management functions are distributed to PDPs to avoid a heavy traffic caused by management.

P2P networks are overlay networks composed of nodes located in edges of the Internet. Some researchers have been employed P2P model for network manage-ment. P2P based network management improves traditional network management,

(25)

Figure II.3: P2P based network management architecture

for example, load balance of management tasks at management nodes, cooperative management, and increased reliability. Figure II.3 presents a P2P-based network management architecture proposed in [21]. Top-level Manager (TLM) is a peer that is in charge of both reacting to the requests from human operators and communi-cating with other management peers for the accomplishment of a particular man-agement task, Mid-level Manger (MLM) is a peer that is responsible for reacting to the requests from the TLM and other MLMs. As shown in Figure II.3, many management nodes which have management functionalities are distributed. There-fore, management tasks are distributed to MLM or TLM nodes. Furthermore, the number of managed devices per each MLM can be controlled to support scalability. However, P2P-based network management also gives additional expenses. In terms of network usage, P2P-based approach needs higher bandwidth consumption with

(26)

more information exchange compared to traditional approaches. Also, there is ad-ditional overhead for managing management nodes, such as TLMs and MLMs. Our approach focuses on minimizing information exchange between management nodes and supporting scalability.

2.2 Management Information Models for Knowledge

Rep-resentation

An autonomic principle has been applied to network management in order to over-come drawbacks of the traditional network management techniques. An autonomic principle is a way to reduce complexity [22]. If an autonomic system can perform manual, time-consuming tasks (such as device configuration changes in response to simple problems), then these actions will save operators and administrators valuable time, enabling them to perform higher level cognitive network functions, such as net-work planning and optimization. Representation of knowledge is a fundamental part of an autonomic principle. Traditional network management architecture, such as SNMP only describes low-level monitoring data. Since vendors organize their tables and write their Management Information Base (MIB) very differently, it is therefore difficult to find the data that is being searched for. Information model is necessary to solve interoperability problem of disparate data produced by different vendors. We use of information and ontological modeling to capture knowledge relating to network capabilities, business goals and policies. This enables reasoning and learning for enhancing and evolving knowledge. There are information models such as CIM [23], SID [24], DEN-ng [25] which are conceptual information models for describing

(27)

computing and business entities in Internet, enterprise and service provider envi-ronments. We use DEN-ng to represent knowledge in CogMan for several reasons. While DEN-ng has a well-developed context model, the CIM and SID do not have it. Context is critical for selecting policies applicable for adapting environmental conditions change. Also, the CIM and the SID do not provide any mechanisms to orchestrate behavior, such as a finite state machine, which DEN-ng provides.

2.3 Control Loops for Autonomic Systems

IBM was a pioneer of autonomic management of IT resources and they defined the MAPE control loop [7]. As shown in Figure II.4, sensors and effectors get data from and provide commands to both the entity being managed and to other Auto-nomic Managers. The AutoAuto-nomic Manager is an implementation that automates a management function. The knowledge source implements a repository that provides access to knowledge according to the interfaces of the Autonomic Manager. Boyd’s Observe-Orient-Decide-Act (OODA) loop [26] was originally conceived for military strategy, but has had multiple commercial applications as well. It describes the pro-cess of decision-making as a recurring cycle of four phases: observe, orient, decide, and act, as shown in Figure II.5. Unlike the MAPE loop, this is not a sequential loop. First, stopping observation while the analysis is continuing is not effective in terms of performance. Second, a balance must be maintained between delaying decisions and performing more accurate analysis that eliminates the need to revisit previously made decisions.

(28)

envi-Figure II.4: IBM autonomic control loop

ronment, and determines the effect of the changes on the currently active set of business policies. As shown in Figure II.6, the FOCALE control loops operate as follows. Sensor data is retrieved from the managed resource (e.g., a router) and fed to a model-based translation process, which translates vendor- and device-specific data into a normalized form in XML using the DEN-ng information model and on-tologies as reference data. This is then analyzed to determine the current state of the managed entity. The current state is compared to the desired state from the appropriate Finite State Machines (FSMs). If no problems are detected, the system continues using the maintenance loop; otherwise, the reconfiguration loop is used so that the services and resources provided can adapt to these new needs.

There are some problems with above control loop architectures. The control loops are gated by the monitoring function. Hence, if too much data floods the system, the performance of the rest of the system suffers, even if the monitored data is not

(29)

Figure II.5: Boyd’s Observe-Orient-Decide-Act loop

relevant. Secondly, all the cases are processed using same routine, so there is no consideration on the priority or characteristic of a network problem. Third, it is impossible for them to reason about what actions to be taken, if a situation that it encounters has not been anticipated.

Minsky [27] developed a model of human intelligence which is built using simple processes, called agents, which interact according to three layers, called reactive, deliberative, and reflective. A cognitive architecture [28], which is for building a system that attempts to encompass the full range and magic of human cognition, can be a good reference to make network system more intelligent and effective. This is shown in Figure II.7. Reactive processes take immediate responses based upon the reception of an appropriate external stimulus. Deliberative processes receive data from and can send “command” to the reactive process. Reflective processes supervise the interaction between the deliberative and reactive processes. By using a human

(30)

Figure II.6: FOCALE autonomic architecture

cognition model, it is able to handle situations that have never been anticipated and solve network problems adaptively.

2.4 Fault Management

In this section, we first describe existing failure recovery mechanisms in SDN. We then present existing alarm correlation techniques for fault diagnosis. Alarm cor-relation techniques can be used for SDN fault management, as well as traditional networks.

2.4.1 Failure Recovery in SDN

Fault management of OpenFlow networks differs in a number of aspects from that of traditional networks. In traditional networks, distributed networking devices, such as routers, reconstruct routing paths with changed topology information and update

(31)

Figure II.7: The CHIP architecture

routing tables when a link fails. In OpenFlow networks, however, routing decisions are made by a centralized controller. The controller detects topology changes, decides on a route to send packets, and sets up flow tables of switches along the route.

There are two well-known failure recovery mechanisms in OpenFlow networks, restoration and protection [14]. Restoration matches well with the OpenFlow prin-ciple that a network is controlled by a centralized controller. If a failure occurs, the controller calculates alternative paths for every affected flow and sets up new flow entries to switches along the new path. Therefore, the time taken to recover from a failure with the restoration mechanism is proportional to the number of affected flows and the length of the new path. Considering that the failure recovery

(32)

require-ment in a carrier-grade network is 50 ms, the failure recovery time of the restoration mechanism is not acceptable for the provision of services on the network [14].

Secure Channel to the controller

Flow Entry Flow Entry

Flow Table 0 Flow Table 1

Flow Entry Flow Entry

…

Group Entry Group Entry Group Entry Group Table Group ID Group Type Action Buckets Flow Entry Flow Entry Flow Table n

Figure II.8: OpenFlow group table

To solve this problem, OpenFlow 1.1 [29] introduces a fast failover group entry, which provides fast rerouting. Protection can be implemented using a group table. A group table concept is described in Figure II.8. The group table consists of group entries that contain action buckets. When a packet arrives at a switch, a matching flow entry is examined first. If there is a corresponding group entry, the packet is redirected to the corresponding group entry. A group entry has a fast failover group type that is used for path protection, and the fast failover type group entry has a set of action buckets. The first action bucket is used when the switch is operating under normal conditions and the other action buckets can be used when the switch

(33)

output port of the first action bucket is unavailable.

A

B

C

_D

E

src dst Outport Failover port

a b 1 2

Flow table of switch A

src dst Outport Failover port

a b 1 none

Flow table of switch D

1 2

Original path: <ABE> New path: <ADE>

a

b

Figure II.9: Path protection

Figure II.9 shows path protection using a group table. When a new packet, with source and destination nodes a and b arrives at switch A, the controller sets the working path hABEi by installing normal flow entries in switches A, B, and E. In addition to the original working path, protected path hADEi is installed to switches D and E. For switch A, failover type group entry is added to redirect flows when port 1 is unavailable. As a result, switches B and D have normal flow entry, switch A has a normal flow entry and a group entry for protection, and switch E has two normal flow entries. Protection provides faster failure recovery than path restoration since switches can handle failures by themselves without the controller. However, if multiple failures occur and both working and backup paths are affected by the failures, the affected flows cannot be redirected to the backup paths. For example, in Figure II.9, packets cannot be delivered from host a to host b if link(A,

(34)

B) and link(A, D) fail simultaneously.

We propose the FFS algorithm, which enables recovery from failures in situations where no backup path has been prepared. FFS provides fast flow redirection by reducing packet exchanges between a controller and switches. Our proposed control loop uses different loops based on the situation. In a single failure scenario, we use the reactive or deliberative loops with protection, while multiple failure scenarios are handled by the reflective loop, which employs the proposed FFS algorithm.

2.4.2 Alarm Correlation Techniques

There are four alarm correlation approaches, rule-based alarm correlation [30], codebook-based alarm correlation [31], case-codebook-based alarm correlation, mining codebook-based alarm cor-relation [32, 33]. However, Rule-based, codebook-based, and case-based approaches are highly dependent on expert knowledge of skilled operators. Especially, it is not easy to reflect dynamically changing network condition such as wireless or overlay environments because rules or dependency models are made manually based on the assumption that network is mostly stable. Mining based alarm correlation is able to detect the cause and effect relationships between alarms. However, it is hard to detect relationships in a short period of time because of its long processing time. Our method used both rule-based and mining based approaches. Efficiency can be taken from the rule-based approach and dynamic changing relationships are detected by mining based approach.

(35)

2.5 Related Projects

In this section, we briefly describe related projects and programs for management of networks.

2.5.1 Knowledge Plane and 4D

The motivation for the Knowledge Plane [34] was the call for a “new architecture” that would maintain the strengths of the current Internet architecture while address-ing its weaknesses. This is evidenced by the vast number of extensions beaddress-ing made to the Internet, most of which are not architectural in nature, but rather point so-lutions to specific problems that use the Internet (e.g., NATs, VPNs, and firewalls) and/or apply only to one managed network within the Internet (e.g. QoS on one autonomous system). However, Knowledge Plane did not specify how to work with heterogeneous technologies and devices; no advances in building a common model or in mapping from vendor-specific data to a common normalized data were defined. In addition, by combining the management and control planes, the Knowledge Plane fails to consider heterogeneous networks (e.g., wired and wireless networks) that use different control planes and mechanisms.

The Data, Discovery, Dissemination, and Decision Plane (4D) [35] is a clean-slate design that changes the IP control plane in order to better control network-wide objectives. As the name implies, it consists of four separate planes. More importantly, this addresses one of the key shortcomings of the current Internet design, which is the inability to separate the different layers that perform different functions. In contrast, our approach (1) provides more sophisticated learning and reasoning algorithms,

(36)

(2) emphasizes context-awareness, so that the resources and services provided at any given time are determined by the current context as well as applicable business goals and user needs, (3) explicitly defines and uses a mechanism to translate between different vendor-specific management data, and (4) maintains compatibility with the current Internet. Note, however, that some of the ideas of 4D, and their protocol implementation, can be accommodated by our approach

2.5.2 AutoI

The goal of the Autonomic Internet (AutoI) project [36] is to develop a self-managing virtual resource overlay that can span across heterogeneous networks and supports service mobility, security, QoS, and reliability under the concept of five planes – Or-chestration Plane, Service Enablers Plane, Knowledge Plane, Management Plane, and Virtualization Plane. However, it is doubtful that the Orchestration Plane ac-complishes its goal as they supposed. The Orchestration Plan consists of Distributed Orchestration Components (DOCs), which control a single orchestration domain, en-ables the AMSs of the domains to communicate and cooperate with each other. The AutoI described interfaces between DOC and AMS, but they do not specify how they control and manage network resources to meet overall goals and policies of heterogeneous networks. For example, algorithms for inferring policy conflict and solving conflict should be devised. In our CogMan architecture, we use the Policy Continuum [2] to develop mappings between policy rules that correspond to each constituency. This enables context-aware policies to be used to orchestrate behavior for business goal, system level, and other forms of interactions.

(37)

2.5.3 4WARD

The 4WARD project [37] has developed a clean-slate network management approach called In-network Management (INM), which is aimed at the management of large, dynamic networks, where a low rate of interaction between an outside management entity and the network will be required. The idea INM is that management tasks are delegated from management stations outside the network to a self-organizing man-agement plane inside the managed system. In other words, INM takes the view that the network element itself is the best entity for understanding how it should be man-aged; hence, management of the network must be made intrinsic to the operation of the network. This is enabled through decentralization, self-organization, embedding of functionality, and autonomy. Under this paradigm, the managed system executes local or global functions, and reports the results of its actions to an outside manage-ment system or triggers exceptions if intervention from outside is needed. However, its information model is not a true information model, but is rather a simple data-oriented mapping of objects. Hence, it does not solve the interoperability problem. We do not believe that devices have the ability to manage themselves; while a router or switch has the physical space to contain such functionality and the logical com-puting means to support it, this approach is in general not applicable to the vast majority of end nodes that exist, especially those that are mobile-centric and have limited resources.

(38)

2.5.4 FAME

The FAME project [38] addresses the federation of communication networks, as well as autonomic network management. Current research into autonomic communica-tions and autonomic network management emphasize the automation of decision-making to reduce operational costs of service providers. However, without address-ing the highly dynamic and federated nature of modern service provisionaddress-ing, this research runs the risk of simply replacing today’s network management silos with autonomic network silos. FAME project structured its research into three strands: (1) Federated Communications Service Management, (2) Service Monitoring and Configuration, and (3) Network Infrastructure Coordinated Self-Management. The federation of heterogeneous networks requires for harmonizing different management data from different networks, such as resources being manage, the context used in communication services, and the policies used to express governance. The approach and the methodology used by FAME are similar to ours. FAME uses information models and ontologies extracted from the DEN-ng [25] for the semantic mapping of management data, and the FOCALE [8] is used for their architecture. FAME con-centrates on the federation, but we focus more on cognitive control and management of virtualized resources and cloud environment. We use a new FOCALE cognition model for collecting data, managing, and decision making, which enable our system to learn about its own behavior and effectively respond to context changes.

(39)

2.6 Chapter Summary

In this chapter, we briefly introduced network management approaches. We also presented information models for knowledge representation in autonomic network management. We then presented control loops and compare with the proposed cog-nitive control loop. As we validated our proposed architecture by applying it to fault management, we briefly described fault management techniques in SDN and traditional networks. Finally, we reviewed related research projects and programs for network management.

(40)

Chapter

III

Cognitive Network Management

Architecture

In this section, we present an autonomic network management architecture based on a cognitive control loop. First, the overall architecture of CogMan is described. Second, we propose a cognitive control loop for elaborate processing of network prob-lems. Finally, we describe hierarchical management architecture to provide scalabil-ity.

3.1 Overall Architecture of CogMan

Figure III.1 depicts the conceptual architecture of CogMan, which incorporates a cognitive control loop enabled by a model and ontology [39]. The AM gathers device-specific data from managed resources. The collected data are fed to the Model-based translation process, which translates vendor-specific data into vendor-neutral data

(41)

Figure III.1: Conceptual architecture of CogMan

using the DEN-ng information model. By this process, CogMan is able to manage network devices regardless of their data models or languages.

The AM infers the impact of the information. For example, a single link fail-ure means that users cannot get the level of QoS specified in their Service Level Agreement (SLA). It is therefore determined as a well-known problem that should be processed immediately. The priority and characteristic of the problem affects the decision-making process within the cognitive control loop. When the cognitive control loop deals with a problem, the reactive loop is used if the problem can be processed quickly. For example, a path protection mechanism in MPLS enables switches to change their paths without the intervention of a human operator. The deliberative loop is used when the problem is anticipated but detailed actions should be organized. The reflective loop is used if the problem is not anticipated by the ad-ministrators or difficult to be solved by a standard method. It examines how the goals are affected by the problem. Decision-making in the control loops is influenced

(42)

by policies, which are sent by a policy manager. Human operators give business goals, which are translated into policies in the policy manager via the user inter-face. Human operators can also create policies directly. A set of appropriate actions from the cognitive control loop, which is composed of vendor-neutral commands, are translated into vendor-specific commands.

The AM can manage a single network device or a number of network devices. The deployment policy determines how many network devices can be managed by a single AM. This means that a number of AMs may coexist in the managed network. To support scalability, an architecture that describes how to distribute AMs in the managed network is needed. We use a hierarchical management architecture for cooperation between AMs.

CogMan has three important parts: (1) knowledge representation and organiza-tion, (2) the cognitive control loop, and (3) a hierarchical architecture for AMs.

3.2 Knowledge Representation

Knowledge is critical to our approach for two different reasons. First, without a single definition of knowledge, it is impossible to solve the interoperability problem

(43)

that prevents disparate data produced by different vendors to be shared and reused. This problem is exacerbated by the multiple conflicting standards that are present in the telecommunications world as well as by artifacts such as private MIBs. Second, network management has been limited in its power by having to deal with data, instead of having access to machine-readable semantics. For example, dropped data packets can be easily recoverable; dropped control packets could make a problem much worse. Hence, our approach relies on formal models to represent knowledge.

An information model is a representation of the characteristics and behavior of a component or system independent of vendor, platform, language, and repository. In contrast, a data model is an implementation of the characteristics and behavior of a component or system using a particular vendor, platform, language, and/or repository. We use a DEN-ng information model [25] and a DENON-ng ontology model in order to translate between the different data models (e.g., a directory and a relational database) used by different applications. Knowledge is defined once in DEN-ng information model, and then bound to one or more specific forms via one or more specific data models by ontological reasoning used to ascertain and discover inconsistencies and incompatibilities between data models.

While models are important to represent knowledge, they are not sufficient for the implementation of the proposed architecture. They do not contain the constructs necessary to support the definition of knowledge or reasoning about knowledge, which requires formal semantic definitions. We augment the knowledge contained in models with a set of ontologies. Ontology provides mechanisms to establish seman-tic relationships between information. In addition, We gain significant benefits by combining modeled and ontological data. Doing so enables us to reason from facts;

(44)

more importantly, it forms the foundation for mapping between different schemata. It also can be used to strengthen the semantics being inferred. For example, if a hypothesis is formed, we can use additional semantic relationships to gather addi-tional information to help prove or disprove the hypothesis. This in effect provides a self-check of the consistency and integrity of the information that has been collected. This is especially important when multiple data from different sources need to be integrated, since they can have different qualities (e.g., accuracy, certainty, and freshness) as well as different formats. Hence, semantic relatedness enables us to solve two important problems: (1) the harmonization of different qualities of data, and (2) the alignment of different ontologies to facilitate extraction of diverse data. This in turn increases the accuracy and reliability of the proofs that use the ontological data. Our current implementation uses a programmable threshold for the former, which enables us to weight the contribution of different data sources; for the latter, we use standard and custom semantic equivalence relationships.

3.3 The Cognitive Control Loop

CogMan is built in accordance with the FOCALE [8] control loop and self-governing; that is, the system senses changes in itself and its environment, and determines the effect of the changes on the currently active set of business policies [39, 40]. To strengthen self-awareness, the cognitive control loop employs a model of human intelligence that is built using simple processes, which interact according to three layers: reactive, deliberative, and reflective [27][28].

(45)

Managed Element(s) Reactive Deliberative Reflective Normalize Observe Compare Plan Decide Learn Reason Act

Figure III.2: Conceptual model of the cognitive control loop

The system is able to recognize when an event or a set of events has been encountered before. On the basis of history and features of events, the reactive process makes a decision and executes without any planning. This reactive mechanism enables much of the computationally intensive portions of the control loop to be bypassed, pro-ducing “shortcuts” that result in a decision being made immediately without time-consuming higher-level process involvement. The deliberative loop is the same as that of the original FOCALE control loop process that takes the Observe-Normalize-Compare-Plan-Decide-Act path. This uses long-term memory to store the methods that were used to satisfy goals on a context-specific basis. The reflective loop exam-ines the various conclusions made by the set of deliberative loops being used, and tries to predict the best set of actions that will maximize the goals being addressed by the system. This process uses semantic analysis to understand why a particular context was entered and why a context change accrued to help predict how to more easily and efficiently change contexts in the future. These results are also stored in

(46)

long-term memory, so that the system better understands contextual changes to its reasoning in order to aid debugging.

Context Policy Decide Compare Normalize Observe Act Reason Learn Reactive Deliberative Reflective

(a) Inner control loop

Act Reason

Context

Policy Observe Normalize Compare Decide

Learn

Reactive Deliberative Reflective

(b) Outer control loop

Figure III.3: The cognitive control loops

The cognitive control loop makes a fundamental change to the FOCALE control loops, as shown in Figure III.3.Instead of using maintenance and reconfiguration control loops, the cognitive control loop uses a set of outer control loops that affect the set of inner control loops. The outer control loops are used for “large-scale” adjustment of functionality by reacting to context changes. The inner control loops are used for more granular adjustment of functionality within a particular context. In addition, both the outer and inner control loops use reactive, deliberative, and reflective reasoning, as appropriate.

Two important changes have been made to the inner loops. First, the Decide function has been explicitly unbundled from the Act function. This enables addi-tional machine-based learning and reasoning processes to participate in the deter-mination of which actions should be taken. Second, the new cognition model has changed the old FOCALE control loops, which were both reflective in nature, to the three control loops shown in Figure III.2, which are reactive, deliberative, and

(47)

reflective. Figure III.3 shows the new outer loops of the cognitive control loop. The reactive loop is used to react to context changes that can be handled by network elements themselves; the deliberative loop is used when a context change is known to have occurred, but its details are not sufficiently well understood to take an ac-tion; and the reflective loop is used to better understand how context changes are affecting the goals of the AM.

3.4 Hierarchical Autonomic Management Architecture

As the number of network devices is increasing, centralized management is not valid for the Future Internet as well as current Internet. We propose hierarchical au-tonomic management architecture to support scalable network management. We specify the functionalities of a management element and how they can be organized with hierarchy [41]. Autonomic Element (AE) is an abstraction that allows the cog-nitive control loop to provide distributed functions such as communication, learning, reasoning, and management. AEs are grouped together in cooperating communities, or clusters. AE is managed entity itself or a parent AE, which manages other child AEs or managed entities. In the future, network devices may have more intelligence for managing themselves, so routers or switches can be AEs. However, existing net-work devices have no functionality for managing themselves and an AE is an agent which manages network devices. Figure III.4 shows a block diagram of an AE. It is composed of an Autonomic Manager (AM), a Context Manager (CM), a Policy Manager (PM), and a Knowledge Base (KB). An AE can manage a child AE or Managed Resource (MR) as a Management Entity (ME). When an AE manages a

(48)

MR, a Model Based Translation Layer (MBTL) translates vendor-specific data and command to vendor-neutral data and command (or vice versa) because MRs are devices from various vendors.

Figure III.4: Block diagram of autonomic element

Translated vendor-neutral data is sent to a CM. A CM collects data from a ME and stores it to a context repository. A CM detects changes in the network, or in user needs, or even in the business; these context changes in turn activate an associated set of policies that define the functionality the autonomic manager should govern. Other AE accesses context information via a context interface (CI). A PM stores available policies to a policy repository and Other AE configures policies via a policy interface. An AM decides appropriate actions for the current context using a state manager, action manager, learner, and reasoner based on FOCALE cognition control loops. The decided action is set to the ME. CogMan selects one of the reactive, deliberative, and reflective control loops based on the current state and collected context. A system built in accordance with FOCALE is self-governing,

(49)

in that the system senses changes in itself and its environment, and determines the effect of the changes on the currently active set of business policies

Figure III.5: Example of autonomic element hierarchy

CogMan uses hierarchical autonomic network management architecture. With complex requirements for network management and the increasing number of de-vices, network management has become more complex. Therefore, it is hard to manage network with a single network management server. Multiple nodes will be distributed in the network for managing current networks as well as future net-works and more nodes will be needed for management as the size of a network domain is getting larger. CogMan adopt hierarchy for two reasons. First, hierar-chy distributes processing load of management nodes by hierarchical level. Second, efficient exchange of management information is enabled by hierarchy. Consider a pure peer-to-peer based network management architecture. Management nodes ex-change management information between every pair of cooperating nodes and this could be overhead in terms of network bandwidth. Hierarchy in managing nodes reduces overhead of exchanging management information because managing nodes send management information to the designated node [41].

(50)

Figure III.5 shows the structure of a hierarchical AE and gives an example of how the hierarchy can be mapped to the physical infrastructure. All Autonomic Network Domain Manager (ANDM), Autonomic Network Manager (ANM), and Autonomic Element Manager (AEM) are AEs. We designed this hierarchy based on the DEN-ng information model which describes a network management hierarchy. A parent AE governs the management decisions of its child AE, such as coverage of monitoring and performance metrics to be measured. However, child AEs also have their own autonomic decision-making capabilities. The cognition control loop plays a role of main decision engine of AE and monitoring data is sent by child AEs or managed entities. A number of AEs should be distributed in the network. If there is only one AE in the network, it is the same as a traditional client-server model and has problems of communication like a single point of failure because all the management information should be delivered to a single AE.

3.5 Chapter Summary

In this chapter, we presented the proposed architecture. We also described the cog-nitive control loop that provides three different loops for handle situations. Finally, we described hierarchical architecture of CogMan for scalable network management.

(51)

Chapter

IV

Fault Management in SDN

In this chapter, we present how CogMan is applied to manage faults in SDN. Since OpenFlow [13] is a representative protocol that implements the SDN architecture, we implemented our proposed methods in OpenFlow networks to validate them. First, we describe detailed processes for managing faults in SDN. We then present our experimental setup and emulation environments to validate the proposed methods. Finally, we show that our methods recover from failures faster and reduce overhead of the management system.

4.1 Introduction

In order to validate our control loop, we apply the proposed control loop for man-aging failures in SDN. Management and control complexity is high in traditional networks due to the tightly coupled control and data planes. In traditional net-works, to deal with a link failure, actual failure recovery of running flows, such as

(52)

the changing of routes, is done by distributed routing protocols on networking de-vices. It is difficult for network operators to become involved in real-time control. On the other hand, new networking architectures, such as SDN and OpenFlow [13], separate data and control planes for packet networks, and an OpenFlow controller implements the control and management plane functions of traditional switches or routers. Consequently, interactions between control and management functions are more significant and our control loop plays an important role in both control and management.

Failure recovery in OpenFlow networks has different aspects to that of traditional networks. Restoration and protection are well-known failure recovery mechanisms in OpenFlow networks [14]. Because the controller redirects flows one by one in restora-tion, recovery takes a long time. Protection sets a backup path in advance, such as Multiprotocol Label Switching (MPLS) path protection. This mechanism provides fast failure recovery that meets the failure recovery requirement of carrier grade net-works, 50 ms [15]. However, there are scenarios that protection cannot cover, such as multiple failures that affect both working and backup paths. When there is no avail-able backup path, restoration is the only the possible method, but its recovery time is relatively long. We propose a Fast Flow Setup (FFS) algorithm to handle failures that protection cannot cover. By reducing packet exchanges between switches and a controller, FFS provides fast flow redirection as opposed to restoration. To validate the cognitive control loop and the FFS algorithm, we present details of the processes used by the cognitive control loops to manage failures in OpenFlow networks. We then evaluate the performance of our control loop by conducting failure recovery experiments in an OpenFlow testbed. In a multiple failure scenario, the recovery

(53)

Reactive case

Deliberative case

Reflective case

•A failure successfully recovered by path protection •Affected flows are short duration flows •Controller involvement is unnecessary

•A failure successfully recovered by path protection •Affected flows are long duration flows

•The controller should change paths to optimized path based on characteristics of flows

•E.g., High-bandwidth real-time video traffic

•Multiple failure & redirected paths are affected by a failure

•Protection mechanism is not able to recover failures

•Controller should find another available path and sets up affected flows to the new path Cognitive control

Figure IV.1: Reactive, deliberative, and reflective cases for managing failures

time of the proposed method proved to be approximately 45% shorter than that of existing methods.

4.2 Detailed Fault Management Processes

As outlined in Section 2.4, protection is efficient for recovery of a single failure, but it cannot successfully recover failures if backup paths are not available. This means that a single solution cannot solve all failure situations in OpenFlow networks. Different failure situations require different ways of managing failures. As MAPE and FOCALE control loops only provide a single loop for control, they do not have any inherent functionality to provide different solutions for different situations. The cognitive control loop analyzes failures and handles them using an appropriate technique.

When failures occur in a managed OpenFlow network, the cognitive control loop first analyzes failures to decide on an appropriate loop for the failures. As shown in Figure IV.1, the cognitive control loop classifies failure cases as reactive, deliberative,

(54)

or reflective cases in order to manage failures using the most appropriate loop and method. In our approach, a basic mechanism for managing failures is protection. Thus, during flow setup, a controller sets working and backup paths for every flow. Protection guarantees recovery from a single failure in a required time interval, as every flow has one backup path.

A single failure case is classified as a reactive or a deliberative case. Both cases have in common the fact that a failure can be recovered by protection. However, as discussed in [43], short-duration and long-lived flows should be treated differently. A short-duration flow disappears quickly, so redirection to an optimal path is not necessary. A long-lived flow, such as a high-bandwidth real-time flow needs an op-timal path to meet its requirement. As available bandwidth dynamically changes over time, the deliberative loop finds the optimal path and redirects the affected long-lived flows from a temporary backup path. If no backup path is available for failures, the reflective loop handles the failure cases. So far, restoration is the only possible method to use if protection does not work. However, restoration takes a huge amount of time [14] and it is not feasible to provide services. We propose the FFS algorithm for fast failure recovery. The reflective loop exploits FFS to process multiple failure situations that cannot be handled by the reactive and deliberative loops. In this section, we present in detail the processes used by the cognitive control loop to manage failures in OpenFlow networks.

(1) The Observe process

When the management system, which exploits the cognitive control loop, re-ceives alarm messages from a managed network element, the Observe process

(55)

extracts information about the alarm messages. An instance of the alarm is represented as a tuple of attribute values. An alarm is defined formally as follows:

A:={an, rt, rid, ta}

Whereanis the alarm name,rtis the resource type,ridis the ID of the resource,

and ta is the time at which the alarm occurred. The values of the attributes

are set in this process and the alarm instance is sent to the Normalize process.

Algorithm 1 Normalize (observed)

1: Input: observed //set of alarm instances 2: Output: clusterSet //set of alarm clusters 3: R={} //alarms related to a

4: C={} //an existing alarm cluster related to a

5: fora∈observeddo

6: R=relatedAlarms(a)

7: C =getClusters(timegap, ta)

8: if ∃c∈C containsr∈R then

9: add ato the cluster c

10: add ctoclusterSet

11: else

12: create a new clusterncand add atonc

13: add nctoclusterSet

14: end if

15: end for

16: returnclusterSet

(2) The Normalize process

In the Normalize process, alarm instances are correlated temporally and spa-tially. As shown in Algorithm 1, alarm instances received from the Observe process are correlated based on correlation rules.R and C are sets associated

(56)

witha. Correlation rules foraand existing alarm clusters that have time gaps less thantimegapare extracted from the knowledge base. If a certain clusterc contains alarmsr, which are related toa,ais added to clustercandcis added toclusterSet. If there is no cluster associated with the alarm instance, a new alarm cluster nc is created and the alarm instance a is added to nc. All of the alarms from the Observe process are processed andclusterSetis returned. For example, assume that alarms {1, 2, 3, 4} are collected from the network topology in Figure IV.2. Each alarm is examined and clustered with related alarms. Assume that {2} is processed first and there is no cluster related to

{2}. Then, a new clusterc1 is created to which{2}is added.{1},{3}, and{4}

are added toc1 one by one due to the relationships between them, as shown in the causality graph. We assume that {2, 4}are symptoms and {1, 3} are the root cause problem according to the causality graph; consequently, the root cause flags of{1, 3}are set to “true” and the root cause flags of {2, 4}are set to “secondary”. For alarm clustering, any correlation algorithm can be used. For example, we used codebook based alarm correlation for clustering [31]. The alarm cluster c1 is then sent to the Compare component.

(3) The Compare process

As shown in Algorithm 2, the Compare process receives alarm clusters. This use case focuses on alarms caused by link down or port down. Since the alarms are clustered in the Normalize process, every cluster already has information that identifies the root cause of the alarm. Based on the alarm information, the current situation is classified as either a reactive, deliberative, or reflective case,

(57)

Ping test failed AC A B C D Ping test failed AE Port Down B: port #2 VoIP packet loss Example topology {2} {3} {4} {5} p1 p2 p1 p#: # is a port number Port Down C: port #1 {1} E

Working path A-B-C Backup path A-D-E-C

p2 p3 p1 p2 p1 p2 p1 p3 p3

Figure IV.2: Example of a causality graph

as shown in Figure IV.1. If a single failure occurs or backup paths are available for the affected flows, protection handles this situation. The controller can detect whether protection works by examining the paths of the affected flows. If the affected flows are short-duration flows, we classify the case as a reactive case. However, if the affected flows are long-lived flows, it is considered to be a deliberative case. The deliberative loop finds the best path for the flows and redirects the flows to that path. To find the path, the Plan & Decide process is called. If both working and backup paths are unavailable, the compare process classifies this case as a reflective case. Since the failures cannot be recovered by switches, the reasoning process implemented in the controller handles this case.

(4) The Plan & Decide process

A single failure is successfully recovered from by switches, without any inter-vention from the controller. Despite successful redirection of flows, long-lived flows need additional actions. The Plan & Decide process finds the best path

(58)

Algorithm 2 Compare (clusterSet)

1: Input: clusterSet //set of alarm clusters 2: Output: none

3: if numOf F ailure(clusterSet) == 1||isP athP rotectionF ailed() == f alse

then

4: f lowEntries=getAf f ectedF lowEntries(ClusterSet)

5: for f ∈f lowEntriesdo

6: if isLongLivedF low(f) ==truethen

7: P lanAndDecide(f, c) .deliberative case

8: end if

9: end for

10: else

11: Reasoning(clusterSet) . reflective case

12: end if

that meets the requirements of the affected long-lived flow and sets up flow entries to switches in the path.

(5) The Act process

In the Act process, the controller receives high-level actions from Plan & Decide or Reasoning processes, which it translates to OpenFlow-specific commands and sends to the target switches. For example, flow entry install, modification, and other OpenFlow commands for configuration actions are sent to target switches.

(6) The FFS process

The FFS process is for failures that cannot be recovered by a predefined plan. Although protection enables recovery from a single failure in a required time interval, it is not possible to recover if multiple failures occur and both working and protected paths are affected by them. It is of course possible to successfully recover from multiple failures if the controller sets more than one protected

(59)

A B C _D E 1 2 2 3 a b 3rdpath Backup path Working path Controller

1. The controller obtains the path <ACDE> for the affected flow 1 1 1 2 2 4

2. The controller sends an flow entry to switch A (output action: port 3)

3. The controller sends an flow entry to switch C (output action: port 2)

4. The controller sends an flow entry to switch D (output action: port 1)

5. The controller sends an flow entry to switch C (output action: port 2)

Figure IV.3: Example of restoration

path. However, the overhead for setting up the flow is high and an error may occur when setting backup paths. Currently, restoration can be used if protec-tion is not available, but the failure recovery time of restoraprotec-tion is very long and is therefore not acceptable for the provision of network services [14]. There-fore, we need a better method to recover from failures quickly in conditions that have not been prepared for.

We propose the FFS algorithm to handle failures quickly when there is no avail-able backup path. The main idea underlying FFS is reduction of the amount of communication between the controller and switches by implanting path in-formation in an entry switch when the controller sets up a flow. As shown in Figure IV.3, if working and backup paths have failed, restoration can be used to recover from the failures. In restoration, the controller sets up a new path for every affected flow one by one. The number of packet exchanges needed to redirect a single flow is the same as the length of a new path. N extra packet exchanges are required to set up an N-switch unidirectional path in

(60)

restora-1 NULL NULL 1 1 2 Switch C Switch D Switch E flag Output action

pad [0] pad [1] pad [2] pad [3] pad [4]

Outport number struct ofp_action_output { uint16_t type; uint16_t len; uint32_t port; uint16_t max_len; uint8_t pad[6]; }; pad [5]

1 NULL NULL NULL 1 1

flag

IP options

hop[0] hop[1] hop[2] hop [3] hop [4] hop [5]

Switch D Switch E

OpenFlow header (controllerswitch)

Switch C consumes and updates Flow entry format in a flow table

Match Fields Counters Instructions

Figure IV.4: Fields used for FFS

tion. Therefore, if the number of affected flows is M, M·N-packet exchanges are required.

Excessive packet exchanges are not only overhead for the controller, but also add latency. Since each network has a single controller, the controller sends flow entry setup messages in succession. If the time for sending a flow entry is T seconds, a single flow redirection takes at least N·T seconds. The FFS algorithm only requires one-packet exchanges for flow setup, so the algorithm reduces the overhead of the controller and increases the speed of flow redirection. To implant path information in a switch, we use a field that is unused in a normal OpenFlow message. We will now briefly describe how a controller puts path information in a switch. As shown in Figure IV.4, a flow entry consists of match fields, counters, and instructions. Match fields are for flow matching with source, destination addresses, etc. The counters field indicates the number of forwarded packets that belong to the flow. The instructions field specifies

(61)

the actions to be executed when a packet matches the entry. These instructions result in changes to the packet, action set, or pipeline processing. The output action is a type of instruction that makes a switch forward matching packets to a designated output port. We use the pad field of the output action, which is simply used to make an output action 64-bit aligned, thus, it is actually unused. Pad[0] is a flag that indicates whether this output action contains a path for FFS or not. A value of “1” means that the output action has a path for FFS. Path information is encoded in pad[1]-pad[5]. Due to restrictions on header size, switch ID is not included in the encoded path but an output port on each switch is specified.

Since we use the pad field of the output action, which consists of five elements in addition to pad[0], the maximum length of the path that can be represented in our algorithm is six hops. We believe that six hops are enough to support a path change considering a small domain network. We can also increase the available length if necessary. For example, if we put two output port numbers to one pad element, FFS can support a 12-hop maximum path. However, in this case, the maximum port number we would be able to represent would be reduced to 24_{. As shown in Figure IV.4, the output port number is set} from pad[5] to pad[1] in reverse order. The value “2” of pad[5] means that the matching flow should be forwarded to port 2 on switch C.

Algorithm ??explains in detail the process used when each switch receives a new packet. When a packet that has no matching flow entry arrives, the switch examines its IP option field first. If the flag is set to “1”, the switch obtains an