A Graph-based Proactive Fault Identification
Approach in Computer Networks
Yijiao Yu, Qin Liu and Liansheng Tan
*Department of Computer Science, Central China Normal University, Wuhan 430079 PR China
E-mail: {yjyu, liuqin, l.tan}@mail.ccnu.edu.cn
Abstract—In large-scale computer networks, the isolation of the primary failure source is a challenging task. This paper presents a proactive network fault diagnosis approach based on graph theory. Compared with other approaches, the manager of network management system checks the status of the managed devices actively rather than receive messages from those objects passively. The salient feature of this approach is that the possible failure sources, including the real one, can be computed precisely and quickly without any alarm historical information or strict assumptions. This approach does not introduce much processing complexity by taking full use of matrix and Boolean operations. To test and evaluate our proposed algorithm, it is implemented in Java and tested in a real large network environment. The experiment results show that this approach is not only efficient but also scalable on fault identification in large-scale computer networks.
Keywords— connectivity;connectivity structure; graph theory, network management; fault identification; Simple Network Management Protocol; Internet Control Messages Protocol.
I. INTRODUCTION
With the emerging of various network applications, users demand better quality of service (QoS). One of the most crucial issues is to maintain the network availability and reliability. Unfortunately, the current network is not good enough to satisfy the requirements for some exceptional events. For example, power shuts down in network centers or fibers are broken down as a result of mechanical failure. It is not easy to do with network faults only by network operators in short time due to the complexity of network. Furthermore, the Internet explodes so quickly, and the difficulty of network fault management increases sharply. To shorten the fault response time and liberate network operators, automated fault management is a desirable goal of Network Management System (NMS) implementation for large-scale networks.
The process of network fault management is divided into three stages, which are alarms correlation, fault identification and repairing. In the first step, the manager
* Corresponding author. E-mail address: [email protected]. Post address: Department of Computer Science, Central China Normal University,
should detect and know the status of all the managed objects. When exceptions happen in network, the manager should report and fix them as soon as possible. There are two key issues of fault identification: where is the first failure source? How many kinds of possible faults can lead to the fault results? The goal of automated fault identification algorithm is to solve these issues without the intervening of man.
In recent researches, a lot of fault identification algorithms have been proposed, such as coding-based schemes [1]-[2], proactive network [3], Petri-net [4]-[5], alarm correlation [6]-[7], expert system and artificial neural networks [8]-[9], active network & mobile agents [10]-[12] and network dependencies [13]. They inherit good effects in some special network environments, but it will be better if the generalization of those algorithms is improved. Expert system or artificial neural network needs a great deal of historical alarm records and the reasoning process is time-consuming. According to the experiences from artificial intelligence, it is hard to guarantee the reasoning results. Active network and mobile agents are the future network technologies, but the security is the most obstacle of their application in current networks. In addition, most routers and switches do not support mobile codes at present. Alarm correlation method and other methods are based on the premise that network manager is able to get trap messages from fault agents. Actually, if some links are broken down or devices are shut down, trap cannot be sent or transmitted to the manager. Therefore, those approaches will not work well when the basic premise is not satisfied. What’s more, some methods are sensitive to the topology with network, but the computer network environments change frequently.
It’s necessary to relax strict limitations on the assumptions, improve the accuracy of fault identification and speed up the decision process. We proposed a proactive fault identification approach based on graph theory [14]. In this paper, we focus on the implementation of the approach. Consequently, our major concerns are summarized into four questions as
listed: 1. How does the manager acquire the fault information as soon as possible? 2. Where do the possible failure sources locate and how to identify them? 3. How many faults can cause the fault indications and which one is the most likely? 4. How to get the fault indications given the special fault elements status and decide whether it is consistent with the observed?
The contents of this paper are organized as follows. Network modeling is in section II. Section III shows network fault identification approach in detail. Section IV lists the simulation of four classical network faults. Automated fault indications analysis algorithm is illustrated in section V. The approach has been implemented in Java and tested in a real large network environment. The performances are shown in section VI. Finally, some conclusions and evolutions about the approach are drawn.
II. NETWORK MODELING
Computer network is such a super and complicated system, in which hardware and software are included, that NMS is always in development to satisfy the requirements from users. In recent years, self-management and organization of large-scale network becomes more and more necessary [16]-[18]. From the network services providers’ point of view, the maintenance of hardware is a vital task. Common NMS focuses on switching elements, channels and parts of key servers, such as WWW, FTP server. Either switching devices or servers have at least one IP address or domain name address, and they support a lot of network management protocols. The manager runs in NMS can communicate with these managed objects in different manners, such as sending an Internet Control Message Protocol (ICMP) request to a specified destination, or a Simple Network Management Protocol (SNMP) request.
Each managed device has at least a unique IP address and the manager cares for whether they are active. The network fault management can be treated as a connectivity problem in our view. Therefore, these devices can correspond to vertices in a graph. Amplifier and repeaters are linked with channels to keep the signal from attenuation. Those network elements will not be considered because they cannot support network management protocols and cannot be discovered by the manager. Physical channels in network are luxuriant. Optical fiber, cable and wireless are usually employed in network. Since they are special medium joining pair of managed equipments, they can be marked as edges in a graph. Almost all networks are full simplex in a sense of
connectivity, it is natural that edges are bi-directed edges and network graph is an undirected graph.
SNMP and ICMP are two elementary and popular
network management protocols. With the Ping
command, manager can test the connectivity and active status of every vertex in the network. However, no path information, from the manager to the node, is provided with the “ping” command. Even with the “traceroute” command, it only gets one path between the given pair of vertices. Suppose that software fault occurs when destination vertex can be communicated well with Ping command but abnormal with Get primitive in SNMP. Assume fatal fault occurs when the Ping command ends for time out.
Manager has two ways to detect the status of vertex. The first one is a proactive manner that it automatically sends special packets to the managed vertex and waits for the acknowledgements. The second way is only to wait for the traps from agents in the managed device. However, the traps will not be forwarded to manager when serious faults happen. Furthermore, agent sends trap only on seven special events occurring, and most trap message just describes isolation information of a node. Hence, manager should adopt a proactive manner to check the status of managed vertices.
When links are broken down, switching devices connected with them still exist, but vertices pair no longer communicate to each other via it. When switches are abnormal, channels incident with it will be unavailable, but the device still exists in network. Because most of switching devices have multi-interface, switch fault can be subdivided into global fault and local fault. Global fault means all interfaces are abnormal, and the local fault means only parts of them fail. In this paper, global fault maps removing vertex and local fault means removing edge. Obviously, the removal of a vertex and the removal of an edge are the elementary operations for matrices, which represents the computer network. The definitions of removal operations are as follows, and the necessary detailed operations for matrices will be presented in section III.
Def. 1. The collection of all managed vertices in graph is managed vertices set (MVS).
Def. 2. The removal of a vertex is to erase all edges incident with the vertex, but the vertex still exists.
Def. 3. The removal of an edge is simply to erase the edge.
There are a great many indicators of network faults, such as no acknowledgement from destination vertex, high loss ratio of packets and the large transmission delay. In the view of graph theory, the obvious and direct consequence of serious network faults will lead to
the whole network from one connected component to multiple connected components. Connectivity and reachability matrix are the basic notations and tools in graph research, which will be imported in network fault management. With the connectivity analysis in graph theory, the vertex, which the manager of NMS exists, is in one and the only one connected component of the graph. That is to say vertices in other connected components will not be accessible for the manager.
As the above analysis, the connectivity structure of the managed network is the necessary information of our approach based on graph theory. In fact, the dynamic and real-time discovery of the network topology is also a complicated topic in network research. In this paper, we do not focus on how to obtain them in detail. There is a basic and important assumption for our approach, which is the fault identification algorithm can get the physical connectivity structure from other models of the network management system (NMS), such as the configuration model. In a business network management environment, the network operator knows clearly how many equipments and links have been invested and where they are. When the NMS starts, the static connectivity information will be imported to NMS through the configuration model. Furthermore, most of the configuration models have the auto-topology discovery function, which can provide the dynamic connectivity information.
III. THE FAULT IDENTIFICATION APPROACH
The approach based on graph mainly works on the connectivity analysis between pair of vertices, and a constant vertex in the pair is the node running NMS. Before introducing our approach, some necessary definitions are defined first.
Def. 4. A vertex is reachable only when the manager communicates with it successfully in defined times; otherwise it’s unreachable.
Def. 5. The set including all reachable vertices in network is named as reachable vertices sets (RVS). Correspondingly, the set of unreachable vertices is unreachable vertices set (UVS).
Def. 6. Suppose that G = <V, E> is an undirected and simple graph where |V| = n, and
v
1,
v
2,
…
,
v
n∈
V
. The adjacency matrix A of G, with respect to this listing of the vertices, is the n×n zero-one matrix with 1 as its (i, j)th entry when vi and vj are adjacent, and 0 as its (i, j)th
entry when they are not adjacent. In other words, if its adjacency matrix is A = [aij], then
1 { } 0 i j ij if v v is an edge of G a otherwise =
Def. 7. Suppose that G = <V, E> is an undirected and simple graph where |V| = n, |E| = m,
V
∈
…
n 2 1,
v
,
,
v
v
and e1 ,e2 ,… ,em ∈E . Theincidence matrix M of G, with respect to this listing of the vertices, is the n×m zero-one matrix with 1 as its (i, j)th entry when v
i and ej are incident, and 0 as its (i, j)th
entry when they are not incident. In other words, if its incidence matrix is M = [mij], then
1 0
j i
ij
when edge e is incident with v m otherwise = . v1 e0 e1 e2 e4 e3 e5 e6 e7 e8 e9 e10 e11 e12 e13 v0 v2 v3 v4 v5 v6 v7 v8 v9
Figure 1. The topology of a network.
Figure 1 is an undirected graph of a network. The adjacency matrix A for this graph is
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 v v v v v v v v v v v v v v A v v v v v v = .
The incidence matrix I for this graph is
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 e e e e e e e e e e e e e e v v v v I v v v v v v = .
The adjacency matrix and incidence matrix will be used frequently in sections IV.
Step1:RVS is set to null and UVS is set to MVS; Step2: Both PFES and PFVS are set to null;
Step3: The manager tests statuses proactively of all managed objects and all vertices are divided into RVS and UVS according to the test results;
Step4: Scan the incidence matrix and compute the PFES of network;
Step5: Scan PFES and UVS; compute PFVS; Step6: Identify possible fault locations; Step7: Reason possible faults; Step8: Repair fault and test the effects;
Step9: Decide if the algorithm should stop or go back to Step3.
Def. 8. Suppose that G = <V, E> is an undirected and simple graph where |V| = n, and
v
1,
v
2,
…
,
v
n∈
V
.The reachability matrix R of G, with respect to this listing of the vertices, is the n×n zero-one matrix with 1 as its (i, j)th entry when v
i can reach vj, and 0 as its (i,
j)th otherwise. In other words, if its reachability matrix is
R = [rij], then 1 0 i j ij if v can reach v r otherwise =
All the elements in the reachability matrix for Figure 1 are one when there is no fault in computer network.
Def. 9. The edge is a possible fault edge (PFE) if an end vertex of this edge is in RVS and another end vertex is in UVS.
Def. 10. The collection of all possible fault edges is possible fault edges set (PFES).
v1 e0 e1 e2 e e3 4 e5 e6 e7 e8 e9 e10 e11 e12 e13 v0 v2 v3 v4 v5 v6 v7 v8 v9
Figure 2. A fault case of the computer network.
Figure 2 is a case of network failure, in which cross means fault. Suppose that the manager process always runs in v0 in this paper. The current status of all vertices
can be represented with a Status Vector, denoted by SV in this paper. The value of all elements of SV is either one or zero. When SV[i](0≤i≤n−1) is equal to 1, vi
can be accessed by v0. With the zero-one one-dimension
array SV, both RVS and UVS are represented in computer algorithm.
Due to the edges fault, it is clear that there are two connected components in Figure 2. The SV of Figure 2 is
0 1 2 3 4 5 6 7 8 9
[1 1 1 1 0 0 1 1 1 1] v v v v v v v v v v SV =
According to definition 9, e3 and e6 are members of
PFES. As the data structure of SV, PFES is also realized by a one-zero vector. Hence, PFES of Figure 2 is
0 1 2 3 4 5 6 7 8 9 10 11 12 13 [1 1 1 0 1 1 0 1 1 1 1 1 1 1]
e e e e e e e e e e e e e e PFES =
Def. 11. A vertex, member of UVS, is a possible fault vertex (PFV), when one of incident edges is an element of PFES.
In Figure 2, unreachable vertex v4 is a PFV because e3
and e6 are members of PFES.
Def. 12. The collection of all possible fault vertices is possible fault vertices set (PFVS).
The PFVS of Figure 2 is
0 1 2 3 4 5 6 7 8 9
[1 1 1 1 0 1 1 1 1 1] v v v v v v v v v v PFVS=
Comparing with the PFVS and SV, it is easy to see that the number of PFV is no more than that of unreachable vertices.
According to the definitions as defined, the novel fault identification approach can be described informally. In general, the approach is subdivided into nine steps as described in Figure 3. Several theorems about the correctness of this approach are proved in reference [15]. In this paper we will focus on the detailed algorithms about how to realize every step of it.
Figure 3. The approach based on graph theory.
The fault identification process is executed under three occasions. First, network manager executes it at every interval periodically. Second, when network status becomes worse, it will be called. Third, network operator starts it through the human-machine interface of NMS.
When the NMS starts, the fault identification model gets the physical connectivity structure of the managed network from the configuration model. According to the connectivity structure, fault identification model will create the adjacency matrix and incidence matrix of the managed network. Usually, the NMS runs for a very long time. If there are some changes of connectivity structure, the configuration model will send the latest connectivity structure to fault identification model to refresh the two matrices.
A basic question of our approach is how to measure the current status of every vertex by the manager. Due to the popularity of SNMP and ICMP, they are selected to measure the status of vertex in our experiments. Get and Ping are the frequently used primitives or command in those protocols respectively. The command "ping" is executed only to test whether a node is active or not. The active status of all managed nodes is vital to our approach. If the testing communication between the manager node and the managed node can get expected
voididentifyPFES() {
int[] v = new int[2]; intk;
for (inti = 0; i < numOfEdges; i++) {
k = 0;
for (intj = 0; j < numOfVertices; j++) { if (incidencyMatrix[j][i] == 1) { v[k] = j; k++; } } if (SV[v[0]] ≠SV[v[1]]) PFES [i] = 0; else PFES [i] = 1; } } voididentifyPFVS() {
for (inti = 0; i < numOfEdges; i++) {
if (PFES [i] == 1)
{// Find the vertices incident with PFE.
for (intj = 0; j < numOfVertices; j++)
if ((incidencyMatrix[j][i] == 1) && ( SV[j] == 0)) PFVS[j] = 0;
} } }
reply in time, the current status of this node will be regarded as "reachable".
Considered that the network is unstable and it ought to keep the overload of network management as low as possible, the manager is suggested to check the status of a managed vertex at several times. If the previous checks are failed, manager will try again until the times of checks is larger than the given times or the manager receives a valuable acknowledgement. If the manager gets the acknowledgements in time, the node is marked as active and the description of this vertex in SV will be 1. Otherwise the measured vertex is marked as unreachable and the description of this vertex in SV will be 0.
Guaranteeing the real time feature of network element status information is an engineering problem. For every measurement, it is unavoidable for the manager to spend some time waiting for the acknowledgement. If the measurement actions are executed in succession, the time of waiting is the sum of every testing. In large-scale network, this cannot be ignored because the number of managed nodes is so large. Furthermore, the transmission delay of every long distance connection aggravates the situation. Then a new question emerges, which is that the status of the first one has changed after measuring the last one.
To avoid the long time of measurement and assure the real-time feature of status table, parallel measurement is suggested in engineering. Either with C++ or Java programming language, multi-thread technology is supported and employed frequently. Every thread is a kind of computer resource, and no matter creating or killing a thread will cost a little time. To reduce the system consumption of threads operation, a pool of threads is proposed and applied in our emulation system. Every thread is assigned a measurement task and it can be executed concurrently. A measurement task applies for an instance of thread from the thread pool before it starts, and returns the thread instance after measurement. Experiments result show that the thread pool is able to save the time of creating a thread during the algorithm execution.
The next step is to get the possible fault edges. This operation can be accumulated precisely with the algorithm illustrated in Figure 4. Most of the elementary operations in this algorithm are querying the status vector, denoted by SV in pseudocode, and the incidence matrix. Obviously, the final computing results for a given input are determinate. Furthermore, the computing speed is acceptable in engineering.
Figure 4. Algorithm of computing PFES.
Why do we define PFES and what is the use of it? An important goal is to identify the fault sources as soon as possible. For instance, the network state is like Figure 2, v4 and v5 are inaccessible and there are three edges
incident with v4 and v5, and why do only e3 and e6 belong
to PFES? The connectivity between v4 and v5 is not clear
from the manager point of view. If the fault is from edges, the manager affirms that v4 and v5 are
unreachable if and only if e3 and e6 are both break down.
On the other hand, without adequate evidences, it is impossible to judge whether e7 is normal or not.
In Step5, PFVS will be computed according to the results of PFES. The detailed algorithm is listed in Figure 5. Similar to the number of possible fault edges, the number of possible fault vertices also decreases compared with the number of inaccessible vertices. It is interpreted that if v4 shuts down, both v4 and v5 are
unreachable; on the other hand, when v5 fails, v4 can also
run well. For example, that the hub of an office is error do not mean all the computers in the office shut down. All the possible fault edges and vertices are called as possible fault elements in this paper.
Figure 5. Algorithm of computing PFVS.
With the example of Figure 2, the definition of PFES and PFVS is helpful to isolate the possible fault elements and cut down the fault source reasoning time. This feature will be discussed in remained sections. Because Step 6 is so complicated, it will be discussed in section V independently.
IV. FAULT CASES STUDYING
Although the fault cases are distinguished in different network management environments, in our point of view, these fault cases can be classified into four types. The different fault cases are closely with two notations, the number of connected components and degree of vertex, in graph theory. Different fault cases have different fault identification ways, and they will be illustrated as follows.
A. A Vertex is Inaccessible Whose Degree is Larger Than 1
In figure 6, the vertex v7, whose degree is 2, is
inaccessible. Routers and switches are the connectors of different sub-network; of course, the fibers or cables connected with them are more than one. So this fault case can represent single switch or router is inaccessible in computer network. v1 e0 e1 e2 e4 e3 e5 e6 e7 e8 e9 e10 e11 e12 e13 v0 v2 v3 v4 v5 v6 v7 v8 v9
Figure 6. The first fault case.
Step3: SV = [1 1 1 1 1 1 1 0 1 1].
Step4: There is only one element in UVS. Scan the 7th
row in incidence matrix, and we can see that e8 and e9
are possible fault edges.
PFES = [1 1 1 1 1 1 1 1 0 0 1 1 1 1] Step5: Scan the incidence matrix and conclude that
PFVS = [1 1 1 1 1 1 1 0 1 1];
Step6: There are two elements in PFES and one element in PFVS. So the number of possible fault elements is three and the total possible fault elements combination are eight.
TABLE I. FAULT ELEMENTS COMBINATIONS AND EFFECTS v7 e8 e9 consistency 0 0 0 T 0 0 1 T 0 1 0 T 0 1 1 T 1 0 0 T 1 0 1 F 1 1 0 F 1 1 1 F
Table I show the three possible fault elements and combinations and effects. In the first three columns, 1 represents this network element is normal and 0 is abnormal. When the combination of possible fault elements is given as the first three columns, the value of "consistency" is T (True) if the node is isolated according to connectivity analysis in graph theory. On
the contrary, it will be F (False) if the node is reachable to the manager node in connectivity analysis.
For example, the second row in Table 1 represents a special combination of possible fault elements, which means that v7 shuts down and both e7 and e8 are broken.
Obviously, v7 is unreachable in the view of connectivity analysis in graph theory. What's more, the reachable testing result with NMS shows that v7 is also
unreachable. The connectivity analysis in graph theory is consistent with the engineering testing one. Then the value of the “consistency” will be set T. Another example is about F. When only e7 is broken, v7 is
reachable in connectivity analysis. However, v7 is not
reachable in test. Then the two results are not consistent; therefore the value is F.
Only the combinations of fault elements, which lead to the consistency column equals to T, are the possible fault reasons and need to be considered further. If the value of “consistency” is T, then the values of the first three columns will be interpreted as a possible fault sources.
Because the computation algorithm about “consistency” is so complicated that the detailed procedure will be introduced in section V and section VI respectively. In this section, we will use the results directly.
Step7: From the fourth column of Table I, it is clear that there are five possible fault reasons result in the unreachable v7. The current task is to output all the
possible reasons and judge which one is the most possible. In our experiment, the Bayesian decision is used to sort the possible fault reasons. With the Bayesian decision, some fault probability of vertices and edges should be obtained at first.
Actually, the failure probability of switch vertices is not easy to define, because there is not a unified failure probability for all switches and links in different area and different environment. Even in a shared environment, the probabilities of different equipments are not equal. The accurate probability can be defined based on the historical statistic fault records. Due to these reasons, we do not focus on the accurate way to define the probabilities. We are more interested in the computing and sorting method for fault probabilities. The detailed probability of equipment does not influence our fault identification approach.
Assumptions about fault occurring probabilities are listed:
a) Probability of fault event in switch is Ps;
b) Probability of fault event in link is Pc;
c)
P
c≤
P
s;The simple four rules, used in this paper, are based on our past network management experience in China. It is real in our network management system, but we cannot assure that they are very fitful to other networks.
Compute probability of the five possible fault reasons. Reason1:
P
0=
P
s×
P
c×
P
cReason2:
P
1=
P
s×
P
c×
(
1
−
P
c)
Reason3:P
2=
P
s×
(
1
−
P
c)
×
P
c Reason4:P
3=
P
s×
(
1
−
P
c)
×
(
1
−
P
c)
Reason5:P
4=
(
1
−
P
s)
×
P
c×
P
cBecause P3 is the largest, failure in v7 is the most
possible fault source and will be output firstly.
Step8: Send repair commands to network operator to check the power or other conditions in v7.
Step9: If the repair feedback information shows the judgment is not correct, select Reason5, which has the second largest probability, and so on. The algorithm runs repeatedly until all vertices are reachable.
B. A Vertex of Degree 1 is Unreachable
Figure 7 shows the second fault case where v5 is
inaccessible and whose degree is just one. In computer networks, most servers are located in one network, i.e., only a cable or optical fiber is connected with it. Then this case can represent that a WWW, FTP and E-MAIL server is inaccessible. v1 e0 e1 e2 e4 e3 e5 e6 e7 e8 e9 e10 e11 e12 e13 v0 v2 v3 v4 v5 v6 v7 v8 v9
Figure 7. The second fault case.
Step3: SV = [1 1 1 1 1 0 1 1 1 1];
Step4: PFES = [1 1 1 1 1 1 1 0 1 1 1 1 1 1]; Step5: PFVS = [1 1 1 1 1 0 1 1 1 1];
Step6: All the possible fault elements in this fault case and possible fault reasons are listed in Table II.
TABLE II. FAULT ELEMENTS COMBINATIONS AND FAULT EFFECTS. v5 e7 consistency
0 0 T
0 1 T
1 0 T
1 1 F
Step7: Compute probability of the three possible fault reasons.
Reason1:
P
0=
P
s×
P
c Reason2:P
1=
P
s×
(
1
−
P
c)
Reason3:P
2=
(
1
−
P
s)
×
P
cBased on the assumptions as above, P1 and P2 are
obviously larger than P0. Let’s compare P1 and P2. c s c s
P
P
P
P
P
P
1−
2=
×
(
1
−
)
−
(
1
−
)
×
c s c c s sP
P
P
P
P
P
−
−
+
=
=
P
s−
P
c≥
0
For P1 is the largest, the vertex v5 is the first output of
the fault identification algorithm.
Step 8: Send the repair command to fix the network.
C. Multiple Vertices in a Connected Component are Inaccessible
Figure 8 shows the third case, in which a lot of managed devices are inaccessible instantly and all the vertices are in the same connected component, and this case is frequent in real network environment. If the boundary routers fail or fibers connected with boundary routers are broken down, all computers in a sub-network are inaccessible. v1 e0 e1 e2 e4 e3 e5 e6 e7 e8 e9 e10 e11 e12 e13 v0 v2 v3 v4 v5 v6 v7 v8 v9
Figure 8. The third fault case.
Step3: SV = [1 1 1 1 0 0 1 1 1 1];
Step4: PFES = [1 1 1 0 1 1 0 1 1 1 1 1 1 1]; Setp5: PFVS = [1 1 1 1 0 1 1 1 1 1];
Step6: Three possible fault elements, namely e3, e6
and v4, are in Figure 8;
Step7: Similar to fault source sorting in the first fault case, the vertex v4 is fault will be first listed;
Step8: The operation is similar to the first two cases; Step9: Test reachable status of vertices in UVS. When all vertices are accessible, the algorithm stops.
In this case, we see that the possible fault sources do not increase quickly although the fault vertices are more than one. If e3 and e6 are normal, the network will be
recovered completely after repairing the fault in v4.
Otherwise the fault fix process can be subdivided into two steps. The advantages of the way are that decreasing the complexity of decision of each time, and improving the fixing rate and the correctness.
D. Multiple Vertices Are Inaccessible Which are in Multple Connected Components
Figure 9 shows that v4, v5 and v7 are abnormal which
is an example of the fourth fault case. In computer networks, this case can be interpreted that some isolated sub networks are inaccessible in the same time.
voidremovalOfAVertex(intvertexId) {
for (inti = 0; i < numOfVertices; i++) {
adjacencyMatrix[i][ vertexId] = 0; adjacencyMatrix[vertexId][i] = 0; }
}
void removalOfAnEdge(intedgeId) {
intj = 0;
int[] v = new int[2];
for (inti = 0; i < numOfVertices; i++) { if (incidenceMatrix[i][edgeId] == 1) { v[j] = i; j++; } adjacencyMatrix[v[0]][v[1]] = 0; adjacencyMatrix[v[1]][v[0]] = 0; } } v1 e0 e1 e2 e e3 4 e5 e6 e7 e8 e9 e10 e11 e12 e13 v0 v2 v3 v4 v5 v6 v7 v8 v9
Figure 9. The fourth fault case.
Step3: SV = [1 1 1 1 0 0 1 0 1 1];
Step4: PFES = [1 1 1 0 1 1 0 1 0 0 1 1 1 1]; Setp5: PFVS = [1 1 1 1 0 1 1 0 1 1].
The next steps are discussed in the first and second cases.
V. AUTOMATED FAULT EFFECT ANALYSIS
ALGORITHM
In section III, the possible fault edges and vertices of a fault effect have been identified. If either all the possible edges or all the possible fault vertices are removed, they will change the connectivity of the graph. Different combinations of possible fault elements are able to lead to the same fault indication. Such as in Figure 2, if e3 and v4 are failed at the same time, the fault
indication is also that v4 and v5 are unreachable. It
suggests that identifying the possible fault factors is only the premise steps of fault reasons analysis. The further work is to list the possible fault reasons as Table I and Table II in section IV. Some algorithms compare the probability, and some find the similarity between the current fault cases with the historical fault records. However, we care for the connectivity of graph. In this method, all the possible fault reasons will be listed precisely.
The discussion of the algorithm is divided into three parts. Firstly, algorithms of reachability matrix computing are analyzed and compared. The combinations of fault factors and their indications are computed. Secondly, how to judge whether the indications of possible fault elements combination is the same as testing results are discussed. Finally, how to get all the combinations in software design are introduced.
A. Possible Fault Elements Combination and its Effect In graph theory, the computing algorithm of the reachability matrix is R = A + A2 + A3 + … + An, where
A is the adjacency matrix of the graph. The time complexity of this algorithm is O(n4) [14].
Warshall algorithm is an efficient method for computing the transitive closure of a relation [15]. Warshall algorithm has a worst case complexity of O(n3),
where n is the number of vertices of the graph. For the reachability matrix is similar to the transitive closure of
a relation, we propose that using the Warshall algorithm to do reachability matrix computing [14]. Obviously, Warshall algorithm is faster than the first one. Because the computing of reachability matrix is a frequent operation in our fault identification approach, selecting the Warshall algorithm will speed up the whole fault reasoning process efficiently.
The removal of a vertex and the removal of an edge have been defined in section II. Because all the information about network is stored in the adjacency matrix and incidence matrix during software implementation, the detailed operations of removal are described in Figure 10. Querying PFES and PFVS, and removing vertices and edges as Figure 10, a new adjacency matrix, denoted by A′, will be produced. Based on matrix of A′, computing the reachability matrix of current network with Warshall algorithm is possible.
Figure 10. The removal Algorithms of a vertex and an edge.
To show the computing process, there is an example of this process about Figure 11.
v1 e0 e2 e10 e11 e12 e13 v0 v2 v8 v9 v1 e0 e13 v0 v2 v8 v9 (a) (b) e2
Figure 11. The Vertex and edge have been removed.
Figure 11 (a) is the complete computer network, which is a sub-graph of Figure 1. Figure 11 (b) is the current network state in which v8 and v2 are inaccessible
by the manager located in v0. Suppose that this fault is
due to the global fault in v8, and e2 are broken down at
the same time.
Suppose that the vertices of Figure 12 (b) are listed arbitrary as v0, v1, v2, v8 and v9. The reachability matrix
8 0 1 0 0 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 delete v A A A ′= = ⇒ ′= 2 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 delete e A ′ ⇒ =
Computed the reachability matrix of the Figure 11 (b) with Warshall algorithm. The detailed process is
0 1 0 0 1 0 1 0 0 1 (1) 1 1 0 0 1 (2) 1 1 0 0 1 ' ' 0 0 0 0 0 ' 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 A A A ⇒ = ⇒ = 1 1 0 0 1 (3) 1 1 0 0 1 (4) ' 0 0 0 0 0 ' 0 0 0 0 0 1 1 0 0 1 A R A ⇒ = ⇒ =
(1) Add the 1st row to the 2nd row;
(2) Add the 1st row to the 5th row;
(3) Add the 2nd row to the 1st row;
(4) R is the reachability matrix.
However, only with the reachability matrix R, it is not easy to get conclusions about fault effect directly.
B. Judging Whether Fault Effects are the Consistent Observe the reachability matrix, we can see whether there is at least a path between two given vertices with eyes. The most special idea of our fault identification approach is the connectivity analysis of the graph, and this reachability matrix implies that information. We try to find a way to translate the implicit information into explicit one, which can be judged by some rules. Matrix transformation is the most frequent operation in mathematical research, and it is used in our algorithm. After matrix transformation, some special sub-matrix will appear. Fortunately, we have found some special transformation rules for reachability matrix.
To do automated fault effect analysis, two temporary matrices, RTM and CTM, are imported in Figure 12. RTM is the abbreviation of “Row Transformation Matrix” and CTM is the abbreviation of “Column Transformation Matrix”. From the graph theory, a pair of different vertices is reachable if and only if they are in the same connected component. The automated fault effect judgment algorithm, illustrated in Figure 12, is based on this principle.
Figure 12. The matrix transform algorithm.
Transform reachability matrix R with algorithm in Figure 12, and we will get the CTM. The vertices of graph will be listed as reachable vertices set and unreachable vertices set in CTM.
To show this transformation clearly, we will show the computing procedure of Figure (b) as follows.
1 1 0 0 1 1 1 0 0 1 1 1 0 0 1 (1) 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 (2) 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 R RTM CTM = ⇒ = ⇒ =
(1) Copy the 1st, 2nd, 5th row to the 1st, 2nd, 3rd row of
RTM;
(2) Copy the 1st, 2nd, 5th column to the 1st, 2nd, 3rdrow
of CTM.
Finally, the sequence of vertices of Figure 11 in CTM is v0, v1, v9 and v8, v2,.
Suppose the elements number of RVS is V1 (V1 ≤n).
CTM can be divided into 4 matrices and every sub-matrix has its special physical denotation, which is helpful to the consistency judgment of fault effect. If the fault effect from connectivity analysis is consistent with the measurement results, the CTM must be satisfied with all regulations below.
Create matrix RTMand CTM.
for (i = 0, j=0, k= numOfVertices -1; i < numOfVertices; i++) {
if (viis accessible) {
copy the ith row of R to the jth row of RTM; j++;
} else {
copy the ith row of R to the kth row of RTM; k--;
} }
for (i = 0, j=0, k = numOfVertices - 1; i < numOfVertices; i++) {
if (vi is accessible) {
copy the ith column of RTM to the jth column of CTM; j++;
} else {
copy the ith column of RTM to the kth column of CTM; k--;
} }
voidlistFaultReasonsTable(int elements[x]) {
int max = 2x;
intnumber, i, j, k;
intlist = new int[max][x + 1];
for (number = 0; number < max; number ++) {
i = number; j = x - 1; k = 0; while (j >= 0)
{
list[number][k] = iDIV 2j; i = iMOD 2j; j--; k++; } } } 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 CTM =
The left-top sub-matrix (V1×V1) represents reachable features of any pair vertices among the accessible vertices; so all the elements values must be 1.
The left-low sub-matrix and the right-top sub-matrix have the same meaning, which represent the reachable status between RVS and UVS. Of course, elements in these sub matrices should be zero. Otherwise, there must be a vertex, which is an element in both UVS and RVS, but it’s not permitted in graph theory.
The above case is a possible fault reason when v2 and
v8 are inaccessible by the manger, which v8 and e2 are
the fault sources at the same time.
On the other hand, we will test another combination of possible fault elements. For example, when only v8 is
error, the computing process is listed below.
0 1 0 0 1 0 1 0 0 1 1 0 1 1 0 (1) 1 0 1 0 0 ' 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 A A = ⇒ = 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 (2) (3) 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 R RTM = = ⇒ ⇒ 1 1 1 1 0 1 1 1 1 0 (4) 1 1 1 1 0 1 1 1 1 0 0 0 0 0 0 CTM = ⇒ (1) Remove v8;
(2) Compute the reachability matrix with Warshall algorithm;
(3) Transform R to get RTM; (4) Transform RTM to get CTM.
Obviously, the combination fault effect is not consistent with the measurement result, because v2 is
connected with v0 from the reachability matrix. Perhaps
there’re some connections among inaccessible vertices, which can’t be available by the manager. The best way is to ignore information in the right-below sub matrix. If there’re connections among inaccessible vertices, the expression is that repairing one vertex and some vertices become reachable.
C. Algorithm for Computing Fault Elements Combination
The number of possible fault elements is a variable, which changes in different time even in the same network. Every possible fault elements has two statuses, one is normal and another is abnormal. It is known that there are 2x kinds of combinations of possible fault
elements where x is the number of possible fault elements of the network, but the number of possible fault reasons does not equal to it. A flexible algorithm, which can produce the 2x one-zero sequence quickly, is
the key issue of fault reasons computing.
To solve it, a similar problem is compared at first, which is the way to get the binary format sequence from an integer. DIV and MOD are two useful operations in data processing. Let’s observe an example, which is to transform a decimal number into its binary format step by step. The input data is (11) d. The steps with DIV and
MOD computing are shown in Figure 13.
Figure 13. The process of change (11) d to (1011) b. Observe Figure 13, it is heuristic for our task. Because the number of possible fault elements is known and the total combinations are computed. Each combination can be assigned a number to identify it. How to code the number with a good way is the current task, which can represent physical meaning and helpful for further process. However, each combinations can be coded as a binary sequence whose length is x, and every bit of it means whether the possible fault element is normal.
With this coded method, all the combination can be represented. The algorithm to code the fault elements combination is illustrated in Figure 14, which is similar to the integer format transformation algorithm.
Figure 14. The algorithm to get possible fault lists.
With these algorithms in section V, the two tables can be computed automatically.
number = 11, numberDIV 8 = 1, MOD 8 = 3; number = 3, numberDIV 4 = 0, MOD 4 = 3; number = 3, numberDIV 2 = 1, MOD 2 = 1; number = 1, numberDIV 1 = 1, MOD 1 = 0.
VI. EMULATIONS IN A LARGE-SCALE NETWORK
There are a few complicated algorithms proposed in this paper, which should be coded and applied in a real network management system. Furthermore, we are eager to evaluate this fault identification approach from the actual experiment results. In our point of view, the performance for fault identification at least includes two factors. One is the reasoning speed and another is whether it can give the actual fault reason for network. We do not consider the total recovery time of network because different fault source needs different fixed time. We pursue the shorter reasoning time.
We carry out an experiment of this fault identification approach in a real large-scale network management environment. The experiment set up is illustrated step by step as follows. Step 1: Code these algorithms and embed the novel fault identification model into a network management system; Step 2: After the NMS runs, the fault identification model gets the completed and static topology information from the configuration models of NMS. Step 3: The current or dynamic status table of all managed nodes is created by the status testing process; Step 4: With the three necessary structures (Adjacency matrix, Incidence matrix and the Status Vector), the fault identification model will be triggered and the all-possible fault reasons will be listed and sorted. Step 5: Fix the network according to the possible fault reason list.
Figure 15. Map of an actual large-scale network.
The connectivity structure of all the managed network equipments is illustrated in Figure 15, which is the backbone of a telecommunication company in Hubei province in China. There are totally 53 managed devices, denoted by triangles in the map, which are routers or switches. The black line describes the links and most of them are optical fibers. The network topology is so complicated that it is not easy to classify them into a regular topology case. Since every vertex locates at least in a ring and the degree of it is larger than two, the second fault case cannot be tested in our evulation experiments. Due to the irregular topology structure of this network, the events correlation technology and
reasoning are not conveniently applied. On the contrary, the graph-based approach is not sensitive with the topology and can be employed easier.
Java is one of the most portable programming language, whose codes can run in Windows, Linux and Unix and other operating system, and is popular in network management system design and implementation in recent years [18]-[19]. Although Windows is the most popular operating system, Unix and Linux are more frequently used in telecommunication servers. To improve the software reusability, Java Development Kit (JDK1.4.1) is adopted to be the development languages. With the RunTime class in Java, the Ping command can be conveniently integrated in the NMS and get the state measurement results. In emulation testing, the classical types of fault effects except the second case listed in section IV are tested.
Figure 16. The time consumption of reasoning.
Two issues of the algorithms are considered carefully, namely the time of fault isolation and the correctness of fault source. Figure 16 shows the reasoning time (step 3 to step 6) consumption (from step 4 to step 7) changes with the number of inaccesible vertices which are in a connected component. The manager runs in a personal computer with Pentium II 500 and 128M memory. It denotes that the time consumption is nearly a const, which is about 400 ms, even the fault number increases linearly. That is to say the time costs is stable when it is employed in a larger scale network. This feature supports the real time response requirement for NMS and the salability of all kinds of network is also acceptable. Compared with other fault isolation technologies, the approach based on graph theory is not only fast but also easy to be realized. What’s more, the reasoning results are more reliable and helpful to fix.
To get an objective evaluation of the novel approach in paper, some comparisons should be done with other fault identification methods. Because we have not the same test cases from other paper, we do not compare our work with other research results directly in this paper to
avoid drawing a subjective conclusion. The real reasoning speed in Figure 16 is acceptable either in research or in engineering field. On the other hand, all the possible fault reasons are listed and sorted by the probability after reasoning. That is to say, we have found the actual fault reason with this fault identification approach. Therefore, we think the performance for fault identification is illustrated by the simulation results.
VII. CONCLUSIONS
The fault identification and isolation approach based on graph theory appeals some excellences cmpared with other methods. Firstly, it is able to work in any topology network. While other reasoning methods depend on events correlation, they have to observe the network topologies to conclude relationships among the vertices and traps. Secondly, the reasoning methods about fault reasons become simpler and can be executed conveniently in computer. There are some case-based, rules-based or artificial network based ways to find out the fault reasons. Each of them tries to find or compare the similarity between the new fault case with the historical fault records. It is a dilemma in those methods between the correctness and reasoning speed, but our novel method solves it successfully. Thirdly, the greatest improvement is that the algorithm always finds the most likely fault reason in short time. Finally, most of the operations in this approach are matrix and Boolean, which can be executed quickly in computer. From the evaultion example in this paper, it is clear that the time complexity of the approach is acceptable.
While this paper has alluded that the probability of fault events are equal, we do not believe this to be entirely desirable. Hence, the research should be replaced with the actual facts in future, because the probability will affect the fault reasons sorting sequence. It is indicated that Bayesian decision is not able to respond quickly for small probability events; future research should focus on integrating bayes-decision and case-based in the fault source reasoning system. In addtion, if there is abundant historical fault records, they should be more studied and try to mine some helpful rules. Furthermore, the basic assumption, which the connectivity structure of network can be obtained from the topology model, should be released in the furture, because the network topology is changing dramatically [20]. Some current connectivity discovery should be intergreated in the fault identification work.
Automated fault identification and isolation are still a challenge, because the collected information in manager is too little but the possible fault reasons are so many. How to find the actual fault reason is still an open
problem in network management. In this paper, the most possible reasons set including the actual one are found. It is an improvement, but not the completed solution. We believe that the fault isolation based on graph theory will lead to simple, scalable network management solutions.
ACKNOWLEDGEMENTS
This research has been partially supported by National Natural Science Foundation of China under Grant No. 60174043, by the Key Project of Natural Science Foundation of Hubei Province in China under Grant No. 2002AB025, and by the Natural Science Foundation of Central China Normal University under Grant No. 500502.
REFERENCES
[1] Chi-Chun Lo, Shing-Hong Chen, and Bon-Yeh Lin, “Coding-based Schemes for Fault Identification in Communication Networks,”
International Journal of Network Management, pp. 157-164, May-June 2000.
[2] Masum Hasan, Binay Sugla, and Ramesh Viswanathan, “A Conceptual Framework for Network Management Event Correlation and Filtering Systems,” Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management (IM), 1999.
[3] Cynthia Hood and Chuanyi Ji, “Proactive Network Fault Detection,”
IEEE Transactions on Reliability, vol. 46, no. 3, pp. 333-341, September, 1997.
[4] A. Aghasaryan, E. Fabre, A. Benveniste, R. Boubour, C. Jard, “A Petri net approach to fault detection and diagnosis in distributed systems,” 36th
IEEE Conference on Decision and Control (CDC), San Diego, IEEE Control Systems Society, pp. 726-731, December 1997.
[5] Boubour, Renée, Jard, and Claude. “Fault Detection in Telecommunication Networks Based Petric Net Representation of Alarm Propagation,” Lecture Notes in Computer Science, vol. 1248: 18th
International Conference on Application and Theory of Petri Nets, Toulouse, France, pp. 367-386, June 1997.
[6] Bouloutas A, Calo S, and Finkel A, “Alarm Correlation and Fault Identification in Communication Networks,” IEEE Transactions on Communications, vol. 42, pp. 523-533, 1994.
[7] C. S. Chao, D. L. Yang, and A. C. Liu, "An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation," Journal of Network and Systems Management, vol. 9, no. 2, pp. 183-202, June 2001. [8] E.A. Mohamed and N.D. Rao, “Artificial neural network based fault
diagnostic system for electric power distribution feeders,” Electric Power System Research, vol. 35, pp. 1-10, 1995.
[9] Hongjun Li, John S. Baras, and George Mykoniatis, “An Automated, Distributed, Intelligent Fault Management System for Communication Networks,” Technical Report, TR 99-57, University of Maryland, 1999. [10] Beverly Schwartz, Alden W. Jackson, W. Timothy Strayer, Wenyi Zhou,
R. Dennis Rockwell, and Craig Partridge, “Smart Packets: Applying Active Networks to Network Management,” ACM Transactions on Computer Systems. Vol.18, NO.1, February 2000. pp. 67-88.
[11] T. White, A. Bieszczad, and B. Pagurek, “Distributed Fault Location in Networks Using Mobile Agents,” In Proceedings of the 3rd International
Workshop on Agents in Telecommunication Applications IATA'98, Paris, France, July 1998.
[12] Andrzej Bieszczad, Bernard Pagurek, and Tony White, “Mobile agents for network management,” IEEE Communications Surveys, September 1998.
[13] Irene Katzela and Mischa Schwartz. “Schemes for Fault Identification in Communication Networks,” IEEE/ACM Transactions on Networking, vol. 3, no. 6, pp. 753-764, December 1995.
[14] Yijiao Yu, Qin Liu, Liansheng Tan and Debao Xiao, “A Novel Automated Fault Identification Approach in Computer Networks Based
on Graph Theory,” In Proceedings of 2003 International Conference on Communication Technology, Beijing, pp. 167-173, April 9-11, 2003. [15] Stephen Warshall, A Theorem on Boolean Matrices, Journal of the ACM,
9(1), pp.11-12, January 1962.
[16] C. S. Chao, D. L. Yang, and A. C. Liu, “A LAN Fault Diagnosis System,” Computer Communications, vol. 24, no. 14, pp. 1439-1451, September 2001.
[17] Anoop Reddy, Deborah Estrin, and Ramesh Govindan, “Large-Scale Fault Isolation,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 5, pp. 733-743, May 2000.
[18] Y. Yemini, A.V. Konstantinou, and D. Florisssi, “NESTOR: An Architecture for Network Self-Management and Organization,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 5, pp. 75-76, May 2000.
[19] L. Andrey, O. Festor, E. Nataf and R. State, “JTMN: A Java-Based TMN Development and Experimentation Environment,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 5, pp. 66-67, May 2000. [20] V.Paxson, End-to-End Routing Behavior in the Internet, IEEE/ACM
Transactions on Networking, Vol.5, No.5, pp.601-615, October 1997.
Yijiao Yu is now a Teaching Assistant in Central
China Normal University, PR China. Mr. Yu received his MSc and Bachelor degree of computer science from the department of computer science, Central China Normal University in 2002 and 1999 respectively. His research interests focus on computer networks and artificial intelligence.
Qin Liu gets MSc. and Bachelor degree of
Computer Science from Department of Computer Science, Central China Normal University, PR China. She is interested in computer networks congestion control and traffic modeling.
Liansheng Tan is now a Full Professor and Head of
Department in Department of Computer Science, Central China Normal University, PR China. Professor Tan received his Ph.D. degree from Loughborough University in the UK in 1999. He was doing research in computer communication network in School of Information Technology and Engineering at University of Ottawa, Ontario, Canada as a postdoctoral research fellow and a visiting research scientist in 2001. He has published over fifty referred papers. His research interests are in modeling, analysis and performance evaluation of computer communication networks, their protocols, services and interconnection architectures. These include multimedia networks, local area networks, metropolitan area networks, broadband networks and switching architectures for congestion control. Professor Tan is also interested in queuing theory, simulations, computational algorithms and their applications in high-speed computer communication networks.