Failure Behavior Analysis for Reliable Distributed Embedded Systems

(1)

Failure Behavior Analysis for Reliable Distributed Embedded Systems

Mario Trapp, Bernd Schürmann, Torsten Tetteroo {trapp | schuerma | tetteroo}@informatik.uni-kl.de Department of Computer Science, University of Kaiserslautern

Abstract

Failure behavior analysis is a very important phase in developing large distributed embedded systems with weak safety requirements which do graceful degradation in case of failures. Today, the analysis will usually be done by stan- dard methods like FTA and FMEA considering the exist- ence of faults, only. Gradations of errors are not regarded, although this is a very coarse system behavior approxima- tion. In contrast to that, our advanced failure behavior analysis yields more sophisticated and graded results.

We obtain comprehensive results by assigning a quality description to all the information in a system and extending the pure information flow to an information quality flow, that models system failure behavior, too. We model this information quality flow by object-oriented hierarchical petri nets. Large parts of these nets can automatically be generated from the existing behavioral system structure. A net simulator enables us to perform all the sophisticated analyses we need to examine the failure behavior

.

1 Introduction

Failure Analysis is one of the main aspects in developing reliable embedded systems. Knowing the effects of faults on the system behavior enables the developer to strengthen the weak points and to prevent the system from failing in the case of faults. In safety critical systems, e.g.

drive-by-wire systems, failures must be prevented, wherefore redundant components or functionality are added.

Our research group focuses on the development of large embedded systems with weak safety requirements like building automation systems. Preventing all possible failures is not necessary, and, of course, too expensive in these systems. Therefore, our approach meets the idea of graceful degradation. We “allow” certain failures, how- ever, they must be controllable in order to obtain guaranteed and predictable gradations of the system functionality.

These gradations can be seen as different failure modes of the system.

We use an object-oriented approach to model the system. Figure 1 shows the simplified data model used in our developing process as an UML class diagram [1].

The system behavior is realized by tasks. Tasks are related to othertasksby the information they interchange.

Each task is realized in a system component, which can aggregate othersubcomponents, creating a structural hierarchy. One appropriate requirements analysis process pro-

ducing data which can be mapped to that simplified data model is, for example, the requirements engineering process for large distributed reactive systems developed in our research group [2]. It has to be mentioned that the real data model is more complex than it is shown in figure 1, but it could be transformed to the simplified model.

To realize a controllable and predictable gradation of the system functionality, it is necessary to analyze which faults affect the system behavior. A fault causes a gradation of functionality, switching the system to a specific failure mode. Depending on that mode, a succeeding fault causes another failure mode. Therefore, the analysis must be able to handle successive faults and their chronological order.

Furthermore, it is important to analyze in detail how faults affect the system.

In section 2, we give an overview of related failure analysis and verification techniques. Then, we describe our analysis approach in sections 3 and 4.

2 Related Work

There exist a lot of techniques and methods to analyze and to verify systems, so that this section can give an incomplete overview, only. The techniques and methods can be divided into two groups: On the one hand there are formal verification techniques like model checking and markov chains, on the other hand there are semi-formal methods like fault tree analysis and failure mode and effects analysis.

Fault Tree Analysis (FTA) is a top down approach: It starts with system failure situations which must be avoided, and analyzes how these failures can be caused by faults of subsystems or system components. The faults are com- bined by boolean expressions like and, or, not, etc. produc- ing boolean equations as the result of FTA [3]. Usually, FTA does not consider different modes of the system except success and failure, and it considers neither multiple faults nor their chronological order.

Task Component

Fig. 1. Simplified data model.

1..*

< sends information

realized in >

subcomponents System

system components system

tasks

(2)

Failure Mode and Effects Analysis (FMEA) is another well known analysis technique. It starts with a possible fault of a component and analyzes bottom-up, how higher- level functionality and the systems functionality are affected. The results of FMEA are listed informally in large tables describing, for example, the component failure, the effects to the system, the criticality of the failure, etc. Usu- ally, FMEA oversimplifies a system into two modes: success and failure, and does not consider different modes, which could represent gradations of system functionality and performance [4]. Multiple faults are usually not con- sidered, either [5]. PRICE ANDTAYLORextended FMEA to analyze and report the most likely multiple simultaneous failure combinations, but they do not handle the chronolog- ical order of faults [5]. YANG AND KAPUR introduced a customer driven reliability to FMEA as a quality over time.

They consider different performance levels of a product which are degraded over time [4].

Model Checking is a formal verification technique used to automatically check systems which have a finite state space. A system is modelled as a finite automaton and elementary propositions are mapped to each state, which are fulfilled in these states. Then, this model of the system is used to automatically prove whether or not the specification is satisfied. Such a specification is given as a set of properties, usually expressed in temporal logic [6].

Markov Chain Analysis is a technique to examine sto- chastic systems. The system is modelled with a finite set of states and transitions between the states. Each transition from state s_i to state s_j is labeled with p_ij expressing the probability that the transition will be executed in the next step. This model is called a (discrete-time) Markov Chain which is interpreted at discrete time steps. Continuous-time Markov chains are interpreted over continuous time. Their transitions are labeled with the rate of the exponential probability distribution. It should be noted that Markov chains are memoryless: The probabilities of the transitions depend on the current state, only. Neither the previous states nor the time the system is in the current state influence these probabilities. Markov chains can be analyzed using numerical or analytical solutions. For example, the evolution of the model up to a given point in time or the long-run average behavior of the system can be studied [7].

Adding elementary propositions to the states enables the usage of Model Checking with Markov chains. For example, the Erlangen-Twente Markov Chain Model Checker (E MC²) is a Model Checker for discrete-time and continuous-time Markov chains [8].

3 Failure Behavior Analysis

3.1 Motivation

Usually, FTA and FMEA only consider a fault as existing or non-existing. Gradations of errors are not regarded.

This is, however, only a very coarse approximation of the system behavior in the case of a failure. A major disadvantage of FTA and FMEA is that they are completely static, i.e. as soon as an output error depends on the current characteristics of an input error FTA and FMEA cannot be used. We want to explain that in the following example.

Let us regard three components used to control a radiator temperature. The first component is a temperature sensor. The sensor value is smoothed in a special moving average filter. Finally, the smoothed value is sent to the third component, a hysteresis switch which opens or closes the radiator valve, respectively. A new temperature value is sampled every 20 seconds and the moving average filter uses the last 6 values to calculate a new mean temperature.

Due to the special implementation of the moving average filter, the hysteresis switch gets a new value every 2 min- utes.

FTA examines how the hysteresis switch can be influenced.

As it uses the value of the moving average filter as input, it is of course influenced by a fault of the latter. In turn, the output value of the moving average filter depends on the value of the temperature sensor. In consequence, it is con- cluded that a fault of the temperature sensor causes a failure in the hysteresis switch. A similar result would be obtained by an FMEA.

Now, let us assume that a valve movement temporarily dis- turbs the temperature sensor and causes a relative error of 10%. Ten seconds are needed to open or close the valve.

That means, for a period of 10 seconds the temperature value might have a relative error of 10%. Now, it must be examined whether or not it is necessary to use a better and thus more expensive temperature sensor. For that, FTA as well as FMEA are useless. FTA and FMEA have only shown that the hysteresis switch depends on the temperature sensor. The detailed information if or how a relative error of 10% of the temperature value that persists for 10 seconds influences the hysteresis switch cannot be obtained.

Although it would be possible to manually calculate the influence on the hysteresis switch in that simple example, this is not possible if the interdependencies of large (distributed) systems must be considered. Therefore, a failure behavior analysis is required that is capable of treating the dynamic dependencies of the output values on the error of the input values as well as the timing aspects.

The treatment of dynamic dependencies poses two major problems. First, it must be possible to describe faults and errors, thus to use more gradations than only valid and invalid. Second, it is necessary to describe the dependency of an output value of a component on the error characteristics of a current input value. For the description of faults and the definition of the dependencies it must be possible to consider timing aspects. These additional features are provided by the failure behavior analysis we will introduce in this paper. In comparison with FTA and FMEA, our failure behavior analysis enables the analyst to obtain much more sophisticated information about the failure behavior of a large and possibly distributed system.

3.2 Overview of the Analysis

The advanced analysis of the system behavior in the case of a failure shall deliver sophisticated and graded analysis results. We do not only want to know which components are influenced by certain errors, but also how and by which kind of errors they are influenced.

⊥

(3)

We obtain such analysis results by assigning a quality description to all the information transformed and trans- ported in a system. Instead of regarding the pure information flow in a system that characterizes the functional behavior, we regard the flow of the information qualities which defines the system behavior in the case of a failure, too. In the following this will be explained more detailed.

As it has been mentioned earlier, we assume that the overall functionality of a system is decomposed into various tasks, whereby the relations between interdependent tasks are known. The system functionality is defined by the collaboration of the single tasks. The stand-alone functionality of a single task is usually quite simple. Due to the communication between tasks, however, their functionalities are composed to an arbitrarily complex, superordinated system functionality. That means, at the lowest level the system functionality is defined by task functionalities defining the possibly state dependent transformation of input information to output information. To obtain the superordinated functionality, the communication between tasks, i.e. the transportation of information, must be regarded.

We want to seize the same concept for the failure behavior analysis. Instead of regarding the transformation and transportation of information, we are interested in the quality of the information. For that, we will first introduce a construct that enables us to describe such a quality. Then, we will define how these qualities are transformed by a single task. We describe this with petri nets. In the last step, we will replace the transportation of information by the transportation of their qualities in the task collaboration network. For that, we use hierarchical petri nets.

The system developer needs only to define the mapping of the information qualities at the task level. The superordinated petri net defining the collaboration of the single tasks can be generated, automatically. This leads to a high scalability of our approach.

In section 3.3 we will give precise definition of the term information. After that, we will introduce the concept of the information quality in section 3.4. Finally, in section 3.5 we will show, how petri nets can be used to define the information quality flow.

3.3 Information

Any kind of data that enters or leaves the system, or is moved or stored in the system is called information. This is the only meaning of the term information in this paper, independent of any other existing definitions.

The information concept is illustrated in figure 2. A sensor value enters the system and is stored in attributeA1 of componentA. This information is moved to component B, whereby information can be moved using method calls or by sending signals. Then, it is sent to an actuator.

Although the sensor value has not been stored in the system, so far, it is called information since it is data that enters the system. The same applies to data sent to an actuator.

3.4 Information Quality

We assign a quality attribute to each information. For example, the validity of an information or its relative error could be used. We do this by using a class representing the information quality. This class may have an arbitrary num- ber of attributes describing the quality of an information.

Of course, all capabilities of the flexible object-oriented approach can be utilized. A UML representation of the information quality class is shown in figure 3.

The attribute validity defines whether or not the information is valid, at all.

The attribute relativeError represents the relative error of a continuous information, e.g. a temperature value. As we will see in the application examples, in most cases it is sufficient only to have a relative error and no absolute error.

As our approach ought to be appliable at various stages of the development process, we do not assume an executable specification of the functional behavior. Therefore, it is not always possible to examine the propagation of absolute errors of continuous values.

However, it is possible and even necessary to regard the absolute values of discrete information. Therefore, the attributes nominalValue and currentValue are used. If, for example, the on/off-switch of a cruise control is defect, it is a crucial difference whether the control should be switched on or off. Therefore, considering the nominal value is necessary to describe how a component is influenced. If we regard a component that de-/activates an alarm system which has the possible states „deactivated“, „activated“, and „alarmOn“, then besides the nominal value (e.g. the alarm system should be deactivated) the current value must be given (e.g. due to a failure the alarm system might be activated).

The attributes MTTC (Maximum Time To Correction) and MTTO (Minimal Time To Occurrence) are necessary to describe and to examine transient errors. A software or hardware fault leads to an error only if the affected component is used. If, for example, a faulty sensor value is corrected before it has been read, it will not result in an error. Therefore, it is obviously necessary to regard the time that is required at maximum to correct a fault. This time is represented by the attribute MTTC. Furthermore, it must be defined how long a value remains correct at minimum. This time is represented by the attribute MTTO.

We want to use the following example to illustrate the

Sensor

Component A Component B

Actuator Attribute A1 Attribute B1

Fig. 2. Sample information flow.

Information

InformationQuality validity

Fig. 3. Information quality class.

relativeError

MTTC MTTO nominalValue currentValue

propertyList

(4)

meaning and the usage of these attributes. Let us regard a micro controller reading the values of a sensor. We assume that a watchdog timer is used to reinitialize the micro controller if it has been stalled, wherefore it can be assured that in the case of a failure the micro controller is working again after at latest t₁seconds, i.e. the MTTC is t₁seconds.

Further, we assume that the micro controller has intensively been tested, so that it can be guaranteed that it takes at least t₂ seconds before a failure occurs after the micro controller has been reinitialized, i.e. the MTTO is t₂ seconds. In addition, our analysis can also be used to determine the required values of t₁and t₂as constraints at a very early stage of the development process.

The last attribute we want to provide is the propertyList.

All characteristics of an information quality that are not required to determine whether and how a component is influenced, can be represented by a property. For example, it might be possible to use a textual representation to describe the influence of a component failure on the energy consumption of a system. Therefore, in each component traversed by an information quality due to the error propagation, a new property can be attached to the traversing information quality. Although the influences informally described in these properties are not used to determine the failure behavior of depending components, at the end of the analysis the analyst obtains a detailed textual description of all relevant influences on arbitrary concerns, i.e. a kind of report is generated automatically.

Due to the usage of an information quality class, our approach is very flexible and extendable, as it is possible to modify the existing or to add new attributes and methods to the class to adapt the concept to various domains and kinds of application.

3.5 Information Quality Flow

Errors can be described with information qualities.

Now, it is necessary to describe the propagation of errors.

Therefore, we regard the information quality flow instead of the information flow of a system, as it is shown in figure 4.

A functional behavior specification defines the information flow. Therefore, it is specified how an information is modified by a single task. Further, the system structure defines the interdependencies between tasks. A failure behavior specification basically uses the same flow structure, however, it is defined how the quality of an output

information of the task is influenced by the quality of the input information. We use hierarchical reference nets [9] to describe this aspect.

Reference nets are object-oriented petri nets which support predicates. These nets enable us to have tokens represent the information qualities and we can assign conditions to transitions.

Modelling the failure behavior of a single task

Depending on the qualities of the input information, a task is set to a certain failure mode. This dependency is described in a so-called guard condition that is assigned to the failure mode. If a guard condition of a failure mode is true, the task is set to that failure mode. In consequence, the guard conditions of all failure modes of a task must be mutually exclusive. Each possible failure mode is represented by a placeFM jin the petri net modelling the task.

The according guard condition is assigned to the transition leading to that place. The qualities of the output information are defined depending on the current failure mode and the input information qualities of a task. This is realized by assigning an action to a transition leading to the task output place, whereby the necessary input information qualities are available at the transition in form of tokens moved from the input places to the output places. The resulting general structure of a task petri net is shown in figure 5.

When the petri net is to be evaluated, the tokens refer- encing information quality objects are moved to theinput places, whereby an own place exists for each input information quality. According to the guard conditions the tokens are moved to the place representing the appropriate failure mode (FM j). In the following transition, new tokens representing the qualities of each output information are created. For each failure mode a different dependency of the output qualities on the input qualities may be defined.

In section 4 this concept will be shown in an example.

Chronological order of faults

One disadvantage of the data flow principle of petri nets is that the token representing the current failure mode is directly removed from the failure mode place ^{FM j} because the ^action transitions are enabled immediately after the according failure mode place has been marked.

We therefore create an additional token that is tagged with the name of the current failure mode and move it to an additional place called currentFailureMode. This way, we can preserve the current failure mode. In consequence, the

Sensor

Component A Component B

Actuator Attribute A1 Attribute B1

Fig. 4. Sample information quality flow.

InformationQuality Validity

PropertyList

Output k Input i

Guard 1

Guard j

Guard n

Action 1

Action j

Action n FM 1

FM j

FM n

Fig. 5. General petri net structure for the definition of the information quality mapping within a single task.

Input 1

Input m

Output 1

Output o

(5)

current failure mode can also be considered in the guard conditions. That means, the chronological order of faults and all preceding events can influence the failure behavior.

Although the current failure mode is preserved, it is not possible to figure out the failure history after an analysis. We therefore further extend the petri net by a ^history place (fig. 6) to log this history. The^historyplace can con- tain an arbitrary number of tokens. Each of these tokens is tagged with the name of the according failure mode and a time stamp (^NOW). As every change of the failure mode is represented by a token on the history place, the complete failure behavior history of the task during the analysis is logged.

Modelling the interdependencies of tasks

By now, we have shown how the transformation of the information qualities by a single task can be modelled.

Now, it is necessary to describe the interconnection of the tasks with petri nets. Utilizing the hierarchy concept provided by petri nets, we can automatically generate a superordinated petri net representing the interdependencies of the tasks, as these interdependencies are given by the system structure (figure 1). The general principle of the hierarchy support of reference nets is illustrated in figure 7.

Transitions in the superordinated net are uniquely con-

nected with transitions in the sub nets (in figure 7 these connections are illustrated with dashed arcs). As soon as transition 1ais enabled,transition 1bis fired, too. Tokenb

is injected into the sub net. The token on the place^Failure Modesin the superordinated net is a reference to the sub net. After the sub net has been executed,transition 2a is enabled and, in consequence, transition 2b is fired, too.

Token^cis then outreached to the superordinated net and is propagated to^{task C}.

4 Application Example

In the following, an example will be introduced that demonstrates the application of the analysis. Thereby, the main aspect is to show that the concepts introduced in the preceding sections are sufficient to obtain sophisticated analysis results.

In the example, we will regard a pressure control. The focus lies on data processing. Especially, the usage of the relative error attribute will be shown.

4.1 Pressure Control

Obviously, it is possible to describe the propagation of a relative error of data processing components completely mathematically. However, relative errors seem to be problematic if absolute values must be considered. Therefore, data processing components, the error propagation behavior of which can be described mathematically, are com- bined with a failure behavior specification that uses absolute values in the guards. Thus, the example is used to demonstrate that it is sufficient to use the relative error, although absolute values are required to decide whether a guard is fulfilled. Especially, if the MTTC is considered, as well, a wide range of failure cases can be examined and sophisticated results can be obtained.

4.1.1 Requirements. The main task of the control system is to control the pressure of an oxygen tank that belongs to a chemical plant. The target pressure is 500 bar. A pressure between 480 and 520 bar is optimal. If the pressure is between 470 and 480 bar or 520 and 530 bar, respectively, the efficiency of the depending machines is decreased. As soon as the pressure gets higher than 530 bar the situation becomes dangerous. If the pressure is lower than 470 bar the machines are stalled. This is also illustrated in figure 8.

4.1.2 System Design. The chosen structure of the pressure control system is illustrated in figure 9. Basically, three pressure sensors at different places in the pressure tank are available. An average filter calculates the actual overall pressure. In order to suppress temporal fluctuations, addi-

Input

Guard 1

Guard i

Guard n

History [„FM 1“,NOW]

[„FM i“,NOW]

[„FM n“,NOW]

Fig. 6. All changes of the current failure mode during the analysis are logged in ahistory place.

Task A Task B Failure Modes Task C

b.valid

!b.valid Normal Failure

c.valid=b.valid

b c

b b c

b

c

Fig. 7. General principle of hierarchy in reference nets Transition 1a

Transition 1b Transition 2a

Transition 2b

Pressure/bar Plant state

480-520 optimal operation

470-480 or 520-530 reduced efficiency / should be avoided

<470 or > 530 machine stall, danger / must be avoided Fig. 8. Requirements on tank pressure.

(6)

tionally, a moving average filter is applied in the next step.

After that, a hysteresis switch decides when the valve has to be opened or closed.

The formulae defining the digital filters are shown in figure 10. The average filter weights all of the three pressure values equally to calculate the actual overall pressure value. The moving average filter uses the last five actual pressure values to calculate the smoothed average pressure value, whereby again all values are weighted equally.

The hysteresis loop of the switch is shown in figure 11.

Although a pressure between 480 and 520 bar is optimal, the threshold values of the hysteresis have been set to 490 and 510 bar in order to have a 10 bar reserve at both ends.

As the target pressure is 500 bar, a tolerance interval of +/- 10 bar has been established, before the valve is opened or closed, respectively.

4.1.3 Failure Behavior. In the next step, the failure behav- ior must be defined. Therefore, the failure modes of the three components, that can be considered as tasks of the system, are specified. As it has been explained earlier, we use extended petri nets to describe the failure behavior.

Average Filter

The failure behavior model of the average filter is shown in figure 12. The general structure illustrated in figure 5 has been used to model the failure behavior of that task. The qualities of the three pressure values^p1,^p2, and p3are considered as input. If any of these values has a relative error different from zero, the task is set to the failure mode ^Error, otherwise the task remains in the mode ^Nor- mal. If the task is in the normal mode, the output value has no error, therefore, the error attribute is set to 0 (The setting

of the attribute validity has been neglected to keep the example simple). In the failure modeErrorthe relative error of the output pressure is calculated using the relative errors of the input values, as it is specified by the formula shown in figure 13. The current failure mode is represented by an additional token on the placecurrentFailureMode, as it has been mentioned earlier. The transitions outside the box are required as interface and will be connected to transitions in the superordinated net, as it has been explained in section 3.5.

Moving Average Filter

The moving average filter uses the last five values of the average filter, whereby every 2 milliseconds a new value is sampled. The error propagation can be specified, in general, depending on the amount of regarded samplesN and the sample periodT, as it is shown in figure 14. The

quotient of theMTTCover the periodTdefines how many samples can be influenced by a fault. All other samples have no error (It is assumed that ). There- fore, the ratio between the number of influenced samples and the number of all samples defines the weightwof the relative error. The according petri net for the filter is shown in figure 15. Again, the failure modesNormalandErrorare defined. The normal mode is valid when the input is correct, otherwise, the task is set to the error mode. In the normal mode, again, the error attribute of the output quality is set to zero. In the error mode, the relative error of the output value is calculated according to the formula shown in figure 14. To express a persistent error with theMTTC, the latter is set to the maximal possible integer value (MAX- INT). If an error is persistent, all samples are faulty, there- fore, ap must have the same relative error as p. This Fig. 9. Structure of pressure control system.

Pressure Sensor 1

Valve Average

Pressure Sensor 2

Pressure Sensor 3

Filter Average

Filter Moving

Hysteresis Switch

Fig. 10. Formulae describing the digital filters.

pi[k]: pressure i at time k p[k] : average pressure at time k

ap[k]: smoothed average pressure at time k

p k[ ] 1

3--- p1 k( [ ]+p2 k[ ]+p3 k[ ]) average filter =

ap k[ ] 1

3--- p k[ –i]

i=0

∑

4

moving average filter =

pressure [bar]

switch command [open/close]

open

close

490 500 510

Fig. 11. Hysteresis loop used for the hysteresis switch.

Fig. 12. Petri net specifying the failure behavior of the average filter.

p1 p2 p3

p1

p2

enter p3

exit getState g1 : p1.error>0 or p2.error>0 or p3.error>0

Error

„Normal“

p

„Error“

Normal g2 : p1.error==0 and p2.error==0 and p3.error==0

p1

p2

p3

[p1,p2,p3]

a1 : p.error = 1/3 (p1.error + p2.error + p3.error) Guards:

Actions:

g1

g2

a1

a2 currentFailureMode a2 : p.error = 0

fm

p.error 1 3--- p

1.error+p2.error+p3.error

( )

=

Fig. 13. formula specifying error propagation of the average filter

w min 1 1

N---- (p.MTTC) ---T

 , 

 

ap.error w p.error⋅

=

Fig. 14. Formula specifying error propagation of an moving average filter.

MTTO

»

N⋅T

(7)

requirement is met if the MTTC is set to MAXINT, as in that case the weightwalways evaluates to 1, i.e. the error ofp is assigned to the error attribute ofap.

Hysteresis Switch

Before we define the petri net for the hysteresis switch, we examine its failure behavior. The hysteresis loop is illustrated in figure 11 and the consequences of too low or too high a pressure are shown in figure 8. Besides the normal operation, wrong pressure values can result in a problematic or dangerous situation, respectively. Obviously, it is reasonable to define the three failure modes ^Normal, Problem, and^Danger. Now, we must examine which errors result in which failure mode.

At first sight, it seems quite simple to define guards like p.currentValue > 520. However, our approach is appli- able at early stages of the development process, therefore, we do not assume to have absolute values or absolute errors available. For this reason, we regard the threshold values and use the available relative errors to obtain the absolute values. This is sufficient, as it is only necessary to consider the worst case. We must regard both threshold values and, since the relative error does not express if the faulty value is higher or lower than the actual value, it is also necessary to cover both cases in the consideration. We want to calculate exemplarily the maximal relative error that is allowed to remain in the normal mode for the upper threshold value.

The upper threshold value is 510 bar. Actually, it is only necessary to open the valve when the pressure is higher than 520 bar. If the faulty value of ^ap is lower than the actual pressure in the tank, it must be ensured, neverthe- less, that the valve is opened at latest when the pressure is higher than 520 bar. That means, if the pressure is 520 bar, the current, faulty value of ^ap must be at least 510 bar.

According to the upper formula shown in figure 16, the relative error must therefore be lower than or equal to 1.9%.

If the value of ^ap is higher than the actual pressure, the valve might be opened too early. We demand that the pres-

sure must be at least 480 bar before the valve is opened.

The according maximal relative error is calculated using the lower formula of figure 16.

The remaining maximal relative errors can be calculated in the same way. The petri net representing the resulting failure behavior of the hysteresis switch is shown in figure 17. As the output of this task is a direct system output (the command for the valve), it is only necessary to out- reach the failure mode to obtain the influence on the system behavior.

4.1.4 Interdependencies of tasks. So far, the failure behavior of three single tasks has been described. Now, it is necessary to combine the single petri nets in a superordinated petri net to obtain the system failure behavior.

The overall petri net specifying the failure behavior of the pressure control system is shown in figure 18. Mainly the system structure has been rebuilt. First, an instance of the petri net representing the average filter is created and three qualities of the pressure values can be injected, as it has been explained in section 3.5. The current failure mode and the quality of the average pressure value are requested from the sub net describing the average filter. The quality of the average pressure value is used as input for an instance of the petri net specifying the failure behavior of the moving average filter. The output of this sub net is its current failure mode and the quality of the smoothed average pressure value, which, in turn, is used as input for the hysteresis switch. As the output of the hysteresis switch is a direct system output, it is reasonable not to use an information quality as output, but only the current failure mode.

An analysis can be started by simulating the petri net.

Errors can be injected using the injection transitions. We create an information quality object, which is referenced by a token, and put this token on the place leading to the respective transition. These information quality tokens can be either defined and placed manually on the input places, or automatically by a separate software. The injection and the propagation of the information qualities is then done automatically by a petri net simulator [10]. Placing the tokens manually has the disadvantage that the petri net simulator requires to create and to place new information quality tokens for each analysis run. Furthermore, it is not possible to change the injections during an analysis. How- ever, it is one major advantage of our approach that the Fig. 15. Petri net specifying the failure behavior of the

moving average filter.

p enter

exit getState

„Error“

„Normal“

p

p Normal

Error CurrentFailureMode

a2 a1 g1 : p.error>0

g2 : p.error==0

a1 : ap.error = min(1, 1/5 * floor(p.MTTC / 2)) * p.error Guards:

Actions:

p

p p

p ap

ap ap fm

a2 : ap.error = 0

g1

g2

520 1( –x)≥510 x≤0.019 = 1.9%

480 1( +x)≤510 x≤0.0625 = 6.25%

value of ap is lower than the actual pressure:

value of ap is higher than the actual pressure:

Fig. 16. Maximal relative error for the upper threshold value

=>

Fig. 17. Petri net specifying the failure behavior of the hysteresis switch.

ap

(ap.error>1.9) and (ap.error<=3.8)

ap.error>3.8

ap.error<=1.9 Problem

Danger

Normal

exit

getState currentFailureMode

(8)

basic concepts allow to simulate and to correct various errors during an analysis run, even the order of faults can be examined. For these reasons, it is reasonable to use a separate software. Then, it is possible to define a complete analysis scenario in advance and the software creates and places the according information quality tokens in the appropriate order. Moreover, it is possible that a user interface is provided enabling the analyst to set new information qualities dynamically during the analysis.

4.1.5 Analysis. For the means of this paper, it is not neces- sary to distinguish between manual and automatic creation and placing of information quality tokens. For that reason, in the following analyses the information quality tokens are assumed to be already placed.

First, we will examine the permanent fault of a single sensor. As a second scenario we will examine the influence of a transient disturbance on the system behavior.

Permanent Sensor Fault

The first analysis is used to answer the question, whether or not it is necessary to provide redundant pressure sensors. Therefore, we regard the permanent fault of a single pressure sensor. We want to examine whether or not the two remaining sensors are sufficient to hold up a correct system behavior. In addition to the functionalities of the tasks introduced in the previous sections, we assume that an error detection is implemented which detects a missing sensor value. In that case, the average filter only uses the two remaining values. The error detection also considers a sensor defect, if its value has a deviation of 15% to the values of the other two sensors. A detected failure of a single sensor has no effect on the system behavior. Exactly, that would be the result of a simple analysis, too. However, we want to examine the influence of a relative sensor error of 14%, as this error will not be recognized by the error detection.

In advance, a token representing an information quality object with the error attribute set to 14% was placed on the input place ofp1. As the fault is considered to be permanent, theMTTCwas set to MAXINT. The error attribute of the two remaining information qualities was set to 0.

When the petri net simulation is started, the three informa-

tion qualities are injected into an instance of the petri net defining the failure behavior of the average filter. After that petri net has been evaluated, besides the current failure mode, the token representing the information quality of^pis outreached. The relative error of^pis 4.66%, as the relative error of 14% has been reduced by the average filter. The token representing the information quality of ^p is then injected into the petri net of the moving average filter. As the^MTTCis assumed to be infinite, the moving average filter cannot reduce the error. For that reason, the error attribute of the outreached information quality of^apis still 4.66%. This quality is then injected into the sub net of the hysteresis switch. As the relative error of^apis higher than 3.8%, a dangerous situation arises. The analysis steps are illustrated in figure 19. Obviously, the error detection is not

sufficient. Even if only a deviation of 6% instead of 15%

was allowed before an error is detected, a problematic situation would be the consequence. Although we admit that it is possible to calculate a correct deviation rate for that simple example manually, this is impossible for large, complex, and distributed systems. Our analysis, instead, is intended and particularly suited for the analysis of those systems.

Transient Sensor Fault

Now, we want to consider the following scenario: The sensors can be influenced by electromagnetic radiation. We assume, that the electromagnetic radiation does not persist longer than 4 milliseconds in the tank. Measurements showed that in the case of electromagnetic radiation, the value of sensorp1has a relative error of 15%, the value of p2has a relative error of 6%, and sensor p3 is not influenced. Now, we want to examine, whether or not it is necessary to improve the sensors or to suppress the radiation.

The according analysis is illustrated in figure 20. Two information qualities for the values ofp1andp2are created with the according relative errors. The MTTC is set to 4 milliseconds. TheMTTOis neglected as we assume that the frequency of occurrence of the disturbance is very low.

After the average filter, the actual overall pressure valuep has a relative error of 7%. The disturbance persists for 4 milliseconds, i.e. in the worst case 2 of 5 samples have a relative error of 7%. In consequence, the relative error of Fig. 18. Petri net defining system failure behavior of

pressure control.

The petri net can be analyzed by putting tokens which reference according information qualties, on the input places. Qualities of all information in the system can be injected into arbitrary parts of the system. A petri net simulator handles the injection into the subnets and the propagation between depending nets.

average filter moving average filter

hysteresis switch p1

sub net

failure mode

p

ap

sub net failure mode

failure mode Injection Transitions

Input Places

average

hysteresis filter

sub net moving average

filter

switch p2

p3

Fig. 19. Analysis of the influences of a permanent sensor fault.

hysteresis switch ap

„Danger“

Injection-Interfaces Input Places

p1.valid=false; p1.error = 14%; p1.MTTC=MAXINT;

Created tokens:

p3.valid=true;

p2.valid=true;

p1 p2

p3 p.error = 4.66%

ap.error = 4.66%

“Error“

p1 p2 p3

p

(9)

the smoothed pressure value apis reduced to 2.8%. That means, the efficiency of the depending machines is reduced (see figure 17). It is obviously necessary to improve the behavior.

In the next scenario, we assume that the engineers of the plant propose two possible improvements: First, it is possible to reduce the maximal time of persistence of the radiation from 4 to 3.5 milliseconds. Second, it is possible to shield the sensors, in consequence, the relative errors could be reduced to 12% and 4.8%, respectively, that means a reduction of the disturbance by 20%.

If we use these values as input for the analysis, we obtain the result, that the reduction of the time of disturbance to 3.5 milliseconds is sufficient to hold up the normal mode. A persistence of the radiation of less than 4 milliseconds means that at most one sample is influenced. For that reason, the moving average filter compensates the relative error.

Shielding the sensors as assumed above, however, is not sufficient. Despite the reduced relative errors, the overall pressure has still a relative error of 5.6%. The moving average filter only reduces the error to 2.24%, what is not sufficient to remain in the normal mode.

5 Conclusion:

Analyzing large distributed systems

In this paper, we introduced a flexible, scalable, and extendable approach for failure behavior analysis. In comparison to existing analyses, like FTA or FMEA, our analysis yields more sophisticated results which enable the analyst to understand the system behavior in the case of faults.

In the application example, we demonstrated the appli- cability of our analysis. However, we limited the complexity of the example to prevent going beyond the scope of a paper.

Our research focuses on large, distributed embedded systems. Therefore, our analysis has been developed for those systems. Mastering the complexity of those systems is one major problem. For that reason, the scalability of our approach is of crucial importance. A further essential

aspect is the possibility to automatically generate major parts of the petri nets. For example, it is even possible to generate a simple failure behavior that sets the validity attribute of the output information qualities to false, if any input quality is invalid. For that reason, similar results as they are obtained by FMEA are yielded automatically without any additional effort of the analyst. Although the generated petri nets define only a very coarse approximation of the actual failure behavior, it is, in contrast to a common FMEA, possible to examine the effects of fault combinations very easily or even automatically.

A further aspect that is important for the analysis of large systems is reuse. If tasks or components are reused in other projects, the petri nets defining their failure behavior, can be reused, too.

If distributed systems ought to be analyzed, our approach has two further advantages. First, the partitioning of the system is supported, as one can examine which tasks should be assigned to which partition so that a failure of one partition has the least influence on the overall system behavior. Second, the effects of missing or delayed information, interchanged between system partitions, can be analyzed. The delay of an information can be assigned to its quality and the effects on single tasks can be modelled explicitly with petri nets. The effect on the overall system behavior is obtained automatically, as the error propagation is provided by the petri net simulator.

6 References

[1] G. Booch, I. Jacobson, J. Rumbaugh, “The Unified Modelling Language User Guide”, Addison Wesley Longman, Reading, MA. 1999

[2] A. Metzger, S. Queins, “A Reuse- and Prototyping-based Approach for the Specification of Building Automation Sys- tems”, OMER-2 Workshop, Hersching, Germany, 2001 [3] IEC 61025 (1990-10), “Fault Tree Analysis”, International

Electrotechnical Commission, Geneva, Switzerland, 1990 [4] K. Yang, C. K. Kapur, “Customer Driven Reliability: Integra-

tion Of QFD And Robust Desing”, Proceedings IEEE Annual Reliability and Maintainability Symposium, 1997

[5] C. J. Price, N. S. Taylor, “FMEA For Multiple Failures”, Pro- ceedings IEEE Annual Reliability and Maintainability Sym- posium, 1998

[6] B. Berard, M. Bidoit, A. Finkel, F. Laroussinie, A. Petit, L.Petrucci, Ph. Schnoebelen, P. McKenzie, “Systems and Software Verification”, Springer Verlag, Berlin, 2001 [7] H. Hermanns, “Construction and Verification of Performance

and Reliability Models”, in Bulletin of the European Associa- tion for Theoretical Computer Science (EATCS), 2001 [8] H. Hermanns, J.P. Katoen, J. Meyer-Kayser and M. Siegle, “A

Markov chain model checker”, Proceedings of Six Interna- tional Conference on Tools and Algorithms for the Construc- tion and Analysis of Systems (TACAS), Springer Verlag, Berlin, 2001

[9] Olaf Kummer, “Simulating Synchronous Channels and Net Instances”, 5. Workshop on Algorithms and Tools for Petri Nets, 1998

[10]Olaf Kummer, Frank Wienberg, “RENEW - The Reference Net Workshop”, Petri Net Newsletter, No. 56, 1999.

Fig. 20. Analysis of the influences of a transient sensor fault.

hysteresis switch

„Error“

p.error=7%

ap.error=

„Error“

„Problem“

Injection-Interfaces Input Places

p1.valid=false; p1.error = 15%; p1.MTTC=4;

Created tokens:

p3.valid=true;

p2.valid=false; p2.error = 6%; p2.MTTC=4;

p1 p2 p3

2.8%

p1 p2 p3

p

ap