Autonomous. decentralisedsystems.

(1)

Redirectors

Roger Mateer and Yinong Chen Highly Dependable Systems Research Programme

Department of Computer Science University of the Witwatersrand, Johannesburg

South Africa

1999{08{27

Abstract

This paper introduces the dependable decentralised system architecture developed by the PHDS group. It describes the virtual redirector utilised by this architecture and how a decentralised network layer rewall application is implemented on it.

Dependability of the design is achieved through the use of fault-tolerant protocols. The virtual redirector is implemented by a dynamic hashing algorithm, which uses load information and the fault/working status of nodes in the system. The focus of this paper is the implementation of a testbed which is used to compare the performance and availability of the decentralised rewall against a monolithic rewall with equivalent functionality.

Keywords:

dependable computing, internet, security, rewall, fault-tolerant protocol

Category:

research paper, student paper

4 The Experimental Testbed

14

4.1 Overview . . . 14

4.2 The Decentralised Firewall Node . . . 14

4.3 The Monolithic Firewall Node . . . 14

4.4 The Trac Generation Node . . . 15

4.5 The Packet Trace Recording Node . . . 16

4.6 The Logging Node . . . 16

5 Summary and Future Work

17

1 Introduction

With the rapidly increasing demand on computing power and communication speed of their computer systems, organisations are forced to upgrade their systems every few years. They can either replace their systems with more powerful ones (usually centralised systems) or expand their systems by adding more computers to their distributed systems. The advantage of using the centralised approach is the simplicity at the hardware and the system software levels. One can simply buy the latest most powerful system to replace the old one. The advantage of using a scalable decentralised system which allows gradual adding of components for upgrading is at the application level: it is often more cost-eective for existing applications not to have to be ported to an entirely new system, which is often a painful and error-prone practice. Besides performance upgrading, dependability is also a very important issue. Sophisticated fault-tolerant computing techniques can be implemented in a decentralised system to ensure correct operation of the system even if some components are faulty.

As organisations have become increasingly dependent on the correct operation of computer systems and their legacy applications, decentralised systems have become more and more widely used in industry.

Figure 1 shows the Connect Control component attached to Firewall-1 from Check Point Software Technologies [www.checkpoint.com], where a scalable number of computers are used to handle the internet trac in parallel. Load balance and fault-tolerance are implemented among these computers. Figure 2 shows the congurable trac servers from Inktomi that can be deployed in networks from a small point of presence (POP) to a data center to a backbone peering point [www.inktomi.com], depending on the number of redundant servers used. Both systems in gure 1 and gure 2 use the decentralised approach.

Figure 1: Redundant Connect Control Component

Hardware implemented fault-tolerant mechanisms can be deployed in a centralised computer sys-tem or in a node of a decentralised syssys-tem. Software implemented fault-tolerant mechanisms can be more conveniently inserted into a decentralised system: for example, a layer of fault-tolerant protocols can easily be added into its communication system. It has been shown that fault-tolerant protocols oer a cost-eective way to implement highly dependable computing in o-the-shelf decentralised systems [Che93, Che98, FNP+95, OSJ95, SLR98].

Fault-tolerant protocols are a layer of software that handles fault-detection and masking [Che93, FNP+95], redundant data consistency and agreement [Ech89], synchronisation among computers or

processes in a computer [OSJ95], reconguration of system resources when some components are detected to be faulty [Che98, SLR98], and so on.

In the last few years, the Programme for Highly Dependable Systems at the University of the Witwatersrand has designed an ethernet-based decentralised system (the PHDS system) to provide

(3)

Figure 2: Redundant Trac Server

dependable internet service through a virtual service redirector [Che98]. This paper mainly discusses the further development of a testbed and testing of the PHDS system against a monolithic system in terms of performance and availability.

In the next section, the architecture of the PHDS system and the virtual service redirector are introduced. Section 3 describes the architecture of the scalable network-layer packet ltering rewall to be used in forthcoming experimental research in the PHDS group. Section 4 outlines design issues surrounding the components of the experimental testbed to be used in this forthcoming research. Section 5 concludes the paper.

2 The PHDS Architecture

This section gives an overview of the autonomous decentralised PHDS system architecture and the virtual service redirector which is built on it.

2.1 Autonomous Decentralised Systems

The autonomous decentralised system is a relatively new system architecture which can be categorised as being between a distributed system and a computer network, as gure 3 illustrates.

A distributed system is a system with multiple autonomous computers managed by a coherent operating system. The main purpose is to increase the performance/cost ratio of the system. The multiple computers are transparent to users. A user uses a distributed system like a simplex computer. It is up to the operating system to select the best computer, nd and transport all the input les to that computer, and put the results in the appropriate place. A computer network also contains multiple autonomous computers. These computer are managed by independent and possibly dierent types of operating system. The main purpose of a network is for resource sharing. To use another computer, a user has to explicitly login to that computer by providing the address, a login name and a password. The user of a network also has to nd the locations of input les and decide where to put the results.

The main features of an autonomous decentralised system are [Che99, Iha98]

Autonomous controllability like a network.

{

Each node manages itself using an independent operating system.

{

Failure of a node won't stop the other nodes.

Autonomous coordination like a distributed system.

{

Nodes have similar hardware and software conguration.

{

Nodes work together to solve coherent and interactive problems.

(4)

coordination autonomy Distributed System with a coherent OS Autonomous Decentralised System Computer Network

Figure 3: Relationship between the three types of system

During the last six years, autonomous decentralised systems have advanced, along with the rapid development of other related technologies such as mobile agents, distributed systems and the internet, to provide solutions for control and information systems.

Many applications of autonomous decentralised systems are safety-critical. An example is the ATOS (Autonmous Decentralised Transport Operation System) developed by Hitachi and the Japanese Railway Company. It controls 17 train lines with 250 stations in the Tokyo metropolitan area which are utilised by 13 million passengers per day [KM99]. Safety, reliability, availability, fault-tolerance and real-time responses are primary design goals. Autonomous agent technology has been suggested as one possible solution to these [Cov99].

An autonomous agent is a system situated within part of an environment that senses that envi-ronment and acts on it, in pursuit of its own agenda and with the intention of aecting its future sensory perceptions. Autonomous agents can be divided into two dierent types, namely intelligent agents and mobile agents [Cov99], of which mobile agents could be directly applied in autonomous decentralised systems.

Mobile agent technology concentrates on providing concepts and mechanisms to migrate the agent's code and execution state to dierent parts of a system, where it could execute to achieve its coded objectives. Through the encapsulation of behaviour denitions (the semantics of the agent itself) within the migrating agent, a mobile agent can support the delegation of a large grain operation and help to

reduce the number of remote interactions in a decentralised environment;

enable dynamic and on-demand delegation/downloading of software and functionality only when

and where it is needed | achieving higher decentralization by bringing the client to the server resource; and

enhance the autonomy of messages exchanged between distributed software components.

Another project related to the PHDS virtual redirector investigates the use of mobile agent tech-niques to implement some of the system's tasks. For example, mobile agents may replace the current reconguration protocol. When a computer fails, the critical tasks on the computer have to be moved to another computer. Currently, we have all the tasks stored on all computers. The reconguration is implemented by sending current states of a task from a replica of the failed node to another node. The node then activates the same task with the received states. With the mobile agent technique, each node only store tasks allocated to it. When a node fails, a replica of the failed node can migrate the task and its current states to another node. Through a special code the receiving node can recognise that an incoming packet is a mobile agent. It can activate the packet as a task. The implementation of mobile agents in the PHDS system is beyond the scope of this paper.

2.2 The PHDS System for Internet Service Redirection

To improve dependability, many service providers have used redundant servers as backup spares in their systems. Current implementations of redundant servers are

User Conguration. The redundant servers with dierent IP addresses are all connected to

the network. The users have to put the IP addresses of the primary and backup servers in their client software conguration. When the primary server is not available, the client software looks for a backup server's IP address. The drawback of this implementation is that the service provider has to rely on users to setup their conguration correctly.

Manual Switch. The primary and backup servers have the same IP address. When the primary

server is down, a backup spare can be manually switched on to replace the primary one.

Hardware service redirector. The IP address of the server is in fact for the redirector that

redirects each client request to one of the servers available, as shown in gure 4. Their price is about 10 times as expensive as a typical server, which is not acceptable to many service

(5)

providers. Another drawback is that there is still a single point of failure: the failure of the redirector will result in the unavailability of the entire service.

Software service redirector (e.g., the ConnectControl in Firewall-1 from Check Point Software

Technologies, as shown in gure 1). The principle is same as the hardware redirector, except that it is implemented by software. It is running on a single machine and there is still a single point of failure in the system.

The aim of virtual director implemented on the PHDS system is to explore an alternative way to implement the service redirection that can overcome the problems that current techniques encounter. The objectives of the research are to achieve low cost and high availability and reliability by using a decentralised system with the following features

automatic fault detection, diagnosis, and system reconguration among the redundant servers, eliminating or reducing the probability of the single point of failure in the system,

exible conguration to cope with dierent kinds of applications and dierent levels of

depend-ability requirements, and

extensible performance and load balance.

Figure 4: Redundant System with a Physical Redirector

Figure 5: The PHDS System Overview

The PHDS system is shown in gure 5. A number of machines are connected to three networks, where one is used to connect to the outside computer network (the Internet), one is used to connect to the intranet, and one is used for the communication among the machines in the PHDS system.

Theoretically, we can have any number of replicated computers, depending on the level of fault-tolerance and parallelism required. We have three replicated computers and a number of fault-tolerant

(6)

protocols in our current experimental system, resulting in a system able to tolerate a single fault. Each replicated computer is organised in a hierarchy of layers.

At the application layer, a rewall is implemented which uses the virtual service redirector. The core of the virtual redirector is the hashing algorithm which distributes incoming requests (load) evenly among the available computers.

Fault-tolerant protocols are below the layer of the virtual redirector. The fault-tolerant protocols detect any possible faults in the system and inform the hashing algorithm that incoming requests should not be forwarded to nodes which have been found to be faulty.

Below the fault-tolerant protocols lie a communication layer and a fault injection layer. The communication layer provides three types of communication service, which will be explained in sec-tion 3.5. The fault injector, which can generate various kinds of fault condisec-tions, is used only in the testing period.

In the following sections, the experimental system for the virtual service redirector and the testbed of the system will be discussed.

3 The Scalable PHDS Firewall Architecture

3.1 Overview

This section describes a revised architecture of the PHDS system, which will be used in a forthcoming experimental project. It is a variant of the generic architecture described above, which has been modied specically to suit the application of network-layer packet ltering and to facilitate easy scaling.

Figure 6 illustrates in overview the interactions that take place between the components of the decentralised rewall node. Figure 7 illustrates a similar overview of a comparable monolithic rewall node (the purpose of which node is explained in section 4). The various processes shown are described in detail in the subsections below, but a summary of what each of them involves is presented here for convenience:

Heartbeat inputrefers to the receipt by the heartbeat protocol entity of the current hashing and

packet lter congurations from the reconguration protocol entity (which maintains them), and the current packet lter load from the packet lter.

Heartbeat exchangeis the broadcast by the heartbeat protocol entity on a local node of this

infor-mation through the inter-host communication mechanism, and the receipt of such inforinfor-mation from its peers on other nodes.

Heartbeat result is a summary of this information which is presented to the reconguration

protocol entity, in order to allow it to update the hashing conguration, and to determine when it is appropriate to apply a hashing conguration update to the hashing protocol entity on the local node.

Update of the hashing protocol conguration is the application of a updated conguration to

hashing protocol entity on the local node, which occurs when the new conguration is deemed to be consistently available across all operational nodes. This type of event is logged every time it occurs.

Access control list update exchange is the broadcast and receipt of a new access control list,

which is initiated whenever the conguration is changed by the administrator, or a new node appears.

As with update of the hashing protocol conguration, update of the packet lter conguration

is the application of the access control list update to the packet lter by the reconguration protocol entity on a local node (and the logging of this event), which also only occurs when the reconguration protocol deems that it can be consistently updated across all operational nodes.

Event logging is simply the use of the inter-host communication mechanism by the logging

protocol entity to send logs of these events to logging nodes, as they occur.

Hashed packet ltering is only used by the decentralised rewall node. It involves IP packets:

{

being captured by the service-specic communication mechanism,

{

being handed to the hashing protocol entity, which determines whether or not to hand them to the packet lter for processing,

{

possibly being handed to the packet lter for ltering, which determines whether to accept or reject the packet, and

{

if accepted, being handed back to the service-specic communication mechanism and re-broadcast on the opposite interface medium.

(7)

heartbeat input event logging heartbeat exchange PACKET FILTER SERVICE-SPECIFIC COMMUNICATION

access control list update exchange INTER-HOST COMMUNCATION HEARTBEAT PROTOCOL ENTITY RECONFIGURATION PROTOCOL ENTITY HASHING PROTOCOL ENTITY LOGGING PROTOCOL ENTITY configuration update of the hashing

update of the packet filter configuration

hashed packet filtering

heartbeat result

NODE OF THE DECENTRALISED FIREWALL

Figure 6: Communication between the components of a single Node of the Decentralised Firewall Simple packet ltering is only used by the monolithic rewall node. It works similarly to hashed

packet ltering, except that incoming packets do not get ltered through the hashing protocol entity (since the monolithic rewall node doesn't have one), but instead get passed directly to the packet lter for processing.

3.2 The Packet Filter

The packet lter lters IP packets according to a specied security policy.

For each IP packet it receives, it must decide whether or not that packet meets the security criteria stipulated by its current Access Control List (ACL) conguration. This decision must be reached by considering only the contents of the current packet, and must take into account the IP header information, and the TCP/UDP header information where it is available. It must also accept atomic updates to its conguration from the reconguration protocol.

Packet ltering can be done either using a nave or Binary Decision Diagram (BDD) algorithm, and either concurrently or sequentially. The best feasible implementation appears to be the concurrent nave algorithm.

Concurrent packet ltering involves using a xed number (> 1) of execution contexts to lter

packets, rather than just one. The advantages of concurrent over sequential packet ltering are that the latency of packet ltering can often be reduced, and that it is possible to provide a natural

(8)

event logging INTER-HOST COMMUNCATION PACKET FILTER SERVICE-SPECIFIC COMMUNICATION RECONFIGURATION PROTOCOL ENTITY LOGGING PROTOCOL ENTITY update of the packet filter configuration

MONOLITHIC FIREWALL NODE

simple packet filtering

Figure 7: Communication between components of the Monolithic Firewall Node

accurate measure of the packet lter load (namely, the number of execution contexts busy at the time of measurement).

The nave algorithm involves iterating through a list of packet ltering rules, until one is found which matches the current packet. At this point the action associated with that rule (accept or reject) is taken. The matching is done on the basis of the contents of the packet's network layer header, and the transport layer header in its payload, if any.

The network layer header contains the source and destination IP networks in address/netmask form (which, among other things can be used to determine whether the packet is inward or out-ward bound) and the transport layer protocol. The transport layer header contains the source and destination ports and the connection ags (in the case of TCP).

The BDD algorithm potentially performs better than the nave algorithm due to its virtual elim-ination of rule redundancy, but it is rather more complex to implement. See [HFH98] for a detailed description of a similar algorithm.

3.3 The Fault-Tolerant Protocols

The fault-tolerant protocols allow the decentralised rewall to continue to function in the presence of predened classes of component failure.

(9)

failures on the ability of the decentralised rewall to provide the same rewall service functionality as a comparable fault-free monolithic rewall would. These protocols were mentioned in section 2, and are used in a slightly modied form in the forthcoming experimental system.

3.3.1 The Heartbeat Protocol

The heartbeat protocol maintains (as accurate and consistent as possible an approximation to) an up-to-date picture of the set of operational nodes.

Status datagrams must regularly be exchanged between heartbeat protocol peers on nodes, and this information must be passed to the reconguration protocol on a node as soon as it is available. The presence of a status datagram indicates that its associated node is operational, and its content provides a measure of the packet lter load, as well as the conguration checksums of the hashing protocol and the packet lter on that node.

Status datagrams are broadcast at regular time intervals, using timely datagram broadcast inter-host communication (which see). They are collected from peers until a timeout occurs, and the resulting picture is summarised and given to the reconguration protocol.

The purpose of the checksums on the hashing and ACLcongurations is to have a chance of de-tecting conguration inconsistencies. Detection of such inconsistency could either trigger rebroadcast of the aected conguration information, in order to correct congurations of inconsistent nodes, or voluntary shutdown of a node which detects that it is inconsistent too often.

3.3.2 The Hashing Protocol

In general, the aim of the hashing protocol is to distribute the system workload evenly over the available nodes.

Any node has small but non-negligible probabilities of failing to perform a given task at all, and of failing to perform it correctly if it does. In order to provide some method of specifying required quality of service, a given task could be allocated a criticality value, which the hashing protocol could use to determine the number of replicated nodes to which to allocate the task1.

In our current implementation of the rewall application, the hashing protocol is designed to allocate normal tasks to one node and critical tasks to two nodes, and a comparison protocol (see section 3.3.3 for details) is used to detect the inconsistent outputs generated by single faults.

The hashing protocol distributes the packet ltering workload over (as accurate as possible an approximation to) the set of currently operational nodes.

It must use the current hashing conguration to decide, for each incoming IP packet, whether it should be passed to the packet lter on the same node, and pass exactly those packets which should be passed to the packet lter. It must also accept updates to the hashing conguration from the reconguration protocol and apply them atomically.

Hashing may be implemented using (amongst other possibilities) a deterministic algorithm, a randomized algorithm, or a dynamic load-balancing algorithm.

Deterministic hashing involves allocating each node a set of sequence number equivalence classes modulo a multiple of the number of operational nodes (or something to that eect). The collection of these sets of equivalence classes should ideally be disjoint and cover all sequence numbers, but it is sucient for there to be little overlap and almost complete covering.

Randomized hashing (explored in [Pih98, Mat98]) involves allocating to each node a probability with which it will process any given packet. This should be done in such a way that if nodes independently elect whether or not to process packets according to their assigned probabilities, that both the average rate at which packets are dropped and the total amount of redundant work done by the collection of all nodes are minimised. Clearly, a tradeo needs to be made between these two conicting goals.

Dynamic load-balancing, which is implemented in the current PHDS system [Pat98], uses load information to allocate packets to the least loaded nodes rst, in order to minimise the greatest load on any node.

An example should serve to illustrate these algorithms:

The deterministic algorithm on three nodes would behave, for the given packet trace, as illustrated in table 1.

Given the same packet trace, and the depicted hashing decisions (which are the result of each node randomly deciding to process each packet with probability 1/3), the randomized algorithm on three nodes would behave as illustrated in table 2. Notice the rather high packet drop rate (packets 3, 5, 9 are dropped) for this initially seemingly reasonable allocation probability. Notice

1Where nodes can fail to perform allocated tasks, the redundancy which this introduces can be used to yield higher

probabilities of more critical tasks being performed by at least one node, and where node failures can introduce erroneous results, it can be used by some error-detecting or error-correcting scheme to increase the probability of more important tasks being performed correctly.

(10)

Table 1: Example of the Behaviour of the Deterministic Hashing Algorithm

time packet \n" is being processed packets allocated to node

A B C 0 1 1 5 1 2 1 2 6 1 2 3 1 2 3 13 1 2 3 4 1 4 2 3 14 1 2 3 4 5 1 4 2 5 3 15 3 4 5 4 5 3 16 3 4 6 4 3 6 18 4 6 4 6 19 4 7 4 7 24 7 7 28 7 8 7 8 29 7 8 7 8 32 7 9 7 9 35 9 9 42 9 9

Table 2: Example of the Behaviour of the Randomized Hashing Algorithm

is the packet

time packet \n" is being processed given to node packets allocated to node

A? B? C? A B C

0 1 no yes yes 1 1

5 1 2 yes yes yes 2 1 2 1 2

6 1 2 3 no no no 2 1 2 1 2 13 1 2 3 4 yes no no 2 4 1 2 1 2 14 1 2 3 4 5 no no no 2 4 1 2 1 2 15 3 4 5 4 16 3 4 6 yes no no 4 6 18 4 6 4 6 19 4 7 yes no no 4 7 24 7 7 28 7 8 no yes no 7 8 29 7 8 7 8 32 7 9 no no no 7 35 9 42 9

Table 3: Example of the Behaviour of the Dynamic Load-Balancing Hashing Algorithm

time packet \n" is being processed packets allocated to node

A B C 0 1 1 5 1 2 1 2 6 1 2 3 1 2 3 13 1 2 3 4 1 4 2 3 14 1 2 3 4 5 1 4 2 5 3 15 3 4 5 4 5 3 16 3 4 6 4 6 3 18 4 6 4 6 19 4 7 4 7 24 7 7 28 7 8 8 7 29 7 8 8 7 32 7 9 9 7 35 9 9 42 9 9

(11)

also that packets are sometimes allocated to multiple nodes, although this probability was chosen with the intent of allocating each packet to one node. It will require experimentation to see whether these disadvantages of randomised hashing are adequately oset by its simplicity (and thus possible performance) advantages.

Finally, given the same packet trace once again, and assuming the load on a given lter at a given time simply to be the number of packets being processed at that time, and that ties are broken by allocating work to the lowest numbered node, the dynamic load-balancing algorithm on three nodes would behave as illustrated in table 3.

It is also possible to combine dynamic load-balancing with the other two algorithms, in order to enhance their performance adaptively. This is done by adjusting the number of equivalence classes or the probabilities assigned to each node, based on the current packet lter load on that node.

These algorithms are all feasible to implement, although the dynamic load-balancing modication of the deterministic algorithm is rather complicated.

3.3.3 The Comparison Protocol

The aim of the comparison protocol is to attempt to detect faults which do not cause a node to fail, but cause it to operate incorrectly.

A typical fault-tolerant technique is Triple Modular Redundancy (TMR). If this technique is used, each message will be checked by three rewall nodes. The decisions of the three nodes will be voted on by a majority voter. In order to tolerate possible faults during voting, a voting protocol with three copies of voting entities running on the three nodes is needed [Che93].

Although the TMR implementation can meet the availability and fail-safe requirements of the system, a simpler design with duplications on triple redundant hardware could also meet the same requirements. Figure 8 illustrates the scheme.

Figure 8: Duplicate Implementation of a Fault-tolerant Firewall

As in the TMR scheme, there are three rewall nodes, but only two of the three check a given packet. A comparison protocol has replaced the voting protocol. In this scheme, each packet is distributed to two of the three rewalls and each rewall sends its decision to one of the other two comparison entities.

For example, packet 1 arrives, and is distributed to rewall 1 and rewall 2. The decisions (accept or reject) of the two rewalls are forwarded to comparison entities 1 and 2. Packet 2, in turn, is distributed to rewall 2 and rewall 3, and the resulting decisions are then forwarded to comparison entities 2 and 3. Table 4 shows an example of a set of comparison results.

For an arriving packet, comparison output depends on rewall outputs and possible faults. The nal decision whether to accept a packet (let it go through the rewall) or not, depends on both the rewall outputs and the comparison results. According to the safety requirement, a packet can be accepted if and only if the two rewalls that check the packet accept the packet, and both comparison entities reach agreements. Table 5 summarises the input-output correspondence based on the safety requirement. It can be seen that the nal decision of the comparison protocol will be equal to the decision of either of the two rewalls if and only if comparison entities give an \agree" output. Otherwise, the packet will be rejected.

(12)

Table 4: Example Packet Distribution and Syndrome Table

Packet 1 Packet 2 Packet 3 ... Packet N Fault frequency Diagnosis

Firewall 1 comparison 1 disagree agree ... disagree

f

1

=m

1 faulty

Firewall 1 comparison 2 agree agree ...

f

2

=m

2 fault-free

Firewall 1 comparison 3 agree disagree ... agree

f

3

=m

3 fault-free

Decision refuse accept refuse ... refuse

Table 5: The nal decision depends on faults, rewall outputs and comparison results. fault during rewall i rewall j comparison i comparison j decision

none accept accept agree agree accept

comparison i accept accept disagree agree reject

comparison j accept accept agree disagree reject

rewall i reject accept disagree disagree reject

rewall j accept reject disagree disagree reject

rewall i & comparison i reject accept agree disagree reject rewall j & comparison j accept reject disagree agree reject

It increases the throughput of the system by 50%;

When a fault occurs, the third computer is used as a backup spare. As a result of a fault, the

system doesn't fail. Instead, it loses the 50% extra throughput.

In summary, the comparison protocol should check packet lter decisions for consistency between pairs, and block actions resulting from inconsistent decisions.

The heartbeat protocol and the comparison protocol are used to detect whether any nodes are faulty. The reconguration protocol then informs the hashing protocol that no tasks should be allocated to those nodes.

Even if individual nodes are assumed to fail silently (that is, they perform any given task correctly or not at all), the comparison protocol is still needed. This is because the unavoidably unreliable propagation of conguration updates could still cause conguration inconsistency across nodes, and thus dierences in the treatment of a given task, depending on which node is selected to process it.

The best way to minimise the probability of occurrence of this problem is to increase the reli-ability of the conguration update communication mechanism, and to incorporate a mechanism to synchronise the actual application of these updates.

The latter task is performed by the reconguration protocol, described next.

3.3.4 The Reconguration Protocol

The reconguration protocol keeps the congurations of nodes up-to-date.

It must use the information provided by the heartbeat and comparison protocols to update the congurations of the packet lter and hashing protocol on the local node in response to events which require their congurations to change.

It receives information from the heartbeat protocol about which nodes are operational, and what their status is (namely, their hashing and ACL conguration checksums, and packet lter load).

It is also used to distribute ACL updates, both when the ACL is changed by the administrator on any operational node, and when new nodes appear, and have to be incorporated into the collection of operational nodes.

Checksum information for the hashing protocol and packet lter on a local node is maintained by the reconguration protocol. It is given to the heartbeat protocol to broadcast, and checksums on broadcast status datagrams returned to the reconguration protocol are checked for consistency before conguration updates are applied by the reconguration protocol to the hashing protocol or the packet lter.

(13)

3.4 The Logging Protocol

The logging protocol supports recording of the rewall's internal behaviour during operation. This is not merely a convenience added for the sake of experiment; it is actually an important component of a rewall system, because security policies typically have to adapt to changing requirements and logging information is an important resource in assessing what needs to change and how.

The logging protocol must support recording of packet ltering decisions and recongurations of all kinds.

Logging information must be sent to a single lesystem by all nodes, since otherwise failure of nodes could lead to missing (or, at best, inconvenient to access) logs.

This can either be done by having all nodes write log entries to a lesystem NFS-mounted on all of them from a logging node, or by sending broadcast messages from all rewall nodes to logging nodes.

The advantage of NFS is that it is simpler to implement than broadcasting of log entries. The advantage of broadcasting of log entries is that multiple logging nodes can be set up to remove the single point of failure in the logging node without incurring any performance cost (since logging nodes would simply listen for broadcasts and record what they hear). A more important advantage of broadcasting of log entries, for the purpose of analysing the results of experimentation, is that the trac patterns are better understood and more predictable than those of NFS.

Consequently, although more work is involved with setting up broadcast and capture of log entries, this is the approach that will be taken.

3.5 Communication System

The communication system provides support for various communication requirements in executing tasks.

3.5.1 Service-specic Communication

Service-specic communication supports the communication between the service running on the PHDS architecture and its clients.

In the case of the network layer rewall service, this involves listening for and rebroadcasting of IP packets.

IP packets will be listened for using the packet capture library (libpcap) designed for and dis-tributed with tcpdump. Unfortunately, libpcap can not also be used for IP packet rebroadcast, since tcpdump never needs to do that. So, this will have to be done using the raw socket interface in our experiment.

3.5.2 Inter-protocol Communication

Inter-protocol communication supports the exchange of information between the concurrently exe-cuting representatives of dierent protocols on a single node.

It must support reliable message passing and timely datagram passing between these modules. Reliable message passing involves (as close as possible to) guaranteed delivery of messages, without regard to how long it might take to achieve. Timely datagram passing involves best-eort delivery of datagrams within a strict time period, whereafter the delivery system should give up because the information is outdated.

As of Linux kernel 2.2.x, libpthread implements POSIX-compliant kernel threads (execution con-texts which share memory and are scheduled by the kernel) reasonably well and the kernel is reentrant. Shared memory communication thus becomes possible between smoothly scheduled concurrent tasks without the need for the application programmer to worry about reentrance problems for system calls, or about blocking communication hanging supposedly unrelated threads because of an incor-rectly implemented primitive user-level scheduling policy.

So, kernel threads will be used to implement modules, and inter-protocol communication will make use of the available shared memory to implement reliable message and timely datagram passing. This should be much easier to do than it would have been with the 2.0.x kernel, where sockets would need to be used between processes which could not share memory.

3.5.3 Inter-host Communication

Inter-host communication supports the exchange of information between the peers representing a single protocol on dierent nodes.

It must be able to support the reliable distribution of arbitrary-sized data from some node to all nodes. It must also be able to support the distribution of xed-sized data from some node to all nodes under time constraints (giving up once the deadline passes).

(14)

The basic idea here is to implement an unreliable datagram broadcast protocol, and to build the required protocols on top of that. There are two seemingly equally feasible ways of implementing the underlying protocol:

Firstly, the mechanism for service-specic communication could be used. The packet capture library could be used to listen on all nodes for IP datagrams broadcast from some node on a given medium through the raw socket interface. Alternatively, some node could send UDP datagrams to the broadcast address for the given medium and all nodes could wait to receive broadcasts.

Further investigation needs to be done to see which approach is better.

The higher-level protocols actually required of inter-host communication will then be implemented using a simple redundancy scheme | simply broadcast each datagram a number of times.

If we assume a uniform distribution of broadcasts received, with a constant probability of receipt, then it is easy to see that at the cost of extra network trac, we can make the probability of delivery of (at least one copy of) a broadcast as high as we like. However, this is probably a rather nave assumption, since datagram delivery even over the loopback interface is not uniformly distributed, so experimentation needs to be done to see the level of reliability that can actually be achieved using such a scheme.

It may be possible to achieve greater reliability using a more complex scheme involving requests for broadcast or potentially acknowledgements, but such schemes have two signicant disadvantages: Firstly, such protocols add complexity to the program, which makes its correctness harder to verify and its performance and reliability harder to analyse. Secondly, given that a simple redundancy scheme can potentially produce arbitrarily high reliability values at performance cost, and given the fact that the network layer is assumed to be unreliable, there is no point in adding complexity to the program when doing so might make it more prone to becoming totally unusable in the event of some unexpected failure.

Timely datagram broadcast can be implemented by using the replicated broadcast scheme, fo-cussing on ensuring that a best eort to deliver is made, but only for a xed time period.

Reliable message broadcast can be achieved by splitting an arbitrary-sized message up into a sequence of xed-sized datagrams, each of which is broadcast reliably and in its turn. Reliable datagram broadcast can be achieved by a similar method to timely datagram broadcast | also using the replicated broadcast scheme, but focussing on ensuring that a specied level of reliability of delivery is achieved, however long that may take.

4 The Experimental Testbed

4.1 Overview

The experimental testbed is a practical system set up to test hypotheses about the decentralised rewall's behaviour in as realistic a manner as possible.

The testbed system has ve types of node (essentially, a program running on a machine, and performing a specic task in an experiment). They are: the decentralised rewall node, the monolithic rewall node, the trac generation node, the packet trace recording node, and the logging node, and their relationships in the testbed are illustrated in gure 9.

The contents of the bold box are the components of the rewall being tested, and the surroundings are the testbed environment. The rewall nodes could potentially be a single monolithic rewall node, a collection of decentralised rewall nodes, or both. The rewall logging node is actually both part of the rewall system (since logging is a necessary function of a rewall), and part of the testbed environment (since it records the internal rewall activity).

As an example, gure 9 illustrates the case where both a collection of decentralised rewall nodes (denoted by \D"), and a single monolithic rewall node (denoted by \M") appear. This, however, may not necessarily be the best way to perform the experiments, because of the unrealistic congestion caused by the replicated trac that would be produced.

4.2 The Decentralised Firewall Node

These nodes are used in a collection to construct the decentralised rewall, in order to test hypotheses about its behaviour. Section 3 was dedicated to describing the components of this node in detail.

4.3 The Monolithic Firewall Node

The monolithic rewall is intended to function comparably to the decentralised rewall node, in order that the two architectures may be compared fairly.

It must operate as a network layer lter in the same way as the decentralised rewall system does. This is most easily accomplished by using a simplied version of the decentralised rewall node.

(15)

Packet trace recorders Firewall logging node

Internet representatives (traffic generators)

Firewall nodes

Firewall-internal shared medium Internet-firewall interface

shared medium

Intranet representatives (traffic generators) Intranet-firewall interface

shared medium

D

M

Figure 9: The Firewall & Environment Testbed

In particular, the hashing and heartbeat protocols are entirely absent, and the reconguration protocol performs ACL conguration updates only on the local packet lter. Everything else remains the same as in the decentralised rewall node.

4.4 The Trac Generation Node

The collection of trac generation nodes generates trac to exercise the rewall systems being tested. The members of this collection must communicate with each other using transport layer interac-tions that are in some way a `realistic' representation of typical real-world communication activity. This communication must also be of a congurable intensity, so that the rewall can be tested under a range of network trac loads.

The trac generators can either use real packet trace data or synthetically generated trac. Both approaches present fair implementation challenges.

On the one hand, using real packet trace data entails using the output tcpdump generates when listening on a shared medium between a real intranet and the rest of the internet.

It is not good enough simply to play such data back as it was recorded, because most events in such a trace are causally related to others in the trace, and the way they are time distributed depends on the way the transport-layer protocol behaved in the situation in which the recording was made. However, this project aims to study the eect of dierent network-layer lters on the behaviour of the transport-layer protocol, so the times at which these events occur (and indeed, if they occur at all) should rather depend on the situation in which the trac is played back.

It is the best we can do to assume that those events which do not depend on others in the trace are truly independent events and that their time distribution is realistic, even though this may not

(16)

be strictly true for a number of reasons.

Further, in order to study the behaviour of a transport layer protocol, that protocol actually has to be used, so the output recorded by tcpdump also has to be translated back into the transport layer operations that caused it. These operations then actually need to be performed between com-municating trac generation nodes | the independent operations at the times recorded, and the dependent operations if and when their respective causes occur.

The network trac load of such trac can be congured by changing the time-scale on which the independent events occur.

On the other hand, using synthetically generated trac also has its share of problems, related in particular to deciding how to use standard techniques suciently realistically for the purposes of this project.

The two relevant techniques are the generation of Poisson distributed random trac and self-similarly distributed random trac [WTSW97]. Both techniques use an abstraction of network communication in which there is a collection of sources on a shared medium, each of which randomly emits or doesn't emit a packet at every point in time, and in which these emissions are assumed to be mutually independent.

The Poisson trac model (in common use until recently) assumes that all sources emit packets with xed nite variance and at the same average rate, which rate is the single parameter which expresses the network trac load.

The self-similar trac model (a recent challenger to Poisson distribution, which provides a more accurate t to empirical data) assumes that:

sources emit packets in a strictly alternating on-o fashion;

on-sequences follow the same random length distribution across all sources; o sequences also follow the same random length distribution across all sources; on and o sequences may follow dierent distributions; and

both the on and o sequence distributions exhibit innite variability, so that there is a

non-negligible probability of packet trains and of silences of arbitrary lengths being emitted by each source.

According to [WTSW97], these conditions are sucient for the aggregate trac to exhibit self-similarity, or to look similar on all scales. This is in contrast to Poisson distributed trac, where the aggregate result looks smooth on suciently large scales.

The main problem with either of these approaches lies in deciding how to apply correctly a model which uses independently emitted packets, into a situation which could use a connection-oriented transport layer protocol (and therefore dependencies between packet emissions).

One possibility is to use the synthetic generation model for the independent events (essentially, the connection requests), and some distribution for the trac that passes over a connection. But this constitutes a new model, which would have to be validated against empirical data.

Consequently, with all its attendent diculties, the use of real packet trace data appears to be the more feasible approach to take.

4.5 The Packet Trace Recording Node

The packet trace recording node logs eects of the behaviour of the rewall being tested on the trac on a particular shared medium.

It must record sucient information about the network layer activity on the shared medium to be able to deduce the eect the rewall has on the behaviour of the transport layer protocols.

If it is possible to specify that tcpdump record activity at the network layer, even when IP packets can be reassembled into higher-layer packets, then the packet recording node could simply be a tcpdump process running on a machine with a harddrive large enough to record the needed data.

Failing that, the packet capture library could be used to record the activity at the appropriate level.

4.6 The Logging Node

The logging node logs the internal behaviour of the rewall being tested.

It must record sucient information about the internal activities of the rewall to allow reasons to be ascribed to the externally observed behaviours.

In particular, it should record packet lter and hashing protocol recongurations, as well as packet lter decisions.

Since it has been decided to do logging using inter-host communication, the logging node must use whichever method is used to listen to inter-host communication broadcasts, and it must simply record everything it receives.

(17)

5 Summary and Future Work

We still need to complete the implementation of the testbed and collect the data for analysis and evaluation. If we can show that the virtual redirector and the decentralised rewall can signicantly improve the availabilty of the system, further experiment in industry will be conducted. Formal spec-ication and verspec-ication of the protocols in the system are being conducted. We are also investigating the use of mobile agents to replace some parts of our system.

References

[Che93] Yinong Chen. Testing and evaluating fault-tolerant protocols by deterministic fault injection. InVDI Series, number 260 in 10. VDI Verlag, Dusseldorf, 1993.

[Che98] Yinong Chen. A redundant virtual service redirector for computer networks. In Di-gest of Fast Abstracts: IEEE 28th Annual International Symposium on Fault-Tolerant Computing (FTCS-28), pages 27{28, Munich, January 1998. IEEE.

[Che99] Yinong Chen. Autonomous decentralised systems and their applications. Elektron Jour-nal of SA-IEE, pages 52{55, May 1999.

[Cov99] S. Covaci. Autonomous agent technology. In 4th International symposium on Au-tonomous decentralised systems, pages 255{257, Tokyo, March 1999.

[Ech89] K. Echtle. Distance agreement protocols. InIEEE 19th Annual International Symposium on Fault-Tolerant Computing (FTCS-19), pages 191{198, Chicago, June 1989.

[FNP+95] J-C Fabre, V. Nicomette, T. Perennou, R.J. Stroud, and Z. Wu. Implementing fault

tolerant applications using reective object-oriented programming. In IEEE 25th An-nual International Symposium on Fault-Tolerant Computing (FTCS-25), pages 489{498, Pasadena, June 1995.

[HFH98] Scott Hazelhurst, Anton Fatti, and Andrew Henwood. Binary Decision Diagram Repre-sentations of Firewall and Router Access Lists. Technical Report TR-Wits-CS-1998-3, University of the Witwatersrand, Johannesburg, South Africa, October 1998.

[Iha98] H. Ihara. More dependable systems in computing - autonomous decentralised systems. In Proceedings of the 1998 IFIP International Workshop on Depend-able Computing and its Applications, pages 127{137, Johannesburg, January 1998. http://www.cs.wits.ac.za/research/workshop/ip98.html.

[KM99] I. Kaji and K. Mori. Atomicity of transaction processing in heterogeneous autonomous decentralised systems. In 4th International symposium on Autonomous decentralised systems, pages 150{157, Tokyo, March 1999.

[Mat98] Roger Mateer. Comparison Protocol in a Fault-Tolerant Firewall. Masters Dependabil-ity project report, Department of Computer Science, UniversDependabil-ity of the Witwatersrand, Johannesburg, South Africa, September 1998.

[OSJ95] A. Olson, K.G. Shin, and B.J. Jambor. Fault tolerant clock synchronization for dis-tributed systems using continuous synchronization messages. InIEEE 25th Annual Inter-national Symposium on Fault-Tolerant Computing (FTCS-25), pages 154{163, Pasadena, June 1995.

[Pat98] Zunaid Patel. Load balancing and Fault Tolerance on a Dependable Virtual Service Redi-rector. Honours research report, Department of Computer Science, University of the Witwatersrand, Johannesburg, South Africa, October 1998.

[Pih98] Pekka Pihlajasaari.A Distributed Coherent Cache. Masters Dependability project report, Department of Computer Science, University of the Witwatersrand, Johannesburg, South Africa, September 1998.

[SLR98] A. Singhai, S-B Lim, and S.R. Radia. The sunscalr framework for internet servers. In

IEEE 28th Annual International Symposium on Fault-Tolerant Computing (FTCS-28), pages 108{117, Munich, June 1998.

[WTSW97] Walter Willinger, Murad S. Taqqu, Robert Sherman, and Daniel V. Wilson. Self-similarity through high-variability: Statistical analysis of Ethernet LAN trac at the source level. IEEE/ACM Transactions on Networking, 5(1):71{86, February 1997.

(18)

Glossary

ACL

Access Control List

BDD

Binary Decision Diagram

IP

Internet Protocol

NFS

Network File System

PHDS

Program for Highly Dependable Systems

TCP

Transmission Control Protocol

TMR

Triple Modular Redundancy

Autonomous. decentralisedsystems.

Abstract

Contents

4 The Experimental Testbed

2 The PHDS Architecture

2.2 The PHDS System for Internet Service Redirection

3 The Scalable PHDS Firewall Architecture

NODE OF THE DECENTRALISED FIREWALL

3.2 The Packet Filter

MONOLITHIC FIREWALL NODE

3.3 The Fault-Tolerant Protocols

3.3.2 The Hashing Protocol

3.3.3 The Comparison Protocol

3.3.4 The Reconguration Protocol

3.4 The Logging Protocol

3.5 Communication System

3.5.2 Inter-protocol Communication

3.5.3 Inter-host Communication

4 The Experimental Testbed

4.2 The Decentralised Firewall Node

4.4 The Trac Generation Node

4.5 The Packet Trace Recording Node

4.6 The Logging Node

5 Summary and Future Work

Glossary