Lightweight Protocols for Distributed Systems

(1)

Lightweight Protocols for Distributed Systems

Thesis for Submission for PhD

./. Crawcrofi

Depiu’linent of Com puter Science

(2)

ProQ uest Number: 10105643

All rights reserved

INFORMATION TO ALL U SE R S

The quality of this reproduction is d ep en d en t upon the quality of the copy subm itted.

In the unlikely even t that the author did not sen d a com plete manuscript

and there are m issing p a g e s, th e se will be noted. Also, if material had to be rem oved,

a note will indicate the deletion.

uest.

ProQ uest 10105643

Published by ProQ uest LLC(2016). Copyright of the Dissertation is held by the Author.

All rights reserved.

This work is protected against unauthorized copying under Title 17, United S ta tes C ode.

Microform Edition © ProQ uest LLC.

ProQ uest LLC

789 East E isenhow er Parkway

P.O. Box 1346

(3)

CONTENTS

1. Chapter 1: Distributed Systems Protocol A r c h ite c tu r e s ... 7

1.1 U s e r s ... 7

1.2 The Operating S y s t e m ... 7

1.3 H ie W i r e ... 7

1.4 Switching M e tlio d o lo g ie s ... 8

1.5 Packets on the n e tw o r k ... 10

1.6 Operating System F a c i l i t i e s ... 10

1.7 Software for Intennediate Devices in a N e t w o r k ... 11

1.8 Distributed Systems Protocol A r c h ite c tu r e ... 11

1.9 T r a n s a c tio n s ... 12

1.10 Sequence luid T i m e ... 12

1.11 M a n a g e m e n t... 13

1.12 Related w o r k ... 13

2. Chapter 2; Protocol Implementation Design M e tlio d o lo g y ... 14

2.1 Some agreed formats in protocols ... 14

2.2 Some Rules for P r o t o c o l s ... 15

2.3 What is C o n c u n e n c y ? ... 17

2.4 I n t e r l e a v i n g ... 17

2.5 Problems with C o n c u i r e n c y ... 17

2.6 Consumers, Producers imd Critical Regions ... 19

2.7 Tlie Internet A r c h i t e c t u r e ... 23

2.8 Protocol State and P iu 'am e te riz a iio n ... 25

2.9 Different A rc h ite c tu re s ... 27

2.10 Performance Limitations o f OS I ... 29

2.11 Traditional Communications Requirements 29 2.12 Service interfaces to Transport p r o t o c o l s ... 30

2.13 Non blocking and Asynchronous interfaces ... 32

2.14 Transport-to-network L e v e l ... 34

2.15 Network-to-Link or Device L e v e l ... 34

2.16 ISO Trmisport C h t s s e s ... 34

2.17 N etwork E r r o r s ... 35

2.18 W hat mu.st a Tran.sport Protocol d o ? ... 35

2.19 W hat else do you need in a P r o t o c o l ? ... 36

2.20 Related W o r k ... 36

2.21 Performance Related W o r k ... 37

3. Chapter 3: A Sequential Exchange P r o t ( x : o l ... 38

3.1 I n t r o d u c t i o n ... 38

3.2 The ESP P r o t o c o l ... 39

3.3 The M essage Transport S u b la y e r ... 40

3.4 Pairing M essages for Request/Response ... 41

3.5 Service Interface ... 43

3.6 Exam ples o f Operjition o f R e q u e s t / R e p l y ... 44

3.7 Performance and Possible E n h a n c e m e n t s ... 48

3.8 Forwîirding Calls and M u l t i c a s t ... 48

3.9 Packet L a y o u t ... 48

3.10 T he E SP m essage sub layer state t a b l e ... 52

3.11 Default timer and counter v a l u e s ... 53

3.12 A N ote on B it Ordering ... 53

3.13 Crash and Relxxn P rotœ ol I n t e r a c t i o n ... 53

3.14 Related P r o t o c o l s ... 54

(4)

-4. Chapter 4; A M ulticast 1 rausport Protocol ... 55

4.1 I n t r o d u c t i o n ... 55

4.2 M ulticast Semrmtics ... 56

4.3 N-Reliable M aximal S e r v i c e ... 56

4.4 Underlying N etwork S e r v i c e ... 57

4.5 Tlie M essaging Service I n te r f a c e ... 57

4.6 M essage P r i m i t i v e s ... 58

4.7 Buffering and M arkers in the I n t e r f a c e ... 58

4.8 Failure M t x l e s ... 58

4.9 M essaging Protocol O p e r a t i o n ... 59

4.10 Acknowledgement Scheme for M ultipacket M e s s a g e s ... 59

4.11 Retransmission S c h e m e s ... 59

4.12 Duplicate F i l t e r i n g ... 60

4.13 W indow Scheme ... 60

4.14 The Implosion Problem ... 60

4.15 Scheduling r e p l i e s ... 62

4.16 Request and R e s p o n s e ... 62

4.17 Voting F u n c t i o n s ... 63

4.18 Operation o f Request/Response using M essage Level ... 63

4.19 Failure Mtxles and Cancelled R e q u e s t s ... 63

4.20 I m p le m e n ta tio n ... 63

4.21 Analysis o f Implosion for Single L A N ... 64

4.22 Differing Background L o a d s ... 66

4.23 Potential Hru-dwiu-e S u p p o r t ... 66

4.24 Protocol State Required ... 67

4.25 Alternative Approaches - TCP Extension to One-to-Many M u l t i c t i s t ... 71

4.26 C o n c l u s i o n s ... 73

5. Chapter 5: Transport Protocol P e r l b r m a n c e ... 74

5.1 I n t r o d u c t i o n ... 74

5.2 IP/ICM P traffic generation ... 77

5.3 TCP traffic g e n e r a t i o n ... 78

5.4 Intrusive TCP M onitoring and C o n t r o l ... 78

5.5 Passive T C P M o n i t o r i n g ... 81

5.6 Setting up T C P e x p e r i m e n t s ... 82

5.7 Analysing the T C P Experimental O u t p u t ... 83

5.8 M easurem ents on S A T N E T ... 84

5.9 Background and M o t i v a t i o n ... 85

5.10 Choice o f T ransport Protocol to M e a s u r e ... 85

5.11 Characterising P e r f o n n a n c e ... 85

5.12 W hat Param eters and A l g o r i t h m s ? ... 85

5.13 N aive Throughput and Delay C a lc u la tio n s ... 86

5.14 M axim um Segm ent Size C h o i c e ... 88

5.15 Retransm it Tim ers and Round Trip Time E s t i m a t o r s ... 89

5.16 W indows, Retransm ission and C o n g e s tio n ... 90

5.17 Traffic g e n e r a t o r s ... 91

5.18 Intrusive I n s tr u m e n ta tio n ... 91

5.19 External O b s e r v a t i o n ... 91

5.20 The M easurem ent E n v i r o n m e n t ... 92

5.21 Discussion o f RTTS and W i n d o w s ... 93

5.23 C o n c l u s i o n s ... 96

6. Chapter 6: M anaging the Interconnection of L A N s ... 97

6.1 I n t r t x l u c t i o n ... 97

(5)

-6.2 Protocol D ep en d en cie.s... 97

6.3 Address Resolution iuid M apping ... 97

6.4 Fault I s o l a t i o n ... 99

6.5 Routing and L(X)p Resolution in Internetwork routers ... 99

6.6 Remote access to Management Information ...101

6.7 Performance C o n s i d e r a t i o n s ... 101

6.8 ISDN and Bridges/Connectionless R o u t e r s ... 102

7. Chapter 7: LAN Access Control at the MAC l e v e l ... 104

7.1 I n t r o d u c t i o n ... 104

7.2 Physictil A tta c k s ... 104

7.3 Lan Interconnection ... 105

7.4 Access Control F i l t e r s ... 105

7.5 General F i l t e r s ...106

7.6 Remote M t u t a g e m e n t ... 106

7.7 Denitd o f S e r v i c e ... 106

7.8 Access C o n t r o l ... 107

7.9 M a n a g e m e n t...107

7.10 Routing...107

7.11 Policy Routing Spiuining Tree ... 108

7.12 Costs o f Policies... I l l 7.13 Related W o r k ... I l l 8. Chapter 8: Conclusions iuid The F u t u r e ...112

8.1 Conclusions # . . * # # . . * « « # 112

8.2 Future W o r k ...115

9. R e f e r e n c e s ...121

References... 121

(6)

-LIST OF FIGURES

Figure 1. Switched N e t w o r k ... 9

Figure 2. The Spectrum o f W AN <uid LAN C h a r a c t e r i s t i c s ... 9

Figure 3. Opemting System Structure - Pnxiess View; ... 11

Figure 4. Example Structure o f Internet Datagnun ... 14

Figure 5. Transmission O rder o f B y t e s ... 15

Figure 6. Signitictmce o f B i t s ... 15

Figure 7. An exiunple o f an invented protocol b e h a v io u r ... 16

Figure 8. D e a d lo c k :... 18

Figure 9. L i v e l o c k : ... 18

Figure 10. Tnuisport Entity 1 ... 20

Figure 11. Transport Entity 2 ... 21

Fig me 12. Network Entity 1 21 Figure 13. Network Entity 1 22 Figure 14. Flow o f Information between Protocol E n t i t i e s ... 23

Figure 15. A Fictitious I n t e r n e t ... 24

Figure 16. Connection P h a s e s ... 26

Figure 17. Tim ers and Bandwidth*Delay - Pipesize ... 26

Figure 18. W indows and Delay/B and width P r o d u c t... 27

Figure 19. Synchronous 1 / 0 ... 32

Figure 20. Non-Blocking 1 / 0 ... 33

Figure 21. Asynchronous 1 / 0 ... 33

Figure 22. ESP - Protocol Stack ... 39

Figure 23. Simple Exchmige ... 44

Figure 24. Sequence o f E x c h a n g e s ... 44

Figure 25. Exchange with Answer A c k n o w le d g e m e n t... 44

Figure 26. Exchange with Call and Answer A c k n o w le d g e m e n t... 45

Figure 27. Large Request and Response with Explicit A c k s ... 46

Figure 28. Large Request and Response with Implicit A c k s ... 47

Figure 29. D ata and Acknowledgement P a c k e t ... 49

Figure 30. ESP Flag B i t s ... 49

Figure 31. E SP Error Packet ... 50

Figure 32. E SP Error report c o d e s ... 51

(7)

-Figure 33. ESP Ping P a c k e t ... 31

Figure 34. ESP M essage Sub-layer State T a b l e ... 52

Figure 35. ESP a ills tuid reply nesting r u i e . s ... 54

Figure 36. M ulticast Response I m p lo s io n ... 61

Figure 37. Definitive M ulticast Tnuisport Protocol S p e c ific a tio n ... 69

Figure 38. Definitive M u ltilist Transport Packet S p e c i f i c a t i o n ... 70

Figure 39. Event Trace M easurement P o i n t s ... 74

Figure 40. TCP Tim er E v e n t s ... 75

Figure 41. TCP Protocol Control Block ( T C P C B ) ... 76

Figure 42. RTT Smootiling Algoritlim S p e c i f i c a t i o n ... 80

Figure 43. Typical TCP Trace O u t p u t ... 82

Figure 44. Sample TCP Monitoring O u t p u t ... 83

Figure 45. TCP Behaviour witli mi.sbeliaving W i n d o w ... 84

Figure 46. TCP Beliavioui witli Slow Stiul and Congestion A v o i d a n c e ... 84

Figure 47. Theoretical SATNET Tliroughput ... 87

Figure 48. SIM P Queue Lengtli tuid Service Time A n a l y s i s ... 87

Figure 49. TCP M easurement Environment T o p o l o g y ... 92

Figure 50. Smoothed Round Trip Time R e s u l t s ... 93

Figure 51. MSS Selection R e s u l t s ... 94

Figure 52. Two TCP Connections S lw in g B a n d w i d t h ... 96

Figure 53. Exmnple 1 LAN T o p o lo g y ... 109

Figure 54. Example 2 LAN T o p o lo g y ...110

Figuie 55. Example 3 LAN T o p o lo g y ... 110

Figure 56. Com puter-Com puter Com m unications M o d e l ...112

Figure 57. Communications Functional Building Bkx:k M o d e l ... 112

Figure 58. Packet V ideo Protocol S t a c k ... 118

Figure 59. Packet Audio Protocol S t a c k ... 119

(8)

-LIST OF TABLES

TABLE I. Range o f N etw ork Uses ... 7

TA BLE 2. Range o f Network T e c h n o l o g y ... 7

TABLE 3. Packet Type and Event K e y ... 16

TABLE 4. C lark’s Five L ayer M o d e l ... 29

TABLE 5. ISO Transport Protocol C l a s s e s ...34

TABLE 6. Packet Size changes through L a y e r i n g ... 36

TABLE 7. ESP Default Tim er S e t t i n g s ... 53

T A B L E S. ICM P S t a t i s t i c s ...78

TABLE 9. TCP Traffic tool Example M easurement O u t p u t ...78

TABLE 10. SATNET SIM P Q ueue Lengtli mid Wait T i m e s ...88

TABLE I I . Tlieoretical Packets Sent versus Segement Size ...89

TABLE 12. Tlieoretical Packets Sent versus Seyement Size ...89

(9)

2

-ABSTRACT

A m ethodology tuid architecture have been developed tliat contrast sharply with common interpretations of the Open System s Interconnection concept o f Layering. The thesis is tliat while many functionalities are required to support the wide range o f services in a Distributed System, no more tlimi two concurrent layers o f protocol tu'e required below the application. The rest o f the functions required may be specified and im plem ented as in-line code, or pr(x:edure calls. The proliferation o f similtu^ functions such as m ultiplexing mid eiTor recovery, tlirough more than one or two layers in the OSl model, conU'adicts many basic rules o f moduhirisation, and leads to inherent inefficiency and lack of robustness. Furtliermore, it makes tlie implementation o f Open Systems in hardware ahnost intractable. A key design issue is how to construct tlie automatic adaptation mechmiisms tliat will til low a protocol to match the service requirements of the user.

The mguments in this tliesis are based on the experience of the design and implementation o f three major classes o f protocol for distributed systems; a tnuisaction protocol (ESP), a reliable multicast protocol (MSP), and a Stream Protocol (TCP). A number o f performtuice metisurements o f these protocols, including tlie latter, the DoD Transmission Control Protocol, were carried out to support the tliesis.

(10)

3

-Acknowledgem ents

I would like to iJiaiik everyone in iny fiunily who encouraged me to linaily do some work. *

Thanks lu^e also due to Ben Bacarisse, M ark Riddocli, Karen Paliwoda, Steve Eiisterbrook, m em bers of the SATNET“ m easurem ents Task Force, Peter Kirstein, Soren Sorenson, Peter Lloyd and Robert Cole for assistance at various points during this resettrch.^

Thesis: Aims and Goals

The mm of tliis resemch was to develop protocols tliat <ae designed and engineered for scalable distributed systems support. To tliis end, it was a goal to produce an efticieni tnuisport protocol for one-to-one and one-to-miuiy Remote Procedure Call support that cmi tilso be used for conventional (connection-oriented) applications. An important goal is that we understand limits to tlie performance o f such protocols in the Local mid W ide Aiea.

Sub-goals included: possible hmdwme support for tfansport protocols; link mid network level security implictitions o f non trivial size distributed systems; some understmiding of possible formtd spécification o f aspects of tlie protocols developed.

The work described in tliis thesis was ctuiied out between 1985 and 1988. The autiior believes that despite the time elapsed since then mid tlie presentation of this work, mmiy o f tiie findings are still highly relevmit, especially the protocol designs mid performmice results.

1. Tliariks e.specially to tlie menilier.s ot tlie lA B's Eii<l-to-Einl Research Group for the alternative title - the UnlTearahle Lightness o f Protocols (Craig Partridge after Milan Kundera).

2. Tliis work was supported in part by DARPA under Contract NuiiilTer N(KX)14-86-(XW2. Any views expressexl in tliis tliesis are those of tlie author alone.

3. Some iiLspiration was drawn from the writings of Myles na Gopaleen concerning the Research Bureau in the Irish Times;

(11)

4

-Coiitents: Organisution and Layout

The tliesis is presenled in 8 ehaplers. The core ol llie work is presented in chapters three, lour luid live. Here we present a sim ple transport protocol for RPC support, extensions for one-to-inany conversations, and perfonnance m easurements o f a similar protocol. This meets the primary aim o f the thesis. The requirement for such protœ ols has long been identified, and references are given throughout the text in support of tliis.

The outline of all the chapters is as follows; 1. Distributed Systems Protocol Architectures.

In this chapter, we review tlie services provided by the tnuismission network and those required by tlie user. We then review protocols in distributed systems, and finally present our Distributed Systems Prot(x:ol Architecture,

2. Protocol Implementation Design Methodology «

In this chapter we review Protocol Design M ethodologies, tuid examine relative layering strategies and overheads. We compiu'e tlie layering overheads versus flexibility and extunine Finite State M achine Specifications luid direct implementation from them. We look at mapping End to End protocols to Bounded Buffers

3. Sequential Exchange Protocol

In this chapter we present tlie design o f the Sequential Excluuige P rotœ ol, a Tnuisaction Protocol for RPC. We look at tlie need for Presentation Level Integration, the need to support recursive calls, iuid then look beyond simple RPC at Messaging iuid SU'eiuning.

This protocol is ;ui alternative to VMTP^**‘^^**^“ iuid ciui be used as a suitable improvement on the psuedo tran.sport protocol used on UDP for support of NFS/Sun RPC.^"*^^*^'^' MicX9a jj. intended primarily as a transaction protocol for RPC, something which has been identified as missing for a decade. We Uxik at how there is a need for Presentation Level Integration. The protocol also supports recursive chills. This chapter also extuniiies what is needed beyond RPC such as M essaging/Stietuning used for supporting networked window systems.

4. Multicast.

This chapter presents the design o f a multicast protocol, which is an enhancem ent to the transaction protocol presented in Chapter 3. The aim o f the protocol is to provide a service equivalent to a sequence o f reliable sequential unicasts between a client and a num ber of servers, whilst using the broadcast nature o f som e networks to reduce both the order o f the number o f packets tmnsmitted tmd the overall time needed to collect replies. Appendix 1 introduces a Distributed Conferencing Program that can mtike use of multicast, and Appendix 2 outlines possible OSI protocol support for the windowing system used in the conferencing application.

5. Protocol Performance.

This chapter describes tlie tools ruid measurement mechtuiisms used to investigate the performance o f the Transmission Control Protocol (TCP) over the Atlantic Packet Satellite Network. One of the main Kxils used was developed by Stephen Eîisterbnxik at IK^L. while the other was provided by Vmi Jacobson o f LBL.

(12)

5

-W e Like a pragmatic look at relransinission schemes, How control schemes over Datagnun Networks, Congestion Control Issues and Adaptation in Protocols.

Appendix 6 describes some luriher details o f trmisport protocol implementation and specification techniques. We look into problems o f Concurrency, Robustness mid Efficiency, Presentation Interaction, mid Buffer O w n e r s h i p . A p p e n d i x 8 covers possible protocol hardware support, including pmallel processing, pros and cons o f Front-End Processors, Support for rate control, support for sequencing/resequencing and support for packetisation and presentation parsing.

As trmismission speeds increase, more stress is being placed on protocol implementations. Here we explore which salient functions can be moved into, or supported by, specialist hardwme. Since we are mguing for a 2 or 3 process implementation of tlie communications stack, parallel processing has limited scope in miy given conversation. However, for large hosts, there may be mmiy conversations in progress simultaneously. There are also a number of specialist tasks that may be done such as in-line presentation encoding/decoding, checksumming and encryption/decryption. These could all be done in a pipeline of prtxzessors. In intermediate systems in the network, there me possible benefits in using a pnx:essor per interface.

6. M anaging tlie Interconnection of LANs

In tliis chapter we piesent some o f the techniques for building extended LANs that have been used to build metropolitan scale distributed systems witliout needing M etropolitan Area Networks. It is importmit to understand the impact of these techniques on packet loss, reordering and duplication so that we can build robust Lansport protocols.

This approach to scaling networks may seem to be bottom up, compmed witli the RACE progrmnme approach in designing a pan-Europemi network from scratch. It does have major advantages in being based in high speed technology that is available now, so that perform ance o f real systems cmi be studied rather than merely modelled. W e look at routing and performance problems.

This chapter continues on to exmnine the interconnection o f LANs with "MAC"®*^'‘*^“ (or link/frmne level)*^*^^“ Bridges, mid Internetwork level relays ("Gateways" or "Routers" in DoD Term inology).^'"*^^' BngStiu following issues are mainly addressed in the context of the DoD Protocol suite mid interconnected Etliemets: managing interconnected LANs, implications for Bridges, Routers mid Relays o f high performance, localisation of trtiffic versus the cost o f routing and addressing protocols.

Appendix 3 describes the iipplication of these ideas in the context of a university or campus system configuration. Appendix 4 describes a possible distributed application o f some o f these ideas. Appendix 5 describes the configuration o f the Alvey Adminil network, which was the testbed for for some of tliis research. Appendix 7 describes some further measurem ent experiments and results cmried out on another such configuration, on the UCL Campus network.

7. LAN Access Control at tlie M AC level

Chapter seven includes a description o f some extended network security features that we have devised for extended LANs. One criticism often levelled at network architectures is that security is added as an aftertliought. In the process o f this work, we show that at least some levels o f security (in the area of access control on a per host or subnetwork basis) a m be provided at a raisonable level using fairly modest m echanisms.

8. Conclusions and Further Study

(13)

6

(14)

-

7 -1. Chapter 1: Distributed Systems Protocol Architectures

In this chapter, we review the services provided by the wire tmd those required by the user. This simple three layer approach is an attempt to expose the basic ptirtimeters in the protcx;ol design sptice. W e avoid reproducing a description of the Open Systems Interconnection Architecture as this is well understood and described well elsewhere. In contrast, we review protocols designed for distributed systems from die research community during the 1980s.*^^^“ Much o f the material presented in this and the next chapter consists o f descriptions o f tools and building bl(x:ks which ctm be put together in a variety o f ways. It is how these components go together in distributed systems now that may be different from die peer to peer communication models o f traditional communications.

W hen considering why we need specialised aim m unications softwiue, and how it will differ from existing applications mid systems software, we must consider die hierarchy, consisting of;

• W hat die User wmits to coimnunicate.

• What die Operaiin^ System provides for Communication Support. • W hat the Wire provides in tenns of basic communication. L I Users

By users, we mean, ultimately, hummi computer-users, but also include a wide nuige o f application prognuns. The kind of things users wish to communicate cover a wide spectrum:

Traditional Contemporary Modern

Analogue Voice Terminal Access Window Based Graphics Terminal

Telex File Transfer (lumpy or whole) Shtu'ed Files/File Access

TV Electronic Mail Conferencing/M ultim edia Mail

Mail Order Remote Job Entry Distributed Execution

Load Sharing/Replicated Services

T A B L E 1. Range o f Network Uses 1.2 The Operating System

The operating system provides generic software for communication, whereas the user provides anything m ore specihc to the task in hmid. G eneric com munication software is partly what this thesis is all about. 1.3 The Wire

The Wire is in reality a wide range o f services, ranging from a modem with an acoustic coupler, attached to a conventional telephone, right through to a High Speed digital packet switched network. This is illustrated in Table I and Figure 2.

The range W ide Area (National and International) networks extends a long way as exemplified in Table 2:

Phone Trunks Packet Switched

Numbers 600 Million 10,000s 100s

Technology Modems/PCs Modem/M uxes DTEs/Gateways

Usage French Minitel Banks etc Research

(15)

- X

-1.4 Switcliinf.’ Mclhodoloi^ic.s

In switched digital networks, the switching nodes can operate in tliree different ways (refer tiiso to Figure 1):

• Circuit Switching

The switches on a route establish a physical circuit between two hosts for tlie duration o f the coininunication session. This is how the telephone system operates.

• M essage Switching

The sending host parcels up data into long messages which are transferred step by step over each link between the switches and delivered to the recipient.

• Packet Switching

As message switching, but the parcels are short, so that the switching nodes can store numbers of tliem in memory.

Packet switching is favoured for computer coimnuiiication; It mtikes gtxid use o f tlie links. Between bursts o f data between one ptiir o f hosts, otlier hosts may use the same links tmd switches. Delays due to storing short packets in switches tire short compmed to storing long messages. Switches need less memory/buffer space to hold packets for forwtuding tliiui for messages. Hardware to implement packet switches is now relatively cheap. Switches tuid links may bretik during a communication session, but alternative routes can be found. Packet switched networks have been in use for over 25 yetirs. Early examples are the US ARPANET and the UK National Physical Laboratory Network. The ARPANET evolved into tlie Internet, which by the end o f 1992 connected well over 1 million hosts worldwide. At die stune time, tlie PITs provide tui international packet switched network. In the UK tliis is B T s PSS (Packet Switched Stremn) Service.

(16)

< )

-I'licic lias been m u ch rccciil renew al ol (he clehaie e o n e e n iin e Ih e ellieaey ol h o p -b y - h o p How eoiilrol v ersus end-lo-eiul (low conirol as a baiulwidth shiuing and (low t|uara nlining m e c h a n ism s. A llhough this is largely lieyond the scope ol this diesis, it is louehed on in ch ap te r 5.

d ri.iiil

X s w i i d i

s so u rc e

n (k.-siin;ilion

' O

O

' ©

0 "

Figure 1. S w itch ed N e tw o rk ['he range ot netw ork cha ra cteristics is wide:

Link data rate (bit.sAsec)

l(X)()M

lOOM

2M

IM

1(K)K

IK

c o m p u te r buses

local & metro networks

packet satellite

wide area terrestrial

,(X)1 .01 .1 1 10 100 1000 KXKX)

Geographic coverage (kilometres)

Figure 2. The Spectrum of WAN and LAN Characteristics

(17)

-

10-1.5 Packets on the network

Packets are placed on the network. What is their late? Even il the network is essentially a reliable bit pipe, it is possible for it to bre^tk, independent of tlie end hosts coininunicating. Practice shows that it is in fact far more likely to break tluui, for instance, the terminal interface device on a mainframe or a disk controller. If tlie network is packet switched, then oilier things can go wrong too in a variety o f subtle ways.

Tlius, depending on tlie type of network, a viuiety of disasters Ciui befall a packet: The network may not know how to get this packet to its correct destination; the packet cmi just get lost due to lack o f memory or a temporiiry fault in some component; a packet cmi be corrupted by electrical (or other) noise on a wire, or by faulty softwttfe; packets cmi be delivered to the remote end in a different order thmi they were sent; the smne packet could be delivered multiple times to a destination; packets could be delivered by the network to a destination host faster tlian the host can deal with tliem; packets may even arrive that were not actually sent by miyone (invented by the network); tlie network may support many different optimal packet sizes at different places.

Tlie consequences iu*e many: There m ust be mechanisms for hosts to indicate where they wish to communicate - just as tile systems allow us to indicate which tile we wmit to read or write; there may be a single wire leading out o f a host, which then splits in some way to go to multiple destinations. This means we need to have some kind of addressing mechanism, so that we can multiplex com m unication between many hosts on a network; the user needs som e level of reliability. This cmi vary between some statistical level - digital voice or video may only require some percentage of data per second to be delivered from which a reasonable result can be constructed by the receiver. [For instance, hum ans only need 30% o f tlie sound of som eone’s voice to be able to understand 99% o f what is being said]; tlie user may need to restrict the rate at which data anives. reai-Time applications like voice mid video require fmrly exact rates of arrival. Print servers may support much lower data rates tlian typical modem networks. The user needs to be able to subdivide ( "fragment " or "segment ") and re-assemble the data into appropriate packet sizes. Tliis may further involve a mechanism for discovering tlie appropriate size, or rely on intermediate systems performing this function. It is inefficient to do this in too many places.

The network must also be used responsibly and must protect itself against malicious use. Since it is often one of the more expensive resources an orgmiisation cmi possess, it must also be used efficiently. In suimnary then, a Packet Switched Network is mialogous to a huge statistical multiplexor.

1.6 Operating System Facilities

M ost O/S support is for some form o f communication mid sharing of devices, and so has som e facilities for building protocols. However, conventional device support usually assumes that devices are accessed over a locid bus, and so tire alm ost error free and with delays similar to memory access."* Also, sharing of devices is achieved by access being only via a device driver, and is therefore easily achieved. The system safely separates users from each otlier, mid protects devices from aberrmit behaviour o f users' processes by restricting access through system calls.

(18)

User Space

Kernel

o i o i o

Monitor

Device Drivers

Shared Memory M sg Passing

ksched _{Standard Device}

Interface

F igure 3. Operating System Structure - Process View:

Figure 3 illustrates the layering tuid process model of a typical uni processor operating system of the 1980's such as 4.3BSD U n ix .^“**^^‘* Within tlie Kernel, or in a Real Time System, the structuring is rather different witli closer ctxiperation between processes or threads o f control, less protection, tuid the assumption tiuit the prognum ner is more expert. Processes run in two main types o f context (or address space) with memory miuiagement units controlled by the operating system to protect prtx:esses' memory from random, erroneous access from other processes. These tire "User Sptice" and "Kernel Sptice’ (in Unix terminology). There tuxi then three levels o f scheduling: Dispatching of interrupt service routines; Scheduling o f keniel tasks (e.g. file system tasks, or network tuid transport protocol ttisks); Scheduling of U ser Processes. More recently, Uiis division o f scheduling cltisses has gone further to include "thretids", which operate within a single address space, within a single "user sptice" process. Systems such as Amoeba, MACH, Chorus and W indows-NT provide this abstraction.

1.7 Software fo r Intermediate Devices in a Network

In a packet switched network^""^^^ . tuid in store and forwtu'd networks^"'^^*" , we have specitd purposes boxes, vtu*iously called Bridges, Switches, Gateways and Reltiys. These will normally run real-time operating systems, and perform mtuiy of tlie stune functions as genertd purpose end hosts, except for the upper levels which may not be required. There is often specitd htu'dwtu’e support to facilitate communications, e.g.:

• Intelligent Interfaces, including Content Addresstible M emories, for de-m ultiplexing packets based on fields within a pticket (addresses).

• Intelligent or Complex Buses (e.g. biiittfy tree buses or even no buses). • Specialised use o f M emory Mtuiagement Units (MMU).

Depending on the architecture of the network, and the choice o f end host protocols, these intennediate nodes support different complexities of protocols, usually ranging^ from connectionless to connection oriented.

1.8 D istributed Systems Protocol Architecture

The choice o f where low level protocol functionality should go is dictated to a large extent by the distribution o f problems and performance in the underlying interconnection fabric. Factors include where

(19)

- 12

Ihe errors occur and where die delays are incurred, he it in transmission or queueing in bridges or routers. W e tirgue tiiat tiie cuiTent trend in technology has moved the engineering param eters towtu'ds connectionless networking, and pure end-to-end detection and recovery mechtmisms.^’

In the workstation arena, the Internet Protocol Architecture has tu'isen as the mtiin sUucture tor com m unications for distributed systems. In recent years, however, some shortcomings o f tliis tu'chitecture have been pointed out. Clark and Tennenhou.se propose Integrated Layer Processing in tlieir paper^'“^*’“ and Tennenhouse^*^"*^“ again calls for a reduction in layers. Throughout tiiis thesis we will explore the im plications of these ideas in implementation.

Above mi all pervasive Intenietwork layer, traditional applications such as remote tenninal access, tile transfer mid so forth have been built on tlie service provided by reliable stremn protocols such as TCP.**‘*^*^*“ This service has proved invaluable in simplifying tlie communications model for these applications, but has proved less appropriate to coimnuiiication patterns involved in distributed computing. We now turn to the transport protocol requirements for these users.

1.9 Transactions

Cheriton^*“^***“ describes a trmisaction protocol with multicast facilities tuned for page level file access, but also suitable for wider mea distributed system transactions. Watson^''^*^^ describes a trmisaction protocol that ingeniously avoids problems with rapid system crmsh and rebixits, using time changes rather thmi sequence or epoch numbers. Extending this work, Chesson*"*^^’^ describes a protocol very similm to VM TP specitdly tuned for high performmice interworking from gniphics workstations to supercomputers. The protocol was also designed for VLSI implementation.

None o f these u^ansaction protocols support Call Back where the server of mi RPC request (or descendants o f the current ctdl) may make a cidl back to the originating client, witliin the smne conver.sation. This is despite the fact that tliis is a perfectly natural way for any progrmn to behave (mutual recursion) whetlier distributed or not. Randell describes tliis in detail in an emly paper on tlie Newcastle Connection, tlie first distributed Unix witli RPC.**“"^^“

In chapter 3 we will describe a transaction transport protocol that supports callback. In chapter 4, we describe extensions to support multicast.

1.9.1 The Limit o f Transactions

The sequenticd exchange paradigm leads to unacceptable latency for some applicatitms.^*'**^^ It becomes necessary to provide a reliable message streaming facility (relaxing the strong synchronisation between the application interface send/receive ptiir of operations, and tlie corresponding exchange o f m essages with the server application. Tlie X Prot(x:ol is an example of ju st such a system.^*^**^^ VM TP and the Sequential Exchmige Protocol both provide this functionality, although many other transaction protocols do not. Another approach is to provide fine grained efficient concurrency in servers and clients. This has become more widespread in systems recently.^P*^*^**

The protocols described in chapters 3 and 4 provide stremning of calls to improve application performance over that constrained by the latency o f tlie network and tunimound time, when blocking awaiting the return from a conventional non-stremned conventional RPC.

1.10 Sequence and Tinm

Lamport^®"*^*** describes mi algorithm for global agreement on sequencing. Based on these arguments Birman®"'*^® describes ABCAST, a protocol providing totally ordered broadaist.

(20)

-

13-Mills describes a protocol and algorithms lor synchronising clocks in a distributed system. Liskov et al have com bined M ills’ work witli Birman’s''^*'*'^*^ to describe an algorithm for minimising htuidshaking in messaging systems in a distributed system with synclironised clocks as described in Mills.

These m echanism s permit less hand shaking within distributed prognuns, since any ænbiguily in ordering is removed by enforcing a global view o f time, and tlius of messages. There is a natural tradeoff between the extent to which this is enforced and the iunount of concunency in tlie system, but at tlie gain of sigiiiliciuit simplicity for tlie progi<unmer.^

1.11 M anage7ne.ni

A Distributed System is not useful outside a resetu'ch laboratory witliout mtuiagement. In chapter 5 we kxik at peiform aiice mtuiagement, and in tippeiidix 8 suggest luadwtue support for perfonntuice enhancement. In chapter 6 and 7 we Icxik at interconnection, routing management, and security. In appendix 9 we examine som e formal ttspects of a system that may in tlie future allow us to prove that we m eet conlrtictual obligations in terms of coiTechiess, at letist in terms of service.

1.12 Related work

There are many background references on Architectures’^. For instiuice, Tanenbaum ^“”*^“ iuid Hal sail

Hai88a jg^cribe tlie OSI tuid XNS tachitectures. See also the RFCs by Dave Cltak tuid Comer^"""**"^ S te9ia

for the DA RPA Internetwork Architecture.

D isaibuted System D esigners’ views of relevtuice include:^'*"'*^'*'^**'**^''' lhnX3a, B«,g*Ua, Cel83a, C»l83a W ide Area Communictitions tuid^^*^*^^' Cyp7Xa, DECXZa, ECMXZa, FriXSa, (,iix5a,ScliXSa, iEE83a distributed systems and WecXUa, ZimXUa (Y^^^e genertd communication tachitectures. The tachitectures in these references tae rem takably sim ilta in thtit the layering and hx:ation o f function or mcxlultaisation is Itagely similta; the mtijority tae tdigned with respect to choosing connectionless networking layer.

M ore recent work, for instance the tliesis by Lixiti Zlituig on Architectures^**“^‘*“ where a rate based network tachitecture is outlined, tuid the XTP wiak^*’*^^**^ merge more of tlie functions o f the network and the atuisport Itiyers, and tdso move more flow policing functiontdily into the network, while stopping short o f distributing full connection state in switches.

7. Note that tliere i.>; an iiiiplicatioii for (lie .security of a disaibuted xyxteiii wlio.se correct or performant operation depends on synclironised clocks. Tlie synclironisation sulvsy.steni iiiu.sl be secure.

X. W ebster's defines tliis as:

(21)

-

1 4

-2. Chapter 2: Protocol Implementation Design Methodology

In Uiis chapter we review lower level Protocol Design Metliodologies, and exmnine relative layering strategies and overheads. We describe some o f tlie commonly accepted building b lw k s for protocols, ranging from tlie specification o f formats for packets including header control information, and the elements o f procedure or operational components of a protocol. We look at building blocks both for correct protocol operation, and for efficient provision o f communication, both for the user and for the (usually shared) transmission network.

W hat is a protocol? First and foremost, it is a set of agreed rules on what is meant by each unit o f data exchtinged - i.e. it is the enctxling or agreed meaning for bits/by tes/words exchanged. It is also a prescription o f what procedures to adopt in each circumstance.^ Exmnples of such rules for format include the use of English in inteniational diplomacy, die use o f particular dictionmies for scrabble and the use of the ascii code for chmricters stored in computers. In Computer Communictitions Networks, for reliability, flow control, resource sharing, etc, we need prottKols widi complex procedures. Some o f these building blocks tU'e described below.

2.1 Some agreed fo rn w ls in protocols

Tlie DoD Internet Datagram protocol is a fonnat for a packet, used commonly on LANs (Ethernet) and W ANs (the Inteniet). At the time of writing, tlie Internet now interconnects over one million host computers, tuid can be regarded tis die most successful network, and em ploys one of the most successful protocol suites in existence. The IP Packet Format is show in Figuie 4.

0 1 2 3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4- - 4— I— l“- 4 " - 4 “-4 " -4 “- 4 " ” 4"-4"“ 4"“ "f’“ 4"“ 4““ 4“- 4 " “ 4“*4"“ 4"-4"~4“*4“*4"“ “f‘*4"“ 4"*4"“ 4— k - 4- - 4— h

|V e r= 4 I IHL= 5 |T y p e o f S e r v i c e ] T o t a l L e n g th = 21 | H—I— h-"t— t-- + - + - + - + -H— H- + -H— h-H— I- - 4— H- + - + -4— H-4— I— I— H- + -H— I— I— I—

I-I I d e n t i f i c a t i o n = 111 | F I g = 0 | F ra g m e n t O f f s e t = 0 | 4”'"4"” 4”*4"“ 4"“ 4"“ 4“* + “ 4*" + “’4"“ 4"“ 4”"4"*4” - 4 " -4 ’ " 4 ”“ 4“*4”* + *4““ 4"“ + “ + “ 4“*4"“ 4"“ 4“*4"*4‘ “ 4"“ +

I T im e = 1 2 3 | P r o t o c o l = 1 | h e a d e r c h e c k s u m | H— + - + — I— + — I— H- 4— H - 4— H - 4 - - 4 '- 4 — H -4 --4 — h - 4 - - 4 - - 4 — h - 4 — I— I— I— h - 4 — h

I s o u r c e a d d r e s s |

4” “ 4"“ 4 "-4 ”” 4““ 4"“ 4"~4“ “ 4 '“ 4"“ 4"“ 4’ “ 4” “ 4“*4"” 4"“ 4 '“ + “ 4"“ 4"“ 4"*4**“4“* 4 "-4 ““ 4"“ 4"“ 4"“ 4”“ 4” -4 "* 4 ““ “+‘

I d e s t i n a t i o n a d d r e s s |

4“*4“- “h~4"-"f’ “ 4""4— 1““ 4““ 4“ “ 4“- 4 “- 4 " “ 4"“ + “ 4"“ 4”"4"*4”*4”-4 * “ 4"“ + “ 4 '“ 4*"4“-4 ”“ 4*“ 4*“ 4"“ 4"“ 4”-4"

I d a t a I

4— 1”*4"“ 4"“ 4"“ 4”“ 4“- 4 " " 4 ‘

F ig u re 4. Example Structure o f Internet Datagnun

(22)

15

-The order ol bits and byles inusl be specilied uiuunbiguously loo, as shown in Tigures 5 and 6;

0 1 2 3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + - + -H----k - 4---1— I + “ + - + - + * + — k - 4- - 4— k - 4 - - 4 " - - k - 4 " - 4 - - 4 — k - 4- - 4— k - -

4-I 1 I 2 I 3 I 4 I

4 k - 4" —I— 4“ “ 4" —k - 4— 4— *k -4 “ - 4 ---- k “ 4“- 4 " - 4 " * " k “ 4"“ 4“" 4 k - 4- —k —k —k “ 4~-4— 4---- 1---- 1---- k - 4 " * 4 “

I 5 I 6 I 7 I 8 I

4- - 4---- 1---- k “ 4---- 1---- 1---- k - 4---- k “ 4---- 1---- k - 4---- 1---- k - 4 - - 4 - - 4 - - 4 - - 4 ---- k - 4- - 4- - 4---- k - 4 k - 4---- k - 4 “*4"-*k

I 9 I 10 I 11 I 12 I

4"~4““ 4— 4— 4"~4“~4““ "k“ "k“ 4— k - 4- - 4- - 4- —I— 4- - 4— k —k - 4 — I— 4"-4— k —I— k - 4" - 4— 4" - 4- —k - 4 — k

Figure 5. Transmission Order ol Bytes

0 1 2 3 4 5 6 7 + -H 1 h-H 1---1---1-- +

|1 0 1 0 1 0 1 0|

H 1---k-4---- H- + - + - + - +

Figure 6. Significmice ot Bits

A long running (but pointless) debate over the merits of big-enditm versus little-endian bytes mid words is described well in Cohen.

2.2 Some Rules fo r Protocols

Like a human conversation, a protocol is a set of agreed rules for structuring communication between ptirties. To make conversations possible, tliere must be some rules or protocols. Examples o f such structures are: question and answer; tlie monologue (broadcast); tlie conversation; a cry for help (broadcast); an order and an acknowledgement; the use o f announcement of identity of caller/callee in a telephone call; tlie agreement to abide by an tutiiter in a cricket match; the handshake at the end o f a business deal. There tire many otliers.

Each end point o f a communication session is autonomous, that is each can decide to stop or start talking at any time, or to interrupt another party. This involves concurrency, and is what makes the design and implementation o f protocols difficult; the software to implement any protocol o f significant interest involves concurrency. It is inherent in communication that there is true interleaving of events (programs managing each end of a conversation). There are non-detenninistic events (users may type at a keyboard at any time, as someone on tlie telephone m ay speak at any moment). Tliis means that polling implementations are not often feasible: the num ber of possible events that m ust be polled for and the time to execute a polling loop are prohibititive, and differing event streams happen at different, varying rates. Tliis leads to event driven programming, and other ways o f structuring communicating systems, often involving concurrency. W e discuss this further below.

(23)

-

16-CR

CR

CC

DT _{m .}

DT

DR DR

DC

F ig u re 7. An exiunple of an invented protocol behaviour

CR Connection Request

CC Connection Confirm

Cl Connection Indication

DT Data

AK Acknowledgement

DR Disconnect Request

DC Disconnect Confirm

DC Disconnect Indiaition

T A B L E 3. Packet Type and Event Key

(24)

-

17-2.J W/uit is C oncun ency y

In the soltwiire world, we commonly lind Uiree types of concurrency;

1. Fake concurrency provided by a multiprogramming environment or operating system. This exists down to the level o f Interrupts tuid up tlirough System Calls to concurrent progrtunming environm ents.

2. Retil concurrency in a tightly coupled multi-processor environment. Tliis does not interest us here, as communication in such tui environment is generally provided by hardware, and such devices are special purpose (e.g. graph reduction machines for functional languages. Digital Array Processors etc).

3. Real concurrency in a Linisely C oupled Distributed system. 2.4 Interleaving

Because o f the unpredictable nature o f the real world input to a system - the non-determini.stic order of events, it is far easier to model the system behaviour as a set o f (ctxiperaiing) processes/tasks/threads, eîich o f which detils with æi appropriate stretun o f events. This a m then be implemented using oi-routines, or m ulti-processing. We then have to view tlie system as //the.se tasks actually run concurrently (at the .siune time).

W hat actually happens in a single system is that tliey are interleaved. This m am s that at any point in execution, one task could stop, then another run. Tliere will be some indivisible (atomic) unit o f execution that a m n o t be suspended. In pre-emptive operating systems, the atomic unit may be as small as a single instruction. In Run to Completion real-time systems, this unit will be decided by the progrjunmer. In a distributed system, there is real concurrency whether we wtmt it or not. This brings special problems - for instance partial failure of a system is possible.

2.5 Problems with Concurrency

The first and last cases of concurrent systems in the previous section concern us. What are the special program m ing problems in such systems?

Firstly, w hether we have an interleaved system, or a real distributed system, we have to protect Critical Regions of code. These are sections of ctxie that may alter the value o f Shared Variables. Because of concurrency, it is possible for incorrect values to arise, if the sequence o f operations on these variables is not carefully controlled. For instance, the arbitrary interleaving o f two threads o f control that both carry out a fetch, m odify and store sequence o f operations on the sam e variable results in a high probability that only one o f the modifications is is made pennanent. Secondly, we have to control ownership o f resources. A num ber o f incorrect behaviours can result from incorrect sequences of resource tilkx:ation, including deadlock, livelock and problems with fair allocation of resources.

(25)

IS

-Needs Printer

Has Tape

Figure 8. Deadlock:

Needs Tape

Has Printer

Has R1 Needs R1 Has R 1

Needs R1 Has R1 Needs R1

F ig u re 9. LivekKk:

Needs R1 N eeds R1

Still Has not got R1

2.5.1 Fairness:

The presence o f a resource greedy process can mean other processes do not get access to a resource as often as necessary. An exam ple of this often occurs in operating systems which over-prioritise a tape drive, so that terminal access is radically slowed down when the tape is running.

(26)

IV

A ndrew s^‘’‘^^^“ delincs 3 levels ol Imniess: 1. Unconditional Fairness

Every unconditional atomic action (or task) is eventually scheduled.

The networking analogy is that data is always eventually transterred, no m atter how little the end user retries.

2. Weiik Fairness

Unconditionally lair and conditional atomic actions (tasks) are eventutilly executed it tlieir guard becomes true and stays true. The networking analogy is that data is always eventually tnm sfened, provided tliat once the end user has stiirted to try to send data, they persist.

3. Strong Fairness

Unconditionally tmr and conditional atomic actions (tasks) tire eventutilly executed it their guard is intinitely often true (i.e. true intinitely often in the history of a non-terminating task). The networking analogy is that data is always eventually trtmsfeiTed, so long tis having started to try to send data, the end user persists, every now and tlten.

The model of windowing tliat TCP adopts (described in chapter 5) achieves a form o f weak fair-shtuing, and is in p<at a consequence of these considerations.

2.5.2 Formal Systems

Formal systems now exist for defining luid checking llie behaviour of such system s of distributed concurrent tasks - for instance M ilner’s CCS, Hoare’s CSP, and the OS I LOTOS.

Hoare’s CSP is based on techniques o f program design and testing tliat derive from weakest pre-condition approaches, finding loop invarimits and minimising Interference between concurrent modules, so tliat safe sequential progrmmning techniques may be used as far as possible, and the combinatorial explosion of states o f concurrent prognuns is slowed.

One application o f tlie.se techniques has been the ISO layer approach for specifying protocols. One o f the dangerous consequences of tliis pmticular approach is tliat it is often taken literally in implementations. As we shall see, this ctui lead to very poor performmice. A serious limitation of tliese approaches has been that although a formal specification ctui lead to the elim ination o f safety or liveness problems, the problems o f timeliness mid fairness are far hmder to chase down. Indeed, it is often the case that hidden events are introduced to simplify a definition (by regtading these events as outside tlie scope of the system being specified). Tliis very act can lead to lack o f fairness.

2.6 Consumers, Producers arul Critical Regions

Using the paradigm o f layering for moduhirisation, we have a natural model for building communications protocols where a service is provided by a layer to a layer above. W e have seen that tliere are only 2 external interfaces to a protocol layer - lower layer input events, mid upper layer request events. Corresponding to these, tliere are requests from tliis layer to tlie layer below, and indications (events delivered to them) from this layer to the layer above. Tliere me tilso internal events (e.g. timers) and control events (e.g. network resets, user aborts etc) W e have m odelled a layer as a set o f cooperating tasks, processes or threads o f control, which m e essentitilly concurrent. Although this concurrency is usually fictional (i.e. interleaved execution by a single processor, rather thmi multiprocessor), it is convenient for designing and implementing protocols - a thread for etich stremn of events is easier to program than a calculated polling loop, with the fraction o f time tilmost impossible to calculate for each event stream (to take just one poor alternative approach). Having chosen to implement m ost protocols as a set o f ctxiperating concurrent threads o f control, we me left with tlie problem tliat the cooperation is often by means o f shmed memory. This leads to one main problem, and a set o f design

(27)

2 0

-• Each lliread within a layer dealing with events tiom the layer below is a consumer ol those events, ruid generally produces events tor the layer above as a result.

• Thus tlirough many layers we have a chain o f prtxlucer from top down to the bottom, and rmotiier chain o f pnxlucers from bottom to top. Each chain of producers is matched by a chain of consumers.

• All o f this is to ensure lack o f deadlock, aitliough it cannot ensure fairness (lack o f livelock etc). • It is intuitively obvious tliat the system makes (low control reasonably easy to implement.

This software structure is illustrated in Figures 10 through 13. Tlie (low of data and control tlirough the system is shown in Figure 14.

1/2 of Transport Protocol

Consumer of User events. Producer of Network requests.

for(ever)

{

s l e e p (tsdu-q-event);

prev = spl-ts()

tsdu-pkt = deq(tsdu-q); spl-restore(prev)

-- make up some tpdu (transport pkts) -- do all the rtx q work and set timers

-- if this is that service/kind of transport protocol

-- (and network service is not enough to give us the -- service we are asked for)

prev = spl-ns()

enqueue(nsdu-q, tpdu-pkt) spl-restore(prev)

wakeup(nsdu-q-event) ;

(28)

21

-2/2 of Transport Protocol - TS2

Consumer of Network input events. Producer of User events

f o r (e v e r )

{

s l e e p (tpdu-q-event);

prev = spl-ts{)

tsdu-pkt - deq(tpdu-q); spl-previous(prev)

-- decode tpdu -- queue for user

-- generate acks (with appropriate window updates)

prev = spl-ts()

e n q u e u e (tsdu-q, npdu-pkt) spl-restore(prev)

Figure 11. Transport Entity 2

1/2 of Network Protocol - NSl

Consumer of Transport events. Producer of Link requests

f o r (ever)

{

sleep(nsdu-q-event);

prev = spl-ns()

tsdu-pkt = deq(nsdu-q); spl-previous(prev)

-- make up some npdu (network pkts)

-- do all the rtx q work and set timers -- if this is that kind of Network protocol

prev = s p l - l s ()

e n q u e u e (Isdu-q, npdu-pkt) spl-restore(prev)

(29)

2 2

-- 2/2 of Network Protocol

Consumer of events, Producer of Network requests...

for(ever)

{

sleep(npdu-q-event);

prev = s p l - n s ()

npdu-pkt - deq(npdu-q); spl-restore(prev)

-- decode npdu, and demultiplex to appropriate -- transport entity

prev = spl-ts()

e n q u e u e (tpdu-q, tpdu-pkt) spl-restore(prev)

w a k e u p (tpdu-q-event ) ;

Figure 13. Network Entity 1 Some definitions:

• sleep - waits (suspends this task/process/tliread) till some otiter task Wtikes us up with appropriate event).

• wakeup - wakeup any task sleeping for tliis event.

• spl-? - set priority level to exclude any tasks from wtiking up who need this level o f access. This can be im plem ented using a semaphore, or else the entire protocol layer might be a monitor, spl-x - restore all previous priority levels.

• que - add a buffer/packet to a queue for anotlier process to... • dequeue - rem ove from a queue som e buffer.

• - - is a com m ent

Corresponding to Figure 10, is a consum er of network events (TS2) and producer o f transport service events to the user. For Figure 12, there is a consumer o f link events, producing network input to the this transport task (NS2).

Thus in a link plus transport protœ ol, for a full system from one host (user) to another, we have a chain of production and consum ption which contains at least 8 tasks!

(30)

- 23

Anoüier aspect ol communication which does not quite lit this pmadigm as presented is that of fragmentation and reassembly. When sending a protocol data unit through a layer, it may be that a layer above has chosen a convenient unit size which is not convenient for the layer below. Some might construe this as poor design, but it is a common consequence of protocol implementation when based on a layered approach. This then requires tlie sender to break down N+1 layer Data Units into several N -layer PDUs. These m ust also be reassembled in a receiver or intermediate node. W hich is the better choice is not clear. How to apply queueing, pacing or otlier techniques to fragments o f a PDU is îd.so not obvious.

Usera Userb

I

^

I

1 ' '

I

TS2 TSl TS2 TSl

NS2 N.S1 NS2 NSl

I

"

I

"

I

> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

I

< < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < <

Figure 14. F lo w o f I n f o r m a t i o n b e t w e e n P r o t o c o l E n t i t i e s

The model presented above and in Figures 10 tlirough 14 shows exactly why the num ber o f concurrent layers implemented at each end should be minimised. Both the Sequenced Exchange Protocol and M ulticast Transport Protocol were designed to run in such a system, with a maximum o f 3 concurrent layers at each end. The cost in terms o f latency and CPU usage associated with tlie context switch implied by the mutual exclusion above is only required for the network to transport level, and transport to application. Any other layer processing (e.g. presentation coding/decoding) should be achieved inline. A connection-oriented network service requires at least one more process layer, and yet (as ragued earlier) does not obviate the need for a powerful transport protocol.

2.7 The Internet Architecture

The DoD DARPA Internet A rchitecture^'*^" is an rachitecture based on two principles that differ in emphasis from tlie ISO O Sl model:

• Fault Resilience built into the Network • End-to-Eiid fa te sharing protocols.

The unit o f interconnection is the network, rather than the single link. Fault resilience is achieved by choosing to support only tlie lowest common denominator o f network services.

(31)

2 4

-The Internet Datugrain Protncnl iiilows Uie iiHerconneclioi) of liclerogcneous netw orks by datagram switching gateways (sub-level 3c in tlie OS I mtxlel), as illustrated in Figure 15. These CcUi be relatively simple devices, mid only need support enough ol tlie functionality o f the networks they connect to be able to send mid receive datagrmns on each network, l l i e gateways that interconnect networks Sun75a route datagrmns dynmnicidly depending on available paths mid network loads.

gateway

wan

wmi

wan

gateway

wmi

gateway

Ian

Imi

F ig u re 15. A Fictitious Internet

Datagrams support simple best attempt packet delivery.^"**^*^ This metins that we need only to be able to send mid receive packets, deal witli addresses mid checksum headers. Only error detection is done, and packets in error (either checksums fail, or mis-directed) are discard ed . It is up to higher level protocols to deal with problems. Datagrmn protocols provide some further functions including fragmentation, to deal witli networks that support vmying mtiximum packet s i z e s ; t i m e to live specification, to enable the network to discard packets too oUi, header checksums, to prevent corrupt packets unnecessarily congesting gateways.

There is usually one IP implementation (process/task/thread) per machine, and it is usually built in the kernel o f the operating system. This may talk to severttl subnet or kx:til network interfaces, which are usually implemented in drivers. Above tlie IP layer, tliere may be several trmisport protocols, each o f which exists, once per machine. These must tlien multiplex between users/applications.

(32)

2 5

-netw ork service: the actuiil operating parameters can viiry by orders ol magnitude during a single session. For exmnple, typical implementations can tolerate delay variations from 2 msecs to 20 secs, throughput in the range from 1bps to 20 Mbps mid packet loss rates from 0 to 50%.

Tlie Tnuism ission Control P r o t o c o l p r o v i d e s a reliable stieiun of bytes between two processes on different machines. It relies only on IP for packet delivery, and can deal with packets lost, out of order or duplicated. TCP provides tlie following protocol functionality:

• End-to-end addressing using Paris.

• Full Duplex connections, using a single packet format, witli control and data indicated in a single 20 byte header.

• Error detection using low cost checksums.

• Recovery using mi Acknowled}»enicnt and Retransinission scheme. • Duplicate detection and segment ordering using sequence numbers. • Flow control using Dynamic Windowing.

2.7.1 M anagement

The Internet Architecture has a rich nmning architecture with several l e v e l s T h e Address Resolution Protocol (ARP) is used for IP address to loctd network address m a p p i n g . T h e ICM P is used for error reporting and diagnostics as well as subnet configuration request mid response messages.**"'**^’*’ A plethora o f routing information protocols including GGP/EGP/IGP/BGP is used for gateway to gateway exchanges to build up a distributed topological view .^'"*)" )^'')^" The Domain Nmne Scheme is used for a service nmne (string) to IP address mapping.

In contrast to management "in tlie large" that is found in tlie IEEE and ISO models of network mmiagement, these functions are embedded in the layers tliat their functions support. The SNMP protocol is used for monolithic network management access to Internet objects described in the M anagement Information Base objects in a remote system.^'^^"^ On the other Iimid, tlie embedded approach is exemplified by the inclusion of dynmnic or adaptive protocols such as ARP for distributed address resolution and ICMP for route dissemination and traffic source quenching.

In chapters six mid seven, this simplifying approach is used in looking at LAN interconnection and some possible embedded security mechanisms.

2.8 Protocol State and Paratm terization

(33)

2 6

-C o n n e c t i o n P h i i s e

Phases F igu re 16. Cennectum

lune

(34)

2 7

-S a ie lliie Paüi

64kbps 2 Secs

Terrestrial Link

2Mbps KXlmsecs

Figure 18. W indows mid Delay/Bandwidtb Product

The biuidwidtJi/delay product in modem networks is such that to achieve reasonable tliroughput, outstanding packets must be permitted. This trend is increasing, mitigating against stop-and-go style protocols (at îiny level, be it trmisport, or application or elsewhere). Windows may used to specify a receive buffering, mid with or without explicit "Receiver Ready/Receiver Not Ready" style messages, can fio w control a sender. However, a window is required to make good use of modern networks. Consider two typical patlis with tlie ch^rracteristics shown in figure 18, witli a terresU'ial megastremn path and a satellite kilostre<un patii, both supporting KKX) byte packets. A "stop and go" protocol would lead to possible tliroughput:

Terrestrial: lOOObytes in 100 msecs = 80Kbps. Satellite: lOOObytes in 2 secs = 4Kbps.

The solution is to use dynmnic windowing. The reason static windows me not sufficient in a datagram network is that the path may chmige. This is studied further in chapter five.

2.9 D ifferent Architectures

The OSI and D oD m chitectures me designed specifically to support traditional applications such as Remote Term inal Access, File Transfer, Electronic Mail and Remote Job Entry. M odem applications in D istributed System s require a different range o f protocols. For instance. File Access can be built with a general facility for rem ote or distributed execution. Applications like tliis have different communication requirem ents which can be chm acterised as requiring tlie following features:

• Request/Response call oriented • Low latency per a ill

• Calls possibly stremned

• Progrmmning Lmiguage integration

W e discuss below approaches that have em erged to provide this additional functionality. 2.9.1 Berkeley/Sun mid First O rder Distributed Computing

In tlie early 1980s, D ARPA funded the Department o f Electronic mid Electrical Engineering at the University o f C alifom ia, Berkeley, to implement the Intemet Protocols. The implementation team went further and provided general a purpose communications progrmmning environment based on sockets, which is in som e sense a Ftr.vr Order distributed computing system.