A computing structure for data acquisition in high energy physics

(1)

A COMPUTING STRUCTURE FOR DATA ACQUISITION _IN _HIGH _ENERGY _PHYSfCS--

by

GARRY ALEXANDER LESTER

Submitted for _the Degree _of

Doctor _of _Philosophy

at the

University

_{of Salford}

in _the

Department _of Electronic _{and Electrical} Engineering

1988

MCIMLXXXVIII

(2)

(3)

My theory that I have follows _the lines _that

I am about to relate

(4)

Contents

ACKNOWLEDGMENTS _xiv

ABSTRACT _xv

CHAPTER 11

HISTORICAL REVIEW 1

1.1 History _of Computing I

1.1.1 Mechanical Calculating Engines 1

1.1.2 Electro-Mechanical Computing Machines 3

1.1.3 Electronic Computers 3

1.1.4 Mechanisms For Introducing Parallelism 4

Multiprocessors 5

Multifunction Processors 5

Processor Arrays ₅

Pipeline Processing 6

1.1.5 _Scalar _{and Vector} _Processors ₆

1.1.5.1 _Scalar _Processors 7

1.1.5.2 Vector Processors 8

1.1.6 _Processor Arrays 9

1.1.7 Array Processors 10

1.1.8 Orthogonal _{and Associative} Processors 10

1.2 Classification Schemes ₁₂

(5)

CONTENTS V

1.3 Current Computer Research 16

1.3.1 Electron ic Parallel Processors 16

1.3.1.1 Direct Networks 17

1.3.1.2 Indirect Networks 18

1.3.1.3 Bus Systems 19

1.3.1.4 Cellular Array Processors 20 1.3.1.5 Systolic _{and Wavefront} Arrays 21 1.3.1.6 Data Driven _and Demand Driven

Computing 22

1.3.1.7 Other Schemes 24

1.3.2 _Digital Optical Computers 24

1.3.3 Biologic _al Computers 27

1.4 Computers in Experimental High Energy Physics 27

CHAPTER 2 30

2 COMPUTING REQUIREMENT 30

2.1 Introduction ₃₀

2.2 _Present _Data _Collection _System ₃₀

2.2.1 _Experiments _{and Detectors} ₃₃

2.2.2 _{The Event} _Manager 34

2.2.2.1 Operating Features 34

2.2.2.2 Singles Mode 34

2.2.2.3 Biparameter Interface 35

2.2.2.4 Multiparameter Event Handling 36

(6)

CONTENTS

2.2.3 Limitations _{and Bottlenecks} 37

2.2.3.1 Current Data Rates 38

2.3 Design Aims 38

2.4 Processing by Look-Up Table 39

2.4.1 Estimation _{of Memory} Requirements 39

2.5 Analytical Solution 40

2.5.1 Serial Nature _of Data Taking 40

2.5.2 Parallel Requirement _of Processing 41

2.5.3 Event Nature _{of Data} 41

2.5.4 Possible Granularity _of Parallelism 41

vi

CHAPTER 3 42

3 _THEORETICAL CONSIDERATIONS 42

3.1 Introduction 42

3.2 Processor Criteria 42

3.3 Assumptions _{on the} _{Form of the Data} 45

3.4 Possible Structures 45

3.4.1 Bus Connected Structures 45

3.4.2 Direct Network Computers 46

3.4.2.1 Usage Pattern 47

3.4.2.2 Tree Structures 47

3.4.2.3 Pyramid Structures 48

(7)

CONTENTS _vii

3.4.2.4 Ring Structure 49

3.4.2.5 Cylindrical Structures 49

3.4.2.5.1 Feed Mechanism 50 3.4.2.6 Distinct Node Flow Models 51

3.4.2.6.1 Single Ring 53

3.4.2.6.2 Cylinder 61

3.4.2.7 Homogeneous Flow Model 72

3.4.2.7.1 Cylinder 72

3.5 Relationship To Other Topologies 79

3.6 Relationship to Other Systems 80

3.7 Fault Tolerant Nature _of _{a Cylinder} 81

3.7.1 Path Redundancy 81

3.7.2 Link Failure 82

3.7.2.1 Distinct Node Case 82

3.7.2.2 Homogeneous Node Case 82

3.7.3 Processor Failure 83

3.7.3.1 Data Sink Failure 83

3.7.3.2 Data Source Failure 83

3.7.3.3 Faulty Processing Failure 84

3.7.4 No-Action Resilience to Faults 85

3.7.5 On-Line Replacement _of Processors 86

3.7.5.1 Reprogramming 86

3.7.5.2 Hardware Requirement 87

3.8 Distributed Depth First Search 88

3.8.1 Development _of the DDFS Algorithm 89

(8)

CONTENTS _viii

3.8.2 Circuits _{and Self-Loops} ₉₁

3.8.3 _{Use of Multi-Programming} ₉₁

3.8.4 _{Use of DDFS for} _testing ₉₂

CHAPTER 4 ₉₄

4 _SIMULATION _{OF PROCESSING STRUCTURES} ₉₄

4.1 _Introduction ₉₄

4.2 _Processing _Element _Model ₉₄

4.3 _{Communication} _Hardware _Model ₉₅

4.4 _Iterative _Nature _of _the _Simulation ₉₅

4.5 _{Communication} ₉₆

4.6 _Data _Input _Mechanism ₉₆

4.7 _Computation ₉₈

4.8 _Performance _Indicators ₉₉

4.9 _Startup _Effects ₉₉

4.10 _Preferential _{Communication} _Algorithms ₁₀₁

4.11 _Distinct _{Node Simulations} ₁₀₁

4.11.1 _Data _Routing _Information ₁₀₂

4.11.2 _Ring _Simulation ₁₀₂

4.11.2.1 _Algorithms _Investigated ₁₀₂

(9)

CONTENTS _ix

4.11.2.2 _Performance _of _the _Algorithms ₁₀₄ 4.11.2.2.1 _{Communication}

Bound ₁₀₅

4.11.2.2.2 _Intermediate _Data

Types ₁₀₉

4.11.2.2.3 _Computation _Bound ₁₁₄

4.11.3 _Cylinder _Simulation ₁₁₆

4.11.3.1 _Algorithms _Investigated ₁₁₆ 4.11.3.2 _Performance _of _the _Algorithms ₁₁₉

4.11.3.2.1 _{Communication}

Bound 119

4.11.3.3 _Larger _Data _Types ₁₂₂

4.12 _Homogeneous _Processing _Simulation ₁₃₈

4.12.1 _Ring _Simulation ₁₃₈

4.12.1.1 _Algorithms _investigated ₁₃₈ 4.12.1.2 _Performance _of _the _Algorithms ₁₃₉

4.12.2 _Cylinder _Simulation ₁₄₃

4.12.2.1 _Algorithms _Investigated ₁₄₃ 4.12.2.2 _Performance _of _the _Algorithms ₁₄₅

4.12.2.2.1 _Processing

Throughput 145

4.12.2.2.2 _Fault _Tolerance ₁₅₈

CHAPTER 5

5 PRACTICAL METHODOLOGY

169

(10)

CONTENTS

5.1 Introduction 169

5.2 Hardware Overview 169

5.2.1 Processor Memory Configuration 170

5.2.2 Communication Protocol Selection 171

5.2.3 Additional Input/Output Facilities 171

5.2.4 Power Supply _and Physical Support 172

5.2.5 External Communication Interface 172

5.3 Interface Board Design 172

5.3.1 Serial Communication Protocol 172

5.3.2 Availability _of Suitable Devices 174

5.3.3 Transmit Receive Synchronisation 174

5.3.4 Receive Circuit 176

5.3.5 Transmit Circuit 177

5.3.6 Interface to RS-232 C 178

5.3.6.1 Parallel Bus _simulation 179

5.3.6.2 Control Logic 179

5.3.6.3 Clock Generation 181

5.4 Processor Board Design 181

5.4.1 Link Status Latches 182

5.4.2 PIO Addressing 184

5.4.3 Link Addressing 184

5.4.4 Memory Decoding 185

5.4.5 Clock Generation 186

5.5 Hardware Construction 187

(11)

CONTENTS _xi

5.6 Software Overview 188

5.6.1 Programming Languages 188

5.6.2 Process Scheduling Mechanism 190

5.6.3 Communication Data Structures 192

5.6.4 Communication Functions 193

5.6.5 Inter-Process Communication Addressing 194

5.6.6 Link Table Structure 194

CHAPTER 6 197

6 PRACTICAL IMPLEMENTATION 197

6.1 Introduction 197

6.2 Programming Methods 197

6.2.1 Circuit Switching 197

6.2.1.1 Storage Requirements 198

6.2.1.2 Programming Time 198

6.2.2 Identical Mutual Programming 199

6.2.2.1 Storage Requirements 200

6.2.2.2 Programming Time 200

6.2.2.3 Simultaneous Programming 201 6.2.2.4 Error Correction Mechanisms 202 6.2.2.5 Programming Algorithm Adopted 203 6.2.2.6 Differing Processor Functions 206

6.3 Data Processing 207

6.3.1 Data Format 208

(12)

CONTENTS _xii

6.3.2 Data Supply 209

6.3.3 Data Display 209

6.3.4 Processing Behaviour 210

6.3.4.1 Distribution Preservation 210

6.3.4.2 Processing Speedup 211

6.3.4.2.1 Ring 211

6.3.4.2.2 Column 213

6.3.4.2.3 ILI Structures 215

6.3.4.3 Fault Tolerance 216

6.3.4.4 Lost Events 217

6.4 Distributed Depth First Search Algorithm 217

6.4.1 Initial Implementation 217

6.4.2 Multiprocessing Implementation 222

CHAPTER 7 226

7 EVALUATION AND CONCLUSIONS 226

7.1 Summary ₂₂₆

7.2 Evaluation ₂₂₇

7.3 Further Research 228

7.3.1 Processor Selection 228

7.3.2 Node Structure 229

7.3.3 Data Supply 229

7.3.4 Data Retrieval 230

7.3.5 Programming Error Correction ₂₃₀

(13)

CONTENTS _xiii

7.3.6 Algorithm Development 230

7.4 Final Implementation 231

7.5 Conclusions 231

REFERENCES 233

APPENDICES Vol II

(14)

xiv

ACKNOWLEDGMENTS

I wish to thank _{my supervisor} Dr CA Owen for _allowing _me the freedom _to _approach _this _research in _{a largely} independant

way whilst _providing _constructive _advice _when this _{was sought.}

I would also wish to thank Dr ECG

_{Owen and his colleagues}

_at

Daresbury _Laboratory for _opening _{up this} _avenue _of _research _to me. Finally I _would _acknowledge Dr R Vaughan-Williams for

having

_written

_such

_{a splendid}

_set

_of

_symphonies

_which

_have

(15)

xv

ABSTRACT

A _review _of _the development _of _parallel _computing is presented, followed by a summary of currently recognised types of parallel computer and a brief summary of some applications

of parallel computing in the field of high energy physics.

The _computing _requirement _at _the data _acquisition _stage of a particular set of high energy physics experiments is detailed, _with _reference _to _the _computing _system _currently in use. The requirement for a parallel processor to process the

data from _these _experiments is _established _and _a _possible computing structure put forward.

The _topology _proposed _consists _of _a _set _of _rings _of processors stacked to give a cylindrical arrangement, an

analytical approach is used to verify the suitability and extensibility of the suggested scheme. Using simulation

results the behaviour of rings _and _cylinders of processors

using different algorithms for the _movement _of data _within the system _and different _patterns _of data input is _presented _and discussed.

Practical hardware _and _software details for _processing equipment capable of supporting such a structure as presented here is _given, _various _algorithms for _use _with _this _equipment,

(16)

xvi

Appendices _of _{constructional} _information _and _all _program

(17)

1

CHAPTER I

I _HISTORICAL _REVIEW

1.1 _History _of _Computing

1.1.1 _Mechanical _Calculating _Engines

The _earliest _automatic _computing _engines _appeared in europe in the early seventeenth century, and were purely

mechanical in operation. According to Hayes[641 the earliest

of these was one designed and built in 1623 by Wilhelm Schikhard, _of far _more influence however _{was the} _machine built

by _Blaise _Pascal in 1642 _though _even _this later _machine _was only capable of addition and subtraction. The next notable

event was the construction in about 1671 by Gotfried Leibniz of a calculator capable of addition and subtraction in the manner of Pascal's machine and also multiplication _and division.

These _machines _were largely _regarded _as _academic _curiosities

until mechanical _calculators _were _exploited _commercially in the 19th _century.

The _next _major figure _was _Charles _Babbage, he _proposed two _mechanical _computers, _the _now famous Difference Engine

(begun in 1823) _and _Analytical _Engine (conceived in 1834). Neither _machine _was _completed for _{a variety} _of _reasons but

probably principally because both machines were very adventurous and required rather better mechanical _engineering

(18)

HISTORICAL _REVIEW ₂

The difference _engine _of Babbage _was _conceived for _the purpose of automatic computation of _astronomical tables _using only addition and subtraction by the method of finite

differences, _hence _its _name. _Babbage _had _been _greatly impressed by _the _number _of _errors in _manually _computed _tables and his difference _engine was intended to not only calculate

the _entries _of _a table but to transfer these _results immediately _to _{an engravers} _plate _using _steel _punches.

The Analytical Engine _was _conceived _as _performing _any mathematical operation _{automatically} _and the final design

exhibited many of the main _elements of the electronic computers

built _over _{a century} _later. _{The analytical} _engine _{was to} have two main parts: the store, a memory unit consisting of sets of counter wheels, corresponding to the cathode ray tubes, mercury delay _lines _{And bistable} _electronic _circuits _used _{as memory} in electronic computers, and the mill, corresponding to a modern arithmetic logic unit being capable of performing the four basic _arithmetic _functions.

Babbage

_proposed

_to

_control

_the

_machine

_using

_punched

cards

not

unlike

those

developed

for

_use

_with

the

Jacquard

loom.

There

_were

_to

_{be two sets}

_of

_cards,

_operation

_cards,

used

to

control

the

operation

of

the mill

and variable

cards,

used

to select

the memory locations

to be used as the sources

of

operands

_{and the}

destination

of

results

for

_{a particular}

operation.

One

of

the

most

significant

contributions

of

Babbage's

_proposals

_{was a mechanism}

_{to allow}

_the

_sequences

_of

operations

to

be

_changed

_{automatically,}

in

_modern

_terms

(19)

HISTORICAL _REVIEW ₃

The _earliest known _reference _to _parallelism _in _computer design _is _thought _to _be _in _{a Sketch} _of _the _Analytical _Engine Invented by _Charles _Babbage by _General _LF _Menabrea _in ₁₈₄₂ according to Hockney _{and Jesshope[681.} Though _parallelism _was not incorporated in Babbage's _analytical _engine _{he was clearly}

aware of the _possibility _of its _application _to improve performance _over _{a century} before the technology _{was available}

to _make its large _scale _application _possible. _Though _neither of Babbage's _engines _were _constructed _mechanical four function

calculators were in widespread _use from _early in the _nineteenth century until _electronic _calculators became _widely _available cheaply in the latter half _of the 20th _century.

1.1.2 _{Electro-Mechanical} _Computing _Machines

In _the 1930s _and _1940s _]Konrad _Zuse _built _several

electro-mechanical _computers, _apparently _unaware _of _the _work _of Babbage, _Z3 built _around ₁₉₄₁ _is _believed _to _be _the _first

operational _general _purpose _program _controlled _computer, however _this _work _was _interrupted _by _the

second _world _war _and had _little _influence _on _later

machines. A more influential

machine

_{was the Harvard}

Mark I (originally

_called

_{the Automatic}

Sequence

Controlled

_Calculator)

_proposed

_in

₁₉₃₇

_by

_Howard

Aiken,

_which

_{was completed}

_{by IBM in 1944.}

1.1.3 _Electronic _Computers

The first _purely _electronic, _as _opposed _to

(20)

HISTORICAL _REVIEW

probably being ENIAC (Electronic Numerical Integrator and Calculator) built _at _the _University _of _{Pennsylvania.} _Though it weighed 30 tons _and contained 18,000 electronic valves it was

around 1000 times faster than the Harvard Mark I, taking 3 mS for _{a 10 digit} _{multiplication.}

The first _machines _employing the _stored _program _concept in _which _program _and data _reside _together in _the _{same memory} unit appeared at the end of the 1940s, with EDSAC at the University _of Cambridge _and EDVAC _at the University of

Pennsylvania.

It

is

interesting

_that

_at

_this

_time

_the

_possibility

_of

using

serial

arithmetic,

with

the

consequent

reduction

in

hardware

_required,

_{was regarded}

_{as an advantage}

_{of the speed of}

electronic

components

over

mechanical

machines

which

had

performed

their

arithmetic

in parallel.

1.1.4 _Mechanisms

_{For Introducing}

_Parallelism

Despite _the _speed _advantages _of _electronic _computers _over mechanical _and _{electro-mechanical} _machines _greater _processing

rates were soon required. Though some improvement in processing speed _{was achieved} through improvements in the speed

of electronic components major improvements were achieved through _parallelism; the fundamental techniques for introducing

(21)

HISTORICAL _REVIEW ₅

There _are _several _techniques for introducing _parallelism as listed by Sharp[1361 these _are; Multiprocessors,

Multifunction Processors, Array _Processors _and _Pipeline

processing. Each of these techniques involves replication of some or _all of the computer architecture at varying levels.

Multiprocessors

Multiprocessors involve duplication _of _the _entire

computer, computers making up the machine using some communication scheme to coordinate their actions. A variety of communication schemes are the subject of current research;

shared _memory, _shared bus _{and direct} interprocessor _connections

for _examples. _The _major _problems _with _{multiprocessors} _are those _of _{communication} _and _{synchronisation} _of the _processors

and of decomposition of the task into suitable units to be performed in parallel. Multiprocessors are dealt with further

in _the _survey _of _currently _researched _processor _types in section 1.3.

Multifunction _Processors

If _units _within _a _processor _are _replicated _such _as floating _point _processors, _then it is _possible _to _perform

several operations simultaneously within a single cpu.

Processor Arrays

The development

_of

_processor

_arrays

is

dealt

_with

_fully

in

_section

1.1.6. _In

_{a processor}

_array

_{a large}

_number

_of

(22)

HISTORICAL _REVIEW ₆

simultaneously but on different _sets _of data. Usually _some mechanism for data _exchange between the _processors is _provided.

Though _such _machines _can _achieve high _processing _rates _for _the large _number _of _problems _that _can _{be reformulated} for _such _a processing scheme they are not _efficiently _applied to _all

problems. The term array processor frequently used for such computers is not used consistently as is illustrated in section

1.1.7.

Pipeline Processing

Pipeline _processing is _applicable _when long _sequences _Of instructions _are _to _be _repeated _for _different data. A production line type _of _approach _can be _used to overlap an

instructions _execution _with _the _execution _of _the _preceding

instruction _{on the} following item _of data _in _the _sequence. It is _{common to} _combine _pipeline _processing _with _{one or more} _other techniques for introducing _parallelism.

1.1.5 Scalar _{and Vector} _Processors

There

_are

_several

_techniques

_that

_have

been

_used

_to

achieve

speed

up

through

the

implementation

of

parallelism.

The

development

_of

_parallelism

_as

_applied

_to

Itraditional9

architectures

is

presented.

This

account

is based on those by

Hockney

_{and Jesshope[681}

_{and Sharp[1361.}

All _parallelism _requires _some _additional _hardware,

usually in the form _of _replication _of _some _part _of _the

(23)

HISTORICAL REVIEW

implemented.

7

The _pilot ACE _and the _commercial _machine derived from this, the English Electric DEUCE, first _constructed in 1951 had several features to permit parallel operations to be carried

out. The input/output devices (a card reader and card punch) could operate in parallel with the rest of the machine and also

instructions _were _available to _operate _on _all of the words in one of the eleven mercury delay lines with a single

instruction, _what _would _now be _classed _{as vector} instructions.

During the 1950s _several important advances were

introduced, _represented by _the IBM _commercial _machines of the period which incorporated bit-parallel arithmetic, and later

1/0 _channels _which _were in _essence dedicated 1/0 processors, these _were the _earliest _{multiprocessors.}

At _about this time _{some consideration} _{was given} to large scale multiprocessor designs, however the programming of such systems has several problems and most of the commercial

development _was _towards introducing _parallelism into _scalar computers (computers operating on data items comprising single values only) to achieve higher computing speeds.

1.1.5.1 Scalar Processors

Multiple functional _units _allowing _arithmetic _operations

to be performed in _parallel _{and pipelining} where stages of an operations execution are overlapped with later stages in a previous operations execution (usually _employed in the

(24)

HISTORICAL REVIEW 8

computer architectures of the 1960s. A modern example of a processor employing multiple functional units is the CYBERPLUS[721.

Multiple functional _units _were _usually _of _arithmetic units, registers _{and memory,} to make use of these in parallel

some degree of lookahead was required to determine which operations could be performed in parallel. This lookahead approach allowed the overlapping of instruction decoding, address calculation and fetching of operands using a pipelining

technique. A good _{representative} _example of a machine using such techniques is the IBM 360/91.

1.1.5.2 Vector Processors

The logical development from _scalar _machines using pipelining techniques to achieve high computing speeds was the

construction of vector processors using pipelining techniques.

Vector _machines _operate _on _vectors (an _ordered _group _of numerical values) as a basic unit of data. The most famous of

these _are the CRAY _machines, _the CRAY-1 having _regularly achieved 130 MFlops/sec _{on appropriate} problems.

This is _one _of _the _main _points _about departures from

scalar

machines

in

that

the problems

must be suitable

for

the

machine

architecture;

vector

machines

exhibit

little,

if

any

improvement

_over

_purely

_scalar

_processors

_when

_performing

purely

scalar

computations.

This

dependence

upon

problem

suitability

is

clearly

demonstrated

by

the

benchmarks

_of

_a

CYBER 205 vector

processor

_{and a CYBERPLUS scalar}

_processor

_on

(25)

HISTORICAL REVIEW 9

testing _{and branching} instructions[72).

1.1.6 Processor Arrays

In 1962 _{a paper} _of Slotnick et al described the SOLOMON computer, SOLOMON standing for Simultaneous Operation Linked

Ordinal MOdular Network. This was one of the earliest

references to the concept of'an array of processors, each with some memory, and all under the control of a central control

stream. Though never built as originally proposed several

important _machines developed from this concept, such as the ILLIAC IV _and ICL DAP[491. The ILLIAC IV was not a success,

costing four times its contract figure and never coming within an order of magnitude of its proposed performance, when finally operational in 1975, it was however a very influential machine.

ILLIAC IV _was, like the engines of Babbage a century before,

too _ambitious for the technology available at the time. The ICL _{DAP was} _commercially viable however, the first one being installed in 1980, this _machine had in its production form an array of 64 x 64 processors, each with 4096 bits of memory and

capable of bit-serial arithmetic on the values held in this memory, 4096 such calculations being performed in parallel. In

common with the processors of SOLOMON and ILLIAC IV the processors of the ICL DAP had connections with their nearest

neighbours, in an array pattern from which this genre of machines get their name. Though these machines used physical

hardware _connections to their _nearest _neighbours _{an alternative}

is _to _use _conceptual links, data being _passed _via _common memory. In this case there is no definite structure to the

(26)

HISTORICAL _REVIEW 10

termed _{unstructured,} the term _ensemble _was _used for _such _an arrangement. A notable example of such a machine is the Burroughs Parallel Element Processor Ensemble PEPE developed in

the _mid 1970s _which had 288 _processing _elements, _each containing three processors (one each for input of radar signals, processing of data and output of control signals)

controlled by three control units for the three types of processor within each processing element. When necessary

communication between the processing elements took place via the _memories _of the _control _units.

1.1.7 Array Processors

Many _special _purpose _computers have been produced for processing large amounts of data, usually in the form of

arrays, these tend to be referred to by the generic name array processors though their architecture does not necessarily

consist of an array of processors. A good example of such processors are the special purpose devices for Fast Fourier

Transform (FFT) _and _similar _algorithms frequently _used in signal processing _{applications.} A list of the attributes

required of a subset of array processors, peripheral array processors, has been suggested by Karplus[821 though the

generality of the term array processor is acknowledged.

1.1.8 Orthogonal _{and Associative} Processors

(27)

HISTORICAL _REVIEW ₁₁

orthogonal _computer described by Shooman in _parallel _computing with vertical data _a 'horizontal _unit' for _word _serial/bit

parallel operations _and _{a 'vertical} _unit' for bit _serial/word

parallel operations were both provided allowing the most appropriate mechanism of referencing the data to be used.

The notion

of testing

all

words

in parallel

leads

to the

idea

_of

_associative

_processing

_{and content-addressable}

_memory

in which items are referenced

by a match between the data and a

given

bit

pattern

or mask rather

than

by the

address

of

its

location

in

_memory.

_It

is

_usual

_{to provide}

both

_associative

and address

reference

in a processing

scheme though

in a purely

associative

memory there

is no facility

to address

data

by its

position

in store.

A series of commercial machines under the name of OMEN (Orthogonal Mini EmbedmeNt) _were _produced in _the _early 1970s,

these _used _{a PDP-11} _as the horizontal _arithmetic _unit _and _an array of 64 processing _elements _as the _associative _vertical

arithmetic unit.

Several

_machines

_based

_around

_the

_orthogonal

_computer

concept

have been built

_{and machines}

_along

these

lines

_{are the}

subject

_{of considerable}

_present

_research.

Recently _{a large} _amount _of interest has been _shown in multiprocessor _computers, _most _commercial _machines _use _only _a

small number of processors; the CRAY X-MP, regarded by many as a state of the _art multiprocessor _{supercomputer} has _{a maximum} of only four processors. Some _academic _{multiprocessors} _use

rather more processing _elements _{and these} _are dealt _with _in _the

(28)

HISTORICAL _REVIEW ₁₂

survey _{of multiprocessor} types in _section 1.3.

1.2 Classification _Schemes

Several _schemes have been _proposed _for _the _{classification}

of multiprocessor _computers, however _most _of these _are inadequate _to _classify _the _wide _variety _of _machine _types presently recognised.

A

_natural

_{classification}

is

by

_the

level

_at

_which

parallelism

is

implemented

within

the

computer,

Hockney

and

Jesshope[681

divide

_this

_{up into}

four

levels:

1-

_{Job Level}

2- Program Level

3- Instruction Level

4- Arithmetic _{and Bit} _Level

at

the job

level

separate

jobs

or large

sections

thereof

are

regarded

as the

units

to be executed

in

_parallel.

Since

Jobs

_{are usually}

_independant

_these

_{can be executed}

_{in parallel}

without

communication

_{or determinacy}

_problems

_arising.

At _the _program _level _sections _of _{a program} _that _{do not} exhibit data dependencies _may be _executed in _parallel on separate _processors, _{a good} _example _of this _are loop _constructs

which do not require data exchanges between iterations of the loop.

Pipelining

_of

_the

_stages

_of

_instruction

_fetch

_and

(29)

HISTORICAL _REVIEW ₁₃

At _the _lowest _level _parallelism _between _the

operations _on elements _of instruction _operands is _possible, _{e. g.} _bit parallel _arithmetic _{or vector} _arithmetic.

The _machine _to be described _in _this _thesis _would _reside at the top most level _of this _scheme, the duplication _of the program being regarded _{as separate} jobs _using different data.

A very simple taxonomy was presented by Crenshaw in _a NATO _conference _paper[481 in _which _computer _systems _are

regarded as either 'federated' or 'integrated'. In this scheme a federated system is one consisting of several computers each performing a particular task _and communicating with the other

processors through 1/0 channels. The computers making up a federated _computer _{may themselves} be integrated _computers _and need not be identical. An integrated system is one in which unrelated tasks _are _performed in _a multiprogrammed fashion within _{a single} _computer. This _computer _{may be a monoprocessor}

or _a _{multiprocessor} _sharing _common _main _storage. The key feature _of _an _integrated _computer _system _is _the _single _job queue _{and operating} _system.

In _this _terminology _the _computing _structure _presented _in chapter 3 would be regarded _{as a federated} _computer _{made up of} a collection _of integrated _computers (since it is likely _that the _nodes _would be capable _{of multiprogramming).}

One of the _most _widely _quoted _{classifications} _is _that _of Flynn[52,68,136,1421. _Rather _than _describing _the

architecture

of the computer Flynn's taxonomy _relates _the _instructions _and the data being _processed. _Flynn _identified _four

(30)

HISTORICAL _REVIEW ₁₄

relationships between the instruction _stream(s) _and data stream(s):

SISD - Single Instruction _stream, Single Data _stream

SIMD - Single Instruction _stream, Multiple Data _stream MISD - Multiple Instruction _stream, Single Data stream

MIMD - Multiple Instruction _stream, Multiple Data stream

the first _of these represents the serial computer architecture. The SIMD architecture is one in which a single

instruction _operates _{on multiple} data, _{a good} example being a vector instruction. The MISD case is on first inspection

meaningless since it implies that multiple operations are being

performed on a single data item simultaneously, however in a later _paper Flynn[511 _suggest that special streaming

techniques, _such _as the pipeline where different instructions

are applied to the data stream as it passes through the machine, are included in this group. The final group, the MIMD machines, includes all multiprocessor configurations, the lack

of distinction between different types of multiprocessor being the _main deficiency _of Flynn's taxonomy. The system presented in _chapter 3 falls in _this _{MIMD category.}

A

_{classification}

based

_on

the

_organisation

_of

the

computer

from

its

constituent

parts

was

provided

by

Shore[138,68,1361

in 1973.

Shore's

_{classification}

_provides

_six

cases

of processor:

Word serial, Bit _parallel Word parallel, Bit _serial

(31)

HISTORICAL _REVIEW

15

IV _{- Unconnected} _array V- _Connected _array

VI _{- Logic} in _memory _array

type I is _the _traditional _serial _computer _architecture

and type II is _{a bit} _slice _processor. _{The orthogonal} _computer, type III is _effectively _{a combination} _of _types _I _an _II _being able to access data in two _{perpendicular} directions. Type IV is _an _array _of _unconnected _processors _under _the _control _of _a single _control _unit, type V is _similar _except that the processors are connected to permit communication. Type V, the final _type _consists _of _memory _with _processing _units _distributed

throughout it _as in _associative _processors. _This

classification does not _adequately _cover loosely _coupled processors such _{as direct} _connection _network _{computers[871} _of which the structure presented in _chapter 3 is _{an example,} these

being _regarded _{as multicomputers} _rather _than _{a multiprocessor.}

Hockney _and Jesshope[681 _propose _{a structural} _notation not unlike that used by chemists to indicate _chemical formulae

as a means of _expressing _computer _{architecture.} This _provides a comprehensive _scheme for description _of _computer

architectures _which is then _used in the _presentation _of _a computer taxonomy _{as a set} _of decision trees. Zakharov[1611 is critical _of this _scheme _{as being} _too _cumbersome _to _{be useful,}

also, _as in the _scheme _of Shore, _network _computers _are _not included _in _the _{classification.}

Sharp[1361 _provides _four _cases

(32)

HISTORICAL REVIEW 16

SES - Single Processor, Scalar Data SEA - Single Processor, Array Data

MES - Multiple Processors, Scalar Data MEA - Multiple Processors, Array Data

the

last

two of these

being

subdivisions

of Flynn9s

MIMD

group.

Other

_schemes

have

been

proposed

by

both

Kuck

and

Schwartz,

the

_scheme

presented

by

Kuck

is

an

extension

of

Flynn's taxonomy by the addition of an execution stream providing for 16 system types in all and that of Schwartz

provides

a taxonomic

table

based on 55 designs.

None _of the _schemes _are entirely satisfactory, since, as Hockney _{and Jesshope} state, it is quite possible for computers

to have

characteristics

which

belong

in more than one section

of the classification.

1.3 Current Computer Research

1.3.1 Electronic Parallel Processors

A large _number _of _parallel processing architectures are

the

_subject

_of

_ongoing

research,

most

of

these

involve

some

form

_of

_{multiprocessor.}

Several

authors

have

surveyed

the

machines

and

architectures

investigated[135,127,104,2,142,82,651- Implementation of entire machines in VLSI has been the subject of considerable

research

and an influential

factor

in

the

selection

of

many

architectures

such as repeated

processor

and switch

units

which

(33)

HISTORICAL _REVIEW

architecture[140], others[1331.

17

VLSI _array _{processors[1461} _and _various

1.3.1.1 _Direct _Networks

In

direct

_networks

_{a number}

_of

_processors

(nodes),

_each

with

its

own independant

memory are

connected

to

some (or

in

the case of full

connection,

all)

of the other

processors

which

comprise

the

system

by

a dedicated

communication

mechanism

(links).

_These

_networks

_are

_static

_since

the

_connection

pattern

is

unchanging

unlike

indirect

networks

described

in

section

1.3.1.2. There

is

an

enormous

variety

in

the

connection

schemes

currently

being

considered,

the

schemes

usually

exhibit

regularity

and

some

of

the

more

commonly

investigated

_schemes

_are

_the

_ring

(an extension

_{of which,}

the

cylinder,

is used as the connection

scheme in chapter

3),

tree,

mesh

(toroidal

mesh),

hypercube

and

shuffle-exchange

networks[133,501.

Such networks

are

easily

constructed

from

identical

_processing

_nodes,

_such

_{as the}

DIRMU multiprocessor

node[621

or

the

MDP[331

or

the

TRANSPUTER[11,102,156,148,155,78,1011,

though

in such a case it

is

desirable

for

_all

_nodes

_{to have}

_the

_{same degree}

(number

_of

links)

_{and for}

_this

_{to remain}

_constant

_regardless

_of

_the

_size

to which

the machine

is

expanded.

This

increase

in the degree

of nodes with

size

is

one of

the main disadvantages

of one of

the

_most

_widely

investigated

_topologies,

_the

hypercube[113,147,1151.

_{A closely}

_related

_architecture

_which

overcomes

this

difficulty

_are

_the

_cube

_connected

_cycles

_and

extended

_{cube connected}

_cycles

_{topologies[133,50],}

_{in which the}

processors

_at

the

_vertices

_of

_the

_hypercube

_are

replaced

by

(34)

HISTORICAL _REVIEW

18

cycles (rings) _of _processors; _such _{a topology} _can _{be realised} with _processors _of degree _three.

Completely _connected _machines _are _rarely _encountered

owing to the high degree _required _of the _processors _making _up such a system for large numbers of processors. The HPDM[911 has five _{TRANSPUTERS completely} _connected _and _in _addition _uses shared memory to _communicate with a similar number of more conventional CLIPPER processors which are connected together

with a parallel bus. The HPDM may be connected to other HPDMS by an ETHERNET or X. 25 network.

1.3.1.2 Indirect Networks

Indirect _{or dynamic} _networks fall into _{one of} three _major classes; single stage, multistage and crossbar, Feng[501 gives a description of such networks and the uses to which they are put. Single stage networks are also called recirculating

networks since data may have to be recirculated several times before it _reaches its destination. _They _are _used in _cycling machines[1321, the GF11 supercomputer[17] uses a3 stage network in _{a cycling} _scheme. Multistage _and _crossbar _networks

are frequently _used to _connect _processors to memory units, the interconnection _network _permitting _access _to _any _of _the _memory units by any of the processor _units.

The Remps _machine[73] _users _{a global} _network _to _allow

memory

_sharing

between

_processors

but

in

_addition

_uses

_two

other

_networks,

_{one to allow}

_{1/0 communication}

_mapping

_{and the}

second,

_a

_one

_sided

_network,

_to

_allow

_{interprocessor}

(35)

HISTORICAL _REVIEW

19

One _sided _networks, _referred _to _occasionally _as _full switches, _allow _any _of the _processors _attached _to _communicate with _any _of the _other _processors _attached, _unlike _two _sided networks _which _allow _{any of} the inputs _{to. connect} _to _{any of} _the

outputs, Almasi[2] _refers to these _as the 'boudoir' _and Idancehall' _arrangements _{respectively.} _The _IBM _RP3[731 is _a current example of the application of a one sided network to interprocess _{communication.}

1.3.1.3 Bus Systems

Bus _systems _use _single _or _multiple _parallel busses to

permit

data

exchange

between

several

units

connected

to the bus

in _a _random _access fashion. This f ree interconnection _of devices _as _compared _with _say direct _network _machines _readily

allows

many conceptual

interconnection

schemes to be mapped on

to

the

_parallel

bus.

The critical

_component

is

largely

the

bandwidth

_of

_the

bus,

limited

by

_the

_number

_of

_physical

connections possible. The general any to any connection

possibility of the bus was one of the main reasons for its adoption to connect the vector processors of the MU6V[751.

This _general _mapping _allows bus _connections _to be applied in _a variety of environments such as Message-Passing[126] and Data Flow[1501. _This _argument _also _applies _to _other _globally _shared medium _{communications} _such _as those _used in Local Area

Communications _{as in} _the _{TUMULT ring[1301.}

(36)

HISTORICAL _REVIEW

20

connected by other busses, _or, _{as in} _the _case _of _MuTeam[301 _by serial _asynchronous links. _The _Synapse _N+1 _system[1121, _in common with many systems, _uses _one _or _more _global buses (the synapse _expansion bus in the _case _of _synapse) _to _which _are connected lower level buses. The TAMIPS _{multiprocessor[} 1511 has _up _to _eight _processors _connected _to _its _local _bus, _this multiprocessor _can then be attached to the _multibus (IEEE-796)

to _provide for _{expansibility.} The Flex/32[1001 _{multicomputer}

uses a local bus with several processors connected, the processors may connect to other groups of processors or input/output devices _to _provide for _{expansibility.} _There _are several busses in use, VME, Multibus (as used in the Sequoia computer[981), Futurebus and others. FERMTOR[1251 uses local buses in _{a ring,} _the buses _making _up _the _ring being _connected by _station latches _which deal _with data _transfers between devices _not _connected _to _the _{same bus.} _{By varying} _the _number of buses and number of devices connected to each bus this architecture may be 'tuned' to the data access pattern.

1.3.1.4 Cellular

Array

Processors

The _array _processor (SIMD) _architecture in _which _arrays of _very _simple _processors _can _communicate _with _adjacent

processors is _of _considerable interest _especially _with _{a view} towards incorporating _large _numbers _of _processors _onto _{a single}

VLSI device. _An _example _of _such _{a VLSI} _device _is _the _ITT CAP-II _chip[107] _which _has _a4X4 _array _of _{16 processors}

each working _with 16 bit _words. _Mishin _and _{Sedukhin[1061} _discuss

the behaviour _of _such _an _adjacent

communication _cellular computer system _{and its} _performance _for

(37)

HISTORICAL _REVIEW ₂₁

Control for _such _{an array} _of _processors is _generated from one central control mechanism with little independant activity

by _the _individual _processors. _It _is _possible _to _transfer _some of the _centralised control to the individual processors but

still maintain an overall central control. In such cases the SIMD _nature _of the _architecture becomes less clear as the control tends towards completely independant operation as in MIMD architectures.

1.3.1.5 Systolic _{and Wavefront} Arrays

The _systolic _array is _{a possible} example of a halfway

house between SIMD and MIMD operation, there being an array of processors, each performing computations under a global scheme