The instructions in SIMD computers are decoded by the array control unit and it broadcast the instructions to the number of processing elements (PEs)

(1)

UNIT –VII

SIMD Computer Organizations

We know that vector processing is also implemented through SIMD computers. There are two implementation models of SIMD computers, based on the memory distribution and addressing schemes.

Generally SIMD computers have single control unit and distributed memories, but some of SIMD computers have associative memories. The instructions in SIMD computers are decoded by the array control unit and it broadcast the instructions to the number of processing elements (PEs). In SIMD organizations generally the PEs are passive; the ALUs execute instructions received from the array control unit. Here all PEs are working the speed of clock, and these are synchronized by the array controller.

The following are the two basic models of the SIMD computers.

1. Distributed memory model.

2. Shared memory model.

1. Distributed memory model: In this model consists of array of PEs and each PE is attached with local memory, then this model exhibits the spatial parallelism. All the PEs and local memories are controlled by the array controller. The following diagram shows the distributed memory model.

In this model, the programs and data are loaded into the local memories from the control memory through the host computer. The instruction is sent to the control unit for the decoding. If it is a scalar or program control operation then it will be transferred to the scalar processor and then executed there. If the decoded instruction is a vector operation, then it broadcast to the all the PEs for the parallel execution.

The data required for the execution of vector instruction is partitioned and distributed to all local memories through vector data bus. Here the PEs are interconnected by a data routing network with is used to establish inter PE data communication network.

This network is under control of the control unit by executing program control instructions.

Shared memory model: In this model, we use shared memories among the PEs. That is all the PEs attached in SIMD computer are simultaneously use the memory modules connected by the alignment network in shared fashion. Here also the alignment network is controlled by the control unit. The following diagram shows the shared memory model.

(2)

The Burroughs Scientific Processor (BSP) had adopted shared memory model with n=16, PEs connected to the m=17 shared memory modules through 16 X 17 alignment network. Here the alignment network must properly design to avoid the memory access conflicts. Depending on the interconnecting structure of PEs the SIMD computers again classified as;

1. Bit-slice PE computers.

2. Word-parallel computers.

SIMD Instructions: We know that SIMD computers can execute arithmetic, logic, data routing instructions and also perform masking operations on vector operands. In bit- slice SIMD computers, the vector operands are represented in the binary format. In work-parallel SIMD computers, the vector operands are represented as 4- or 8-byte numerical values.

All SIMD instructions use vector operands of the length ‘n’, where ‘n’ is equal to the number of PEs in SIMD computer. The executions of instructions in SIMD computers are more similar to the instruction execution in pipelined vector processors, but here spatial parallelism is exhibited.

The data-routing instructions exhibit permutations, broadcasts, multicasts and various rotate and shift operations. The mask operations are used to enable or disable a subset of PEs at any moment.

Host and I/O: To handle the I/O activities in SIMD computers, a special control memory is used in between the host computer and array control unit. This is a staging memory to hold programs and data. In SIMD computers the partitioned data sets are distributed to the local memories before starting the program execution. Here the host computer manages the mass storage and graphic display of computational results.

(3)

The CM-2 Architecture:

The connection Machine CM-2 is a fine grain super computer built by think machines corporation. It consists of thousands of bit-slice processing elements (PEs) to achieve the speed of 10 Gflops.

Program Execution Paradigm: The program execution in CM-2 machine is sub divided into two parts, which is in front-end part and back-end part. In CM-2 the program execution was started in front end part; this part issues the micro instructions to the backend part by broken-down the instructions into microinstructions and broadcast them to all the processors in processing array. The backend part have the processing array, where actual parallel data operations to be performed.

The interconnection between frontend and backend is established in three ways: by broadcasting bus, global combining bus and scalar memory bus. The data and microinstructions are broadcasted to all PEs through the broadcast bus. The global combined bus is used obtain the processing results from the processing array and then send to the front end. The scalar bus is used to read or write the 32-bit data at a time to and from the memories which are attached to the data processors. The following diagram represents the architecture of CM-2 machine.

(4)

The architecture of CM-2 architecture has the following three main components.

1. Processing array.

2. Processing nodes.

3. Hyper cube routers.

Processing array: We know that the CM-2 was a back-end machine for parallel data computation. The processing array contained 4K to 64K bit slice processors all of them are controlled by a sequencer. Here the sequence decode the parallel instruction at front-end and then broadcast the nano instructions to the processing array (back-end).

All the processors in CM-2 machine access their memories simultaneously and those are working with lockstep manner. The interprocessor communication in CM-2 is achieved by router, NEWS grids or scanning mechanism.

Processing nodes: The CM-2 processing node has the 32-bit slice data processors in addition with memory and floating point chip and an optional floating point accelerator. Each data processor has 3-input and 2-output bit slice ALU and associated latches for memory interface. Here the ALU is able to perform bit serial full adder and Boolean logical operations. The following diagram shows CM-2 processing nodes.

Here each processor chip contains 16 processors and processor chips are paired to accessing the group of memory chips. In CM-2 machine the memory data path is –bits (16-bits data and 6-bits ECC) and has 18-bit memory address which is able to access 256K memory words (512K bytes of data). The floating point chip is able to perform 32- bit floating point operations at a time.

Hypercube routers: In CM-2 machine, special hardware was built for inter processor communication. Then all the nodes are wired together to form a Boolean n-cube. Here each node was connected to 12 other router nodes, including its pair nodes.

(5)

The NEWS Grid: Each processor chip in CM-2 machine has 16 physical processors and these processors are arranged in 8X2, 1X16, 4X4, 2X2, or 2X2X2X2 grid fashion.

So finally 64 virtual processors are assigned to each physical processor in a grid. The NEWS grid is based on the each processor has a north, east, west and south neighbors in various grid configurations. In further a subset of wires are used to connect 2¹² nodes as a two-dimensional grid of any shape.

To interconnect each processor in internal grid configuration with the global grid configuration, all the processor chips should arranged in NEWS grid shape. This NEWS grid made flexible interconnections among the processors and efficient to route data on dedicated grid configuration.

Scanning & Spread Mechanisms: The CM-2 machine has a special hardware support for scanning and spread the instructions and data across the news grids. This hardware is very powerful to make parallel operations for fast data combining or spreading throughout the entire processor array.

Scanning NEWS grids combine communication and computation operations. These two operations simultaneously scan every row of a grid along a particular dimension for performing arithmetic and logic operations. It is also possible to expand the scanning operations to all processing array elements.

The spread mechanism is used to send the computing value to all other processors across the chip.

I/O and Data Vault: In CM-2 machine there was 2 to 16 high speed I/O channels are available for data and image I/O operations. Peripheral devices attached to I/O channels have a data vault. Here data vault was disk-based mass storage system for storing program files and large data bases.

Major Applications: The CM-2 machine was used in all massive parallel processing and grand challenge applications. These machines are used in document retrieval using relevance feedback, in memory-based reasoning, medical diagnostics systems and bulk processing of natural languages. Some more applications of CM-2 is VLSI circuit analysis, computational fluid dynamics, signal/image/vision processing and integration, neural network simulation and dynamic programming, context-free parsing, ray tracing graphics and computational geometry problems.

A Synchronized MIMD Machine:

A general CM-2 machine is suffered with rigid SIMD architecture and can perform limited general purpose operations. To overcome these problems, the designers develop CM-5 machine, which has universal architecture and have the good feature of both SIMD and MIMD machines.

(6)

An MIMD machine is good at independent branching but bad at synchronization and communication. On the other hand, an SIMD machine is good at synchronization and communication but poor at branching. Now the CM-5 was developed with synchronized MIMD structure to support parallel computation.

Building blocks: The CM-5 machine contain 32 to 16,384 processing nodes, each of which have 32-MHz SPARC processor, 32-Mbytes of memory, and 128-Mflops of vector processing unit which is able to perform 64-bit floating point and integer operations.

The CM-5 machine has number of control processors, the number may vary depending on configurations. Each control processor configured with memory and disk based on the needs. In CM-5 machine input, outputs were provided via high bandwidth I/O interface, mass secondary storage, and high performance networks. Additional low speed I/O was provided by Ethernet connections to control processors.

Network functions: In CM-5 architecture all the building blocks are interconnected by three networks: a data network, a control network, and a diagnostic network. The data network provides high performance, point-to-point data communications between processing nodes. The control network provided cooperative operations between the processors such as broadcast, synchronization and scans as well as system management functions. The diagnostic network allowed back-door access to all system hardware to test system integrity and to detect isolate errors. The data and control networks were connected to processing nodes, control processors and I/O channels via network interface. The following diagram shows network architecture of CM-5 Machine.

The networks in the CM-5 machine do not depending on specific type of processors.

When new technology has introduced, it can easily added to the CM-5 architecture.

System operations: During the functioning the CM-5 architecture is divided into number of user partitions. Each partition has control processor, number of processing

(7)

nodes and data, control networks. The following diagram shows user partitions CM-5 architecture.

The control processor within the each user portion works as a partition manager. Each user programs are executed on a single partition but exchange data with other processors in other partitions.

The system functions are also classified into privileged or non-privileged. The privileged system functions are accessing data and control networks. The privileged functions executes directly without system calls. So they reduce burden on Operating System.

Some other privileged functions are executed by the system calls such as diagnostic network access, sharing of I/O resources etc.

Some control processors in CM-5 are used to manage I/O devices and interfaces.

Finally the CM-5 has the features of hardware modularity, distributed control, latency tolerance, and user abstraction; all these features lead to scalable computing.

Control processors and processing nodes:

A basic control processor has a RISC microprocessor (CPU), memory unit, I/O system with local disk and Ethernet connections and CM-5 network interface. It is same as standard off-the-shelf work station computer. The following diagram shows general structure of control processor in CM-5 machine.

(8)

The network interface is used to connect the control processor with outside system components. The control processor use a UNIX based OS for the managing parallel processing resources. Some of the control processors are also manage computational resources in user partitions and other was used to manage I/O resources. Bur actually control processors are used to manage some specialized functions.

Processing nodes: The following diagram shows structure of processing nodes in CM- 5 machine.

Here processing nodes are generally SPARC processor with a memory system, having memory controller and 8, 16, or 32 Mbytes of DRAM memory and 64-bit wide internal bus.

Here the SPARC processor has multiwindow feature which is useful for fast context switching. Due to the multiwindow feature the dynamic use of processing nodes in different user partitions at different times are possible. Vector units are added between memory and system bus is an extra feature. Each vector unit has dedicated 72-bit path to its memory units to provide maximum bandwidth of 128 Mbyts/s per vector unit. The following diagram shows processing nodes with vector units.

(9)

The vector units used to execute vector instructions issued by the scalar processor and performed all functions of memory controller include generation of Error Correcting Code (ECC). Here each vector unit has a vector instruction decoder, a pipeline ALU and 64-registers. Each vector instruction issued to a specific vector unit or pair of units or broadcast to all units at once. The vector units are used to achieve maximum performance of each processing node. The following diagram shows vector units functional architecture.

Finally SPARC processors were used to implement control processors and processing nodes. As processor technology improves other new processors also used. Here the network architecture was designed independent of processor technology.

Interprocessor communications:

The CM-2 and CM-5 machines have high speed scanning and spreading mechanisms for interprocessor communication. These mechanisms are further improved and categorize into four interprocessor communications.

(10)

 Replication

 Reduction

 Permutation

 Parallel prefix

These operations may apply on both regular and irregular data sets such as vectors, matrices, multidimensional arrays, variable length vectors, linked lists, and irregular patterns.

Replication: We know that in broadcasting operation, a single value is replicated as many copies and distributed to all processors. The following diagram shows replication operation in interprocessor communication

The part-B shows spreading of column vector into all columns of matrix. The expansion of short vector into a long vector is shown in part-C and completely irregular duplication shown in part-D.

The replication plays key role in matrix arithmetic and vector processing, replication is done through the control network in four kinds of broadcasting schemes such as user broadcast, supervisor broadcast, interrupt broadcast, and utility broadcast.

Reduction: This operation generally treated as opposite operation to the replication.

The reduction is implemented by fast scanning. The following diagram shows various types reduction techniques.

(11)

Part-A shows global reduction produced by sum of vector components, part-B shows row/column reductions produces the sum per each row or each column. Part-C shows variable length vector reduced into chunks of long vector and part-D shows reduction on irregular sets.

The reduction function includes maximum, minimum, average, dot product, and sum, logical AND, logical OR etc. operations. Fast scanning and combining are required operations for implementing the reduction.

Here we consider four types of combining operations, such as reduction, forward scan, backward scan, and router done.

Permutation: Parallel computations on data depending upon permutation operation by fast exchanging of data among processing nodes. The following diagram shows four cases of permutation.

The permutation operations are used in matrix transpose, reversing a vector, shifting a multidimensional grid and FFT butterfly operations.

Parallel prefix: This operation supported by control network. A parallel prefix operation sends the result to the i^th processor by applying one of five reduction operations to the values in the preceding i-1 processors in linear order. The following diagram shows parallel prefix operations.

(12)

The part-A shows one dimensional sum-prefix. The two-dimensional row/column sum- prefix is shown in part-B and performed by using forward scanning mechanism. Part-C shows one dimensional prefix-sum on sections of long vector independently. Part-D shows forward scanning along linked lists to produce prefix-sums as outputs.

End of UNIT - VII