Shared Memory Multiprocessor Architecture Concerns

Given its paral lelism model, the KAP _preprocessor requ ires operating system and hardware support from the system for efficient paral lel execution . There are three areas of concern: t hread creation and sched ul ing, synchron ization between threads, and data caching and system bus bandwidth.

Function X3H5 D irectives

To specify regions of parallel execution To specify parallel loops

To specify synchronized sections of code such that all processors synchronize

To specify that all processors execute sequential ly To specify that only the first processor executes

C*KAP* PARALLEL REGION C*KAP* END PARALLEL REGION C*KAP* PARALLEL DO

C*KAP* END PARALLEL DO C*KAP* BARRIER

C*KAP* CRITICAL SECTION C*KAP* END CRITICAL SECTION C*KAP* ONE PROCESSOR SECTION C*KAP* END ONE PROCESSOR SECTION

The KAP Parallelizer for DEC Fortran and DEC C Programs

Table 2 KAP SMP Support Libra ry

Fortran

C Entry Point Name Name Function

OSF/1 DECthreads Subroutines Used __ k m p_ e n t e r _ c s e c __ k m p_ e x i t -c s e c __ k m p_ f o r k m p p e c s m p p x c s m p p f r k

To enter a critical section To exit a critical section To fork to several threads

p t h r e a d _ m u t e x _ l o c k p t h r e a d _m u t e x _ u n l o c k p t h r e a d _ a t t r_ c r e a t e , p t h r e a d_ c r e a t e __ k m p_ f o r k -a c t i v e m p p f k d To inquire if already parallel ( n o n e ) __ k m p_ e n d m p p e n d To join threads p t h r e a d _

j

o i n , t h r e a d d e t a c h __ k m p _e n t e r _o n e p s e c m p p b o p To enter a single processor section p t h r e a d _m u t e x _ l o c k , p t h r e a d _m u t e x _ u n l o c k __ k m p _ e x i t _o n e p s e c m p p e o p To exit a single

processor sect ion

p t h r e a d _ m u t e x _ l o c k , p t h r e a d _m u t e x _ u n l o c k __ k m p_b a r r i e r m p p b a r To execute a barrier wait p t h r e a d _m u t e x _ L o c k ,

p t h r e a d _ c o n d _w a i t , p t h r e a d _m u t e x_ u n l o c k

Thread Creation and Scheduling Thread cre

ation is the most expensive operation. The X3H5 standard minim izes the need for creating threads through the use of paral lel regions. The SMP sup port l i brary goes further by reusing threads from one parallel region to the next. The SMP support library examines the value of an environment vari able to determine how many threads to use. The appropriate sched ul ing of threads onto hardware processors is extremely important for efficient execution. The support l i brary relies on the DECthreads implementation to achieve this. For the most efficient operation, the l i brary should schedule at most one thread per processor.

Synchronization between Threads In the KAP

model of parallelism, threads can synchronize at • A point where loop iterations are scheduled • A point where data passes between iterations

(for col lection of local reduction variables only) • A barrier point leaving a work-sharing construct

• Single processor sections

Two versions of the SMP support library have been developed: one with spin locks for a single-user environment and the second with mu tex locks fo r a multiuser environment. Either l ibrary works in either environment; however. using the spin lock

Digital Tech11ica/ journal vbl. 6 No . .3 Summer 1994

version in a single-user environment yields the most efficient para llelism .

Using spin locks i n a mu ltiuser environment may waste processor cycles when there are other users who could use them. Using mu tex locks for a single user environment creates un necessary operating system overhead. In practice, however, a system may sh ift from single-user to mul tiuser and back again in the course of a single run of a large pro gram. Therefore, KAP supports all lock-environment combinations.

Data Caching and System Bus Bandwidth

Multiprocessor Alpha systems support coherent caches between processors. s To use these caches

efficiently, as a policy, KAP localizes data as much as possible, keeping repeated references within the same processor. Loca li zing data reduces the load on the system bus ancl reduces the chances of cache thrashing.

When all the processors simultaneously request data from the memory, system bus bandwidth can limit SMP performance. If optimizations enhance cache locality, less system bus bandwidth is used , and therefore SMP performance is less likely to be l imited .

KAP Technology

This section covers the issues of data dependence analysis, preprocessor architecture, and the selec tion of loops to paralleli ze.

Scientific Computing Optimizations for Alpha

Data Dependence A nalysis-The Kernel

In document dtj v06 03 1994 pdf (Page 58-60)