+
FINAL PRODUCT
(TO
DIVU)
VML_RESULT [63:0J
TO VREG
Figure 9 Vector Multiply Unit similar to the process used for double-precision
multiplication. However, in single-precision multi plication, only one multiplier chip is needed ro pro duce the result and the pack chips do not need to sum the partial product. Integer multipli ca tion is slightly different from floating point multiplication because it does not need to be accumulated or rounded. Thus, the correct product is produced by one multiplier. The result bypasses the accumu lation and rounding logic and proceeds directly into the packing logic to be sent to the vector regis ter unjt.
The exponent handling for both multiplication and division is performed by the same logic on the packing chips. Depending on the instruction being executed, the exponent is either added (multipli cation) or subtracted (division). The result of this operation is then piped to the next stage and the position of the h idden bit is determined. If the frac tional portion of the data must be shifted to ensure the hidden bit is in the correct position, the expo nent is then incremented or decremented accord-
76
ingly. The normalize count (i.e. , shift count) is used to select the correct final exponent. Overflow and underflow exception checking can only be detected and reported after the final exponent is selected. If an exception is detected, then a reserved operand is written to the appropriate vector register element. The first stage of the exponent logic also checks for divide by zero and reserved operand exceptions. Division Vector division is a variable-cycle func tion. The number of cycles depends on the format of the operands. The custom divider is capable of producing six quotient bits per cycle. Therefore, F _floating point division is performed in 7 cycles,
G_floating point in 1 2 cycles, and D_floating point in 13 cycles. Because of the variable number of cycles in a divide instruction, no other instruc tion can execute in the V-box while a divide is in process. Also, because of the iterative nature of divi sion (i.e. , one division must be completed before another can be started), the instruction cannot be pipelined.
As a vector div ide instruction executes, two 64-bit elements are received from the vector regis ter unit each cycle and are latched i n the di vide unpack chip. The elements are unpacked, and the fractional portion of the elements is sent to the etJS
tom divider in 32-bit slices. The exponent portion is sent to the shared exponent logic on the packing chips, as described in the Multiplication sect ion. During this cycle, time-critical values, such as com plemented element values and first-cycle quotient bits, are calculated and forwarded to t he custom divider.
W hen t he divider receives the data, it uses a n iterative algorithm t o produce six quotient bits per cycle. The quotient bits produced are then sent to the packing chips, which may have to increment the quotient, depending on the value of subsequent quotient bits. The div ider instructs the quotient accumulation logic whether or not incrementing is necessary. The partial quotient, once decided, is held in a bank of l atches until a l l the quotient bits are received . When the entire quotient is available, the result is rounded, normal ized , and packed by using the same logic path as multiplication. A mul tiplexer switches this packing logic between the multiplication and division logic.
Performance Characteristics
As of this writing, testing of the vccror performance of the VAX 9000 system has only just begun. How ever, some preliminmy resu lts are p resented in Table 3. We expect that these results will improve as testing continues and more code i s optimized to take advantage of the chaining and overlapping provided by the V-box.
Chaining and Overlapping
Because of the design of the vector register unit, the V-box can concurrently execute a vector add-
Table 3 VAX 9000 Model 21 0 P rel imi nary Performance Double-precision M FLOPS , U n iprocessor
Size Vector
Peak rate NA 1 25
LFK (Geometric mean) 44 1 1 3. 2
LFK (Arit hmetic average) 44 1 20.6
L I N PAC K 1 0002 80
FFT 4096 26
Convolution 1 50 X 1 500 99. 1 5 Matrix multiply 642 1 1 1 .36
Digital Tecbllicaljournal Vol. 1 No. -1 Fall 1990
Vector Processing on the VAX 9000 System
class instruction , vector multiply instruction, and vector memory instruction. Unlike the VAX 6000 Model 400 system, vector register conflicts between these instructions have little effect on overlapping. ; With the VAX 9000 system, a conflict only delays t he execution of the subsequent vector instruction by one or two cycles at most.
However, the overlapping behavior of the V-box is sensitive to the issue order of vector instructions. If two vector instructions executed by the same V-box unit are issued one after the other, the second instruction is delayed until the V-box unit has fin ished executing the first. In addition, vector i nstruc tions issued after a vector memory instruction or divide instruction, do not begin execution unti.l the previous instruction completes. A general ru le in scheduling code for the VAX 9000 V-box, is to gen erate, whenever possible, instruction triples, where the first two instructions are a vector add-class and vector multiply instruction and the last instruction is a vectOr memory or vector divide instruction . Failing that, at least one vector add-class or vector multiply instruction should be issued before a vec tor memory or vector divide instruction.
The following code examples demonstrate the usage of the VAX vector instruction set and the over lapping behavior of the VA X 9000 V-box. (Note: It should be assumed in the examples that all arrays are 8-byte double precision .)
In the following DAXPY inner loop example, the first two VLDQ instructions do nor overlap. How ever, the VSM ULD, VVA DDD , and VSTQ instructions do overlap. D o i = 1 , 64 DY ( i ) = DY ( i ) • DA x OX ( i ) e n d d o vecrorizes as: VLDQ o x , K8 , vo ; Lo a d ve c t o r OX VLDQ/M DY , K8 , V2 ; L oad ve c t o r DY ; w i t h mod i f y i n t e n t VSMULD DA , V O , V 1 ; V 1 = D A * D X VVADDD V 1 , V2 , V3 ; V3 . V 1 . D Y VSTQ V3 , D Y ' K8 ; S t o r e vec t o r D Y
The first two V LDQ instructions do not overlap in the following MERGE example,
Do i = 1 , 64 a ( i ) = b ( l ) - c ( i ) i f ( a ( i ) . g t . 0 ) t h e n b ( i ) = a ( i ) e l s e b ( i ) = d i ) e n d i f e n d d o 77
vectorizes as: VLDQ VLDQ VVSUBD V S T Q V S L S S D b , #8 , vo c , #8 , V 1 VO , V 1 , V 2 V2 , a , # 8 #1\ X O , V2 ; Load ve c t o r b ; Load ve c t o r c ; b - e ; S t o r e v e c t o r a ; Te s t a ( • ) a n d s e t rna s k ; i n VMR . < VS C M P ; p s e u d o - o p d o i n g L e s s ; T han S i g n e d t e s t ) VVMERGE V 1 , V2 , VO ; Me r g e a and c i n t o b ; u s i ng mas k i n VMR VSTQ VO , b , #8 ; S t o r e ve c t o r b
However, the VVSUBD instruction does overlap with the VSTQ instruction. Both the VSLSSD (VSCMP) and VVMERGE instructions are executed by the vector add unit. Therefore, these two instruc tions do not overlap. However, the VVMERGE instruction does overlap with the VSTQ instruction.
In an I F-THEN- ELSE example, such as the following, Do i = 1 , 64 i f ( a ( i ) . g t . 0 ) t h en b ( i ) c ( i ) e l s e b ( i ) c ( i ) I a ( i ) e n d i f e n d d o vecrorizes as: VLDQ a , # 8 , VO ; L o a d vee t o r a V S L S S D #1\ X O , V O ; T e s t a ( •) a n d s e t mas k ; i n VMR . < VSCMP ; p s e u d o - o p d o i n g L e s s ; T h a n S i gn e d t e s t ) VLDQ c , #8 , V 1 ; L o ad vee t o r c VVD I VD / 0 V 1 , VO , V2 ; Ma s k e d d i v i d e o f c by a ; f o r VMR(ij = 0 VST Q / 1 V 1 , b , #8 ; S t o r e " t h en " p a r t of b ( • ) VST Q / 0 V 2 , b , #8 ; S t o r e " e l s e " p a r t of b( * ) Nothing overlaps the first V LDQ instruction, but the VSLSSD instruction does overlap the second VLDQ instruction. Nothing can overlap with the VVDIVD instruction. Thus, the VSTQ instructio n does not begin execution until the VVOIVD instruc tion completes. The remaining VSTQ instruction waits for the first VSTQ instruction to complete.
In the following scatter-gather example, none of the instructions is overlapped.
Do i = 1 , 64 i f ( a ( i ) . e q . 0 ) t he n b ( i ) = c ( i ) / d ( i ) e n d i f e n d d o 78 vecwrizes as: VLDQ V S E Q L D I O TA MFVCR MTVLR VGATHQ VGATHQ VVD I VD VSCATQ a , #8 , VO #J\ X O , VO #8 , V 1 R O R O c ' V 1 , V 2 d , V 1 , V3 V2 , V3 , V4 V4 , b , V 1 ; L oad ve c t o r a ; Te s t a ( • ) f o r z e r o a n d ; s e t ma s k. C VS C M P p s e u d o ; o p d o i n g E q u a l t e s t > ; Ma k e c o mp r e s s e d ; ve c t o r o f o f f s e t s , w r i t e s i z e o f ve c t o r ; t o V C R ; Move V C R i n t o R O ; C MFVP p s e u d o - o p ) ; Load n ew VLR v a I u e ; C MTVP p s e u d o - o p ) ; Ga t h e r ve c t o r c ; u s i ng o f f s e t s i n V 1 ; Ga t h e r v e c t o r d ; u s i n g o f f s e t s i n V 1 ; D i v i d e c b y d ; S c a t t e r v e c t o r b u s i n g ; o f f s e t s i n V 1
I t should b e noted i n this example that the VSEQLD and the IOTA instructions do not overlap. This lack of overlap occurs because the IOTA instruction is actually done with microcode on the E-box, and the IOTA instruction cannot begin exe cution until the VSEQLD instruction has computed all the new vector mask register bits. The vector register access instructions (MFVCR and MTVLR) take only a few cycles and do not significantly affect the overlapping of other vector instructions.
Summary
By taking advantage of key features of the VAX vector architecture, such as instruction overlap ping, imprecise exceptions, and asynchronous interaction with the scalar processor, the vector processor of the VAX 9000 system provides super computing performance for computationally inten sive applications. Through the use of barber poling, the vector processor can overlap two vector arith metic instructions with one memory instruction to deliver a peak double-precision performance of 125 M F LOPS.
Acknowledgments
The authors wish to acknowledge the technical contributions of the following individuals to the VAX vector architecture and the VAX 9000 V-box design : Wayne Cardoza , Dave C utler, Tryggve Fossum, Rich Grove, Kevin Harris, Steve Hobbs, Brian Koblenz, D w ight Manley, Dave O rbits, Bob Supnik, Mike Tehranian, Cheryl Wiecek, and Rich Witek.
References
1 . Russell, "The CRAY - 1 Computer System ,"
ACM Proceedings, vol . 21, no. 1 (January 1978): 63-72.
2. VAX Vector Processing Handbook (Maynard : D igital Equipment Corporation, Order No.
EC-H04 19-46/89, 1989).
3. R. Brunner, VAX Architecture Reference Manual (Bedford: Digital Press, Order No. EY -F576 E- DP,
1990).
4 . D. Fenwick et a l . , "A VlSI Implementation of the VAX Vector Architecture," Proceedings of COMPCON '90 (IEEE, Spring 1990).
Digital Tecbntcaljournal Vol. 2 No. 4 Fall 1990
Vector Processing on the VAX 9000 System
5. CRAY-2 Compute-r System Functional Descrip tion (Cray Research, Inc , 1985 ).
6. W. Buchholz, "The IBM System/370 Vector Archi tecture, " IBM Syste-ms journal, vol. 25, no. 1 (1986): 51 -62 .
7. D. Marshall and ]. McElroy, " VAX 9000 Pack aging- The Multichip Unit," Proceedings of COMPCON '90 (!E E E , Spring 1990).
8. M. Adiletta et al . , "Semiconductor Technology in a High-performance VAX System ," Digital Technical journal, vol . 2 , no. 4 (Fall 1990, this issue): 43-60.
james B. McElroy Frank]. Swiatowiec