Floati ng poi n t addition i nvolves a series of steps.
1 . The exponents are subtracted to determi ne
the shift a moun t necessary to al ign t he frac t ions .
2 . The fraction operand w i t h t he smaller exponent is shifted into al ignment and added or subtracted .
3 . The resul t i s shifted back t o the normal ized form ( � result < 1 . 0) . Normal i zation shift i ng is accompanied by exponent adjust ment .
4 . The resu l t is rounded and checked for overflow or underflow conditions.
Typical ly, t he shift ing operations and their con trol consume large a moun ts of chip area and potentially a large portion of the total calcu lation t i me . An analysis of t hese operations was used to gu ide trade-offs in the design of t he CFPA . 1 It was noted that a lthough large shifts are somet i mes necessary to compute the final resul t , the i r fre quency of occurrence is very smal l . Furthermore, a sma l l shifter, capable of covering the vast majority of cases in a single operation provides the benefit of a sma l l control c ircu i t that can be more easily opt i mi zed for speed . It was decided that the speed and area advantages gained by ctesigni ng for the most frequently occurring cases provided the best solution under project con stra i n ts.
Speci fical ly , a sma l l shifter that is capable of left-four ro right-seven b i t shifts proved ro have adequate range for most a l i gn ment and norma l i zation shifts . I n up ro 80 percent of the cases, additional cyc les arc not needed for a lign ment shifting Larger a l i gnment s h i ft i n g u t i lizes the multi pi ier array for a s h i ft capabil i ty of 1 6 bits per cycle . The array m i n i m i zes the worst-case shift t i me without requiring a large shifter. Although it rarely req u i res additional cyc les, nor malization shift i ng may cause a l onger latency. Additional cycles, however, are nor necessary for
normal i zation in 9 3 percent of the cases.
To reduce the shifter control complexity, a mod i fied ALU ca.lcu l ates the absol ute value of the
Digilal Technical ]ouriUII No. 7 A ugust 1 988
exponent d i fference . The mod ified ALU docs not req uire addi t iona l cal culation r i me to accom pl ish this calcu lation . The absolu te va lue resu l t s i m p l i fies control logic t o enable rhe a l ignment shifter tO complete in the next c lock phase . Only one additional generate term is needed to enable two carry chains executing simu l taneously; one calcu lates A m i nus B. the other B m i n us A. The most sign i ficant bit (MSI3) of the fi rst carry c ha i n determines the sign of the operat ion . To produce t he absolute va l ue or positive resu l t . t he MSB of the first carry cha i n is used to select the final out put from the two carry chains. In addition , the MSI3 is used to sel ect the fraction requ iri ng a l ignment .
'T'he CFPA completes addi tion o r subtraction operations in t hree cycles for most cases. This m i n i mum execution t i me is exceeded for on ly 2'5 percent of a l l add i t ion or subtraction opera tions , al most a l l of which req u i re on l y one addi tional cyc l e .
The major i mprovement over the MicroVA.X I I FPlJ i n the addition/su btract ion a lgorithm is the e l i m i na t ion of no-operation cycles necessary for control eva luation preced i ng the a l ignment and norm a l i zat ion steps. The resu ltant reduction as compared tO the MicroVA.X I I FPU is from e i ght cycles to three for both single- and double-preci sion additions/subtractions i n the actual floating poi n t unit ca lcu lations ( 3 to 1 at 90 ns, 3 . :) to
1 at 80 ns) .
The overa l l performance gai n i n equivalent cycles is 2 0 to 8 for si ngle-precision ( 2 . 8 to 1 at 90 ns, 3 . 1 to 1 at 80 ns) and 26 to 1 1 for double precision add i tion/subtraction ( 2 . 6 to 1 at 90 ns, .) 0 to 1 at 80 ns ) .
Divisio n
Floating point d i vision consists of d ivision of t he fraction or mant issa and subtraction of the exponen ts . Division presents a more intractable problem than mu l t i p l ication when designi ng for high-speed performance . The d i ffi c u l ty arises due to t he fact that the partial remainder at each step must be exa m i ned before the next operation can be determined . Various a lgorithms have been proposed to reduce the nu mber of arithmetic steps-. but no single sol ut ion seems tO opt i m i ze both performance and s i ze constrai nts.
The CFPA uses a method of d ivision t hat offers an i mprovement over single-b i t d ivision algo rithms. w h ich perform an arith metic operation to produce a single quotient bit per ste p . The
Digital Technical journal No. 7 A ugust I 'J88
method cal ls for shifting over, or norm a l i z i ng, multiple l eading b i ts w hen the part i a l remainder is sma l l . A part i a l remai nder with mul t i p l e lead i ng ones i nd icates a sma l l negative remainder, whereas lead i ng zeros i nd icate a sma ll positive remainder. M u l t i pl e quotient b i ts can be deter m i ned for cycles i n which the magnitude of the part i a l remainder is smal l . S h i ft operations replace arithmetic operat ions on unnormal i zed remainders , red u c i ng the n umber of ALU cycles needed to develop the final quotient. This method of d ivision is cal led normal i z i ng, non restoring d ivision and is a lso used in the MicroVA.X FPU . The d i fference between t he two impl ementations is i n the norma l ization s h i ft range provided for partial remainder and quo tient develop ment .
Of course, t h i s a lgorithm is q u ite data sensi tive . A d ivision t hat results i n a part i a l remain der of all ones or all zeros can be completed
in a m i n i mu m a mount of t i me ; whereas, if a stri ng of a lternat i ng ones and zeros is produced at each ALU operation , the process degener ates to a one-bit-per-cyc le pace. The observed average rate for an algorit h m that a l l ows u n l i m i ted shift range is 2 . 66 bits per cyc l e . Unfortunately, t h e shi ft range c hosen i m p l ies a control structure d irectly between the shift and ALU operations. The t i me between t hese operations is critical ly i m portant to t he over a l l cyc l e of the c h i p . We c hose 4 b i ts as t he left s h i ft range for the CFPA to reap the max i m u m benefit from t h e technique without intro ducing i nordi nately di fficu l t control paths between the s h i ft and ALU operat ions. This amounts to an i ncrease of 2 bits of shift range over the MicroVA.X FPU . Correspondingly, t he average n umber of q uot ient bits developed each cycle i ncreased from 1 . 5 to 2 . 4 . Expand ing the shifter beyond a range of 4 for t his method provides a d i m i n ishing i mprovement, as shown in Table 2 .
Table 2 Average Quotient Bits per Cycle Shifter Range 2 4 6 8 Unlimited Average Speed 1 .5 2.39 2.54 2 .64 2.66 1 1 3
Development of the CVAX Floating Point Chip
I ncreasing the number of quotient bits devel oped per cycle from I . 5 to 2 . 4 results i n increased speeds i n the CFPA d ivide l oop re lative tO the MicroVAX FPU: 1 . 8 times greater for 90-ns cycles, and 2 . 0 ti mes greater for 80-ns cycles. The overhead cycles involved in setti ng up t he divide sequence and norma l i zi ng the quo tient arc reduced from 7 tO 2. As a resu lt, t he CFPA rea l i zes a performance greater than t he M icroVA.X l i FPU i n terms of n u mber of cycles reduced for d ivision. I nclu d i ng the processor-ro FPU interface cycles, the n u m ber of cycks for sing le-precision division is reduced from 3 7 to 1 8 cycles ( 2 . 3 a t 90 ns, 2 . 6 at 80 ns) ; for D_floating dou ble-precision d i vision , 6 1 tO 3 5 ( I .94 a t 9 0 ns, 2 . 2 a t 80 ns) .
Comparatively, t h is method of division is very efficient , especia1ly when we cons ider the sma l l amou n t of control circuitry and data path area req u i red. Designers can i ncrease performance additiona lly by using algorithms that employ multi ples of t he divisor, or by i mpleme n t i ng a divider array structure . The use of mul tiples of the divisor requ ires both addi tional registers to hol d the mu lti ples ( 3/4 , 1 , 5/2) and further expansion of the l eft shift capabi l i ty to rake advantage of t he longer norma l i zations created by this approach ( 3 . 6 bits per cycle with left shift range expanded tO 6 ) . In add i t ion, the control logic req u i red to support the selection of the proper multiple is more complex and wou ld be much more difficult to i mplement in the con strai ned cycle time. The other a l ternative of exe cuting the d ivide step in an array structure for performance capable of 3 to 4 quotien t b i ts per cycle involves an even greater cost in hardware and is nor consistent with the project goals.
I nteger divi sion docs nor au tomat ically bene
ti t from hardware devoted to floating po int divi sion . Since floating point d i vision rel ies on the norma lization of t he operands, i n teger divi sion must ei ther convert operands to the normal ized form o r accept a slower one-bi t-per-cycle algorithm . The CFPA design for i n teger division normali zes both the d ivisor and dividend i n order t o use the 2 . 4 -bi t-per-cycle d i vide algo rit h m . Norma l i zation of the divisor and d ividend proceeds at 5 bits per cyc l e . The number of quotient b i ts needed to complete the i nteger di vision operation is determi ned by the d i ffe r ence between the normal ization shift amounts of the divisor and dividend. Consequently, integer divides arc typica lly executed at
I 1 4
2 . 5 b i ts per cycle as compared to I bit per cycle on the MicroVA.X FPU .