30
OHMS
CPUCLOCK
0 RIVER
Figure 9RAM Clock Distri bution
DigitJI Tcch nic1l journal Vol . 8 No. 4 l 996
series-damping resistors in each cache data line, as shown in Figure 10. Automatic component placement machines and availability of resistors in small pacbges made mounting 288 resistors on the mod u le a painless task, and the payoff was huge : nearly perkct signals even in the presence of spurious data transitions caused by the m icroprocessor's architectural katurcs and RAM characteristics. Figure l l i l lustrates the han dling of some of the more difficu lt wavdcm11s. Performance Features
This section discusses the perr()rnlance of the AlphaServer 4 1 00 system derived ti·o1n the physical aspects of the CPU mod u le design and the effects of the d uplicate TAG store .
Physical Aspects of the Design
As previously mentioned, the synchronous cache was chosen primarily tor performa nce reasons. The archi tecture of the Alpha 2 1 1 64 m icroprocessor is such th<1t its data bus is used f()r transters to and from main mem ory ( fi l ls and writes) as wel l as its B-cache :' As system cycle times decrease, it becomes a challenge to manage memory transactions without requiring wait cycles using asynchronous cache RAM devices. for example, a transfer from the B -cache to main memory ( victim transaction) has the tollowing dcby components: l . The microprocessor drives the add res. otr�chip. 2. The address is fanned out to the RAM devices. 3 . The RA.Jv1s retrie\'e data.
4 . The RAl\!ls d rive data to the bus i ntcrf.1ce device.
5. The bus i nterface device req uires a setup time .
vVorst-case delay values t()r the above items might be the fol lowing:
l . 2 . 6 nanoseconds' 2. 5 . 0 nanoseconds 3. 9 . 0 nanoseconds 4. 2 . 0 nanoseconds 5. l .O nanoseconds Total: 1 9 . 6 nanoseconds
Thus, tor system cycle times rhar arc signi ficantly shorter than 20 nanoseconds, it becomes impossible
Figure 1 0
RAJ'vl. Driving the Micropmccssor Jnci TI·J nsccivci· rhmugh
Figure 11
Handling of Difficult Wavdorms
to access the RAM without using mu ltiple cycles per read operation, and since the full transter involving memory comprises four of these operations, the penalty mounts considerably. Due to pipelining, the synchronous cache enables this type of read operation to occur at a rate of one per system cycle, which is 15 nanoseconds in the AlphaServer 4100 system, greatly increasing the bandwidth for data transfers to and from memory. Since the synchronous RAM is a pipeline stage, rather than a delay element, the win dow of valid data available to be captured at the bus interface is large. By driving the R.A!vls with a delayed copy of the system clock, delay components 1 and 2 are hidden, al lowing taster cycling of the B-cache.
'When an asynchronous cache communicates with the system bus, all data read out fi·om the cache must be synchronized with the bus clock, which can add as many as two clock cycles to the transaction. The synchronous B-cache avoids this performance penalty by cycling at the same rate as the system bus.2
In addition, the choice of synchronous RAMs pro vides a strategic benefit; other microprocessor vendors are moving toward synchronous caches. For example, nu merous Intel Pentium microprocessor-based sys tems employ pipeline-burst, module-level caches using synchronous RAM devices. The popularity of these systems has a large bearing on the RAM industry.9 It is in DIG ITAL's best interest to tollow the synchronous RAM trend of the ind ustry, even tor Alpha-based systems, since the vendor base will be larger. These vendors will also be likely to put their efforts into improving the speeds and densities of the best-selling synchronous RAM products, which will facilitate improving the cache performance in future variants of the processor modules.
Effect of Duplicate Tag Store (DTA G)
As mentioned previously, the DTAG provides a mech anism to filter irrelevant bus transactions from the
DATA LINE SCALE:
1 .00 VOLT/D IVISION, OFFSET 2.000 VOLTS, I NPUT DC 50 OHMS
TIME BASE SCALE:
10.0 NANOSECONDS/
DIVISION
Alpha 2 1 1 64 microprocessor. In addition, it provides an opportunity to speed up memory writes by the I/0
bridge when they modif)r an amount of data that is smal ler than the cache block size of 64 bytes (partial block writes).
The AlphaServer 4100 I/0 subsystem consists of a PC! mother board and a bridge. The PC! mother board accepts I/0 adapters such as network interfaces, disk controllers, or video controllers. The bridge pro vides the inter£1ce between PCI devices and between
the CPUs and system memory. The I/0 bridge reads
and writes memory in much the same way as the CPUs, but special extensions are built into the system bus pro tocol to handJe the requirements of the I/0 bridge.
Typically, writes by the f/0 bridge that are smaller than the cache block size require a read-modifY-write sequence on the system bus to merge the new data with data from main memory or a processor's cache. The AJphaServer 4 100 memory system typically trans fers data in 64-byte blocks; however, it has the ability to accept writes to aligned 1 6-byte locations when the
I/0 bridge is sourcing the data. When such a partial block write occurs, the processor module checks the DTAG to determine if the address bits in the Alpha 2 1 1 64 cache hierarchy. I f it misses, the partial write is permitted to complete unhindered. If there is a hit, and the processor module contains the most recently modified copy of the data, the l/0 bridge is alerted to replay the partial write as a read -modifY-write sequence. This feature enhances DMA write perfor mance for transfers smaller than 64 bytes since most of these references do nor hit in the processor cache.<
Conclusions
The synchronous B -cache allows the CPU modules to provide high performance with a simple architec ture, achieving the price and performance goals of the AlphaServer 4 100 system . The AlphaServer 4 1 00
36
CPU design team pioneered the use of synchronous RAMs in an Alpha microprocessor- based system design, and the knowledge gained in bringing a design from conception to volu me shipment will benefit future upgrades in the AlphaServer 4 100 server family, as well as products in other platf-orms.
Acknowledgments
The development of this processor mod ule would not have been possible withou t the support of nu merous individuals. Rick Hetherington pertormed early conceptual design and built the project team. Pete Bannon implemented the synchronous RAM support features in the CPU design . Ed Rozman championed the use of random testjng techniques. Norm Plante's ski l l and patience in implementing the often tedious PC layout req uirements contri buted in no small mea sure to the project's success. Many others contributed to firmware design, system testing, and performance analysis, and their contributions are gratefu lly acknowledged. Special thanks must go to Darrel Donaldson for supporting this project throughout the entire development cycle.
References
1 . D I GITAL AlphaServer Family DIG ITAL U N IX Perfor
mance Flash (Maynard, Mass . : Di gital Eq uipm ent Corporation, 1 996 ), http :/ jwww.europe.digital.com/
info/ performa nce/ sys/ unix -svr- flash -9 .abs.html. 2. Z. Cveranovie and D. Donaldson, "AiphaServer 4 LOO
Performance Characterization," IJi,t.;ital Tech nical Journal, val. 8, no. 4 ( L 996, this issue): 3-2 0 .
3 . G . Hcrdeg, " Design and Implem entation of the AlphaServer 4 1 00 CPU and Memory Architecture," D(f!,ital Techn ica! Joumal, vol. 8, no. 4 ( I 996, thi s issue ): 48-60 .
4. S. Duncan, C. Keefer, and T. McLaughlin, " H igh Performance 1/0 Design in rhe AJphaServer 4 1 00 Sy m metric Multiprocessing System," D(c;ital Technical Journal, vol . 8, no. 4 ( 1 99 6 , this issue ) : 6 1 -7 5 . 5 . "M icroprocessor Report," MicroDesign Resources,
val . 8, no. 1 5 ( 1 994 ) .
6 . JIJM Personal Cnmputer Power Series 800 Perj'or mance(A.rmonk, N.Y.: Inte rnational Business Machines Corporation, 1 995 ), http:/ /ike.e ngr.wash ingron.edu/ news/whirep/ps-perf.hrml.
7. L. Sau nders and Y. Trivedi, "Testbench Tutorial," Inte grated System Desig n, val. 7 ( April and May 1 995 ).
8 . [)!GJ'JAL Semiconductor 27 764 (366 i\'1!-lz Through 433 J\1/Hz) Alpha Microprocessor J-Jardware Rejerence t\llanual (Hudson, Mass . : Digital Eq uipment Corporation, 1 996 ).
Digital Technical journal Vol . 8 o. 4 1 996
9 . ] . Handy, "Synchronous SRAlvl Ro undup," Dataquest
( September l l , 1 9 9 5 ) .
General Reference
R . Sites, ed. , Alpha Architecture R(fere1lce Manual
(Burlingron, Mass . : Digit:� I Press, 1 9 9 2 ).
Biographies
Maurice B. Steinman
JVlaurice Steinman is a hardware principal engineer in the Server Product Development Group and was the leader of the CPU design tc1m for rhe D I G I TAL AlphaServer 4 1 00
system . In previous projects, he was one ofthe designers of the AlpluServcr 8400 CPU module and a designer of
rhe cache conn·ol su bsystem tor rhc VAX 9000 com puter
system. Maurice received a B.S. in computer and systems engineering ti·om Rensselaer Polytechnic I nsriturc in 1 9 86.
He was :�warded two patents related to cache control and coherence and has [\VO p:Henrs pending.
George J. Harris
George Harris was responsible for the signal integrity and cache design ofrhe CPU module in the AlphaServer 4 1 00
series. He joined DIG ITAL in I 98 1 and is a ha rdware prin
cipal engineer in the Server Product Development Group. Iktore joining DIGITAL, he designed digital circuits at
the computer divisions of Honeywell, RCA, and Ferranti . He also designed compu ter-assisted medical monitoring svsrems using PDP- 1 1 compu ters for the American Optical Division ofWarncr Lambert. He received a master's degree in electronic communications from McGill Universi ty, J\ilontreal, Quebec, and was awarded ten parents relati ng to compu ter-assisted medical monitoring and one patent related to work at D I G ITAL in the area of circuit design.
And rej Kocev
And rej Koccv joined D I G I TA L in 1 994 after receiving a B.S. in con1putcr science ti·om Rensselaer Polytcdmic I nstirute. He is a senior hardware engi neer in the Sen·er Prod uct Development Group and a n1emhn of rhc CPL' l'cri fication tea m . He designed the logic 1cri lica rion sol[ " arc described in th is paper.
Virginia C. Lamere
Virginia LmH.:n: is a hardll'are pri ncipal engineer in the Scrn-r Product De,·clopment Group Jnd ll'as responsible ttlr C : l'l module design in the D I G I TA l . AlphaSen-cr 4 1 00
series. l;innv 11·Js J nu:m be1· ofrhe verification n::1ms t(.>r
rhe AlphaSenn 8400 and AlphaServer 2000 CPU mod·
uks. Prior to those projects, she conrribured to rile d esign ofrhc floating-point processor on rhe VAX 8600 and the execution u n it on the VAvX 9000 computer syste m . She n.:ccivcd '' B .S. in clccrrical engi neeri ng and computer science t'rom Princeton Unive1·sity i n 1 98 1 . Ginny was awarded two p:ucnts i n the area of the C.\ecu tion u n i t design a n d i s J co-author of t h e paper "Floati n g Poi nt Processor t(n· the VAX 8600" pu blished in this jounw/.
Roger D. Pannell
Roger I\m nell was the leader of the VCTY AS I C design tc:1m t(Jr the Alph:1Sen•e1· 4 1 00 svstem . H e is :1 hardware princip:1l engi neer in the Server Product De,·elopmcnt (;roup. Roger Ius II'Orked on several projects since join i ng l)igi tJI i n 1 977. Most recent!\', he has been a module/
ASIC (ksignn on rhe Al phaSer�t:r 8400 and VAX 7000
1 /0 port modu les and ,1 bus- to-bus I/0 bridge . Roger ree<.:iH:d a B . S . in elccrmnic engi neeri ng r..:chnologv ti·om the U niversity of Lowel l .