and Memory Architecture
The DIGITAL AlphaServer 4100 system is Digital Equipment Corporation's newest four-processor midrange server product. The server design is based on the Alpha 21 1 64 CPU, DIGITAL's latest 64-bit microprocessor, operating at speeds of up to 400 megahertz and beyond. The memory architecture was designed to interconnect u p t o four Al pha 21 1 64 CPU chips a n d up to four 64-bit PCI bus bridges (the Alpha Server 4100 supports u p to two buses) to as much as 8 giga bytes of main memory. The performance goal for the AlphaServer 4 1 00 memory interconnect was to deliver a four-multiprocessor server with the lowest memory latency and highest mem ory bandwidth in the ind ustry by the end of June 1 996. These goals were met by the time the AlphaServer 4100 system was introduced in May 1 996. The memory interconnect design enables the server system to achieve a minimum mem ory latency of 1 20 nanoseconds and a maximum memory bandwidth of 1 giga byte per second by using off-the-shelf data path and address com ponents and programmable logic between the CPU and the main memory, which is based on the new synchronous dynam ic random-access memory technology.
41l Di�;i tcll Technical journal Vol . 8 No. 4 1 996
I
Glenn A. Herdeg
The D I G I TAL AlphaSuvn 4 100 s�'stem i s a svrnmet ric multiprocessing (SMP) midra nge suver that sup
ports up to fou r Alph:1 2 I 1 64 microprocessors .
A singk Alph<� 2 1 164 CPU chip may simultaneously issue multiple extern:�! accesses to main memory. The Alph:1Servu 4100 memory imuconncct was designed to maximize this multiple- issue ti::ature of the Alpha 2 1 1 64 CPU chip :�nd to t:� kc :llh':111tage ohhe perfor
mance benefits of the new bmily of memory chips called synchronous d\'n:unic random·access memories (SDRAMs). To meet the best-i n-industry latency <111d b:mdwidth pertorm:1ncc goa ls, D I G ITAL de,-eloped :1 simple memory interconnect ,1rchitccturc th<�t com bines the existing Alpha 2 ! 164 CPU memory i n ter race wi th the industry-standard SDRAM interrace .
Throughout this paper the term late ncy reters to the time required to return data ti·om the mcmorv chips ro the C PU chips-the lo\\'er the latency, the better the
put(>rman cc. The AlphaScr\'er 4 10 0 svstcm achic,·cs :1 m i ninlllm latencv of 120 n a noseconds ( ns) tl-om rhc rime the address appc::lrS ar rbe pins of rhc Alplu 2 1 164 C P U ro the time the CPU internal tv receives tl1e corre sponding data hom any address i n m:1in memory. The term ba ndwidth rdcrs to the ::tmount or' data, i.e., the n umber of bytes, transferred berwecn the memory chips and the CPU chips per unit of rime-the higher the bandwidth, the better the pcrt(mnance. The AlphaServer 4 100 delivers '1 nJ,lXimum memory band width of l gig<�byte per second (G B/s).
Beh-c introducing the DIGITA L AlphaServer 4 lOO product i n M:1y 1996, rhc developme nt ream
d u cted :m extensi,·e pcd(mllancc comparison of
the top sen·crs in the industry. The bencbnurk tests showed that the A l p h aServcr 4 10 0 delivered the lowest memory latency :md rhc highest McC<�Ipin
memory b:1 11dwidth of all the t'vVO- to four-processor systems in the industry. A companion p<�per in this issue of the ]oumol " A i p luServer 4 100 Pcr t(>nll Jnce Characterization," contains the comparative int(mnation.1
This p:�per focuses on the '1 rchitecturc and design of the rhn:e core modules that \\'ere developed concur rently to optimize the ped(mn:�ncc of the e ntire
memory architecture . These three mod u les-the motherboard , the synchronous memory module, and the no-external -cache processor mod u lc-�H"C shown in Figu re l .
Motherboard
The motherboard contains connectors t()r up to t(>ur processor mod u les, u p to t(Jur memory module pairs, up to two 1/0 i nterrace modu les ( tcH1 r peripheral component i nterconnect
[ PC! ]
bus bridge chips ror:�l), memory address m ultiplexers/drivers, :md logic t(>r memory control and arbitration.' Al l con trol logic on the motherboard is im plemented using simple 5 - ns 28-pin programmable arra�' logic ( PAL) de\' ices and more complex 90- mcgahcrtz ( M H z ) 44-pin programmable logic devices ( P LDs) clocked Jt 66 M Hz. Several motherboards have been produced to support various n u m bers of processor modules, memory modu les, and 1/0 interbcc modules. The Alp haScrvcr 4 10 0 supports one to t(>ur processor mod u les, one to t<> u r memory modu le pJirs ( 8 -G B maxim um memory ), and o n e I/0 int crbcc mod u l e ( up to two P C I buses).'Synchronous Memory Module
The synchronous memory modu les arc custo m designed, 72 - bit-wide plug-in cJrds instJ I Icd i n pairs t o co1Tr t h e full width of t h e 1 44 -bit memory data bus. Synchronous memory mod ules that provide 32 megabytes ( M J) ) to 256 M R per pair were designed usmg 1 6- mcgabit ( M b) SDRAM chips. These memory modu les contain nine, eighteen, thirty-six, or seventy-two 1 0 0-M H z S D R.AM chips clocked at 66 M H z, t(>ur 1 8- bit clocked data rcmsccivcrs, address bn-our bufkrs, and control provided by 5 -ns 28-pin PALs. To increase the maximum amount of memor v in the system , a tamily of plug-in compati ble memory mod u les was designed, providing up to 2 G B per pair using 64- M b exte nded data our dynamic random access mcmorv ( EDO D RA M ) chips . These modu les cont a i n 72 or 1 44 EDO DRAM chips controlled by two custom applic1tion -specitic integrated circuits (ASJ C:s) provid ing data m u l tiplexing and control , t(>ur 1 8- bir clocked data transceivers, Jnd address bn-out buftl:rs. Consequently, the AlphaServcr 4 1 00 memory architecture provides main memory capacities of 32 M B to 8 G B with a m i n i m u m latency of 1 20 ns to :111y address. This paper concentrates on the imple mentation of the synchronous memory modu les, although the EDO memory modu les arc fu nctionally compati ble. The recontigu rabi lity description later in this paper pro\·idcs more derails of the implementation of the EDO memory modules.
No-External-Cache Processor Module
The no-external -cache processor mod u le is a plug-in card with a 144-bit rncmor�r i nred�1ce that contai ns one Alpha 2 1 1 64 CPU chip, eight 1 8- bit clocked data transceivers, tou r 1 2 -bit bidirectional address latches, and control provided by 5-ns 28-pin PALs and 90-MHz 44- pin PLDs clocked at 66 M H z . The Alpha 2 1 1 64 CPU chip is program med to operate at a syn chronous memory i n tcrtacc cycle time of 66 M H z ( 1 5 ns) to match the speed o f the SO RAM chips o n the memory modu les. Although there are no external cache random-access memory ( RAlvl ) chips on the module, the Alph<l 2 1 1 64 i tself contains two levels of on-chip caches: �1 primary 8 - kilobyrc ( KB ) data cache and a primary 8 - KB instruction cache, and a second level 96-KB three-way set-associative data and instruc tion cache. The no-external-cache processor modu le was designed to take advantage of the multiple-issue feature of the Alpha 2 1 1 64 C P U . By keeping the latency to main memory low and by issuing m ultiple references trom the Alpha 2 1 1 64 CPU tO main mem ory at the same time to increase memory bandwidth, the pedormancc of many applications actu a l ly exceeds the pertormancc of a processor mod u le with a third level external cache.' N u merous appl ications perform better, however, with a large on- board cache. For this reason , the AlphaScrver 4 100 ofkrs several variants of plug-i n compatible processor modules containing a 2-MB, 4-MB, or greater module-level cache. The paper "The AlphJScrvcr 4 1 00 Cached Processor Mod u l e Architecture and Design," which appears in this issue ofthejourua/, contains more related information!
The three components of the core mod ule set were designed concurrently to address five issues:
1 . Simple design
2. Quick design rime
3. Low memory latencv 4. High memory bandwidth 5. ReconfigurJbiliry
Simple Design
The Alpha 2 1 1 64 CPU chip is based on a reduced instruction set computing ( R.ISC) architecture, which
h as a small , simple set of i nstructions operating as tast as possi ble. AlphJScrvcr 4 1 00 designers set the sam e goal of simplicity t()l· the rest of the server system.
The AlphaScrvcr 4 1 00 interconnect between rhc CPU and main memory was optimized tor the Alpha 2 1 1 64 chip and the S D RAJ\11 chip. To keep the design simple, only off the-shel f data path and address com ponents and rcprogrJmmable control logic devices were placed between the Alpha 2 1 1 64 and SDRAM
50 Figure 1