The ADU CPU module consists of a single CPU chip , a 256 -ki lobrte (KB) secondary cache, and an i nter face to the system bus. Al l CPU modu les in the
system are identica l . The C:Pll modu les are not sel f
sufficient; t hey must be i n itial ized by the console workstation before the CPU can he enabled.
The CPU module contains extensive test access logic that allows other bus agents to read and write most of the mod u le's internal state. We imple mented this logic because we knew these modules would be used to debug CPU chips. Test access logic
wou ld help us determine the cause of a CPU chip
mal fu nction and wou ld make it possible for us to introduce errors inro the secondary cache to test the error detection and correction capabil ities of the CPU chip. This logic was used to perform almost
all initialization of the CPU module and was a lso
used to troubleshoot CPU modules after they were
fabricated.
The centra l feature of the CPli module (shown in Figure '5) is the secondary cache, built using 16K by 4 BiCMOS static RAMs. Each of the 16K half blocks in the data store is 1'56 bits wide (4 long words of data, each protected by 7 ECC bits). Each of the 8K entries in the tag store is an 1 8 -bit address (protected by parity) and a :'1-bit control field (va lid/shared/dirty, also protected by p arity). In addition, a secondary cache dupl icate tag store, consisting of an 1 8-bit address and a valid bit (protected by parity), is used as a hint to speed pro cessing of reads ami writes encountered on the system bus. Finally, a CPU chip data cache dupl icate tag store (protected by parity) fu nctions as an
BYPASS
The Alpha Demonstration Unit
inval idation filter and selects between update and inval idation strategies.
The system bus i nterface watches for reads and writes on the bus, and looks up each address in the secondary cache. On read hits, it asserts B -shared on the bus, and, if the block is dirty in the sec ondary cache, it asserts B-dirty and suppl ies read d:.�ta to the bus. On write hits, it selects between the inval idate and update strategies, modifies the con trol field in the secondary cache tag store appropri ately, and, if the update strategy is selected, it accepts data from the system bus.
Unl ike most bus devices, the CPU modu le's system bus interface must accept a new address every five cycles. To do this, it is implemen ted as two independent finite state machines connected together in a pi pel i ned fashion.
The tag stare machine, which operates d uring
bus cycles I through 5. watches for addresses, per forms a l l tag store reads (in bus cycle 4. just i n time to assert B-shared and 8-dirt y in bus cycle 5) , and performs any needed rag store writes (in bus cycle
';). If the tag state machine determines that bus data
must be supplied or accepted, it enables the data
state machine, and, at the same time, begins p ro
cessing the next bus request.
The data state machine, which operates during
bus cycles 6 through 10. moves data to a mi from
the bus and handles the reading and writing of the secondary cache data store. The h ighly pipe l i ned
Figure 5 CP 1\1/odule
nature of the system bus makes reading and writi ng the data store somewhat trich-y. Figure 6a shows a write h i t that has selected the update strategy immediately fo l lowed by a read hit that must supply data to the bus. High performance m a ndates the use of clocked transceivers, which means the secondary cache data store m ust read one cycle ahead of the bus ancl must write one cycle behind the bus, resu lting in a conflict in bus cycle 1 1 . However, the bus transfers data i n a fixed order, so the read will always access quadword 0 of the block, and the write wil l always access quad word
3
of the block. By implementing the data store as two 64-bit-wide banks, i t i s possible to han d le these back-to-back transactions without creat ing any special cases, as shown in F igure 6b. This example is typ ica l of the style of design used in the ADl l , which elim inates extra mechanisms wherever possible.The CPU inte rface hand les the arbitration for the secondary cache and generates the necessary reads and wri tes on the system bus when the CPl . sec ond ary cache misses.
The CPU chip is supplied with a clock that is not rel ated to the system clock i n frequency or phase. This factor made it easier to use both the 100-MHz frequency of the DC227 proto type chip and the 200 -MHz frequency of the DECch ip 21064 CPU. It
also al lowed us to vary the operating frequency dur i ng CPU chip debugging. However, the data buses connecting the CPU chip to the rest of the CPl l modu le must cross a clock-domain boundary. Pe rhaps more sign ificant, the secondary cache tag and data stores have two asynchronous sources of contro l , since the CPU ch ip contains an integrated
secondary cache control ler.
CYCLE 0 2 3 4 5 6 7 8 9 WRITE CYCLE WO Vl/ 1 W2 Vl/3 Vl/4 W5 W6 W7 W8 Vl/9 READ CYCLE RO R 1 R 2 R3 R4 CACHE Vl/7 W8 CYCLE 0 2 3 4 5 6 7 8 9 WRITE CYCLE wo W1 W2 W3 Vl/4 W5 W6 W7 W8 W9 READ CYCLE RO R 1 R2 R3 R4 CACHE EVEN Vl/7 CACHE ODD Vl/8 Figure o '58 1 0 Vl/1 0 R5 W9 10 Vl/1 0 R 5 Vl/9
The bidirectional data bus of the CPU chip is con verted into the unidirectional data buses used by the rest of the CPU modu le by transparent cu toff
l atches. These latches, which are located in a ring surround i ng the CPU , also convert the quasi-ECL lev els generated by the CPU chip into true EC L levels for the rest of the CPU module. These la tches are norma l ly held open, so the CPU chip is, in effect, connected d i rectly to the secondary cache tag and data RA...VIs. Cont ro l s ignals from the CPU chip's in te grated secondary cache con trol ler are simply ORed i nto the ap propriate secondary cache RAM drivers. These latches are also used to pass data across the two-clock-domain boundary. Norma l ly a l l latches are open. On reads, logic in the CPll chip clock domain cl oses a l l the latches and sends a read request into the bus clock domain. Logic in the bus clock domain obta ins the data, writes both the sec ondary cache and the read latches, and sends an acknowleJgment back into the CPU chip cl ock doma i n . Logic i n the CP\ J chip clock domain accepts the first hal f-block of the data, opens the f irst read latch, accepts the second halt� line of the data, and opens a l l remaining latches. \'hires are similar. Logic in the CPU chi p clock domain writes
the first h a l f- l i ne into the write latch, ma kes the second half-I ine valid (beh ind the latch), and sends a write request in to the bus clock domain. Logic in the bus clock dom ain accepts the first hal f-l ine of data, opens the write latch, accepts the secoml ha l f block of data, and sends an acknowledgment back i nto the CPl J chip clock domain .
Logic i n the C P l J chip clock domain controls a l l l atches. Only two signals pass through synchroniz ers: a si ngle request signal passes from the CPU chip clock domain to the bus clock domain, and a s ingle
1 1 1 2 1 3 1 4 1 5 Figure 6 a shows a conflict for access
to the secondary cache RAMs caused
by back-to-back cycles. In the marked
R6 R7 R8 R9 R 1 0 cycle. the cache writes the bus data W 1 0 R8 R9 R 1 0 that arrived i n cycle Vl/ 1 0. but i t also
R7 needs to read data to supply it during
t
cycle R7.1 1 1 2 1 3 1 4 1 5 Figure 6b shows how this conflict can be resolved by treating the cache as
two independent banks (even and odd). R6 R7 R8 R9
R7 R9 R 1 0 Vl/ 1 0 R 8 R 1 0
CPU Timing
acknowledge signal passes from the bus c lock domain to the CPU chip clock domain.
The seco ndary cache arbitration scheme is unconventional because the system bus has no sta l l mechan ism . I f a read o r a write appears o n the system bus, the bus inte rface must have u ncondi tional access to the secondary cache; i t cannot wait fo r the CPU to finish its current cycl e. I n fact, the
bus interface cannot detect if a cycle is i n progress in the CPU chip's integrated cache controUer.
Nevertheless, a l l events in the system bus inter face occur at fixed times with respect to bus arbi tration cycles. As a result, the system bus inte rface can supply a busy signal to the CPU interface, which
al lows it to predict the bus interface 's use of the secondary cache in the im med iate future. The
CPU
inte rface . therefo re, wa its until the secondary cache can be accessed without conflict ami then performs its cycle without add itional checking. This waiting is performed by the CPU ch ip's i nte
grated secondary cache control ler for some cycles, and by logic in the CPU in terface running in the bus clock domain for other cycles. To reduce l atency, the CPU reads the secondary cache while wa i t i ng.
and ignores the data if it i s not yet valid.
All operations use ownersh ip of the system bus as an inte rlock . For example, if the CPU writes to a location i n the secondary cache that is ma rked as shared, the CPU interface acquires the system bus,
ancl t hen updates the secondary cache at the same time as it broadcasts the write. This cloes not e l im i nate all race conditions; i n particu lar, it a l lows a dirty secondary cache block to be inval ida ted by a system bus write wh ile the CPU i n terface is wait ing to acquire the bus to write the block to memo ry. This is easily hand led, however, by having the CPU
i nterface generate a signal (always_u pdate) that insists that the system bus interface select the U [)date Strategy.
The combination of arbitration by predicting fut u re events and the use of the system bus as an interlock makes the CPU module's control logic extremely s imple. The bus interface and the CPU i nterface have no knowledge of one another beyond the busy and always_update s ignals. Since no compl icated i nteractions between the CPU and the bus exist, no ti me-consum ing simu lations of the i nteractions needed to be performed, and we had none of the difficult-to- track-dow n bugs that are usua l l y associated with m u ltiprocessor systems.
The CPU module contains a nu mber of control
registers. The bus cycles that read and write these
Digital Techuical ]ounwl VrJI. 4 No. 4 Special Issue 1')')2
The Alpha Dernonstration l!nit
registers are processed by the sysrem bus inter face as ord inary, but somewhat degenerate, cases. The local CPU accesses its local registers over the system bus, using ordi nary system bus reads ami writes, so no special logic is needed to resolve race conditions.
To keep pace with our schedule. we arranged ti.> r most of the system to be debugged befon: the CP U
chip arrived . By using a suitably wired in tegrated c ircu it test c l ip, we could place com mantis onto the CPU chip's com mand bus and ver ify the control signa ls with an oscil loscope. The results of tiH:se rests left us fa irly confident that the system worked before the first chip arrived.
We resu med test ing the CPU mo du le :t ft<:r the CPU chip was instal led . We placed short ( th re<: to five instructi ons) programs into main memo ry, enabled the CPU chip for a short time, then i nspected the secondary cache (using the Cl' l l mod u le's test access logic) to examine the results.
Eventua l ly we con nected an external pu i s<: gen erator to the CPU chip's clock and an extern al power supply to the CPU chip. These mod ifications permitted us to vary both the operating frequencv and the operating vol tage of the CPU chip. By using a p ulse generator a nd a power su pply that cou ld be remotely con trol led by another computer, we were able to write simple programs that could run CPU
chip diagnostics, without manual i ntcr ve ntion, over a wide range of operati ng cond itions. This greatly simplified the task of col leering the raw data needed by the chip designers to