Pausible Clocking: A First Step Toward Heterogeneous Systems æ

(1)

Pausible Clocking: A First Step Toward Heterogeneous Systems

Kenneth Y. Yun

Ryan P. Donohue

Department of Electrical and Computer Engineering

University of California, San Diego

9500 Gilman Drive

La Jolla, CA 92093-0407

[email protected]

Abstract

This paper describes a novel communication scheme, which is guaranteed to be free of synchronization failures, amongst multiple synchronous modules operating indepen-dently. In this scheme, communication between every pair of modules is done through an asynchronous FIFO chan-nel; communication between a module and the FIFO is done using a request/acknowledge handshaking. Synchronization of handshaking signals to the local module clock is done in an unconventional way [17, 15, 3, 12, 5] — the local clock built out of a ring oscillator is paused or stretched, if nec-essary, to ensure that the handshaking signal satisfies setup and hold time constraints with respect to the local clock. We constructed a test bed consisting of two synchronous mod-ules with pausible clocking control and an asynchronous FIFO on a MOSIS

1

:

2 m

CMOS chip. The resulting system functions reliably up to the local clock frequency of 220MHz (according to SPICE simulation) — the maximum clock rate is limited by the ring oscillator, not the pausible clocking control. Preliminary test results indicate that the fabricated chips operate correctly as simulated.

1. Introduction

The next generation digital VLSI systems will necessar-ily be based on system-on-chip concepts, in order to satisfy unrelenting demands for higher performance and also to ac-commodate smaller packaging and low power requirements for mobile applications. These on-chip systems will con-sist of multiple independently synchronized domains: some may be clocked domains, such as synchronous processor cores, while others may be clockless (asynchronous) do-mains, such as peripheral controllers. These chip designs will resemble today’s complex board-level designs. On-chip modules will be held together by interface glue logic which must facilitate high speed communication between synchronous modules operating at different clock

This research was supported in part by a gift from Intel Corporation.

cies, between synchronous and asynchronous modules, and between asynchronous modules. The key difference be-tween today’s board-level designs and the future system-on-chip designs is the speed at which communications take place.

A first step toward such heterogeneous system design on chip is a reliable high-speed communication scheme among multiple synchronous modules operating independently. We examined a variety of communication schemes that attempt to mitigate synchronization failure without sacrificing com-munication throughput. They generally fall into one of two categories: (1) brute-force synchronization of communica-tion signals to each module’s free-running clock with an ac-ceptable level of synchronization failure; (2) adjustment of individual synchronous module’s local clock, when neces-sary, to avoid synchronization failure.

The first category includes methods such as the well-known double-latching scheme and a natural extension of the double-latching scheme called pipeline synchronization [16]. These methods reduce the probability of synchroniza-tion failure to an acceptable level by repeatedly resynchro-nizing communication signals with back-to-back latches. These methods are simple and inexpensive to implement, but a major drawback is the latency of communication.

The methods in the second category [17, 15, 3, 12, 5] generally rely on stopping or stretching each synchronous module’s local clock to guarantee that communication sig-nals never violate setup and hold time constraints with re-spect to the local clock. Although these methods are ro-bust and do not incur long communication latency, they involve designing a special clocking circuit, unfamiliar to most designers. The simplest example using this scheme one can conceive is a synchronous module communicat-ing with an asynchronous peripheral. In this system, the synchronous module latches the handshaking signals from the asynchronous module by stopping or stretching its own clock, when necessary.

In this paper, we describe a general method of commu-nication between two synchronous modules operating

(2)

in-FIFO

sender receiver

(a) One way communication

sender / receiver sender / receiver

(b) Bidirectional communication Aσ Rσ Aρ Rρ R1ρ A1ρ A1σ R1σ R2σ A2ρ R2ρ A2σ FIFO FIFO Synchronous Module 1 PCC Synchronous Module 2 PCC Synch Module 2 PCC Synch Module 1 PCC

Figure 1: Two synchronous modules communicating via an asynchronous FIFO channel.

dependently, i.e., at different clock frequencies or phases, based on the pausible clocking scheme as shown in fig-ure 1. Synchronous modules communicate with each other via an asynchronous FIFO used as a communication chan-nel. The interfaces between the synchronous modules and the FIFO are pausible clocking control (PCC) circuits, i.e., the handshaking signals from the FIFO are sampled by the pausible clock of each synchronous module. Although self-timed FIFOs have been used for communication between synchronous modules elsewhere [6], they have not been uti-lized in communication between synchronous modules op-erating independently at different clock frequencies.

In order to validate this scheme, we implemented a test bed consisting of two synchronous modules with pausible clocking control and an asynchronous FIFO on a MOSIS

1

:

2 m

CMOS chip. The resulting system functions reliably

up to a local clock frequency of 220MHz — the upper bound on the local clock rate is due to the ring oscillator, not the pausible clocking control.

The rest of this paper is organized as follows: section 2 reviews mutual exclusion element and arbiter, key compo-nents used in the PCC circuit; section 3 describes the design and implementation of the PCC unit; section 4 describes several system configurations using this scheme and limita-tions; section 5 describes the experimental results; section 6 concludes the paper with some remarks on the future system design.

2. Background: Mutual Exclusion and Arbiter

In this section, we briefly review the concepts of the mu-tual exclusion element and the arbiter. A mumu-tual exclusion

R1 1 G G2 R2 R1 R2 R 1 G G2 G ME C C 1 T T2 (b) Arbiter (a) ME

Figure 2: (a) CMOS mutual exclusion element circuit; (b) Arbiter circuit using ME.

element [15, 14] is a circuit (see figure 2a) that that allows one request to pass through at a time on a first come first serve basis. When two inputs arrive simultaneously, it se-lects one to pass through arbitrarily. An arbiter [10, 9, 4, 14, 8, 13] is a circuit that propagates one request at a time (as does the ME) but also acknowledges the requesters with grant signals as well. The circuit used in our design is shown below (figure 2b). The symbol “C” represents C-element, a self-timed latch which raises its output when both inputs be-come high, lowers its output when both inputs bebe-come low, and keeps the old value if the inputs have different polarities. The detailed circuit behavior of both circuits is explained clearly in many texts and journals, so we will not elaborate here.

However, an interesting characteristic of the ME should be noted [2, 7]. The closer the arrival times of the rising tran-sitions of two inputs are, the longer it takes for the internal analog difference circuit to resolve the metastability [2, 11], hence the latency becomes longer. In order to effectively use the ME circuit in our design, we simulated our ME design in

1

:

2 m

CMOS with SPICE. All PMOS transistors in our

de-sign haveW=L

= 12

=

2

, and all NMOS transistors have

W=L

= 6

=

2

. The mean latency from input to output

ver-sus the difference in input arrival times is shown in figure 3. The rise time of both inputs was set to 1ns for the simulation,

which is typical for

1

:

2 m

technology.

3. Design of Pausible Clocking Control

The pausible clocking control is a scheme to avoid syn-chronization failure by adjusting the local clock. A synchro-nization failure at the module interface occurs when the ar-rival times of an external signal transition and a sampling

(3)

0

Latency (ns)

Difference in input arrival times (ns)

1.2 1.1 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 1.0 0.8 0.4 0.6 0.2 0.1 0.3 0.5 0.7

Figure 3: Mutual exclusion element mean latency versus difference in input arrival times.

edge of the clock are indistinguishable by the sampling latch at the module boundary. In our scheme, the synchroniza-tion failure is circumvented by pausing or stretching the lo-cal module clock when necessary.

sysclk Ring Oscillator sysclk rclk FSM AFSM ME R1 G1 Rρ Aρ SRρ

Figure 4: Receiver PCC for one way communication. A block diagram of the receiver PCC is shown in figure 4. This scheme uses a mutual exclusion element (ME) to force the temporal separation of the sampling edges of the clock and external signal transitions. Because MEs require that requesters competing for shared resources must be persis-tent, the clock input to the ME must be “stretched” when it loses the arbitration. A ring oscillator is used instead of a crystal oscillator in order to be able to adjust the duration of off-phases of the clock. The local module clock, sysclk, is a buffered version of one of the outputs of the ME. It nor-mally has a 50% duty cycle, except when the clock input loses the arbitration, in which case the off-phase of the clock is stretched.

As shown in figure 5, one-way communication, for which the synchronous module is the receiver, is straight-forward. A request event from the FIFO is forwarded to the mutual exclusion element (ME) via the asynchronous finite state machine (AFSM).

If rclk is low whenR

1 rises, then the ME immediately

raisesG

1, which prompts the AFSM to generate an event

onSR

. This event is effectively synchronized to sysclk,

R1 1 G sysclk rclk sysclk paused Rρ Aρ setup ρ SR

Figure 5: PCC timing (one way communication).

i.e., guaranteed not to induce a synchronization failure when sampled by the FSM, under a reasonable timing assumption as described below. Note that rclk may rise before the ME

lowersG

1, but the ME will not allow sysclk to rise until

G 1

becomes low.

On the other hand, if rclk is already high whenR

1rises,

the assertion ofG

1is stalled until rclk is lowered. As soon

as rclk falls, the ME raisesG

1and the AFSM generates an

event onSR

.

Rclk may actually rise at about the same timeR

1rises. In

such situations — the situations in which temporal

separa-tion ofR

1

+

and

rclk

+

becomes blurred — the ME simply

“tosses a coin” to determine which signal to service first. If

rclk

+

wins the “coin toss”, then sysclk rises first andG

1

re-mains low until sysclk falls (which happens shortly after rclk

falls). On the other hand, ifR

1

+

wins, then the ME raises

G

1first and blocks sysclk from rising. In order to prevent

sysclk from stalling indefinitely (until the next toggling of

the request,R

), the AFSM lowers

R

1immediately after

G 1

rises, which in turn causesG

1to fall allowing sysclk to rise.

The PCC does not differentiate rising edges ofR

from

falling edges — both edges enableR

1to be asserted and

G 1

to be asserted as a result. In fact, the AFSM effectively

per-forms a two-phase to four-phase conversion fromR

to R

1

and a four to two-phase conversion fromG

1 to SR

. This

conversion is independent of whether a two-phase or four-phase communication protocol is used between the FIFO and the synchronous module. It is merely done so that both

edges ofSR

are synchronized to sysclk.

1 R + + R_ρ SR_ρ+ A_ρ+ R_ρ− sysclk+ 1 R − 1− G 1 R + SR_ρ− A_ρ− sysclk+ 1 R − 1− G 1 G+ 1 G+

(4)

In order for the synchronous FSM that generatesA to

recognize the change inSR

, we need to ensure that

SR

must satisfy setup and hold time constraints with respect to

sysclk. As illustrated in figure 6, in order to recognizeSR

+

(SR

,)

1

reliably, the path fromG

1

+

to SR

+

( SR ,)

must be shorter than the path fromG

1

+

to R 1 ,toG 1 ,to sysclk

+

by at least the data setup time for the FSM latches.

This is easily satisfied becauseG

1

+

to SR

+

( SR ,) delay

is a simple generalized C-element delay (transitions onSR

are directly triggered byG

1

+

), which is much less than the

delay fromG 1

+

to R 1 ,toG 1 ,tosysclk

+

. 0 1 2 3 / 1 G− /R1+ + Rρ / 1 G− /R1+ Rρ− 4 5 / + 1 G R1− / + 1 G R1− + ρ SR − ρ SR R1 1 G reset 1 G 1 G Rρ weak ρ SR ρ SR SRρ weak

Figure 7: PCC asynchronous finite state machine specifica-tion and implementaspecifica-tion.

The asynchronous finite state machine (see figure 7) is specified in burst-mode [19, 18] and synthesized using the

3D-gC synthesis tool [20]. This burst-mode state machine

has two inputs (R

, G

1) and two outputs (

R 1,

SR

). In state

0, whenR

rises, the machine raises

R

1and goes to state 1.

In state 1, the machines waits forG

1to rise; when it does, the

machine lowersR

1and raises

SR

concurrently and goes to

state 2. WhenG

1falls in state 2, the machine transitions to

state 3. The machine transitions through states 4 and 5 and

back to 0, asR

,triggers a sequence of signal transitions

ending withG 1 ,. sysclk Ring Oscillator Rρ Aρ Rσ Aσ ME rclk AFSM AFSM Arbiter FSM sysclk R1 G1 R2 G2 Rα Gα ρ SR FSM SA_σ

Figure 8: PCC for bidirectional communication.

1_{We use}

a+and a,to denote rising and falling transitions of a

respectively.

For bidirectional communication, the synchronous mod-ule must interface with two FIFOs as shown in figure 1b. As illustrated in figure 8, two handshaking signals must be syn-chronized to the local clock: the request from the sending FIFO and the acknowledge from the receiving FIFO. In or-der to simplify the interface to the ME, our design uses an arbiter to select just one external signal to pass through at a time. Synchronization of the handshaking signals is done in the same way as for one-way communication.

FIFO FIFO Node Node FIFO FIFO Node Node Node A R FIFO A R R A R A R A R A R A R A A R A Async Module R A A R FIFO A R A R Module Sync Module Sync

(b) Heterogeneous ring configuration (a) Ring configuration with synchronous nodes only

R

Figure 9: (a) A heterogeneous message-passing multipro-cessor using PCCs; (b) A heterogeneous system with a mix-ture of asynchronous and synchronous modules.

4. System Configurations and Limitations

Using the pausible clocking scheme, it is conceivable that one can construct a heterogeneous multi-processor system with point-to-point links between every pair of nodes. Each link is a bidirectional FIFO as shown in figure 1. However, as fanouts from and fanins to each node increase, the arbiter block becomes larger making the system impractical.

However, we assert that it is possible to construct a ring configuration as shown in figure 9a similar to the systems proposed in Scalable Coherent Interface (SCI) specification [1]. In this structure, messages are always transmitted to one side and received from the other side, so that only one level of arbitration is required. A major advantage of our ring configuration over other proposed systems, such as SCI sys-tem, is that it is a truly heterogeneous system with each node operating at its own speed. Another typical system configu-ration would be a mixture of asynchronous and synchronous modules as shown in figure 9b.

(5)

FIFO Sender Receiver Rs R2 G2 AFSM FSM ME sysclk2 SAs As rclk2 FSM AFSM ME R1 G1 Rρ Aρ SRρ sysclk1 rclk1 Ring Oscillator sysclk1 Ring Oscillator sysclk2

Figure 10: One way test configuration.

Systems-on-chip should be designed with as many

reusable components as possible. Standard modules,

such as CPU cores, should be reused with little or no modification, because these modules are highly optimized

for performance and sensitive to timing variation. For

the systems proposed in this paper, ideally, the pausible clocking control circuit should simply replace a portion

of the system clock generation unit. However, for the

state-of-the-art microprocessors, the system clock is pro-duced by a phase locked loop (PLL). We cannot adjust the phase of the output of a PLL instantaneously in an analog fashion, as required in our pausible clocking control. Thus a ring oscillator should be used in place of a PLL. Then we lose control of the nominal frequency. Tuning the ring oscillator frequency would require more control pins and hence are more expensive. (However, the ring oscillator in this case does have an advantage that its frequency drift closely tracks logic components on the chip, e.g., if logic components slow down due to an increase in operating temperature, then so does the ring oscillator.) Furthermore, anything in the clock path designed to generate multiple frequencies and/or to minimize jitter creates a problem for pausible clocking.

5. Experimental Results

We performed extensive SPICE simulations of two PCC modules connected via an asynchronous FIFO as shown in figure 10, after backannotating the layout parasitics into the schematic using Mentor Graphics Accusim. We varied the depth of the FIFO between 1 and 4. The performance of the PCC appears to be independent of the depth of the FIFO. The inclusion of the FIFO is for a generic reason of “smoothen-ing” the bursty data transfer between two modules operating at different clock rates, not to enhance the performance of the PCCs.

The first timing trace (figure 11) shows a receiver module

operating at 217MHz. The first event onR

(a rising

transi-tion) is acknowledged normally without pausing sysclk. The second event (a falling transition) causes sysclk to be paused for about 1.8ns.

The second timing trace (figure 12) shows a simulation

Analog Trace

4.50e-08 5.00e-08 5.50e-08 6.00e-08 6.50e-08 7.00e-08 7.50e-08

TIME (sec) 0.00 5.00 sysclk 0.00 5.00 rclk 0.00 5.00 Ap 0.00 5.00 SRp 0.00 5.00 G1 0.00 5.00 R1 0.00 5.00 Rp

Figure 11: PCC simulation trace illustrating a clock pause (sysclk= 217 MHz).

result of one-way communication between two modules op-erating at different clock frequencies. In this simulation, the sender module operates at 135MHz and the receiver at 217MHz. The sender FSM is simply a rising edge-triggered flip-flop followed by an inverter. At the first rising edge of sysclk2 after the system reset signal turns off, the sender

FSM generates a request to the FIFO by raisingR

s. The

FIFO responds immediately by raisingA

s. The sender FSM

samples the “synchronized” version of this signal (SA

s) at

the next rising edge of sysclk2 and lowersR

s. As long as

the FIFO responds to the sender’s request signal

immedi-ately, there is no jitter (pausing) on sysclk2 becauseA

s is

synchronized to sysclk2. The receiver FSM is a rising edge-triggered flip-flop. When a request from the sender reaches the receiver through the FIFO, it is acknowledged at the next rising edge of sysclk1. Because the request signal toggles in-dependently of sysclk1, sysclk1 pauses occasionally to syn-chronize the FIFO request. Because the receiver operates at higher clock frequency than the sender, the FIFO never fills up, so the sender never slows down in this simulation.

(6)

Analog Trace

3.000e-08 4.000e-08 5.000e-08 6.000e-08 7.000e-08 8.000e-08 9.000e-08 1.000e-07 1.100e-07 1.200e-07

TIME (sec) 0.00 5.00 sysclk2 0.00 5.00 As 0.00 5.00 Rs 0.00 5.00 sysclk1 0.00 5.00 Ap 0.00 5.00 Rp

Figure 12: One way communication simulation trace

(sysclk1= 217 MHz;sysclk2= 135MHz).

6. Conclusion

We presented a new communication scheme, which is based on the pausible clocking scheme, for multiple

syn-chronous modules operating independently. In order to

prove its feasibility, we constructed a test bed consisting of two synchronous modules with the pausible clocking

con-trol and an asynchronous FIFO on a MOSIS

1

:

2 m

CMOS

chip. The resulting system functions reliably up to the lo-cal clock frequency of 220MHz (according to SPICE sim-ulation) — the maximum clock rate is limited by the ring oscillator, not the pausible clocking control. At the time of publication, preliminary test results indicate that the fabri-cated chips operate correctly as simulated.

In the future, we plan to investigate a larger system, a heterogeneous ring configuration with a mixture of syn-chronous and asynsyn-chronous modules. In addition, we will investigate a new oscillator design (other than simple ring oscillator designs).

Acknowledgment

The authors would like to thank Charles Dike of Intel Corporation for pointing out real-world problems associated with implementing pausible clocking control for micropro-cessor cores.

References

[1] IEEE Standard 1596-1992. Scalable coherent interface (SCI).

[2] T. J. Chaney and C. E. Molnar. Anomalous behavior of syn-chronizer and arbiter circuits. IEEE Transactions on Com-puters, C-22(4):421–422, April 1973.

[3] Daniel M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems. PhD thesis, Stanford University, October 1984.

[4] P. Corsini. Speed-independent asynchronous arbiter. IEE journal on Computers and Digital Techniques, 2(5):221–222, October 1979.

[5] G. Gopalakrishnan and L. Josephson. Towards amalgamating the synchronous and asynchronous styles. In TAU-93. [6] Mark R. Greenstreet. Implementing a STARI chip. In Proc.

International Conf. Computer Design (ICCD), pages 38–43. IEEE Computer Society Press, October 1995.

[7] Lindsay Kleeman. Service and Metastability Performance of Arbiters. PhD thesis, Dept. of Electrical and Computer Eng., Univ. of Newcastle, Australia, August 1986.

[8] Alain J. Martin. On Seitz’s arbiter. Technical Report 5212:TR:86, Caltech Computer Science, 1986.

[9] R. C. Pearce, J. A. Field, and W. D. Little. Asynchronous arbiter module. IEEE Transactions on Computers, 24:931– 932, September 1975.

[10] W. W. Plummer. Asynchronous arbiters. IEEE Transactions on Computers, 21(1):37–42, January 1972.

[11] Fred U. Rosenberger and Charles E. Molnar. Comments on ‘metastability of CMOS latch/flip-flop’. IEEE Journal of Solid-State Circuits, 27(1):128–130, January 1992. Reply by Robert W. Dutton pages 131–132 of same issue.

[12] Fred U. Rosenberger, Charles E. Molnar, Thomas J. Chaney, and Ting-Pien Fang. Q-modules: Internally clocked delay-insensitive modules. IEEE Transactions on Computers, C-37(9):1005–1018, September 1988.

[13] T. Sakurai. Optimization of CMOS arbiter and synchronizer circuits with submicron MOSFETs. IEEE Journal of Solid-State Circuits, 23(4):901–906, August 1988.

[14] Charles L. Seitz. Ideas about arbiters. Lambda, 1(1, First Quarter):10–14, 1980.

[15] Charles L. Seitz. System timing. In Carver A. Mead and Lynn A. Conway, editors, Introduction to VLSI Systems, chapter 7. Addison-Wesley, 1980.

[16] Jakov N. Seizovic. Pipeline synchronization. In Proc. Inter-national Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 87–96, November 1994. [17] M. J. Stucki and J. R. Cox Jr. Synchronization strategies.

In Charles L. Seitz, editor, Proceedings of the First Caltech Conference on Very Large Scale Integration, pages 375–393, 1979.

[18] K. Y. Yun. Synthesis of Asynchronous Controllers for Hetero-geneous Systems. PhD thesis, Stanford University, August 1994. Technical Report CSL-TR-94-644.

[19] K. Y. Yun, D. L. Dill, and S. M. Nowick. Synthesis of 3D asynchronous state machines. In Proc. International Conf. Computer Design (ICCD), pages 346–350. IEEE Computer Society Press, October 1992.

[20] Kenneth Y. Yun. Automatic synthesis of extended burst-mode circuits using generalized C-elements. To appear in EURODAC-96.