DESIGN OF CRITICAL DATA PATH USING ASYNCHRONOUS DOMINO LOGIC PIPELINE

(1)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 315

DESIGN OF CRITICAL DATA PATH USING ASYNCHRONOUS DOMINO LOGIC PIPELINE

1S.Immanuel Prabaharan, ²A.Samsu Nighar, ³R.Sarathkumar Kithiyon, ⁴C.Karuppasami

1,2Assistant Professor, ^3,4Student, Department of Electronics and Communication Engineering, Chandy College of Engineering,Tuticorin, Tamil Nadu, India

1[email protected]²[email protected]³[email protected]⁴[email protected]

Abstract- This paper presents a high-throughput and ultralow- power asynchronous domino logic pipeline design method, targeting to latch-free and extremely fine-grain or gate-level design. The data paths are composed of a mixture of dual-rail and single-rail domino gates. Dual-rail domino gates are limited to construct a stable critical data path. Based on this critical data path, the handshake circuits are greatly simplified, which offers the pipeline high throughput as well as low power consumption.

Moreover, the stable critical data path enables the adoption of single-rail domino gates in the noncritical data paths. This further saves a lot of power by reducing the overhead of logic circuits. An 8 x 8 array style multiplier is used for evaluating the proposed pipeline method. Compared with a bundled-data asynchronous domino logic pipeline, the proposed pipeline, respectively, saves up to 60.2% and 24.5% of energy in the best case and the worst case when processing different data patterns.

Index Terms— Asynchronous pipeline, critical data path, dual- rail domino gate, single-rail domino gate.

I. INTRODUCTION

During the last decade, there has been a revival inresearch on asynchronous technology. Along with thecontinued CMOS technology scaling, VLSI systems becomemore and more complex. The physical design issues, such asglobal clock tree synthesis and top-level timing optimization,become serious problems. Even if technology scaling offersmore integrationpossibilities, modularity and scalability aredifficult to be realized at the physical level. Asynchronousdesign is considered as a promising solution for dealing withthese issues that relate to the global clock, because it uses localhandshake instead of externally supplied global clock [1]–[4].The attractive properties are listed as follows:

1) Low power consumption 2) High operating speed

3) Noclock distribution and clock skew problem 4) Better composability and modularity

5) Less emission of electromagnetic noise

6) Robustness toward variations in supply voltage, temperature, and fabrication process parameters.

In asynchronous design, the choice of handshake proto-cols affects the circuit implementation (area, speed, power,robustness, etc.). The four-phase bundled-data protocol andthe four-phase dual-rail protocol are two popular protocolsthat are used in most practical asynchronous circuits. The four- phase bundled-data protocol design most closely resembles thedesign of synchronous circuits.

Handshake circuits generatelocal clock pulses and use delay matching to indicate validsignal. It normally leads to the most efficient circuits due tothe extensive use of timing assumptions. On the other hand,the four-phase dual-rail protocol design is implemented in anelaborate way that the handshake signal is combinedwith thedual-rail encoding of data. Handshake circuits are aware of thearrival of valid data by detecting the encoded handshake signal,which allows correct operation in the presence of arbitrarydata path delays. This feature is very useful for dealing withdata path delay variations in advanced VLSI systems, such asasynchronous field-programmable gate arrays (FPGAs) [5]–[7] and system-on-chip [3], [4].

However, such attractive feature is realized at the expense of encoding and detection overheads.These overheads cause low circuit efficiency and restrict theapplication area of the four-phase dual-rail protocol design.

This paper presents a novel design method of asynchronousdomino logic pipeline, which focuses on improving the circuitefficiency and making asynchronous domino logic pipelinedesign more practical for a wide range of applications. Thenovel design method combinesthe benefits of the four- phasedual-rail protocol and the four-phase bundled- data protocol,which achieves an area-efficient and

(2)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 316 ultralow-power asynchronous domino logic

pipeline.

Asynchronous domino logic pipeline is an interestingpipeline style that can entirely avoid explicit storage elementsbetween stages by exploiting the implicit latching functionalityof domino logic gates. The latchless feature provides thebenefits of reduced critical delays, smaller silicon area, andlower power consumption.

However, asynchronous domino logic pipeline has a common problem that dual-rail domino logic has to be used to compose the domino data path. Single-rail domino logic cannot be used because it would break the domino data path since only noninverting logic can be implemented [13].

As aresult, the domino data path has a dual-rail encoding overhead that consumes a lot of silicon area and power consumption. Such overhead almost cancels out the area and power benefits provided by the latchless feature.

Another problem is the overhead of handshake control logic.Conventional designs of asynchronous domino logic pipelinebased on the four-phase dual-rail protocol rely on domino datapath to transfer data and encoded handshake signal, and usecompletion detectors to detect and collect the handshake signalthroughout the entire data paths [8]–[11]. Such design methodis very robust for delay variations in data paths. However, itcauses a serious detection overhead. The detection overheadis growing with the width of data paths, which impedes itsapplication in the design of a large function block with aconsiderable data path width. On the other hand, asynchronousdomino logic pipeline based on the four-phase bundled- dataprotocol avoids the detection overhead by implementing asingle extra bundling signal, to match the worst case blockdelay, which serves as a completion signal. The problem is thatthis design method completely loses the good properties in thefour-phase dual-rail protocol design. Besides, it does not solvethe dual-rail encoding overhead problem in data paths [12].

In this paper, our proposed pipeline reduces both thedual-rail encoding overhead in data paths and the detectionoverhead in handshake control logic by designing basedon a constructed critical data path. A stable critical datapath is constructed

using redesigned dual-rail domino gates.By detecting the stable critical data path, a 1-bit completiondetector is enough to get the correct handshake signal regardless of the data path width.

Such design does not only greatlyreduce the detection overhead but also partially maintains thegood properties in the fourphase dual rail protocol design.Moreover, the stable critical data path serves as a matchingdelay to solve the dual-rail encoding overhead problem in datapaths. With the help of the redesigned dual-rail domino gates,single-rail domino logic is successfully applied in noncriticaldata paths. As a result, the proposed asynchronous dominologic pipeline has a small overhead in both handshake controllogic and function block logic, which greatly improves thecircuit efficiency. According to the design feature, we namethe proposed pipeline as asynchronous pipeline based onconstructed critical data path (APCDP).

This paper is organized as follows. Section II introducesthe background of asynchronous domino logic pipeline. PS0is introduced to demonstrate the advantages and problemsof asynchronous domino logic pipeline based on dual- railprotocol. Several related designs are also simply introduced.Section III focuses on the introduction of the proposed pipelinedesign method.

Synchronizing

Fig. 1. Block diagram of PS0

(3)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 317 II. BACKGROUND

PS0 is a well-known implementation style of asynchronousdomino logic pipeline based on dual-rail protocol [8]. It is an important foundation for most later proposed styles. Sinceour proposed pipeline is also based on PS0, we will begin byreviewing PS0 pipeline style, and then simply introducing twoother advanced styles: 1) a timingrobust style called prechargehalf-buffer [9] and 2) a high-throughput style called lookaheadpipeline [11]. Finally, we summary the delay assumptions of these pipelines and give our delay assumption in the proposeddesign.

A. PS0

1)Four-Phase Dual-Rail Protocol:

PS0 is designed basedon the four-phase dual-rail protocol. Fig. 3 shows an exampleof data transfer based on the four-phase dual-rail protocol,and Table I shows the code table of the four-phase dual-railencoding. The four-phase dual- rail encoding encodes a requestsignal into the data signal using two wires,(ῳ_t_ῳ_f). The data value 0 is encoded as (0,1), and value 1 is encoded as (1,0);the spacer is encoded asis(0,0); (1,1) is not used. Whentransferring the valid data, a spacer is inserted between them.A receiver can easily obtain the valid data by monitoring thetwo wires. This protocol is very robust since a sender and areceiver can communicate reliably regardless of delays in thecombinational logic block and wires between them. The dual-rail encoded data path is known as the delay-insensitive datapath.

2) Structure of PS0:

Fig. 1 shows a block diagram of PS0.In PS0, each pipeline stage is composed of a function block and a completion detector. Each function block is implemented using dual-rail domino logic. Each completion detector generates a local handshake signal to control the flow of data through the pipeline. The handshake signal is transferred to the precharge/evaluation control port of the previous pipeline stage

Fig. 2 shows an example of the dual-rail dominoANDgateand the 2-bit completion detector.

A two-inputNORgate servesas the 1-bit completion detector to generate a bit done signalby monitoring the outputs of dual-rail domino gate. To builda 2-bit completion detector, C-element is needed to combinethe bit done signals. A full completion detector is formed bycombining all bit done signals from the entire data paths witha tree of C-elements, as shown in Fig. 1.

3) Protocol of PS0:

The protocol of PS0 is quite simple.F(N) is precharged when F(N+1) finishes evaluation.F(N) evaluates when F(N+1) finishes its reset, or precharge. In Fig. 1, if we observe a single data flow through an initially empty pipeline in which every pipeline stage is in evaluation phase, the complete cycle of events is as follows.

F1 evaluates and data flow to F2.

1) F2 evaluates and data flow to F3. F2’s completion detector detects completion of evaluation and sends aprecharge signal to F1.

2) F1 precharges and F3 evaluates. F3’s completion detector detects completion of evaluation and sends aprecharge signal to F2.

3) F2 precharges. F2’s completion detector detects thecompletion of precharge and sends an evaluation signal(enable signal) to F1. The evaluation signal enables F1to evaluate new data once again.

(4)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 318 There are three evaluations, two completion

detections, andone precharge in the complete cycle for a pipeline stage. Thepipeline cycle timeTcycleis

T_cycle=3t_Eval+2t_CD+tPrech (1)

WheretEvalandtPrechare the evaluation and precharge timesfor each stage andtCDis the delay through each completiondetector.

4) Overhead Problems

There are mainly two overheadproblems that prohibit the widespread use of PS0, the detectionoverhead in handshake control logic and the dual-rail encodingoverhead in function block logic. A ripple carry adder, shownin Fig. 4, is used as an example to clarify these overheadproblems.

Fig. 4. Pipelined 4-bit ripple carry adder.

Fig. 5. Block diagram of PCHB.

The detection overhead is caused by the full completiondetectors that are used to deal with data path delay variationsby detecting the entire data paths. The overhead greatly affectsthe pipeline speed and power consumption. The most seriousproblem is that the detection overhead is growing the width of data paths. In a 4-bit ripple carry adder, the width of datapaths is between 8 and 5 bits. The detection overheads of8–5-bit completion detectors might be acceptable in practicaldesign. However, in 32-bit ripple carry adder design, the widthof data paths is at least 33

bits. The overhead of 33-bitcompletion detector is so large that PS0 is hardly applicablein such situation. Even the detection time can be reduced bypartitioning wide data path into several data streams [11], thedetection power is not reduced.

The dual-rail encoding overhead is caused by dual-raildomino logic that is used for not only implementing logicfunction but also storing data between pipeline stages. Becausethere are no explicit storage elements (latches or registers),a lot of dual-rail domino buffers have to be added to levelizeeach stage. The added dual-rail domino buffers consume alot of silicon area and power. In a 4-bit ripple carry adder,18 dual-rail domino buffer gates are added, which almostcancel out the benefit of removing explicit storage elements.

B. OTHER ADVANCED PIPELINES 1) Precharge Half-Buffer Pipeline:

Fig. 5 shows a blockdiagram of precharge half-buffer pipeline (PCHB). PCHB is atiming- robust pipeline style that uses quasi-delay- insensitivecontrol circuits [9]. Two completion detectors in a PCHBstage: one on the input side (Di)and one on the outputside (D0). The complete cycle of events for a PCHB stage isquite similar to that of PS0, except that a PCHB stage verifies its input bits. Because of the input completion detector (D_i),a PCHB stage does not start evaluation until all input bits arevalid. This design absorbs skew across individual bits in thedata paths. Although this design makes PCHB more timingrobust, it causes a two-time overhead in handshake controllogic compared with PS0. Besides, PCHB has the same dual rail encoding overhead as PS0.

(5)

Fig. 6. Block diagrams of LP2/2. (a)LP2/2 based on dual-rail protocol.(b) LP2/2-SR

2) LP2/2:

LP2/2 is a high-throughput pipeline style [11],which has both dual-rail protocol design and bundled-dataprotocol design. Fig. 6(a) shows the block diagram of LP2/2 based on thedual-rail protocol. LP2/2 improves the throughput of PS0 byoptimizing the sequential of handshake events.

However, theydo not solve the overhead problems in handshake control logicand function block logic.

The handshake speed is acceleratedby employing asymmetric completion detectors placed aheadof function blocks. Although this pipeline structure reducesthe handshake cycle time, the asymmetric completion detectorsstill consume a lot of power since they have to detect the entiredata paths.

Fig. 6(b) shows the block diagram of LP2/2 based onthe bundled-data protocol (LP2/2-SR).

LP2/2-SR avoids thedetection overhead problem by implementing a single extrabundling signal. The bundling signal serves as a completionsignal, which matches the worstcase delay in function blocks.Although such design reduces the power consumption inhandshake control logic, the overhead problem in functionblock logic remains unsolved since dual-rail domino logic stillhas to be used to compose the domino data path [12].

C. DELAY ASSUMPTIONS

PCHB is a very robust pipeline that requires no delayassumptions or calculations by designer.

However, the robustness of circuits comes at the

expense of performance. Thecomplex handshake circuits slow down the handshake speedand consume more power.PS0 is designed for fast handshake speed by seeking todesign control circuits that are always correct for commonconditions. It is based on a delay assumption that each pipelinestage’s predecessor precharges no slower than the stage’ssuccessor evaluates.

Fig. 7. Block diagram of APCDP.

Based on PS0, LP2/2 makes two more aggressive delayassumptions: first, each pipeline stage evaluates no slower thanits completion detects plus the stage’s successor precharges.Second, each pipeline stage completion detects plus its predecessor precharges no slower than the stage evaluates plus itssuccessor completion detects.

Although these delay assumptions improve the pipeline throughput, they sacrifice thepipeline robustness.Our proposed pipeline is based on PS0, but makes adifferent delay assumptionfrom LP2/2.

We assume that, inthe evaluation of domino gates, an-stack pull-down pathcauses larger delay than am-stack pull-down path (nis largerthanm). Based on this delay assumption, we construct a stablecritical data path for improving the circuit efficiency in bothhandshake control logic and function block logic. Because thisdelay assumption is made on gate level instead of handshakestructure level, our proposed pipeline has, structurally, thesame robustness as PS0.

III. ASYNCHRONOUS PIPELINE BASED ON CONSTRUCTED CRITICAL DATA PATH A. OVERVIEW

Fig. 7 shows the block diagram of the proposed asynchronous pipeline (APCDP). The pipeline is designed based on astable critical data path that is constructed using a special dual rail logic. The critical data path transfers a data signal

(6)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 320 and anencoded handshake signal. Noncritical data

paths, composedof single-rail logic, only transfer data signal. A staticNORgate detects the dual-rail critical data path and generates atotal done signal for each pipeline stage. The outputs ofNORgates are connected to the precharge ports of their previousstages

APCDP has the same protocol as PS0. The differenceis that a total done signal is generated by detecting onlythe critical data path instead of the entire data paths. Suchdesign method has two merits. First, the completion detectoris simplified to a singleNORgate, and the detection overheadis not growing with the data path width. Second, the overheadof function block logic is reduced by applying single-raillogic in noncritical data paths.

As a result, APDCP has asmall overhead in both handshake control logic and functionblock logic, which greatly improves the throughput and powerconsumption.

APCDP is more familiar to bundled-data asynchronouspipeline because the critical data path essentially works as a matching delay, which controls the correct data transfer in noncritical data paths.

TABL E II STATES OF PULL–

DOWNTRANSISTORPATHSONDIFFERENTDATAPATTERNS

Compared with conventional bundled data design using a separate matching delay to match theworst case delay in function blocks, the proposed designmethod reuses the existing function block logic to provide thematching delay. Such design has many advantages. First, thematching delay is accurate. The matching delay in APCDP isexactly the same worst case delay in function blocks. Second,the matching delay is robust for delay variations. Dual-railcritical data path supplies delay-insensitive property, which canbe self- adaptive to delay variations in function blocks.

Third,the handshake control logic is efficient in

area and power. Thehandshake control logic is implemented by reusing the existingfunction block logic.

Finding a stable critical data path in function blocks isvery important in the proposed design. The problem is thatit is difficult to get a stable critical data path using traditionallogic gates. Traditional logic gates have the gate-delay data- dependence problem the gatedelay is dependent on inputdata patterns. For example, the ripple carry adder in Fig.

4.The ripple carry path seems to be the stable critical data path.However, actually, the critical data path varies according todifferent input data patterns.

Because of the gate-delay data dependence problem, the carry function gate can be triggeredearly by the input bits (anandbn) regardless of the carry bit.Since the input bit travels faster in the buffer path than thecarry bit in the ripple carry path, it cannot guarantee that thecritical transition signal always presents on the ripple carrypath.

Adding delay elements is an intuitive way to construct astable critical data path. However, this method needs complextiming analysis and would cause huge overhead of delayelements. This paper introducesan efficient solution that usesSLGs to construct the critical data path. The SLGs solvethe gate-delay data-dependence problem by making sure thatSLGs cannot start evaluation until all valid data arrive [14].This feature does not only help to construct a stable criticaldata path but also enable the adoption of single-rail dominologic in the noncritical data paths. As a result, the proposeddesign is significantly area and power efficient.

B. LOGIC GATES

VLSI circuits, it is difficult to get a stable critical data path using traditional logic gates due to the gate-delay data dependence problem. Fig. 2(a) shows a traditional dual rail domino AND gate. The true side of logic is implemented by out_t=a_t·b_tand the false side by out_f=a_f+b_f.

(7)

Fig. 8. Synchronizing AND gate and the truth table of dual-rail AND logic.

Fig. 9. SynchronizingANDgate with a latch function and the table of latchstates.

Table II shows the states of pull-down transistorpaths on different data patterns. In traditional dual-rail dominoANDgate, there are three transistor paths: 1)[a_t,b_t];2)[a_f]; and 3)[b_f].First of all, these paths have differentnumber of transistors at the sequential

position. When theyturn ON,

respectively,[a_f]and[b_f]cause less delays than[a_t,b_t]. Moreover, when the data pattern is(0,1,0,1),[a_f]and[b_f]will be bothON, which leads to a muchquicker signal transfer.As a result, the gate delay has alarge variation depending on different data patterns. To solvethe gate-delay data- dependence problem, SLG and SLGL areintroduced [14].

1) Synchronizing Logic Gates:

SLGs are dual-rail dominogates that have no gate- delay data-dependence problem. Fig. 8shows the synchronizingANDgate and the truth table of dual railANDlogic. The principle is that, in the pull- down network,there is exactly one path activated according to one datapattern, and the stack of all possible paths is kept constant atthe sequential

position. Compared with the traditional design,the false side logic expression is changed to out_f=a_t·b_f+a_f·(b_t+b_f). Table II shows that there are fourtransistor paths:

1)[a_t,b_t];2)[a_t,b_f];3)[a_f,b_t];and[a_f,b_f].Ever y path has two transistors at the sequentialposition, and there is only one path turnsONcorrespondingto an input data pattern. As a result, the gate delay becomesindependent on different data patterns.

This kind of gates isnamed as SLGs because they can synchronize their inputs.The SLGs verify that all data signal transitions have arrivedon their inputs before changing their outputs.

The characteristics of SLGs are listed as follows.

1) An SLG has a certain number, inputs’

number, of transistors in pull-down transistor paths at the sequential position.

2) An SLG has no gate-delay data-dependence problem. Its gate delay mainly relates to the inputs number.

3) An SLG can synchronize its inputs. The SLG cannot start evaluation until all valid data arrive.

2) Synchronizing Logic Gates With a Latch Function:

Basedon the characteristics of SLGs, SLGLs are extended. Fig. 9shows synchronizingANDgate with a latch function and thetable of latch states. An SLGL has an enable port(en_t,en_f),which controls the opaque and transparent state of the SLGL.The principle is that SLGLs cannot start evaluation withoutthe presence of the enable signal.

Same as the dual-railANDlogic, all traditional dual-raildomino logic can be redesigned to become an SLG or anSLGL. The critical data path in dual- rail asynchronous pipelinecan be easily constructed using SLGs and SLGLs.

C. STRUCTURE OF APCDP

Fig. 10 shows the structure of APCDP. The solid arrowrepresents a constructed critical data path (dual-rail data path),the dotted arrow represents the noncritical data paths (single-rail data paths), and the dashed arrow represents the output ofsingle-rail to dual-rail encoding converter.

In each pipeline stage, a staticNORgate is used as1-bit completion detector to generate a total done signal forthe entire data paths by detecting the

(8)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 322 constructed criticaldata path. Driving buffers

deliver each total done signal to the precharge/evaluation control port of the previous stage.

Since the completion detector only detects the constructedcritical data path, the noncritical data paths do not have to

transfer encoded handshake signal anymore.

Therefore, single rail domino gates are used in the noncritical data path tosave logic overhead.

Encoding converter is used to bridgethe connection between single-rail domino gate and dual raildomino gate.

1) Construction of the Critical Data Path:It is difficult toconstruct a stable critical data path using traditional logic gatesfor their gate-delay data-dependence problem. The criticalsignal transition varies from one data path to others accordingto different input data patterns. Since

SLGs have solved thegate-delay data-dependence problem, a stable critical data pathcan be easily constructed by the following steps:

Fig. 10. Structure of APCDP 1) finding a gate (named as Lin gate) that has

the largest number of inputs in each pipeline stage;

2) changing these Lin gates to SLGs;

3) linking SLGs together to form a stable critical data path.

The basic idea of finding the critical signal transition is thatembedding an SLG in each pipeline stage and making the SLGto be the last gate to start and finish evaluation. First of all, theembedded SLG has the largest gate delay in a pipeline stage.The reasons are as follows.

1) The SLG has the largest stack in the pull- down network compared with other gates.

2) The SLG has only one pull-down transistor path activated for each input data pattern.

Then, if all gates evaluate at the same time or the SLG is thelast gate to start evaluation in the pipeline stage, the criticalsignal transition would present on the output of the SLG.

In practice, making all gates evaluate at the same time isdifficult, especially without the help of intermediate latches orregisters. Therefore, we make the SLG become the last gate tostart evaluation by linking each pipeline stage’s SLG together.In the first pipeline stage, the critical signal transition is onthe output of the SLG because all

gates evaluate at the sametime for the input control of latches or registers. After linkingeach pipeline stage’s SLG together, the SLG in the followingpipeline stage would be the last gate to start evaluation since italways waits for the critical signal transition from the previousSLG. As a result, the linked SLG data path becomes a stablecritical data path.

Linking each pipeline stage’s SLG together is partially donein the process of selecting Lingate in each pipeline stage.When searching L_ingate, there might be more than one option.It is best to select the Lingate that is originally linked to theLingate in the following pipeline stage. After changing theseLingates to SLGs, SLGs are naturally linked.

For example,the linkage between Stage1 and Stage2 in Fig.10. However, ifwe cannot find the linked L_ingates in neighbour stages, SLGLneeds to be used to solve the linking problem. The linkagebetween Stage2 and Stage3 is in such situation. The linkageis established by connecting the output of SLG in Stage2 andthe enable port of SLGL in Stage3.

2) Encoding Conversion:Since the completion detectordetects only the constructed critical data path, the noncriticaldata paths do not have to transfer encoded handshake signalanymore. The

(9)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 323 logic overhead in the noncritical data paths canbe

reduced using single-rail domino gates instead of dual-raildomino gates. However, single-rail domino gate and dual-raildomino gate use different encoding schemes. It has encodingcompatibility problem when a single-rail domino gate connectsto a dual-rail domino gate. Encoding converter needs to bedesigned to solve the problem.

Fig. 11.Encoding converters. (a) Intuitive design.

(b) Proposed design.

TABLE III

TRUTH TABLE OF ENCODING CONVERTER

Fig. 11 shows two implementations of encoding converter.Table III shows the truth table. In precharge phase𝑃𝐶 = 0, encoding converter outputs a dual-rail data0 (out,𝑜𝑢𝑡 ) = 0,1 .In evaluation phase𝑃𝐶 = 1, if the input is a single-rail data0in=0, the converter keeps the dual-rail data0.

If the input is a single rail data1 in=1, the converter outputs a dual-rail data1(𝑜𝑢𝑡, 𝑜𝑢𝑡 ) = 1,0 .Since single-rail encoding only has twostates that, respectively, represent data0 and data1, there is noother state that can be converted to spacer(𝑜𝑢𝑡, 𝑜𝑢𝑡 ) = 0,0 .The disappearance of spacer violates the four-phase dual-railprotocol, which would cause data transfer error.

Fig. 10 shows two examples that encoding converters areused to bridge the connection

between single-rail domino gateand dual-rail domino gate. Focusing on the encoding converterin Stage2, when Stage2 enters the precharge phase, the SLGoutputs a spacer, but the converter outputs a invalid data0.This invalid data0 cannot be absorbed by the SLGL in Stage3since the spacer impedes its evaluation. However, when Stage2enters the evaluation phase, it has a risk that the invaliddata0 might be erroneously absorbed if the output of the SLGbecomes valid earlier than the output of the converter. Theearlier arrived valid data from the SLG trigger the SLGL tostart evaluation and absorb the invalid data0. To avoid thisproblem, the encoding converter needs to satisfy a timingconstraint that the output of the converter should become validearlier than the output of SLG.

In other words, the constructedcritical data path should be robust.

In addition to protect data transfer error by enhancing therobustness of the critical data path, we can also improve theconversion speed of the encoding converter. Interestingly, wedo not have to care about the conversion from the single-rail data0 in=0 to the dual-rail data0(𝑜𝑢𝑡, 𝑜𝑢𝑡 ) = 0,1 . Because the converter initially outputs a dual-rail data0, thisconversion can be considered infinite fast. We only have tofocus on improving the conversion from the single-rail data1in=1 to the dual-rail data1 (𝑜𝑢𝑡, 𝑜𝑢𝑡 ) = 1,0 .Fig. 11(b)shows the proposed design of the encoding converter.

Whenthe converter enters the evaluation phase, the input in=1 canimmediately pull down𝑜𝑢𝑡 . No matter the output 𝑜𝑢𝑡, 𝑜𝑢𝑡 isa instant spacer (0, 0) or the valid data1 (1, 0), it effectivelyprotects the data transfer error. On the other hand, the intuitivedesign in Fig. 11(a) has a longer signal transition delay. 𝑜𝑢𝑡 cannot be pulled down until out becomes 1. It has a higherpossibility of causing data transfer error than the proposedencoding converter.

D. Robustness Analysis

APCDP has pipeline failure in the situation that a pipelinestage does not finish evaluating before its previous stage startprecharge. In such situation, the pipeline stage cannot correctlyfinish evaluating because the precharge of its previous pipelinestage removes the valid data from the inputs. To avoid thispipeline failure, APCDP needs to satisfy an assumption that,in a pipeline stage,

(10)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 324 none of the other bits across the entiredata paths is

slower than the detected bit by more than thedelay through a staticNORgate and the drive buffer chainfollowing it. The robustness of APCDP is analyzed based onthis assumption.

1) Robustness of the Pipeline Structure:According to thepipeline structure of APCDP, the hold timeT_holdof valid dataon the inputs of each pipeline stage is

Thold=tSLG_Eval+tNOR+tBuf+tSLG_Prech(2)

where tSLG_Evalis the evaluation time for the SLG in a pipelinestage andtSLG_Prechis the precharge time for the SLG in theprevious pipeline stage.tNOR+tBufis the delay through theNORgate and the drive buffer.

The pipeline structure of APCDP is quite robust sincethe hold timeTholdsupplies sufficient time margins. In theconstruction of the critical data path, we introduced that theSLG is embedded as the last gate to finish evaluation in eachpipeline stage.

There are even some gates that are slowerthan the SLG because of delay variations in practice,

T_holdsuppliest_NOR+t_Buf+t_{SLG_Prech}time margins for pipelinefailure protection. We believe that these time margins aresufficient for dealing with delay variations in practical design.However, for safety, we supply several enhance measurementsfor the constructed critical data path in the following section.

2) Robustness of the Critical Data Path:We first use themethod of logical effort [19] to analyze the robustness of theconstructed critical data path.

Then, we discuss how to furtherenhance the robustness of the constructed critical path.

The method of logical effort is an easy way to estimate delayin CMOS circuit. In the method, modelling delay of a logicgate isolates the effects of a particular fabrication process byexpressing all delays in terms of a basic delay unit particular tothat process. The delay incurred by a logic gate is comprised oftwo components: 1) a fixed part called the parasitic delaypand 2) a part that is proportional to the load on the gate’soutput, called the effort delayf. This effort delay dependson the load and the properties of the logic gate driving theload. There are two related terms for these effects: the

logicaleffortgcaptures the effect of the logic gate’s topology on itsability to produce output current, while the electrical efforthdescribes how the electrical environment of the logic gateaffects performance and how the size of the transistors in thegate determines its load-driving capability.

Electronic effort isalso called fanout by many CMOS designer. As a result, thedelay of a logic gate is expressed as

𝑑𝑒𝑙𝑎𝑦 = 𝑓 + 𝑝 = 𝑔𝑕 + 𝑝 = 𝑔 ^𝐶^𝑖𝑛

𝐶_𝑜𝑢𝑡 + 𝑃 (3) whereCoutis the capacitance that loads the output of the logicgate andCinis the capacitance presented by the input terminalof the logic gate.

In each pipeline stage of APCDP, the SLG/SLGL has alarger gate delay than other gates according to the methodof logic effort. First, the SLG/SLGL has more complicatedtopology than other gates in the pull-down network. It slightlyincreases the parasitic delaypand the logical effect g. Second,the output of SLG/SLGL is connected to a staticNORgate andthe SLG/SLGL in the next stage. Compared with the outputsof other gates, the SLG/SLGL has a larger fanoutCout,whichincreases the electrical efforth. As a result, the SLG/SLGLhas a larger gate delay than traditional logic gates even theyhave same number of inputs. When linking all SLGs/SLGLstogether, these imposed delays increase the robustness of theconstructed critical data path.

In practice, the robustness of the constructed critical path isaffected by delay variations. As a matter of fact, it is a commonproblem in VLSI circuit design, same as the robustness of aclock signal in synchronous design and a match delay line inbundled-data asynchronous design [2]. As we all know, thesedesigns all suffer from delay variations.

To resist the influenceof delay variations, synchronous design enlarges the cycle timeof a clock signal to get some margin. On the other hand,bundled-data asynchronous design adds extra delay marginon the matching delay line to match the worst case delay incombinational logic block.

Same like these solutions, the delayvariations problem in the proposed design can be solved byenlarging delay margin on the constructed critical data path.We supply four measures to enlarge the delay margin, whichare listed as follows:

(11)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 325 1) Improving the noncritical paths delay;

2) Adding delay elements on the critical path.

3) Sizing the pull-down transistors or the staticinverters of SLGs and SLGLs to increase gate delays;

4) Applying a low priority in circuit layout for the constructed critical path;

These measures have different impacts on the performanceof circuits. It is better to choose a proper measure or multiple measures according to the practical design requirements.In measure 1), reducing the transistor size of pull-downnetwork or the static inverters can increase the gate delaysof SLGs and SLGLs. The increased delay slightly slowsdown pipeline speed, but smaller transistor size helps insaving power and silicon area. Measure 2) is layout optimization. Although this measure slightly decreases pipelinespeed, it does not impose the extra overhead of transistors.Measure 3) does not degrade pipeline speed. The problem isthat improving the noncritical path delay would increase thepower consumption. Measure 4) is an intuitive way to enhancethe critical data path. The drawback is the extra overhead oftransistors.

In addition, the use of domino logic introduces many designrisks because it is very sensitive to noise, circuit, and layouttopologies [20]. The solutions to alleviate these problems arenot in the scope of this paper. We only recommend to limitthe largest stack of domino logic when designing APCDP.The limitation depends on processing technology, practicalproblems, and actual conditions. Smaller stack design helpsin alleviating the noise problems and increasing the pipelinespeed. The tradeoff is the increased number of pipeline stages.More pipeline stages means larger silicon area and higherpower consumption. On the other hand, larger stack designdeteriorates the noise problems and the pipeline speed. Themerit is the reduced number of pipeline stages. Less pipelinestages helps in saving silicon area and power consumption.

Fig. 12. (a) Fork structure. (b) Join structure.

E. Extension to Complex Structures

The previous sections just analyzed the linear pipelinestructure. For more complex data paths, forks and joins areneeded [2]. Fig. 12 shows fork structure and join structure inAPCDP.In fork structure, the outputs of function block A are splitto connect with function blocks B and C. C-element is usedto collect the handshake signal from A’s successors. Theconstruction of the critical data path in fork structure is sameas that of described in the linear structure. The problem isthat the data paths from A to B and C are more complexthan the linear structure. The complex data paths cause largedelay variations, which affect the correctness of the criticaldata paths at the inputs of B and C. Pipeline failure happenswhen B and C do not finish their evaluations before A finishesits precharge. The delivery time of the precharge signal fromB and C to A is that

Tdel=TNOR+TC+Tbuffer (4)

whereTNOR,TC,andTbufferare delay time of a staticNORgate, a C-element, and the buffer gate. If the delay variationson the data paths are smaller thanTdel, no pipeline failurehappens.

In join structure, the outputs of function block A and Bmerge together at function block C, which requires sending anacknowledge signal from C to all its predecessors. In functionblock C, the critical data paths from function block A and Bneed to simultaneously connect to an SLG/SLGL. The designprocess is similar to that of described in the linear structure.The problem in join structure is that the acknowledge signalnetworks at function block B and C are more complex thanthe linear structure.

(12)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 326 Pipeline failure happens when A and Bdo not

completely finish their precharge process before Centers the next evaluation phase, which means that C wouldmistakenly absorb old data from A or B. The delivery time ofthe precharge signal from C to A and B is that

Tdel=TNOR+Tbuffer(5)

According to the handshake protocol, the time for C to enterthe next evaluation phase is that

TnextEval=2TEval+2TNOR+2Tbuffer(6) whereTEvalis the evaluation time for a function block.Therefore, the margin time is that

Tmargin=TnextEval –

Tdel=2TEval+TNOR+Tbuffer (7) If the delay variations on the acknowledge signal networks aresmaller thanTmargin, no pipeline failure happens.

IV. EVALUATION

This section presents the evaluation results of APCDP.An 8×8 array style multiplier is chosen as the test case,which is, respectively, designed using the proposed APCDP,LP2/2-SR [11], classic synchronous pipeline (Sync), and Sync-CG [15], [16]. Conventional dual-rail asynchronous pipelinesare not selected as the evaluation counterparts because they arehardly applicable in the design of large function block (suchas the 8×8 array style multiplier).

A. Experiment Setup

Four 8×8 array style multipliers are designed usingHSPICE in a 65-nm design technology. All designs are simulated at 1.2 V normal supply voltage, 85 °C temperature, anda normal process corner.

TABL E IV

Evaluation Results of8×8ArrayStyleMultipliers

Fig. 13.Energy consumption per cycle for processing different data patterns.

LP2/2-SR is chosen as a representative of latchlesspipeline for comparing with APCDP.

LP2/2-SR was proposed along with LP2/2, which is significantly morearea and energy efficient. It was reported that LP2/2-SR had about 60% smaller area and 55% lower energyconsumption than LP2/2 in first-input first-output (FIFO)design [11]. Besides, LP2/2-SR resembles the design of latchless synchronous pipeline. The difference is that latchlesssynchronous pipeline uses a complex multiphase clockinginstead of bundled-data handshaking. Because of their similarity, the performance of LP2/2-SR can be used as a referencefor comparing APCDP with latchless synchronous pipeline.

An 8×8 array style multiplier is extremely fine-grain orgate-level pipelined using LP2/2-SR as well as APCDP. Thedepth of each pipeline stage isonly one domino logic gate, andthere are no explicit storage elements between stages.

Theircomparisons show not only the circuit efficiency of APCDPbut also the merits of dual-rail asynchronous design.To make a further comparison, Sync and Sync-CG are alsodesigned.

They are used as comparison references since theydo not belong to latchless pipeline. The 8×8 array stylemultiplier is divided into five pipeline stages. The logic blocksare composed of static logic gates, and D flip-flops are usedas intermediate storage elements between pipeline stages.

Thesequential clock gating in Sync-CG is designed using six Dflip-flops, which realizes fine-grain

(13)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 327 clock gating that is similarto the handshake

behaviours in APCDP and LP2/2-SR.

Fig. 14. Performances of power consumption (all designs operate at 3.6Gdataset/s)

B. Results

Table IV shows the evaluation results of the 8×8 arraystyle multiplier. The performances of throughput are evaluatedwithout considering design margins, which are ideal results.The results show that APCDP has high throughput, the smallest transistor count, and the lowest forward latency in alldesigns. Fig. 13 shows the energy consumption for processingdifferent data patterns. Fig. 14 shows the performances ofpower consumption under different workloads. The resultsshow that APCDP is the most power efficient design.

1) Transistor Count:Table IV shows that APCDP,respectively, reduces the transistor count by 25.3% and 18.1%compared with LP2/2-SR and Sync. APCDP uses a mixtureof dual-rail domino logic and single-rail domino logic. Thesingle-rail domino logic gates in the noncritical data pathssave a lot of the transistor count. Although the SLGs andSLGLs in APCDP consume more transistors than traditionaldual-rail domino gates, they are in a small quantity (only 56,used in the critical data path), which have small impact onthe transistor count.

2) Latency:APCDP and LP22-SR have about one-thirdlower forward latency than Sync and Sync-CG. This is becauselatchless design has no sequential overhead (no registersor latches) on its forward path. Compared with LP22-SR,APCDP has a little larger latency. The latency is sacrificedfor constructing the stable critical data path.

Fortunately, thisdegradation is not serious.

3) Throughput:The performances of throughput are evaluated without considering design margins, which are all idealresults from the schematic simulations.The results show that LP2/2- SR has the best throughput performance. This benefits from the bundled-data asynchronousdesign of LP2/2-SR. Traditional dual-rail domino data paths inLP2/2-SR actually have better signal transition speed than thedata paths composed of SLGs/SLGLs. Bundled-data designcan exploit this benefit to increase the pipeline throughput.

However, delay margins need to be added in practicalbundled-data design, which would decrease the performanceof throughput.

4) Power:The energy consumption of VLSI circuits relatesto the toggling rate in data paths. In APCDP, the adoption ofsingle-rail domino gates in the noncritical data paths saves notonly silicon area by reducing transistor count but also energyconsumption by reducing the toggling rate.Fig. 13 shows the energy consumption per cycle forprocessing different data patterns. Each energy consumption is an average value calculated from 100 cycles. Theresults show that APCDP consumes much less energy thanLP2/2-SR. The adoption of single-rail domino gates in APCDPreduces the toggle rate in data paths since single-rail dominologic does not toggle when transferring low-voltage signal.Besides, the toggling rate relates to the injected data patterns.Therefore, the energy consumption of APCDP varies a lotaccording to different data patterns. On the other hand, theenergy consumption of LP2/2-SR remains almost constant.LP2/2-SR’s full dual-rail domino data paths have almost aconstant toggling rate regardless of the injected data patterns.Compared with LP2/2-SR, APCDP saves up to 60.2% ofenergy in the best case when processing data pattern ff*00(hexadecimal digit) and 24.5% of energy in the worst casewhen processing data pattern ff*ff.

Furthermore, the power performance with different workloads is evaluated. Fig. 14 shows the performance of powerconsumption when all designs operate at 3.6G dataset/s. Thesolid lines show the power consumption when the injecteddata patterns are recurring between ff*ffff*00. The dottedlines shows the power consumption when the

(14)

S.Immanuel Prabaharan, A.Samsu Nighar, R.Sarathkumar Kithiyon and C.Karuppasami ijesird, Vol. III, Issue VI, December 2016/ 328 injected datapatterns are recurring between

ff*0fff*00. The workloadrefers to the rate of the number of active-state cycles to thetotal number of cycles. In our case, the workload is calculatedbased on a period of consecutive data injection cycles (active state cycles) following consecutive empty cycles. Fig. 15shows the workload definition. The workload is calculatedasN/(N+M),whereN

is the number of consecutive datainjection cycles andM

is the number of consecutive emptycycles.

C. Further Discussion

The above evaluation is based on schematic simulation.To apply APCDP to large function modules, design automationis an important issue that has to be solved. Design automationfor asynchronous circuits is a very large research domain.It is difficult to fully cover this topic. The interested reader isreferred to [21]. Here, we just simply address several specificdesign automation issues in APCDP.One issue is the design automation for constructing astable critical data path. Based on the proposed constructionmethod, design automation flow can be easily developed.

First,the gate with the largest number of inputs in each stage andconnection of gates between stages is identified. Then, theproper gates using SLGs or SLGLs are replaced according tothe gate connection information and optimization methods.

Another issue is the design automation for layout. Standardplacement and routing (P&R) tools tend to reduce the worstcase delay in function blocks by optimizing circuit layout.Such optimization is not suitable for APCDP since it woulddegrade the robustness of the constructed critical data path. Toavoid this problem, a specific P&R method has to be developedto increase the delay on the constructed path.

The last issue is timing verification. There are almost no EDA support for verifying domino circuits because chargesharing and uncertainty about worst case delay makes statictiming analysis (STA) very complex [21]. In APCDP, theknown critical data path and the dual-rail handshake controllogic help in easing the timing analysis

problem. After creatinga high-quality domino gate library, it is possible to apply STAfor verifying the timing constraints in APCDP.

V. CONCLUSION

This paper introduced a novel design method of asynchronous domino logic pipeline. The pipeline is realized based ona constructed critical data path. The design method greatlyreduces the overhead of handshake control logic as well asfunction block logic, which not only increases the pipelinethroughput but also decreases the power consumption. Theevaluation results show that the proposed design has betterperformance than a bundled-data asynchronous domino logicpipeline (LP2/2-SR). It is even comparable with a synchronouspipeline with sequential clock gating.

REFERENCES

[1] A. J. Martin and M. Nystrom, “Asynchronous techniques for system- on-chip design,” Proc. IEEE , vol. 94, no. 6, pp. 1089–1120, Jun. 2006.

[2] J. Teifel and R. Manohar, “An as ynchronous dataflow FPGA architecture,” IEEE Trans. Comput., vol. 53, no.

11, pp. 1376–1392, Nov. 2004.

[3] M. Hariyama, S. Ishihara, and M.Kameyama, “Evaluation of a fieldprogrammable VLSI based on an asynchronous bit-serial architecture,” IEICE Trans. Electron. , vol. E91- C, no. 9, pp. 1419–1426, Sep. 2008.

[4] Z. Xia, S. Ishihara, M. Hariyama, and M. Kameyama,

“Design of high-performance asynchronous pipeline using synchronizing logicgates,”IEICE Trans. Electron., vol.

E95-C, no. 8, pp. 1434–1443,Aug. 2012

[5] I. Sutherland, B. Sproull, and D. Harris,Logical Effort: Designing Fast CMOS Circuits. San Mateo, CA, USA: Morgan Kaufmann, 1999

[6] A. Taubin, J. Cortadella, L. Lavagno, A. Kondratyev, and A. Peeters, “Design automation of real-life a synchronous devices and systems,” Found. Trends Electron. Des.

Autom, vol. 2, no. 1, pp. 1–133,2007.

[7] S. M. Nowick and M. Singh, “High-performance asynchronous pipelines an overview,” IEEE Des. Test Comput. , vol. 28, no. 5, pp. 8–22, Sep./Oct. 2011.

[8] M. Singh, J. A. Tierno, A. Rylyakov, S. Rylov, and S. M.

Nowick, “Anadaptively pipelined mixed synchronous- asynchronous digital FIR filter chip operating at 1.3 gigahertz,” IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 18, no. 7, pp. 1043–1056, Jul. 2010.

[9] Z. Xia, S. Ishihara, M. Hariyama, and M. Kameyama,

“Design of high-performance asynchronous pipeline using synchronizing logic gates,” IEICE Trans. Electron. , vol.

E95-C, no. 8, pp. 1434–1443, Aug. 2012.

[10] A. Taubin, J. Cortadella, L. Lavagno, A. Kondratyev, and A.

Peeters, “Design automation of real-life asynchronous devices and systems,” Found. Trends Electron. Des. Autom, vol. 2, no. 1, pp.

1–133,2007.