ALU Design - 8. 16- Bit RISC Processor Design for Convolution Application Using Verilog HDL

Arithmetic Addition

The first operation we designed the ALU to do was addition. We decided to implement this using a ripple carry adder. The advantages of a ripple carry adder are the simplicity of its logic and the ease of extending its logic more places. It's disadvantage is that it's slow. Adders like the carry look ahead adder are much faster. For our design, though, we believe that the ID and MEM stages will be taking up enough time such that the speed differences of the two are not a huge concern. Seeing as it is the first instruction, we gave it the OpCode

0000000010

Our adder diagram

The last zero will be made apparent in the next section.

Subtraction

The next operation we designed was the awkward stepbrother of addition: subtraction. Rather than implementing a separate adder dedicated solely to doing addition, we decided to modify our existing adder to do both.

Subtraction is simple to addition of one number and the two's complement of another number.

That is, the second input is inverted and then a one is added to it. So, we have

A-B=A+(!B+1)

Because it is addition, we can rearrange it to be

A-B=(A+!B)+1

Now it is easy to see that the subtraction of A and B is the addition of A and the inversion of B plus 1. So, we place a multiplexor in front of the second input of the adder: one choice is to chose the signal unmodified; the other is to choose the inverted part of B. Then to add one, we set the carry in bit of the adder high. So, in the interest of saving a little bit of logic, we tied the carry in signal of the ALU and the deciding input of the multiplexor to the same bit ... (insert dramatic overture here) ... the last one. So, the subtraction gets the OpCode:

0000000011

Addition and subtraction combination circuit.

From here on, we will refer to the last bit as SIG0 because it sounds ominous.

Set Less Than SLT

The purpose of the set less than operation is te determine whether the first input of the ALU is less than the second, or A<?B. So, if the statement is true, A<B -> A-B<0. So, if A-B is negative, then SLT is true. As it turns out, the last sum bit of the adder is the sign of the result. So, if we tie the least significant bit of the SLT operation to the last bit of the adder and then the rest to ground (becauseSLT only outputs 1 or 0) we have given the ability to compare to values to our adder. We gave this operation the OpCode:

0000100001

The addition of SLT to our previous circuit

SLTU

There arises a slight bit of more difficulty in implementing unsigned set less than. Let us discuss the issues of simply implementing the exact same logic used in the SLT operation for the SLTU operation by looking at a 4-bit example. Suppose you have two binary numbers: 1100 and 0011.

In unsigned decimal, those numbers are 12 and 3. Obviously 12<3 is false. If we implement it with the same logic as above, though, it becomes:

1100-0011 = 1100 + 1101 = (1)001

The addition of SLTU to our previous circuit resulting in the complete adder component.

This would imply that 12 is less than 3. This is a result of the fact that 1100 is actually -4.

So, -4<3 is true.

The solution to this problem actually implements only a handful more gates. If we extend our adder to 33-bits, 1100 can be explicitly 01100 which is 12, thus getting rid of it's ambiguity with

-4. Also, since this 33rd gate is only used in SLTU instructions, we can tie it's other input to 1.

So, now we have

01100-10011 = 01100+11101 = (0)1001

This is obviously the result we need. The savvy reader will note that this is only an XOR between 0, 1 and the 32nd carry bit. The savvier readier will note that that is just inversion of the 32nd carry bit. We gave this operation the OpCode:

0001000001

Logic OR

The thing to know about the OR function is that just like the AND function the OR function is a bitwise function that will OR each individual bit. In order to have the ALU handle this we used 32 or gates in parallel for each individual bit. The OR operation gets the OpCode:

00000101

OR gate AND

The thing to know about the and function inside the ALU is that it is a bitwise AND meaning that it takes each bit and AND's with the same bit from another input. In order to accomplish this with inside of a 32-bit adder would mean using 32 and gates all in parallel with each other as shown below. The AND operation gets the OpCode:

0000001001

AND Gate

NOR

The thing to know about the NOR function inside the ALU is that the NOR function is a bitwise function where each and every bit will be NOR'd. The ALU has 32 NOR gates that are going to handle this function.The NOR operation gets the OpCode:

0000010001

NOR Gate Shifting

The MIPS processor implements only one type of shift: logical. After shifting bits in the necessary direction, the output bits that did not receive a value from the input are given the value 0.

SLL

To shift left, each bit in the output determines it's value by a multiplexor that chooses between one of

the 32 input bits and 0's. In order to simplify the logic, we have a block whose job it is to change the input 5-bit shamt to a one-hot encoded shamt. This reduces the amount of time it takes to process the shift command. Like shifting left in base10 is multiplying by ten , shifting left in base2 is multiplication. A word of caution, though: using this method to multiply large numbers may result in incorrect answers as bits important to the value of the number in the most significant portion of the word may be cut off.

We gave this the OpCode:

0010000000

Shift left logical circuit using one hot encoding

SRL

Shifting right is the same as above, but with the bits chosen in reverse.

Like dividing by ten in base₁₀ is simply a shift right of all the digits, a shit right in base₂ is a division by two. But a word of caution: this operation puts zeros in place of bits not defined by the input.

Thus, using this method to divide negative numbers will change the value and proper steps must be take to fix that.

This has the OpCode:

0100000000

Shift right logical circuit using one hot encoding.

LUI

When loading an immediate into the upper 16 bits of a register, the sign extended immediate needs to be shifted left 16 times. So, that is exactly what we designed. We decided to implement an explicit shift left 16 module to reduce the amount of logic we needed. This has the OpCode:

1000000000

Our load upper immediate circuit

ALU Datapath

Immediates

Immediate Implementation

Immediates are handled outside of the ALU. Rather than creating separate functions for immediate operations, we placed a MUX in front of the second operand of the ALU that would choose between the immediate field and register value using the signal ALUSrc to choose between the two.

An ALU with a multiplexor on the input to allow working with immediate.

Immediate Datapath

Not Zero?

The real question for this section is, "Why not just have a line that is high when the result of the ALU is 0?" The response, to find out if the result is not zero, we need only have a 32-bit wide OR gate. If any of the bits is not zero, it will be high. On the other hand, if we wanted a ZERO?

line, we would have to NOT this gate. Our result allows us to leave out this extra gate.

CHAPTER 3

The trend in the recent past shows the RISC processors clearly outsmarting the earlier CISC processor architectures. The reasons have been the advantages, such as its simple, flexible and fixed instruction format and hardwired control logic, which paves for higher clock speed, by eliminating the need for microprogramming. The combined advantages of high speed, low power, area efficient and operation-specific design possibilities have made the RISC processor ubiquitous.

The main feature of the RISC processor is its ability to support single cycle operation, meaning that the instruction is fetched from the instruction memory at the maximum speed of the

memory. RISC processors in general, are designed to achieve this by pipelining, where there is a possibility of stalling of clock cycles due to wrong instruction fetch when jump type instructions are encountered. This reduces the efficiency of the processors. This paper describes a RISC

architecture in which, single cycle operation is obtained without using a pipelined design. It averts possible stalling of clock cycles in effect [1]-[3].

The development of CMOS technology provides very high density and high performance

integrated circuits. The performance provided by the existing devices has created a never-ending greed for increasingly better performing devices.

This predicts the use of a whole RISC processor as a basic device by the year 2020. However, as the density of IC increases, the power consumption becomes a major threatening issue along with the complexity of the circuits. Hence, it becomes necessary to implement less complex, low power processor designs. Energy recovery is proving to be a promising approach

for the design of low power VLSI circuits. In recent years, studies on adiabatic computing have grown for low power systems and several adiabatic logic families have been proposed [4]-[6]. In this regard, we have utilized 2N-2N2P quasi-adiabatic logic in the design of incrementer circuits and carry select adder circuits along with architectural changes, in order to prove the power efficiency of the proposed structures with those found in the literatures.

Program counter is one of the most complex building blocks of the processor design. It performs mainly two operations, namely, incrementing and loading. In order to address this issue, the present work establishes a novel design of an incrementer structure [7]. It is realized using the quasiadiabatic 2N-2N2P logic structure. The structure incurs 17 times reduction in power, while comparing against conventional CMOS counterpart. A speed increase of about 25% and a marginal amount of 8% area saving are achieved.

The second part of this work concentrates on the complexity reduction in ALU by optimizing the design of arithmetic circuits. The previous works in literature focus on energy efficient

arithmetic circuits. In order to increase the operating speed and power efficiency of the

processor, we have come out with a novel design of a carry select adder structure [8]. This has also been compared with the conventional adders to validate the design in terms of powerdelay product.

In order to employ the processor for signal processing applications, we have integrated a modified Wallace tree multiplier that uses compressor circuits to achieve low power,

high speed operation [9], in the ALU. Reference [10] suggests that it is possible to achieve the high speed, low power and area efficient operations by reducing the stronger operations such as multiplication, at the cost of increasing the weaker operations such as addition.

Fig.1 16-bit Non-Pipelined RISC Processor

Convolution is an important signal processing application which is used in filter design. Many algorithms have been proposed in order to achieve an optimized performance of the filters by optimizing the convolution design. Modified Winograd algorithm is notable among them. They require 4 multiplication operations only, when compared to 6 for a 3*2 normal convolution

methodology with an additional rise in the number of adders to 9 from 4. In this work, we have designed and developed a 16-bit single cycle non-pipelined RISC processor. In order to

improve the performance, modification on incrementer circuit and carry select adder circuit have been done and modified structure has been integrated into the design and the performance is validated. A multiplier structure has been developed and modified Winograd algorithm is executed in order to validate our claim.

DESIGN OF 16-BIT RISC CPU

In document 8. 16- Bit RISC Processor Design for Convolution Application Using Verilog HDL (Page 24-39)