Design of High Performance and Low Power Simultaneous Multi-threaded Processor

(1)

ISSN: 2088-8708  423

Design of High Performance and Low Power Simultaneous Multi-threaded Processor

Krishan Arora*, Paramveer Singh Gill*, Parul Mehra**

* Assistant Professor Departement of Electronics and Electrical Engineering, Lovely ProfessionalUniversity, Punjab

**Phd Scholar, Departement of Commerce, CMJ University, Meghalaya

Article Info ABSTRACT

Article history:

Received Mar 31, 2013 Revised May 14, 2013 Accepted May 27, 2013

In this paper, we present the design of a High Performance Multi-Threaded Processor. Processing of high quality images is inevitable in applications such as, HD TV, Gaming Multimedia, etc. which require a great processing power with low power consumption. This can be achived with multi-threaded processors which optimally utilises the Functional Units (Fus). The speed of processing is as good as multi-core processors with lesser area. A conflict resolver (CR) is designed for scheduling the instructions, which involves allocation of Fu. The data move instructions are in majority in any of the programs; the corresponding logic blocks are replicated and speed of execution is further improved. We illustrated for two-threaded processorHowever, it is possible to extend the design for any number of threads by suitably redesigning the CR, and also replicate Transfer Logic and CPU Registers.

Keyword:

Muti-threaded Prosessor Hardware thraeding SMT

Instruction scheduling Single Threaded Processor

Corresponding Author:

Krishan Arora Assistant Professor

Departement of Electronics and Electrical Engineering, Lovely Professional University,Punjab

Email: [email protected]

1. INTRODUCTION

The extreme developments in entertainment, gaming, medical imaging and HDTV along with electronics systems. Electronic systems design is getting more and more complex. The system requires very high speed processing, at the same time power consumption is also stretching its limitations. To meet this contradicting requirements the solution is in the efficient and effective embedded processor design. One of the promising methods is “Simultaneous multithreading” (SMT), which takes super-threading to the next level in the high performance processor design. It is super-threading without the restriction, i.e. all the instructions issued by the front end on each clock are from the same thread. In SMT, instructions are executing on different threads simultaneously. So we get better utilization of functional units. The issue of scheduling is properly managed by a special hardware unit called Conflict analyzer, which takes part of the opcode, i.e. 3 most significant bits.

Although SMT might seem like a pretty large departure from the kind of conventional, process- switching multithreading done on a single-threaded CPU, it actually doesn't add too much complexity to the hardware. This is done by dividing up the processor's architectural resources into two types: Replicated and Shared.

2. RESEARCH METHOD

The potential for achieving a significant increase in throughput on a superscalar by using simultaneous multithreading (SMT) was first demonstrated in 1995 by Tullsen of University of washington

(2)

IJECE 424 [1]. S supers multit the sa parall own i execu thread freque multit multit been u hardw amon thread system multi- multi- mode base p and e simpl data f cycle also g the m

3. SIN 1. It 2. Al 3. It 4. It 5. It 6. Th

4. IM switch hardw Share



E Vol. 3, No.

MT is a techn scalar’s funct threading pipe ame program.

lelism in a des independent r ution resource ded performan ency. In [5] r threaded arch threading whi used to impro ware-software g threads. Sp d-level paralle m cannot guar -core, multithr -threaded exec ling multi-thr

Our work platform for i executes that i e hardware, w from different to machine cy gets reduced a methodology ca

NGLE THRE We first im can execute 4 ll the instructi can communi has 256 bytes can point to 2 he maximum C

MPLEMENTA Although hing multithre ware. This is d ed. Let's take a

3, June 2013:

nique that allo tional units.

eline to increa In [4] packe sign that supp register file a es and memo nce, so that reexamine the hitecture. In [6 ich can efficie ove single thre approach to peculative mu

elism specula rantee that the readed archite cution traces t eaded process is different fr implementing in single mac which consum pins so we do ycle. That circ and execution an be extended

EADED PRO mplemented th 45 Arithmetic,

ions are taking icate with 2K s internal Cach 2K external m

Clock Frequen

Figure 1

ATION OF M SMT might eading done o done by divid a look at whic

423–428 ws multiple in

Then the Dy ase processor

t processor ut ports 256 simu and executes

ry ports with a design is a e issue of non 6] presents a ently handle l ead performan

speculative ultithreading i

atively, i.e., e e parallelism ecture and the to drive the si sors and the ne from above in g Multi-Thread

chine cycle i.e me very less po o not need any cuit is compos speed increa d to more than

OCESSOR AR he Single Thre , Logical, Data g only 2 clock

IO devices.

he memory.

emory.

ncy is 264MH

1. Architectura

MULTI-THR seem like a on a single-th ding up the pr ch resources fa

ndependent th ynamic Multi utilization, ex tilizes multipl ultaneous thre instructions f h other thread

achieved that ndeterministic

cache-based larger threads nce. In [7] pre

multithreadin ncreases sing executing cod exists. In [8]

e usage of an imulation. In ew functional two ways: (1 ding, It fetch e. it takes on ower and Silic y internal circu sed of comple

sed by many n two threads.

RCHITECTU eaded Process a transfer and k periods to ex

Hz.

al Diagram of

READED PRO pretty large d hreaded CPU, rocessor's arch all into which

hreads or prog ithreading Pr xcept that the le multithread eads in eight p formatted to a ds. The proce t is realizable c processors, architecture . Speculative esents the Mit ng, even in th gle-threaded a de in parallel, describes a tr application m [9] explains a lity Model to s 1) A new Sing the op-code ly two clock con area. It is uit which is re ex FSMs, so by

folds. Also it .

URE

sor which has d Looping Inst

xecute.

f Single Threa

OCESSOR departure from

it actually do hitectural reso categories for

rams to issue rocessor [3] p

threads are c ded processing processing eng a general purp essor is optim e in terms of

simplifying p support for S

multithreadin tosis framewo he presence o application pe

even when ransaction-lev model that gen a conceptual a support multit gle Treaded P and data sim periods. (2) B s because we equire to redef

y eliminating t is proved for

following fea tructions.

aded Processor

m the kind o oesn't add too ources into tw

r my Multi-th

ISSN:2 multiple instr presents a si created dynam g engines to s gines. Each th

pose ISA, wh mized to sacr

f silicon area previous desig Speculative Si ng is a techniq ork, which is of frequent d erformance by the compiler vel simulation nerates synthet architecture de threaded proce Processor is in multaneously a By virtue of are fetching o fining pins fro that our switc rm the initial

atures:

r

of convention o much compl wo types: Rep readed proces

2088-8708 ructions to a imultaneous mically from support this hread has its hile sharing rifice single a and clock gns using a imultaneous que that has a combined dependences y exploiting

or runtime model of a tic single or eveloped for

essors.

ntroduced as and decodes this we got op-code and om machine ching power studies that

al, process- lexity to the plicated, and

ssor:

(3)

Fus.(i Instru Proce

4.1. C Instru Funct instru instru increm Threa utilizi chart

Regis most o other

Design of H We imple i.e. in our pr uction Decode essor is shown

Conflict Resol Conflict R uction that wh tional Unit us uction on the r uction then co ment and do ad-1 will exec ing functional

in the Figure

We replica ster-Memory,

of instructions thread for TL

High Perform

 All G

 PC a

 Retu

 Flag

 Tran trans

emented Two rocessor there er and Confli n in Figure 2.

Fig

lver

Resolver is no hich functiona sed in the nex rest of Functio onflict resolve

not allow its cute normally l units efficien

3.

Table 1. R

ated the Trans Register-I/Os s are related to L related Instru

mance and Low Replicated General Purpos and IR urn Stack predic g Register

nsfer logic to im sfer instructions

Thread proce e is only one ict Resolver

gure 2. Archit

othing but a s al unit will be xt Instruction onal Units. If er issue a sign

s instruction y). The same ntly without an

Representation

Case No OP

1 0

2 0

3 0

4 1

5 1

6 1

sfer Logic (TL etc) .It means o data transfer uctions, so the

w Power Simu e Registers ctor

mplement the da s

essor which c e ALU, one efficiently ut

tecture of Mul

imple decode e used by thr by Thread-1 in the case bo nal Stall=1, w to execute w cycle is repe ny error in th

of conflicting

P12 OP11 OP 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1

L) (which take s each thread r, so by provid e execution wi

ultaneous Mult ata



can run two T Multiplier, o ilizes the Fus

lti-Threaded P

er which ident reads in the n

for Thread-1a oth threads req which will sto while do noth eat again and he output. All t

g cases and th

P10 Conflictin 1 ALU 0 Divider 1 Barrel Shif 0 Multiplier 0 Cache 1 IOB

es care of all d has its own co ding each thre ill be fast and

ti-threaded Pr Shared Cache Memory Execution Uni Instruction De Functional Un IO pins

Threads of a p one Divide, o s. The arhite

Processor

tify by the M next Instructio and allow the quires same fu op the program hing with Thr d again so by the above is s

e respective F

ng FU

fter

data transfer b opy of TL. It i ead its own co

smooth.

rocessor (Kris ry

its ecoder nits

program on o one Barrel sh cture of Mul

MSBs of op-co on. So it will e Thread-2 to functional unit

m counter of read-1(i.e. Ins y this conflict shown clearly

FUs

between Regis is because in a opy of TL it w

shan Arora) one copy of hifter). The lti-Threaded

odes of next reserve the execute its t in the next thread-2 to struction of resolver is by the flow

ster-2, a program will not stall

(4)

IJECE 426

5. R consid Proce

Figur

20 40 60 80 100 120 140



E Vol. 3, No.

RESULTS AN A total of dered to be r essors are com

re 4. Area com Regi

0 0 0 0 0 0 0 0

LUT

Single Thread

3, June 2013:

ND ANALYS f three Archit replicated and mpared on Area

mparison in te isters and Gat

Slice Slice F/F

ded Dual-C U C Prog

Co of

Allo bot

423–428 Figure 3. F

SIS

tecture (Viz.

d Multi-Threa a, Power and

erms LUT, Sli e count

e Slice Reg.

G

Core Start

Update Program Counter1 (PC1) &

gram Counter 2(P

ompare MSB [12:

opcode1 & opcod

Is they are same

ow The Instructio th Threads to exe

Flow Chart - C

Single Threa aded, in whic Throughput b

ce, F/F,

GC x 100

&

PC2)

:10]

de2

on on cute

case Yes

Conflict Revol

aded, Multi-C ch partially re bases and resu

Figure

0 5 10 15 20

Leakag

mW

Sing Mu Stall Thread2 (i Instruction on e

stop PC2 to In

Execute Instr Thread

Update P Identify the con

1 2 3 4

lver

Core, in which eplicated and ults are shown

5. Power cons

ge Internal P

gle Threaded lti-Threaded .e. not allow execute and ncrement)

ruction on d1

PC1 nflicting case

6 4 5

ISSN:2

h all the reso d others were

below:

sumption in m

Net Sw ower

Dual Core

2088-8708

ources were shared) of

mW

wiching

(5)

Design of High Performance and Low Power Simultaneous Multi-threaded Processor (Krishan Arora) So as we can see from the above graphs that the gate count and power consumption of Dual core processor is just double as that of Single Threaded Processor. But there is lot of area and power saving in case of Multithreaded Processor as compare to Dual Core Processor. Its is mainly because in case of Multi- Threaded the all functional units are not replicated but they are shared, while in case of Dual-Core processor whole the architecture get replicated.

Figure 6. Performance with respect to dependencies among the threads

Here we compared performance of Single-Threaded Processor and Multi-Threaded Processor on the bases of Instructions executed per clock cycle for 50 Instructions program per Thread (i.e. total of 100 instructions). For 0 and 2% conflicts our Multi-Threaded Processor performs 197% compare to Single- Threaded Processor (i.e. nearly like two Single-Threaded Processors). As percentage of conflicts increases the performance of Multi-Threaded processor starts decreasing nearly linearly. Since for 50 instructions per thread the maximum possible conflicts can be 94%. At 94% conflict Multi-Threaded will perform 104.63%.But as we know most of the Instructions in a program is Data-transfer and Jump, so in most of cases the conflict will be will be varies from 30-60% and correspondingly our Multi-Threaded processor performance varies from 156.15-126.88% compare to Single-Threaded Processor.

6. EXPERIMENTATION

The processor is modelled using Verilog HDL and its functionality is verified by using ModelSim-XE III 6.2g. It is then synthesized with Xilinx ISE-9.1i. Finally the design is mapped on Vertex5-XC5VLX110T platform.

7. FUTURE SCOPE

The presented Multi-Threaded processor architecture can be extended to any number of threads by suitably redesign the CR, also replicate transfer logic and CPU Registers as many as threads.The usual limitation on the number of threades is number of functional units used in the design. Incase the number of threads exceeds the number of functional units, the threades has to wait more to get perticular functional unit to execuete its instruction, so the perforemence of Multi-Threaded processor will degrade greatly . But there is also scope we can either go for combination of Multi-Core and Multi-threaded processor, in which there will be multi-cores of processor on single chip and each core will have perticular number of threads.

REFERENCES

[1] Dean M. Tullsen, Susan J Eggers, Henry M Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism.

ICSA'95, Santa Margherita Ligure Italy. 1995: 392–403.

[2] Susan J Eggers, Joel S Emer, Henry M Levy, Jack L Lo, Rebecca L Stamm, Dean M Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. Journal of IEEE Micro. 1997: 12-17.

[3] Haitham Akkary, Michael A Driscoll. A Dynamic Multithreading Processor. 31^st Annual ACM-IEEE International Symposium on Microarchitecture. 1998: 226-236.

0 0.2 0.4 0.6 0.8 1 1.2

0 10 20 30 40 50 60 70 80 90 100 Nos. of Instructions executed per clock cycle

% Conflicts

Single-Threaded Multi-Threaded

(6)

 ISSN:2088-8708

IJECE Vol. 3, No. 3, June 2013: 423–428 428

[4] O’Melveny, Myers LLP, Kayamba, Inc. A Massively Multithreaded Packet Processor. Presented at NP2: Workshop on Network Processors, held in conjunction with The 9th International Symposium on High-Performance Computer Architecture, Anaheim, California. 2003: 1-11.

[5] P Leadbitter, D Page, NP Smart. Nondeterministic Multithreading. IEEE Transactions on Computers. 2007; 56(7):

992-998.

[6] Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu and Pen-Chung Yew. Supporting Speculative Multithreading on Simultaneous Multithreaded Processor. Y Roberts et al. (Eds.): HiPC. 2006: 148-158.

[7] Carlos Madriles, Carlos Garcıá-Quinõnes, Jesu´s Sańchez, Pedro Marcuello, Dean M. Tullsen. Mitosis: A Speculative Multithreaded Processor Based on Precomputation Slices. IEEE Transactions on Parallel and Distributed Syatems. 2008; 19(7): 914-925.

[8] Nicholas Ma, Naraig Manjikian, Subramania Sudharsanan. Modeling and Simulation of Multicore multithreaded Processor Architecture in System C. CCECE. 2008: 1155-1160.

[9] David Burgart. Modeling Multi-Threaded Processors. White paper TeamQuest. 2008: 1-10.

BIOGRAPHIES OF AUTHORS

Krishan Arora is presently working as Assistant Professor in the Department of Electronics and Electrical Engineering in Lovely Professional University, Phagwara (Punjab). He has completed his B.Tech (Electrical & Electronics Engg.) from Punjab Technical University,Punjab and M.Tech (Electrical Engg.) from Punjab Technical University, Punjab. His area of research is Power System, Electrical Machines and Power Electronics.

Paramveer Singh Gill is presently working as Assistant Professor in the Department of Electronics and Electrical Engineering in Lovely Professional University, Phagwara (Punjab).

He has completed his B.Tech (Electronics & Communication Engg.) from Punjab Technical University,Punjab and M.Tech (VLSI Design) from VIT Vellore. His area of research is Processor Design.

Parul Mehra is Phd Scholar from Department of Commerce in CMJ University, Meghalaya. She has completed her Master of Commerce from Hindu College, Amritsar and Bachelors of Education with specialization in Commerce and Economics from Khalsa college of Education Amritsar. She has cleared UGC-NET with specialisation in Commerce in DEC.2011 in First Attempt.