ISSN: 2088-8708 423
Design of High Performance and Low Power Simultaneous Multi-threaded Processor
Krishan Arora*, Paramveer Singh Gill*, Parul Mehra**
* Assistant Professor Departement of Electronics and Electrical Engineering, Lovely ProfessionalUniversity, Punjab
**Phd Scholar, Departement of Commerce, CMJ University, Meghalaya
Article Info ABSTRACT
Article history:
Received Mar 31, 2013 Revised May 14, 2013 Accepted May 27, 2013
In this paper, we present the design of a High Performance Multi-Threaded Processor. Processing of high quality images is inevitable in applications such as, HD TV, Gaming Multimedia, etc. which require a great processing power with low power consumption. This can be achived with multi-threaded processors which optimally utilises the Functional Units (Fus). The speed of processing is as good as multi-core processors with lesser area. A conflict resolver (CR) is designed for scheduling the instructions, which involves allocation of Fu. The data move instructions are in majority in any of the programs; the corresponding logic blocks are replicated and speed of execution is further improved. We illustrated for two-threaded processorHowever, it is possible to extend the design for any number of threads by suitably redesigning the CR, and also replicate Transfer Logic and CPU Registers.
Keyword:
Muti-threaded Prosessor Hardware thraeding SMT
Instruction scheduling Single Threaded Processor
Copyright © 2013 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Krishan Arora Assistant Professor
Departement of Electronics and Electrical Engineering, Lovely Professional University,Punjab
Email: [email protected]
1. INTRODUCTION
The extreme developments in entertainment, gaming, medical imaging and HDTV along with electronics systems. Electronic systems design is getting more and more complex. The system requires very high speed processing, at the same time power consumption is also stretching its limitations. To meet this contradicting requirements the solution is in the efficient and effective embedded processor design. One of the promising methods is “Simultaneous multithreading” (SMT), which takes super-threading to the next level in the high performance processor design. It is super-threading without the restriction, i.e. all the instructions issued by the front end on each clock are from the same thread. In SMT, instructions are executing on different threads simultaneously. So we get better utilization of functional units. The issue of scheduling is properly managed by a special hardware unit called Conflict analyzer, which takes part of the opcode, i.e. 3 most significant bits.
Although SMT might seem like a pretty large departure from the kind of conventional, process- switching multithreading done on a single-threaded CPU, it actually doesn't add too much complexity to the hardware. This is done by dividing up the processor's architectural resources into two types: Replicated and Shared.
2. RESEARCH METHOD
The potential for achieving a significant increase in throughput on a superscalar by using simultaneous multithreading (SMT) was first demonstrated in 1995 by Tullsen of University of washington
IJECE 424 [1]. S supers multit the sa parall own i execu thread freque multit multit been u hardw amon thread system multi- multi- mode base p and e simpl data f cycle also g the m
3. SIN 1. It 2. Al 3. It 4. It 5. It 6. Th
4. IM switch hardw Share
E Vol. 3, No.
MT is a techn scalar’s funct threading pipe ame program.
lelism in a des independent r ution resource ded performan ency. In [5] r threaded arch threading whi used to impro ware-software g threads. Sp d-level paralle m cannot guar -core, multithr -threaded exec ling multi-thr
Our work platform for i executes that i e hardware, w from different to machine cy gets reduced a methodology ca
NGLE THRE We first im can execute 4 ll the instructi can communi has 256 bytes can point to 2 he maximum C
MPLEMENTA Although hing multithre ware. This is d ed. Let's take a
3, June 2013:
nique that allo tional units.
eline to increa In [4] packe sign that supp register file a es and memo nce, so that reexamine the hitecture. In [6 ich can efficie ove single thre approach to peculative mu
elism specula rantee that the readed archite cution traces t eaded process is different fr implementing in single mac which consum pins so we do ycle. That circ and execution an be extended
EADED PRO mplemented th 45 Arithmetic,
ions are taking icate with 2K s internal Cach 2K external m
Clock Frequen
Figure 1
ATION OF M SMT might eading done o done by divid a look at whic
423–428 ws multiple in
Then the Dy ase processor
t processor ut ports 256 simu and executes
ry ports with a design is a e issue of non 6] presents a ently handle l ead performan
speculative ultithreading i
atively, i.e., e e parallelism ecture and the to drive the si sors and the ne from above in g Multi-Thread
chine cycle i.e me very less po o not need any cuit is compos speed increa d to more than
OCESSOR AR he Single Thre , Logical, Data g only 2 clock
IO devices.
he memory.
emory.
ncy is 264MH
1. Architectura
MULTI-THR seem like a on a single-th ding up the pr ch resources fa
ndependent th ynamic Multi utilization, ex tilizes multipl ultaneous thre instructions f h other thread
achieved that ndeterministic
cache-based larger threads nce. In [7] pre
multithreadin ncreases sing executing cod exists. In [8]
e usage of an imulation. In ew functional two ways: (1 ding, It fetch e. it takes on ower and Silic y internal circu sed of comple
sed by many n two threads.
RCHITECTU eaded Process a transfer and k periods to ex
Hz.
al Diagram of
READED PRO pretty large d hreaded CPU, rocessor's arch all into which
hreads or prog ithreading Pr xcept that the le multithread eads in eight p formatted to a ds. The proce t is realizable c processors, architecture . Speculative esents the Mit ng, even in th gle-threaded a de in parallel, describes a tr application m [9] explains a lity Model to s 1) A new Sing the op-code ly two clock con area. It is uit which is re ex FSMs, so by
folds. Also it .
URE
sor which has d Looping Inst
xecute.
f Single Threa
OCESSOR departure from
it actually do hitectural reso categories for
rams to issue rocessor [3] p
threads are c ded processing processing eng a general purp essor is optim e in terms of
simplifying p support for S
multithreadin tosis framewo he presence o application pe
even when ransaction-lev model that gen a conceptual a support multit gle Treaded P and data sim periods. (2) B s because we equire to redef
y eliminating t is proved for
following fea tructions.
aded Processor
m the kind o oesn't add too ources into tw
r my Multi-th
ISSN:2 multiple instr presents a si created dynam g engines to s gines. Each th
pose ISA, wh mized to sacr
f silicon area previous desig Speculative Si ng is a techniq ork, which is of frequent d erformance by the compiler vel simulation nerates synthet architecture de threaded proce Processor is in multaneously a By virtue of are fetching o fining pins fro that our switc rm the initial
atures:
r
of convention o much compl wo types: Rep readed proces
2088-8708 ructions to a imultaneous mically from support this hread has its hile sharing rifice single a and clock gns using a imultaneous que that has a combined dependences y exploiting
or runtime model of a tic single or eveloped for
essors.
ntroduced as and decodes this we got op-code and om machine ching power studies that
al, process- lexity to the plicated, and
ssor:
Fus.(i Instru Proce
4.1. C Instru Funct instru instru increm Threa utilizi chart
Regis most o other
Design of H We imple i.e. in our pr uction Decode essor is shown
Conflict Resol Conflict R uction that wh tional Unit us uction on the r uction then co ment and do ad-1 will exec ing functional
in the Figure
We replica ster-Memory,
of instructions thread for TL
High Perform
All G
PC a
Retu
Flag
Tran trans
emented Two rocessor there er and Confli n in Figure 2.
Fig
lver
Resolver is no hich functiona sed in the nex rest of Functio onflict resolve
not allow its cute normally l units efficien
3.
Table 1. R
ated the Trans Register-I/Os s are related to L related Instru
mance and Low Replicated General Purpos and IR urn Stack predic g Register
nsfer logic to im sfer instructions
Thread proce e is only one ict Resolver
gure 2. Archit
othing but a s al unit will be xt Instruction onal Units. If er issue a sign
s instruction y). The same ntly without an
Representation
Case No OP
1 0
2 0
3 0
4 1
5 1
6 1
sfer Logic (TL etc) .It means o data transfer uctions, so the
w Power Simu e Registers ctor
mplement the da s
essor which c e ALU, one efficiently ut
tecture of Mul
imple decode e used by thr by Thread-1 in the case bo nal Stall=1, w to execute w cycle is repe ny error in th
of conflicting
P12 OP11 OP 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1
L) (which take s each thread r, so by provid e execution wi
ultaneous Mult ata
can run two T Multiplier, o ilizes the Fus
lti-Threaded P
er which ident reads in the n
for Thread-1a oth threads req which will sto while do noth eat again and he output. All t
g cases and th
P10 Conflictin 1 ALU 0 Divider 1 Barrel Shif 0 Multiplier 0 Cache 1 IOB
es care of all d has its own co ding each thre ill be fast and
ti-threaded Pr Shared Cache Memory Execution Uni Instruction De Functional Un IO pins
Threads of a p one Divide, o s. The arhite
Processor
tify by the M next Instructio and allow the quires same fu op the program hing with Thr d again so by the above is s
e respective F
ng FU
fter
data transfer b opy of TL. It i ead its own co
smooth.
rocessor (Kris ry
its ecoder nits
program on o one Barrel sh cture of Mul
MSBs of op-co on. So it will e Thread-2 to functional unit
m counter of read-1(i.e. Ins y this conflict shown clearly
FUs
between Regis is because in a opy of TL it w
shan Arora) one copy of hifter). The lti-Threaded
odes of next reserve the execute its t in the next thread-2 to struction of resolver is by the flow
ster-2, a program will not stall
IJECE 426
5. R consid Proce
Figur
20 40 60 80 100 120 140
E Vol. 3, No.
RESULTS AN A total of dered to be r essors are com
re 4. Area com Regi
0 0 0 0 0 0 0 0
LUT
Single Thread
3, June 2013:
ND ANALYS f three Archit replicated and mpared on Area
mparison in te isters and Gat
Slice Slice F/F
ded Dual-C U C Prog
Co of
Allo bot
423–428 Figure 3. F
SIS
tecture (Viz.
d Multi-Threa a, Power and
erms LUT, Sli e count
e Slice Reg.
G
Core Start
Update Program Counter1 (PC1) &
gram Counter 2(P
ompare MSB [12:
opcode1 & opcod
Is they are same
ow The Instructio th Threads to exe
Flow Chart - C
Single Threa aded, in whic Throughput b
ce, F/F,
GC x 100
&
PC2)
:10]
de2
on on cute
case Yes
Conflict Revol
aded, Multi-C ch partially re bases and resu
Figure
0 5 10 15 20
Leakag
mW
Sing Mu Stall Thread2 (i Instruction on e
stop PC2 to In
Execute Instr Thread
Update P Identify the con
1 2 3 4
lver
Core, in which eplicated and ults are shown
5. Power cons
ge Internal P
gle Threaded lti-Threaded .e. not allow execute and ncrement)
ruction on d1
PC1 nflicting case
6 4 5
ISSN:2
h all the reso d others were
below:
sumption in m
Net Sw ower
Dual Core
2088-8708
ources were shared) of
mW
wiching
Design of High Performance and Low Power Simultaneous Multi-threaded Processor (Krishan Arora) So as we can see from the above graphs that the gate count and power consumption of Dual core processor is just double as that of Single Threaded Processor. But there is lot of area and power saving in case of Multithreaded Processor as compare to Dual Core Processor. Its is mainly because in case of Multi- Threaded the all functional units are not replicated but they are shared, while in case of Dual-Core processor whole the architecture get replicated.
Figure 6. Performance with respect to dependencies among the threads
Here we compared performance of Single-Threaded Processor and Multi-Threaded Processor on the bases of Instructions executed per clock cycle for 50 Instructions program per Thread (i.e. total of 100 instructions). For 0 and 2% conflicts our Multi-Threaded Processor performs 197% compare to Single- Threaded Processor (i.e. nearly like two Single-Threaded Processors). As percentage of conflicts increases the performance of Multi-Threaded processor starts decreasing nearly linearly. Since for 50 instructions per thread the maximum possible conflicts can be 94%. At 94% conflict Multi-Threaded will perform 104.63%.But as we know most of the Instructions in a program is Data-transfer and Jump, so in most of cases the conflict will be will be varies from 30-60% and correspondingly our Multi-Threaded processor performance varies from 156.15-126.88% compare to Single-Threaded Processor.
6. EXPERIMENTATION
The processor is modelled using Verilog HDL and its functionality is verified by using ModelSim-XE III 6.2g. It is then synthesized with Xilinx ISE-9.1i. Finally the design is mapped on Vertex5-XC5VLX110T platform.
7. FUTURE SCOPE
The presented Multi-Threaded processor architecture can be extended to any number of threads by suitably redesign the CR, also replicate transfer logic and CPU Registers as many as threads.The usual limitation on the number of threades is number of functional units used in the design. Incase the number of threads exceeds the number of functional units, the threades has to wait more to get perticular functional unit to execuete its instruction, so the perforemence of Multi-Threaded processor will degrade greatly . But there is also scope we can either go for combination of Multi-Core and Multi-threaded processor, in which there will be multi-cores of processor on single chip and each core will have perticular number of threads.
REFERENCES
[1] Dean M. Tullsen, Susan J Eggers, Henry M Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism.
ICSA'95, Santa Margherita Ligure Italy. 1995: 392–403.
[2] Susan J Eggers, Joel S Emer, Henry M Levy, Jack L Lo, Rebecca L Stamm, Dean M Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. Journal of IEEE Micro. 1997: 12-17.
[3] Haitham Akkary, Michael A Driscoll. A Dynamic Multithreading Processor. 31st Annual ACM-IEEE International Symposium on Microarchitecture. 1998: 226-236.
0 0.2 0.4 0.6 0.8 1 1.2
0 10 20 30 40 50 60 70 80 90 100 Nos. of Instructions executed per clock cycle
% Conflicts
Single-Threaded Multi-Threaded
ISSN:2088-8708
IJECE Vol. 3, No. 3, June 2013: 423–428 428
[4] O’Melveny, Myers LLP, Kayamba, Inc. A Massively Multithreaded Packet Processor. Presented at NP2: Workshop on Network Processors, held in conjunction with The 9th International Symposium on High-Performance Computer Architecture, Anaheim, California. 2003: 1-11.
[5] P Leadbitter, D Page, NP Smart. Nondeterministic Multithreading. IEEE Transactions on Computers. 2007; 56(7):
992-998.
[6] Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu and Pen-Chung Yew. Supporting Speculative Multithreading on Simultaneous Multithreaded Processor. Y Roberts et al. (Eds.): HiPC. 2006: 148-158.
[7] Carlos Madriles, Carlos Garcı´a-Quin˜ones, Jesu´s Sa´nchez, Pedro Marcuello, Dean M. Tullsen. Mitosis: A Speculative Multithreaded Processor Based on Precomputation Slices. IEEE Transactions on Parallel and Distributed Syatems. 2008; 19(7): 914-925.
[8] Nicholas Ma, Naraig Manjikian, Subramania Sudharsanan. Modeling and Simulation of Multicore multithreaded Processor Architecture in System C. CCECE. 2008: 1155-1160.
[9] David Burgart. Modeling Multi-Threaded Processors. White paper TeamQuest. 2008: 1-10.
BIOGRAPHIES OF AUTHORS
Krishan Arora is presently working as Assistant Professor in the Department of Electronics and Electrical Engineering in Lovely Professional University, Phagwara (Punjab). He has completed his B.Tech (Electrical & Electronics Engg.) from Punjab Technical University,Punjab and M.Tech (Electrical Engg.) from Punjab Technical University, Punjab. His area of research is Power System, Electrical Machines and Power Electronics.
Paramveer Singh Gill is presently working as Assistant Professor in the Department of Electronics and Electrical Engineering in Lovely Professional University, Phagwara (Punjab).
He has completed his B.Tech (Electronics & Communication Engg.) from Punjab Technical University,Punjab and M.Tech (VLSI Design) from VIT Vellore. His area of research is Processor Design.
Parul Mehra is Phd Scholar from Department of Commerce in CMJ University, Meghalaya. She has completed her Master of Commerce from Hindu College, Amritsar and Bachelors of Education with specialization in Commerce and Economics from Khalsa college of Education Amritsar. She has cleared UGC-NET with specialisation in Commerce in DEC.2011 in First Attempt.