• No results found

ECE 2300 Digital Logic & Computer Organization. Advanced Topics

N/A
N/A
Protected

Academic year: 2022

Share "ECE 2300 Digital Logic & Computer Organization. Advanced Topics"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

Fall 2016

Advanced Topics ECE 2300

Digital Logic & Computer Organization

(2)

Lecture 27:

Parallelism: Making our Processor Fast

• Processor architects improve performance

through hardware that exploits the different types of parallelism within computer programs

• Instruction-Level Parallelism (ILP)

– Parallelism within a sequential program

• Thread-level parallelism (TLP)

– Parallelism among different threads

• Data-level parallelism (DLP)

– Parallelism among the data within a sequential program

3

(3)

Instruction-Level Parallelism (ILP)

• Refers to the parallelism found within a sequential program

• Consider the ILP in this program segment

• Superscalar pipelines exploit ILP by duplicating the pipeline hardware

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3 LW R2,0(R3)

(4)

Lecture 27:

Two-Way Superscalar Pipeline

IM Reg AL

U DM

5

IF/ID ID/EX EX/MEM MEM/WB

IF ID EX MEM WB

A L U

Reg

(5)

Instruction Sequence on 2W SS

IM Reg AL

U DM

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3 LW R2,0(R3)

(6)

Lecture 27:

Instruction Sequence on 2W SS

IM Reg AL

U DM

7

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

ADD R1,R2,R3 OR R4,R4,R3

LW R2,0(R3)

(7)

Instruction Sequence on 2W SS

IM Reg AL

U DM

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3

AND R6,R6,R2

LW R2,0(R3)

(8)

Lecture 27:

Instruction Sequence on 2W SS

IM Reg AL

U DM

9

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3

AND R6,R6,R2 ADDI R7,R7,3

LW R2,0(R3)

LW R2,0(R3)

(9)

Instruction Sequence on 2W SS

IM Reg AL

U DM

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3

AND R6,R6,R2 ADDI R7,R7,3

LW R2,0(R3)

LW R2,0(R3)

(10)

Lecture 27:

Instruction Sequence on 2W SS

IM Reg AL

U DM

11

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3

AND R6,R6,R2 ADDI R7,R7,3

LW R2,0(R3)

LW R2,0(R3)

(11)

Instruction Sequence on 2W SS

IM Reg AL

U DM

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

LW R2,0(R3)

LW R2,0(R3)

(12)

Lecture 27:

Instruction Sequence on 2W SS

IM Reg AL

U DM

13

A L U

Reg

ADD R1,R2,R3 OR R4,R4,R3 SUB R5,R2,R3 AND R6,R6,R2 ADDI R7,R7,3

ADDI R7,R7,3

LW R2,0(R3)

LW R2,0(R3)

(13)

Limitations Due to Data Dependences

• Consider this program sequence

• The ADD and the OR, and the SUB and AND, cannot execute at the same time

– Hardware would detect these dependences and issue the instructions (send them to EX) one by one

• Addressed by hardware that performs out-of-order execution

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R2,R3 AND R6,R6,R5

(14)

Lecture 27:

Out-of-Order Execution

• Processor hardware executes instructions out of the original program order

• Key component is an issue queue that tracks the availability of source operands

15

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R2,R3 AND R6,R6,R5

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R2,R3 AND R6,R6,R5

(15)

Issue Queue

• Hardware structure where instructions wait until source operands can be bypassed

IF ID EX

Reg File

...

Issue Queue

MEM

(16)

Lecture 27:

Issue Queue Structure (Single Issue)

17 RT

Rdy

=?

RS

Rdy

=?

RD FS

IMM

Selection Logic

Rdy bits from all

queue entries

selected queue

entry

RD of issued instruction

...

Issue Queue

[RT]

[RS]

to EX (if selected}

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R2,R3 AND R6,R6,R5

(17)

Example Dual Issue Queue Operation

2

Y

=?

3

Y

=?

1

ADD

X

[R2]

[R3]

to EX

ADD R1,R2,R3 OR R4,R1,R3

SUB R5,R2,R3 AND R6,R6,R5

=?

=?

1

N

=?

3

Y

=?

4

OR

X

[R3]

=?

=?

2

Y

=?

3

Y

=?

5

SUB

X

[R2]

[R3]

to EX

=?

=?

6

Y

=?

5

N

=?

6

AND

X

[R6]

=?

=?

X

X

(18)

Lecture 27:

Example Dual Issue Queue Operation

19

=?

=?

to EX

OR R4,R1,R3 AND R6,R6,R5

=?

=?

1

Y

=?

3

Y

=?

4

OR

X

[R3]

=?

=?

=?

=?

to EX

=?

=?

6

Y

=?

5

Y

=?

6

AND

X

[R6]

=?

=?

X

X

Cycle 2

(19)

A More Troublesome Piece of Code

• Now consider this program sequence

• Superscalar pipeline would send instructions one by one through EX, MEM, and WB

– 1 ALU, 1 memory port, and 1 RF port would sit idle, perhaps through 1000 loop iterations!

• But what if we had a second program to run?

Loop ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3 AND R1,R6,R5 BEQ R2,R1,Loop

(20)

Lecture 27:

Multiprogramming and Multithreading

• Multiprogramming is the sharing of the computer hardware among multiple active programs

• Programs may be time multiplexed, in which they take turns using the processor

• Or they can use the hardware at the same time!

– Using multiple cores (next time)

– Using the same core at the same time

• For multiple simultaneously running programs, or multiple threads of the same program

21

(21)

Thread-Level Parallelism (TLP)

• Refers to the parallelism among different threads

– A thread is a path of execution within a program

– Each uses its own registers but they share memory

• Consider two threads that we want to run

• We can run them on separate cores (later), or create a superscalar pipeline that can run them both at the same time

Loop ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3 AND R1,R6,R5 BEQ R2,R1,Loop

Thread 1

LW R7,0(R1) ADD R4,R7,R2 SUBI R5,R4,1 SW R5,R1,0 ADDI R1,R1,1

Thread 2

(22)

Lecture 27:

Two-Way Multithreaded Pipeline

IM Reg AL

U DM

23

IF/ID ID/EX EX/MEM MEM/WB

IF ID EX MEM WB

A L U

Reg

Reg

Reg

• Two threads share many of the pipeline resources

• Each thread has its own PC and registers

• This is one example of multithreading, called Simultaneous Multithreading (SMT)

PC PC

(23)

Example Multithreaded Operation

Reg AL

U DM

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

Reg

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

(24)

Lecture 27:

Example Multithreaded Operation

Reg AL

U DM

25

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

ADD R1,R2,R3 ADDI R7,R1,0

Reg

Reg

(25)

Example Multithreaded Operation

Reg AL

U DM

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

OR R4,R1,R3

ADD R4,R7,R2 ADD R1,R2,R3 ADDI R7,R1,0

Reg

Reg

(26)

Lecture 27:

Example Multithreaded Operation

Reg AL

U DM

27

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

SUB R5,R4,R3

SUBI R5,R4,1 ADD R1,R2,R3

ADDI R7,R1,0 OR R4,R1,R3

ADD R4,R7,R2

Reg

Reg

(27)

Example Multithreaded Operation

Reg AL

U DM

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

SUB R5,R4,R3

SUBI R5,R4,1 ADD R1,R2,R3

ADDI R7,R1,0 OR R4,R1,R3

ADD R4,R7,R2

Reg

Reg

(28)

Lecture 27:

Example Multithreaded Operation

Reg AL

U DM

29

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

SUB R5,R4,R3

SUBI R5,R4,1 ADD R1,R2,R3

ADDI R7,R1,0 OR R4,R1,R3

ADD R4,R7,R2

Reg

Reg

(29)

Example Multithreaded Operation

Reg AL

U DM

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

SUB R5,R4,R3

SUBI R5,R4,1 OR R4,R1,R3 ADD R4,R7,R2

Reg

Reg

(30)

Lecture 27:

Example Multithreaded Operation

Reg AL

U DM

31

IF/ID ID/EX EX/MEM MEM/WB

A L U

Reg

PC PC

ADD R1,R2,R3 OR R4,R1,R3 SUB R5,R4,R3

ADDI R7,R1,0 ADD R4,R7,R2 SUBI R5,R4,1

SUB R5,R4,R3 SUBI R5,R4,1

Reg

Reg

(31)

Data-Level Parallelism

• Consider the following C code

• Arrays a, b, and c contain four 8-bit elements

– a[0], a[1], a[2], a[3] for array a

• Same operation is done for each data element

• Can replace the 4 add operations in the loop above by 1 SIMD add instruction

char a[4], b[4], c[4];

for (i = 0; i < 4; i++) a[i] = b[i] + c[i];

(32)

Lecture 27:

SIMD Instructions

• Single Instruction, Multiple Data

• Special instructions for vector data (arrays)

• Identical operation is performed on each of the corresponding data elements

• Data elements are stored contiguously

– 1 load (store) can read (write) all the elements at once – Register file is wide enough to hold all the elements in

one register

33

(33)

Implementing the for Loop Using SIMD

char a[4], b[4], c[4];

for (i = 0; i < 4; i++) a[i] = b[i] + c[i];

; load b and c from memory

LW R0, 0(R4) ; R4 points to b

LW R1, 0(R5) ; R5 points to c

; vector add

ADD.V R2, R0, R1 ; 1 inst does 4 adds!

; store result

SW R2, 0(R3) ; R3 points to a

Assumes 32-bit rather than 16-bit registers,

(34)

Lecture 27:

Adding Support for ADD.V to the ALU

IM Reg AL

U DM

35

IF/ID ID/EX EX/MEM MEM/WB

IF ID EX MEM WB

A L U

Reg

CL

CL

CL

8-bit ALU

Carry Logic propagates carry for ADD but not for ADD.V

(35)

Example ARM SIMD Instruction

VADD.I16 Q0, Q1, Q2

(36)

Lecture 27:

ILP, TLP, and DLP

• Many processors exploit all three

– Best performance/watt achieved with each in moderation rather than one/two to the extreme

• ILP

– Typically 2 to 6-way superscalar pipeline

– Performance improvement tapers off with wider pipelines while power may increase significantly

• TLP

– Support for 2 threads may require small amount of additional hardware over single threaded SS pipeline – May improve HW efficiency compared to SS alone

• DLP (via SIMD)

– Many applications (graphics, video and audio

processing, etc) make this worthwhile 37

(37)

Next Time

More Advanced Topics Case Study

References

Related documents

Two Port Measurement of a 12 kHz Band Pass Filter: Reverse Measurement. TX

Bildquelle: www.spectronet.de &gt; Nachrichten &gt; OMRON Xpectia Pressekonferenz Günzburg &gt; 081001_04_hotzel_ruhlamat.pps.. 5.6 Example for SaaS Applications in ruhlamat

The main category was the students’ perception, and the subcategories included: mentality about the training (limiting the training to vaccination; rest

The impact of a 'stage 1' trade liberalisation is comparable to a real depreciation of 0.59 per cent per annum and that of a 'stage 2' trade liberalisation is equivalent to a 1.57

Major Concentration in Biology with a Minor in Chemistry for Teachers – section 13.34.5: Concurrent Bachelor of Science (B.Sc.) and Bachelor of Education (B.Ed.) - Major

Services in multi storied buildings such as Plumbing System, HVAC, Electricity, Sewerage system, Fire fighting, Vertical Circulation System and Efficient evacuation methods. Note:

To sum up, it is the aim of the dissertation to shed light on the possibilities to encounter the shortage of skills in the ICT-industry which are to analyze the career choices

Present status of livestock products industry in India-dairy, meat, poultry, skin, hides,- wool; selection of livestock type, production and processing units; processing industry