CSEE W4824 Computer Architecture Fall 2012

(1)

CSEE W4824 – Computer Architecture

Fall 2012

Luca Carloni

Department of Computer Science Columbia University in the City of New York

http://www.cs.columbia.edu/~cs4824/

Lecture 2

Performance Metrics and Quantitative

Principles of Computer Design

Announcements: CS Distinguished Lecture

Wed, Oct. 12

th

_{11:00 am - Davis Auditorium}

• “What Should a Well-informed

Person Know about Computers?” • Brian Kernighan (Princeton Univ.)

– His book with Dennis Ritchie, the creator of the C programming language is

considered “the bible of C” – At Bell Labs contributed to the

development of Unix working with the Unix creators K. Thompson and D. Ritchie – He is also a coauthor of the widely used

AWK and AMPL programming languages, and of the EQN and PIC typesetting languages

– In collaboration with Shen Lin he devised well-known heuristics for two important NP-complete optimization problems:

• graph partitioning

(2)

CSEE 4824 – Fall 2012 - Lecture 2 Page 5 Luca Carloni – Columbia University

Computer Architects and

Quantitative Approach

• Design ideas and trade-offs are tested by using

tools in order to estimate the impact on

performance, power and cost (an iterative process)

– analytical reasoning and fundamental design principles – equations for basic metrics

• cost, performance, power…

– simulations at various levels

• system level, ISA, micro-architecture, memory , RTL, gate, circuit level

– benchmark programs representing typical workloads

How to Define Performance?

Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (m.p.h.) Passenger Throughput (passenger x m.p.h) Boeing 777 370 4630 610 228,750 Boeing 747 470 4150 610 286,700 Concorde 132 4000 1350 178,200 Douglas DC-8-50 146 8720 544 79,424

(3)

Two Key Performance Metrics

Airplane DC to Paris Speed Passengers Throughput

(passengers x mph)

Boeing 747 6.5 hours 610mph 470 286,700

Concorde 3 hours 1350mph 132 178,200

• Time to run the task

– execution time, response time, elapsed time, latency

• Tasks per time unit

– execution rate, bandwidth, throughput

Latency vs. Throughput

• Latency

– “real” time necessary to complete a task – important when the focus is on a single task

• a computer user who is working with a single application • a critical task of a real-time embedded system

• Throughput (aka Bandwidth)

– number of tasks completed per unit of time – a metric independent from the exact number of

executed tasks

– important when the focus is on running many tasks • a manager of a large data-processing center is interested

(4)

Latency lags Bandwidth

• Bandwidth has outpaced latency across the main computer technologies

• “There is an old network saying: Bandwidth

problems can be cured with money. Latency problems are harder because the speed of light is fixed—you can’t bribe God.”

[Anonymous]

Latency and Throughput –

The Classic 5-Stage Pipeline

• Pipelining

– increases the instruction throughput • number of instructions completed per unit of time

– but does not

reduce (in fact, it usually slightly increases) the execution time of an individual instruction

(5)

Performance Metrics

• Machine

X

is

n

times faster than machine

Y

executionTime(Y) executionTime(X)

n = = performance(X)

performance(Y)

• Performance and execution time are reciprocal

– improve performance increase performance – improve execution time decrease execution time

• Example

– executionTime(Y) = 4.8, executionTime(X) = 3.6

• n= 1.33, i.e. Xis 33% faster than Y

“Make the Common Case Fast”

• “the most important, pervasive, and simple

principle of computer design”

– in making a design trade-off…

• favor the frequent case rather than infrequent case

– when determining how to allocate resources…

• favor the frequent event rather than the rare event

– when optimizing the design of a module…

• target the average functional behavior

• …besides, the frequent case is often simpler

1. How to determine what the frequent case is?

2. How to determine the amount of the possible

performance gain in making the frequent case

faster ?

(6)

Simulation and Simulation Levels

• ISA (functional) simulator

– execute program & get ISA-level statistics

• frequency of instructions

• Memory simulator

– ISA simulator is run together with a model of the memory systems

• get cache hit/miss rates, study memory hierarchy options

• Full performance simulator

– a detailed performance model to a functional simulator

• model all interactions, stalls, (mis)-speculations • generate accurate statistics

Simulation Tradeoffs

• ISA simulator

– 10x slower than the real processor

– 10-100x faster than a detailed performance simulator

• Key points

– use the right level of simulation to answer a specific question

• e.g., ISA simulator to get instruction mix statistics

– use fast, idealized models for non-critical components

• e.g., assume a perfect main memory for applications that present an optimal cache hit ratio

– simulation is a powerful tool for architectural

explorations, but analytical reasoning should always be applied before starting long simulations

(7)

Benchmark Suites

• Sets of programs to simulate typical workloads

• Several types

– real software applications (GCC, Word,…)

• most accurate but typically longer to process

• portability problems (OS/compiler dependencies), GUI

– kernels (Livermore Loops, Linpack,…)

• small, key pieces taken from real programs

• limited picture, but good to isolate the performance of individual features of a machine

– synthetic benchmarks (Whetstone, Dhrystone,…)

• try to match the average frequency of operations on operands of a real program

– may easily mislead compiler and hardware designers

new exec. time of improved part original execution time

of unimproved part

Amdahl’s Law

• What is the overall

speedup after improving a

component

x

of a system?

originalExecutionTime

speedup = = newPerformance

originalPerformance newExectionTime

• If component

x

is improved by

Sx

and component

x

affects a fraction

Fx

of the overall execution

time then

1 speedup = (1 –Fx) + Sx Fx

system

x

(8)

• If we optimize the module for the floating-point instructions by a factor of 2, but the system will normally run programs with only 20% of floating point

instructions then the speedup is only

Amdahl’s Law - Example

1 speedup = (1 –Fx) + Sx Fx 1 speedup = (1 – 0.2) + 2 0.2 = 1 0.9 = 1.111

Amdahl’s Law - Example

S 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 S 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Speedup vs. Optimized Fraction

(9)

• the closer to 1 is

Fx

, the closer to

Sx

is the

overall speedup…

– i.e. [make common case fast]

• however, as

Sx

 

, speedup



1 / (1-

Fx

)

– i.e., once Fx/Sx is small with respect to (1-Fx) the price/performance ratio falls rapidly as Sx is increased

• the incremental improvement in speedup gained

by an additional improvement in the performance

of just a portion of the computation diminishes

as improvements are added

Amdahl’s Law and the

Law of Diminishing Returns

• Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", AFIPS ‘67

• Amdahl’s Law - special case of parallelization

– if Fis the fraction of a calculation that can be parallelized and (1-F) is the fraction that is sequential (i.e. cannot benefit from parallelization) then Amdahl’s Law gives the maximum speedup that can be achieved by using Nprocessors as

• Example

– if Fis only 90%, the calculation can be sped up by only a maximum of a factor of 10, no matter how many processors are used

– key to parallel computing is to augment F

• but there is also Gustafson’s Law…

Amdahl’s Law - Reference

1

speedup =

(1 –F) +

N F

(10)

Principle of Locality

• Temporal Locality

– a resource that is referenced at one point in time will be referenced again sometime in the near future

• Spatial Locality

– the likelihood of referencing a resource is higher if a resource near it was just referenced

• 90/10 Locality Rule of Thumb

– a program spends 90% of its execution time in only

10% of its code

• hence, it is possible to predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past

• this is a consequence of how we program and we store the data in the memory

Principle of Locality - Example

• Cache Memory

– directly exploits temporal locality providing faster access to a smaller subset of the main memory which contains copy of data recently used

– but, all data in the cache are not necessarily data that are spatially close in the main memory…

– …still, when a cache miss occurs a fixed-size block of contiguous memory cells is retrieved from the main memory based on the principle of spatial locality

(11)

CPU Time

• CPU Time

– user CPU Time

• spent in the user program

– system CPU Time

• spent in the OS performing tasks required by the program • harder to measure and to compare across architectures

– CPU performance = user CPU time on an unloaded system

– most computers run with a single clock signal (strictly synchronous design) whose discrete time events are called cycles, periods, or ticks

• a P with a 1ns clock period runs at 1GHz of clock frequency… CPU Time = (Clock Cycles for a Program) x (Clock Cycle Time) =

= (Clock Cycles for a Program) / (Clock Frequency)

CPU Time – Three Main Factors

CPU Time = (Clock Cycles for a Program) x CCT

• IC = instruction count

– number of instructions executed for a program

• CPI = clock cycles per instruction = CCfP/IC

– average number of clock cycles per instruction of a program

– its reciprocal is IPC = instruction per clock cycles

CPU Time = IC x CPI x CCT

• CPU Time equally depends on these three factors

• a 10% improvement in any of these leads to a 10% improvement in CPU time

(12)

CPU Time - Dependencies

IC CPI CCT Program Compiler ISA HW organization HW technology

CPU Time = IC x CPI x CCT

IC CPI CCT Program

_

Compiler

_

ISA

_

HW organization



HW technology



•

some interdependencies, but many techniques improve a single factor

Improving Performance by Exploiting

Parallelism

• at the system level

– use multiple processors, multiple disks

• scalabilityis key to adaptively distribute workload in server apps

• at the single microprocessor level

– exploit instruction level parallelism (ILP)

• e.g., pipelining overlaps the execution of instruction to reduce the overall program CPU Time

– reduces CPI by overlapping instructions in time

– possible because many subsequent instructions are independent

• e.g. parallel computation

– reduces CPI by overlapping instructions in space

– duplicate hardware modules such as ALUs

• at the circuit level

– carry-lookahead adders speed-up sums

(13)

CPU Time – broken down per instruction

CPU Time = IC x CPI x CCT

CPI =



i

(

IC

i

x CPI

i

)

IC

=



i

(

IF

i

x CPI

i

)

• frequent instructions have larger contributions on CPI • CPI should be measured to include pipeline/memory effects

– it is not sufficient to calculate it from the reference manual table

• NOTE: it is ok to compare two designs based only on CPI (or

IPC) only if IC and CCT are the same!

CPU Time =



i

(

IC

i

x CPI

i

)

x CCT

Example:

Average Instruction Execution Time

• Assuming a simple un-pipelined processor with CCT = 2ns

Operation IFi CPIi IFi x CPIi (% Time)

ALU 0.5 4 2 46

Load 0.2 5 1 23

Store 0.1 5 0.5 12

Branch 0.2 4 0.8 19

• CPI = i (IFi x CPIi ) = 4.3

(14)

Example:

Speedup From 5-stage Pipelining

• Assumption

– after pipelining the slowest stage forces an effective clock period equal to

(CCT + clockOverhead) = (2 + 0.2)ns

• Question

– What is the speedup from pipelining?

(Average Instruction Time )unpipelined

speedup =

(Average Instruction Time )pipelined = = 3.9

8.6 2.2

Another Key Metric: Power Dissipation

[Source: K. Asanovic – MIT ]

• Energy

– measured in Joules

• Power

– rate of energy consumption • [Watts = Joules/sec]

– instantaneous power P = V * I • voltage drop across a component

times the current flowing through it

• Example

– system A

• higher peak power • lower total energy

– system B

• lower peak power • higher total energy

V

I

(15)

Power Consumption of CMOS Transistors

• Dynamic Power

– traditionally dominant component – dissipated when transistor

switches (i.e. data dependent)

• Static Power

– becoming more important with transistors scaling

– due to “leakage current” that flows even if there is no switching activity

– proportional to the number of transistors on the chip

• Challenges

– power is the key limitation to chip design

• distribute power on-chip • remove heat

• prevent hot spots

• low power design (clock gating, DVFS)

Example: Dynamic Power Consumption

• Assume a 0.25µm CMOSchip with a voltage supply Vdd=2.5V

clock frequency F=500Mhz, and average load capacitance of

CL=15fF/gate(assuming a fan-out of 4)

• What is the power consumption per gate? • Approximately, Pavg=50µW

• For a design with 1 million gates, assuming that a transition occurs at every clock edge, this would result in an average power consumption of ~50W!

• In reality, not all gates on the chip switch at the full rate of 500Mhz. The actual activity is substantially lower

(16)

Dynamic Voltage Frequency Scaling

• DVFS is a low-power design technique that is

becoming pervasive in modern processors

• Example:

– If the voltage and frequency of a processing core are both reduced by 15% what would be the impact on dynamic power?

• Pnew is 64% more power efficient than Pold

C x (V x 0.85) x (F x 0.85) Power Save = 2 = 0.85 = 0.61 = Pnew Pold C x V x F2 3

Assigned Readings

• Computer Architecture – A Quantitative Approach by John Hennessy – Stanford University Dave Patterson – UC Berkeley Fifth Edition - 2012

Morgan Kaufmann (Elsevier) • Read Sections 1.8-1.12