CSEE W4824 – Computer Architecture
Fall 2012
Luca Carloni
Department of Computer Science Columbia University in the City of New York
http://www.cs.columbia.edu/~cs4824/
Lecture 2
Performance Metrics and Quantitative
Principles of Computer Design
Announcements: CS Distinguished Lecture
Wed, Oct. 12
th11:00 am - Davis Auditorium
• “What Should a Well-informedPerson Know about Computers?” • Brian Kernighan (Princeton Univ.)
– His book with Dennis Ritchie, the creator of the C programming language is
considered “the bible of C” – At Bell Labs contributed to the
development of Unix working with the Unix creators K. Thompson and D. Ritchie – He is also a coauthor of the widely used
AWK and AMPL programming languages, and of the EQN and PIC typesetting languages
– In collaboration with Shen Lin he devised well-known heuristics for two important NP-complete optimization problems:
• graph partitioning
CSEE 4824 – Fall 2012 - Lecture 2 Page 5 Luca Carloni – Columbia University
Computer Architects and
Quantitative Approach
• Design ideas and trade-offs are tested by using
tools in order to estimate the impact on
performance, power and cost (an iterative process)
– analytical reasoning and fundamental design principles – equations for basic metrics
• cost, performance, power…
– simulations at various levels
• system level, ISA, micro-architecture, memory , RTL, gate, circuit level
– benchmark programs representing typical workloads
CSEE 4824 – Fall 2012 - Lecture 2 Page 6 Luca Carloni – Columbia University
How to Define Performance?
Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (m.p.h.) Passenger Throughput (passenger x m.p.h) Boeing 777 370 4630 610 228,750 Boeing 747 470 4150 610 286,700 Concorde 132 4000 1350 178,200 Douglas DC-8-50 146 8720 544 79,424
CSEE 4824 – Fall 2012 - Lecture 2 Page 7 Luca Carloni – Columbia University
Two Key Performance Metrics
Airplane DC to Paris Speed Passengers Throughput
(passengers x mph)
Boeing 747 6.5 hours 610mph 470 286,700
Concorde 3 hours 1350mph 132 178,200
• Time to run the task
– execution time, response time, elapsed time, latency
• Tasks per time unit
– execution rate, bandwidth, throughput
Latency vs. Throughput
• Latency
– “real” time necessary to complete a task – important when the focus is on a single task
• a computer user who is working with a single application • a critical task of a real-time embedded system
• Throughput (aka Bandwidth)
– number of tasks completed per unit of time – a metric independent from the exact number of
executed tasks
– important when the focus is on running many tasks • a manager of a large data-processing center is interested
CSEE 4824 – Fall 2012 - Lecture 2 Page 9 Luca Carloni – Columbia University
Latency lags Bandwidth
• Bandwidth has outpaced latency across the main computer technologies
• “There is an old network saying: Bandwidth
problems can be cured with money. Latency problems are harder because the speed of light is fixed—you can’t bribe God.”
[Anonymous]
CSEE 4824 – Fall 2012 - Lecture 2 Page 10 Luca Carloni – Columbia University
Latency and Throughput –
The Classic 5-Stage Pipeline
• Pipelining
– increases the instruction throughput • number of instructions completed per unit of time– but does not
reduce (in fact, it usually slightly increases) the execution time of an individual instruction
CSEE 4824 – Fall 2012 - Lecture 2 Page 11 Luca Carloni – Columbia University
Performance Metrics
• Machine
X
is
n
times faster than machine
Y
executionTime(Y) executionTime(X)
n = = performance(X)
performance(Y)
• Performance and execution time are reciprocal
– improve performance increase performance – improve execution time decrease execution time
• Example
– executionTime(Y) = 4.8, executionTime(X) = 3.6
• n= 1.33, i.e. Xis 33% faster than Y
“Make the Common Case Fast”
• “the most important, pervasive, and simple
principle of computer design”
– in making a design trade-off…
• favor the frequent case rather than infrequent case
– when determining how to allocate resources…
• favor the frequent event rather than the rare event
– when optimizing the design of a module…
• target the average functional behavior
• …besides, the frequent case is often simpler
1. How to determine what the frequent case is?
2. How to determine the amount of the possible
performance gain in making the frequent case
faster ?
CSEE 4824 – Fall 2012 - Lecture 2 Page 13 Luca Carloni – Columbia University
Simulation and Simulation Levels
• ISA (functional) simulator
– execute program & get ISA-level statistics
• frequency of instructions
• Memory simulator
– ISA simulator is run together with a model of the memory systems
• get cache hit/miss rates, study memory hierarchy options
• Full performance simulator
– a detailed performance model to a functional simulator
• model all interactions, stalls, (mis)-speculations • generate accurate statistics
CSEE 4824 – Fall 2012 - Lecture 2 Page 14 Luca Carloni – Columbia University
Simulation Tradeoffs
• ISA simulator
– 10x slower than the real processor
– 10-100x faster than a detailed performance simulator
• Key points
– use the right level of simulation to answer a specific question
• e.g., ISA simulator to get instruction mix statistics
– use fast, idealized models for non-critical components
• e.g., assume a perfect main memory for applications that present an optimal cache hit ratio
– simulation is a powerful tool for architectural
explorations, but analytical reasoning should always be applied before starting long simulations
CSEE 4824 – Fall 2012 - Lecture 2 Page 15 Luca Carloni – Columbia University
Benchmark Suites
• Sets of programs to simulate typical workloads
• Several types
– real software applications (GCC, Word,…)
• most accurate but typically longer to process
• portability problems (OS/compiler dependencies), GUI
– kernels (Livermore Loops, Linpack,…)
• small, key pieces taken from real programs
• limited picture, but good to isolate the performance of individual features of a machine
– synthetic benchmarks (Whetstone, Dhrystone,…)
• try to match the average frequency of operations on operands of a real program
– may easily mislead compiler and hardware designers
new exec. time of improved part original execution time
of unimproved part
Amdahl’s Law
• What is the overall
speedup after improving a
component
x
of a system?
originalExecutionTime
speedup = = newPerformance
originalPerformance newExectionTime
• If component
x
is improved by
Sx
and component
x
affects a fraction
Fx
of the overall execution
time then
1 speedup = (1 –Fx) + Sx Fxsystem
x
CSEE 4824 – Fall 2012 - Lecture 2 Page 17 Luca Carloni – Columbia University
• If we optimize the module for the floating-point instructions by a factor of 2, but the system will normally run programs with only 20% of floating point
instructions then the speedup is only
Amdahl’s Law - Example
1 speedup = (1 –Fx) + Sx Fx 1 speedup = (1 – 0.2) + 2 0.2 = 1 0.9 = 1.111
CSEE 4824 – Fall 2012 - Lecture 2 Page 18 Luca Carloni – Columbia University
Amdahl’s Law - Example
S 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 S 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Speedup vs. Optimized Fraction
CSEE 4824 – Fall 2012 - Lecture 2 Page 19 Luca Carloni – Columbia University
• the closer to 1 is
Fx
, the closer to
Sx
is the
overall speedup…
– i.e. [make common case fast]
• however, as
Sx
, speedup
1 / (1-
Fx
)
– i.e., once Fx/Sx is small with respect to (1-Fx) the price/performance ratio falls rapidly as Sx is increased
•
the incremental improvement in speedup gained
by an additional improvement in the performance
of just a portion of the computation diminishes
as improvements are added
Amdahl’s Law and the
Law of Diminishing Returns
• Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", AFIPS ‘67
• Amdahl’s Law - special case of parallelization
– if Fis the fraction of a calculation that can be parallelized and (1-F) is the fraction that is sequential (i.e. cannot benefit from parallelization) then Amdahl’s Law gives the maximum speedup that can be achieved by using Nprocessors as
• Example
– if Fis only 90%, the calculation can be sped up by only a maximum of a factor of 10, no matter how many processors are used
– key to parallel computing is to augment F
• but there is also Gustafson’s Law…
Amdahl’s Law - Reference
1
speedup =
(1 –F) +
N F
CSEE 4824 – Fall 2012 - Lecture 2 Page 21 Luca Carloni – Columbia University
Principle of Locality
• Temporal Locality
– a resource that is referenced at one point in time will be referenced again sometime in the near future
• Spatial Locality
– the likelihood of referencing a resource is higher if a resource near it was just referenced
• 90/10 Locality Rule of Thumb
– a program spends 90% of its execution time in only
10% of its code
• hence, it is possible to predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past
• this is a consequence of how we program and we store the data in the memory
CSEE 4824 – Fall 2012 - Lecture 2 Page 22 Luca Carloni – Columbia University
Principle of Locality - Example
• Cache Memory
– directly exploits temporal locality providing faster access to a smaller subset of the main memory which contains copy of data recently used
– but, all data in the cache are not necessarily data that are spatially close in the main memory…
– …still, when a cache miss occurs a fixed-size block of contiguous memory cells is retrieved from the main memory based on the principle of spatial locality
CSEE 4824 – Fall 2012 - Lecture 2 Page 23 Luca Carloni – Columbia University
CPU Time
• CPU Time
– user CPU Time
• spent in the user program
– system CPU Time
• spent in the OS performing tasks required by the program • harder to measure and to compare across architectures
– CPU performance = user CPU time on an unloaded system
– most computers run with a single clock signal (strictly synchronous design) whose discrete time events are called cycles, periods, or ticks
• a P with a 1ns clock period runs at 1GHz of clock frequency… CPU Time = (Clock Cycles for a Program) x (Clock Cycle Time) =
= (Clock Cycles for a Program) / (Clock Frequency)
CPU Time – Three Main Factors
CPU Time = (Clock Cycles for a Program) x CCT
• IC = instruction count
– number of instructions executed for a program
• CPI = clock cycles per instruction = CCfP/IC
– average number of clock cycles per instruction of a program
– its reciprocal is IPC = instruction per clock cycles
CPU Time = IC x CPI x CCT
• CPU Time equally depends on these three factors
• a 10% improvement in any of these leads to a 10% improvement in CPU time
CSEE 4824 – Fall 2012 - Lecture 2 Page 25 Luca Carloni – Columbia University
CPU Time - Dependencies
IC CPI CCT Program Compiler ISA HW organization HW technology
CPU Time = IC x CPI x CCT
IC CPI CCT Program
Compiler
ISA
HW organization
HW technology
•
some interdependencies, but many techniques improve a single factorCSEE 4824 – Fall 2012 - Lecture 2 Page 26 Luca Carloni – Columbia University
Improving Performance by Exploiting
Parallelism
• at the system level
– use multiple processors, multiple disks
• scalabilityis key to adaptively distribute workload in server apps
• at the single microprocessor level
– exploit instruction level parallelism (ILP)
• e.g., pipelining overlaps the execution of instruction to reduce the overall program CPU Time
– reduces CPI by overlapping instructions in time
– possible because many subsequent instructions are independent
• e.g. parallel computation
– reduces CPI by overlapping instructions in space
– duplicate hardware modules such as ALUs
• at the circuit level
– carry-lookahead adders speed-up sums
CSEE 4824 – Fall 2012 - Lecture 2 Page 27 Luca Carloni – Columbia University
CPU Time – broken down per instruction
CPU Time = IC x CPI x CCT
CPI =
i
(
IC
ix CPI
i)
IC
=
i
(
IF
ix CPI
i)
• frequent instructions have larger contributions on CPI • CPI should be measured to include pipeline/memory effects
– it is not sufficient to calculate it from the reference manual table
• NOTE: it is ok to compare two designs based only on CPI (or
IPC) only if IC and CCT are the same!
CPU Time =
i
(
IC
ix CPI
i)
x CCT
Example:
Average Instruction Execution Time
• Assuming a simple un-pipelined processor with CCT = 2ns
Operation IFi CPIi IFi x CPIi (% Time)
ALU 0.5 4 2 46
Load 0.2 5 1 23
Store 0.1 5 0.5 12
Branch 0.2 4 0.8 19
• CPI = i (IFi x CPIi ) = 4.3
CSEE 4824 – Fall 2012 - Lecture 2 Page 29 Luca Carloni – Columbia University
Example:
Speedup From 5-stage Pipelining
• Assumption
– after pipelining the slowest stage forces an effective clock period equal to
(CCT + clockOverhead) = (2 + 0.2)ns
• Question
– What is the speedup from pipelining?
(Average Instruction Time )unpipelined
speedup =
(Average Instruction Time )pipelined = = 3.9
8.6 2.2
CSEE 4824 – Fall 2012 - Lecture 2 Page 30 Luca Carloni – Columbia University
Another Key Metric: Power Dissipation
[Source: K. Asanovic – MIT ]
• Energy
– measured in Joules
• Power
– rate of energy consumption • [Watts = Joules/sec]
– instantaneous power P = V * I • voltage drop across a component
times the current flowing through it
• Example
– system A
• higher peak power • lower total energy
– system B
• lower peak power • higher total energy
V
I
CSEE 4824 – Fall 2012 - Lecture 2 Page 31 Luca Carloni – Columbia University
Power Consumption of CMOS Transistors
• Dynamic Power
– traditionally dominant component – dissipated when transistor
switches (i.e. data dependent)
• Static Power
– becoming more important with transistors scaling
– due to “leakage current” that flows even if there is no switching activity
– proportional to the number of transistors on the chip
• Challenges
– power is the key limitation to chip design
• distribute power on-chip • remove heat
• prevent hot spots
• low power design (clock gating, DVFS)
Example: Dynamic Power Consumption
• Assume a 0.25µm CMOSchip with a voltage supply Vdd=2.5V
clock frequency F=500Mhz, and average load capacitance of
CL=15fF/gate(assuming a fan-out of 4)
• What is the power consumption per gate? • Approximately, Pavg=50µW
• For a design with 1 million gates, assuming that a transition occurs at every clock edge, this would result in an average power consumption of ~50W!
• In reality, not all gates on the chip switch at the full rate of 500Mhz. The actual activity is substantially lower
CSEE 4824 – Fall 2012 - Lecture 2 Page 33 Luca Carloni – Columbia University
Dynamic Voltage Frequency Scaling
• DVFS is a low-power design technique that is
becoming pervasive in modern processors
• Example:
– If the voltage and frequency of a processing core are both reduced by 15% what would be the impact on dynamic power?
• Pnew is 64% more power efficient than Pold
C x (V x 0.85) x (F x 0.85) Power Save = 2 = 0.85 = 0.61 = Pnew Pold C x V x F2 3
CSEE 4824 – Fall 2012 - Lecture 1 Page 34 Luca Carloni – Columbia University
Assigned Readings
• Computer Architecture – A Quantitative Approach by John Hennessy – Stanford University Dave Patterson – UC Berkeley Fifth Edition - 2012Morgan Kaufmann (Elsevier) • Read Sections 1.8-1.12