Measuring Performance Throughput - Brown_unc_0153D

To understand throughput and performance bottlenecks, we first need to be able to measure performance accurately. The GPU platform provides machine counters, from which metrics for speedup and throughput can be derived. In my case studies, I concentrate on three throughput metrics: instruction, I/O, and data throughput. Other performance metrics (such as

total cycles, speedup, and work and depth analysis) will also be used in my case studies but will not be the focus.

To understand what are the bottleneck issues affecting performance, programmers start with certain known values, then take experimental measurements to derive metrics that provide insight. Values known by programmers include the input size (n) and output size (m). The CUDA profiler or hardware timers record useful measurements such as timings (time), number of parallel cores (p), and total threads (t). NVidia GPUs also provide various machine counters for profiling performance, including instructions issued (II) and the average instructions retired per machine cycle (IPC). From these basic values and measurements, I compute derived metrics such as total cycles, TC = II/IPC, which measures the total number of machine cycles to complete a section of code, algorithm, or an entire program.

3.1.1. Throughput Metrics

In my case studies, I consistently use three throughput metrics to gauge algorithmic performance. Instruction throughput (MI/s or GI/s, meaning mega- or giga- instructions executed per second, over all threads) tracks algorithmic performance. I/O throughput (MB/s or GB/s, meaning mega-bytes or giga-bytes transferred per second) tracks memory transfer performance. Data throughput (M*/s or G*/s, meaning mega-units or giga-units handled per second) tracks algorithmic performance in data units most germane to the problem space. Note that each of the three is a simple ratio of basic measurements.

A typical throughput graph, as seen in Figure 3.1, plots throughput on the 𝑦-axis as a function of input size (𝑛) on the 𝑥-axis (usually in log-scale).

Like Figure 3.1, most GPU throughput graphs have sigmoidal (“S” shaped) curves for increasing values of input sizes (𝑛). The throughput performance curve starts off flat for small input sizes (n ≤ 103_{), grows rapidly for medium input sizes (10}3_{< n ≤ 10}6_{), and then levels off at} some fraction of the hardware’s peak throughput for large input sizes, (106_{< n). For small input} sizes, there is not enough data to use the parallel hardware efficiently or to amortize the heavy GPU kernel launch costs. For large data sizes, there is enough data to parallelize work across tens of thousands of threads and the initial kernel launch costs are amortized across millions of data elements decreasing launch costs to a negligible fraction of total performance. For medium data sizes, throughput performance transitions from the inefficient to efficient case as the input size increases.

3.1.2 Parallel Speedup and Work and Depth Analysis

In addition to throughput metrics, there are two other traditional notions of performance for parallel computation that I use in my case studies: parallel speedup and work and depth analysis (Hennessey and Patterson, 2010). Let me introduce them by analogy to their serial equivalents: serial speedup and asymptotic runtime analysis.

The concept of speedup, 𝑆 = (𝑇𝑖𝑚𝑒𝑜𝑙𝑑

𝑇𝑖𝑚𝑒𝑛𝑒𝑤), allows us to compare the performance of two

similar programs, algorithms, or sections of code using simple timings, with one timing for the

2^130 2^16 2^19 2^22 2^25 2^28 10 20 30 40 I/ O T h ro u g h p u t (G B /S) n

old code and one timing for the new code. Serial speedup (SS) is the ratio of the amount of time it takes to complete a task using an improved program versus a baseline program, 𝑆𝑆 =

𝑇𝑖𝑚𝑒𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒

𝑇𝑖𝑚𝑒𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑑. Parallel speedup (PS) is the ratio of the time to complete a task using a serial

computation versus the time to complete the same task using a parallel computation, 𝑃𝑆 =

𝑇𝑖𝑚𝑒𝑠𝑒𝑟𝑖𝑎𝑙

𝑇𝑖𝑚𝑒𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙.

Often a serial algorithm cannot be fully parallelized due to unavoidable dependencies between sections of code. Amdahl’s Law (Amdahl, 1967) predicts the theoretical maximum parallel speedup on p processors for a specific problem for which a fraction of the program 𝛼 ∈ [0,1] is inherently sequential and the rest of the program (1 − 𝛼) is parallelizable: 𝑆(𝛼, 𝑝) = ( 1

𝛼+1−𝛼

𝑝

). Even given an infinite number of parallel processors, total performance cannot

exceed (1

∝). In other words, total performance is constrained by the serial portion of the program. Amdahl’s law, which can be derived from parallel speedup by normalizing serial time to one, suggests that programmers focused on latency can solve a fixed-sized problem in the shortest period of time by removing serial constraints.

As a counterpoint, Gustafson observes that end-users often exploit the maximum

computing power available to them to solve ever larger problems over some practical time period (minutes, hours, days). Gustafson's Law (Gustafson, 1988) as 𝑆(𝛼, 𝑝) = 𝛼 + (1 − 𝛼)𝑝 suggests that programmers focused on throughput issues can push more work through the system using massive parallelism. Gustafson’s law can be derived from scaled parallel speedup by normalizing parallel time to one.

Asymptotic analysis, also known as “Big ‘O’ notation”, shows how the running time (or resource usage) of an algorithm grows with the input size (n). The growth-rate function is reported using asymptotic notation to suppress implementation-dependent constants and to simplify expressions. If these hidden constants are reasonable, then 𝑶(log 𝑛) ≪ 𝑶(𝑛) ≪

𝑶(𝑛 log 𝑛) ≪ 𝑶(𝑛2). In words, a logarithmic growth-rate algorithm is preferred to a linear growth-rate, a linear growth-rate is preferred to a log-linear growth-rate, and log-linear growth- rate is preferred to a quadratic growth-rate algorithm.

Work + Depth Analysis: For a parallel computation, work efficiency W(n), is defined as the total number of instructions executed across all parallel cores. The ratio of work to the number of cores (p) results in the linear ideal speedup (𝑊(𝑛)_𝑝 ) of a parallel computation. However, there are usually dependent steps, either inherent in the algorithm itself or with the coordination required between parallel cores, which limit this ideal speedup. The longest dependent chain

of executed steps on any core is defined as an algorithm’s depth efficiency D(n). Figure 3.2 illustrates work and depth for summation, in both serial and parallel forms.

Work efficiency and depth efficiency are related by Brent’s Theorem (Gustafson, 2011), which says that any parallel algorithm runs in 𝑊(𝑛)

𝑝 + 𝐷(𝑛) time, in fact: 𝑚𝑎𝑥 ( 𝑊(𝑛)

𝑝 , 𝐷(𝑛)) ≤

𝑇𝑖𝑚𝑒 ≤𝑊(𝑛)

𝑝 + 𝐷(𝑛). Four useful implications from Brent’s Theorem are

 A parallel algorithm cannot run faster than its depth efficiency, D(n).

 A parallel algorithm that requires coordination across p cores cannot run faster thanO(log p).

 It is inefficient to use more parallel cores than you need to solve an algorithm. The concept of Parallelism (𝑊(𝑛)

𝐷(𝑛)) roughly captures how many parallel cores an algorithm can efficiently use.

Figure 3.2: An example of work and depth efficiency for the simple summation of 8 values for both the serial and parallel cases.

Serial Sum Work: 7 adds Depth: 7 steps Work: 7 adds Depth: 3 steps Parallel Sum

 If the number of processors and input size are fixed or known ahead of time than try to do at least D(n) work on each processor for the best parallel efficiency.

In document Brown_unc_0153D_15479.pdf (Page 69-74)