Background - Performance, Power Modeling and Optimization for High-Performance Computing Syste

6.2.1 Baseline CUDA and Fermi Architecture

CUDA is a parallel computing architecture developed by Nvidia [48]. It abstracts the thread-level parallelism of the GPU into a hierarchy of threads (grids of blocks of warps of threads) [49]. These threads are then mapped onto a hierarchy of hardware resources. The basic unit of execution flow, the warp, contains 32 threads that execute the same instruction based on the single instruction, multiple thread (SIMT) paradigm.

Figure 6.1 illustrates the detailed microarchitecture of the warp scheduler and SIMT pipelines inside a CUDA SM. Each SM features two warp schedulers and two dispatch units with all the warps evenly divided according to the parity of the warp ID, as shown in the box marked “Scheduler”. Each warp scheduler can function independently without dependency checking across the schedulers. Each SM also contains 32 streaming processors (SP ) divided evenly into 2 pipelines, 4 special function units (SF U ) and 16 load/store units (M EM ), as shown in the box marked “SIMT Pipelines”. Considering that each pipeline (excluding SFU) has 16 execution units, while a warp contains 32 threads, it takes at least 2 cycles for an instruction to be issued to the pipeline. As a result, the dual warp schedulers run at half of the pipeline frequency, issuing a maximum of one instruction every cycle.

The warp scheduler maintains the status of warps on a per-cycle basis. As shown in Figure 6.1, the warp status in the scheduler can take on one of three values. A warp is inactive with control hazards when the next instruction is not stored in the instruction buffer, and thus cannot be issued immediately. This scenario only occurs when the instruction is a branch or function call; in both cases, it is observed that the probability that a warp turns inactive due to control hazards, Pinactive control, remains quite stable

and can be considered as a kernel-dependent constant. A warp is inactive with data hazards when the next instruction of the warp has a data dependency on a previous instruction which still resides in the pipelines. An active warp has no data dependency issues and is ready to be issued immediately.

The scheduler picks an active warp from its own active warp pool in a loosely round-robin fashion, sends the warp to its dedicated SIMT pipeline, and updates the warp status and data dependencies. While inside the dedicated SIMT pipeline, the

Constant Cache Writeback Issue Queue Dual-Issue Scheduler Odd Warps Even Warps Di sp at ch U n it Di sp at ch U n it Active warp Inactive warp 2 3

MEM: 16 LD/ST Units, 2+ Stages

SIMT Pipelines

ICNT SP2: 16 Units, 13 Stages

SP1: 16 Units, 13 Stages

SFU: 4 Units, 13 – 25 Stages

Shared Memory Texture Cache Data L1 Cache 3 MEM SFU SP2 SP1

Figure 6.1: Microarchitecture of a GPU core in Fermi GTX 480.

instructions are sent into an operand buffer while waiting for all the input registers to be acquired. Once all inputs are ready, the operand buffer issues the instructions to the execution pipeline in a first-in-first-out fashion. For each arithmetic SIMT pipeline, there are over 20 pipeline stages [50]. Considering extra stalls caused by the dispatch unit and potential registers bank conflicts, a significant amount of warps are needed to avoid stalls in arithmetic pipelines, and particularly in the even more time-consuming MEM pipeline. If no active warp is available, or the warp is issued to another SIMT pipeline, a stall occurs and a bubble is inserted into the SIMT pipeline. At the writeback stage, the instruction is considered finished and the warp status is updated.

6.2.2 Workload and Metrics

Application Suite:

We perform evaluations using the Parboil benchmark suite [51], which contains a wide range of GPGPU applications optimized for CUDA architecture, as shown in Table 6.1, including bimolecular simulation, fluid dynamics, image processing, astronomy, and dense and sparse linear algebra.

Each application consists of one or more kernels. We observed that even kernels from the same applications can exhibit different characteristics. We pick kernels based

Table 6.1: List of GPGPU kernels.

Bench. Abbr. Kernel Weight Avg. Kernel Invo- Avg. Launch Cycles cations Overhead (µs) bfs BFS BFS in GPU kernel 100% 22 1 -

cutcp CUT cuda cutoff potential 99.90% 5 26 71 histo HIS histo main kernel 51.30% 0.3 10000 3 lbm LBM performStreamCollide kernel 100% 3 1 - mri-q MRI ComputeQ GPU 99.60% 4 2 73 sad SAD mb sad calc 52.50% 18 1 - sgemm SGE mysgemmNT 100% 3 1 - spmv SPM spmv jds 99.90% 0.4 50 3.3 stencil STE block2D hybrid coarsen x 99.80% 2 100 5.2 tpacf TPA gen hists 100% 7 1 -

Table 6.2: GPGPU-Sim Configuration for Baseline Architecture (Fermi GTX 480).

GPU config. 15 GPU cores, 2.0 Compute Capability Frequency 1400MHz Core, 700MHz ICNT, 924MHz DDR5 GPU Core Config. SIMT Width: 16 (SP1, SP2 and MEM), 4 (SFU) Resources/Core Max 1536 Threads, Max. 8 CTAs,

48KB Shared Memory, 32768 Registers

Caches/Core 16KB, 128B line, 4-way, 64 MSHR L1 Data Cache 12KB, 128B line, 24-way Texture Cache

8KB, 64B line, 2-way Constant Cache Unified L2 Cache 768KB, 128B line, 16-way, 256 MSHR Scheduling GTO (Greedy-then-Oldest Scheduling) Interconnect 2D mesh (5x5, 15 cores+6 Memory Controller) DRAM Model FR-FCFS, 6MC, Burst Length 8,

Buswidth 8B/MC, Total 384bits

GDDR5 Timing 924MHz, 16 Banks, tCCD= 2, tRRD= 6, tRCD= 12,

tRAS= 28, tRP = 12, tRC= 40, tCL= 12, tW L= 4,

tCDLR= 5, tW R= 12, tnbkgrp= 4, tCCDL= 4, tRT P L= 2

on their weight (ratio between kernel execution time and whole application time) in each application and perform evaluations on both GTX 480 hardware and GPGPU-Sim (version 3.2.0) [52]. We model our baseline architecture after Fermi GTX 480 [53] with the configuration shown in Table 6.2.

Table 6.1 shows kernel performance characteristics captured through hardware pro- filing. Invocation indicates how many times the kernel has been launched in the application. From the kernels we evaluated, different invocations exhibit similar function unit utilization. For simulation simplicity, if a kernel has hundreds of innovations, we repeatedly simulate the kernel with the same input set. In addition, For kernels with

more than one invocation, we measured kernel launch overhead, the gap between when the previous kernel finishes and a new kernel launches. Note that this does not include memory copy time. The overhead is often in µs, but when kernels are very short, it can have a significant performance impact, since the launch overhead becomes larger relative to the kernel execution time. For example, in HIS, the kernel launch overhead is over 1% of the kernel execution time.

Evaluation Metrics:

We use SM IP C, the average number of instructions issued per cycle in one SM, as a performance metric. More specifically, SM IP C in this dissertation stands for the average number of instructions issued from warp schedulers per cycle, which has a direct relation to the pipeline utilization. For the rest of dissertation, IP C indicates SM IP C. The average number of cycles per instruction, CP I per warp is also used to inves- tigate the stalls each warp suffers due to various reasons. Note in theory, given the number of warps, and the CP I per warp, SM IP C should equal to the number of warps divided by CP I.

In document Performance, Power Modeling and Optimization for High-Performance Computing Systems (Page 88-91)