• No results found

9.3 Analytical Performance Model

9.3.2 Mean Value Based Performance Model

Note that Nactive,k,p is the number of active warps from kernel k with next instruction

to be issued in pipeline p right before the warp scheduler issues the next instruction. We refer to the total number of active warps as Nactive,k =Pp∈{SP,M EM,SF U }Nactive,k,p.

Nactive,k reflects how aggressively one kernel can utilize SIMT pipelines during con-

tention. Therefore, as shown in Equation 9.3, if we can find the Nactive,kin each pipeline,

we can derive IP Ck. On the other hand, Nactive,k can be derived by Ninact,k subtracted

from Nk, and Ninactvie,k can also be derived from IP Ck, as IP Ck has a direct impact

on how many warps become inactive due to data hazards.

9.3.2 Mean Value Based Performance Model

The goal of the performance model is to predict the steady-state IP Ck and Nactive,k.

Figure 9.3 illustrates the performance model for a given kernel pair with detailed warp status distribution in steady state. The figure shows a warp status distribution during

Avg. active warps active warp of k1 D is p a tc h U n it Pipeline utilization (a) Idle Utilization w/ k1 D is p a tc h U n it 0.125 0.125 3 2 active warp of k2

Avg. active warps

Utilization w/ k2 0.05 0.025 0.375 0.25 0.125 0.063 SP MEM SFU SP MEM SFU (b) Pipeline utilization

Figure 9.4: Determining throughput for mixed kernels: (a) Pipeline constrained sce- nario; (b)Parallelism constrained scenario.

execution from an individual kernel’s perspective. The kernel consists of inactive warps with control hazards, inactive warps with data hazards, and the rest are active warps. In our presentation below, we

This model integrates two sub-models:

• The IP C model, described in subsection 9.3.3, estimates the steady state IP C for the two kernels as a function of the number of active warps of each kernel.

• The Active Warps model, described in subsection 9.3.4, derives Nactive,k1 and

Nactive,k2 as a function of IP Ck.

The two sub-models together form a simple nonlinear equation that does not admit a closed-form solution but can be solved iteratively in a few iterations using a standard root-finding method.

9.3.3 IP C Model

We now develop an analytical model that calculates IP Ck1, IP Ck2 from the number of

active warps of each kernel. To explain how performance is determined with multiple active warps in the scheduler, we consider two scenarios, shown in Figure 9.4. For both scenarios, the figure shows two kernels being co-issued to an SM by a single dispatch unit1 . The numbers within the scheduler box indicate the average number of active warps in the steady state for each of the three pipelines. We use Pk,p to denote the

1

Recall that an SM has two dispatch units, each of which can feed an SP pipeline (exclu- sively) and the M EM and SF U pipelines (on a shared basis). The figure shows these three pipelines (SP, SF U, M EM ) being fed by a single dispatch unit.

probability that an instruction will be issued to pipeline p from kernel k. This quantity reflects the degree of balance in the way kernel k utilizes the available pipelines, and can be obtained by profiling the kernel and its instruction mix. Individual pipeline utilization is shown to the right of the dispatch unit, with a solid bar representing utilization of kernel k1, a hashed bar representing utilization of kernel k2, and an empty bar representing idleness in the pipeline.

Determining Shader IP C

When multiple kernels are running concurrently in an SM, as shown in Equation 9.3, the mixed kernels are in one of two scenarios.

Parallelism constrained scenario: When there are not enough warps to keep any of the pipelines fully utilized, the SM suffers extra stalls, as illustrated in Figure 9.4(b). Many factors can lead to insufficient active warps, including low occupancy due to large resource requirements such as registers, shared memory, or grid size, or frequent thread block synchronizations that lead to invalid warps. In general, we say the SM is parallelism constrained, as the SM cannot provide sufficient active warps to keep at least one of the pipelines fully utilized. In the steady state, an equilibrium is reached where a warp is issued as soon as it turns active. Hence, for ∀p ∈ {SP, M EM, SF U }, we have:

IP Ck,p= Nactive,k,p, ∀k ∈ {k1, k2}. (9.4)

From the definition of Pk,p, it can be assumed that Nactive,k,p= Nactive,k× Pk,p, where

Nactive,kis the number of active warps in kernel k. Considering that none of the pipelines

are fully utilized, the following condition must be satisfied for ∀p ∈ {SP, M EM, SF U }:

Nactive,k1× Pk1,p× Ck1,p+ Nactive,k2× Pk2,p× Ck2,p < 1 (9.5)

Pipeline constrained scenario: In this scenario, the SM has sufficient active warps ready to issue from the scheduler; however, one of the pipelines is fully utilized and becomes the performance bottleneck of the SM. Figure 9.4(a) illustrates a fully utilized SP pipeline.

When the pipeline p is fully utilized, not all the active warps can be issued immediately. Thus IP Ck,p should be smaller than Nactive,k,p, combining with Equation 9.6, the left

side of Equation 9.5 must be greater than one when the shader is in the pipeline con- strained scenario. Therefore, Equation 9.5 can be used as the boundary between the two scenarios.

When pipeline constrained, the warp scheduling policy determines how the pipeline utilization breaks down among two kernels in steady state. In this chapter, we assume the scheduler uses a loosely round-robin scheduling policy. As each active warp is equally likely to be issued in a round-robin warp scheduler, it is reasonable to assume that when pipeline p is fully utilized, the IP C of kernel k with respect to pipeline p (IP Ck,p) is

proportional to the number of active warps of the kernels (Nactive,k,p). Furthermore, as

only p is fully utilized, Equation 9.4 also works for the rest of the pipelines. Therefore, we have: IP Ck1,p IP Ck2,p = Nactive,k1,p Nactive,k2,p = Nactive,k1− IP Ck1× (1 − Pk1,p) Nactive,k2− IP Ck2× (1 − Pk2,p) (9.7)

From the assumption that kernel behavior is stable over its execution, a kernel that reaches the steady state must either be limited by parallelism constraints or pipeline constraints; one of the constraints is always dominant over the other when determining the performance of the shader. This is also verified through our experiments: each benchmark suffers primarily from either single pipeline congestion or from insufficient warps, with the effect of other factors such as branch divergence being less than 5% for GPU applications.

Therefore, with Equation 9.5 serving as the boundary condition of the two scenarios, from Equation 9.4, 9.6 and 9.7, the performance of individual kernel IP Ckcan be derived

as a piecewise linear function of Nactive,k.