Formal Framework and Definitions - SIMD@OpenMP : a programming model approach to leverage SIMD

In this section, we describe a framework to formally introduce the concepts of overlap group, overlapped vector memory load and other definitions of our proposal. We apply our formalization to the Moving Average code shown in Figure 5.1 and to the Jacobi code shown in Figure 5.4a to offer a practical point of view.

5.3. Formal Framework and Definitions 111

1 for (i=1; i<=sizex-2; i++){

2 #pragma omp simd aligned(u+1,utmp+1:64) suitable(sizey:16) reduction(+:sum)

3 for (j=0; j<=sizey-3; j++){

4 float _{tmp = 0.25f * (u[i*sizey+j] + u[i*sizey+j+2] +} _{/*left + right +*/}

5 _{u[(i-1)*sizey+j+1] + u[(i+1)*sizey+j+1]);} _{/*top + bottom} _*/

6 float _{diff = tmp - u[i*sizey+j+1];} _/*center*/

7 _{utmp[i*sizey + j+1] = tmp;}

8 _{sum += diff * diff;}

9 }

10 }

(a) Scalar code with a standalone SIMD directive on the inner loop. The lower bound of the inner loop has been normalized to 0

1 for (i=1; i<=sizex-2; i++){

2 floatVF vsum = vpromotionVF(0.0f); 3 for (j=0; j<=sizey-VF-2; j+=VF){

4 floatVF tmp = vpromotionVF(0.25f) *

5 (unaligned_vloadVF(&u[i*sizey+j]) + unaligned_vloadVF(&u[i*sizey+j+2]) + 6 aligned_vloadVF(&u[(i-1)*sizey+j+1]) +

7 aligned_vloadVF(&u[(i+1)*sizey+j+1]));

8 floatVF diff = tmp - aligned_vloadVF(&u[i*sizey+j+1]); 9 aligned_vstoreVF(&utmp[i*sizey+j+1], tmp);

10 _{vsum += diff * diff;}

11 }

12 sum += vhorizontal_reductionVF(vsum); 13 }

(b) Vector pseudo-code after the inner loop vectorization. VF is the vectorization factor

Figure 5.4: Code of a 2D Jacobi solver for heat diffusion

5.3.1 Preliminaries

We use a[i] to denote a scalar access to the ith element of the array a, and a[l : u] to

designate the set of scalar accesses to a[j], ∀j l ≤ j ≤ u. Let K be a scalar loop defined as:

K : IVK = hLBK, UBK, STK, SKi (5.1)

where IVK, LBK, UBK, STK and SK are the induction variable, the inclusive lower

and upper bounds, the step and the scalar statements of K, respectively. Let us

assume that the loop has been normalized, i.e., LBK = 0 and STK = 1, and array

accesses have been linearized [81].

We define V as the vectorized version of K following a strip-mining/unroll-and- jam vectorization approach as our approach introduced in Chapter 3, such that:

V : IVV = hLBV, UBV, STV, SVi (5.2)

scalar iterations of K computed within each vector iteration of V ), UBV = UBK −

VFV + 1 and SV is the set of statements obtained from the vectorization of SK.

Definition 1. (Stride-one Vector Memory Access). −→a[e]denotes the stride-one vector

memory access, henceforth svma, of a scalar access a[e] that satisfies that e is of the

form e = IV_K + c where c is invariant in K. This kind of vector accesses performs

the set of scalar memory accesses a[e : e + VFV − 1] in a vector way.

A stride-one vector memory load, denoted vl, is an svma that reads data from memory. From now on, we limit our description to vls, although it can be easily extended to stride-one vector memory stores. However, our proposal has limited applicability to stores in real applications.

In the Moving Average example, the outer loop in Figure 5.1a is K : i = h0, N − points − 1, 1, {lines 3 − 10}i, the outer loop in Figure 5.1b is V : i = h0, N − points − VF , VF , {lines 2 − 10}i and a[i+j] is a vl. In the Jacobi 2D example, the inner loop in Figure 5.4a is K : i = h0, sizey − 3, 1, {lines 3 − 9}i, the inner loop in Figure 5.4b is V : i = h0, sizey − VF − 2, VF , {lines 3 − 11}i and accesses on arrays u and utmp are vls.

5.3.2 Overlap Relations

The following definitions are used to establish an equivalence relation among vls of each particular array in V .

Definition 2. We say that two vls−−→a[e1]and −−→

a[e2]overlap, denoted −−→ a[e1]u

−−→

a[e2], if and only

if{a[i] | i ∈ e1 ... e1+ VFV − 1} ∩ {a[j] | j ∈ e2 ... e2+ VFV − 1} 6= ∅.

Definition 3. We say that two vls−−→a[e1]and −−→

a[e2]are transitively-overlapped, denoted

−−→ a[e1] u∗

−−→

a[e2], if and only if −−→ a[e1]u

−−→

a[e2]or there exists −−→

a[e3]such that −−→ a[e1]u −−→ a[e3]and −−→ a[e3]u∗ −−→ a[e2].

It is important to note that some SIMD architectures with alignment constraints need two aligned vl instructions to perform one unaligned vl. In this way, the effec- tive aligned vls must be considered in those architectures when evaluating whether two vls overlap.

5.3.3 Overlap Groups

We describe several sets of vls that are necessary to define the concept of an overlap group and its construction.

Given loop V , we define the set of loops LV = {V } ∪ {all loops in SV}. Then, for

a loop L ∈ LV, we define:

AL,a= { all vls on the array a in the body of L } (5.3)

Moreover, the set of vls on the array a directly nested in the loop L is defined as follows:

5.3. Formal Framework and Definitions 113

In the Moving Average example, let Lo and Li denote the outer loop and the

inner loop from Figure 5.1b, respectively. Then, L_Lo = {Lo, Li } and LLi = {Li }.

Focusing on the array a, A_Lo,a, A_{Li ,a}, D_Lo,a and D_{Li ,a} sets are defined as:

ALo,a= ALi ,a = {−−−−−→a[i+j]}

DLo,a= ALo,a− ALi ,a = ∅ DLi ,a = ALi ,a

In the Jacobi 2D example, the inner loop is the only loop affected by vectorization.

Then, let Li denotes the inner loop from Figure 5.4b. Then, LLi = {Li }. Hence,

targeting only vls from array u, A_{Li ,u} and D_{Li ,u}are defined as:

ALi ,u = { −−−−−−−−−−−−→ u[i*sizey+j],−−−−−−−−−−−−−−→u[i*sizey+j+1], −−−−−−−−−−−−−−→ u[i*sizey+j+2],−−−−−−−−−−−−−−−−−−→u[(i-1)*sizey+j+1], −−−−−−−−−−−−−−−−−−→ u[(i+1)*sizey+j+1] } DLi ,u= ALi ,u

Definition 4. (Overlap Group). An overlap group OL,a is an equivalence class of

the equivalence relationu∗on D_L,a.

The ith overlap group is denoted byOL,a,i. We call PL,ato the quotient set of the

equivalence relation of DL,a byu∗.

Depending on how vls overlap in the group, an overlap group can be of one or both of the following non-disjoint kinds:

intra: |OL,a,i| > 1, i.e., there exist at least two vls in OL,a,ithat overlap in the same

iteration.

inter: There exist w and w0inO_L,a,ithat overlap across iterations, i.e., w u(w0/IVL

7→ IVL+ STL). Note that w and w0 may be the same vl.

If the overlap group does not satisfy any of these properties, its kind is unknown.

Definition 5. (Overlapped Vector Memory Load). A vl in O_L,a,i is an overlapped

vector memory load, denoted ovl, if and only ifOL,a,iis of kind inter and/or intra.

The length of each overlap group is computed as:

length(OL,a,i) = M − m + 1

where m = min { l | ∀r ∈ OL,a,ir = a[l : u]}

M = max {u | ∀r ∈ OL,a,ir = a[l : u]}

(5.5)

Again, in the Moving Average example, we apply the transitively-overlapped

relation to the only non-empty set, DLi ,a. It results in one inter overlap group

Jacobi 2D example, we apply the transitively-overlapped relation on the set DLi ,u

which generates three overlap groups:

O_{Li ,u,0} = {−−−−−−−−−−−−−−→u[i*sizey + j],−−−−−−−−−−−−−−−−→u[i*sizey + j+1], −−−−−−−−−−−−−−−−→ u[i*sizey + j+2] } O_{Li ,u,1} = {−−−−−−−−−−−−−−−−−−−−→u[(i-1)*sizey + j+1] } OLi ,u,2 = { −−−−−−−−−−−−−−−−−−−−→ u[(i+1)*sizey + j+1] }

where the groupO_{Li ,u,0}has intra and inter kind, and groupsO_{Li ,u,1}andO_{Li ,u,2}have

unknown kind because their cardinality is 1 (no intra kind) and their single vl does not overlap across iterations (no inter kind). The cardinalities and lengths of the resulting overlap groups are:

|OLi ,u,0| = 3 |OLi ,u,1| = 1 |OLi ,u,2| = 1

length(OLi ,u,0) = VF + 2

length(OLi ,u,1) = length(OLi ,u,2) = VF

Since O_{Li ,u,1} and O_{Li ,u,2} have unknown kind, both overlap groups do not contain

ovls.

In document SIMD@OpenMP : a programming model approach to leverage SIMD features (Page 138-142)