In this section, we describe a framework to formally introduce the concepts of over- lap group, overlapped vector memory load and other definitions of our proposal. We apply our formalization to the Moving Average code shown in Figure 5.1 and to the Jacobi code shown in Figure 5.4a to offer a practical point of view.
5.3. Formal Framework and Definitions 111
1 for (i=1; i<=sizex-2; i++){
2 #pragma omp simd aligned(u+1,utmp+1:64) suitable(sizey:16) reduction(+:sum)
3 for (j=0; j<=sizey-3; j++){
4 float tmp = 0.25f * (u[i*sizey+j] + u[i*sizey+j+2] + /*left + right +*/
5 u[(i-1)*sizey+j+1] + u[(i+1)*sizey+j+1]); /*top + bottom */
6 float diff = tmp - u[i*sizey+j+1]; /*center*/
7 utmp[i*sizey + j+1] = tmp;
8 sum += diff * diff;
9 }
10 }
(a) Scalar code with a standalone SIMD directive on the inner loop. The lower bound of the inner loop has been normalized to 0
1 for (i=1; i<=sizex-2; i++){
2 floatVF vsum = vpromotionVF(0.0f); 3 for (j=0; j<=sizey-VF-2; j+=VF){
4 floatVF tmp = vpromotionVF(0.25f) *
5 (unaligned_vloadVF(&u[i*sizey+j]) + unaligned_vloadVF(&u[i*sizey+j+2]) + 6 aligned_vloadVF(&u[(i-1)*sizey+j+1]) +
7 aligned_vloadVF(&u[(i+1)*sizey+j+1]));
8 floatVF diff = tmp - aligned_vloadVF(&u[i*sizey+j+1]); 9 aligned_vstoreVF(&utmp[i*sizey+j+1], tmp);
10 vsum += diff * diff;
11 }
12 sum += vhorizontal_reductionVF(vsum); 13 }
(b) Vector pseudo-code after the inner loop vectorization. VF is the vectorization factor
Figure 5.4: Code of a 2D Jacobi solver for heat diffusion
5.3.1 Preliminaries
We use a[i] to denote a scalar access to the ith element of the array a, and a[l : u] to
designate the set of scalar accesses to a[j], ∀j l ≤ j ≤ u. Let K be a scalar loop defined as:
K : IVK = hLBK, UBK, STK, SKi (5.1)
where IVK, LBK, UBK, STK and SK are the induction variable, the inclusive lower
and upper bounds, the step and the scalar statements of K, respectively. Let us
assume that the loop has been normalized, i.e., LBK = 0 and STK = 1, and array
accesses have been linearized [81].
We define V as the vectorized version of K following a strip-mining/unroll-and- jam vectorization approach as our approach introduced in Chapter 3, such that:
V : IVV = hLBV, UBV, STV, SVi (5.2)
scalar iterations of K computed within each vector iteration of V ), UBV = UBK −
VFV + 1 and SV is the set of statements obtained from the vectorization of SK.
Definition 1. (Stride-one Vector Memory Access). −→a[e]denotes the stride-one vector
memory access, henceforth svma, of a scalar access a[e] that satisfies that e is of the
form e = IVK + c where c is invariant in K. This kind of vector accesses performs
the set of scalar memory accesses a[e : e + VFV − 1] in a vector way.
A stride-one vector memory load, denoted vl, is an svma that reads data from memory. From now on, we limit our description to vls, although it can be easily extended to stride-one vector memory stores. However, our proposal has limited applicability to stores in real applications.
In the Moving Average example, the outer loop in Figure 5.1a is K : i = h0, N − points − 1, 1, {lines 3 − 10}i, the outer loop in Figure 5.1b is V : i = h0, N − points − VF , VF , {lines 2 − 10}i and a[i+j] is a vl. In the Jacobi 2D example, the inner loop in Figure 5.4a is K : i = h0, sizey − 3, 1, {lines 3 − 9}i, the inner loop in Figure 5.4b is V : i = h0, sizey − VF − 2, VF , {lines 3 − 11}i and accesses on arrays u and utmp are vls.
5.3.2 Overlap Relations
The following definitions are used to establish an equivalence relation among vls of each particular array in V .
Definition 2. We say that two vls−−→a[e1]and −−→
a[e2]overlap, denoted −−→ a[e1]u
−−→
a[e2], if and only
if{a[i] | i ∈ e1 ... e1+ VFV − 1} ∩ {a[j] | j ∈ e2 ... e2+ VFV − 1} 6= ∅.
Definition 3. We say that two vls−−→a[e1]and −−→
a[e2]are transitively-overlapped, denoted
−−→ a[e1] u∗
−−→
a[e2], if and only if −−→ a[e1]u
−−→
a[e2]or there exists −−→
a[e3]such that −−→ a[e1]u −−→ a[e3]and −−→ a[e3]u∗ −−→ a[e2].
It is important to note that some SIMD architectures with alignment constraints need two aligned vl instructions to perform one unaligned vl. In this way, the effec- tive aligned vls must be considered in those architectures when evaluating whether two vls overlap.
5.3.3 Overlap Groups
We describe several sets of vls that are necessary to define the concept of an overlap group and its construction.
Given loop V , we define the set of loops LV = {V } ∪ {all loops in SV}. Then, for
a loop L ∈ LV, we define:
AL,a= { all vls on the array a in the body of L } (5.3)
Moreover, the set of vls on the array a directly nested in the loop L is defined as follows:
5.3. Formal Framework and Definitions 113
In the Moving Average example, let Lo and Li denote the outer loop and the
inner loop from Figure 5.1b, respectively. Then, LLo = {Lo, Li } and LLi = {Li }.
Focusing on the array a, ALo,a, ALi ,a, DLo,a and DLi ,a sets are defined as:
ALo,a= ALi ,a = {−−−−−→a[i+j]}
DLo,a= ALo,a− ALi ,a = ∅ DLi ,a = ALi ,a
In the Jacobi 2D example, the inner loop is the only loop affected by vectorization.
Then, let Li denotes the inner loop from Figure 5.4b. Then, LLi = {Li }. Hence,
targeting only vls from array u, ALi ,u and DLi ,uare defined as:
ALi ,u = { −−−−−−−−−−−−→ u[i*sizey+j],−−−−−−−−−−−−−−→u[i*sizey+j+1], −−−−−−−−−−−−−−→ u[i*sizey+j+2],−−−−−−−−−−−−−−−−−−→u[(i-1)*sizey+j+1], −−−−−−−−−−−−−−−−−−→ u[(i+1)*sizey+j+1] } DLi ,u= ALi ,u
Definition 4. (Overlap Group). An overlap group OL,a is an equivalence class of
the equivalence relationu∗on DL,a.
The ith overlap group is denoted byOL,a,i. We call PL,ato the quotient set of the
equivalence relation of DL,a byu∗.
Depending on how vls overlap in the group, an overlap group can be of one or both of the following non-disjoint kinds:
intra: |OL,a,i| > 1, i.e., there exist at least two vls in OL,a,ithat overlap in the same
iteration.
inter: There exist w and w0inOL,a,ithat overlap across iterations, i.e., w u(w0/IVL
7→ IVL+ STL). Note that w and w0 may be the same vl.
If the overlap group does not satisfy any of these properties, its kind is unknown.
Definition 5. (Overlapped Vector Memory Load). A vl in OL,a,i is an overlapped
vector memory load, denoted ovl, if and only ifOL,a,iis of kind inter and/or intra.
The length of each overlap group is computed as:
length(OL,a,i) = M − m + 1
where m = min { l | ∀r ∈ OL,a,ir = a[l : u]}
M = max {u | ∀r ∈ OL,a,ir = a[l : u]}
(5.5)
Again, in the Moving Average example, we apply the transitively-overlapped
relation to the only non-empty set, DLi ,a. It results in one inter overlap group
Jacobi 2D example, we apply the transitively-overlapped relation on the set DLi ,u
which generates three overlap groups:
OLi ,u,0 = {−−−−−−−−−−−−−−→u[i*sizey + j],−−−−−−−−−−−−−−−−→u[i*sizey + j+1], −−−−−−−−−−−−−−−−→ u[i*sizey + j+2] } OLi ,u,1 = {−−−−−−−−−−−−−−−−−−−−→u[(i-1)*sizey + j+1] } OLi ,u,2 = { −−−−−−−−−−−−−−−−−−−−→ u[(i+1)*sizey + j+1] }
where the groupOLi ,u,0has intra and inter kind, and groupsOLi ,u,1andOLi ,u,2have
unknown kind because their cardinality is 1 (no intra kind) and their single vl does not overlap across iterations (no inter kind). The cardinalities and lengths of the resulting overlap groups are:
|OLi ,u,0| = 3 |OLi ,u,1| = 1 |OLi ,u,2| = 1
length(OLi ,u,0) = VF + 2
length(OLi ,u,1) = length(OLi ,u,2) = VF
Since OLi ,u,1 and OLi ,u,2 have unknown kind, both overlap groups do not contain
ovls.