Rule-Based Program Transformation for Hybrid Architectures CSW Workshop Towards Portable Libraries for Hybrid Systems

(1)

M. Carro1,2, S. Tamarit2, G. Vigueras1, J. Mari ˜no2

(2)

General Observation

• Cost: develop and maintain code adequate for hybrid architecture(s).

Every sub-platform different approaches / needs.

• Libraries: unified API, code adapted / optimized for some architecture.

• But:

1 Optimizationsacrosslibrary boundaries difficult.

2 Maintaining code for several platforms costly. 3 Porting among platforms costly.

• What is the right balance?

• We take the extreme position:

Optimize all you can (source code needed). As automatically as possible.

(3)

Overview

• Source-to-source transformationof procedural code.

• Semantics-preserving.

• Modifying certain non-functional characteristics: speed, number of (FP) instructions, cache hits, communication, placement of kernels, . . .

• Performed by applyingtransformation rules.

• Transformation rules:

• Match fragments to transform.

• Identify what they should be transformed into.

• Generate new code when certainconditionsare met.

• Conditionscaptured withannotations:

• Algorithmic structure of code.

(4)

Transformation Scheme

GPGPU (OpenCL) Translated code Open MP MPI FPGA (MaxJ) Readycode Initial code Preparation Translation

(5)

Transformation Scheme

GPGPU (OpenCL) Translated code Open MP MPI FPGA (MaxJ) Readycode Initial code Preparation Translation • Generic C−→generic C. • Sructural changes. • Platform-independent.

• With acode shapeadequate for some target platform.

(6)

Transformation Scheme

GPGPU (OpenCL) Translated code Open MP MPI FPGA (MaxJ) Readycode Initial code Preparation Translation

• Rewrites generic code to different target architectures.

• Uses procedural-level information.

(7)

Running Example: RGB Filter

(8)

Running Example: RGB Filter

void kernelRedFilter(...) {

for (j = 0; j < width; j += 3 )

// Generate one row of the output image }

void rgbImageFilter(...){

for (i = 0; i < height; i++) kernelRedFilter(...); // Same for green and blue } int main(...) { // Read image rgbImageFilter(...); // Write images }

(9)

Annotated Code

#pragma polca kernel opencl #pragma polca map image redImage

for (i = 0; i < height; i++)

kernelRedFilter(image, i, rawWidth, redImage);

• Hint: farming out to a GPGPU.

• Information regarding target architecture: either from programmer or from analysis of architecture / task graph.

• Annotations: what does the code block do?

(10)

Information from Annotations

What

map

tells us

#pragma polca map image redImage

for (i = 0; i < height; i++)

kernelRedFilter(image, i, rawWidth, redImage);

• Inputimage, outputredImage.

• For everyimage[i],redImage[i]is produced using onlyimage[i].

• No global variables, no dependencies across iterations.

• Computation ofredImage[i]enclosed insidekernelRedFilter().

⇒ Pragma as summary of simpler properties.

(11)

www.imdea.org

Transformation rules

• Match input code, check conditions, generate output.

• Focused onproceduralproperties / structures.

(Simplified) Transformation Rule: Simplify Loops

one_iteration_active { pattern: {

for (cexpr(i) = cexpr(ini);

cexpr(i) < cexpr(n); cexpr(i)++)

if (cexpr(i) == cexpr(other)) cstmt(one_stat); } generate: { subs(cstmt(one_stat),cexpr(i),cexpr(other)); } } ⇓ c = v[k];

(12)

www.imdea.org

Transformation rules

(Simplified) Transformation Rule: Simplify Loops

if (cexpr(i) == cexpr(other)) cstmt(one_stat); } generate: { subs(cstmt(one_stat),cexpr(i),cexpr(other)); } } for (i = 0; i < N; i++) if (i == k) c = v[i];

(13)

Transformation rules

(Simplified) Transformation Rule: Simplify Loops

if (cexpr(i) == cexpr(other)) cstmt(one_stat); } generate: { subs(cstmt(one_stat),cexpr(i),cexpr(other)); } } for (i = 0; i < N; i++) if (i == k) c = v[i]; ⇓ c = v[k];

(14)

Rule Chaining

• Rule chaining: result of rule can be input for another rule.

• Generation of OpenCL in example: loop fusion (2 times) + inlining (3 times) + loop fusion (2 times) + iteration step normalization + loop collapse

• Search space.

• Use metrics on non-functional properties to drive search.

• Target-dependent.

(15)

Infrastructure and Tool

• Current choice: Haskell +cpp.

• Previously: Clang; proved to be tedious and error prone.

• Transformations had to be written in C++.

• AST not designed for source-to-source program transformation.

• The tool:

• Reads external rules in a C-like language (STML).

• Tool parametric to the rules; new, tailored rules, can be developed).

• Rules can include conditions which are eitherfunctionalpragmas orprocedural

pragmas.

(16)

Demo available — Contact us

Step by step transformation from C code to OpenCL.

• RGB filter C code−→readyC code.

• ReadyC code−→OpenCL, OpenMP, MPI, MaxJ.

• CPU→GPGPU:≈100×speedup computation time; but data transfer dominates for this case.

Other examples

• Reduction of complexity orderO(n3)−→O(n2) [−→O(1)]for matrix multiplication given properties of one input matrix.

• Transformation ofc=a·v+b·vintoc=k·vwherek =a+b— includes loop fusion, code hoisting; saves run time and FP operations.

• Transformation of original C code intoreadyC code for MaxJ — non-trivial transformations.

(17)

Future Steps

• Apply to more complex code and transformation rules.

• Perform realistic tests on actual architectures of said rules (ongoing).

• Improve description and handling of code blocks to handle more code structures.

• Improve handling of different compilation targets.

• Interface with external tools to reduce annotations.

• Would probably need feedback to / from the user.

(18)

M. Carro1,2, S. Tamarit2, G. Vigueras1, J. Mari ˜no2

(19)

(20)

Translation

• Code generated withadequateshape + annotations.

• E.g., shape for a GPGPU different from shape for OMP

• OMP: split initial image among available threads.

• GPGPU / OpenCL: every loop iteration goes to one task.

• Annotations include, e.g.,

• Explicit independence between iterations or code fragments.

• Information about loop splitting.

(21)

Loop Fusion Transformation Rule (Simplified)

for_loop_fusion_pragma {

pattern: { cstmts(ini); #pragma polca def a

#pragma polca map inputa outputa

for(cexpr(i) = cexpr(init);

cexpr(i) < cexpr(n); cexpr(modi)) { cstmts(bodyFOR1);

}

cstmts(mid); #pragma polca def b

#pragma polca map inputb outputb

for(cexpr(j) = cexpr(init);

cexpr(j) < cexpr(n); cexpr(modj)) { cstmts(bodyFOR2); } cstmts(fin); } condition: { no_reads(cexpr(i),cstmts(mid)); no_reads(cexpr(j),cstmts(fin)); } generate: { cstmts(ini); cstmts(mid);

#pragma polca same_properties a #pragma polca same_properties b

for(cexpr(i) = cexpr(init); cexpr(i) < cexpr(n); cexpr(modi)) { cstmts(bodyFOR1); subs(cstmts(bodyFOR2), cexpr(j), cexpr(i)); } cstmts(fin); } }

(22)

Salvador Tamarit, Guillermo Vigueras, Manuel Carro, and Julio Mari ˜no. A Haskell Implementation of a Rule-Based Program Transformation for C Programs.

In Enrico Pontelli and Tran Cao Son, editors,International Symposium on Practical Aspects of Declarative Languages, number 9131 in LNCS, page TBD. Springer-Verlag, June 2015.

Guillermo Vigueras, Salvador Tamarit, Manuel Carro, and Julio Mari ˜no. Towards a Rule-Based Approach to Generate High-Performance Scientific Code.

Poster presented at the 2015 HiPEAC Conference, Amsterdam, January 2015.