M. Carro1,2, S. Tamarit2, G. Vigueras1, J. Mari ˜no2
General Observation
• Cost: develop and maintain code adequate for hybrid architecture(s).
Every sub-platform different approaches / needs.
• Libraries: unified API, code adapted / optimized for some architecture.
• But:
1 Optimizationsacrosslibrary boundaries difficult.
2 Maintaining code for several platforms costly. 3 Porting among platforms costly.
• What is the right balance?
• We take the extreme position:
Optimize all you can (source code needed). As automatically as possible.
Overview
• Source-to-source transformationof procedural code.
• Semantics-preserving.
• Modifying certain non-functional characteristics: speed, number of (FP) instructions, cache hits, communication, placement of kernels, . . .
• Performed by applyingtransformation rules.
• Transformation rules:
• Match fragments to transform.
• Identify what they should be transformed into.
• Generate new code when certainconditionsare met.
• Conditionscaptured withannotations:
• Algorithmic structure of code.
Transformation Scheme
GPGPU (OpenCL) Translated code Open MP MPI FPGA (MaxJ) Readycode Initial code Preparation TranslationTransformation Scheme
GPGPU (OpenCL) Translated code Open MP MPI FPGA (MaxJ) Readycode Initial code Preparation Translation • Generic C−→generic C. • Sructural changes. • Platform-independent.• With acode shapeadequate for some target platform.
Transformation Scheme
GPGPU (OpenCL) Translated code Open MP MPI FPGA (MaxJ) Readycode Initial code Preparation Translation• Rewrites generic code to different target architectures.
• Uses procedural-level information.
Running Example: RGB Filter
Running Example: RGB Filter
void kernelRedFilter(...) {for (j = 0; j < width; j += 3 )
// Generate one row of the output image }
void rgbImageFilter(...){
for (i = 0; i < height; i++) kernelRedFilter(...); // Same for green and blue } int main(...) { // Read image rgbImageFilter(...); // Write images }
Annotated Code
#pragma polca kernel opencl #pragma polca map image redImage
for (i = 0; i < height; i++)
kernelRedFilter(image, i, rawWidth, redImage);
• Hint: farming out to a GPGPU.
• Information regarding target architecture: either from programmer or from analysis of architecture / task graph.
• Annotations: what does the code block do?
Information from Annotations
What
map
tells us
#pragma polca map image redImage
for (i = 0; i < height; i++)
kernelRedFilter(image, i, rawWidth, redImage);
• Inputimage, outputredImage.
• For everyimage[i],redImage[i]is produced using onlyimage[i].
• No global variables, no dependencies across iterations.
• Computation ofredImage[i]enclosed insidekernelRedFilter().
⇒ Pragma as summary of simpler properties.
www.imdea.org
Transformation rules
• Match input code, check conditions, generate output.
• Focused onproceduralproperties / structures.
(Simplified) Transformation Rule: Simplify Loops
one_iteration_active { pattern: {
for (cexpr(i) = cexpr(ini);
cexpr(i) < cexpr(n); cexpr(i)++)
if (cexpr(i) == cexpr(other)) cstmt(one_stat); } generate: { subs(cstmt(one_stat),cexpr(i),cexpr(other)); } } ⇓ c = v[k];
www.imdea.org
Transformation rules
• Match input code, check conditions, generate output.
• Focused onproceduralproperties / structures.
(Simplified) Transformation Rule: Simplify Loops
one_iteration_active { pattern: {
for (cexpr(i) = cexpr(ini);
cexpr(i) < cexpr(n); cexpr(i)++)
if (cexpr(i) == cexpr(other)) cstmt(one_stat); } generate: { subs(cstmt(one_stat),cexpr(i),cexpr(other)); } } for (i = 0; i < N; i++) if (i == k) c = v[i];
Transformation rules
• Match input code, check conditions, generate output.
• Focused onproceduralproperties / structures.
(Simplified) Transformation Rule: Simplify Loops
one_iteration_active { pattern: {
for (cexpr(i) = cexpr(ini);
cexpr(i) < cexpr(n); cexpr(i)++)
if (cexpr(i) == cexpr(other)) cstmt(one_stat); } generate: { subs(cstmt(one_stat),cexpr(i),cexpr(other)); } } for (i = 0; i < N; i++) if (i == k) c = v[i]; ⇓ c = v[k];
Rule Chaining
• Rule chaining: result of rule can be input for another rule.
• Generation of OpenCL in example: loop fusion (2 times) + inlining (3 times) + loop fusion (2 times) + iteration step normalization + loop collapse
• Search space.
• Use metrics on non-functional properties to drive search.
• Target-dependent.
Infrastructure and Tool
• Current choice: Haskell +cpp.
• Previously: Clang; proved to be tedious and error prone.
• Transformations had to be written in C++.
• AST not designed for source-to-source program transformation.
• The tool:
• Reads external rules in a C-like language (STML).
• Tool parametric to the rules; new, tailored rules, can be developed).
• Rules can include conditions which are eitherfunctionalpragmas orprocedural
pragmas.
Demo available — Contact us
Step by step transformation from C code to OpenCL.
• RGB filter C code−→readyC code.
• ReadyC code−→OpenCL, OpenMP, MPI, MaxJ.
• CPU→GPGPU:≈100×speedup computation time; but data transfer dominates for this case.
Other examples
• Reduction of complexity orderO(n3)−→O(n2) [−→O(1)]for matrix multiplication given properties of one input matrix.
• Transformation ofc=a·v+b·vintoc=k·vwherek =a+b— includes loop fusion, code hoisting; saves run time and FP operations.
• Transformation of original C code intoreadyC code for MaxJ — non-trivial transformations.
Future Steps
• Apply to more complex code and transformation rules.
• Perform realistic tests on actual architectures of said rules (ongoing).
• Improve description and handling of code blocks to handle more code structures.
• Improve handling of different compilation targets.
• Interface with external tools to reduce annotations.
• Would probably need feedback to / from the user.
M. Carro1,2, S. Tamarit2, G. Vigueras1, J. Mari ˜no2
Translation
• Code generated withadequateshape + annotations.
• E.g., shape for a GPGPU different from shape for OMP
• OMP: split initial image among available threads.
• GPGPU / OpenCL: every loop iteration goes to one task.
• Annotations include, e.g.,
• Explicit independence between iterations or code fragments.
• Information about loop splitting.
Loop Fusion Transformation Rule (Simplified)
for_loop_fusion_pragma {pattern: { cstmts(ini); #pragma polca def a
#pragma polca map inputa outputa
for(cexpr(i) = cexpr(init);
cexpr(i) < cexpr(n); cexpr(modi)) { cstmts(bodyFOR1);
}
cstmts(mid); #pragma polca def b
#pragma polca map inputb outputb
for(cexpr(j) = cexpr(init);
cexpr(j) < cexpr(n); cexpr(modj)) { cstmts(bodyFOR2); } cstmts(fin); } condition: { no_reads(cexpr(i),cstmts(mid)); no_reads(cexpr(j),cstmts(fin)); } generate: { cstmts(ini); cstmts(mid);
#pragma polca same_properties a #pragma polca same_properties b
for(cexpr(i) = cexpr(init); cexpr(i) < cexpr(n); cexpr(modi)) { cstmts(bodyFOR1); subs(cstmts(bodyFOR2), cexpr(j), cexpr(i)); } cstmts(fin); } }
Salvador Tamarit, Guillermo Vigueras, Manuel Carro, and Julio Mari ˜no. A Haskell Implementation of a Rule-Based Program Transformation for C Programs.
In Enrico Pontelli and Tran Cao Son, editors,International Symposium on Practical Aspects of Declarative Languages, number 9131 in LNCS, page TBD. Springer-Verlag, June 2015.
Guillermo Vigueras, Salvador Tamarit, Manuel Carro, and Julio Mari ˜no. Towards a Rule-Based Approach to Generate High-Performance Scientific Code.
Poster presented at the 2015 HiPEAC Conference, Amsterdam, January 2015.