Code generation under Control
Rencontres sur la compilation / Saint Hippolyte
Henri-Pierre Charles CEA Laboratoire LaSTRE / Grenoble
12 décembre 2011
Code generation under Control
Rencontres sur la compilation / Saint Hippolyte
Henri-Pierre Charles CEA Laboratoire LaSTRE / Grenoble
Introduction
Présentation
Henri-Pierre Charles, two lines CV :
2010- CEA/DRT/DACLE/LIST/LaSTRE CRI PILSI context at Gières
1993-2010 : assistant professor in Université of Versailles Saint-Quentin en Yvelines, PRiSM laboratory, IUT de Vélizy Keywords :
Architecture, HPC, Compiler backend, Parallelism (ILP, Multimedia, Caches)
6809, 68000, i860, trimedia, Itanium, Power, CELL, ARM, MEPHISTO, other
GCC, LLVM, FFTW, H264, Spiral, ATLAS, MESA3D, other 3D Image reconstruction, Z-buffer, Video Compression, FFTW, QCD
Introduction
CEA / CRI PILSI
CEA : Commissariat à l'Énergie Atomique et aux Énergies Alternatives
DAM : Direction des Applications Militaires
DEN : Direction de l'Énergie Nucléaire DRT : Direction de la Recherche Technologique DSM : Direction des Sciences de la Matière DSV : Direction des Sciences du Vivant LIST : Laboratoire Intégration des Systèmes
et des Technologies SACLAY LETI : Laboratoire Électronique et de Technologie de l'Information Grenoble LITEN : Laboratoire Innovation pour les Technologies des Energies Nouvelles et les nanomatériau
LaSTRE : Laboratoire Système Temps Réel Saclay / Gières
LIALP : Laboratoire Infrastructure et Atelier Logiciel pour Puces
Introduction
Présentation LaSTRE
Laboratoire Sytèmes Temps Réel : Head : Vincent DAVID
OASIS Multi-scaled time-triggered architecture (the system
is measured at its own rhythm) Temporal consistency of exchanged data
PharOS Same concepts specialized in automotive context :
Embedded Systems Multiprocessors
MPPA High productivity parallel programming model for
embedded HPC : MPPA project P
c
Low Level Code Optimization Dynamic code generation, low level
optimization, multimedia applications
Motivation Context
Objective ?
Be at home as fast as possible With safety
Speed Limitations Constraints
“Real” Speed Limitations Constraints Gaz Consomption Constraints
Motivation Context
Classical Compilation Chain
Source
code
Intermediate
code
CompilerBinary
code
Runnable
code
SystemAssembly
code
Assembler Loader User
Data
Idea Algorithm
ProgrammerCompilation objectives
Translate source code to a semantically binary equivalent Assume “successive refinement”
Optimize for efficency / parallelism : reduce cycle count Performance defaults is now a “bug” (not only in RT systems) “Performance counter in the loop”
Motivation Context
Ask for program !
What are speed variation for this program :
int i;
for (i= 0; i < N; ++i)
{ int j; dest[i]= 0; for (j= 0; j < N; ++j) dest[i] += src[j] * m[i][j]; }
Compiler, data size, target processor, instruction set, available parallelism, data type, memory location, operating system, ...
Motivation Context
Data Size Matter
Loop size (value ofN)
101 Multimedia kernel : Full loop unroll, instruction
scheduling, memory caches access, ...
102/103/ Scientific code : loop unroll, loop convertion, data prefetching
106 Multimedia flux : multithreading
1010 and more High level parallelism : MPI / Grid / Cloud, ...
N is generally a parameter only known at run-time. Profiling and
Iterative compilation does not help.
Compilation strategies are complex and are application domain specific
Architecture
Architecture GENEPY
Architecture
Operateur Mephisto
Dynamic compilation
Compilette in work
Source
code
CompilerIntermediate
code
AssemblerAssembly
code
LoaderBinary
code
SystemRunnable
code
Data
UserIdea Algorithm
ProgrammerCompilette
Algorithmic optimizer Parameter Code generationData Driven (Size, Alignment, Values) Energy Driven (ISA selection, Vectorization) Speed Driven (ISA selection, Vectorization quality) Network Topology driven
Dynamic compilation deGoal a tool for dynamic code
generation
deGoal : a tool for compilette generation Generate a code generator
Virtual Portable Instruction Set (Register based Data Type) Optimization at compil time & run time
Faster than any compiler code generator No Intermediate representation
Algorithmic level Bottom up approach
Target : ARM, GENEPY, XP70V3/4, GPU, K1, ... Memory footprint : few Kb
Dynamic compilation
FP7 H4H
FP7 :H4H: High Performance for Heterogenous Architecture,
GPU JIT for Scilab
Generate NVIDIA assembly language PTX dynamically Embed code generator in Scilab
Optimized data movement Linear algebra context
Dynamic compilation
FP7 Touchmore
FP7 :Touchmore: Dynamic code generation
Dynamic code generation for MpSOC GENEPY tile (DSP Mephisto + MIPS) Generate code for MIPS or Mephisto Multimedia applications (MP3 / MP4)
Dynamic compilation
Smecy
FP7 :Smecy
Target P2012 MPSoC / XP70 processor Matrix x Matrix dynamic generation “Perfect hash” dynamic generator
Dynamic compilation
Related work
Jit compilation : Java, LLVM, CUDA : Intermediate
representation, heavy weight code generators (code footprint & time)
Python, perl, php : too high level, glue language FFTW, Spiral : code generator, dynamic configuration Atlas : compil time tuning
Dynamic compilation
Conclusion
Dynamic code generation is THE challenge (JIT, Javascript, emulation, multicore simulation, ...)
Lot of work to do : power characterization
MPSoC and HPC systems share some problematics : multiple core, power consomption control, ...
Control over parameters for code generation are multiples and hard to manage