Application Analysis - Programming heterogeneous MPSoCs : tool flows to close the software prod

This section details the main components of the analysis phase in Figure 5.1 Tracing is discussed in Section 5.2.1 and model construction in Section 5.2.2. Section 5.2.3 revis- its the problem of performance estimation in the context of the sequential flow. Finally, Section 5.2.4 describes the results exported by the graph analysis component.

5.2.1 Application Tracing

The tracing process takes a sequential application (and its input) and produces a trace file. This process is similar to the original MAPS tracing process in [46]. It is therefore only briefly reviewed. Application tracing starts by instrumenting the application IR. This is done by control flow and memory instrumentation passes.

Algorithm 5.1 shows the pseudocode of the control flow instrumentation pass. It receives as input the application IR (see Definition 2.19) and returns an instrumented version of it. The functions First and Last in Line 3 return the first and the last statements from the set of function statements (by construction, every function has a single entry and exit point in the flow). The functions InsertBefore and InsertAfter insert call

5.2. Application Analysis 77

Algorithm 5.2Memory Instrumentation.

1: procedure DDFAinst(IR= (Sstmt, Sf)) 2: InitGlobalVars(IR)

3: for f ∈SfdoInitLocalVars( f ) 4: end for

5: fors∈Sstmt do

6: ifs=load∨storethen

7: i←GetAccessInfo(s), InsertBefore(IR, _DDFA_TraceMem(i), s)

8: end if

9: end for

10: returnIR 11: end procedure

statements to the IR before and after a given statement. For every function in the IR, the instrumentor inserts a call to _FT_EnterFunction before its first statement and a call to _FT_ExitFunctionafter its last statement (Lines 7–8). In the case of the main function, the functions _MT_Init and _MT_Exit are added as well. The entrance to every basic block is also instrumented by adding a function call to _BBT_EnterBB before every basic block leader (Line 9). Finally, every function call is instrumented by adding a call to _FT_StCallingIRStmbefore every call statement.

The pseudocode for the memory instrumentation pass is shown in Algorithm 5.2. The function InitGlobalVars inserts instrumentation calls for every global variable defined in the application. Similarly, the function InitLocalVars instruments the local variables of every function. The code in Lines 5–9 instruments all memory access instructions (load or store). Since LLVM does not model registers at the bytecode level, accesses to local variables are also implemented by pointer dereferencing. For this reason, the function GetAccessInfo first analyzes the type of access to distinguish among ordinary local variables, arrays and true pointer access. For example, array accesses are characterized by a call to the getelementptr function mentioned before. For an array access, the offset that is accessed is passed to the instrumentation function _DDFA_TraceMem.

The main instrumentation functions are listed in Table A.1. These functions are all implemented in the uTracer runtime library, which is linked to the binary to obtain the instrumented host executable (see Figure 5.1a). After running this executable, the trace file is finally produced. Figure 5.2 shows an example of a trace file for a simple application. For the sake of clarity, instead of using the IR (ex.bc), the control flow is annotated to the original C code in Figure 5.2a, with the basic block identifiers generated during tracing. The trace file in Figure 5.2b first shows that the main function started executing after being called from dummy statement s0 (s:0 in Line 1). Thereafter, BB2 was entered,

during which function foo was called from statement s16 (in Lines 2–3). Within foo, BB1

is executed (Line 4), which contains an access to array A. The access information in Line 5 specifies that: (1) the statement causing the access is s5, (2) it is a read access (’r’), (3) the

array is local (’l’), (4) it is in the stack of function foo, (5) when foo is called from call site 1, (6) the array name is A, (7) its base type is 8 which stands for int and (8) the offset of the access is 20, i.e., the fifth element of the array. Compare this information with the read accesses to the global array A within the for loop in Line 9 and Line 11 of the trace file. Notice also, that when foo is called for the second time, the call site information is the only thing that changes (see Line 15). The call site information added to the memory instrumentation enables context-sensitive DFA.

78 Chapter 5. Sequential Code Flow b) a) 1:s:0:enter:main:ex.bc 2:2 3:s:16:enter:foo:ex.bc 4:1 5:m:5|r||l|foo|1|A|8|20 6:exit:foo:ex.bc 7:m:17|w||g|A|8|32 8:3, 4 9:m:27|r||g|A|8|0 int A[10];

int foo() { int A[10]; return A[5];} int main() { int s = 0, i; A[8] = foo(); for (i = 0; i < 2; i++) { s += A[i * 4]; } A[2] = foo(); return 0; } 10:5, 3, 4 11:m:27|r||g|A|8|16 12:5, 3, 6 13:s:36:enter:foo:ex.bc 14:1 15:m:5|r||l|foo|2|A|8|20 16:exit:foo:ex.bc 17:m:37|w||g|A|8|8 18:exit:main:ex.bc

Figure 5.2: Trace file example. a) Application code (ex.c). b) Trace file.

5.2.2 Building the Application Model

After obtaining the execution trace, the actual model construction starts. A detailed flow of this process is shown in Figure 5.3. As inputs it takes the trace file, the line information generated during pre-processing and the annotated bytecode of the application. Recall that this annotated version includes unique identifiers assigned to statements and basic blocks during instrumentation.

The first component of the flow in Figure 5.3 performs standard static LLVM analysis, which includes, among others, control and data flow analysis as well as call graph generation. The second component is a plugin added to LLVM that changes the CDFG representation to a Dependence Flow Graph (DFG) representation1. The DFGs of all the functions in the IR are built using the technique described in [122]. In addition to normal CDFG nodes, DFGs have switch and merge nodes. These nodes are used to add control information to data dependencies. As a consequence, data dependency information (def-

use chains) is more explicit in a DFG, which eases the later clustering and code generation

process. Additionally, the DFGs generated by the second component include loop entry and exit nodes. These nodes are used to collect dependency information and to annotate the results of loop analysis for partitioning.

The third component in Figure 5.3 is in charge of parsing the execution trace and producing the sequential profile (πA : S EA → N in Definition 2.28). Building the control flow profile, i.e., for elements in (S_stmtA ∪ BBA_{∪ E}A

c ) ⊂ SEA, is a straightforward process

that consists in counting appearances of basic block identifiers in the trace. The same holds for function profiling, i.e., for elements in SA_f ⊂ SEA. Additionally, the calling statements exported in the trace allows to distinguish different call sites and add profiling information to the edges in the CG, i.e., for elements in E_cgA ⊂ SEA_.

The memory information in the trace is used to extend the static data flow analysis with dynamic information. In other words, it is used to complete the set of data edges E_dA and define the values πA(e)∀e ∈ E_dA ⊂ SEA. The way the dynamic data flow information is collected from the trace is described in [46].

The last component in the flow makes sure that the final DFGs are consistent by merg- ing static and dynamic information and solving potential conflicts. Conflicts appear when a dependency is not recognized by the static analysis. New edges coming from the dynamic analysis may invalidate initial static edges. Consider as an example the code in Listing 5.1. If static analysis happens to miss the definition in Line 5 of variable a, a data

1_{If not said otherwise, in the rest of this thesis, DFG stands for a function’s Dependence Flow Graph} and not for a Data Flow Graph (see Definition 2.24).

5.2. Application Analysis 79

Trace

Line info

Annot. bc LLVM Analysis:_{(IR, CG, alias}

analysis, loop info, dominators/ postdominators) LLVM Plug-in (Dependence Flow Graph) Trace parsing, collect info at C statement level, graph annotation Merge static and dynamic information and export graphs Graphs (xml)

Figure 5.3: Flow for building an application sequential model.

flow edge would be created from the statement in Line 4 to the statement in Line 6. The dynamic information would add an edge between the statements in Lines 5–6 which in- validates the previous static edge. Such conflicts appear since the static analysis is not carried out in a conservative fashion. A conservative static flow analysis would add many

may-dependencies that would clutter the DFGs hampering parallelism extraction. 1 int foo() {

2 int a;

3 int _{*pa = &a;}

4 a = 0; // Definition observed by static DFA

5 *pa = 1; // Definition observed by dynamic DFA

6 a += 42; // Use variable a

7 return a;

8 }

Listing 5.1: Conflicting static and dynamic DFA.

5.2.3 Sequential Performance Estimation Revisited

The different methods of performance estimation used in this thesis were presented in Section 2.2.2. In the sequential flow, fine-grained performance estimation is needed, at the level of basic blocks or even statements. For this reason, some of the methods become im- practical. An annotation-based approach, for example, would require the programmer to specify the execution time of every basic block for each processor type. Instrumentation at the basic block level would introduce too much overhead for simulation and measurement based approaches. This could be circumvented by only executing the portions of code that are proposed for partitioning. Notice however that this would require the partitioning process to include several execution rounds (on the simulator or on the target architecture). The turn around time of these two approaches would be therefore prohibitively large.

For the aforementioned reasons, the graph analysis component of the sequential flow (see Figure 5.1) only uses table and emulation based estimation approaches. The table- based approach works as discussed in Section 2.4.1.4, with the IR operations defined by the LLVM basic instructions.

The table-based approach was extended in order to cope with non-scalar architectures, e.g., VLIW. The extension consists in multiplying the result of the estimation for a basic block by a constant factor. Consider a basic block BB with DFG DFGBB to be executed on an architecture with k parallel functional units. Let tASAP be the time reported by

80 Chapter 5. Sequential Code Flow

Figure 5.4: Sample analysis results in the MAPS IDE.

execution, as defined in Section 2.4.1.4, ˆtseq = Ps∈BBζtbPT(s). The cost estimation for the

target architecture is then given by

ζ_tbPT,st(BB) = max 1 k, tASAP ˆtseq ! · ˆtseq (5.1)

5.2.4 Analysis Results

The last component of the analysis phase traverses the application graphs and exports execution statistics. This includes a line profile, a list of hot spots and annotated graphs for visualization. The line profile shows which lines in the source code are executed the most. The list of hot spots helps the programmer to identify where to focus the analysis. In addition to the execution statistics, the graph analysis component collects information at the level of loops and performs an early analysis of potential parallelism patterns (TLP, DLP or PLP). An example of the analysis results, as seen in the MAPS IDE, is shown in Figure 5.4. The original code corresponds to the small application from Figure 5.2a. The code and the line profile can be visualized in a normal C source editor, as shown on the left-hand side of Figure 5.4 (see the percentages to the right of the line numbers). The information collected for the loop in the main function is shown in a yellow box. The CG of the application is shown in the middle of the figure, with two edges, representing the two call sites to function foo. The left-hand side of the figure shows a portion of the DFG of the main function. It is possible to identify the loop entry node, a portion of the for condition and the switch node corresponding to the control flow introduced by the loop.

Graph analysis collects information at the level of loops and functions, which is im- portant for the parallelism extraction and the code generation phases. In the case of loops, the following annotations are added to the loop entry nodes:

1. Loop type: This field indicates whether or not the loop is well structured. That is, if the loop has only one entry, one exit point and only one backward edge, i.e., no break, continue nor return statements.

2. Induction variables: This fields indicates which variables are used to control the iterations of the loop. Loop carried dependencies due to these variables are therefore ignored during parallelism extraction.

5.3. Partitioning and Parallelism Extraction 81

In document Programming heterogeneous MPSoCs : tool flows to close the software productivity gap (Page 86-91)