procedures are interleaved, but there also exist models for which our treatment here is insufficient
5.3 Control Flow Graphs
It is convenient and intuitive to construct models whose states are closely related to locations in program source code. In general, we will associate an abstract state with a whole region (that is, a set of locations) in a program. We know that program source code is finite, so a model that associates a finite amount of information with each of a finite number of program points or regions will also be finite.
Control flow of a single procedure or method can be represented as an intraprocedural control flow graph, often abbreviated as control flow graph or CFG. The intraprocedural control flow graph is a directed graph in which nodes represent regions of the source code and directed edges represent the possibility that program execution proceeds from the end of one region directly to the beginning of another, either through sequential execution or by a branch. Figure 5.2 illustrates the representation of typical control flow constructs in a control flow graph.
Figure 5.2: Building blocks for constructing intraprocedural control flow graphs. Other control constructs are represented analogously. For example, the for construct of C, C++, and Java is represented as if the initialization part appeared before a while loop, with the increment part at the end of the while loop body.
In terms of program execution, we can say that a control flow graph model retains some information about the program counter (the address of the next instruction to be executed), and elides other information about program execution (e.g., the values of variables). Since information that determines the outcome of conditional branches is elided, the control flow graph represents not only possible program paths but also some paths that cannot be executed. This corresponds to the introduction of nondeterminism illustrated in Figure 5.1.
The nodes in a control flow graph could represent individual program statements, or even individual machine operations, but it is desirable to make the graph model as compact and simple as possible. Usually, therefore, nodes in a control flow graph model of a program represent not a single point but rather a basic block, a maximal program region with a single entry and single exit point.
A basic block typically coalesces adjacent, sequential statements of source code, but in
some cases a single syntactic program statement is broken across basic blocks to model control flow within the statement. Figures 5.3 and 5.4 illustrate construction of a control flow graph from a Java method. Note that a sequence of two statements within the loop has been collapsed into a single basic block, but the for statement and the complex predicate in the if statement have been broken across basic blocks to model their internal flow of control.
1 /**
2 * Remove/collapse multiple newline characters.
3 *
4 * @param String string to collapse newlines in.
5 * @return String 6 */
7 public static String collapseNewlines(String argStr) 8 {
9 char last = argStr.charAt(0);
10 StringBuffer argBuf = new StringBuffer();
11
12 for (int cIdx=0; cIdx < argStr.length(); cIdx++) 13 {
14 char ch = argStr.charAt(cIdx);
15 if (ch != '\n' || last != '\n') 16 {
17 argBuf.append(ch);
18 last = ch;
19 } 20 }
21
22 return argBuf.toString();
23 }
Figure 5.3: A Java method to collapse adjacent newline characters, from the
StringUtilities class of the Velocity project of the open source Apache project. (c) 2001 Apache Software Foundation, used with permission.
Figure 5.4: A control flow graph corresponding to the Java method in Figure 5.3. The for statement and the predicate of the if statement have internal control flow branches, so those statements are broken across basic blocks.
Some analysis algorithms are simplified by introducing a distinguished node to represent procedure entry and another to represent procedure exit. When these distinguished start and end nodes are used in a CFG, a directed edge leads from the start node to the node representing the first executable block, and a directed edge from each procedure exit (e.g., each return statement and the last sequential block in the program) to the distinguished end node. Our practice will be to draw a start node identified with the procedure or method signature, and to leave the end node implicit.
The intraprocedural control flow graph may be used directly to define thoroughness criteria for testing (see Chapters 9 and 12). Often the control flow graph is used to define another model, which in turn is used to define a thoroughness criterion. For example, some criteria are defined by reference to linear code sequences and jumps (LCSAJs), which are
essentially subpaths of the control flow graph from one branch to another. Figure 5.5 shows the LCSAJs derived from the control flow graph of Figure 5.4.
From Sequence of Basic Blocks To
entry b1 b2 b3 jX
entry b1 b2 b3 b4 jT
entry b1 b2 b3 b4 b5 jE
entry b1 b2 b3 b4 b5 b6 b7 jL
jX b8 return
jL b3 b4 jT
jL b3 b4 b5 jE
jL b3 b4 b5 b6 b7 jL
Figure 5.5: Linear code sequences and jumps (LCSAJs) corresponding to the Java
method in Figure 5.3 and the control flow graph in Figure 5.4. Note that proceeding to the next sequential basic block is not considered a "jump" for purposes of identifying
LCSAJs.
For use in analysis, the control flow graph is usually augmented with other information. For example, the data flow models described in the next chapter are constructed using a CFG model augmented with information about the variables accessed and modified by each program statement.
Not all control flow is represented explicitly in program text. For example, if an empty string is passed to the collapseNewlines method of Figure 5.3, the exception
java.lang.StringIndexOutOfBoundsException will be thrown by String.charAt, and execution of the method will be terminated. This could be represented in the CFG as a directed edge to an exit node. However, if one includes such implicit control flow edges for every possible exception (for example, an edge from each reference that might lead to a null pointer
exception), the CFG becomes rather unwieldy.
More fundamentally, it may not be simple or even possible to determine which of the implicit control flow edges can actually be executed. We can reason about the call to
argStr.charAt(cIdx) within the body of the for loop and determine that cIdx must always be within bounds, but we cannot reasonably expect an automated tool for extracting control flow graphs to perform such inferences. Whether to include some or all implicit control flow edges in a CFG representation therefore involves a trade-off between possibly omitting some execution paths or representing many spurious paths. Which is preferable depends on the uses to which the CFG representation will be put.
Even the representation of explicit control flow may differ depending on the uses to which a model is put. In Figure 5.3, the for statement has been broken into its constituent parts (initialization, comparison, and increment for next iteration), each of which appears at a different point in the control flow. For some kinds of analysis, this breakdown would serve no useful purpose. Similarly, a complex conditional expression in Java or C is executed by
"short-circuit" evaluation, so the single expression i > 0&&i < 10 can be broken across two basic blocks (the second test is not executed if the first evaluates to false). If this fine level of execution detail is not relevant to an analysis, we may choose to ignore short-circuit evaluation and treat the entire conditional expression as if it were fully evaluated.