5.4 Extracting Assumptions
5.4.1 Abstract Access Paths
Fundamental to our analyses is our approach for representing the memory locations that a statement may reference. Our approach is based on length-limited access path based analyses (e.g., [51]) and parameterized analyses (e.g., [54]). We combine these approaches
1 c l a s s N o d e { 2 N o d e n e x t ; 3 D a t a d a t a ; 4 ... 5 } 6 7 c l a s s . . . { 8 v o i d m ( N o d e n , D a t a d ){ 9 n . n e x t . n e x t . d a t a = d 10 } 11 } 1 // J i m p l e r e p r e s e n t a t i o n : 2 3 c l a s s . . . { 4 v o i d m ( N o d e n , D a t a d ){ 5 ... 6 t = n . n e x t ; 7 t = t . n e x t ; 8 t . d a t a = d ; 9 } 10 }
Figure 5.4: Example in Java and Jimple to Demonstrate Naming of Access Paths and adapt them to our setting in which the analysis distinguishes between unit data and environment data to precisely characterize points-to information for the former, but not the latter.
The goal of our analysis is different from the traditional goal of points-to analyses. In particular, we do not use analysis results to determine potential aliasing relationships and therefore do not require a canonical representation of points-to information. Our analyses are used to safely represent, at each program point, the state of the program, which maps variables to their values; values can be heap locations, a special value null denoting a null pointer, or scalar values (e.g., integers, reals, etc.).
An important part of an access path based points-to analysis is coming up with a scheme for access paths used to name heap objects. To give the intuition behind our access paths names, let us look at a simple example shown in Figure5.4; the right side shows the example in Java, the left side shows the code using the Jimple representation. Class Node shown in the example serves as a list where Data objects may be stored. Suppose, types Node and Data are unit types and suppose method m is an environment method under analysis. Let us trace what happens to unit data being passed to the method m.
Tracing Assignments Through the Concrete Heap For demonstration purposes, suppose that before the first assignment of method m (in Jimple) is executed, the heap
d n
next
data
t = n.next; t = t.next; t.data = d;
t t→ n.next.next
d n
next
data
t = n.next; t = t.next; t.data = d;
t t→ n.next
d n
next
data
t = n.next; t = t.next; t.data = d;
t t→ n.next.next
Figure 5.5: Tracing Assignments Through the Concrete Heap
looks as shown in the upper left picture of Figure 5.5. Then, the next three pictures show the heap after executing the first, second, and third assignments. It is easy to see that at the end of the method execution the variable t points to a heap node that may be named n.next.next. However, in general, chains of references through the heap may be infinite and cannot be statically bound to heap objects. One solution to this problem is to use a k-limiting bound on the length of access paths. Let us trace the example using an analysis with 1 as a k-limiting constant.
Tracing Assignments Using 1-Limited Analysis Figure 5.6 (left) shows the heap after execution of the first assignment. The picture is the same as before except now we are using dashed arrows rather than solid ones to show relationships between reference variables and heap objects. This is done to emphasize that according to the analysis results
d n
next
data
t = n.next; t = t.next; t.data = d;
t t→ n.next
d n
next
data
t = n.next; t = t.next; t.data = d;
t
Data Node
t→ chooseN ode
Figure 5.6: Tracing Assignments Using 1-Limited Analysis
the variable t may point to a node namedn.next. After the second assignment is executed, t may point to the node that can be named n.next.next, beyond k-limit. Usually, once the k-limiting bound is reached, the analyses loose some or all information about heap locations being described. However, in Java, we can exploit the type information and trace that after execution of the second assignment, the variable t may point to a heap object of type Node (not to any heap object). The right side of Figure 5.6 shows this by having t point to any heap object (shaded) of type Node. We use a special name chooseN ode to denote a set of all
heap objects of type Node at a given program state.
Note that this analysis yields very imprecise results: after the execution of the third statement, the data field of every shaded box may point to d. To improve the precision of this analysis, we track reachability of heap objects.
Tracing Assignments Using 1-Limited Analysis with Reachability Figure 5.7
shows steps for the analysis improved with reachability tracing. The heap is shown as before after the execution of the first assignment. However, after the execution of the sec- ond assignment, this analysis can distinguish between all heap objects of type Node from the heap objects of type Node that are also reachable from the node n.next through heap references. The shaded objects on the right side of Figure 5.7 show the set of Node objects that t may point to after the execution of the second assignment. We use a special name
d n
next
data
t = n.next; t = t.next; t.data = d;
t t→ n.next
d n
next
data
t = n.next; t = t.next; t.data = d;
t t→ reachUnitN ode(n.next)
Figure 5.7: Tracing Assignments Using 1-Limited Analysis with Reachability reachUnitN ode(n.next) to denote this set.
This example gives the intuition behind exploiting the type and reachability information for heap objects. Note that the access paths explained in the example started with n, the name of the formal parameter. In general, access paths may start with a name for an object of unit type that an environment method may modify. There are several categories of such objects: globals, parameters, and newly created objects. It is convenient to set up a symbolic location for each of them: global locations, denoted by C.f , parameter locations, denoted by pi, with p0 representing this, and newly allocated objects, denoted by newc,s,
for all instances of c created by statement s.
Note that object may flow to the environment through the return value of called methods. For each object returned from a method, we can trace whether the object is a parameter, global, or newly created object. The above symbolic locations are called roots. The naming scheme for roots guarantees uniqueness of their names within a method.
Formal Discussion
More formally, our points-to representation captures information relative to a given method and is parameterized by a root symbol that represents memory locations. There are three kinds of memory locations that may serve as a root: public static fields of classes (denoted C.f ), method parameters (denoted p), and newly allocated data (denotednewc for objects
of type C allocated at statement s). New locations are modeled as per-allocator summary locations which are supported by the environment code generation described in Section 5.5. Our representation makes use of operations that denote sets of heap-allocated objects in a given state. One can access the set of all allocated instances of class C (denoted choosec), and the set of allocated instances of class C that are reachable from memory location l via paths through the heap that only reference unit data (denoted reachUnitc(l)).
A memory location holds a value that is a scalar, a null, or a heap object. To repre- sent non-scalar memory locations tracked by the points-to analysis, we denote them by an abstract access path π∈ AbsP ath that may be defined by the following regular expression:
π = [(C.f | pi | newc,s)fU0−k(reachUnitc)?]| choosec | null
An access path is either choosec expression, a null, or a length-limited path that starts at a root symbol, consists of 0 to k dereferences of field accessors of unit type, and is optionally terminated in a reachable expression. Alternatively, we use notation reachUnitc(π) (where the parameter is understood to be the path prefix) to denote (π)reachUnitc. We refer to a prefix of a path with j field dereferences as π[j]. The semantics of a path are defined relative to a program state. Paths represent field accesses that are type correct in the sense that π[j] with type C can only be extended to length π[j + 1] by a field f where class(f ) is C. Length-limited paths end in either the location referred to by the field access sequence or a reachUnitc(π[k]) expression. In the former case, the access path represents instances of class C that are reachable via the chain of field dereferences denoted byπ in state s. In the latter case, the access path represents instances of class C that are reachable via a chain of field dereferences through unit data from any of the memory locations denoted by π[k] in state s. Note that a variable, l ∈ V ar, of a reference type may point to a set of locations, Π∈ P(AbsP ath), whose types are assignment-compatible (i.e., π ∈ Π ⇒ type(π) ≤ type(l), where ≤ is the sub-typing relation).
Our abstract access paths provide a different degree of precision compared to traditional k-limited access path based representations (e.g., [51]) in that they are well-typed and they
are able to represent heap reachability relationships between locations.
A pair of abstract access paths is ordered (≤) based on the containment order of the sets of memory locations denoted by the pair. According to the semantics described above the order is:
∀j≤i π[i]≤ reachUnittype(π[i])(π[j])
∀c,i reachUnitc(π[i])≤ choosec
This ordering is lifted to sets of symbolic locations as follows: (∀a∈S∃b∈S0a≤ b) → S ≤ S0
An abstract access path can be extended by a field dereference (denoted π.f ) using rules that distinguish unit locations from environment locations. If π.f refers to an environment location, the analysis does not keep track of it:
type(f )6∈ U =⇒ π.f = choosetype(f )
Otherwise, if π.f refers to a unit location, we must consider the structure of π: type(f )∈ U =⇒ π.f = (root)fi Uf if π = (root)fUi ∧ i < k reachUnittype(f )(π) if π = (root)fUk reachUnittype(f )(π 0) if π = reachUnitc(π0)
choosetype(f ) if π = choosec
An abstract access path rooted in a formal parameter π(p) can be prefixed (denoted π(π0/p)) by substituting the name of formal parameter p, used in defining access path, with
type and length appropriate path prefix π0. If a prefix operation causes the sequence of field
dereferences to exceed k then the extension operator is applied for each field dereference beyondk. The intuition here is that we calculate a symbolic analysis summary for a method m and then use the prefixing operator (denoted m(¯π/¯p)) to determine the effects at a call site by substituting the actual parameters, represented by ¯π, for the symbols representing the formal parameters, ¯p, of m.