Quantified Invariants of Linear Data Structures
5.1 Quantified Data Automata
Before we introduce QDAs, we first need to define on which kind of input these automata operate. To this end, we introduce three types of words:data words, valuation words, and symbolic words.
We model lists and arrays (and also finite sets of lists and arrays) that contain data over some (infinite) data domain D as data words. Intuitively, data words encode the structure of the array or list, the data values stored in the cells, and also the (finite) set of pointer variables used by the given program. Thereby, each symbol of a data word corresponds to a cell of the data structure in the order of their occurrence. To define data words formally, we fix a finite (potentially empty) set of pointer variables
PV = {p1, . . . , pr}.
Definition. (Data word). Let PV be a finite set of pointer variables, Σ = 2PV, and D a (potentially infinite) data domain (i.e., a set of data values). Adata word over PV and D is a word u ∈ (Σ × D)∗where each p ∈ PV occurs exactly once in the first component of u (i.e., for each u = a1. . . anand p ∈ PV, there exists precisely one j ∈ {1, . . . , n} such
5.1 Quantified Data Automata
The empty set in the first component of a data word corresponds to a blank-symbol, which we denote by the symbol b. Such a blank-symbol indicates that no pointer variable points to the corresponding cell.
Let us fix a finite, nonempty set Y = {y1, . . . , yk}ofuniversally quantified variables.
Intuitively, the automata we build do not read data words directly but valuation words, which make the universal quantification explicit. More precisely, a valuation word is a data word extended by an additional component, called a valuation of Y , that
encodes the cells the variables from Y reference (similar to the Σ-component of data words). The variables from Y are then quantified universally in the semantics of the automaton model (as explained later in this section).
Definition. (Valuation word). A valuation word is a word v ∈ (Σ × (Y ∪ {−}) × D)∗ where v projected to the first and third components forms a data word and where each y ∈ Y occurs in the second component of v precisely once.
We use the symbol “−” for positions at which no variables of Y occur. Note that the choice of the alphabet enforces the variables of Y to be in different positions.
A valuation word defines a data word along with a valuation of Y . The data word corresponding to a valuation word v is the word dw(v) ∈ (Σ×D)∗obtained by projecting
v to its first and third components.
In later parts of this chapter, we use a third type of words, which we callsymbolic words. In contrast to data and valuation words, symbolic words capture the structure
of a list or array but do not contain data.
Definition. (Symbolic word). Let Σ = 2PV and Π = Σ × (Y ∪ {−}). A symbolic word is a word w ∈ Π∗where each p ∈ PV occurs precisely once in the first component of w and each y ∈ Y occurs precisely once in the second component of w.
We denote the symbol in Π representing that neither a pointer variable nor a universally quantified variable occurs by b = (b, −). Analogous to valuation words, the symbolic word corresponding to a valuation word v is the word sw(v) ∈ Π∗obtained by projecting v to its first two components.
Example.. Consider the valuation word shown in Figure . (on Page ). Besides the valuation word itself, Figure. illustrates which components of the valuation word constitute the corresponding data word and which the corresponding symbolic word. Note that Figure. depicts symbols of the word as column-vectors for the sake of readability, whereas we usually use row-vectors in depictions of QDAs (e.g., as in
Figure. on Page ). /
To express properties on the data, we fix a set of constants, functions, and relations over the data domain. We assume that the quantifier-free first-order theory over this
5 Quantified Invariants of Linear Data Structures {head, i} − 2 b y1 13 {j} − 42 {tail} y2 17 Data word Valuation word Symbolic word PV Y D
Figure.: A valuation word together with a depiction of which components constitute the corresponding data word and symbolic word.
domain is decidable. We encourage the reader to keep in mind the theory of integers with constants 0, 1, etc., addition, and the usual relations ≤, <, =, etc. as a standard example of such a domain.
A quantified data automaton uses a finite set F of data formulas over the atoms
d(y1), . . . , d(yk) that refer to the data values at the cells referenced by the variables
y1, . . . , yk. Moreover, we assume that F forms a (semi-)lattice F = (F, v, t, false, true)
where v is the partial-order relation over F, t is the least-upper bound, and false and
true are formulas required to be in F that correspond to the bottom and top elements
of the lattice, respectively. Furthermore, we assume that whenever α v β, then α → β. Finally, we require formulas in the lattice to be pairwiseinequivalent.
One obtains an example of such a formula lattice over the data domain of integers by taking a set of representatives of all inequivalent Boolean formulas over the atoms involving no constants, defining α v β if and only if α → β, and taking the least-upper bound of two formulas as their disjunction. Such a lattice is of size doubly exponential in the number of variables, and, consequently, unsuitable in practice. Thus, one might want to use a different, coarser lattice, such as the Cartesian lattice.
TheCartesian lattice is formed over a set of atomic formulas and consists of conjunc-
tions of literals (atoms or negations of atoms). The least-upper bound of two formulas is the conjunction of those literals that occur in both formulas; for example, if the set of atomic formulas is {ϕ1, . . . , ϕ4}, α = ϕ1∧ ¬ϕ2∧ϕ3, and β = ϕ2∧ϕ3∧ϕ4, then
α t β = ϕ3because this is the only literal that occurs in both α and β. For the ordering, we define α v β if and only if all literals appearing in β also appear in α. Note that the size of a Cartesian lattice is only exponential in the number of literals.
We have now introduced all necessary concepts and are ready to define the automa- ton model.
5.1 Quantified Data Automata
Definition. (Quantified data automaton). Let PV be a finite set of pointer variables,
D a data domain, Y a finite, nonempty set of universally quantified variables, and F a
formula lattice over a finite set F of formulas. A quantified data automaton (QDA) is a tuple A = (Q, Π, q0, δ, f ) where Q is a finite, nonempty set of states, Π = Σ × (Y ∪ {−}) is
the input alphabet, q0∈Q is the initial state, δ : Q × Π → Q is the (partial) transition function, and f : Q → F is the final-evaluation function, which maps each state to a data formula.
Intuitively, a QDA is a register automaton that is equipped with a register for each universally quantified variable y ∈ Y . A QDA reads a valuation word, stores the data located at the positions referenced by the variables in Y , and checks whether the formula decorating the state finally reached holds for the data in the registers. It accepts a data word u ∈ (Σ × D)∗if it acceptsall possible valuation words that extend u
with a valuation of Y .
Before we define the semantics of QDAs formally, let us briefly comment on why we allow QDAs to access data solely at cells referenced by universally quantified variables. There are two reasons for this decision: First, granting access also to the data at other cells (particularly referenced by pointer variables) introduces subtle but serious problems, which we want to avoid. Second, most (natural) invariants do not express properties of the data at cells referenced by pointer variables; due to the unbounded nature of arrays and lists, invariants typically state conditions that need to be satisfied by all—or arbitrary—cells and can be expressed solely by relating the data at cells referenced by universally quantified variables. In addition, the formula lattice is smaller (as it contains less atoms), which is advantageous in the context of learning QDAs. Though the limited access to the data restricts the invariants expressible by QDAs, our experiments show that this is not a concern in applications.
Let us now formalize the semantics of QDAs. Given a QDA A = (Q, Π, q0, δ, f ), a
configuration of A is a pair (q, r) where q ∈ Q is a state and r : Y → D is a partial variable assignment that assigns a value of the data domain to a universally quantified variable.
The initial configuration is (q0, r0) where the domain of r0is empty.
Therun of A on a valuation word v = (a1, y1, d1) . . . (an, yn, dn) ∈ (Σ × (Y ∪ {−}) × D)∗is
a sequence (q0, r0), . . . , (qn, rn) of configurations that satisfies δ(qi, (ai, yi)) = qi+1and
ri+1= ri{yi←di} if yi∈Y ; ri if yi= −;
for all i ∈ [n] (recall that ri{y
i ←di} corresponds to the mapping ri in which yi is
mapped to di). As in the case of DFAs, we use A : (q0, r0)−→v (qn, rn) as a shorthand-
5 Quantified Invariants of Linear Data Structures
The QDA Aaccepts a valuation word v if A : (q0, r0)→−v (q, r) where (q0, r0) is the
initial configuration and r |= f (q); that is, after reading the valuation word, the data stored in the registers satisfies the formula annotating the state finally reached. The language Lval(A) is the set of valuation words accepted by A.
The QDA Aaccepts a data word u ∈ (Σ × D)∗if A accepts all valuation words v with dw(v) = u. The language Ldat(A) is the set of data words accepted by A.
To ease working with QDAs and to obtain the intended semantics, we assume throughout this chapter that each QDA satisfies two further constraints:
• Each QDA verifies that its input satisfies the constraints on the number of occurrences of variables fromPV and Y . All inputs violating these constraints
(i.e., all inputs that are not valuation words) either do not admit a run due to missing transitions or lead to a dedicated state labeled with the data formula
false. This property implies that the states of an QDA are “typed” with the
set of variables that have been read so far. As a consequence, cycles in the transition structure of an QDA can only be labeled with b-symbols. Note that this assumption is no restriction because both the language of valuation words and the language of data words are defined in terms of words that satisfy the correct occurrence of variables fromPV and Y .
• Each QDA verifies that the universally quantified variables occur in its input in the same fixed order, say y1 ≺ · · · ≺yk. All valuation words violating this
order lead to a dedicated state labeled with the data formulatrue (i.e., all such
valuation words are accepted). The rationale behind this assumption is the following: since the variables y ∈ Y are universally quantified, it is sufficient to check a property with respect to a fixed order and a different order should not change the accepted language of data words.
Although this assumption is a restriction in general, each QDA can be trans- formed into one that accepts the same data language and respects the predeter- mined variable ordering if the formula lattice is closed under conjunction. The idea for such a construction is to use a subset construction that follows all paths that only differ in the order of Y . For each state in a set of states reached like that, one remembers in which order the variables in Y have occurred. At the final states, one uses the conjunction of all formulas in the set with the appropriate renaming of the variables in Y . Due to the universal semantics of QDAs, this results in a QDA that accepts the same data language as original automaton. Since most natural formula lattices, such as the full lattice and the Cartesian lattice (which we use in this chapter), are closed under conjunction, we can without loss of generality assume that each QDA respects a fixed ordering of the universally quantified variables.