2.8.1 From S-FA/S-FT to S-EFA/S-EFT
The concept of automata with predicates instead of concrete symbols was first men- tioned by Watson [Wat99] and was first discussed by Noord et al. [vNG01] in the con- text of natural language processing. Symbolic finite automata have been further stud- ied in in the context of automata minimization [DV14].
Symbolic finite transducers (S-FTs) were originally introduced by
Hooimeijer et al. [HLM+11] with a focus on security analysis of sanitizers. The for- mal foundations and the theoretical analysis of the underlying S-FT algorithms, in par- ticular, an algorithm for one-equality of S-FTs, modulo a decidable background the- ory is studied by Veanes et al. [VHL+12]. The same work defined Symbolic Trans- ducers (S-Ts). Full equivalence of finite state transducers is undecidable [Gri68], and already so for very restricted fragments [Iba78]. Equivalence of single-valued finite state transducers is decidable [Sch75] and the result has been extended to the finite- valued case [CK86, Web93]. Symbolic finite transducers have been extended to tree structures [DVLM14]. When reasoning about string coders, the use of symbolic alpha- bets is only beneficial when adding look-ahead, and S-EFTs are strictly more succinct than S-FTs.
An algorithm for checking whether a predicate is monadic (finite disjunction of Carte- sian predicates) has been proposed by Veanes et al. [VBNB14]. This algorithm can be used to improve the expressiveness of BEX. The register elimination algorithm pre-
sented in Section 2.6.2.2 has also been extended to larger classes of S-Ts [VMML15]. In the future, we plan to extend BEXto support such an algorithm in order to support
richer analysis.
2.8.2 Models that allow binary predicates
In recent years there has been considerable interest in automata that accept words over infinite alphabets [Seg06, KF94]. In this line of work, symbols can only be compared using equality and arbitrary predicates are not allowed, making the proposed models incomparable to those analyzed in this chapter. Here, we focus on proving negative and positive properties of S-EFAs and S-EFTs over arbitrary decidable Boolean algebras. While we do not investigate specific theories, it would be interesting to understand whether the properties we discussed hold when considering an alphabet theory that only supports equality.
Symbolic finite transducers with look-backk (k-SLTs) have a sliding window of sizek
that allows to refer to the k−1 previous characters [BB13] . We showed how k-SLTs are more expressive than S-EFTs, and contrary to the claim by Botinˇcan et al. [BB13],
k-SLTs arenotclosed under composition, and equivalence ofk-SLTs isundecidable. Streaming string transducers [AC11] provide another recent symbolic extension of fi- nite transducers where the label theories are restricted to be total orders in order to maintain decidability of equivalence [AC11]. Streaming transducers are largely orthog- onal to S-FTs or the extension of S-EFTs. For example, streaming transducers do not allow arithmetic, but can reverse the input, which is not possible with S-EFTs.
2.8.3 Models over finite alphabets
Extended Finite Automata (XFA) are introduced in [SEJK08] for network packet in- spection. XFAs are a succinct representation of DFAs that uses registers to store and inspect values. History-based finite automata [KCTV07] are another extension of DFAs introduced in the context of network intrusion detection, that uses a single register (bit-vector) to keep track of the symbols read so far. In both models the register is used together with the input character to determine when a transition is enabled. Both these models focus on succinctness and the differ from S-EFAs in two ways: 1) they only support finite alphabets; and 2) they can relate symbols at arbitrary positions, while
S-EFAs can only relate adjacent positions. We have not investigated the application of S-EFAs to network packet inspection and network intrusion detection, but we think that S-EFAs can help achieving a further level of succinctness in these domains.
Extended Top-Down Tree Transducers [MGHK09] (ETTTs) are commonly used in nat- ural language processing. ETTTs also allow finite look-ahead on transformation from trees to trees, but only support finite alphabets. The special case in which the input is a string (unary tree) is equivalent to S-EFTs over finite alphabets. This chapter focuses on S-EFTs over any decidable theory. We leave as future work extending S-EFTs to tree transformations.
Chapter 3
FAST: a transducer-based language
for manipulating trees
“but most of all, samy is my hero”
—Samy Kamkar, The Samy Worm
3.1
Introduction
Trees are ubiquitous data structures and are used in a variety of applications in soft- ware engineering. For example, XML, HTML, and JSON are tree formats, compilers operate over an abstract syntax tree, and in natural language processing sentences are structured as trees. Providing better support for manipulating trees is clearly benefi- cial. In this chapter we focus on using variants of tree automata and tree transducers to design a language, FAST, that can analyze practical tree-manipulating programs.
3.1.1 Limitations of current models
Tree automata are used in variety of applications in software engineering, from anal- ysis of XML programs [HP03] to type-checking [Sei94]. Tree transducers extend tree automata to model functions from trees to trees, and appear in fields such as natu- ral language processing [MGHK09, PS12, MK08] and XML transformations [MBPS05]. While these formalisms are of immense practical use, they suffer from a major draw- back: in the most common forms they can only handle finite alphabets. Moreover, in practice existing models do not scale well even for finite but large alphabets.
In order to overcome this limitation, symbolic tree automata (S-TAs) and symbolic tree
transducers(S-TTs) [VB12b, FV14] support complex alphabets by allowing transitions to
be labeled with formulas in a specified alphabet theory. While the concept is straight- forward, traditional algorithms for deciding composition, equivalence, and other prop- erties of finite automata and transducersdo notimmediately generalize. A notable ex- ample is the one we discussed in Chapter 2, where we showed that, while allowing fi- nite state automata transitions to read subsequent inputs does not add expressiveness, in the symbolic case this extension makes most problems, such as checking equivalence, undecidable.
Symbolic tree automata still enjoy the closure and decidability properties of tree au- tomata [VB12b] under the assumption that the alphabet theory forms a decidable Boolean algebra (i.e., closed under Boolean operations). In particular S-TAs are closed under Boolean operations, and enjoy decidable emptiness and equivalence.
A symbolic tree transducer(S-TT) traverses the input tree in a top-down fashion, pro-
cesses one node at a time, and produces an output tree. This simple model can capture several natural language transformations and simple recursive programs that operate over trees. However, in most useful cases S-TTs are not closed under sequential com- position [FV14]. This is a practical limitation when verifying even simple program properties.
3.1.2 Contributions
To overcome the limitations we just discussed, we introduce symbolic tree transducers
with regular look-ahead (S-TTRs), which extend S-TTs with regular look-ahead [Eng77]
— the capability of checking whether the subtrees of each processed node belong to some regular tree languages. S-TTRs support complex, potentially infinite alphabets, and we show that S-TTRs are closed under composition in most practical scenarios: two S-TTRsAandBcan be composed into a single S-TTRA◦Bif either Ais single-valued (for every input it produces at most one output), or Bis linear (it traverses each node in the tree at most once). Remarkably, the algorithm works for any decidable alphabet theory that forms an effective Boolean algebra.
We present FAST, a frontend programming language for S-TAs and S-TTRs. FAST
(Functional abstraction of symbolic transducers) is a functional language that integrates symbolic tree automata and transducers with Z3 [DMB08], a state-of-the-art solver ca- pable of supporting complex theories that range from data-types to non-linear real arithmetic.
We use FASTto model several real-world scenarios and analysis problems: we demon-
strate applications to HTML sanitization, interference checking of augmented reality applications submitted to an app store, deforestation in functional language compila- tion, and analysis of functional programs over trees. We also sketch how FAST can
capture simple CSS analysis tasks. All these problems require the use of symbolic al- phabets.
This chapter is structured as follows:
• Section 3.2 introduces the language FAST using an example in the context of
HTML sanitization;
• Section 3.3 presents atheory of symbolic tree transducers with regular look-ahead(S- TTR) and FAST, a transducer-based language founded on the theory of S-TTRs;
• Section 3.4 discusses a new algorithm for composing S-TTRs, along with a proof of correctness;
• Section 3.5 presents five concrete applications of FASTand shows how the closure
under composition of S-TTR can be beneficial in practical settings;
• Section 3.6 compares FASTwith previous domain specific languages for tree ma-
nipulation.