State Complexity of Regular Tree Languages for Tree Matching

(1)

Vol. 27, No. 8 (2016) 965–979 c

World Scientific Publishing Company DOI: 10.1142/S0129054116500398

State Complexity of Regular Tree Languages for Tree Matching∗

Sang-Ki Ko

Department of Computer Science, University of Liverpool Ashton Street, Liverpool, L69 3BX, United Kingdom

[email protected] Ha-Rim Lee†_{and Yo-Sub Han}‡

Department of Computer Science, Yonsei University 50 Yonsei-Ro, Seodaemun-Gu, Seoul 120-749, Republic of Korea

†_{[email protected]} ‡_{[email protected]}

Received 6 November 2014 Accepted 3 May 2016 Communicated by Arto K. Salomaa

We study the state complexity of regular tree languages for tree matching problem. Given a tree t and a set of pattern trees L, we can decide whether or not there exists a subtree occurrence of trees in L from the tree t by considering the new language L′_{which accepts}

all trees containing trees in L as subtrees. We consider the case when we are given a set of pattern trees as a regular tree language and investigate the state complexity. Based on the sequential and parallel tree concatenation, we define three types of tree languages for deciding the existence of different types of subtree occurrences. We also study the deterministic top-down state complexity of path-closed languages for the same problem. Keywords: State complexity; tree matching; regular tree languages; path-closed languages.

1. Introduction

State complexity is one of the most interesting topics in automata and formal language theory [10, 12, 26, 27]. The state complexity of finite automata has been studied since the 60’s [13, 16, 17]. Maslov [15] initiated the problem of finding the operational state complexity and Yu et al. [27] investigated the state complexity for basic operations. Later, Yu and his co-authors [4,7,24,25] initiated the study on the state complexity of combined operations such as star-of-union, star-of-intersection ∗_{A preliminary version appeared in Proceedings of the 16th International Workshop on}

Descrip-tional Complexity of Formal Systems, DCFS 2014, LNCS 8614, 246–257, Springer-Verlag, 2014.

‡_{Corresponding author.}

(2)

and so on. Moreover, Yu and his co-authors studied the state complexity of com-bined Boolean operations including multiple unions and multiple intersections [4–6]. Recently, the state complexity problem has been extended to regular tree lan-guages. Regular tree languages and tree automata theory provide a formal frame-work for XML schema languages such as XML DTD, XML Schema, and Relax NG [18]. XML schema languages can process a set of XML documents by specify-ing the structural properties. Martens and Niehren [14] considered the problem of efficiently minimizing unranked tree automata. Piao and Salomaa [20,21] considered the state complexity between different models of unranked tree automata. They also investigated the state complexity of concatenation [23] and star [22] for regular tree languages. Two of the authors studied the state complexity of subtree-free regular tree languages, which are a proper subclass of regular tree languages [11].

Since a regular tree language is a set of trees, it is suitable for representing a set of structural documents such as XML documents, web documents, or RNA secondary structures. This implies that a regular tree language can be used as a theoretical toolbox for processing of the structured documents. When it comes to the string case, many researchers often use regular languages to process a set of strings efficiently. Consider the case that we have a set of strings which is a regular language L. Now we want to find any occurrence of strings in L from a text T . The most common way is to construct an FA A that accepts a regular language Σ∗_{L [3].} Then, we read T using A and check whether or not A reaches a final state. When A reaches a final state, we find that there is an occurrence of a matching string of L in T. We extend this approach to the tree matching problem [9]. First, we formally define the tree matching problem to be the problem of finding subtree occurrences of a tree in L from a set of trees T . Since a tree can be processed in a bottom-up or a top-down fashion, we need to consider different types of tree languages for the tree matching problem.

Here we consider three types of tree substructures called a subtree, a topmost subtreeand an internal subtree. Given a tree language L, we construct three types of tree languages recognizing trees which contain the trees in L as subtrees, topmost subtrees and internal subtrees. Note that these tree languages can be used for the tree matching problem as we have used Σ∗_{L for the string pattern matching} problem. In particular, we tackle the deterministic state complexity of regular tree languages and path-closed languages. Interestingly, the tree language consisting of trees that have a subtree belonging to a path-closed language language need not be path-closed and therefore cannot recognized by deterministic top-down tree automata (DTTAs).

We give basic notations and definitions in Sec. 2. We define the three types of tree languages for tree matching in Sec. 3. We present the results on the state complexity of regular tree languages and path-closed languages in Secs. 4 and 5, and conclude the paper in Sec. 6.

(3)

2. Preliminaries

We briefly recall definitions and properties of finite tree automata and regular tree languages. We refer the reader to the books [2,8] for more details on tree automata. A ranked alphabet Σ is a finite set of characters and we denote the set of elements of rank m by Σm⊆ Σ for m ≥ 0. The set FΣ consists of Σ-labeled trees, where a node labeled by σ ∈ Σm always has m children. We use FΣ to denote a set of trees over Σ that is the smallest set S satisfying the following condition: if m ≥ 0, σ ∈ Σmand t1, . . . , tm∈ S, then σ(t1, . . . , tm) ∈ S. Let t(u ← s) be the tree obtained from a tree t by replacing the subtree at a node u of t with a tree s. The notation is extended for a set U of nodes of t and S ⊆ FΣ: t(U ← S) is the set of trees obtained from t by replacing the subtree at each node of U by some tree in S. A nondeterministic bottom-up tree automaton (NBTA) is specified by a tu-ple A = (Σ, Q, Qf, g), where Σ is a ranked alphabet, Q is a finite set of states, Qf ⊆ Q is a set of final states and g associates each σ ∈ Σm to a mapping σg : Qm → 2Q, where m ≥ 0. Assume A has a transition σg(q1, . . . , qm) = P . In this case, A moves to the set P of states by reading a sequence q1, . . . , qm of states and a character σ of rank m. We say that each element of P is a target state of the sequence q1, . . . , qmof states. For each tree t = σ(t1, . . . , tm) ∈ FΣ, we define inductively the set tg⊆ Q by setting q ∈ tg if and only if there exist qi∈ (ti)g, for 1 ≤ i ≤ m, such that q ∈ σg(q1, . . . , qm). Intuitively, tg consists of the states of Q that A may reach by reading t. Thus, the tree language accepted by A is defined as follows: L(A) = {t ∈ FΣ | tg∩ Qf 6= ∅}. The automaton A is a deterministic bottom-up tree automaton(DBTA) if, for each σ ∈ Σm, where m ≥ 0, σgis a partial function Qm_{→ Q.}

A nondeterministic top-down tree automaton (NTTA) is specified by a tuple A = (Σ, Q, Q0, g), where Σ is a ranked alphabet, Q is a finite set of states, Q0⊆ Q is a set of initial states, and g associates each σ ∈ Σm, m ≥ 0, a mapping σg: Q → 2Q

m . As a convention, we denote the m-tuples q1, . . . , qmby [q1, . . . , qm]. A top-down tree automaton A is deterministic if Q0is a singleton set and for all q ∈ Q, σ ∈ Σm, and m ≥ 1, σg is a partial function Q → Qm.

The nondeterministic (bottom-up or top-down) and deterministic bottom-up tree automata accept the family of regular tree languages whereas the deterministic top-down tree automata accept a proper subfamily of regular tree languages — path-closed languages [2, 8].

3. Tree Languages for Tree Pattern Matching

Pattern matching is the problem of finding occurrences of a pattern in a text. The regular expression matching problem is defined as follows: given a pattern regular expression E and an input text T , we want to identify all substrings of T that are in L(E) [3]. Similar to the regular expression matching problem, we consider a pattern given as a set of trees with a tree automaton (TA) A for the tree pattern matching problem. Here we construct a new TA A′ _{from A as we put Σ}∗ _{to the pattern}

(4)

language L(A) to make the new FA A′ _{to simulate all the possible prefixes before} simulating the matching substrings of L(A) [1]. This means that, given A, we need to construct a new TA A′ _{that simulates all the possible prefixes before simulating} the matching subtrees of L(A). Since a tree can be processed in a bottom-up way with a bottom-up TA or a top-down way with a top-down TA, we need to consider three types of tree languages for the tree pattern matching problem.

First we define three different tree substructures. If a tree t′ _{consists of a node} in a tree t and all of its descendants, we call t′_{a subtree of t. If a tree t}′_{is a subtree} of t, then we call t a supertree of t′_{. We also define the topmost subtree of a tree t} as a tree consisting of a set of nodes in t including the root node such that from any node in the set, there exists a path to the root node through the nodes in the set. An internal subtree of a tree t can be defined as a topmost subtree of a subtree of t. We give graphical examples for the definitions in Fig. 1.

t

(a) A subtree t

t

(b) A topmost subtree t

t

(c) An internal subtree t Fig. 1. We define three types of subtrees called a subtree, a topmost subtree and an internal subtree. These figures depict the examples.

Given a tree t and a regular tree language L, we first compute a new regular tree language L′ _{that accepts all possible supertrees of trees in L. Then, we decide} whether or not a tree in L occurs as a subtree of the given tree t by deciding t ∈ L′_. Recall that we build a new FA that accepts Σ∗_{L, which is a concatenation of a} universal language Σ∗ _{and a given language L, for matching a language L of string} patterns. For tree pattern matching problem, we need to consider how to define the concatenation of trees properly. Recently, Piao and Salomaa [23] studied the state complexity of the concatenation of regular tree languages. They defined the sequential σ-concatenation and parallel σ-concatenation where the substitutions can occur at σ-labeled leaves.

We consider a more generalized operation that allows substitution to occur at all leaves regardless of labels. We denote the set of leaves of a tree t by leaf(t). Then, for T1 ⊆ FΣ and t2 ∈ FΣ, we define the sequential concatenation of T1 and t2to be

T1·st2= {t2(u ← t1) | u ∈ leaf(t2), t1∈ T1}.

In other words, T1 ·s t2 is a set of trees obtained from t2 by replacing a leaf with a tree in T1. We extend the sequential concatenation operation to the tree

(5)

languages T1, T2⊆ FΣas follows: T1·sT2=

[

t2∈T2

T1·st2. The parallel concatenation of T1 and t2is

T1·pt2= {t2(leaf(t2) ← t1) | t1∈ T1}.

Thus, T1·pt2 is a set of trees obtained from t2 by replacing all leaves with a tree in T1. We can also extend the parallel concatenation to tree languages. Note that we can say that a tree t2is a topmost subtree of t1 if t1∈ FΣ·pt2.

L FΣ FΣ FΣFΣ FΣ (a) L ·sF_Σ L FΣ FΣ FΣFΣFΣ (b) FΣ·pL L FΣ FΣ FΣFΣ FΣ FΣ FΣ FΣFΣFΣ (c) FΣ·pL·sFΣ

Fig. 2. Three types of tree languages for the tree pattern matching problem.

Relying on the sequential and parallel tree concatenations, we construct three types of tree languages from a regular tree language L for the tree pattern matching problem. See Fig. 2. Given a tree language L,

(1) L ·s_F

Σis a set of trees where a tree in L occurs as a subtree of each tree in the set,

(2) FΣ·pL is a set of trees where a tree in L occurs as a topmost subtree of each tree in The set, and

(3) FΣ·pL ·sFΣ is a set of trees where a tree in L occurs as an internal subtree of each tree in the set.

Notice that a leaf node of a tree can be replaced with any other nodes for the topmost subtree occurrence and the internal subtree occurrence.

4. State Complexity of DBTAs

First we study the state complexity of FΣ·pL which can be used for finding subtree occurrences of a tree in L.

Lemma 1. Given a DBTA A = (Σ, Q, QF, g) with n states for a regular tree language L, 2n−k _{states are sufficient for recognizing}_F

(6)

Proof. Without loss of generality, we assume QF ∩ {σg | σ ∈ Σ0} = ∅ because otherwise FΣ·pL(A) = FΣ. We present an upper bound construction of a DBTA B for FΣ·pL(A). We define B = (Σ, Q′, Q′F, g′), where

Q′ = {X ∪ {σg| σ ∈ Σ0} | X ∈ 2Q\{σg|σ∈Σ0}}, Q′F = {q ∈ Q′ | q ∩ QF 6= ∅}, and the transitions of g′ _{are defined as follows:}

For τ ∈ Σ0, τg′ = {σg| σ ∈ Σ0}. For τ ∈ Σm, m ≥ 1, and P1, P2, . . . , Pm∈ Q′, τg′(P₁, P₂, . . . , Pm) = τg(P1, P2, . . . , Pm) ∪ {σg| σ ∈ Σ0}.

Now we explain how B recognizes the tree language FΣ·pL. Note that we define every target state of g′ _{to be the union of the set of states reachable by g and the} set of states reachable by reading leaf nodes. Since every target state of g′ _{is not} empty, a new DBTA B is complete although A may not be complete. Note that {σg | σ ∈ Σ0} is a set of states that are reachable by reading a leaf node. This implies that all states of B contain the states in {σg| σ ∈ Σ0}. After reading any tree in FΣ, the state of B contains {σg| σ ∈ Σ0} by the construction. Therefore, B can start a simulation of a tree in L(A) after reading any trees in FΣby regarding the trees as leaf nodes. Since FΣ·pL(A) is a set of trees where all the leaf nodes of each tree can be substituted by any trees in FΣ, B accepts FΣ·pL(A).

The upper bound in Lemma 1 is reachable when a DBTA accepts a set of unary trees. If a DBTA accepts a set of unary trees, then we can regard the DBTA as a DFA with multiple initial states. Since the upper bound reaches the maximum when k = 1, we consider the state complexity of catenation of L and Σ∗_{. Let L be} a regular language whose state complexity is n. Then, the state complexity of Σ∗_L is 2n−1 _{[27] which is the same as the bound in Lemma 1. Furthermore, we show} that the upper bound is tight for any 1 ≤ k ≤ n.

Let Σ = Σ0∪ Σ1, where Σ0 = {σ1, σ2, . . . , σk} and Σ1 = {a, b}. We define a DBTA C1= (Σ, QC1, QC1,F, gC1), where QC1= {0, 1, . . . , n − 1}, QC1,F = {n − 1} and the transition function gC1 is defined by setting:

• (σi)gC1 = i − 1 (1 ≤ i ≤ k), • ag_C1(i) = i + 1 mod n, • bg_C1(i) = i (0 ≤ i < k),

• bgC1(i) = i + 1 mod n (k ≤ i < n).

Based on the construction of the proof of Lemma 1, we construct a DBTA D1 = (Σ, QD1, QD1,F, gD1) recognizing FΣ·

p_L(C

1), where QD1 = {P | {0, 1, . . . , k − 1} ⊆ P, P ⊆ QC1}, QD1,F = {P | P ∈ QD1, P ∩ QC1,F 6= ∅}, and the transition function gD1 is defined as follows:

• (σi)g_D1 = {0, 1, . . . , k − 1} (0 ≤ i ≤ k), • agD1(P ) = agC1(P ) ∪ {0, 1, . . . , k − 1}, • bg_D1(P ) = bg_C1(P ) ∪ {0, 1, . . . , k − 1}.

(7)

Notice that L(D1) = FΣ·pL(C1). In the following lemma, we establish that D1 is a minimal DBTA by showing that all states of D1 are reachable and pairwise inequivalent.

Lemma 2. All states of D1 are reachable and pairwise inequivalent.

Proof. First, we prove the reachability of all states of D1. Note that each state of D1is a set of states in C1. By the construction, the size of a state P in QD1satisfies k ≤ |P | ≤ n since {0, 1, . . . , k − 1} ⊆ P . Using the induction on |P |, we show that all states of D1 are reachable.

• Basis: We have a state {0, 1, . . . , k − 1} of size k that is reachable by reading a leaf node.

• Inductive Hypothesis: Assuming that all states P are reachable for |P | ≤ x, we will show that any state P′_{is reachable when |P}′_{| = x+ 1. Let P}′_{= {0, 1, . . . , k −} 1, qk, qk+1, . . . , qx} be a state of size x + 1. The state P′ is reachable from a state {0, 1, . . . , k − 1, qk+1− qk+ k − 1, . . . , qx− qk+ k − 1} by reading a sequence of unary symbols abqk−k. Therefore, all states are reachable by induction.

Next we prove that all states of D1 are pairwise inequivalent. Pick any two distinct states P1 and P2. Assume p ∈ P1\ P2. (The other possibility is completely symmetric.) After reading a sequence of unary symbols an−p−1_{, a final state is} reached from state P1whereas P2 reaches a non-final state. Therefore, all states of D1 are pairwise inequivalent.

Since we have shown that there exists a corresponding lower bound for the upper bound, the bound is tight.

Theorem 3. Given a DBTA A with n states for a regular tree language L, 2n−k _{states are necessary and sufficient in the worst-case for the minimal DBTA} of FΣ·pL if |{σg| σ ∈ Σ0}| = k.

Now we consider L ·s_F

Σ— a tree language consists of all trees that have trees in L as subtrees. In other words, for any tree t in L, we have all possible supertrees of t in L′_{. Given a regular tree language L, it is known that L ·}s_F

Σis also a regular tree language [23]. We study the state complexity of L ·s_F

Σ.

Lemma 4. Given a DBTA A = (Σ, Q, QF, g) with n states for a regular tree language L, n + 1 states are sufficient for recognizing L ·s_F

Σ.

Proof. We construct a new DBTA B = (Σ, Q′_{, Q}′

F, g′) for L ·sFΣ, where Q′ = Q ∪ {qnew}, Q′F = QF, and the transition function g′ is defined as follows:

For τ ∈ Σ0,

τg′ =

τg if τg is defined, qnew otherwise.

(8)

For τ ∈ Σm, m ≥ 1, q1, q2, . . . , qm∈ Q′, and qf ∈ Q′F, τg′(q₁, q₂, . . . , qm) =        τg(q1, q2, . . . , qm) if τg(q1, q2, . . . , qm) is defined and {q1, q2, . . . , qm} ∩ Qf = ∅, qf if {q1, q2, . . . , qm} ∩ Qf 6= ∅, qnew otherwise.

Now we explain how B accepts a set of all trees that are supertrees of trees in L. We define the transition function g′ _{to be complete by setting the target state of} the undefined transition as a new state qnew. Then, B moves to qnew by reading trees in the complement of L and moves to one of its final states by reading trees in L. Assume that B accepts a tree in L and arrives at a final state qf. After then, B stays in qf by reading any sequence of states including qf. This implies that B accepts all supertrees of trees in L(A).

We cannot reach the upper bound n + 1 with any DFA in this case since the state complexity of LΣ∗ _{is n, which is the same as that of L, even for incomplete} DFAs. Thus, we show that there exists a lower bound DBTA of n + 1 states for accepting L ·s_F

Σ, where the state complexity of L is n to prove the tightness of the upper bound.

Let Σ = Σ0∪ Σ1∪ Σ2, where Σ0 = {c}, Σ1 = {a} and Σ2= {b}. We define a DBTA C2= (Σ, QC2, QC2,F, gC2), where QC2 = {0, 1, . . . , n − 1}, QC2,F = {n − 1}, and the transition function gC2 is defined by setting:

• cg_C2 = 0,

• agC2(i) = bgC2(i, i) = i + 1 mod n.

All transitions of gC2 not listed above are undefined. Based on the construction of the proof of Lemma 4, we construct a DBTA D2= (Σ, QD2, QD2,F, gD2) recognizing L(C2) ·sFΣ, where QD2 = QC2∪ {n}, QD2,F = QC2,F and the transition function gD2 is defined as follows:

• cgD2 = 0,

• agD2(i) = bgD2(i, i) = i + 1 (0 ≤ i ≤ n − 2),

• ag_D2(n − 1) = bg_D2(n − 1, i) = bg_D2(i, n − 1) = n − 1 (0 ≤ i ≤ n − 1), • agD2(n) = bgD2(i, j) = n (i 6= j, i 6= n − 1, j 6= n − 1).

Notice that L(D2) = L(C2) ·sFΣ. In the following lemma, we establish that D2 is a minimal DBTA by showing that all states in QD2 are reachable and pairwise inequivalent.

Proof. First, we prove the reachability of all states of D2. It is easy to verify that the state i (0 ≤ i ≤ n−1) is reachable from the state cg_C2 = 0 by reading a sequence of unary symbols ai_{. Then, the state n is reachable by reading a binary symbol b} with two states i and j (0 ≤ i, j ≤ n − 2, i 6= j) since bg_D2(i, j) = n by construction.

(9)

We prove that all states are pairwise inequivalent. We consider two distinct states i and j such that i < j. There are two possible cases:

• 0 ≤ i < j < n: From the state j, we arrive at a final state n − 1 by reading a sequence of unary symbols an−1−j_{. However, the state i arrives at n − 1 − j + i} by reading the same sequence and the state n − 1 − j + i is not final.

• 0 ≤ i < n and j = n: From the state i, we arrive at a final state by reading a sequence of unary symbols an−1−i_{whereas the state j stays at the state n, which} is not final.

We have shown that all states are pairwise inequivalent in all possible cases. Based on Lemma 4 and Lemma 5, we establish the following statement. Theorem 6. Given a DBTA A with n states for a regular tree languages L, n + 1 states are necessary and sufficient in the worst-case for the minimal DBTA of L ·s_F

Σ.

We lastly consider the state complexity of FΣ·pL ·sFΣ. Note that the sequential catenation of trees is not associative whereas the parallel catenation of trees is associative. That means that there exist trees t1, t2and t3such that (t1·st2)·st3and t1·s(t2·st3) do not coincide. This also applies to the catenation of tree languages and thus, leads to (L1·sL2)·sL36= L1·s(L2·sL3) for some regular tree languages L1, L2, and L3. However, for the case when L1 and L3 are FΣ,

(FΣ·sL2) ·sFΣ= FΣ·s(L2·sFΣ)

holds. Thus, we simply denote the language by FΣ·pL·sFΣinstead of (FΣ·sL2)·sFΣ or FΣ·s(L2·sFΣ). Now we tackle the state complexity of FΣ·pL ·sFΣ.

Lemma 7. Given a DBTA A = (Σ, Q, QF, g) with n states for a regular tree language L, 2n−t−k_{+ 1 states are sufficient for recognizing F}

Σ·pL ·sFΣif|QF| = t and |{σg| σ ∈ Σ0}| = k.

Proof. Without loss of generality, we assume QF ∩ {σg | σ ∈ Σ0} = ∅ because otherwise FΣ·pL(A) ·sFΣ= FΣ. We give an upper bound construction of DBTA B that recognizes FΣ·pL(A) ·sFΣ. We define B = (Σ, Q′, Q′F, g′), where

Q′= {X ∪ {σg| σ ∈ Σ0} | X ∈ 2Q\(QF∪{σg|σ∈Σ0})} ∪ {QF}, Q′F = {QF}, and the transitions of g′ _{are defined as follows:}

For τ ∈ Σ0, τg′ = {σg| σ ∈ Σ0}. For τ ∈ Σm, m ≥ 1, and P1, P2, . . . , Pm∈ Q′,

τg′(P1, P2, . . . , Pm) =    τg(P1, P2, . . . , Pm) if Sm_i=1Pi∩ QF = ∅ ∪{σg| σ ∈ Σ0} and τg(P1, P2, . . . , Pm) ∩ QF = ∅, QF otherwise.

(10)

Now we explain how B recognizes FΣ·pL(A) ·sFΣ. Note that we define every target state of g′ _{to be the union of the set of states reachable by g and the set of} states reachable by reading leaf nodes. This implies that B can start simulation of a tree in L after reading any trees in FΣ. Therefore, we know that B arrives at a final state of A by reading any trees in FΣ·pL(A). By the construction, B moves to QF which is the single final state of B by reading trees in FΣ·pL(A). After then, B stays in QF by reading any sequence of states including QF. This implies that B accepts all possible supertrees of trees in FΣ·pL(A). Since a set of all supertrees of trees in FΣ·pL(A) is FΣ·pL(A) ·sFΣ, B accepts FΣ·pL(A) ·sFΣ.

Next we present a lower bound example that reaches the upper bound 2n−t−k_+1. Let Σ = Σ0∪ Σ1, where Σ0 = {σ1, σ2, . . . , σk} and Σ1 = {a, b, c}. We define a DBTA C3 = (Σ, QC3, QC3,F, gC3), where QC3 = {0, 1, . . . , n − 1}, QC3,F = {n − t, n − t + 1, . . . , n − 1} and the transition function gC3 is defined by setting: • (σi)gC3 = i − 1 (1 ≤ i ≤ k),

• ag_C3(i) = i + 1 mod n, • bgC3(i) = i (0 ≤ i ≤ k),

• bg_C3(i) = i + 1 mod n (k ≤ i < n),

• cgC3(i) = i + 1 mod n if i 6= n − t − 1, cgC3(n − t − 1) = 0.

Based on the construction in the proof of Lemma 7, we construct a DBTA D3 = (Σ, QD3, QD3,F, gD3) recognizing FΣ·

p_L(C

3) ·sFΣ, where QD3 = {P | {0, 1, . . . , k − 1} ⊆ P, P ⊆ QC3\ QC3,F}, QD3,F = {QC3,F}, and the transition function gD3 is defined as follows: • (σi)gD3 = {0, 1, . . . , k − 1}, • ag_D3(P ) = ag_C3(P ) ∪ {0, 1, . . . , k − 1} if ag_C3(P ) ∩ QC3,F = ∅ and P ∩ QC3,F = ∅, • agD3(P ) = {QC3,F} if agC3(P ) ∩ QC3,F 6= ∅, • bg_D3(P ) = bg_C3(P ) ∪ {0, 1, . . . , k − 1} if bg_C3(P ) ∩ QC3,F = ∅ and P ∩ QC3,F = ∅, • bgD3(P ) = {QC3,F} if bg_C3(P ) ∩ QC3,F 6= ∅, • cg_D3(P ) = cg_C3(P ) ∪ {0, 1, . . . , k − 1} if cg_C3(P ) ∩ QC3,F = ∅ and P ∩ QC3,F = ∅, • agD3({QC3,F}) = bg_D3({QC3,F}) = cg_D3({QC3,F}) = {QC3,F}.

Notice that L(D3) = FΣ·pL(C3)·sFΣ. In the following lemma, we establish that D3 is a minimal DBTA by showing that all states in QD3 are reachable and pairwise inequivalent.

Proof. We prove the reachability of all non-final states of D3 using induction on the size of P . Note that any non-final state P ∈ QD3 satisfies k ≤ |P | ≤ m − t because QC3,F ∩ P = ∅ and {σc | σ ∈ Σ0} ⊆ P by the construction. A state {0, 1, . . . , k − 1} of size k is reachable by reading a leaf node. Assume that all states P is reachable for |P | ≤ x. Then, we show that any state P′_{of size x+1 is reachable.}

(11)

Let P′ _{= {0, 1, . . . , k − 1, q}

k, qk+1, . . . , qx} be a state of size x + 1. Then, the state P′ _{is reached from a state {0, 1, . . . , k − 1, q}

k+1− qk+ k − 1, . . . , qx− qk+ k − 1} after reading a sequence of unary symbols abqk−k_{. From the induction, it is easy to} verify that all states except QC3,F are reachable. Furthermore, the only final state QC3,F is reachable from a non-final state {0, 1, . . . , n − t − 1} by reading a unary symbol a.

Next we prove that all states of D3 are pairwise inequivalent. Pick any two distinct states P1and P2. Assume p ∈ P1\ P2. (The other possibility is symmetric.) From P1, a final state is reached by reading a sequence of unary symbols cn−t−1−pa whereas P2 does not reach a final state. Therefore, any two states in QD3 are pairwise inequivalent.

Theorem 9. Given a DBTA A = (Σ, Q, QF, g) with n states for a regular tree language L, 2n−t−k+ 1 states are necessary and sufficient in the worst-case for the minimal DBTA of FΣ·pL ·sFΣ if|QF| = t and |{σg| σ ∈ Σ0}| = k.

5. State Complexity of DTTAs

It is well known that every NBTA can be converted into an equivalent NTTA [2,8]. On the other hand, not all regular tree languages are recognized by DTTAs. In other words, a class of regular tree languages accepted by DTTAs is a proper subclass of regular tree languages accepted by NBTAs or NTTAs. Note that DTTAs recognize exactly the class of path-closed languages that is a proper subclass of regular tree languages [2,8]. This leads us to study the state complexity of path-closed languages for tree matching — the state complexity of DTTAs.

We again consider three types of tree languages FΣ·pL, L·sFΣ, and FΣ·pL·sFΣ, where L is a tree language. However, given a path-closed language L, L ·s_F

Σand FΣ·pL ·sFΣ are not necessarily path-closed languages. Nivat and Podelski [19] argued that path-closed languages can be characterized by a property called the subtree exchange property as follows:

Proposition 10 (Nivat and Podelski [19]). A regular tree language L is path-closed if and only if, for everyt ∈ L and every node u ∈ t, if t(u ← a(t1, . . . , tm)) ∈ L and t(u ← a(s1, . . . , sm)) ∈ L, then t(u ← a(t1, . . . , si, . . . , tm)) ∈ L for each i = 1, . . . , m.

Using the subtree exchange property, we prove that given a tree language L, L ·s_F

Σ and FΣ·pL ·sFΣ are not path-closed languages.

Proposition 11. There exists a path-closed language L such that L ·s_F

ΣorFΣ·p L ·s_F

Σ is not a path-closed language.

Proof. First we show that there exists a path-closed language L such that L ·s_F Σ is not a path-closed language. Let Σ = Σ2 ∪ Σ0, where Σ2 = {b}, and Σ0 = {a, c}. A singleton language L contains a single-node tree c, namely L = {c}. It is

(12)

straightforward to verify that FΣ contains every binary tree where leaf nodes are labeled by a or c, and non-leaf nodes are labeled by b.

Then, L ·s_F

Σis a set of binary trees where every tree contains at least one leaf labeled by c. Therefore, b(a, c) ∈ L ·s_F

Σ, b(c, a) ∈ L ·sFΣ, and b(a, a) /∈ L ·sFΣ hold. However, if L ·s_F

Σis path-closed, b(a, a) should exist in L ·sFΣby the subtree exchange property. This implies that L ·s_F

Σ is not a path-closed language. Now let us prove that there exists a path-closed language L such that FΣ·pL·sFΣ is not a path-closed language. Let Σ = Σ2∪ Σ0, where Σ2= {a, b}, and Σ0= {c}. A singleton language L contains a tree a(c, c), namely L = {a(c, c)}. It is easy to verify that FΣcontains every binary tree where all leaf nodes are labeled by c and non-leaf nodes are labeled by a or b.

Then, FΣ·pL ·sFΣ is a set of binary trees where every tree contains at least one non-leaf node labeled by a. Therefore, b(a(c, c), c) ∈ FΣ·pL ·sFΣ, b(c, a(c, c)) ∈ FΣ·pL ·sFΣ, and b(c, c) /∈ FΣ·pL ·sFΣ. However, due to the subtree exchange property, b(c, c) should be in FΣ·pL ·sFΣ if the language FΣ·pL ·sFΣ is path-closed. This means that FΣ·pL ·sFΣ is not a path-closed language.

We define the deterministic top-down state complexity of a path-closed lan-guage L to be the number of states that are necessary and sufficient in the worst-case for the minimal DTTA recognizing L.

Theorem 12. Given a DTTA A = (Σ, Q, Q0, g) with n states for a path-closed language L, n states are necessary and sufficient in the worst-case for the minimal DTTA of FΣ·pL.

Proof. We construct a new DTTA B = (Σ, Q′_{, Q}′

0, g′) for FΣ·pL, where Q′ = Q, Q′

0= Q0, and the transition function g′ is defined as follows: For τ ∈ Σm, m ≥ 0 and q ∈ Q′, we define

τg′(q) =      τg(q) if σg(q) 6= λ for any σ ∈ Σ0, [q, q, . . . , q] | {z } m times otherwise.

Now we explain how B simulates FΣ·pL with n states. Since trees in FΣ·pL have the same topmost parts with trees in L and leaves can be substituted with any tree in FΣ, B simulates from the same initial state with A. Let us assume that a state q ∈ Q′ _{may end the top-down computation with generating a leaf node} since σg(q) = λ. Once B arrives at q, the new transition function g′ continues the computation by reading a non-leaf label of rank m and generating a sequence [q, q, . . . , q] of states whose length is m. This makes a new DTTA B to generate any subtree in FΣat the point where the computation may end with generating leaves and, thus, recognize the language FΣ·pL.

It is easy to see that n states are necessary to recognize FΣ·pL. Consider a path-closed language of unary trees whose state complexity correspond to that of regular

(13)

string languages. Since the state complexity of LΣ∗ _{is n if the state complexity of} L is n, this case can be a lower bound for the path-closed language FΣ·pL. 6. Conclusions

We have considered the state complexity of three types of tree languages FΣ·pL, L ·s_F

Σ, and FΣ ·p L ·sFΣ for tree pattern matching problem. Motivated from tree pattern matching problem, we have investigate the state complexity of these languages when they are described by DBTAs and DTTAs. Table 1 summarizes the established results. Especially, we have shown that L ·s_F

Σand FΣ·pL ·sFΣare not recognizable by DTTAs even when L is a path-closed language since they are not necessarily path-closed languages. We have shown that L ·s_F

Σ and FΣ·pL ·sFΣ need not be path-closed and therefore cannot recognized by DTTAs.

Table 1. A summary table for the state complexity of DBTAs and DTTAs for the tree languages FΣ·pL, L ·sFΣ, and FΣ·pL·sFΣ.

languages state complexity of DBTAs state complexity of DTTAs

F_Σ·pL ₂n−k n

L_·sF_Σ n_{+ 1} _{not recognizable}

F_Σ_·pL_·sF_Σ ₂n−t−k_{+ 1} _{not recognizable}

A possible future direction is to investigate the descriptional complexity of un-ranked tree automata, which are a more generalized model than tree automata over ranked alphabet, for recognizing L ·s_F

Σ and FΣ·pL ·sFΣ. Acknowledgments

Ko was partially supported by EPSRC grant “Reachability problems for words, matrices and maps” (EP/M00077X/1), and Lee and Han were sup-ported by the Basic Science Research Program through NRF funded by MEST (2015R1D1A1A01060097) and the Yonsei University Future-leading Re-search Initiative of 2015.

References

[1] A. V. Aho. Handbook of theoretical computer science (Vol. A). chapter Algorithms for Finding Patterns in Strings, pages 255–300. 1990.

[2] H. Comon, M. Dauchet, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi.

Tree Automata Techniques and Applications. 2007. Electronic book available at http://www.tata.gforge.inria.fr.

[3] M. Crochemore and C. Hancart. Handbook of formal languages, Vol. 2. chapter Automata for Matching Patterns, pages 399–462. 1997.

[4] Z. ´Esik, Y. Gao, G. Liu, and S. Yu. Estimation of state complexity of combined operations. Theoretical Computer Science, 410(35):3272–3280, 2009.

[5] Y. Gao and L. Kari. State complexity of star of union and square of union on k regular languages. Theoretical Computer Science, 499(0):38–50, 2013.

(14)

[6] Y. Gao, L. Kari, and S. Yu. State complexity of union and intersection of square and reversal on k regular languages. Theoretical Computer Science, 454:164–171, 2012. [7] Y. Gao, K. Salomaa, and S. Yu. The state complexity of two combined operations:

Star of catenation and star of reversal. Fundamenta Informaticae, 83(1-2):75–89, 2008.

[8] F. G´ecseg and M. Steinby. Tree languages. In G. Rozenberg and A. Salomaa, editors,

Handbook of Formal Languages, Vol.3: Beyond Words, pages 1–68. Springer-Verlag New York, Inc., 1997.

[9] C. M. Hoffmann and M. J. O’Donnell. Pattern matching in trees. Journal of the

ACM, 29(1):68–95, 1982.

[10] M. Holzer, M. Kutrib, and K. Meckel. Nondeterministic state complexity of star-free languages. In Proceedings of the 16th International Conference of Implementation

and Application of Automata, pages 178–189. 2011.

[11] S.-K. Ko, H.-S. Eom, and Y.-S. Han. Operational state complexity of subtree-free regular tree languages. International Journal of Foundations of Computer Science, In press.

[12] M. Kutrib, G. Pighizzini, and G. Pighizzini. Recent trends in descriptional complexity of formal languages. Bulletin of the EATCS, 111, 2013.

[13] O. Lupanov. A comparison of two types of finite automata. Problemy Kibernet, 9:321– 326, 1963.

[14] W. Martens and J. Niehren. On the minimization of XML schemas and tree automata for unranked trees. Journal of Computer System Sciences, 73(4):550–583, 2007. [15] A. Maslov. Estimates of the number of states of finite automata. Soviet Mathematics

Doklady, 11:1373–1375, 1970.

[16] A. R. Meyer and M. J. Fischer. Economy of description by automata, grammars, and formal systems. In Proceedings of the 12th Annual Symposium on Switching and

Automata Theory, pages 188–191. IEEE Computer Society, 1971.

[17] F. Moore. On the bounds for state-set size in the proofs of equivalence between deterministic, nondeterministic, and two-way finite automata. IEEE Transactions

on Computers, C-20(10):1211–1214, 1971.

[18] F. Neven. Automata theory for XML researchers. ACM SIGMOD Record, 31(3):39– 46, 2002.

[19] M. Nivat and A. Podelski. Minimal ascending and descending tree automata. SIAM

Journal on Computing, 26(1):39–58, 1997.

[20] X. Piao and K. Salomaa. State trade-offs in unranked tree automata. In Proceedings

of the13th International Workshop on Descriptional Complexity of Formal Systems, pages 261–274, 2011.

[21] X. Piao and K. Salomaa. Transformations between different models of unranked bottom-up tree automata. Fundamenta Informaticae, 109(4):405–424, 2011. [22] X. Piao and K. Salomaa. State complexity of Kleene-star operations on trees. In

Proceedings of the 2012 International Conference on Theoretical Computer Science:

Computation, Physics and Beyond, pages 388–402, 2012.

[23] X. Piao and K. Salomaa. State complexity of the concatenation of regular tree lan-guages. Theoretical Computer Science, 429:273–281, 2012.

[24] A. Salomaa, K. Salomaa, and S. Yu. State complexity of combined operations.

The-oretical Computer Science, 383(2-3):140–152, 2007.

[25] K. Salomaa and S. Yu. On the state complexity of combined operations and their estimation. International Journal of Foundations of Computer Science, 18:683–698, 2007.

(15)

[26] J. Shallit. A Second Course in Formal Languages and Automata Theory. Cambridge University Press, New York, NY, USA, 1 edition, 2008.

[27] S. Yu, Q. Zhuang, and K. Salomaa. The state complexities of some basic operations on regular languages. Theoretical Computer Science, 125(2):315–328, 1994.