A Model Query Language - Pattern Search - Querying Large Collections of Semistructured Data

3.4 Pattern Search

3.4.1 A Model Query Language

In this section we propose a query language for pattern search and an algorithm for matching and looking up a query.

A query is expressed as a pattern consisting of a mathematical expression augmented with wild cards, optional parts, and constraints in the form of where clauses. A query matches an expression (Algorithm 4-Line 6) as follows. A wild card represents a slot that will match any subtree of the appropriate type, where [Vi] matches any variable, [Ni] matches any number, [Oi] matches any operator, and [Ei] matches any expression. A wild

card’s index i is an optional natural number such that if two or more wild cards share the same type and index, they must match identical subtrees. Wild cards with no index are unconstrained.

Example 3. The query x[N1]_{− y}[N1] _{matches x}2_{− y}2 _{and x}5_{− y}5 _{but not x}2_{− y}3_{, whereas}

either of the queries x[N1]_{− y}[N2] _{or x}[N ]_{− y}[N ] _{matches all three.}

Optional parts are enclosed by braces and they may appear in some matching expressions. Similar to optionals, other regular expression operators such as disjunctive or repetition operators may also be defined.

Example 4. x2_{{+[N ]} matches x}2 _{and x}2_{+ 1 but not x}2_{+ y or x}2_{− 1.}

Constraints can be specified for wild cards in a query using a “where” clause, as follows: • Number wild cards can be constrained to a specific range or to a domain, which can

be specified using a context-free grammar.

• Variable wild cards can be constrained to a restricted set of possible names. • Operator wild cards can be constrained to a restricted set of operators.

• Expression wild cards can be constrained to contain a given subexpression, which can in turn include further wild cards and constraints.

Example 5.

• Query “[E]2_{[O1]3 where O} _{∈ {+, −}” matches x}2 _{+ 3 and (x + 1)}2 _{− 3 but not} x2_{× 3.}

• Query “x[N 1] _{where 1 ≤ N 1 ≤ 5” matches x}2 _{but not x}9 _{or x}−1_.

• Query “[E1] − 2 where [E1] contains x2_{” matches x}2 _{− 2 and log(x}2_{+ 3y) − 2 but} not x − 2 or y2_{− 2.}

• Query “[E1] where E1 contains log₂([V ])” matches all expressions that include a base 2 logarithm of a variable.

• Query “p[E1] where E1 similar to sin(x)” matches both psin(x) and psin(x + 1) (ranking the first expression higher) but not psin(x) + 1.

In our experiments we assume a pattern does not contain a similarity constraint. Oth- erwise, pattern search would be a generalized form of the similarity search approach, which makes it hard to compare them. Moreover, ranking documents with respect to a pattern query that contains multiple similarity constraints is a complex problem that should be addressed after the more basic problem of capturing the similarity of two math expressions (discussed in this paper) is addressed. This problem is a direction of our future work.

Pattern Matching Algorithm

In this section we describe how a query is matched against a math expression. As men- tioned, similar to an expression, a query is represented as a MathML tree, with some extra tags that represent wild cards and regular expression operators such as optionals. We represent wildcards by a special node with tag “<wild>”. We also mark regular expression operators with special tags or flags. For instance we mark an optional subtree with a special flag that is stored in its root. An example is shown in Figure 3.4.

0 0 1 1 00 00 00 11 11 11 00 00 00 11 11 11 00 00 00 11 11 11 00 00 00 11 11 11 00 00 00 11 11 11 00 00 00 11 11 11 00 00 00 11 11 11 <msup>

*

Figure 3.4: A modified query tree representing {2}[E1]4.

In the rest of this section we assume a pattern does not contain a similarity constraint (e.g. [E]2 _{where [E] is similar to sin(x)). Hence, whereas the algorithm described in} Section 3.3.1 assigns a score to a document that represents how well it matches the query, in this approach a document either matches a query or it does not. We will later extend this approach to handle the similarity constraint in Section 5.3.3.

To match a query against an expression, we first compare their roots tags. If the root of the query is not a wildcard, and its tag matches the root of the expression, we parse the children of the expression with respect to the sequence of the children of the query root and the regular expression operators. We recursively match the subtrees that correspond to pairs of nodes matched by the parser.

If the root of Q is a wild card, we evaluate the match as follows. First if the wild card has an index, e.g. E2 or V 3, we need to determine whether it has already been bound to a subtree because of a previous match having been made when matching another part of the query. If an expression, E0, is already bound to the wildcard represented by Q, then Q matches E only if E and E0 are equal, i.e. have the same signatures. If no expression has previously been bound to Q, we need to compare the types of values at the roots. For example if Q is a number wild card and E is not a number (its root’s label is not “<mn>”) then the result is false. Similarly, variable, operator, and expression wild cards must match variables, operators, and expressions, respectively. Otherwise, if there are no constraints on the wild card, we return true.

Assume Q is an operator wild card with the constraint that it should belong to a specific set of operators, S. It matches E only if E.root is “<mo>” and the label of its child is in S. Similarly, if Q is a number or variable wild card, we check E against the constraint. If Q is an expression wild card and there is a constraint that E should contain Q0, we match all subtrees of E against Q0 and return true as soon as a match is found; otherwise, if no match is found, we return false. Matching an expression containment constraint is detailed in Algorithm 5.

Algorithm 5 submatch(Q0, E)

1: Input: Query, Q0, and Expression, E.

2: Output: true if Q0 matches a subtree in E and false otherwise 3: if match(Q0, E) then

4: return true 5: end if

6: for i := 1 to CE do

7: if submatch(Q, E[i]) then

8: return true

9: end if 10: end for 11: return false

In document Querying Large Collections of Semistructured Data (Page 44-47)