In Section 3.4 we described a pattern matching strategy for math retrieval. In this section we propose an efficient algorithm to process queries represented in the form of a pattern. Our algorithm produces the same result as the pattern search algorithm described in Sec- tion 3.4.1 with a much lower query processing time.
4.4.1
Transforming Expressions
The aim of transforming an expression is to modify it so it is independent of details such as number values or variable and operator names. This will allow us to efficiently look up candidate expressions with respect to patterns that contain wildcards.
For an expression E, we create an enhanced expression, H(E), as follows. Leaves represent literal values such as numbers, variables, and operators (Figure 4.3-A). Hence, we remove all leaves of the corresponding tree of E. We also remove attributes that are not mathematically significant such as font sizes or white space. After these two steps, we obtain a tree that is independent of the specified details (Figure 4.3-B).
4.4.2
Building the Index
To build the index, we first create a pseudo-document for each expression in the collection as follows.
Consider an expression E in document d. We transform E as described in the previous section. For each node N in the transformed expression, we calculate the signature of the subtree rooted at N (Equation 2.1). We also consider the path from the root of E to N , and calculate its signature. The signature of a path is calculated similar to a tree (treating the path as a tree that consists of a single path).
<msup> <math> <msup> <math> <mfenced> <mi> <mrow> <mo> <mn> 2 <mn> <mfenced> <mi> <mrow> x + <mo> <mn> <mn> 1 A B
Figure 4.3: A) The original expression tree for (x + 1)2. B) The transformed expression. The pseudo-document for E consists of a set of terms, where each term is the signature of a subtree or a path in its enhanced form with no duplicates.The header of the pseudo- document also contains a pointer to E, and a pointer to the page that contains E.
Similar to text documents, we build an inverted index on the terms. Using this index, we can efficiently retrieve expressions with respect to their contained terms using standard text-retrieval techniques. A summary of the steps is presented in Algorithm 11.
4.4.3
Processing a Pattern Query
Recall that a pattern query is a math expression with some further information represented in the form of wild cards and regular expression operators such as repetition or optional operators. Our optimization is based on using the index to efficiently filter expressions that do not match the query because they do not contain specific parts. This results in a set of potential matches (candidates), that will be processed further to check if they actually match the query. In the remainder of this section we elaborate on this idea.
Given a pattern query, we first transform it to obtain its enhanced form as explained in the previous section by removing leaves of the tree. An example is shown in Figure 4.4. This results in a tree that may contain wild cards or regular expression operators such as disjunctive, optional, and repetition operators.
A node is a pseudo node if it is represents a wild card or it is associated with a regular expression operator such as a disjunction or repetition operator, otherwise it is a constant node. A subtree is maximal-constant tree if:
Algorithm 11 Building the Index For Optimum Pattern Query Processing 1: Input: collection C of expressions.
2: Output: an index to facilitate processing pattern queries. 3: Let ind be an empty inverted index
4: for each expression E ∈ C do
5: Let pd be an empty pseudo-document
6: pd.page ← pointer to the page that contains E 7: pd.expression ← pointer to E
8: E0 ← transform(E) 9: for each node N of E0 do
10: sSig ← the signature of the subtree rooted at N . 11: if pd does not contain sSig then
12: Add sSig to pd
13: end if
14: pSig ← the signature of the path from E0.root to N . 15: if pd does not contain pSig then
16: Add pSig to pd
17: end if
18: end for
19: Add pd to ind 20: end for
<mn> <mi> <mrow> <mo> x + 1 <mfenced> <msup> <math> <wild> N1 <msup> <math> <mfenced> <mn> <mi> <mrow> <mo> <wild> A B
Figure 4.4: A) The original tree for pattern (x + 1)[N 1]. B) The transformed pattern. 1. It consists of constant nodes only.
2. None of the immediate subtrees of the root’s ancestors is constant. 3. None of its ancestors is a pseudo node.
An example of a maximal constant subtree is circled with solid line in Figure 4.4-B A path in the tree is a maximal-constant path if:
1. Starts from the root.
2. Consists of constant nodes only.
3. No constant path with a longer length exists that contains all its nodes (i.e. it cannot be extended).
From the last property we can conclude that a maximal-constant path ends with a leaf node or the parent of a pseudo node. An example of a maximal-constant path is circled with dashed line in Figure 4.4-B.
We next form a token query that consists of a collection of tokens. Each token is the signature of a maximal-constant path, or a maximal-constant subtree. For each such path or subtree, we calculate the signature and add it to the token query if it is not added previously (to avoid duplicates).
After the token query is formed, we use the index to retrieve expressions that contain such tokens using a standard keyword search algorithm. Each retrieved expression is a candidate that should be processed further to check if it matches the pattern query.
Algorithm 12 optimizedP atternSearch(Q) 1: Input: Query, Q.
2: Output: a list of documents containing expressions that match Q. 3: Modify the query
4: T ← the tree representing the modified query 5: E ← an empty set of tokens.
6: for each maximal-constant subtree M of T do 7: sig ← the signature of M
8: Add sig to E 9: end for
10: for each maximal-constant path P in N do 11: sig ← the signature of P
12: Add sig to E 13: end for
14: candidateExprs ← textSearch(E) 15: res ← an empty list of documents 16: for entry ent in candidateExprs do 17: E ← the expression stored in ent 18: if match(E, Q) [Algorithm 6] then
19: Add the document associated to ent to res 20: end if
21: end for 22: return res
Wikipedia DLMF Combined
Number of pages 44,368 1,550 45,918
Number of expressions 611,210 252,148 863,358
Average expression size 28.3 17.6 25.2
Maximum expression size 578 223 578
Table 4.1: Dataset statistics
Example 9. Assume the query is (x + 1)[N 1] (Figure 4.4-A). The modified query is shown in Figure 4.4-B. The modified tree contains only one maximal-constant subtree (the subtree rooted at the node with tag “<mfenced>”). There are three maximal-constant paths: from the root (with tag “<math>”) to nodes with tags “<mi>”, “<mo>”, and “<mn>”. Hence, the token query contains four tokens: the signature of the subtree, and the three paths. The pseudo-document for expression E = (x + 1)2 (Figure 4.3) contains all the tokens, and hence E is returned as a candidate expression to be matched against the query.
Hence, after a list of candidate expressions are obtained, we use Algorithm 6 to match each one against the query. We return documents that contain matching expressions as the search results.