Our framework has to maintain three auxiliary data structures: the PCS, the PCR- PCI table, and the token-frequency table. All of them are kept memory-resident dur- ing similarity join evaluation and are incrementally updated as the database state changes. We identify that a node is covered by the clustering process by checking if the path from the root to the node (including the node itself) contains at least one element whose label is the same as the target label. At the time of writing, the update functionality of our framework is not fully implemented in XTC yet. Nevertheless, we already provide here a detailed discussion on the engineering issues around the implementation of this feature. In the following, we first present the general approach for building and maintaining these auxiliary data structures; afterwards, we discuss how we can reduce the maintenance cost.
6.5.1 PCS and PCR-PCI Table
The PCS and the PCR-PCI table are built at the end of the path clustering process. Both structures have a reasonably small memory footprint: the size of the PCS was already discussed in Chapter 4; the PCR-PCI table requires 4 bytes per entry (two bytes each for PCR and PCI) where the number of entries corresponds to the number of distinct paths in the dataset. Modifications on the PS, i.e., insertion or deletion of path classes, have to be propagated to these structures. For the PCR-PCI table, this task is trivial: we only need to remove or insert a table entry. However, the PCS maintenance demands more explanation.
Let us first explain the handling of deletions using an example. Consider the dele- tion the path class /hospital/exam/study/patient/mother, whose PCR is 13 and the associated PCI is 4 (see Figure 6.3). We remove the entry h13, 4i from the PCR-PCI table but keep the PCI value. We then access the catalog to obtain the tar- get label used in the clustering process (i.e., exam), remove from the path all labels
from the root up to the target label (i.e., hospital and exam) and derive the path profile {study ◦ 1, patient ◦ 1, mother ◦ 1}, which is used to access the correspond- ing inverted lists in the PCS. For convenience, the inverted lists from Figure 4.3(b) on Page 96 are repeated in Figure 6.7 together with the path cluster whose PCI is 4. For the token study ◦ 1, we scan the associated inverted list until we find the entry hPCI : 4, levels : {1}i. Because study ◦ 1 appears (starting from the target label) at the first level in the deleted path class, we remove this entry from the inverted list. Proceeding similarly, we scan the inverted list associated with the token patient ◦ 1 and find the entry hPCI : 4, levels : {1, 2}i; then we remove the second value of the field levels, i.e., 2. The last inverted list is associated with the token mother ◦ 1, which contains the record hPCI : 4, levels : {3}i. However, now we cannot remove this record, because the cluster prototype contains another path class in which the label mother appears at level 3. To know whether we can update or remove a record from an inverted list, we need the information about how many path classes “con- tain” the corresponding token at a given level. To this end, we store this information together with each each level value in the inverted lists. The new entry layout is hPCI, levels : |PCR|i, which requires two bytes for each level information. In the in- verted list associated with the mother ◦ 1 token, the entry is hPCI : 4, levels : {3 : 2}i. Hence, instead of deleting the entry, we update the |P CR| value from 2 to 1.
As already described in Chapter 4, insertion of a new path class triggers a match of this path class against the PCS. If no cluster prototype is returned having a similarity to the new path class which is not less than a specified threshold, a new cluster is created and the path class is assigned to it. Otherwise, we assign the path class to the most similar cluster prototype. In both cases, we have to add a new entry to the PCR-PCI table and perform appropriate operations to update the PCS, i.e., update of the |P CR| information, insertion of level information or entries, or creation of new inverted lists.
6.5.2 Token-Frequency Table
We build the token-frequency table by performing a single sweep over the database, typically right after the clustering process. During the scan, we also collect path clus- ter statistics, in particular, statistics about the string length of text nodes appearing in the cluster, which are stored in the catalog.
As distinct P Ct sets yield different token sets, we need a mechanism to provide
token-frequency information for arbitrary EDS queries. In principle, we adopt the simple solution of generating all possible tokens, i.e., we generate tokens for P C = P Ct (i.e., P Cs = ) and P C = P Cs (i.e., P Ct = ). For PCI-based profiles, we use
a slightly adapted version of Algorithm 6.2, where lines 4–5 and line 7 are executed for all input nodes. For epq-gram profiles, we execute both the Algorithm 6.1 and the original pq-gram algorithm and take the duplicate-free union of the resulting profiles. Note that we have to build a token-frequency table for each tokenization method as well as for different parameterizations thereof (e.g., q-gram size).
Despite the large number of tokens, the frequency table is still small enough to typ- ically fit into main memory. For example, using Nasa and SwissProt, we generated 37 K and 19 K distinct tokens, respectively (TLC similarity function, q-gram of size 2). The frequency table requires 8 bytes per entry (four bytes each for the hashed token and its frequency); thus, only 36KB are sufficient to keep the frequencies of all Nasa tokens memory-resident, while SwissProt needs 19 KB. In rare cases when the available memory is insufficient, we can restrict the size of the token-frequency table to at most E entries. In this approach, multiple tokens will collapse into a single entry and their frequencies are summed up. Such collisions negatively affect accuracy— because incorrect token weights are computed—and efficiency—because rare tokens may be shifted away from the prefix positions due to their incorrectly increased fre- quency value. We have not measured the impact of size-constrained tables on our algorithms. In our experiments, we assume that the token-frequency table entirely fits into main memory.
Note that we already admit some inaccuracy in the similarity results by hashing token values. In case of collisions, distinct tokens receive the same hash value. Such collisions not only incorrectly increase token-frequency values but also set-overlap results, because different tokens are matched. Fortunately, collisions can be neglected in practice. Using the Karp-Rabin fingerprint function [KR87], we measured less than 0.01% of collisions in all datasets. More importantly, we have not observed any significant degradation of the accuracy results. Besides reducing the size of the token-frequency table, token representation as integer values significantly improves the overall efficiency of similarity join processing.
For PCI-based tokens, updating the token-frequency table after data changes is easy. In case of deleting structure nodes, content nodes, or both, we only need to generate the tokens for the deleted data to probe the token-frequency table and de- crease the corresponding frequency by one—tokens with frequency zero are removed from the table; in case of insertions, we generate the tokens for the new data and in- crement their frequency accordingly or add an entry in the token-frequency table for new tokens. For epq-gram tokens, incremental updates are more complicated due to the underlying sibling ordering which imposes more data dependency on the token generation process. We have not yet investigated whether or not the techniques pre- sented in [ABG06] for incremental updates of pq-gram profiles can be adapted for epq-grams. Hence, we currently apply the profile generation algorithm on the whole tree—on the old and the new version—to update the token-frequency table.
6.5.3 Reducing Maintenance Cost
We now discuss ways to improve maintenance efficiency and reduce storage space of the auxiliary data structures. The first approach consists of adopting lazy update propagation [ABG06, HKS09]. As already mentioned, using this scheme, updates are buffered and propagated incrementally to similarity indexes at fixed time intervals. We can straightforwardly implement this strategy in our framework. Of course, the
inconvenience of this approach is that inaccurate results are reported when similarity operations are invoked between updates. We also need extra space to store the log of data modifications.
The next strategy both saves storage space and improves performance. We have already argued that the choice of the textual elements that will compose the content part is driven by their semantic properties whose value is dependent on the underly- ing application, domain, and data characteristics. In our framework, we provide the user with extensive flexibility in specifying such textual elements. Nevertheless, there exist commonplace assumptions that allow us to make this task semi-automatic. First, long strings are typically deemed as inappropriate for EM, because their “semantic ambit” is too large and there is too much leeway for deviations. Short strings are preferable because they are often used to name or concisely describe essential infor- mation about an entity. Second, text nodes appearing under infrequent paths corre- spond to fields exhibiting high incidence of NULL values in the relational context. While there are many conceivable interpretations for the absence of explicit values, the fact that a field is commonly left unspecified suggests that this field is unsuitable for distinguishing entities from one another in the context of the entire database—of course, it might still be useful among those entities in which the field value is given. Moreover, for PCI-based profiles, tokens containing textual information inherit the frequency of the corresponding path cluster, i.e., their frequency in the data collec- tion is less than or equal to that of its path cluster. Hence, tokens derived from rare PCIs are assigned very high IDF weights. As a result, entities represented by profiles containing such tokens are likely to always yield low similarity results when com- pared with any other entity whose profile does not contain these tokens.
Given the considerations above, an intuitive strategy is to remove from the PCS those path clusters that exhibit low-frequency or contain paths leading to long strings. By not representing these path clusters in the PCS, we prevent them from being used to compose the textual representation of entities, i.e., such path clusters are fixed in P Cs—henceforth, these path clusters are denoted by P Cs∗. Thus, the size of the PCS
and the token-frequency table are substantially reduced. The savings are particu- larly dramatic for the token-frequency table, because long strings are responsible for a large portion of distinct tokens. Moreover, path classes with low-support in the database are exactly those that cause the bulk of updates in the PS and, consequently, in the PCS. Thus, PCS maintenance workload is also greatly reduced.
The set of path clusters to be removed from the PCS is identified during the build- ing phase of the token-frequency table. For each token, we store in a temporary table its PCI and a flag signaling whether or not it contains textual information. As before, we also collect path cluster statistics; at the end of the scan process, this information is used to select the set to be removed. As we create the token-frequency table from the temporary table, we leave out all textual tokens associated with a PCI in P Cs∗. Fi-
nally, we remove from the PCS all entries related to elements P Cs∗. Note that we do
not remove entries associated with PCIs in P Cs∗ from the PCR-PCI table as they are
still necessary to produce (PCI-based) structural tokens. Finally, we use the PCR-PCI table to identify whether or not a PCR is related to a path cluster in P Cs∗. For this
task, we use the most significant bit in the two-byte PCI representation to indicate whether or not a PCI is an element of P Cs∗10.
Parameters used for determining the elements of P Cs∗are the percentage of trees
in which any element of a path cluster appears and the percentage of strings with size larger than a given value. For example, we can assign to P Cs∗ path clusters
that appear in less than 75% of the trees in the collection or have more than 3% of text nodes with size larger than 100. In this regard, lazy update propagation is appropriate for PCS maintenance. Periodically, we can check for path clusters whose frequency has (decreased) increased (below) above the specified threshold and update the PCS accordingly. Currently, we do not keep track of the percentage of long strings in a path cluster after having built the token-frequency table; supporting this feature is part of future work.