• No results found

XML Representation Based on Path Clustering

3.2 Structural Similarity Based on Path Clustering

3.2.3 XML Representation Based on Path Clustering

We now present our approach to generating compact surrogates for the structure of XML trees. In a pre-processing step, the WPS similarity function is employed in a cluster algorithm to group “duplicate paths” in a PS. In a sense, we can view this step as an EM subprocess where the entities to be matched are path classes. We then (con- ceptually) merge the path duplicates by assigning the same identifier to path classes belonging to the same group. We call these identifiers Path Cluster Identifiers (PCIs). The paths in a PS are then annotated with the corresponding PCIs (again, conceptu- ally). Finally, the PCI-equipped PS is used to support the generation structural token sets.

Figure 3.3 depicts the process of generating PCI-based representations of XML trees. Given a PS of an XML collection, we start by specifying a target label tgl, corre- sponding to the entity to be matched (e.g., exam in Figure 2.4). Let Ptgl be the set of all path classes relative to tgl, i.e., (partial) paths in the structural summary having tgl as root label. In case of nested occurrences of tgl, we consider only the paths rooted by the topmost occurrence. We then apply a self-similarity join on Ptglwith predicate WPS (p1, p2) ≥ τ. We use the output of the similarity join to construct a proximity

matrix containing all pairwise similarity values of the path classes in Ptgl (for pairs

not satisfying the similarity join predicate, we assign a similarity value of 0). This similarity matrix is the input to a cluster method that generates a set of path clusters (partitions). In this thesis, we use the UPGMA Agglomerative Hierarchical Clustering method with a user-specified threshold as cutting point in the dendrogram [JD88].5

We denote by PCtglθ the set of path clusters generated from Ptgl at cutting threshold

θ. All path clusters are numbered with integer values, i.e., PCIs. Finally, we annotate the path classes p ∈ P S with the corresponding PCI. For ease of notation, let i be the corresponding PCI of a path cluster pci ∈ PCtgl

θ . Figure 3.4 shows the PS of of the

document illustrated in Figure 2.4 equipped with PCIs, for tgl = exam, θ = 0.6, and decay rate of β = −0.1 for LWS . The values in the box on the left are the similarity values at which the clusters were formed.

After having equipped the PS with PCIs, we are able to easily derive a structural tokenization function. For this purpose, we decompose a tree into a set of paths and, for each path p, the corresponding PCI can be obtained from the annotated PS and used as a structural token. As a consequence, the structure of the tree is represented as a set of pci tokens, where each token denotes the appearance of a path instance related to a path cluster. We denote by pcl this tokenization function based on path clusters, which is formally defined as follows.

Definition 3.3(pcl Tokenization Function). Let P C = {pc1, . . . , pcn} be set of path clus-

ters. Given a path instance p, we say that p ∈ pc iff class(p) ∈ pc. Let T be a tree and rlt (T ) = {p1, . . . , pn} be the its set of of root-to-leaf paths. The pcl tokenization function

generates a profile from the T as follows:

pcl (T ) = {i1, . . . , in: pk ∈ pci, 1 ≥ k ≥ n}

As usual, we generally denote by PCL a similarity function of the class defined by hpcl, , i.

Example 3.4. Consider subtrees a) and b) in Figure 2.4. The pcl profiles of subtrees a) and

b)according to the PCI-annotated PS in Figure 3.4 are both {1, 2, 3, 4}. Thus, the similarity value of PCL(a, b) according to any set-overlap-based similarity function is maximum, i.e., 1. In the example above, note that subtrees a) and b) have maximum similarity even though they have no path in common. This observation highlights a salient feature of our PCI-based representation: equality matching of single tokens incorporates simi- larity matching of whole paths for free. The actual path comparison is done only once during the clustering process thereby avoiding repeated path similarity computations when evaluating the similarity join operation.

At a high level of abstraction, we can interpret our approach as a hash-based similar- ity matching method [Ste07]. Specifically, the generation of PCIs from path sets can be viewed as a distance-preserving hashing function—such hash functions form the basis of the widely used LSH algorithm for probabilistic similarity search [GIM99]: similar paths are mapped to the same integer values. Drawing a parallel, the LSH employs an embed-project-hash paradigm, whereas, here, we follow an EM-based approach for hashing tree paths.

exam

patient

description

name

mother

study

id

study

mother

id

description

patient

name

relatives

1 2 3 4 3 4

1

2

PCI 1 2 3 4 0.633 0.745 0.745 0.745 θ

Figure 3.4: PS equipped with PCIs

Finally, because the PS is a tree structure, the matching of paths in the PS can be done efficiently by standard tree traversal algorithms. In addition, besides support- ing query formulation and optimization, PS has also been exploited for designing a space-economic storage model for XTC [HMS07]. In Chapter 6, we describe how we take advantage of this storage model to obtain PCI values without even having to access the PS structure.