Support Effective Browsing II – Topic Maps on Plain Text
7.2 Topic Map Construction
We formally defined multi-resolution topic maps in the previous chapter. In practice, a topic map can be constructed based on different data sources. The most promising and challenging data source is a plain text collection, which consists only of a set of documents.
Such type of text collections is very common and usually very large. Searching information over this document collection is always difficult since it does not have any structure for a user to browse. Building a topic map over such a plain text collection is challenging but desirable.
7.2.1 Desired Properties
A topic map is to capture the global information structure of a text collection and to support effective browsing. There can be multiple ways of constructing such type of topic maps.
In this section, we enumerate several desired properties of a topic map with the goal of supporting browsing.
• Property 1: Node label predictiveness. Each node in the map should have a content-bearing and meaningful label which clearly indicates the documents within the corre-sponding topic region.
• Property 2: Document coverage. Each level of the topic map should have a com-prehensive coverage of the whole document collection.
• Property 3: Containment relationship. The hierarchy in a multi-resolution topic map should have clearly containment relationship and it is easy for users to understand the hierarchy structure.
• Property 4: Reachability. A better topic map can support a user to reach needed information flexibly. For example, there should be multiple possible paths to reach information in other places.
There are several possibilities to construct a topic map over a document collection. The most intuitive one is to use a document clustering algorithm to organize documents into a hierarchy and build a topic map based on the constructed hierarchy. However, such a method has a notable deficiency: It is notably hard to give a meaningful label to a cluster of documents since most of document clustering algorithms are polythetic and group documents into a cluster based on multiple keywords. A succinct cluster label can be un-informative or represents only a part of the documents in the cluster. In a topic map which has hierarchy structure, it is even harder to give discriminative labels for nodes in different levels. This is not desirable as for Property 1.
7.2.2 A Boolean Keyword Expression Approach
Considering all the desired properties, we propose a keyword based approach to build a topic map. Some previous work such as [57] has proposed to build a keyword hierarchy to organize search results. They determined the containment/subsumption relationship between two keywords based on their co-occurrence relationship. However, such an approach does not satisfy the Property 3. A keyword can be determined to subsume another keyword. But the topic region defined by the two keywords does not have a clear containment relationship.
In the remaining of this section, we describe our way of constructing a topic map.
Map Node Definition
To capture the global information structure, we first need to identify those content-bearing keywords which spread over the whole collection. Nouns are always informative and we use nouns and frequent noun patterns as our map node candidates. Specifically, we define three types of nodes as follows:
• Single-keyword node: A single-keyword node contains only a single noun. For example, “flight” is a node. Such a node defines a topic region consisting of all the documents which contain “flight.”
• Multi-keyword AND node: A multi-keyword AND node is a frequent noun pattern which consists of several nouns with a logic AND relation. We use the sign “+” to connect the keywords in a node. For example, “flight+passengers” is a node in this type. It defines a topic region consisting of documents which contain both “flight”
AND “passengers.”
• Multi-keyword OR node: A multi-keyword OR node consists of several keywords with a logic OR relation. We use the sign “|” to connect keywords in a node. For example, “flight|plane” is a node in this type. It defines a topic region consisting of documents which contain either “flight” OR “plane.”
We can see all these node definitions can better satisfy Property 1: node label predictiveness.
Identify Map Nodes
Given the above definitions, all the nouns and their combination can be potentially map nodes. However, some combination of the keywords such as “flight+population” is not meaningful and may correspond to an empty topic region. Here, we describe how to identify meaningful map nodes. Given a keyword w, we use D(w) as the set of document which contains word w and |D(w)| as the size of D(w).
Single-keyword node. All the noun words can be used as such a node. To be discrim-inative, such a word should not be too popular (e.g., “date” in news articles) or too rare.
We use two parameters maxDF and minDF to filter out those undesired nouns. That is, we only keep nouns which satisfy
N1 = {w : w is a noun, minDF ≤ |D(w)| ≤ maxDF }.
Multi-keyword AND node. We use frequent pattern mining approach to identify such type of nodes [26]. Given a pattern p whose length is k, we generate a pattern of length k + 1 as follows:
1. Identify the set of documents which contain p from the document collection C. We denote it as Dp.
2. For each w ∈ N1, we compute the support and confidence of a pattern p + w as Supp= |DpT D(w)| and Conf = |DpT D(w)||D
p| .
3. If Supp ≥ minSupp and Conf ≥ minConf , we keep the pattern p + w.
We iterate all the k-length patterns and generate a set of (k +1)-length patterns. Especially, when p is a single keyword pattern, i.e., p ∈ N1, we can generate length-2 patterns following the above procedure. The two parameters minSupp and minConf are thresholds used to only keep those meaningful patterns.
Multi-keyword OR node. We identify those OR nodes using an algorithm which are similar to keyword clustering, based on a modified star clustering algorithm [3].
1. Order all the terms in N1 in a decreasing order based on their document frequencies and get a ranked list R.
2. Sequentially select each unmarked element w in R and compute all its nearest neigh-bors of w by
sim(w, x) = |D(w)T D(x)|
|D(w)S D(x)|,∀x ∈ R.
3. w and all x with sim(w, x) ≥ σ together make up an OR node. Mark all the keywords in this node.
4. Iterate 2 and 3 until there are no elements in R.
We have a parameter σ which is to control the coherence of a cluster of keywords.
Suppose we get a set of OR nodes. The above procedure can be used recursively to produce another layer of OR nodes based on the current OR nodes.
Vertical and Horizontal Links
After we have a set of map nodes, we build a topic map by adding vertical and horizontal links among them.
It is clear that the definitions of our map nodes give clear hierarchy structure: Multi-keyword OR nodes subsume single-Multi-keyword nodes and single-Multi-keyword node contain multi-keyword AND nodes. For example, “flight|plane” is a parent of “flight,” which is a parent of “flight+passengers.” The topic regions defined by these nodes have clear containment relationship, which satisfies Property 3.
We build the horizontal links as follows. Given a node v, we use D(v) to define its topic region, i.e., the set of documents which contain the the node v. Given two nodes u and v in the same level, we compute their closeness based on their overlapping area
sim(u, v) = |D(u)T D(v)|
|D(u)S D(v)|. If sim(u, v) ≥ δ, we add a horizontal link between u and v.
Our proposed methods can satisfy all the desired properties listed before. Property 1 and Property 3 has been shown above. Our map also satisfies Property 2 and Property 4 well. Almost every document has nouns and our map nodes ensure a high document coverage. Our map contains both horizontal and vertical links and thus can reach a topic region through multiple noun sequences.