Early Findings and Contributions
Algorithm 9 Fixed Lexical Chain (FXLC) Algorithm.
6.3.1 Semantic Extraction Based on Lexical Chains
As explained in Algorithm 5, during the pre-processing step, we only keep the nouns for each document that a synset match in WordNet. After pre-processing the data (e.g. lowercase, stopwords removal), our corpus has a total of approximately 216K words, of which 68K (nouns) have a match in WN, as Table 6.4 shows.
For our synset experiments, the number of synsets in our term/document matrix ranges between 1284 and 7490 elements. In addition, the documents considered in this experiments have, on average, 7,200 words each, which can produce a considerable large dataset to process.
After all datasets are properly cleaned, we extract the BSID representation (Sec- tion 6.2.1), which is used as a base for all our lexical chains scenarios (FLC, FXLC,
and F2F). Once all flexible lexical chains are extracted from the documents, they are used to map fixed lexical chain structures, and to create the corresponding semantic vector repre- sentations. We also derive FXLC directly from the BSID vectors, using a fixed chain size, as shown in Section 6.2.4.
In our experiments, we validate our various approaches in a clustering task, using 256 bins for our synset-based techniques. As mentioned previously, we have documents from 3 major categories, so we perform a variant of K-Means clustering for K = 3 and evaluate the resulting clusters using both the Adjusted Rand Index and the Mean Individual Silhouette values. The former metric is a measure of similarity between two clusters. We compare the derived clusters to the ground truth clusters, consisting of all the dog documents, all the computer documents, and all the sport documents. The latter metric sees how well the clusters are designed, determining whether documents in the same cluster are close together, while documents in different clusters are far apart.
We use spherical K-Means clustering [70], as this technique considers cosine dis- tance [63, 154] rather than Euclidean distance. To validate the proposed algorithms, we also design, implement, and extend traditional approaches for document similarity, such as: BOW with all words (except common stopwords) in the documents (Bag- of-Words-Raw (BOWR)), BOW with only matched nouns in WordNet (Bag-of-Words- WordNet (BOWN)), BOW with the first synset match (commonly used) in WordNet (Bag- of-Words-Synsets (BOWS)), and BOW with the BSID (Bag-of-Words-Best (BOWB)) ex- tracted from the BSD. Since the traditional approaches are variation of counts, only one bin is considered for these histograms.
Table 6.5 provides a summary of all experiments performed, while Figure 6.5 shows a scatter plot of these results. These results show that various permutations of our general approach worked better than traditional ones, of which four of our approaches stand out. Considering the results presented in Figure 6.5, some observations can be made:
Table 6.5: Experiments using lexical chains algorithms and traditional approaches.
Label Algorithm Adjusted Rand Index Mean Individual Silhouette
A Pure Flex–Method III 1 0.1908
B Pure Flex–Method II 1 0.1775 C BOW-N–Nouns in Wordnet 1 0.1757 D BOW-B–Best Synsets 1 0.1686 E Flex-2-Fixed–Method I 0.8981704 0.3964 F Flex-2-Fixed–Method III 0.8981704 0.3878 G BOW-R–Raw Words 0.8981704 0.1591 H Flex-2-Fixed–Method II 0.8066667 0.3578
I BOW-S–WordNet First Synset 0.6671449 0.1542
J Pure Flex–Method I 0.6590742 0.1826
K Pure Fixed–Method I 0.6044735 0.2137
L Pure Fixed–Method III 0.5165853 0.2734
M Pure Fixed–Method III 0.40252 0.2743
of these perfect clusterings use FLC (considering their variations) while the third perfect clustering results from the proposed methodology of finding the best synset representation for a document (Section 6.2.1);
• The only perfect clustering result which is on the Pareto front (not dominated by another result), is the one using the third approach in Section 6.2.5 (III) for extracting flexible chains;
• The clustering with the maximum silhouette value results from our first approach in Section 6.2.5 (I) to our technique for extracting F2F chains. This clustering is also on the Pareto front; and
• The only clusterings on the Pareto front result from our techniques.
In this experiment, we explore how extracted semantic features can aid in document retrieval tasks. Furthermore, we present several contributions on how these features can be extracted to form more robust lexical chains. First, we explore and extend the notion of WSD and how to represent words, considering the effect of their immediate neighbors in their meaning BSID. Second, we propose three new algorithms: (a) a new methodology to create variable length size semantic chains (FLC), (b) an algorithm to derive fixed lexical structures
A B C D E F G H I J K L M 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.3 0.5 0.7 0.9 1.1 Me an Indi vi dual Si lhouet te
Adjusted Rand Index
Figure 6.5: Scatter plot between mean individual silhouette and Adjusted Rand Index for proposed techniques (Table 6.5 data).
(FXLC) directly from semantic representations and (c) a new approach to transform variable length size semantic chains into fixed parameterized structures (F2F). Third, we provide different alternatives to construct the semantic dispersion over a document (Section 6.2.5). To establish a comparison with the proposed approaches we compare them with traditional ones, such as BOW and a few of its variations. Our findings show that several of our approaches achieve superior results.