Early Findings and Contributions
Algorithm 9 Fixed Lexical Chain (FXLC) Algorithm.
6.3.3 Keyconcept Extraction through Lexical Chains
Another contribution of our research is in the keyword extraction problem. We explore the use of the BSD (Section 6.2.1) and FLC (Section 6.2.2) algorithms to suggest keywords for documents based on their semantic features. In these experiments, we use the POC dataset discussed in Section 6.3. We reduce our corpus to the synsets produced using BSD and FLC. Next, we rank them multiplying their quantity to the weight value obtained through tf-idf (Section 6.2.6). Table 6.11 shows the number of synset obtained per document in each category. It also shows the average of unique synsets per document category.
It is evident that the number of synsets obtained through BSD are far more numerous, if compared to those produced using FLC. The reason is because in the former, we are analyzing separate words, considering only the concepts within its immediate surroundings (predecessor and successor). In the latter, the chains are tracking the continuity of more broad ideas, so the groups of synsets are clustered into a more common concept, which, in
Table 6.11: Distribution of synsets obtained through BSD and FLC.
BSD FLC
Docs Dogs Computers Sports Dogs Computers Sports
Doc_01 627 1458 1546 152 364 368 Doc_02 461 526 1477 91 113 365 Doc_03 1229 382 924 281 95 214 Doc_04 687 793 1007 161 185 260 Doc_05 535 1276 1136 126 318 268 Doc_06 608 1336 346 143 379 82 Doc_07 693 578 911 156 119 226 Doc_08 978 501 161 229 96 396 Doc_09 1285 1205 1473 335 301 382 Doc_10 677 1453 1018 161 349 242 Average 778 951 1000 184 232 280
turn, is represented in a single synset. In other words, BSD produces fine grained results with one synset per existing word in WordNet, while FLC provides general abstractions reducing to total number of synsets in a document.
Once the keyword candidates are obtained (in decreasing order, based on their tf-idf values), we analyze the top-5 ones through a survey comparing them with the document- category defined in each Wikipedia article (ground truth). These categories are implemented considering MediaWiki4, which adds an automated listing to represent and incorporate a given webpage to a subject area, which can be found at the bottom of every Wikipedia article.
The main objective of this experiment is to evaluate whether keywords obtained through our proposed techniques can represent the essential concepts in an given article. Thus, we create a range with scores varying from 1 (strongly disagree) to 5 (strong agree) to assess the quality of the suggested keywords. This experiment is performed through a survey answered by members of our department which consist of 8 people (1 Full Professor, 4 Ph.D. candidates, 2 Ph.D. students and 1 M.S. student).
Tables 6.12 and 6.13 show a sample of how the participants correlate the categories
Table 6.12: MediaWiki categories sample.
Document Key 01 Key 02 Key 03 Key 04 Key 05
Doc_X Computer
hardware Electronics Doc_Y Computer
programming Computers
Doc_Z Linux 1991 software Computing platforms
Cross-platform software
Finnish inventions
in MediaWiki (Table 6.12) and the ones suggested through our algorithms (Table 6.13), respectively. They are asked to compare and score the values for matching documents in both groups, the ground truth and the suggested keywords. The exercise in our survey is done for each document (30 Wikpedia articles), considering the first 5 synsets for the BSD and FLC approaches. A total of 300 records are assessed per participant, which provides 2,400 final evaluations in the entire survey.
Table 6.13: Suggested synsets sample.
Synset ID Type Doc
SID-06321054-N intensifier,intensive Doc_X
SID-06566077-N software,software_program,computer_software Doc_X SID-00928947-N programming,programing,computer_programming Doc_X
SID-05650820-N language,speech Doc_X
SID-05996646-N discipline,subject,subject_area,subject_field Doc_X SID-05847438-N algorithm,algorithmic_rule,algorithmic_program Doc_X
SID-03642806-N laptop,laptop_computer Doc_Y
SID-03699975-N machine Doc_Y
SID-07739125-N apple Doc_Y
SID-09840217-N baron,big_businessman,business_leader,king Doc_Y
SID-13645010-N horsepower,HP,H.P. Doc_Y
SID-13699442-N dram, Doc_Z
SID-09752246-N Aries,Ram Doc_Z
SID-13912992-N constriction,bottleneck,chokepoint Doc_Z SID-03744276-N memory,computer_memory,storage Doc_Z
SID-06198876-N devices Doc_Z
For each document in our corpus and a given keyword, we compute a mean rating over all survey participants. For this experiment, we found 5 keywords per document and perform a weighted average over the 5 mean ratings we have computed, where the weights
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Dogs Computers Sports All Documents
BSD FLC
Figure 6.9: Keyword survey average rankings.
correspond to the tf-idf associated with each of the five keywords.
In Figure 6.9, we show the average of the results over each of the three categories of our corpus (i.e. documents of dogs, computers, sports), so we can capture a more robust outcome from our participants. We also show, in Figure 6.10, the correlation between the individual tf-idf values for each proposed keyword and the average score given by the human reviewers. We see that the correlation is higher for the BSD approach than for the FLC approach, and that the average correlation is high.
The BSD algorithm produces better results in capturing the main concepts for all document categories, when compared to the FLC algorithm. As mentioned previously, BSD provides synsets with more details, which results more precise keywords. FLC, on the other hand, provides more general abstractions, that do not perform as well as the previous technique. An example of this situation is document D6, the sixth document concerning Dogs. The BSD method has the words kennel and doghouse, while the FLC approach suggests the words building and edifice.
In this experiment we explore how semantic features can be extracted and used for keyword suggestion. Instead of relying solely on syntax analysis, statistical approaches or annotated corpora, we apply multiple techniques that use semantic representation to recommend possible keywords. This representation is obtained through two of our proposed
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Dogs Computers Sports All Documents
BSD FLC
Figure 6.10: Keyword survey average correlations with keyword strengths.
techniques, BSD and FLC. Both consider the context surrounding each word to better represent the meaning underlying the text itself. While the former provides a more detailed extraction of synsets, the latter captures more general aspects discussed in the corpus.