3.7 Experimental evaluation
3.7.1 Evaluating the domain model
In this section, we address the research question RQ 1.1, related to the construction of a domain model. The task of constructing a domain model is cast as the task of ranking candidate words selected from a domain specific corpus using various scoring functions. A set of experiments that deal with the intrinsic evaluation of a domain model is presented, using a manually constructed gold standard dataset. As a bench- mark we make use of two methods proposed for term extraction and a method used for constructing concept hierarchies. Additionally, several other benchmarks based on probabilistic topic modelling are discussed and compared with our approach. Results are evaluated in the Computer Science domain using the gold standard dataset con- structed in Section 3.6.2.1. The Krapivin [KAM09] corpus described in Section 3.6.2.2 was used to extract a domain model for Computer Science.
Portion of ranked list (in number of candidates)
F − scor e 100 200 300 400 0 10 20 30 40 DC T ermExtractor Subsumption N CV weight
Figure 3.2: Methods for extracting a domain model
The first two considered benchmarks are the contrastive approach used in TermEx- tractor and the more simple frequency-based method used by NC-value to select context words. For more details about these two approaches check Section 2.2.1. Furthermore,
3.7 Experimental evaluation
Portion of ranked list (in number of candidates)
F − scor e 100 200 300 400 10 20 30 40 LDA5DC LDA75 hLDA
Figure 3.3: Comparison of domain modelling with topic modelling approaches
we consider a statistical method used for the construction of subsumption hierarchies in document browsing [SC99]. In our implementation, context is defined as a window of 5 words. All nouns that are mentioned in at least one quarter of the documents are considered as candidate words for the domain model, excluding the ones that appear in a stopword list. In our implementation we used a widely used stopword list that contains 429 words1.
The results of this experiment are shown in Figure 3.2, where performance is mea- sured in terms of F-score at top N results (F@N), which is described in Section 3.6.1.1. Several conclusions can be drawn from this experiment. First, the methods that analyse the context of top ranked terms (i.e., our domain coherence measure, DC, the weight used for context words in the NC-value, wN CV, and the Subsumption approach) per-
form better than the contrastive measure used in TermExtractor, with statistically significant gains. Also, our domain coherence method outperforms the more simple frequency-based weight used in NC-value and the subsumption score, although the re- sults are not statistically significant in this case. As expected, the words ranked high by TermExtractor are too specific for a generic domain model.
3. DOMAIN ADAPTIVE EXPERTISE TOPIC EXTRACTION THROUGH DOMAIN MODELLING
Comparison with latent semantics
Because a domain model is a semantic group of words, it has many similarities with topics extracted using topic modelling. Therefore a natural question is whether tech- niques developed to discover latent semantics in a domain corpus are suitable to identify a domain model. When compared to topic modelling, a domain model provides less structure, as it identifies only a single topic for the whole domain corpus. A popular probabilistic approach to topic modelling, Latent Dirichlet Allocation (LDA) [BNJ03], is used in our experiments.
Topics are modelled as probability distributions over words, therefore topics can be seen as a list of words ranked based on their probability of occurrence. To make the approaches comparable, for each topic we select the top 20 words with the highest probability of occurrence. In topic modelling, the generality of topics depends on the number of extracted topics, the larger the number of topics, the more specific the thematic information represented by them. As we are interested in identifying general words, we experiment with different settings for the number of extracted topics (i.e., 5, 10, 25, 50, 75, 100, 150, and 200 topics).
Figure 3.3 compares the results obtained by our domain modelling approach DC and the best performing LDA settings. These were found when extracting 5 (LDA5) and 75 topics (LDA75). At most 20 words are analysed for each topic and a total of 400 words were considered in our evaluation. The LDA5 benchmark is an exception because when limiting the number of topics to 5 with maximum 20 words per topic, only 100 words are evaluated. The LDA5 benchmark does not match any of the gold standard words after this point because no data was available and the F-score decreases steadily for larger cut-off points.
Another solution for identifying topics based on their generality is to make use of a method that learns topic hierarchies, such as the hierarchical LDA approach (hLDA) described in [BGJT04]. Default settings are used for hLDA, extracting a three level hierarchy of 20 supertopics and 30 subtopics, which are all considered for our evaluation. In this case, only the best ten words are considered for each topic. In our experiments, the implementation available in Mallet [McC02] is used for both approaches.
A manual analysis of the results shows that the probability of occurrence of a word in a latent topic is not related with the generality of the word. General words
3.7 Experimental evaluation
are combined with more specific words to form a topic. On the other hand domain modelling is more successful in bringing general words at the top of the list as can be seen in Figure 3.3. The LDA approach outperforms the hLDA approach, but both approaches under-perform when compared to domain modelling.
The experiments presented in this section answer the research question RQ 1.1, related to effective ways to automatically construct a domain model. Our approach based on domain coherence is more effective for constructing a domain model than existing term extraction approaches, subsumption and topic modelling approaches.