• No results found

Constructing domain models using domain coherence

Given a domain corpus, representative words of the domain can be selected using a single-word term extraction technique. Several assumptions are made to identify words that are used to construct a domain model from a domain corpus. The first three

3.3 Constructing domain models using domain coherence

assumptions are used for candidate word selection, while the fourth assumption is used to filter the candidate words:

1. Length: It is only single-word terms that are considered as longer terms tend to be more specific;

2. Distribution: Candidate words should have a high distribution in a domain corpus (the word should appear in at least one quarter of the documents in the corpus);

3. Saliency: Candidate words should be content bearing (i.e., nouns, verbs, adjec- tives);

4. Semantic Relatedness: A term is more general if it is semantically related to a large number of domain-specific terms.

The distribution assumption implies that rare terms are more specific, similarly with the frequency-based measure of tag generality used in [BKH+11]. This might not always be the case, for example a simple search with a search engine shows that artefact or silverware are more rarely used than the term spoon, although the first two concepts are more generic. However, in this work we are interested in extracting basic-level categories as theorised in psychology [Haj13]. A basic-level category is the preferred level of naming, that is the taxonomical level at which categories are most cognitively efficient. For example dime is always called a dime and not metal object, 1952 dime or 10 cents. A counter example can be found for the length assumption as well, as the longer term inorganic matter is more general than the single word knife, but in this case we would simply consider as a candidate the single word matter which is more generic than the compound term. Length and frequency of occurrence are proposed as general criteria for identifying basic-level categories in [Gre05].

A possible solution for building a domain model is to use a standard termhood measure for single-word terms, and select the top ranked candidate words. But most of the approaches for extracting single-word terms make use of contrastive corpora, favouring specific words that are rarely used outside of the domain. An alternative solution is to use coherence, interpreted as semantic relatedness, which is shown to play an important role in the task of keyphrase extraction [Tur03]. We generalise Turney’s measure to quantify the coherence of a term in a domain, instead of the coherence of a

3. DOMAIN ADAPTIVE EXPERTISE TOPIC EXTRACTION THROUGH DOMAIN MODELLING

keyphrase in a document. This is done by computing the coherence of a term with the domain model, and not the coherence of pairs of terms. Because the words from the domain model are specifically selected to have high frequency, we can rely on statistics from the domain corpus alone and we do not require any external corpora.

Similar to this previous work, we choose Pointwise Mutual Information (PMI) as a measure of semantic relatedness. This measure was shown to outperform other coher- ence measures when applied to the task of measuring topic coherence [NLGB10]. First, we extract multi-word terms using a standard term extraction technique such as the one presented in Section 3.4.2, then we use the top ranked terms to rank the candidate words for the domain model using the following scoring function that measures domain coherence:

s(θ) =X

σ∈Ω

P M I(θ, σ) (3.2)

Which can be rewritten by replacing the formula for PMI as:

s(θ) =X σ∈Ω log  P (θ, σ) P (θ) · P (σ)  (3.3) where θ is a word considered as a candidate for the domain model, σ is a multi-word term, Ω is the set of extracted terms, P (θ, σ) is the probability that the word θ appears in the context of the term σ, P (θ) is the probability of appearance of θ, and P (σ) is the probability of appearance of σ. In this work context is defined as a window of words. The set Ω contains the best terms extracted by our basic term extraction method described in Section 3.4.2, but any other term extraction method can be applied in this step. At this stage it is only a relatively small number of automatically extracted terms that are used because span searches are relatively expensive to compute. In our experiments we considered the top 200 ranked terms.

In Algorithm 1 we show how domain coherence can be used to construct a domain model, using a short list of automatically extracted terms. Context is defined as a window of words of predefined size. First, we consider as candidates those words (i.e., nouns, verbs, adjectives) that have high distribution, and which are mentioned in a considerable proportion of the corpus. The filtering step discards words that are mentioned in a small number of documents. Then, each word is scored based on their domain coherence which is measured based on the provided list of terms. The domain

3.3 Constructing domain models using domain coherence

model is finally constructed by selecting the top ranked words based on their domain coherence.

Algorithm 1: The algorithm that constructs a domain model using a list of automatically extracted terms.

input : Window size w

List of words words of size n

List of top ranked terms terms of size t Size of the domain model d

output: Domain model domainM odel

1 words ← filterWords(words); 2 for i ← 0 to n do 3 s ← 0; 4 for j ← 0 to t do 5 s ← s + PMI(words[i ], terms[j ], w); 6 end 7 scores[i ] ← s; 8 end

9 domainM odel ← selectTop(words, scores, d);

For example, lets assume that an automatic method for term extraction selects the following best ranked terms from a corpus in Computer Science: linear programming, programming language, and data mining. Also, lets assume that the nouns algorithm, formula, software, system, and smoothness are the content bearing words extracted from a domain corpus. In the filtering step (line 1 in the algorithm), the word smoothness is discarded as less than 10% of the documents in the corpus mention it. This threshold can be adjusted depending on the corpus size and on the size of the resulting domain model. Domain coherence is computed by calculating the PMI for each of the three considered terms (lines 3-7). In this way the normalised values for domain coherence will be 0.95 for algorithm, 0.35 for formula, 0.68 for software, and 0.91 for system. If we choose to construct a domain model of 3 words the domain model will be a vector with 3 elements: algorithm, software, and system. While if we choose to construct a domain model with 2 words the domain model will be composed of the words algorithm and system.

A small sample of words from domain models built for Computer Science, Agricul- ture and the Biomedical domain, using our domain coherence method, is presented in

3. DOMAIN ADAPTIVE EXPERTISE TOPIC EXTRACTION THROUGH DOMAIN MODELLING

Table 3.1. Additionally, a sample list of top ranked words for a domain model in the Computer Science domain is provided in Appendix 2.

Computer Science Biomed Food and Agriculture

development mechanism control

software evidence farm

framework antibody supply

information molecule food

system system forest

Table 3.1: Domain models extracted for different knowledge areas

We observe that a domain model contains several words that are unlikely to be used in other domains, for example software from the Computer Science domain model, or farm from the Agriculture domain model. These domain models also contain more generic words, for instance control or evidence, that are likely to appear often in many different domains. Some of the words appear in more than one model, for example system that is used in the Computer Science model and in the Biomedicine model, although it refers to slightly different concepts. All of these words are likely to be used in general language. Therefore, measures based on contrastive corpora are unsuitable to extract them, as they favour words that are often used in a domain, but that are rarely used outside the domain.