Vocabularies and corpora - Data sets and human performance

Chapter 4 Data sets and human performance

4.1.1 Vocabularies and corpora

The domain independence of the proposed method is demonstrated using three vocabularies: the agricultural thesaurus Agrovoc (1992), the Medical Subject Head- ings (2005), and the High Energy Physics thesaurus.1_{Each is encoded using the}

SKOS (Simple Knowledge Organization System) format,2_{an RDF-based format}

for defining terms and semantic relations between them that also supports multilingual thesauri. SKOS is well established and the number of knowledge structures available in this format continues to grow. With this unified encoding, any vocabulary can be easily plugged into Maui.

Agrovoc thesaurus

Agrovoc is a multi-lingual thesaurus covering agriculture, forestry, fisheries, food and related domains (e.g. environment). It has been developed by the UN Food and Agriculture Organization (FAO), which maintains a large and well used online document repository (1M hits per month).3_{Professional indexers at FAO manu-}

ally assign terms from Agrovoc to all documents in this repository.

The English Agrovoc defines over 28,000 concepts. As the example in Figure 4.1 illustrates, each has one preferred term (descriptor), and many have several alterna- tive versions (non-descriptors), resulting in a total size of around 40,000 terms. The vocabulary has been translated into 23 languages. The first three rows of Table 4.1 below list the sizes of the English, French and Spanish versions used here.

1_{http://www-library.desy.de/schlagw2.html}

2_{http://www.w3.org/2004/02/skos/} 3_{http://www.fao.org/documents/}

The concepts in Agrovoc are interconnected by 83,000 semantic relations of three types: related terms (RT, is-associated-with), which is a bi-directional relation, and broader (BT, has-parent) and narrower terms (NT, has-child), which are in- verse. The BT and NT links build a hierarchical structure that has seven specificity levels.

Note that all thesauri used in this thesis are domain specific and rarely contain ambiguous terms. In Agrovoc, such terms are marked by brackets, e.g. Vanilla (genus) and Vanilla (spice). Where necessary, a scope note describes the intended meaning. There are only 400 (less than 1.5%) such ambiguous terms among 28,170 English descriptors. A similar picture is observed for French and Spanish (see the last column in Table 4.1).

Agricultural document collections

The first corpus comprises 780 full-text documents selected randomly from the FAO’s repository, referred to below as FAO-780. The documents average 30,800 words (a total of 24 million), ranging from 1200 to 257,000 words. The FAO indexers have assigned an average of 8 Agrovoc descriptors to each document, ranging from 2 to 23. This total of 6225 term assignments includes 2187 different terms.

Terms appearing in FAO-780’s topic sets cover only 8% of Agrovoc’s descriptors. This means that classification-based approaches described in Section 3.1.1 would create models for only this tiny subset of the thesaurus, and other terms would never be assigned to new documents. Maui learns the properties of typical

English Descriptor Epidermis

Scope Note Of plants; for the epidermis of animals use SKIN Broader Terms BT1 Plant tissues

BT2 Plant anatomy

Narrower Terms NT1 Plant cuticle

NT2 Plant hairs

NT3 Root hairs

NT2 Stomata

Related Term RT Peel

French Descriptor Epiderme Spanish Descriptor Epidermis

topics, as opposed to properties of specific topics, and can potentially assign any vocabulary term to a new document, regardless of whether it ever appeared as a topic in the training set.

The second corpus with 30 new agricultural documents (FAO-30) is used to determine the inter-indexer consistency of professional indexers. Each document has been independently indexed by 6 people, with an average of 10.4 Agrovoc terms per set, ranging from 4 to 52 terms. This dataset has been created at the FAO specifically for the experiments in this thesis. Section 4.1.2 analyzes the indexing consistency of these professionals, and Section 7.2.5 investigates the consistency of the algorithm with the indexers.

To test Maui’s language independence, French and Spanish collections were ex- tracted from the FAO repository. Manually indexed documents in languages other than English are rare at FAO; thus these collections are rather small: 47 Spanish documents, averaging 42,500 words; and 60 French documents, averaging 22,400 words. The documents had been indexed with English terms, which we mapped to the equivalent Spanish and French terms using Agrovoc. The resulting sets con- tained 2 to 35 topics each, an average of 10.2 topics for Spanish and 11.4 for French documents.

Medical Subject Headings thesaurus

The Medical Subject Headings (MeSH) thesaurus was discussed in Section 2.1. The U.S. National Library of Medicine (NLM) developed this vocabulary for indexing

Total terms Descriptors Non-descriptors Ambiguous

English Agrovoc 38,200 28,170 10,030 400

French Agrovoc 37,350 28,160 9,190 440

Spanish Agrovoc 40,640 28,160 12,480 620

MeSH 141,220 23,890 117,330 380

HEP 16,460 16,000 460 15

the PubMed repository. The SKOS version was provided by van Assem et al. (2006).4

MeSH contains 24,000 concepts organized into a hierarchy via 32,000 BT/NT links. Descriptors in MeSH are called subject headings and are usually accompa- nied by several non-descriptors (entry terms). Whereas Agrovoc only defines syn- onymous non-descriptors, MeSH also includes spelling and formatting variants, resulting in a total of 141,000 terms (see Table 4.1). This much larger vocabulary tests not only Maui’s domain independence but also its scalability.

The experimental corpus, provided by the NLM Indexing Initiative (Aronson et al., 2000 and Gay et al., 2005, Section 3.1.2), consists of 500 documents. This col- lection, NLM-500 is heterogeneous with lengths varying from 440 to 24,500 words (4500 on average) and the number of assigned topics ranging from 2 to 30 (15 on average).

High Energy Physics thesaurus

For the physics domain, the Deutsches Elektronen-Synchrotron developed the High Energy Physics (HEP) thesaurus. The European Organization for Nuclear Research uses it for indexing the contents of the CERN Document Server.5_This

thesaurus is the smallest of the three used in this thesis, listing 16,000 concepts with rare non-descriptors. Beside 500 broader, narrower and related links, HEP defines a semantic relation called Composite/CompositeOf. For example, the con- cept Einstein equation: solution has two CompositeOf relations: Einstein equation and Solution. In total 15,300 such links are defined.

The experimental corpus (CERN-290) comprises 290 random documents from the CERN Document Server, each on average 6,300 words long. The topic sets contain 7 terms on average.

Size statistics

Table 4.1 summarizes the thesauri described in this section. Their sizes range from 16,460 (HEP) to 141,220 terms (MeSH). Some define a wide range of non-

4_{http://thesauri.cs.vu.nl/}

descriptors (MeSH); others only a few (HEP). Little ambiguity was observed: less than 2% in each case. Most ambiguous terms have just two meanings each.

Figure 4.2 compares the distribution of term lengths (in words) in each thesaurus and the topic sets manually assigned to the corresponding collections. Agricultural terms are significantly shorter than both medical and physics ones. Two-word terms make up the majority of terms in all thesauri and corpora, and across all corpora, there is a trend towards the shorter topics in the vocabulary. In NLM-500 single words are much more common than in MeSH, whereas in CERN-290 two- word terms are the most popular topics. Although descriptors with 4 or more words are more common in the HEP thesaurus than in any other vocabulary, they are rarely chosen as topics. The maximum term length in the vocabularies ranges from 7 (Agrovoc) up to 15 words (MeSH). The maximum length in corpora is 5 words in FAO-780 and CERN-290 and 6 words in NLM-500.

In document Human-competitive automatic topic indexing (Page 77-81)