• No results found

Modelling the Thesaurus for the Social Sciences (TheSoz) Using

4.4 Publishing a Domain-specific Thesaurus with Linked Data

4.4.2 Modelling the Thesaurus for the Social Sciences (TheSoz) Using

The transformation process of a thesaurus into the SKOS format has been split up into three steps. Hence, it follows the structured method introduced in [vAMMS06], which consists of the following steps: (1) analysis of the structure, the extent and the complexity of the thesaurus, including contained terms and relations between terms, (2) a mapping of all detected terms and relations to adequate SKOS classes and properties and (3) the technical conversion of the thesaurus according to the defined mapping. This method aims to ensure the quality and utility of the resulting conversion and focuses on two goals: the interoperability and completeness of the converted thesauri.

Thesauri are mostly collections of terms that stand in specific relation to each other, typically in hierarchical or associated relations. While thesauri are more term-centralized instruments that have traditionally been designed and maintained in libraries over decades, the major design aspect of SKOS is a concept-centralized view on the thesaurus or classification system. In SKOS, there are concepts (skos:Concept) that represent a semantic concept. Therefore, each concept can hold more than one label. Exactly one label serves as the preferred label (skos:prefLabel), while additional labels can be included as alternative or hidden labels (skos:altLabel and skos:hiddenLabel). These labels describe, e.g. variants, additional expressions or unusual variants of the preferred terms, which are indicated as non-preferred terms in thesauri.

This design issue proves to be an obstacle when there are multiple relationships between preferred and non-preferred terms for a single term concept. These relationships have to be concentrated and rearranged as properties under a SKOS concept. Figure 4.4 presents a concept-based term relation in SKOS, where a single concept consists of three terms. One of the terms is the preferred term that is recommended for use; the two other terms depict the non-preferred terms. Most traditional thesauri have no such concept- based representation. Instead, the terms and their type (preferred or non-preferred) are organized via direct relations between the terms, e.g. USE and USED FOR (see Figure 4.4).

4.4 Publishing a Domain-specific Thesaurus with Linked Data

URI Definition

skos:Concept Builds the main class of SKOS and defines a suggestive semantic concept, thought or idea. skos:ConceptScheme skos:Concepts can be organized in

skos:ConceptScheme, which build an aggregation of multiple concepts. There can be relations between multiple skos:Concept inside one skos:ConceptsScheme or between different skos:ConceptScheme. Regarding the representation of Knowledge Organization Systems, a skos:ConceptScheme typically comprises one terminology system or thesaurus.

skos:inScheme Describes the relation of a skos:Concept to a skos:ConceptScheme.

skos:prefLabel skos:prefLabel describe the preferred lexical term for a specific skos:Concept. Each skos:Concept can hold only one skos:prefLabel.

skos:altLabel Alternative labels can be defined in order to provide alternative or non-preferred terms of a skos:Concept. skos:hiddenLabel skos:hiddenLabel is defined for describing, e.g.

unusual spellings or terms for a skos:prefLabel. skos:notation skos:notation provides the possibility to connect a

skos:Concept with entries from a notation system. skos:note This property enables the inclusion of notes

associated with a skos:Concept. Subclasses of skos:note include, e.g. skos:editorialNote,

skos:example, skos:definition and skos:scopeNote. skos:broader,

skos:narrower and skos:related

These semantic relations can be defined between multiple classes of a skos:Concept. They enable the construction of hierarchies of concepts.

skos:broadMatch, skos:closeMatch, skos:exactMatch, skos:narrowMatch and skos:relatedMatch

The mapping properties of SKOS are similar to the semantic relations described above. In contrast to them, the mapping properties can only be applied between classes of a skos:Concept of different skos:ConceptScheme.

Table 4.4: Overview of SKOS classes and properties.

UF

“pricing” USE

“pricing policy”

4 Publication of Linked Open Social Science Data

URI Definition

skosxl:Label This class contains lexical entities in a plain literal format.

skosxl:literalForm This property describes the precise literal inside a skosxl:Label class. There can only be one

skosxl:literalForm inside a skosxl:Label class. This restriction includes different languages of a lexical entity, which have to be modelled in another skosxl:Label class.

skosxl:prefLabel, skosxl:altLabel and skosxl:hiddenLabel

These properties are treated analogous to

skos:prefLabel, skos:altLabel and skos:hiddenLabel and refer to a skosxl:Label class, which contains the stated label as skosxl:literalForm.

skosxl:labelRelation The skosxl:labelRelation enables the modelling of specific relationships between skosxl:Label classes. This property allows the representation of complex relations between literals and terms.

Table 4.5: Classes and properties of SKOS-XL.

Thesaurus Analysis

The basis for a transformation of a thesaurus to the SKOS format is a detailed analysis of the thesaurus. Attention is not only paid to terms and the existing associative and hierarchical relations between them, but also to the general structure and design issues of the thesaurus, e.g. the existence of additional classification systems or how far the examined thesaurus conforms to established ISO norms. The Thesaurus for the Social Sciences contains about 12,000 keywords, of which more than 8,000 are Descriptors (authorized keywords) and about 4,000 are Non-Descriptors. The relationships between these keywords are expressed as broader, narrower or related terms and there are also

USE (see Figure 4.4 above) and USE COMBINATION (see Figure 4.5 below) relations

and their counterparts (USED FOR and USED FOR COMBINATION ). Additionally, a classification hierarchy is provided and each thesaurus term is dedicated to one or more classification terms.

The TheSoz contains a special type of non-descriptor called AD (Alternative Non-

Descriptor ) that differs from the international standard norms for thesauri. An alternative

non-descriptor in the TheSoz is used to describe ambiguities in relations between terms. Such descriptors hold more than one USE and/or USE COMBINATION relation at the same time. There are about 216 of such AD terms in the TheSoz.

Example The term ‘committee’, which is classified as an AD term, holds USE relations

to the preferred terms ‘working group’, ‘parliamentary committee’, ‘Wirtschaftsausschuss’

4.4 Publishing a Domain-specific Thesaurus with Linked Data

Figure 4.5: Example of a USE COMBINATION relationship.

(no English translation available; means the ‘Standing Committee on Industry and Trade’) and ‘advisory panel’ at the same time. Additionally, it contains the USE COMBINATION relation to the combined use of the terms ‘product’ and ‘quality’. Terms of the type AD describe generic and ambiguous terms that have different concrete meanings in specialized sub-contexts. This is expressed through the use of more than one USE and/or

USE COMBINATION relation for only one term. In this case, it means that the term

‘committee’ is semantically so general and ambiguous that it is recommended to use a more precise term to describe the intended semantics. Figure 4.6 depicts this example of an AD term.

Figure 4.6: Example of an Alternative Non-Descriptor (AD).

An alternative to represent ambiguities is to transform the AD term to multiple non- descriptors that are extended with specific context information either in their label itself (e.g. ‘committee (working group)’) or in a note, e.g. ‘used in the context of a working group’. But this solution omits the technical processability and detection of the ambiguity because the terms would be identified as different ones.

Mapping to SKOS

For most of the thesaurus items, i.e. terms and relations, adequate SKOS properties and classes can be identified easily because TheSoz conforms broadly to the standard

4 Publication of Linked Open Social Science Data

norms for thesauri. Problems occur when mapping special data items and/or relations that do not conform to thesauri standards like the AD terms of the TheSoz described in the previous subsection. However, since SKOS is based on RDF, it is possible to define additional relations without greater effort. Therefore, a precise mapping to SKOS is more complex than a simple mapping [ZS09a]. In order to obey the concept-based structure of SKOS, but without risking the loss of relevant relations between preferred and non-preferred terms, classes and properties of SKOS-XL have been used [MZS10a]. For this reason, SKOS-XL has also been used for the conversion of the EUROVOC thesaurus [De 09]. Properties of SKOS-XL have been developed explicitly for the representation of lexical issues and provide the possibility to model relations between multiple terms inside one SKOS concept. These label relations allow the definition of own relations between lexical labels, such as typical equivalence relationships like USE or compound equivalence relationships like USE COMBINATION and their counterparts, which are necessary components of the TheSoz.

Table 4.6 presents the mapping from terms and relations of the TheSoz to adequate SKOS classes and properties. As described above, personal classes and properties have been defined in order to represent additional semantics as well as complex relations of the TheSoz.

Extensions have necessarily been defined for representing complex and relevant relations in the TheSoz correctly. They are described in detail using RDF Schema in order to ensure further processing and interoperability with other data sets on the web. Table 4.7 provides an overview of the SKOS extensions defined for the TheSoz.

Figure 4.5 outlines the USE/USED FOR term relations within a concept, where the term ‘pricing policy’ is the preferred one and is recommended for use instead (see the thesoz:use and thesoz:usedFor relations in the figure) of the non-preferred term (depicted as well, as skosxl:altLabel). This modelling approach provides more semantic

information than the single use of skosxl:prefLabel and skosxl:altLabel allows. These relations could also be modelled by only using skos:altLabel and skos:prefLabel, but the addition of personal relations provides more semantics about the relationship. Furthermore, it builds the basis for distinguishing USE and USED FOR relations from

USE COMBINATION and USED FOR COMBINATION relations, which are introduced

below. Distinguishing between these relations is necessary because a term of the TheSoz can hold together multiple such relations (see the example of the AD term). A repres- entation of the relations USE COMBINATION and USED FOR COMBINATION is depicted in Figure 4.4 below.

In the example above, the term ‘university ranking’ is a non-preferred term in the TheSoz, i.e. it is not recommended to use this term. Instead, it is advised that a combination of the terms ‘university’ and ‘ranking’ be used to index documents because both are preferred terms. In order to represent this special relationship in SKOS, the property skosxl:labelRelation is used to define personal semantical relations. Thus, the term ‘university ranking’ gets the relation thesoz:compoundNonPreferredTerm and the two other preferred terms are extended by the relation thesoz:preferredTermComponent.

4.4 Publishing a Domain-specific Thesaurus with Linked Data

Thesaurus Element

Description SKOS Class / Property

DD Descriptor skosxl:prefLabel

ND Non-Descriptor skosxl:altLabel

AD Alternative Non-Descriptor skosxl:altLabel

NT Narrower Term skos:narrower

BT Broader Term skos:broader

RT Related Term skos:related

USE Use (Example: For X, use Y) thesoz:use in conjunction with the class

thesoz:EquivalenceRelationship UF Used For (Example: Y is used

for X)

thesoz:usedFor in conjunction with the class

thesoz:EquivalenceRelationship USK Use Combination (Example:

For X, use Y and Z in combination)

thesoz:compoundNonPreferredTerm and

thesoz:preferredTermComponent in conjunction with the class

thesoz:CompoundEquivalence UFK Used For Combination

(Example: Use Y in combination with Z for X)

thesoz:compoundNonPreferredTerm and

thesoz:preferredTermComponent in conjunction with the class

thesoz:CompoundEquivalence translation Translation of the terms via

language tags

thesoz:hasTranslation and thesoz:isTranslationOf

scope Scope Notes skos:scopeNote

notationcode Numerical code of the systematic classification, to which terms are assigned

skos:notation

Table 4.6: Mapping of TheSoz elements to classes and properties of SKOS.

All three relations hint at a new class, which is defined as a thesoz:CompoundEquivalence. In contrast to the USE relation in the former example (see Figure 4.5), it is now clear that both preferred terms have to be used together, while this information is omitted in a single USE relation.

Furthermore, the label relations of SKOS-XL allow a consistent and correct representation of the alternative non-descriptors of the TheSoz, where a non-preferred term holds relations to multiple preferred terms. A modelling example of such a term is depicted below in Figure 4.6. The term ‘committee’, which is classified as an AD term, holds USE relations to the preferred terms ‘working group’, ‘parliamentary committee’, ‘Wirtschaftsausschuss’

4 Publication of Linked Open Social Science Data

Extension Description

thesoz:Classification Element of the classification hierarchy of the TheSoz, which is defined as a subclass of a SKOS Concept. thesoz:Descriptor Descriptors of the TheSoz represented as a concept

and defined as subclasses of a SKOS Concept. thesoz:Equivalence

Relationship

An equivalence relationship between two terms, where the terms are assigned via thesoz:use and

thesoz:usedFor properties. This is a subclass of skosxl:Label.

thesoz:Compound Equivalence

A compound equivalence between terms. For

constructing USE COMBINATION and USED FOR COMBINATION relations between terms. The non-preferred term is assigned via the

compoundNonPreferrdTerm property. The preferred terms are modelled via the preferredTermComponent property. This is a subclass of skosxl:Label.

thesoz:use Use relation, which is defined as a subproperty of skosxl:labelRelation.

thesoz:usedFor Used for relation, which is defined as a subproperty of skosxl:labelRelation.

thesoz:hasTranslation Relation between different languages of a term, which is defined as a subproperty of skosxl:labelRelation. thesoz:isTranslationOf Inverse property of thesoz:hasTranslation.

thesoz:preferredTerm Component

A preferred term as a component for a USE COMBINATION or USED FOR COMBINATION relation. This property is defined as a subproperty of skosxl:labelRelation.

thesoz:compoundNon PreferredTerm

The non-preferred term as a component for a USE COMBINATION or USED FOR COMBINATION relation. This property is defined as a subproperty of skosxl:labelRelation.

thesoz:isPartOf

EquivalenceRelationship

This property serves as counterpart for thesoz:use and thesoz:usedFor.

thesoz:isPartOf

CompoundEquivalence

This property serves as counterpart for thesoz:preferredTermComponent and thesoz:compoundNonPreferredTerm. Table 4.7: SKOS extensions for TheSoz.

(i.e. ‘Standing Committee on Industry and Trade’) and ‘advisory panel’ at the same time. Additionally, it contains the USE COMBINATION relation to the combined use of the terms ‘product’ and ‘quality’. These multiple and complex relations cannot be modelled without personal defined relations of SKOS-XL because the ambiguities of the

4.4 Publishing a Domain-specific Thesaurus with Linked Data

Figure 4.7: Example of a USE/USED FOR relation using SKOS extensions.

term would be lost.

The classification hierarchy of the TheSoz can be mapped to SKOS without further prob- lems (see Table 4.6). In order to distinguish between concepts of terms and the concepts of classification terms, the classes thesoz:Descriptor and thesoz:Classification have been defined as subclasses of skos:Concept. The numerical code of each classification term, which appears in the URI as well as in skos:notation, is the same code that is included in the skos:notation of each concept containing descriptors (see Table 4.6). By referencing these notation codes via URIs, a connection between descriptors and their according classification terms is established. Each classification element holds all associated thesaurus terms as narrower concepts via the property skos:narrower. Each thesaurus term holds the associated classification element as a broader concept via the relation skos:broader backwards. Multiple assignments are possible, i.e. a term can be assigned to multiple classification elements. In addition, the hierarchy of the classification terms themselves, which is denoted as child and parent relations in the source, is modelled by skos:narrower and skos:broader relations as well.

Technical Conversion

Based on the defined mapping, the conversion program has been developed. This is typically a script that has to be executed on the dedicated thesaurus. In case of the

4 Publication of Linked Open Social Science Data

Figure 4.8: Example for a USE COMBINATION relation using SKOS extensions.

TheSoz, the technical conversion process is carried out by XSL transformations. The original digital format of the TheSoz, which was already encoded in XML, was converted to SKOS in RDF/XML format. The use of XSLT makes it easy to adjust or extend the mapping in case of later revisions or to implement additional or new mappings. Additionally to the mapping, each defined concept as well as each term itself received its own URI, which provides a persistent and unique identification. This is a very important aspect for reuse and links on the web, e.g. links from and to other data sets. All URIs are defined in the context path http://lod.gesis.org/thesoz/, which serves as the base URI. The URI has been chosen according to the naming conventions of web addresses of GESIS and in order to leave room for the publication of further data sets as Linked Data. The namespace of the personal classes and properties is defined at http://lod.gesis.org/thesoz/ext/ and is shortened by the prefix thesoz. The SKOS version of the thesaurus contains three types of URIs, one for the terms, i.e. the descriptors and non-descriptors, one for the concepts summarizing descriptors and non-descriptors, and one for the labels of the classification hierarchy.

• URI scheme for Descriptors: http://lod.gesis.org/thesoz/concept/######## • URI scheme for Terms: http://lod.gesis.org/thesoz/term/########

• URI scheme for Classification Terms: http://lod.gesis.org/thesoz/ classification/#.#.##

This allows an easy distinguishing between descriptors and classification terms by only knowing the concept URI. After the transformation process, the resulting SKOS version

4.4 Publishing a Domain-specific Thesaurus with Linked Data

Figure 4.9: Example of an AD term using SKOS extensions.

of the TheSoz has been tested and validated by various established validation services for RDF and SKOS. It is available via a SPARQL endpoint42, as a HTML representation43 and as a dump file in the RDF/XML and RDF/Turtle formats44.

4.4.3 Establishing Links to Other Thesauri Using Semi-automatically Link