• No results found

Feature Engineering

6.2 Material and Methods

6.2.3 Feature Engineering

Similar to the existing approaches for ontology evolution prediction (see Section 6.1), our model takes into account intrinsic and structural information extracted from the concept and its neighbourhood (indicated by “I” and “S” in Table 6.2 respectively) and Web information obtained after querying data portals such as PubMed (“E” in Table 6.2). However, our approach also deals with temporal information obtained by analysing the history of the considered ontology (“T” in Table 6.2), as well as semantic information obtained from the UMLS (“SW” in Table 6.2). There are 17 features defined in Table 6.2 that can be grouped as follows:

• Temporal features, included in the evaluation after observing their impact on semantic annotations [Cardoso et al., 2016]. We noticed that, if a concept is part of an unstable region, i.e. directly surrounded by concepts that change frequently over time, this concept is also more likely to change.We therefore want to verify whether this feature also plays a role in the identification of concepts requiring revision. According to our formalism, given a concept ct, the two features dealing with temporal aspects form the T emp(ct) context. • Background knowledge (BK), materialized by external ontologies. This information is

potentially relevant for ontology maintenance tasks [Sabou et al., 2008]. In this sense, we used background knowledge to generate new features, evaluating their relevance for our identification model. For instance, we evaluated whether high similarity between the analysed concept and the siblings of matched concepts from other ontologies would indicate a trend for evolution. In this work, similarity was obtained by measuring the cosine similarity between the attribute values of ct and those of the corresponding concepts

in the background knowledge. The set of nine features dealing with background knowledge (“E” and “SW” in Table 6.2) made up the Rel(ct) context.

• Structural information, represented by the Struct(ct) context, denoted characteristics linked

to the description of ct, like the number of attributes of ct, as well as semantic information about the super, subconcepts and siblings of ct. These features are labelled with “S” and “I” in Table 6.2.

Feature Description

(I) Num att(ct) The total number of distinct attributes of ct.

(I) Att length(ct) Sum of the length of each attribute value of ct.

(S) dir children(ct) Number of direct sub-concepts of ct.

(S) all children(ct) Number of all subsumed concepts (direct and inferred) of ct.

(S) siblings(ct) Number of concepts that share at least one

super concept with ct.

(S) isLeaf(ct) Gives an indication if cthas no subconcept. (T ) Region stability(ct) Coefficient measuring the stability of the

neighbourhood of ct. The neighbourhood includes the superconcepts, subconcepts, and siblings of ct. The coefficient is obtained by dividing the number of concepts in the neighbourhood of ct that have evolved

Table 6.2 – continued from previous page

Feature Description

in the last version by the total number of concepts of the neighborhood.

(T ) Last evolution(ct) Indicates how many releases have been published since the last evolution of ct.

(SW ) Similarity max(ct) Max cosine similarity between attribute values with its equivalents in BK. It is obtained by computing the Cartesian product between the set of attribute values of ct and the set of attribute values of equivalent concepts

of ct defined in the UMLS.

(SW ) Similarity average(ct) Average of cosine similarity between attribute values of equivalent concepts of ct in BK.

(SW ) Max simSup(ct) Max cosine similarity between attribute values with the superconcept of ctin BK.

(SW ) Max simSib(ct) Max cosine similarity between attribute values with the sibling concepts of ct in BK.

(SW ) Max simSub(ct) Max cosine similarity between attribute values with the subconcepts of ct in BK.

(E) PubArtT(ct) Number of PubMed articles citing label(ct) in the previous release.

(E) PubArtT1(ct) Number of PubMed articles citing label(ct) in the current release.

(E) DiffArt(ct) The absolute difference in number of PubMed

articles citing label(ct) between both release. (E) DiffArtRatio(ct) The difference in the ratio of the number of

PubMed articles.

Table 6.2: List of pre-selected features

The dataset mentioned in Section 6.2.2 was generated according to these features. It means that for each concept ctconsidered, we computed the value of the corresponding features (see Table 6.2). An illustrative example of how the value of each feature was assigned is shown with the concept number 171.0 (Malignant neoplasm of connective and other soft tissue of head, face, and neck ), from the ICD-9-CM version 2012. This concept has five attributes, including one “Abbreviation in any source vocabulary” (Mal neo soft tissue head), one “Metathesaurus preferred term” (Malignant neoplasm of connective and other soft tissue of the head, face, and neck), and three “Metathesaurus entry terms” (Malignant neoplasm of cartilage of the ear; Malignant neoplasm of cartilage of the eyelid; and Malignant neoplasm of connective and other soft tissue of the head, face and neck). The total length of these five attributes is equal to 258 characters (Att length=258). The last time that this concept evolved was in 2008, i.e. four years earlier (Last evolution=4). We also observed that none of the neighbours (siblings, super, subconcepts) evolved in 2012 (Region stability=0). When we compared the concept label with that of the neighbourhood (for instance, with the superconcept “Malignant neoplasm of connective and other soft tissue”), using the cosine similarity measure, we obtained a value close to 0 (Similarity max=0). Another observation is that this term did not appear in any publications between 2011 and 2012 (PubArtT=0).

6.3

Results

In this section, we present the experimental results that we obtained for the selection of the classifier as well as for the relevance of the features in the identification of concepts that need revision. We further discuss our results with respect to the recommendation of the type of the revision and compare our work with [Tsatsaronis et al., 2013].