Enriched Study Selection Process - Knowledge extraction from unstructured data and classificati

4.6 Discussion

5.1.2 Enriched Study Selection Process

Our idea is to create W0 _{set of most interesting papers, according to the domain of}

interest, gathered from a large set of unread papers W . To obtain it, we use the existing technologies in the field of Semantic Web and text mining techniques, in the context of the Linked Data approach. The process we describe here is a supervised

iterative process built on the top of the following assumption: W /= (as a result of the applied search strategy) and I /= at the begin (some relevant papers are already known when the systematic review starts).

I0 construction

The initial set of sources contained in I is named I0 and it is composed of primary

sources already classified as relevant for the systematic review: this is the first step of our process and it is needed to start the iterative part of the algorithm. I0 can be

built in two different ways. The first way is to ask researchers to use their previous knowledge indicating the most well known and fundamental papers in the field of interest. This strategy considers that often systematic reviews are undertaken by researchers experts in the field. The second way is to explore a portion of the search space using the basic process, e.g. searching on digital libraries or selecting the issues of (a) given journal(s). This portion is marked as I0 and the enriched process

is used to explore the remaining search space.

Model building

The second step of our approach consists in computing automatically a model M from the I0, pool of interesting papers chosen to initialize the model. The idea is

to build a bag of words model starting from the primary studies in I0. The bag of

words model is a representation of the text as an unordered and weighted collection of words, holding their combined appearance frequency, disregarding grammar and word order. For each primary study, we will consider only abstract and introduction. According to [26] terms that appear at the start or at the end of the document are more significant. We excluded the title and the conclusions using just the abstract and introduction. The rationale is to validate our approach also for situations when less information is available. Finally, we perform stop words elimination and stem- ming process, using the Porter algorithm [74]. The model so built is used to train the Naive Bayes classifier, which will compute the weight for each words according to the TF-IDF normalized approach [50].

Linked Data enrichment of papers

As described above, the main actor of this process is DBpedia, a Resource De- scription Framework (RDF) repository where information stored in Wikipedia is represented as structured data. We define wi a paper ∈ W : each wi is processed to

get a set of named entities N which summarizes wi. Named entities are basically

information units univocally defined, those units are normally described by a set of properties. Formally, a named entity is a phrase that contains name of people,

5.1 – Improving SLR preselection process through Linked Data enrichment

organizations, locations, times and quantities and it refers to exactly one or mul- tiple identical, real or abstract text concept instances as introduced by [69]. This operation is done using a NLP extractors, which calculates contextualized entities using NLP algorithm and results are disambiguated using the Web of Data [94]. After that, we link each ni ∈ N to the correspondent DBpedia resource (when it is

available). Then, from this resource we collect all words contained in the description field (abstract property) and the we add those text information to the bag of words natively taken by the paper wi. We call it enriching process and the resulting paper

is named w+i. Finally, it is compared with the trained model, M, by means of the

Naive Bayes classifier, which is described below.

The use of NLP in the SLR is not novelty, but the way in which we perform this approach is. Indeed, [25] exploited NLP in order to automatically index relevant studies in a SLR in order to prioritize the work to analyze the papers. In our approach, instead, we use it to overcome the limitation introduced by the analyzing of a reduced set of words linking other related information already existing in the LOD cloud.

Naive Bayes model classification

We used a Multinomial Naive Bayes (MNB) classifier [77] and we implement the TF- IDF weight normalization. According to [50] this implementation outperforms the CNB used by [63]. So, we use the classifier to compare w+iwith the model learnt and

we determine whether the conditional probability that w+ibelongs to I (from which

M derives) is significant. We assume that all papers that do not belong to I, belong

to E adopting the Boolean algebra. The comparison is done for each w+i ∈ W :

papers with P [w+i ∈ I] ≥ threshold are moved to W0 and they will be manually

analyzed by researchers. Contrariwise, all papers whose P [w+i ∈ I] < threshold

remain in W .

Iteration

As described in the previous section, papers with P [w+i ∈ I] ≥ threshold are moved

to W0 _{to be manually processed, whilst the remaining ones still stay in W . Likely}

some of the papers moved in W0 _{will pass the following manual selections and will go}

to I, while the others will go to E. Whether I is modified, M becomes obsolete then it is necessary to re-build it and repeat the classification step for all papers w+i ∈ W.

Again, if P [w+i ∈ I] ≥ threshold, w+i is moved to W0 to be manually analyzed.

If any w+i goes to W0, i.e. W0 = after a classification, iteration stops. Papers

that remain in W after the last iteration are finally discarded and not considered by researchers, the exclusion of those papers represent the reduction in workload for the human researchers. As a result of the iteration process, at each iteration

Algorithm 1Enriched selection process algorithm

Define I0

Init I with I0

repeat

Train classifier with I Extract model M for all wiin W do

Enrich wiobtaining w+i

Compare w+iwith model M :

if P[w+iin I] ≥ threshold then move wito W0 end if end for for all w0 i∈ W0 do

Manually read title and abstract (w0

i∈ I ) ? move w 0 ito C : discard w 0 i end for for all ci∈ C do

Manually read full paper (ci∈ I ) ? move cito I : move cito E

end for until C /= Discard ∀ wi∈ W

the model will be progressively tailored to the domain of interest, permitting to refine the selection process. We provide in Algorithm 1 the synopsis of the whole study selection process proposed in this paper and in Figure5.2 its complementary graphical representation. Comparing this picture with Figure 5.1, that represents the selection process provided by guidelines [52], we observe that the original process is not changed, but we add a selection of primary studies that recommends papers similar to the model at each iteration. We also reported in Figure 5.2 the steps of the new process described in this section: the use of a model of bag of words (b) derived from I0 or I (a), the enrichment of papers through Linked Data (c) and the

comparison with the model M by means of the Naive Bayes classifier (d). For the sake of simplicity, we do not represent the iteration.

In document Knowledge extraction from unstructured data and classification through distributed ontologies (Page 86-89)