Automatic Extraction of Concepts of the Request Submitted to the IRS Based on Ontology

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 8, August 2013)

1

Automatic Extraction of Concepts of the Request Submitted to

the IRS Based on Ontology

Abdelbaki Issam

1

, Benlahmar El habib

2

, Labriji Elhoussin

3

, Rachik Zineb

4

Faculty of Sciences Ben M’SIK, University Hassan II – Mohammadia, Casablanca, Morocco

Abstract— The information retrieval systems (IRS) are an indispensable tool to provide the desired information to the user, the user query containing one or more terms is the only way to express his needs. To satisfy this need, the SRI has to extract the semantic need from the request of the user to better interpret In order to satisfy this need, the IRS must extract the semantics need to request the user to better interpret. Indeed, for a given query, the IRS returns the same list of users with different needs. For example, for the "Apple" request, some users are interested in finding results treating the mark, while others are interested in finding results treating the fruit. However, the acquisition or extraction of these descriptors called concepts is still a major problem and news. Several studies have been developed in order to deduce the required information. We propose in this paper a method to extract conceptual information from the user query based on the domain ontology ODP (Open Directory Project).

Keywords—Ontology, Personalization, Information Retrieval System, request, Semantic.

I. INTRODUCTION

The availability of information on the internet, the different methods of data transfer and storage technologies have resulted a significant increase of digital documents returned by the IRS for a request made by the user, the latter is lost to express its need in front of significant number of given search tools.

On the other hand, the authors of the documents and users use a variety of words to express the same concept. In addition, they can also use the same words to express different concepts. The purpose of using ontology in an IR process is to provide a response to these problems and allow returning results based on semantic concepts instead of terms. The purpose is to use the richness of the structure of the ontology to allow returning to the user documents which contain the synonyms for the searched terms but also, general and similar terms.

Ontology plays an important role. Indeed, a well-expressed need generates an appropriate answer to SRI. However, the acquisition or extraction of these descriptors called concepts is still a recent and crucial problem. In this paper we present our approach to semantic representation of the user query.

The first section provides an overview of the methods of extraction of the most popular concepts. The second section presents the architecture of our approach and its principal axes and then the final section ends with a conclusion and gives an overview of our perspectives.

II. THE MAIN METHODS USED

In this section we present the main methods used to extract the semantic fields from a corpus which will also be used to extract concepts from the user query. The latter is made by an expert of domain and turns out very expensive. To extract the descriptors of a field automatically or semi-automatically, a corpus of expertise must be used, then the techniques of automatic language processing is applied to the corpus. Finally, to extract all descriptors of a field in an automatic way, it is essential to use a corpus that covers the entire area. In the literature, the various works of extraction of the concepts from the textual corpuses use two approaches, namely the statistical analysis and the linguistic analysis (CLAVEAU [14]). The statistical analysis bases itself on the study of the contexts of use and the distributions of the terms in documents. The linguistic analysis exploits linguistic knowledge, such as the morphological or syntactical structures of the terms. Other works couple these two approaches and constitute an approach called "hybrid or mixed approach".

A. Linguistic analysis

Linguistic analysis relies on techniques based on knowledge of the language or internal structure (format words, morphosyntactic compositions ...).

(2)

International Journal of Emerging Technology and Advanced Engineering

2

B. Statistical analysis

The methods of statistical analysis are based on quantitative techniques, they have the advantage of not having language skills. Indeed, these methods consider that the terms candidates who are more frequent in a domain than in a general use have a more important weight for this domain. (Cohen [7]) uses a statistical log-likelihood method, based on the frequency of n-grams (sub-sequence of n elements constructed from a sequence of characters) in a document and their frequency in other documents. (Daille, Gaussier Langé and [8]) use a modified version of the log-likelihood to calculate the strength of association between two words. Other calculations have been created specifically for the extraction of terms, including C-value (Frantzi, Ananiadou and Tsujii [9]).

C. Hybrid methods

In hybrid models, also called mixed, statistical approaches and linguistic approaches are coupled. The order in which this association is performed varies from one system to another.

Indeed, in some systems the results of a linguistic analysis are validated and filtered by a statistical analysis, while in other systems the results of the statistical analysis are validated by a linguistic analysis. (DAILLE [15]) identifies noun phrases that describe a compound term using automata.

Then, statistical techniques are used to determine the degree of connection between words associated with the compounds terms extracted in the first stage. To perform these statistical calculations, it is based on a corpus of reference and a list of valid words.According to (DUNNING, [10]) the statistical measurement logLike seems to be best suited to represent the relationship between candidate terms using language filters more or less representative of the field. (SMADJA [16]) uses statistical techniques about the mutual information between words (measures the statistical dependence between words) and in a second stage, it uses linguistic techniques to identify pairs of highly associated words.Starting from a labelled corpus, the tool identifies strongly associated pairs of words using also information.

III. IMPLICIT EXTRACTION OF THE CONCEPTS OF THE REQUEST

The purpose of using ontology in an IR process is to solve the problem of ambiguity and redundancy of words. Indeed, the concept refers to a particular meaning of a given word.

When we say that two words are similar, it is in the sense that the concepts related to them are similar.

The similarity is the inverse function of the distance, the more similar two words are, the less they are remote. Our goal is to extract from a text all the concepts that characterize it. For example, for the query "java course," we will have the concepts: course, java, Development, Computer. Thus, we will be able to extract the phrases that best describe the user's request. Thanks to the intervention of the processing tools of the language including linguistic and symbolic analysis. We are going to present the ontology of used reference, then we shall present the architecture of our approach and we shall detail afterward all his axes.

A. The reference ontology

We chose to use the domain ontology ODP (Open Directory Project) as a reference, it is our source of semantic knowledge in the extraction of concepts of the application process. The semantic categories of an ontology are connected with type relationships "is a", each category of the ODP is a concept, it is the area of interest of a user. ODP editors have manually designed it to match each concept to a set of web pages whose content corresponds to the semantics related to the category. The ODP data are represented by two files of type "RDF", the first contains the tree structure of the ontology and the second lists the web pages associated with each category. Each category is represented by a title and description outlining the content of the pages associated, description contains titles and descriptions of its subcategories.

Above an excerpt from the architecture of the ODP.

FIGURE IEXTRACT FROM THE ARCHITECTURE OF THE ODP

(3)

International Journal of Emerging Technology and Advanced Engineering

3

B. Architecture

In this section, we present the architecture of our method. We distinguish two main phases, the first phase is the representation of ODP categories in terms of vectors then s the same study form is applied to generate the vector representing the user request. The second phase is the calculates of similarity to generate the concepts of the query.

C. Phase of representation of categories

Our objective is to represent each semantic category of the ODP by the vector model and the resulting vectors are represented by all the relevant terms of the category, extracted from the title and description of the category.

For this, we proceed as follows.

At first, the title and description are concatenated in a Di documents in order to apply a study of form using a tool NLPA (Natural Language Processing automatic). We chose to use as a morphosyntactic analyzer TreeTagger. It is distributed freely for research purposes.

It is a tool that allows annotating text with information considered relevant. It was developed by Helmut Schmid in the project "TC" in ICLUS (Institute for Computational Linguistics of the University of Stuttgart). TreeTagger allows labeling of German, English, French, Italian, Deutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese and old French texts. It is adaptable to other languages if their lexicon and corpus manually labeled is available.

Finally, it is possible to customize it according to our needs in developing the desired specifications.

Following our needs we proceed as follows:

 Elimination: Remove insignificant words (in English: the, and, or,...), these words are called "empty words."

 Redial: Find the compound words.

 Lexical Analysis: Bring the words to a morphological base form (conjugation, gender, number).

 Lemmatization: is to group words that have the same origin.

Thus, each category denoted Ci is represented by a vector Vi as the vector model, this vector contains terms that have the most weight. Wij, the weight of term Tj in the category Ci is calculated as follows:

wij = pij * log (N / Ni) [17]

 Pij: the degree of representativeness of the term Tj in Di

 N: the number of sub-categories

 Ni the number of sub-categories with the term Tj

D. Phase of extraction of the concepts

After representing each semantic category of the ODP by the vector model, we proceed to the extraction of the concepts of the application by a measure of vector similarity between vectors representing all ODP categories denoted V(Ci) and the vector representing the request noted V(R).

Indeed, the application is made by the vector of its significant terms, these terms are deducted by applying the same research form used in the previous phase. On applying the cosine similarity measure, the proximity of a request R to a class Ci is given by:

(4)

International Journal of Emerging Technology and Advanced Engineering

4 Mathematically it is considered that two vectors are similar when the cosine of the nail formed by these two has a value greater than 0.8. Finally, the concepts have representative most similar to that of the motion vectors will be considered as the concepts of the query.

Example

Consider the query that was mentioned previously "java course." Suppose TreeTagger tool generated the vector representing the following request:

- V (R) = {Development, java}

We will calculate the similarity of this vector with that of a given concept Ci already generated.

- V (Ci) = {Development, java, Computer, software}

Thus the similarity between the two vectors is:

Because cosine is less than 0.8, we deduce that this concept is not a concept of the query.

IV. EVALUATION

In order to validate our proposal, we conducted experiments to evaluate the impact of the use of concepts instead of words during the learning phase of the user profile, it will be used during Phase of classification results in our meta-search engine.

We used two measures as basic indicators to test the effectiveness of the methods, it is the "rate of return", that is the ratio between the number of relevant documents found during a search and the number total of relevant documents in the existing system. The other indicator is the "accuracy rate" is the ratio between the number of relevant documents found during a search and the total number of retrieved documents in response to the question.

A. TREC collection

Given that there is at present no standard frame of evaluation of a model of personalized access to the information, we propose a frame of evaluation by collections TREC (Text Retrieval Conference), it is a conference of American origin the purpose of which is to allow the comparison of the performances of the systems of research for information about important volumes of data, it gathers the designers of boxes with tools or software of research for information in complete text.

She became as a reference and as an international standard in the field of evaluation of the information.

We chose to evaluate our model using the NIST collection (4-5 discs) assessment TREC collection having a size of 741 670 documents.

B. Learning phase

As a first step, we need to enrich a new knowledge base concept. For this, we launched the first 10,000 queries to build the knowledge base.

C. Experimental results

We measured our approach using the same ranking algorithm. Figure 2 shows the results for both Precision and Recall measures for the two knowledge bases (terms and concepts).

TABLEII

EVALUATION RECALL, PRECISION

Number of request

Precision Recall

Terme Concept Terme Concept 100 0,8402 0,8793 2,1562 2,1820

200 0,8511 0,8765 2,1537 2,1793

300 0,8497 0,8793 2,1596 2,1795

400 0,8596 0,8765 2,1593 2,1795

500 0,8580 0,8810 2,1580 2,1810

The first tests presented in this figure are very encouraging. The comparison of our approach with existing ones shows that our approach is competitive.

V. CONCLUSION AND PERSPECTIVES

We presented through this paper a method of extraction of the concept of the request based on reference ontology. The latter is represented by a vector of weighted terms, and information search systems can treat semantically the user query. We defend our approach that ontology is not at the heart of research but reinforces whenever possible. The difficulty of indexing documents by ontology is that it should cover all the concepts. So it must be scalable to accommodate the emergence of new concepts. So we plan to continuously enrich the ontology.

(5)

International Journal of Emerging Technology and Advanced Engineering

5 REFERENCES

[1] I.Abdelbaki, Z.Rachik, E.Ben lahmar, E.Labriji, Int.J.Computer Technology & Applications,Vol 4 (3), 2013,414-418.

[2] I.Abdelbaki, E.Ben lahmar, E.Labriji, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol 4 (2) , 2013, 194 – 198.

[3] Greengrass E., ≪ Information Retrieval : A Survey ≫, 2000. [4] Auger, P., Drouin, P. & Auger, A. (1996). Filtact© : un automate

d’extraction des termes complexes. Terminologies nouvelles, (15), p. 48–49.

[5] Lemay, C., L’Homme, M.-C. & Drouin, P. (2005). Two Methods for Extracting ―Specific‖ Single-Word Terms from Specialized Corpora. International Journal of Corpus Linguistics, 10(2), p. 227–255. [6] Sanderson, M. 1994, _ ‖Word sense disambiguation and information

retrieval‖, dans SIGIR 1994, proceedings of the 17th Annual ACM SIGIR Conference on Research & Development in Information Retrieval,p.142_151.

[7] Cohen, J. D. 1995. \Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting". Journal of the American Society for Information Science 46 (3) : 162-174. [8] Daille, B., _E. Gaussier & J.-M. Langé. 1994. \Towards Automatic

Extraction of Monolingual and Bilingual Terminology". Coling '94. Proceedings of the Fifteenth International Conference on Computational Linguistics : 515-521.

[9] Frantzi, K. T., S. Ananiadou & J. Tsujii. 1998. \The C-value/NC-value Method of Automatic Recognition for Multi-word Terms". Proceedings of ECDL '98 : 585-604.

[10] Dunning, T. 1993. \Accurate Methods for the Statistics of Surprise and Coincidence". Computational Linguistics 19 (1) : 61-74. [11] Justeson, J. S. & S. M. Katz. 1995. \Technical Terminology: Some

inguistic Properties and an Algorithm for Identi_cation in Text". Natural Language Engineering 1 (1) : 9-27.

[12] Bourigault, D. 1992. \Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases". Coling '92. Proceedings of the Fourteenth International Conference on Computational Linguistics : 977-981.

[13] BAZIZ. (2005). BAZIZ M. indexation conceptuelle/sémantique guidée par ontologie pour la recherche d'information, Thèse de Doctorat en informatique effectuée à l'Institut de Recherche en Informatique de Toulouse (IRIT) .

[14] CLAVEAU. (2003). CLAVEAU V. Acquisition automatique de lexiques sémantiques pour la recherche d'information, Thèse de doctorat, Université de Rennes 1.

[15] DAILLE. (1994). DAILLE B. Approche mixte pour l’extraction de terminologie : statistiquel exicale et filtres linguistiques. Rapport interne, Université de Paris 7. Thèse de Doctorat en Informatique Fondamentale.

[16] SMADJA. (1993). SMADJA F. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), pp: 143-177.