• No results found

EFFICIENT PREDICTION OF RESOLVING DIFFICULT KEYWORD QUERIES OVER SEARCH ENGINES USING ONTOLOGY

N/A
N/A
Protected

Academic year: 2022

Share "EFFICIENT PREDICTION OF RESOLVING DIFFICULT KEYWORD QUERIES OVER SEARCH ENGINES USING ONTOLOGY"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

1586 | P a g e

EFFICIENT PREDICTION OF RESOLVING

DIFFICULT KEYWORD QUERIES OVER SEARCH ENGINES USING ONTOLOGY

Priyanka.S

1

, Mannar Mannan.J

2

1,2 Department of Information Technology(IT), Anna University Regional Centre, TamilNadu,( India)

ABSRACT

Keyword query uses a search engine which provides the user friendly environment, but suffer poor ranking quality of documents with a low precision or recall. The hard query or poor query will decrease the document ranking performance. It is important to identify the query for enhancing the user satisfaction. We propose, ontology based keyword query over search engine for considering semantic structure and the concepts. The search is mainly based on keyword expansion and not just the keyword search. The level of accuracy will be enhanced since the query is analyzed semantically. The results are finally ranked using IR-Style ranking algorithm and optimized for providing the relevant links. The links are obtained by using API’s. The final results obtained here will be accurate enough to satisfy the request made by the user query.

Keywords: Ontology, Semantic, IR-Style Ranking, Precision, Recall

I. INTODUCTION

For a given query entered by the user, search for a target web page in most search engines is based on the keyword-based searches and popularity based ranking. Although the results obtained is good enough, not all the search results are relevant to the given query. Most search engines respond to the user queries by providing a list of documents which are relevant to the user query. However, many Information Retrieval (IR) system suffer from variance in performance, such that the quality of the results is poor for some queries. In order to handle the queries properly the “difficult” queries should be identified with the by IR. Precision and recall are generally defined in the terms of a set of retrieved documents and a set of relevant documents. Set of retrieved documents are the list of documents generated by the a web search engine for a query. Set of relevant documents are the list of documents on the internet that are relevant for a certain topic. Precision takes all retrieved documents into account. Recall ininformation retrieval is the fraction of the documents that are relevant to the query that are obtained .Query performance can be predicted by using the two types of statistics such as pre-retrieval statistics and post retrieval statistics. Pre-retrieval statistics includes the query length or query term idf values without any document retrieval of the query. Post-retrieval statistics mainly includes atleast one retrieval steps where the documents are been ranked. The meta-data of the documents will provide the additional information for estimation of the query. The quality of the results is quantified by estimating the query difficulty. The agreement between the top results of the full query and its sub-queries helps in query estimation. It is important to identify the query for enhancing the user satisfaction.

(2)

1587 | P a g e In this paper we propose, ontology based keyword query over search engine for considering semantic structure and the concepts. Semantic Web is the representation of knowledge which enhance search mechanisms with the use of Ontologies. Ontology is a general description of all concepts as well as their relationship. It havebeen shown a beneficial for representing domain knowledge ,and are quickly becoming the backbone of the Semantic Web. In the context of the semantic web, ontologies improve the exploration of the Web resources by adding a consensual knowledge. One of the major advantages of using ontology is mainly for the “reuse” of knowledge.

The ontology-based search is combined with information retrieval which will add semantic meaning to documents. The keywords are mapped into the ontology and then classified within certain domains and make the semantic search engine more flexible and autonomous to construct ontologies. Thus adding a semantic dimension to the Web using ontology to solve many problems

II RELATED WORKS

The query performance is predicted using discounted cumulative gain(DCG) which establishes the measure of retrieval effectiveness that uses graded relevance assessment and usefulness of a document based on its position in the result [6]. Answers to the query are modeled as rooted trees in which the tuples match the individual keywords in the query. A user can get information by typing a few keywords in which the hyperlinks are linked with it and interacting with control on the displayed results [5].

The state-of-the-art IR ranking methods to ranking heterogeneous joined results of database tuples. This ranking method takes into consideration of completeness and size of a result [2]. Statistics have been categorized into two types for predicting query performance properties as pre-retrieval statistics and post- retrieval statistics [3].

A baseline keyword search interface is used to map keywords in a query to ordering clauses. It improves the quality of keyword to predicate mapping by aggregating measured correlation over query pairs [1]. The methods based on the concept of clarity score in which the users are generally interested in only few topics and they are distinguished from other documents which are collected.. The difficulty of the query is predicted more accurately than pre-retrieval based methods for text documents [7]. Some methods use machine learning techniques for learning the query difficulty and predict their hardness [2].

IR-Style ranking algorithm was tried for vector space model and it delivers larger MAP value than PRMS. All state-of-the-art predictors on the text mainly achieves linear /non-linear correlation in the workload. The efficiency of robustness calculation for different parts of the corruption and re-ranking process are done by combining both QAO-Approx and SGS-Approx.[5].

A threshold approach is applied to the KQI to categorize the input query as “easy” or “hard”[5].The prediction performance varies with the query popularity and the combination of features plays a major role in the prediction task [6]. The keyword++ consistently improves both baseline interface of the query-portal and keyword-and use different mechanisms for relevant results [1].

2.1. Ontology-Based Semantic Search

 Interface Engine adds all properties that can be deduced / induced from the semantic annotations and the ontology.

(3)

1588 | P a g e

 The Query Evaluator collects the results and re-transforms them into a single answer which is returned to the user.

 Semantic annotations here assumes to the standard web pages and to objects on standard web pages. It automatically learned from the Web pages and the objects to be annotated and extracted from web pages via user-defined rules (i.e., mapping web pages/objects to an ontological knowledge base).

Fig 1.Ontology-based semanticsearch

 Ontology is mainly used for knowledge representation, information modeling and a formal explicit specification of a shared conceptionalization. Ontologies specify the meaning of annotations. New terms can be formed by combining existing .Meaning of such terms is formallyspecified and can combine/relate terms in multiple ontologies.

 Semantic web mainly used for the development of resource description language(s)and includes classes, subclasses and its properties.

Fig .2 Example of semantic web

 Resources used in semantic webs are globally identified by URI’s. They are generally extensible and relational.

(4)

1589 | P a g e

 Links are identified by URI’s.

 Users can exchange knowledge effectively

 More processable information is available through machines.

III. SYSTEM ARCHITECTURE

The user query should be parsed so that the stop words are removed from the query. All the keywords are considered as concepts. Keywords are expanded in the ontology construction. Ontology is a formal explicit specification of a shared conceptualization. It provides a common understanding of a term and also its relationship with other terms. Thus a hierarchy can be formed with the related terms.

Fig .3 Architecture 3.1 Detailed Description

3.1.1 Ontology Construction : Knowledge Base

ONTOLOGY which forms the knowledge base for this project that is constructed based on the concepts related to the domain. Ontology development can be divided into two main phases: specification and conceptualization.

Specification describes the information knowledge about the domain. Conceptualization describes the structure of this knowledge in which it has to be agreed on by domain experts. Furthermore ontology can be modularized through relations and attributes which involves conceptual aspects

3.1.2 User Input

The user of the system enters a query. The expected output for this query is semantically relevant web links. All the irrelevant links are filtered out and only the relevant links are obtained.

3.1.3 Removal of Stopwords

All the stopwords are removed from the user query. Only keywords (concepts)are obtained and it is matched with the ontology construction

3.1.4 Finding The Difficult Keyword Query

This process is mainly used to find the given query is difficult keyword query based on the given properties.

3.1.5 Ontology Construction And Extraction User

query

Concepts

Ontology

Search

Ranking Document

(5)

1590 | P a g e In this process the ontology is constructed using swoogle in which the keywords are expanded in it. The user query is matched with the keywords which are constructed in the ontology to get a set of more related keywords from it. Finally we get a collection of words which are semantically related and domain specific keywords.

Fig 4 Swoogle architecture

Swoogle architecture mainly consists of four major components such as SWD discovery, metadata creation, data analysis and interface.

3.1.6 Formation of Refined Query

Here the query formation occur with the collected words. The refined queries obtained are given as input to the search engine and fetches more semantically related web links. These queries are sent to search API for fetching relevant web links to the user query.

3.1.7 Ranking of Web Pages

The web links obtained through the API are now filtered and ranked to make it more refined. The irrelevant web links to the user query are all filtered out. Only the relevant links are obtained and documents are collected which relates the user query.

IV IR-STYLE RANKING ALGORITHM

In general we go for ranking mainly due to the too many (thousands of) results match a keyword query and users are interested in a few (<10) results. IR-Style mainly includes the following subsets as:IR-style relevance are mainly based on term frequencies, proximities, position etc and link analysis information. It mainly consider the number of occurrences of a term in a document (raw) term frequency(tf).

ntf = 1 + log(tf) ,if tf> 0

ntf = 0 ,otherwise (1)

The proposed algorithm is designed to access the information stored in the ontology to provide the relevant information to the user. It includes the following steps:

 By using the OWL (Ontology Web Language) the whole information about the domain is stored in ontology in structured form.

 The user query is send through the user interface.

 The algorithm with the help of semantic search with respect to the ontology computesby matching concepts and relationships between concepts.

 Finally the result set that which matches the user query are displayed.

(6)

1591 | P a g e

Fig. 5 Workflow for Query Definition

V.CONCLUSION

With the increasing amount of data being stored on the Web every day, Semantic Search would ensure the provision of useful and relevantresult to the user. This is more user-friendlyby providing more relevant data from ontology. Understanding the user’s query and providing just the content required is the main objective of this application. The results obtained here will be enhanced since the query is analyzed semantically. The results are ranked using the IR-Style ranking algorithm and optimized to provide the relevant links. The links are all obtained by using API’s.

VI. ACKNOWLEDGMENT

The authors would like to thank the Department of Information Technology in Regional Centre of Anna University ,Coimbatore for their valuable guidance and support.

REFERENCES

[1] Shiwen Cheng, ArashTermehchy,andVagelisHristidis,”Efficient Prediction of Difficult Keyword Queries over Database”,2014 vol.26, No.6,JUNE 2014 10.1109/TKDE.2013.140.

[2] V.Ganti, Y.He, and D.Xin, “Keyword++: framework to improve keyword search over entitydatabase”,inproc.VLDB Endowment, Singapore, Sept.2010, vol.3, no. 1-2, pp.711-722.

[3] G.Bhalotia,C.Nakhe, S.Sudarshan,”Keyword searching and browsing in database using BANKS”, in proc.

18th ICDE, San Jose, CA,USA, 2002, pp.431-440.

(7)

1592 | P a g e [4] K.Collins-Thompson, and P.N.Bennett, ”Predicting query performance via classification,”in Proc.32nd

ECIR, Milton Keynes,U.K.,2010,PP.140-152.

[5] S.Cheng, A.Termehchy, and V.Hristidis, “Predicting the effectiveness of keyword queries on database”,in proc. 21stInt.CIKM, Maui,HI,2012,PP.1213-1222.

[6] V.Hristidis, L.Gravano, and Y.Papakonstant, ”Efficient IRstyle keyword search over relational database”, in proc.29th VLDB Conf.,Berlin,Germany,2003,pp.850-851.

[7] S.C.Townsend, Y.Zhou, and B.Croft, “Predicting query performance”, in Proc. SIGIR02, Tampere, Finland,pp.299-306.

[8] E.Yom-Tov, S.Fine, D.Carmel, and A.Darlow, ”Learning to estimate query difficulty: Including applications to missing content detection and distributed information retrieval”, in proc.28thAnnu. Int . ACM SIGIR Conf. Research Development Information Retrieval, Salvador, Brazil, 2005,pp.512-519.

References

Related documents

Pour plate culture, the accepted method of detecting urinary tract infection, is too time consuming for use as a mass..

ABSTRACT: In this paper we analyse the general bulk service queueing model to find the mean queue length   L q , probability that the system is in vacation(P v ) and the

Marine Corps, United States; Elizabeth Kristjansson, School of Psychology, Faculty of Social Sciences, University of Ottawa, Canada; David Moher, Ottawa Hospital Research

Synonyms of explicit concepts, as well as other concepts that are relevant to the information need, but are not explicitly mentioned in the query (henceforth referred

Investigation of the selectivity of monofilament gill nets used in carp fishing ( Cyprinus carpio Linnaeus, 1758) in Lake Bey ş ehir.. Investigation of the selectivity

With a detection rate of 35 %, our study showed that the prevalence of nonalcoholic fatty pancreas is high among adult patients underwent routine medical check-up.. recent

Adjustment variables included age at diabetes diagnosis, cardiovascular disease status, Chronic Disease Score (quartile), renal disease, dialysis, physician office visits

Therefore, curcumin is used either alone or in combination in targeting various types of cancers such as multiple myelomas, pancreatic, lung, breast, oral, prostate, and