• No results found

A Search Engine Modification for Medical Content Search

N/A
N/A
Protected

Academic year: 2020

Share "A Search Engine Modification for Medical Content Search"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 2, February 2013)

110

A Search Engine Modification for Medical Content Search

J.S. Raikwal

1

, Dr. Kanak Saxena

2

1Department of Information Technology, I.E.T., D.A.V.V., Indore, India 2Department of Computer Application, S.A.T.I., Vidisha, Indore, India

Abstract Major amount and variety of information are available on the World Wide Web (Web) and it is increasing in an unstructured manner. This makes the process of navigation and retrieval of information difficult and sometimes frustrating. Data mining and web mining are the application that helps to manage and explore hidden knowledge from data.

Web search engines facilitate users to search a verity of data on the web, but they are less effective because of the proliferation of documents and the availability of hundreds of links as a result of user queries. Users may become lost or frustrated because of unintuitive navigation and incapability of evaluating the relevance of links through their semantic meanings. Most of the links might be unrelated to what the user wants. In this paper we explore the search engine and additionally modify the listing of results. So that user becomes capable to search medical data contents in most relevant way. Sometimes it is known as search re-ranking.

Gathering useful and interesting information from the Web or discovering knowledge from hypertext data is a problem. It may be solved by implementing measures to make Web information understandable by a Web search engine or other types of software. In this paper we describe a classification of web search data according to the user need in medical domain over the web search engine. Classification leads to sorting, indexing, combining different results, categorized the output data. Moreover it also presents the comparison of traditional search and improved search on the basis of performance.

KeywordsWWW, web search engine, sorting, indexing query.

I. INTRODUCTION

A web search engine is aimed to search for data on the World Wide Web. The search outcomes are commonly presented in a list of consequences and are frequently called hits. The information may consist of web pages, images, informative document and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are preserved by human editors, search engines operate algorithmically or as a mixture of algorithmic and human input [1].

When a user search contents on web, the result listed is sometimes mix different domain contents which is similar to search query words.

For example, if a user search for a word ―Neural Network‖ that word is used in engineering as well as in medical domain also, therefore search engine omits combined results, and so user needs to evaluate individual link data, which is frustrating. Consequently, it is required to design a domain specific search engine which gives results according to user interest.

In this paper we are going to design and implement a medical search engine that is capable to extract the accurate information related to medical domain. The implementation of the basic architecture is proposed and possible future improvements are suggested.

The task is divided into the following steps:

1. Survey and information gathering related to the web search engines

2. Search application and information collection that helps to design a web search engine as well as local data collection.

3. System proposal

4. System architecture and its implementation 5. Result calculation and its analysis

II. BACKGROUND

A. How web search engines work

Web search engines work by keeping information about many web pages, which they retrieve from the html itself. These pages are retrieved by a Web crawler sometimes also known as a web spider [2]. The contents of each page are then analyzed to determine how it should be indexed. Data about web pages are stored in an index database so that it can be used in later queries. A query could be a single word. The purpose of an index is to allow information to be found as quickly as possible [1, 3, 4].

(2)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 2, February 2013)

111

This problem might be considered to be a mild form of link rot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere [5]. Therefore, we found that the retrieval and search of data is started from the collection of different web documents, caching them and lastly index them to fetch them more accurately and effectively.

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve.

There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analyzing texts it locates. This second form relies much more heavily on the computer itself to do the bulk of the work.

B. Problem formulation

After the study of different systems proposed by various researchers we conclude the following fact about the web search engines,

1.Required to work with the domain of web content mining is also known as text mining.

2.Major problem is sorted out by providing help to user in writing query.

3.To make data more specific use the user profile information to search data.

4.Use of text mining algorithm to find out exact string that is matched to user requirement.

5.System adopt user query that are more relevant and keep the record of the previous search results.

6.Search is made for all possible domains that are available to search.

7.Collect and store the domain specific knowledge to make search more reliable, effective and efficient.

III. PROPOSED SYSTEM

A. Execution Strategy

To overcome the identified problems, we propose the following solution for the implementation of an efficient search engine,

1. Work with a traditional web search API, we used Google API.

2. Write a web service to apply the auto complete query extension that helps to improve query results.

3. At the time of registration collect information about the user working domain.

4. Use the KNN text mining algorithm to find the correct matched results.

5. Design a database which contains the last query of different users that are frequently searched by user. 6. Store the frequently searched results to make the less

efforts by the system

7. Use the different search results to move the user with their relevancy.

[image:2.612.326.560.333.583.2]

B. System Architecture

Figure 1. System Architecture

The system architecture for effective medical domain specific search is shown in Figure 1. The complete system is consisting of small sub-systems that are fully functional units and are grouped together to form a complete system.

(3)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 2, February 2013)

112

To understand the complete scenario of medical based web search engine with medical knowledge database, consider the following cases,

Case 1: Suppose a user want to make search and need results related to the keyword ―Neural‖. This keyword is well known in medical as well as engineering domains. The web search engine give various results related to both the domains and may omit some of the results from both which are specifically needed. In this case, user needs to search again line by line for finding out which one is the most relevant result according to the need.

Case 2: If search engine knows about user working profile and make effort to list the results for specific domain then the evaluation of result and selection of better matched text is easy than previous system.

IV. SYSTEM PROCESSING

How to build a simple interface for search results?

To answer the above question pursue the following steps,

User query: most of the internet users experienced as well as new users are making search using the keywords. These keywords are representing the text which is required to search over the internet documents, video, news or others. User can be making mistake in writing query or in spelling written to search.

Query helper: this is a list of relevant words or strings that represent the previously most popular search and user can be able to select one of the string that is suggested in the list to make the search. This list is generated using a different database managed by the system using Ajax and a web service.

User Query Database: a database holds the different web search queries that are previously used by different users. These are published during query writing. An internal mechanism checks weather the user query is available in the database or not. If not, then it is added into the database. Otherwise provides an increment to the popularity index.

User profile analysis: now the query is ready to apply over web search API, but before applying to the search API the system slightly changes the user query with suffix ―In‖ with profile keywords. This may help the search engine to return data from the specific domain. Moreover we may introduce more than one keyword one by one which are having same meaning or may be related to each other for a specific domain.

Google API: here the modified user query is submitted to the API to get search results. These search results are in the form of web, image, video or news. These results are listed in an intermediate form. Now then the control is transferred to the new sub-system result optimizer.

Result optimizer: the filtering stage of the search system that filters the unwanted results using the similar words database and well-known KNN text mining algorithm. Therefore, two different systems are activated at the same time to optimize the final results.

User insight watcher: this is activated during the processing of result optimizer and keeps a watch on contents that are required to be added in to the database of search results.

Results database: the selected results are updated to the database for future use.

Result mapper: the evaluated results are produced in this phase in required format and listed into the category over a result page. This evaluation process is designed by using KNN algorithm which is based on text mining.

Common result mapping: here the newly found results are listed in the web, image or video and other category with the previously searched results so that user has the choice to get the required result format.

V. IMPLEMENTATION

After defining the goals and challenges that are required to design an efficient medical search engine, it is found that the implementation of such system required too much efforts and knowledge of programming skills. The system is developed by using the Visual Studio .NET IDE. As it provides a rich class library for implementation moreover new Google API is easily supported by this environment [6].

The code below shows the basic Google API implementation using our application, before it we create a web service that is bind with the textbox using a web application. This web service searches the written text in the textbox into a database [7].

public List<SearchType> Search(){

const string urlTemplate =

@"http://ajax.googleapis.com/ajax/servi

ces/search/web?v=1.0& \ rsz=large&safe=active&q={0}&start={1}";

var resultsList = new

(4)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 2, February 2013)

113 int[] offsets = { 0, 8, 16, 24, 32, 40, 48 };

foreach (var offset in offsets){

var searchUrl = new

Uri(string.Format(urlTemplate,

SearchExpression, offset)); var page = new

WebClient().DownloadString(searchUrl);

var o =

(JObject)JsonConvert.DeserializeObject(

page);

var resultsQuery = from result in

o["responseData"]["results"].Children()

select new SearchType {

url=result.Value<string>("url").ToStrin g(),title=result.Value<string>("title") .ToString(),content=result.Value<string >("content").ToString(),engine=

this.Engine };

resultsList.AddRange(resultsQuery); }

return resultsList;

}

[image:4.612.322.566.207.348.2]

For searching into database a simple SQL query is used to fetch the data from the database. If user entered text and it is found in the database it would be returned directly from the database else it will be searched into the web and provide the data for user query. Data listing in data table is shown in Figure 2.

Figure 2. Data in data table

For user profiling the system provides a simple user registration process as shown in Figure 3, where user can enter some basic details and the working domain by which the data can be easily searched for. A user has to enter the domain of interest and a security pin for changing or modifying the domain in future.

Figure 3. Registration for user profiling

User insight watcher work in back ground to optimize and reduces the data complexity. During the use and a small survey it is found that a large traffic arises when a new research is introduced but after sometime it is found that the demand or search for that decreases due to time. Thus it is necessary to cleanup unwanted data to reduce the complexity of database.

Result optimizer is called when a new query is introduced. If user types a word with wrong spelling then the system also produces a wrong output. Thus it is required to optimize results with similar or related words. For this evaluation of correct word KNN algorithm is used for best fit words. For example if user type a query ―imple‖ here this word does not produce the correct results, thus it is required to amplify the results by introducing the similar or matching words like ―simple‖, ―implies‖, and others.

The result optimization process works in stages. In first step it checks the input spelling and suggests the correct if possible with the user concern. In second step, it works with all links, description of page where it is found.

For the evaluation of results the user feedback for the search results are collected. That is a score given by the user to get correct useful results that is help full for user. This can be seen at the administrator end. This feedback screen contains the user query and their user relevance feedback out of 10 results.

(5)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 2, February 2013)

[image:5.612.47.294.112.284.2]

114

Figure 4. Feedback screen

[image:5.612.49.289.341.700.2]

The Figure 5 shows the listed results. 10 results are given by the search engine. These results are in some unwanted or other format and most of them are not related to medical field.

Figure 5. Web page with normal search

Figure 6. Web page with modified search

After implementation and profiling if same contents are searched as shown in Figure 6, it is found that it provides much nearer results for the same page. This is due to profiling and modification of user query according to the selected profile. Additionally, result optimizer performs the pruning process of results where the duplicate links (results) and other contents are removed from the search results before it is mapped on page. After processing only 9 results out of 10 are found. Here one link is removed by the result optimizer and real nearly matched results are listed using a web page.

VI. RESULT

This section provides the results of the user feedback. It is found that the proposed model is performing better than the conventional approach of search with the simple search engine.

According to the data considered in previous section a graph in Figure 7 visualizes the feedback of same user queries. It is found that the results given by the proposed and implemented model are accurate.

For any data mining and machine learning system error rate and accuracy factors majorly affect the appropriateness of the system. To gain the results of the system the validation scheme is included for the provided system. In the complete system design KNN is used two times, firstly for searching the words in the search contents and secondly when to improve the results for mapping them on the web page.

0 2 4 6 8 10

1 2 3 4 5 6 7 8

Normal New Model

Figure 7. Comparative results according to feed back

[image:5.612.336.566.490.610.2]
(6)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 2, February 2013)

115

The accuracy is calculated using the below given formula,

Accuracy = total no of correctly classified values*100/total no of objects to evaluate

This accuracy is measured to find the contents and the search query. In addition, after evaluating the results it is found that the results through result mapper are improved. Here we implement KNN for the removal of duplicate links and contents from the listed results. Table 1 shows the accuracy after implementing the KNN through the result mapper.

Both graph and table shows that the resultant accuracy after implementing the system provides much more improved results. Consequently, modified search is providing better performance than the normal system.

TABLE 1 ACCURACY DURING MAPPING

Normal search Modified search

63.28 89.28

74.27 85.36

63.42 88.27

52.91 86.22

58.23 85.79

0 20 40 60 80 100

1 2 3 4 5

Normal Modified

Figure 8. Accuracy of data filtering and mapping

The error rate is defined as failure to classify correctly or miss match results from the actually desired classification. This may be found by the following formula,

Error rate = 100 – accuracy

Or

Error rate = wrong classified values * 100 / total values to classify

Table 2 and Figure 9 show the error rate of the system. Accuracy of the system provides the fact that how much amount of data is correctly achieved by any system.

This is estimated by two different ways, in the first way user manually provides some feedback about the system such as rating or relevancy ratio and in the second way the system keeps record on the basis of algorithm selection and amount of correctly classified values. The accuracy is provided in terms of percentage.

In contrast to accuracy, the error rate provides the factor by which the system requires to improve by this factor. It is also evaluated in terms of percentage. Modified search having good performance than previously designed normal system but the error rate provides the indication for further improvement of this system.

TABLE 2 ERROR RATE

Normal search Modified search

36.72 10.72

25.73 14.64

36.58 11.73

47.09 13.78

41.77 14.21

0 5 10 15 20 25 30 35 40 45 50

1 2 3 4 5

Normal modified

Figure 9. Error rate of the system

Accuracy and error rate is inversely proposal to each other, and less error rate shows better performance. Thus system reflects better performance than normal search.

In future, we work to improve more search accuracy and results.

VII. CONCLUSION AND FUTURE WORK

[image:6.612.334.568.282.542.2] [image:6.612.46.289.319.563.2]
(7)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 2, February 2013)

116

1.The implementation of optimized medical search engine using KNN text mining is successfully completed.

2.Designed system produces much relevant results according to the user interest and profile

3.Perform better than previously designed model.

But this implementation does not completely solve the problem to obtain the tree with higher performance, thus it is essential to study in more depth and explore new dimensions to enhance decision tree model.

REFERENCES

[1] A study of search engines for health sciences, International Journal of Library and Information Science Vol. 1(5) pp. 069-073 October, 2009, http://www.academicjournals.org/ijlis.

[2] Web Crawling, By Christopher Olston and Marc Najork, Foundations and TrendsR in Information Retrieval, Vol. 4, No. 3 (2010) 175–246 c 2010 C. Olston and M. Najork DOI: 10.1561/1500000017

[3] http://en.wikipedia.org/wiki/Web_search_engine [4] http://en.wikipedia.org/wiki/Search_engine_indexing.

[5] http://www.hinduwebsite.com/webresources/articles/howseswork. asp

[6] .NET Google Search REST API Integration, By Cognize2k, 30 Dec 2009, http://www.codeproject.com/Articles/49643/NET-Google-Search-REST-API-Integration

[7] How to search Google and Bing in C#, Submitted by JonUdell, http://answers.oreilly.com/topic/2165-how-to-search-google-and-bing-in-c/

[8] Searching and Browsing Linked Data with SWSE: The Semantic Web Search Engine, a Digital Enterprise Research Institute, National University of Ireland, Galway B AIFB, Karlsruhe Institute of Technology, Germany

[9] Identifying Task-based Sessions in Search Engine Query Logs, WSDM’11, February 9–12, 2011, Hong Kong, China. Copyright 2011 ACM 978-1-4503-0493-1/11/02 ...

[10] The exploration of internet marketing strategy by search engine optimization: A critical review and comparison, African Journal of Business Management Vol. 5(12), pp. 4644-4649, 18 June, 2011 Available online at http://www.academicjournals.org/AJBM DOI: 10.5897/AJBM10.1417 ISSN 1993-8233 ©2011 Academic Journals [11] -SEARCH – A Multimodal Search Engine based on Rich Unified

Content Description (RUCoD), WWW 2012 – European Projects Track April 16–20, 2012, Lyon, France

[12] THE DYNAMICS OF SEARCH ENGINE MARKETING FOR TOURIST DESTINATIONS, Bing Pan*, Ph.D. Assistant Professor Department of Hospitality and Tourism Management School of Business and Economics College of Charleston, Charleston, SC 29424-001, USA Telephone: 1-843-953-2025 Fax: 1-843-953-5697 [13] Architecture of A Scalable Dynamic Parallel WebCrawler with High

Speed Downloadable Capability for a Web Search Engine, 6th

International Workshop on MSPT Proceedings MSPT 2006

Figure

Figure 1. System Architecture
Figure 2. Data in data table
Figure 4. Feedback screen
TABLE RROR 2 RATE

References

Related documents

and Design: Two units on Traditional Chinese Medicine were taught to 198 Taiwanese students: 153 students studied using the traditional classroom lecture method and 45

transformational leadership techniques usually linked with business leaders that are used by classroom teachers at the collaborative to maintain strong levels of student

Vikram 1601 is an indigenous processor based system for acquisition of stage parameters of launch vehicles, processing of stage parameters and issuing of commands based on the

Figure1 and 2 show the effect of salt stress, using different concentrations of NaCl, on the chlorophyll content of the maize plants under study, including chlorophyll a, b,

P2P application is used to provide solution for manual procurement process in the college along with the vendor details. It contains the details of all approved

Figure 4 illustrates the cumulative net worth distributions of SCF 2007 survey participants, SCF survey participants who engage a financial advisor for investment or

of the unique properties of water originate from the pattern of hydrogen bond network and its dynamics, spatial, and orientational aspects of water in solvation shells around the

The second objective is to determine and to compare the variability in apparent ileal digestibility (AID) and standardized ileal digestibility (SID) of CP and AA, apparent total