Search Query and Matching Approach of Information Retrieval in Cloud Computing

(1)

Search Query and Matching Approach of Information Retrieval in Cloud Computing

Thamilvaani Arvaree @ Alvar ¹ ,Associate Prof. Dr. Rodziah Atan ²

1,2 Department Of Information System FSKTM ,University Putra Malaysia

Selangor, Malaysia

ABSTRACT- Website becomes very important tool for web users to gain information almost in all the industries such as education, computer science, business, health and others as well. With an enormous growth of the Internet it has become very difficult for the users to find relevant documents or information. Search engine such as Google and yahoo returns a list of web pages that match the user query. Thus, for the inexperienced users may have difficulties in formulating a precise query to get relevant information from the web. This research is investigating the user preferred search query and suitable matching approach in order to retrieve the relevance information in cloud environment.

Keywords: Search query, matching, information retrieval

I. INTRODUCTION

Although search engines are very useful for obtaining information from the World Wide Web, users still have problems obtaining the most relevant information when processing their web queries. The continued explosion of available information on the World Wide Web has lead to the need for processing queries intelligently to address more of the user’s intended requirements than previously possible [1].

In the past few years increasing research has been done on information retrieval, the task of retrieving documents relevant to specific aspects (or facets, or meanings) of a given query. This task is motivated by the difficulties encountered by users in discriminating the information contained in a conventional list of search results, due to their redundancy and lack of structure. The number of real user queries affected is potentially large, partly because informational queries have been estimated to account for 80% of web queries [2], and partly because today virtually any web query expressed by very few words may have multiple interpretations, depending on the user intent or on the context in which it is issued. Furthermore, there is evidence that the retrieval of more information on specific query of interest is often the primary goal of the efforts of searchers [3].

Conventional document retrieval systems return long lists of ranked documents that users are forced to sift through to find relevant documents. The majority of today's Web search engines (e.g., Excite,AltaVista) follow this paradigm. The notoriously low precision of Web search engines coupled with the ranked list presentation make it hard for users to find the information they are looking for.

II. METHODOLOGY A. Experiments

Phase 1: Getting Services

Pre-analysis experiment was conducted to measure the user experience in getting services from cloud service provider. This pre-analysis experiment divided into two main parts. First, the user experience in getting services from any of the service provider. Second, identify number of users who success in getting services from service provider as per their requirements.

1. User Experiment

Scenario: Thirty final year degree students in Business and Computing were selected. A brief introduction was given to the user about the cloud computing and the related topics. After the introduction, the user introduced with few cloud service provider’s website and guide them with the web information. They are required to register with this particular website to get an experience as a cloud user. www.salesforce.com has been selected as a one of the

(2)

the end.

From the analysis of the experiment, out of thirty users 67% of the users are identified as novice users (those who don’t know about the service provider- salesforce.com). The remaining 33% of the users are identified as an experience user those have idea and information about this service providers. The experiences users are questioned about services form this website. Most of the user are not sure about types of software can be downloaded from this website. It is just coincidently browsed the website and gets to know about the details.

In conclusion of this experiment, it can be seen strongly that there is no mechanism to tell the user on deciding which provider have the services that they looking for. The existing practice is user must have information of service providers before they approach to the service provider’s website. Novice users are finding difficulties to come over this situation to get the services.

2. Getting Services

Scenario: Same group of novice and experience users are selected again. Training conducted to them about the cloud computing environment and the purpose of the experiment. Users are required to access the service provider’s website. User’s has been informed and explained about the services by the selected provider. www.salesforce.com website is selected to conduct this experiment. Users are required to get a particular service according their requirements from this website. Results of successfully getting the expected service according to the requirements are recorded.

The main aim of this experiment is to analyze customer experience and requirements problems in getting services from service provider. Both novice and experience users have faced mostly similar types of problems. The three prominent problems faced by the both groups are as follows:

• User’s not sure the relevancy of the search result to their needs.

• User’s need to refine the search option few times to get a required result. (Education, industry, free or paid software and other).

• Not familiar with software names.

Phase 2: search query Experiment setup

Based on the three prominent problems encountered by the novice users, clearly shows that query is playing an important role in information retrieval. Therefore, this research further continues with second phase experiment.

Fifty users were gathered together and request to search and download any files from 2 different service provider’s website. Users are required to record the search query that they used to get each download files. This research mainly limit to commonly used categories. There are namely Computer Science, Business, Engineering and Teaching & Education.

Experiment result

From the experiment, there are 100 websites with different URL addresses are obtained. Besides that, 152 different search queries were recorded for 4 main categories. Table 1 shows the number of search queries collected from the experiment according to the four categories. This result will be used as a dataset later in the research.

Table 1: Search Query

No Category Search query

1 Computer Science 33

2 Business 45

3 Engineering 33

4 Teaching and Education 41

Total 152

(3)

B. Matching

Once the search queries are gathered, information retrieval process will proceed with matching of query with information. To retrieve the information according to a particular information need from a big pool of information available on the web by entering query is a big challenge [4].

This research also look into a method to find the related queries which approximate the information need of the input query issued to the search engine of the website. Thus, due to this most difficult task, many researchers have tried to develop general and efficient strategies for information retrieval. Accessing information and retrieval of information on the Web can be achieved through different ways and these are listed below.

• Going directly to a site if you have the address.

• Browsing.

• Exploring a subject directory.

• Conducting a search using a Web search engine.

• Exploring information stored in online databases on the Web, also known as the ‘Deep Web’.

• Joining an email discussion group or Usenet group.

However, this study will cover mostly search engines. The Figure1 below shows the detailed process of information retrieval.

Figure 1: Information Retrieval Process

A classic example of information retrieval using similarity searching is entering a keyword into the search string box on Amazon's web site in order to retrieve descriptions of products related to that keyword. Approximate string matching algorithms can be divided into two main algorithms. There are equivalence algorithms and similarity ranking algorithms. This paper discusses the strike match similarity ranking algorithm, together with its associated string similarity metric which introduced by Simon White.The strike match similarity ranking algorithm was motivated by the following requirements:

1. A true reflection of lexical similarity - strings with small differences should be recognized as being similar.

In particular, a significant substring overlap should point to a high level of similarity between the strings.

2. Robustness to changes of word order - two strings which contain the same words, but in a different order, should be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognized as dissimilar.

3. Language Independence - the algorithm should work not only in English, but in many different languages.

(4)

FRANCE' should be similar to both 'FRENCH REPUBLIC' and REPUBLIQUE FRANCAISE'. There is a possibility to make relative statements of similarity. For example 'FRENCH' should be more similar to 'FRENCH FOOD' than it is to 'FRENCH CUISINE', because the size of the common substring is the same in both cases and 'FRENCH FOOD' is the shorter of the two strings.

C. Existing Algorithms

1. Equivalence Methods

Equivalence methods compare two strings and return a value of true or false according to whether the method deems those two strings to be, in some sense, equivalent. A simple example of equivalence is to treat 'Tweetle-Beetle Battle' the same as 'TWEETLE BEETLE BATTLE' despite the differences in case, and the replacement of a hyphen with a space in the second string.

2. The Soundex Algorithm

The Soundex algorithm is an attempt to match strings that sound alike. The idea is that take the two strings of the comparison, map each of them to a new string that represents their phonetics, then compare those strings for an exact match. The algorithm is only intended to work with English pronunciation, and there are plenty of counter- examples, even in English, where it doesn't work. However, it is easy to implement, and, even better, is already available as a pre-programmed function in the Oracle Database Management System.

The algorithm works as follows. When mapping the original strings to their phonetic strings, the first letter is always retained, and the rest of the string is processed in a left to right fashion. The subsequent letters of the string are compressed to a three digit code according to the scheme shown in Table 2. Since the first letter is always retained, the algorithm always generates a 4 digit string. The code '0' is used as padding if there are not enough letters in the input string, and any excess letters are disregarded.

Letter Phonetic Code

B,F,P,V 1

C,G,J,K,Q,S,X,Z 2

D,T 3

L 4

M,N 5

R 6

A,E,I,O,U,Y,H,W not coded

Table 2: Phonetic Codes in the Soundex Algorithm

For example, the strings 'LICENCE', 'LICENSE' and 'LICENSING' all map to the same Soundex string, 'L252'.

Additionally,

• adjacent pairs of the same consonant are treated as one

• adjacent consonants from the same code group are treated as one

• a consonant immediately following an initial letter from the same code group is ignored

• consonants from the same code group separated by W or H are treated as one

The Soundex algorithm is interesting because it addresses the pronunciation of words, rather than raw lexical similarity. Its main drawbacks are that it is language dependent, and there are many examples of similar strings that nevertheless produce different Soundex codes. And of course it only provides for comparisons of alphabetic characters - anything outside of the range 'A'-'Z' will simply be ignored.

(5)

3. Similarity Ranking Methods

Similarity ranking methods compare a given string to a set of strings and rank those strings in order of similarity. To produce a ranking, need a way of saying that one match is better than another. This is done by returning a numeric measure of similarity as the result of each comparison. Alternatively, the distance between two strings, instead of their similarity can be another option. Strings with a large distance between them have low similarity, and vice versa.Two very common methods for ranking similarity are the Longest Common Sub-string and Edit Distance.

4. Longest Common Substring

The longest common substring between two strings is the longest contiguous chain of characters that exists in both strings. The longer the substring, the better the match between the two strings. This simple approach can work very well in practice.

A disadvantage of this approach is that the position of an 'error' in the input affects the computed similarity between the two strings. If the error occurs in the middle of the string, then the distance between the two strings will be greater than if the error occurred at one end. For example, suppose we make a simple typing error on the keyboard by pressing the key adjacent to the one intended. With a word such as 'PINEAPPLE', typing 'PINESPPLE' gives a longest common substring of length 4, whereas 'OINEAPPLE' gives a value of 8. The problem is that 'PINESPPLE' is deemed to be just as good a match with 'PINEAPPLE' as the string 'PINE', which is probably not what we want.

5. Edit Distance

This method focuses on the most common typing errors, namely character omissions, insertions, substitutions and reversals. The idea is to compute the minimum number of such operations that it would take to transform one string into another. This number gives an indication of the similarities of the strings. A value of 0 indicates that the two strings are identical

The algorithm can be described more generally by associating a cost with each of the operations, and deriving the distance between two strings as the minimum cost that transforms one string into another. There are two widely recognized variations of the edit distance. The Levenshtein Edit Distance is the most common variation and allows insertion, deletion or substitution of a single character where the cost of each operation is 1.

6. The New Metric

Given the drawbacks of the existing algorithms, strike match algorithm invents a new string similarity metric that rewards both common substrings and a common ordering of those substrings. In addition, strike match algorithm not only considers the single longest common substring, but other common substrings too.

Find out how many adjacent character pairs are contained in both strings.

By considering adjacent characters, the algorithm take account not only of the characters, but also of the character ordering in the original string, since each character pair contains a little information about the original ordering.

For example let comparing the two strings 'France' and 'French'. First, map them both to their upper case characters (making the algorithm insensitive to case differences), then split them up into their character pairs:

FRANCE: {FR, RA, AN, NC, CE}

FRENCH: {FR, RE, EN, NC, CH}

Then check which character pairs are in both strings. In this case, the intersection is {FR, NC}. Then the algorithm finds as a numeric metric that reflects the size of the intersection relative to the sizes of the original strings. If pairs(x) is the function that generates the pairs of adjacent letters in a string, then the numeric metric of similarity is:

(6)

The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. Note that the formula rates completely dissimilar strings with a similarity value of 0, since the size of the letter-pair intersection in the numerator of the fraction will be zero. On the other hand, if there is a compare a (non-empty) string to itself, then the similarity is 1.

For example comparison of 'FRANCE' and 'FRENCH', the metric is computed as follows:

Given that the values of the metric always lie between 0 and 1, it is also very natural to express these values as percentages. For example, the similarity between 'FRANCE' and 'FRENCH' is 40% where express similarity values as percentages, rounded to the nearest whole number.

7. Ranking Results

Typically, we don't just want to know how similar two strings are. We want to know which of a set of known strings are most similar to a particular string. For example, which of the strings 'Heard', 'Healthy', 'Help', 'Herded', 'Sealed' or 'Sold' is most similar to the string 'Healed'? Find the similarity between 'Healed' and each of the other words, and then rank the results in order of these values. The results for this example are presented in Table 3.

Word Similarity

Sealed 80%

Healthy 55%

Heard 44%

Herded 40%

Help 25%

Sold 0%

Table 3: Find the Most Similar Word to 'Healed'

V. CONCLUSION

In conclusion, the technology of search engines is a very dynamic field, always looking for improvements and new ideas in order to satisfy user needs. The ability of the system to find relevant information based on the user's search query to a successful system. This ability can be significantly enhanced by employing an approximate string matching algorithm. This research collect set of search query from the experiment carried out. We have discussed various existing algorithms in matching of query and information retrieval with their limitations. Strike match algorithm works very well in information retrieval of web applications. During future work, we would like to validate this algorithm with sample prototype which significantly will improve effectiveness of information systems.

(7)

REFERENCES

[1] J. Allan, HARD track overview in TREC 2003 high accuracy retrieval from documents, in: Proceedings of the 12th Text Retrieval Conference, 2003, pp. 24–37.

[2] Jansen, B. J., Booth, D. L., & Spink, A. (2008). Determining the informational, navigational, and transactional intent of Web queries. Information Processing and Management, 44(3), 1251–1266.

[3] Xu, Y., & Yin, H. (2008). Novelty and topicality in interactive information retrieval. Journal of the American Society for Information Science and Technology,59(2), 201–215.

[4] Liu, X., & Croft, B. W. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th international ACM SIGIR conference on research and development in information retrieval (pp. 186–193). Sheffield, UK: ACM Press.

[5] Gaines, B.R., Chen, L.L. & Shaw, M.L.G (1997) “Modeling the human factors of scholarly communities supported through the Internet and the www.” Journal of the American Society for Information Science, 48(11),987-1003.

[6] Martzoukou, K. (2005) "A review of Web information seeking research: considerations of method and foci of interest"

Information Research,10(2) paper 215. [Available at http://InformationR.net/ir/10-2/paper215.html].