EFFECTIVE CLOUD SEARCH BASED ON MULTI-KEYWORD RANKED OVER ENCRYPTED DATA

(1)

EFFECTIVE CLOUD SEARCH BASED ON MULTI- KEYWORD RANKED OVER ENCRYPTED DATA

N.Veenasri

¹

, P.Rajesh

²

1Student, Computer Science and Engineering, Kingston Engineering College, (India)

2Assistant professor, Computer Science and Engineering, Kingston Engineering College, (India)

ABSTRACT

With the advent of cloud computing, more and more information data are outsourced to the public cloud for economic savings and ease of access. Some problems may be caused in this circumstance since the Cloud Service Provider (CSP) possesses full control of the outsourced data. There may exist unauthorized operation on the outsourced data on account of curiosity or profit. Hence, the privacy information has to be encrypted to guarantee the security. Meanwhile, existing search approaches over encrypted cloud data support only exact or fuzzy keyword search, but not semantics-based multi-keyword ranked search. Thus the search schemes are not intelligent and also omit some semantically related documents. In view of the deficiency, as an attempt, an effective approach is proposed to solve the problem of multi-keyword ranked search over encrypted cloud data supporting synonym queries based similar search solution over encrypted cloud data. Our solution could return not only the exactly matched files, but also the files including the terms semantically related to the query keyword.

Keywords – Semantic, Multi-Keyword, Ranked Search

I. INTRODUCTION

Cloud computing is the long dreamed vision of computing as a utility, where cloud customers can remotely store their data into the cloud so as to enjoy the on demand high quality applications and services from a shared pool of configurable computing resources. By storing their data into the cloud, the data owners can be relieved from the burden of data storage and maintenance so as to enjoy the on-demand high quality data storage service.

However, the fact that data owners and cloud server are not in the same trusted domain may put the out sourced data at risk, as the cloud server may no longer be fully trusted. On the other hand, some problems may be caused in this circumstance since the Cloud Service Provider (CSP) possesses full control of the outsourced data. There may exist an unauthorized operation on the outsourced data on account of curiosity or profit. It follows that sensitive data usually should be encrypted prior to outsourcing for data privacy and combating unsolicited accesses. However, data encryption makes effective data utilization a very challenging task given that there could be a large amount of outsourced data files. Moreover, in Cloud Computing, data owners may share their outsourced data with a large number of users. The individual users might want to only retrieve certain specific data files they are interested in during a given session. One of the most popular ways is to selectively retrieve files through keyword-based search instead of retrieving all the encrypted files back which is completely impractical in cloud computing scenarios. Such keyword-based search technique allows users to selectively retrieve files of interest and has been widely applied in plaintext search scenarios, such as Google search.

(2)

Unfortunately, data encryption restricts user’s ability to perform keyword search and thus makes the traditional plaintext search methods unsuitable for Cloud Computing. Besides this, data encryption also demands the protection of keyword privacy since keywords usually contain important information related to the data files.

Although encryption of keywords can protect keyword privacy, it further renders the traditional plaintext search techniques useless in this scenario. To securely search over encrypted data, searchable encryption techniques have been developed in recent years. Searchable encryption schemes usually build up an index for each keyword of interest and associate the index with the files that contain the keyword. By integrating the trapdoors of keywords within the index information, effective keyword search can be realized while both file content and keyword privacy are well-preserved. It is preferred to get the retrieval result with the most relevant files that match users’ interest instead of all the files, which indicates that the files should be ranked in the order of relevance by users’ interest and only the files with the highest relevance’s are sent back to users.

In the real search scenario, it is quite common that cloud customers’ searching input might be the synonyms of the predefined keywords, not the exact or fuzzy matching keywords due to the possible synonym substitution (reproduction of information content), such as commodity and goods, and/or her/his lack of exact knowledge about the data. The existing searchable encryption schemes support only exact or fuzzy keyword search. That is, there is no tolerance of synonym substitution and/or syntactic variation which, on the other hand, are typical user searching behaviors and happen very frequently. Therefore, synonym-based multi-keyword ranked search over encrypted cloud data remains a very challenging problem.

To meet the challenge of effective search system, we propose a practically efficient and flexible searchable scheme which supports both multi-keyword ranked search and synonym based search. To address multi- keyword search and result ranking, Vector Space Model (VSM) is used to build document index, that is to say, each document is expressed as a vector where each dimension value is the Term Frequency (TF) weight of its corresponding keyword. A new vector is also generated in the query phase. The vector has the same dimension with document index and its each dimension value is the Inverse Document Frequency (IDF) weight. The VSM can measure the similarity between document index vector and query vector and hence support more accurate ranked search results. Then cosine measure can be used to compute similarity of one document to the search query. To improve search efficiency, a tree-based index structure which is a balance binary tree is used. The searchable index tree is constructed with the document index vectors. So the related documents can be found by traversing the tree.

II. PROBLEM FORMULATION

2.1 System Model

The system model considered has three different entities: the data owner, the data user and the cloud server. The data owner, individual or enterprise, has a document collection which will be outsourced into the cloud. The data owner encrypts the data before outsourcing to the cloud. And for the purpose of searching interested data, the data owner will also generate a searchable index based on a set of distinct keywords extracted from document. Then, the encrypted file collection and searchable index will be outsourced to the cloud together by the data owner. In the search system will generate an encrypted search trapdoor based on the keywords or the synonyms of the predefined keywords entered by the user (has been authorized by data owner). Given the trapdoor, the cloud server will search the index and then return search results to the user. The search result is a

(3)

set of encrypted documents containing the entered keywords, and they are well-ranked according to similarity measures.

Figure 2.1 System Architecture

2.2 Keyword Extraction

Tf-idf is the short for “term frequency-inverse document frequency”. It reflects how important a word is to a document in a collection of corpus. The tf-idf value increases proportionally to the number of times a word appears in the document. One of the simplest ranking functions is computed by summing the tf–idf for each query term.Typically, the tf-idf weight is composed by two terms:

o Term frequency

o Inverse document frequency

TF,Term frequency, which measures how frequently a term occurs in a document. TF(t,d) = (Number of times term t appears in a document d) / (Total number of terms in the document d)

IDF, Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following.

IDF(t) = log_e(Total number of documents in corpus D / Number of documents with term t in D) W_i= TF * IDF

= TF * (1/DF) = fi * log(N/n) Where ,

W_i- Keyword in the document f_i - frequency of term i in the document N - total number of documents

(4)

n- number of documents that which contain the keyword i Improved formula:

W_i = f_i * log(N/n) * ((E₁-E₂)/n Where,

E₁ - number of documents in the largest category containing the term i E₂ - number of documents in the second largest category containing the term i

2.3 Sysnonym Expansion

The Words are analyzed in WordNet so that the related terms can be found for use in the index file. WordNet is a lexical database for the English language. A common use of WordNet is to determine the similarity between words. It groups English words into sets of synonyms called synsets. Synsets include simplex words as well as collocations like "eat out" and "car pool.” All synsets are connected to other synsets by means of semantic relations. Individual synset members (words) can also be connected with lexical relations. For example, (one sense of) the noun "director" is linked to (one sense of) the verb "direct". Synset types are as follows

 Noun

 Verb

 Adjective

 Adverb

Two kinds of building blocks are distinguished in the source files: word forms and word meanings. Word forms are represented in their familiar orthography; word meanings are represented by synonym sets. Two kinds of relations are recognized: lexical and semantic. Lexical relations hold between word forms; semantic relations hold between word meanings.

2.4 Similarity Ranking

The Fig 2.2 is a Searchable index tree which is a balance binary tree, a dynamic data structure. the tree is built following a post order traversal with all leaf nodes generated. The sequential search process for keywords in a search request conducts as follows: the procedure starts from the root node and when arrives at an internal node u , if at least a keyword, then, it continues to search both subtrees of u , otherwise stops searching in the subtree because none of leaf node contains keyword in search query. When arrive at a leaf node, the process computes the cosine value between the index vector stored in the leaf node and the query vector as the similarity score.

The number of documents that contain the keyword in the search query are denoted as r . In the sequential search, the procedure will traverse as many paths as r.

Figure 2.2 Searchable Index Tree

(5)

Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.

Documents and queries are represented as vectors.

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non- zero. Several different ways of computing these values, also known as (term) weights, have been developed.

One of the best known schemes is tf-idf weighting. The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).Vector operations can be used to compare documents with queries.

Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:

Where d.q is the intersection (i.e. the dot product) of the document and the query vectors, ||d||is the norm of vector d, and ||q||is the norm of vector q. The norm of a vector is calculated as such:

As all vectors under consideration by this model are element wise nonnegative, a cosine value of zero means that the query and document vector are orthogonal and have no match (i.e. the query term does not exist in the document being considered)

III. PERFORMANCE ANALYSIS

3.1 Performance of Keyword Extraction Method

In order to verify the effectiveness of the improved ETFIDF, KNN (K Nearest Neighbor) classifier is used to train and predict the category labels of documents. Two steps are performed as follows:

(1) Train documents are firstly trained using KNN classifier, and then a classification result is achieved;

(2) Test document is classified according to the similarity of the test document and classifier.

Figure 3.1 The Classification Accuracy of Two Methods

(6)

The classification accuracy of the improved method and the original method combining with KNN algorithm are shown in the following Fig 3.1. It shows that the Macro F1 of the improved method combining with KNN algorithm is higher than the original method combining with KNN algorithm. The maximum Macro F1 value of original method is 90.382 which is lower than the proposed method. In addition, the standard deviation of the proposed method is lower than original method, which shows that the proposed method reduces dependence on the number of features.

3.2 Efficiency of Index Tree Construction

Given the keyword set constructed using synonym-based method, the time cost of index tree construction for the basic scheme and enhanced scheme is measured. It is obvious that the time cost of the index tree construction is mainly affected by the number of documents in the dataset and keywords in dictionary. For each internal node in the searchable index tree, the major computation is the encryption of the hash table, the time cost of which is proportional to the number of keywords in the dictionary.

Figure 3.2 Index Tree Construction Time for Basic Scheme and Enhanced Scheme

Fig 3.2 indicates that the time cost for constructing index tree is proportional to the number of keywords in the dictionary with the same size of dataset.

Figure 3.3 Index Storage Cost for Basic Scheme and Enhanced Scheme

Fig 3.3 indicates that the sizes of index tree are very close in two schemes, that is, storage space is not a main problem in the cloud computing environment, because the index data only consume a small amount of storage space.

3.3 Search Efficiency

In this section, the performance of search scheme is evaluated as the number of documents increases. The search process, which is implemented by the cloud server, is composed by computing the similarity scores of relevant documents and result ranking based on these scores. Fig 3.4 shows the search time for the basic scheme and enhanced scheme. Let r represent the number of documents including the search keywords. Fig 3.4 shows that

(7)

the search time is mainly depends on the number of documents in the dataset when r is fixed and the time cost of two schemes is similar.

Figure 3.4 Search Time for Basic Scheme and Enhanced Scheme

IV. CONCLUSION

The proposed systems solve the problem of synonym-based multi-keyword ranked search over encrypted cloud data. To achieve practical and effective multi-keyword text search over encrypted cloud data, we make contributions in two major aspects, supporting similarity-based ranking for more accurate search result and a tree-based search algorithm that achieves better-than linear search efficiency. For the accuracy aspect, we first exploit the popular similarity measure, i.e., vector space model with cosine measure, to effectively procure the accurate search result. The search results can be achieved when authorized cloud customers input the synonyms of the predefined keywords, not the exact or fuzzy matching keywords, due to the possible synonym substitution and/or her lack of exact knowledge about the data. The vector space model is adopted combined with cosine measure, which is popular in information retrieval field, to evaluate the similarity between search request and document. We use the vectors consisting of TF values as indexes to documents. These vectors constitute a matrix, from which we analyze the latent semantic association between terms and documents. The sensitivity of data being stored in cloud is protected from the Cloud Service Provider(CSP) by performing encryption and index generation on the client side, where, later they are securely uploaded to the cloud where the CSP can have no control over the uploaded document. Finally, the performance of the proposed schemes is analyzed in detail, including search efficiency and search accuracy, by the experiment on real-world dataset. The results show that the proposed solution is very efficient and effective in supporting synonym based searching.

V. FUTURE WORK

The next work is to research semantics-based search approaches over encrypted cloud data that support other natural language processing technology. Natural language processing is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.

As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input.

The task of automatically extracting structured information from unstructured and/or semi-structured machine- readable documents. In most of the cases this activity concerns processing human language texts by means of Information extraction. To improve the security access control policy will be enhanced. The term access control refers to the practice of restricting access to particular files. An access control system determines who is allowed

(8)

to access the file uploaded. The access policy such as public and private are defined. The data owner who wish to keep their file hidden from other user will provide access policy as private where the file will be stored in a separate folder and will be displayed only to that particular user. When the data owner provides the access policy as public to his uploaded file then the file will displayed to all data user who ever generates search query related to that file semantics. The data user will be provided the facility of rating the document being viewed.

Once user provides the search query, all the documents that hold the exact query keyword or the semantically related keyword are listed. The user after accessing the file, if he/she finds the file relevant to their search query the user can provide the rating to that file. The user can provide either higher or the lower rating to the document based on the relevance he/she finds with the document. If same query is provided by any user for the next time, the document with higher rating will be resulted first. The aim of adding all the features to the cloud search service is that the cloud consumers can search the most relevant products or data by using the designed system.

REFERENCES

[1] P.A. Cabarcos, F.A. Mendoza, R.S. Guerrero, A.M. Lopez, and D. DiazSanchez, “SuSSo: seamless and ubiquitous single sign-on for cloud service continuity across devices,” IEEE Trans. Consumer Electron.,vol.

58, no. 4, pp. 1425-1433, 2012.

[2] D. Diaz-Sanchez, F. Almenarez, A. Marin, D. Proserpio, and P.A. Cabarcos, “Media cloud: an open cloud computing middleware for content management,”IEEE Trans. Consumer Electron., vol. 57, no. 2, pp. 970- 978, 2011.

[3] S. G. Lee, D. Lee, and S. Lee, “Personalized DTV program recommendation system under a cloud computing environment,” IEEE Trans. Consumer Electron., vol. 56, no. 2, pp. 1034-1042, 2010.

[4] L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, “A break in the clouds: towards a cloud definition,” ACM SIGCOMM Comput. Commun. Rev., vol. 39, no. 1, pp. 50-55, 2009.

[5] S. Kamara, and K. Lauter, “Cryptographic cloud storage,” FC 2010 Workshops, LNCS 6054, PP. 136-149, Jan. 2010.

[6] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes: Compressing and indexing documents and images, Morgan Kaufmann publishing: San Francisco, May 1999, PP. 36-56.

[7] S. Grzonkowski, and P. M. Corcoran, “Sharing cloud services: user authentication for social enhancement of home networking,” IEEE Trans.Consumer Electron., vol. 57, no. 3, pp. 1424-1432, 2011.

[8] R. Sanchez, F. Almenares, P. Arias, D. Diaz-Sanchez, and A. Marin, “Enhancing privacy and dynamic federation in IdM for consumer cloud computing,” IEEE Trans. Consumer Electron., vol. 58, no. 1, pp. 95- 103,2012.

[9] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzy keyword search over encrypted data in cloud computing,” Proceedings of IEEE INFOCOM’10 Mini-Conference, San Diego, CA, USA, pp. 1-5, Mar.

2010.

[10] C. Wang, N. Cao, J. Li, K. Ren, and W. Lou, “Secure ranked keyword search over encrypted cloud data,”

Proceedings of IEEE 30th International Conference on Distributed Computing Systems (ICDCS), pp. 253- 262, 2010.

[11] N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacy-preserving multikeyword ranked search over encrypted cloud data,” Proceedings of IEEE INFOCOM 2011, pp. 829-837, 2011.

[12] Q. Chai, and G. Gong, “Verifiable symmetric searchable encryption for semi-honest-but-curious cloud servers,” Proceedings of IEEE International Conference on Communications (ICC’12), pp. 917-922, 2012.

[13] W. Sun, B. Wang, N. Cao, M. Li, W. Lou, and Y. T. Hou, “Privacypreserving multi-keyword text search in the cloud supporting similaritybased ranking,” ASIACCS 2013, Hangzhou, China, May 2013, pp. 71-82, 2013.

[14] P. D. Morehead, The New American Roget’s College Thesaurus in Dictionary Form, 3rd ed., Signet Press:

Seattle, 2002, pp. 10-900.

[15] D. A. Grossman, and O. Frieder, Information retrieval: algorithms and heuristics, 2nd ed., Springer Publisher: Berlin, 2004, pp. 18-20.