WEB TECHNOLOGIES- SECURITY, SCALABILITY AND EFFICIENT SEARCH ENGINE IN THE LIBRARY

(1)

WEB TECHNOLOGIES- SECURITY, SCALABILITY AND EFFICIENT SEARCH ENGINE IN THE LIBRARY

M. Paul N Vijaya Kumar

Faculty in Lib & Inf Science, DLISc, Andhra University, Visakhapatnam (India)

ABSTRACT

Technology advances, scenarios change, growth is exponential, and every day brings a new challenge – wondering how to incorporate these changing scenarios in the field of Library Science to make it possible and to make it possible to provide leadership with the help of cutting edge technologies. The basic human tendency is innovation and leadership – “The best innovation is the best leader”. Today we need to include ultra-modern technologies into Library Science to enhance the users’ experience by integrating best practices of upcoming technologies like security, cloud computing, data mining, big data etc. New digital collections are stored in the cloud and other mobile social networks in developing bleeding-edge reality applications. Such ultra-popular technologies and apps are combined with the availability of affordable services and the evolution and adoption to build communities, to store and analyze large collections of data, create digital collections, and access information and services in ways never thought before.

As data grows big, the task of Librarian becomes complex and forces to think in ultra new age and to focus most wanted concerns such as security, mining of the content, implementation of the most advanced search techniques and tools, enhanced and scalable storage, etc. A new challenge has been evolved through social media networks to share, like or unlike content in all the forms such as plain text, documents, files, audio, video and many more. As we see these are new challenges in different other ways to be faced and provide quick solutions to give ever better experiences to the user.

Keywords: Open Access, New Age Library Science, Digital Library and Ultra Modern Technology, Library Digitization with Cloud Resources, Libraries in Digital Format with Big Data, Social Media and Web in Library Science

I. INTRODUCTION

The modern digital library must be equipped with recent inventions like cloud storage services, privacy preserving in cloud storage systems, and an efficient searching tool that can infer user search goals using click through logs. Also there is a considerable focus on Big Data Analytics, but it suggests complete changes in the storage and retrieving process, which needs every aspect in the existing system to be migrated into the new technology.

1.1 Cloud Storage Systems

Cloud Computing is seen by many as the next big paradigm shift in information technology. The countless of new cloud commercial solutions indicates that the cloud is being widely accepted. The benefits of moving to the

(2)

cloud have already been discussed in multiple texts, e.g., [1]. The question that still needs to be answered is whether or not a secure transition to the cloud is possible for digital libraries that require data confidentiality.

Cloud security is a topic with many sides [2], but we focus on storage and security from the user perspective of a public cloud [3], a digital library, a company, governmental organization or other. The reason of alarm has its root into the deep changes that the cloud paradigm brings to information technology. The most relevant change is that the user relinquishes control of its data to the cloud provider, which stores and processes it in its own data centers. The user may hesitate to trust the cloud provider and/or the foggy legal situation of data in the cloud.

One of the most fundamental services offered by cloud providers is data storage. Let us consider a practical data application. A university digital library allows its members in the same group or department to store and share files in the cloud. By utilizing the cloud, the members can be completely released from the troublesome local data storage and maintenance. However, it also poses a significant risk to the confidentiality of those stored files. Specifically, the cloud servers managed by cloud providers are not fully trusted by users while the data files stored in the cloud may be sensitive and confidential. To preserve data privacy, a basic solution is to encrypt data files, and then upload the encrypted data into the cloud [4]. Unfortunately, designing an efficient and secure data sharing scheme for groups in the cloud is not an easy task due to the following challenging issues.

Cloud providers are aware of the malicious insider threat and argue that they have solutions to mitigate the problem [5]. A first solution is not to allow physical access to the servers. However, most attacks are done remotely, and they do not require physical access. A second mechanism is a zero tolerance policy for insiders who access the data stored in the cloud. This measure only takes place after the attack has happened, which is not enough, for instance, in the case of an attacker that was fired or left voluntarily. For example a person who involved in developing a e-governance system or invent similar kind of the system that will automate country’s most crucial activities like citizen services, security services, e-governance, police services, etc. Now think that this person turns against may create a disaster because he may have all access codes and even some of the loop holes in system.

The third mechanism consists in logging all accesses to the servers where the users’ data is stored. These logs are later used for internal audits, to find out if employees are behaving according to privacy policies. Again, this solution only detects the attack after it happens, which may be too late. All these mechanisms are important and should be deployed, but they are not enough to prevent any of the attacks.

Scalability is one of its kinds in digital library environment in the areas of storage and memberships. This means increasing quantities in exponential for both data and subscriber groups, in such scenario the digital library must handle both in balanced manner by privacy preserving for data owners who are contributing their content in digital library. As we already raised this issue by mitigating through security framework is not enough protection for which we need to integrate privacy preserving rules.

The digital library must have an efficient internal search engine and click through log maintenance to understand user search goals only we can deliver the content to the user with an accelerated accuracy. At the same time protection to the user’s sensitive data such as profiles, user login details, security keys, etc to be provided when other social website (facebook, twitter, linkedin and etc.) subscribers are integrated with in the digital library system. Also it must have a method to analyze the subscriber behavior. To infer the user goals by

(3)

clustering, feedback sessions are proposed. Clustering the feedbacks can effectively reflect the user needs. So the user expectations can be obtained conveniently from the ratings.

Many works about user search goals analysis have been investigated. In the case of search engine, the number of diverse user search goals for a query and depicting each goal with some keywords automatically [6]. Here the feedback sessions are formed with the series of both clicked and un-clicked URLs and ends with the last URL that was clicked in a session from user click-through logs. [7], [8], [9], [10], [11] Demonstrate the use of logs in search engine, the user goal can also be inferred by using the click through data [12], [13]. It is more efficient to analyze the feedback sessions than to analyze the search results or clicked URLs directly. [14], [15] Illustration of the work of inferring goals in search engine using click through data. Users provide their own outcome and then answer questions that may be predictive of that outcome [16]. Models are constructed against the growing dataset that predict each user’s behavioral outcome. Users pose their own questions when it is answered by other users, then in the modeling process it becomes new independent variables.

The rest of the paper presents the study in three sections where each section explains the importance of most wanted technologies to be integrated into digital libraries. In section 1 we study about cloud computing technologies and its importance to be used in digital library. The section 2 presents the security framework to be build around the digital library content and subscription groups. We also propose to add data mining techniques to understand user search goals and provide prioritized search results in section 3.

II. CLOUD COMPUTING SECURITY IN DIGITAL LIBRARY

To achieve secure data sharing for dynamic groups in the cloud, we expect to combine the group signature and dynamic broadcast encryption techniques. Specially, the group signature scheme enables users to anonymously use the cloud resources, and the dynamic broadcast encryption technique allows data owners to securely share their data files with others including new joining users.

Unfortunately, each user has to compute revocation parameters to protect the confidentiality from the revoked users in the dynamic broadcast encryption scheme, which results in that both the computational overhead of the encryption and the size of the cipher text increase with the number of revoked users. Thus, the heavy overhead and large cipher text size may hinder the adoption of the broadcast encryption scheme to capacity-limited users.

To tackle this challenging issue, we let the group manager compute the revocation parameters and make the result public available by migrating them into the cloud. Such a design can significantly reduce the computation overhead of users to encrypt files and the cipher text size. Specially, the computation overhead of users for encryption operations and the cipher text size is constant and independent of the revocation users.

2.1 Virtualization and Iaas Cloud Computing

The foundation of the IaaS cloud service model is native virtualization. In native virtualization, a hypervisor or virtual machine monitor provides an abstraction of the hardware resources to a set of virtual machines (VM) executing on top of it. A Virtual Machine includes its own operating system that runs above the hypervisor.

2.2 Attacking Confidentiality in the Cloud

The four attacks that a malicious insider can perform to access the user’s data. These attacks clearly demonstrate that it is currently possible to violate the confidentiality of the cloud user’s data. The attacks are available as

(4)

2.2.1 File Generation

To store and share a data file in the cloud, a group member performs the following operations:

1. Getting the revocation list from the cloud. In this step, the member sends the group identity ID group as a request to the cloud. Then, the cloud responds the revocation list RL to the member.

2. Verifying the validity of the received revocation list. First, checking whether the marked date is fresh.

Second, verifying the contained signature sig(RL) by the equation e(W,f1(RL))=e(P,sig(RL)). If the revocation list is invalid, the data owner stops this scheme.

Encrypting the data file M. This encryption process can be divided into two cases according to the revocation list.

III. CLOUD SECURITY BOOSTER TECHNOLOGY

Cloud Environments: Security Boosters

Current cloud service providers operate very large systems. They have sophisticated processes and expert personnel for maintaining their systems, which small enterprises may not have access to. As a result, there are many direct and indirect security advantages for the cloud users. Here we present some of the key security advantages of a cloud computing environment:

 Data Centralization: In a cloud environment, the service provider takes care of storage issues and small business need not spend a lot of money on physical storage devices. Also, cloud based storage provides a way to centralize the data faster and potentially cheaper. This is particularly useful for small businesses, which cannot spend additional money on security professionals to monitor the data.

 Incident Response: IaaS providers can put up a dedicated forensic server that can be used on demand basis.

Whenever a security violation takes place, the server can be brought online. In some investigation cases, a backup of the environment can be easily made and put onto the cloud.

 Logging: In a traditional computing paradigm by and large, logging is often an afterthought. In general, insufficient disk space is allocated that makes logging either non-existent or minimal. However, in a cloud, storage need for standard logs is automatically solved.

3.1 Implementing a Secure Cloud

The following security measures represent general best practice implementations for cloud Security. At the same time, they are not intended to be interpreted as a guarantee of success. Please consult with your IBM security services representative to identify the best practice guidance for your specific implementation requirements.

 Implement and maintain a security program.

 Build and maintain a secure cloud infrastructure.

 Ensure confidential data protection.

 Implement strong access and identity management.

 Implement a vulnerability and intrusion management program.

(5)

3.2 Ensuring Confidential Data Protection in Cloud

Data protection is a core principle of information security. All of the prevalent information security regulations and standards, as well as the majority of industry best practices, require that sensitive information be adequately protected in order to preserve confidentiality. Confidentiality of such data is required no matter where that data is resident in the chain of custody, including the cloud environment

3.3 Privacy Preserving In Cloud Storage System

Using cloud storage, users can remotely store their data and enjoy the on-demand high-quality applications and services from a shared pool of configurable computing resources, without the burden of local data storage and maintenance. However, the fact that users no longer have physical possession of the outsourced data makes the data integrity protection in cloud computing a formidable task, especially for users with constrained computing resources. Moreover, users should be able to just use the cloud storage as if it is local, without worrying about the need to verify its integrity. Thus, enabling public auditability for cloud storage is of critical importance so that users can resort to a third-party auditor (TPA) to check the integrity of outsourced data and be worry free.

To securely introduce an effective TPA, the auditing process should bring in no new vulnerabilities toward user data privacy, and introduce no additional online burden to user. A secure cloud storage system supporting privacy-preserving public auditing[19]. We further extend our result to enable the TPA to perform audits for multiple users simultaneously and efficiently. Extensive security and performance analysis show the proposed schemes are provably secure and highly efficient. Our preliminary experiment conducted on our software developed for digital library further demonstrates the fast performance of the design.

IV. SEARCH ENGINE IN DIGITAL LIBRARY

Too many organizations suffer from customer amnesia, as though they have forgotten how to have routine conversations with their customers. When it comes to new product development, these organizations jump right to design of product, by assuming they know what customer expects, and then ships the finished product as soon as possible. For successful reach ability of product, find out what problems that organization can solve for the customer before designing the product. Get early feedback on new product concepts from customers by showing them initial prototypes.

These feedbacks can be collected from the customers by setting up the questions. Once the organization has a system for collecting new product ideas and suggestions, it is easy to make up the product. Customers are a great resource for the product development feedback. Through product development, it aims to meet the changing needs of its customers and increases the customer total spend. Through offering more products and services it hopes to increase profits.

Customer feedbacks provide organization with valuable information that can be used to better position services or products in the marketplace. Still some companies are not asking the customers who buy their products and services what they want and need, while many companies do not incorporate customer suggestions into the product development process. Several reasons could explain that why some organizations do not include customer feedback into their product development and service improvement programs. Perhaps they do not realize the customer excellence is impossible to achieve without knowing or understanding what customers expects. Have they forgotten that the goal of collecting customer feedback regularly and proactively is to

(6)

consistently exceed customer expectations? May be they are not aware that customer feedback programs can be used to create products or services that will ensure business success. Hence the customer suggestions are considered as very important in the development of new product.

To sum up, our work has three major contributions as follows:

 We propose a framework to infer different user search goals for a query by clustering feedback sessions.

We demonstrate that clustering feedback sessions is more efficient than clustering search results or clicked URLs directly. Moreover, the distributions of different user search goals can be obtained conveniently after feedback sessions are clustered.

 We propose a novel optimization method to combine the enriched URLs in a feedback session to form a pseudo-document, which can effectively reflect the information need of a user. Thus, we can tell what the user search goals are in detail.

 We propose a new criterion CCA to evaluate the performance of user search goal inference based on restructuring web search results. Thus, we can determine the number of user search goals for a query.

4.1 Analyzing User Behavior

Market research is often needed to ensure that what customer really wants. Analyzing customer behavior helps organizations improve their marketing strategies by understanding how customers think and select between different alternatives. Customer motivation and decision strategies differ between products that differ in their level of importance. When the consumer behavior and marketing strategy are intervened, marketers can expect success in their profit and sales, competitive sustainability and higher profit in the market place.

The benefits of using consumer behavior to create a marketing strategy are the knowledge marketer’s gain about the needs and values of their target market. Once marketers understand this, their message will be delivered to the correct target in marketplace, resulting in an end sale. [20] Introduced the machine science model for analyzing the customer behavior.

The customer behavior is analyzed by questions posed by the customers about the products. These questions are analyzed by the investigator to check whether the question is suitable, if it is suitable then the question is selected and added by the investigator. For these questions other customers can also propose their answers or responses. [21] And [22] demonstrated the behavioral outcome of customers. These responses can be analyzed to predict the customer behavior. This effectively reflects the user needs and expectations which help in the new product development and improve the market sales.

The proposed feedback session consists of both clicked and un-clicked URLs and ends with the last URL that was clicked in a single session. It is motivated that before the last click, all the URLs have been scanned and evaluated by users. Therefore, besides the clicked URLs, the un- clicked ones before the last click should be a part of the user feedbacks, generally speaking, since users will scan the URLs one by one from top to down, we can consider that besides the ‘n’ clicked URLs, the ‘m’ un-clicked ones have also been browsed and evaluated by the user and they should reasonably be a part of the user feedback. Inside the feedback session, the clicked URLs tell what users require and the un-clicked URLs reflect what users do not care about. It should be noted that the un-clicked URLs after the last clicked URL should not be included into the feedback sessions since it is not certain whether they were scanned or not.

(7)

Each feedback session can tell what a user requires and what he/she does not care about. Moreover, there are plenty of diverse feedback sessions in user click-through logs. Therefore, for inferring user search goals, it is more efficient to analyze the feedback sessions than to analyze the search results or clicked URLs directly.

4.2 Restructuring Web Search Results

Since search engines always return millions of search results, it is necessary to organize them to make it easier for users to find out what they want. Restructuring web search results is an application of inferring user search goals. We will introduce how to restructure web search results by inferred user search goals at first. Then, the evaluation based on restructuring web search results will be described. The inferred user search goals are represented by the vectors and the feature representation of each URL in the search results can be computed and then, we can categorize each URL into a cluster centered by the inferred search goals. In this paper, we perform categorization by choosing the smallest distance between the URL vector and user-search-goal vectors. By this way, the search results can be restructured according to the inferred user search goals.

V. CONCLUSION

In this paper, we design a secure data sharing scheme for dynamic groups in an un-trusted cloud for a modern digital library. In this system, a user is able to share data with others in the group without revealing identity privacy to the cloud. Additionally, it supports efficient user revocation and new user joining. More specially, efficient user revocation can be achieved through a public revocation list without updating the private keys of the remaining users, and new users can directly decrypt files stored in the cloud before their participation.

Moreover, the storage overhead and the encryption computation cost are constant. Our proposed system satisfies the desired security requirements and guarantees efficiency as well.

REFERENCES

[1] S. Yu, C. Wang, K. Ren, and W. Lou, “Achieving Secure, Scalable, and Fine-Grained Data Access Control in Cloud Computing,” Proc. IEEE INFOCOM, pp. 534-542, 2010.

[2] R. Lu, X. Lin, X. Liang, and X. Shen, “Secure Provenance: The Essential of Bread and Butter of Data Forensics in Cloud Computing,” Proc. ACM Symp. Information, Computer and Comm. Security, pp.

282-292, 2010.

[3] E. Goh, H. Shacham, N. Modadugu, and D. Boneh, “Sirius: Securing Remote Untrusted Storage,” Proc.

Network and Distributed Systems Security Symp. (NDSS), pp. 131-145, 2003.

[4] S. Kamara and K. Lauter, “Cryptographic Cloud Storage,” Proc. Int’l Conf. Financial Cryptography and Data Security (FC), pp. 136- 149, Jan. 2010.

[5] V. Goyal, O. Pandey, A. Sahai, and B. Waters, “Attribute-Based Encryption for Fine-Grained Access Control of Encrypted Data,” Proc. ACM Conf. Computer and Comm. Security (CCS), pp. 89-98, 2006.

[6] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.

[7] S. Beitzel, E. Jensen, A. Chowdhury, and O. Frieder, “Varying Approaches to Topical Web Query Classification,” Proc. 30th Ann. Int’l ACM SIGIR Conf. Research and Development (SIGIR ’07), pp.

783-784, 2007.

(8)

[8] T. Joachims, “Evaluating Retrieval Performance Using Clickthrough Data,” Text Mining, J. Franke, G.

Nakhaeizadeh, and I. Renz, eds., pp. 79-96, Physica/Springer Verlag, 2003.

[9] T. Joachims, “Optimizing Search Engines Using Clickthrough Data,” Proc. Eighth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (SIGKDD ’02), pp. 133-142, 2002.

[10] R. Jones, B. Rey, O. Madani, and W. Greiner, “Generating Query Substitutions,” Proc. 15th Int’l Conf.

World Wide Web (WWW ’06), pp. 387-396, 2006.

[11] R. Jones and K.L. Klinkner, “Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs,” Proc. 17th ACM Conf. Information and Knowledge Management (CIKM

’08), pp. 699-708, 2008.

[12] C.-K Huang, L.-F Chien, and Y.-J Oyang, “Relevant Term Suggestion in Interactive Web Search Based on Contextual Information in Query Session Logs,” J. Am. Soc. for Information Science and Technology, vol. 54, no. 7, pp. 638-649, 2003.

[13] T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, “Accurately Interpreting Clickthrough Data as Implicit Feedback,” Proc. 28th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’05), pp. 154-161, 2005.

[14] U. Lee, Z. Liu, and J. Cho, “Automatic Identification of User Goals in Web Search,” Proc. 14th Int’l Conf. World Wide Web (WWW ’05), pp. 391-400, 2005.

[15] X. Li, Y.-Y Wang, and A. Acero, “Learning Query Intent from Regularized Click Graphs,” Proc. 31st Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’08), pp. 339- 346, 2008.

[16] R. Baeza-Yates, C. Hurtado, and M. Mendoza, “Query Recommendation Using Query Logs in Search Engines,” Proc. Int’l Conf. Current Trends in Database Technology (EDBT ’04), pp. 588-596, 2004.

[17] Xuefeng Liu, Yuqing Zhang, Boyang Wang, and Jingbo Yan, “Mona: Secure Multi-Owner Data Sharing for Dynamic Groups in the Cloud”, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 6, JUNE 2013

[18] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R.H. Katz, A. Konwinski, G. Lee, D.A. Patterson, A.

Rabkin, I. Stoica, and M. Zaharia, “A View of Cloud Computing,” Comm. ACM, vol. 53, no. 4, pp. 50- 58, Apr. 2010.

[19] C. Wang, Q. Wang, K. Ren, and W. Lou, “Privacy-Preserving Public Auditing for Storage Security in Cloud Computing,”Proc. IEEE INFOCOM '10, Mar. 2010.

[20] J. Evans and A. Rzhetsky, “Machine science,” Science, vol. 329, no.5990, p. 399, 2010.

[21] L. Barness, J. Opitz, and E. Gilbert-Barness, “Obesity: genetic, molecular, and environmental aspects,”

American Journal of Medical Genetics Part A, vol. 143, no. 24, pp. 3016– 3034, 2007.

[22] T. Parsons, C. Power, S. Logan, and C. Summerbell, “Childhood predictors of adult obesity: a systematic review.” International journal of obesity and related metabolic disorders: journal of the International Association for the Study of Obesity, vol. 23, p. S1, 1999.