Private Content Based Image Information Retrieval using Map-Reduce

(1)

Private Content Based Image Information

Retrieval using Map-Reduce

A Dissertation

submitted in fulfillment for the award of

the degree

Master of Technology,

Computer Engineering

Submitted by

Arpit D. Dongaonkar

MIS No: 121122006

Under the guidance of

Prof. Sunil B. Mane

College of Engineering, Pune

DEPARTMENT OF COMPUTER ENGINEERING AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE-5

(2)

DEPARTMENT OF COMPUTER ENGINEERING

AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE

CERTIFICATE

This is to certify that the dissertation entitled

“Private Content Based Image Information Retrieval using

Map-Reduce”

has been successfully completed

by

Mr. Arpit Dilip Dongaonkar

MIS No. 121122006

as a fulfillment of End Semester evaluation of

Master of Technology.

SIGNATURE SIGNATURE

Prof Sunil B. Mane Dr. J. V. Aghav

Project Guide Head

Dept. of Computer Engineering Dept. of Computer Engineering and Information Technology, and Information Technology, College of Engineering Pune, College of Engineering Pune,

(3)

Dedicated to My Mother

Sunita D. Dongaonkar,

My Father

Prof. Dilip R. Dongaonkar

for their love and everything, that they have given me... and also to

All family members

(4)

Acknowledgements

I would like to express my sincere gratitude towards Prof. Sunil B. Mane, my re-search guide, for his patient guidance, enthusiastic encouragement and useful critiques on this research work. Without his invaluable guidance, this work would never have been a successful one.

I would also like to thank Vikas Jadhav, my classmate for continuous and detail knowl-edge sharing on Hadoop and other aspects from installation to working. I would also like to thank all my M.Tech friends for unconditional support and help.

Last but not least, my sincere thanks to all my teachers who directly or indirectly helped me to learn new technologies and complete my post graduation successfully.

Arpit Dilip Dongaonkar College of Engineering, Pune

(5)

Abstract

Today, Role of Information technology is changing and this change is leading to real life software application developments i.e. Information technology has entered every part of our day to day life like traffic signals, power grids, our day to day financial transactions are handled by Information technology. But this technology comes with a cost to pay for it. As this technology is touching each and every layer of life and community, there is a need to make this technology available to each and every layer of community.

Cloud computing is an emerging technology which reduces this cost while providing very good and efficient information technology platform as well as service. Cloud reduces the overload on hard disk, operating system and eventually on CPU processing. Cloud providers provide infrastructure, platform, database etc. for you to install your operating system, your required software’s, to store your data on their cloud. But this service comes at a cost. Cloud providers provide it on pay per hour basis. So any application which can be deployed on cloud must be fast, efficient and secure.

As Internet is expanding, creation of data over it is also increasing exponentially. Max-imum part of this data contains images. Today large amount of variety image data is produced through digital cameras, mobile phones, and photo editing softwares etc. These images are private to particular user. This digital image data should be secured such that it should not be accessed by others except the user.

There are some domains which uses content based image retrieval applications. These applications should be fast enough to carry out user functionalities efficiently and secure in order to protect user data. If application is to be deployed on cloud, then it should be supported by cloud structure. One technology which has support from cloud and does distributed parallel computing is “Map Reduce”.

(6)

his continuously expanding image database. This system is incorporated with encryp-tion technique for security and Map-Reduce technique for efficiently upload, search and retrieval of images over large dataset of user images.

Proposed solution will be a system to securely upload, search and query images among massive image data storage using content based image retrieval scheme using Map Re-duce technique, and evaluating and comparing proposed solution with existing solution.

(7)

1.3 Similarity Measurement . . . 4 1.3.1 Euclidean Distance . . . 4 1.3.2 Mahalanobis Distance . . . 4 1.3.3 Chord Distance . . . 4 1.4 Hadoop . . . 5 1.4.1 MapReduce . . . 5 1.4.2 HDFS . . . 8 1.4.3 HBase . . . 8 1.5 Oblivious Technique . . . 8 2 Literature Survey 10 2.1 PIRMAP [1] . . . 10 2.1.1 PIR . . . 10 2.1.2 MapReduce in PIRMAP . . . 11 2.1.3 Working of PIRMAP . . . 11 2.1.4 Future Scope/Issues . . . 12

2.2 Map/Reduce in CBIR Application [2] . . . 12

(8)

2.2.2 Future scope/Issues . . . 13

2.3 An Oblivious Image Retrieval Protocol [3] . . . 13

2.3.1 System Model . . . 14

2.3.2 Future Scope/Issues . . . 15

2.4 Distributed Image Retrieval System Based on MapReduce.[4] . . . 15

2.4.1 System Model . . . 15 2.4.2 Future Scope/Issues . . . 16 2.5 Summary . . . 16 3 Motivation 17 3.1 Problem Definition . . . 18 3.2 Scope of Research . . . 18 3.3 Objectives . . . 19

3.4 System Requirement Specifications . . . 19

4 System Design 20 4.1 Complete Architecture . . . 20 4.1.1 Upload Images . . . 21 4.1.2 Image Retrieval . . . 23 5 Implementation of System 25 5.1 Log In Screen . . . 26 5.2 Create Account . . . 26

5.3 User Operation Screen . . . 28

5.3.1 Changes in Hadoop . . . 28

5.3.2 Upload Process . . . 29

5.3.3 Image Retrieval . . . 32

6 Experiments, Testing and Result Analysis 36 6.1 Cluster Configuration . . . 36

6.2 Experiments and Testing . . . 38

6.3 Result Analysis . . . 38

(9)

6.5 Comparison of Different Systems . . . 44

7 System Output 45

7.1 Input Images . . . 45

8 Conclusions and Future Scope 47

8.1 Conclusion . . . 47 8.2 Future Scope . . . 48

(10)

List of Tables

6.1 Cluster Configuration . . . 37 6.2 Second Cluster Configuration . . . 37

(11)

List of Figures

1.1 Map Reduce Process. . . 7

4.1 System Architecture . . . 20

4.2 Upload Process . . . 22

4.3 Image Retrieval Process . . . 23

5.1 LogIn Screen . . . 26

5.2 Create Account Screen . . . 27

5.3 Account Creation Screen - Validation . . . 27

5.4 User Operations Screen . . . 28

5.5 Upload Screen . . . 29

5.6 Uploading Screen . . . 30

5.7 Process behind Upload . . . 32

5.8 Search Screen . . . 32

5.9 Searching Screen . . . 33

5.10 Processes behinds search Screen . . . 34

6.1 One Node Cluster Difference at Upload MapReduce . . . 39

6.2 One Node Cluster Difference at Retrieval MapReduce . . . 39

6.3 Three Node Cluster Difference at Upload MapReduce . . . 40

6.4 Three Node Cluster Difference at Retrieval MapReduce . . . 40

6.5 Performance of Upload MapReduce at Cluster . . . 41

6.6 Performance of Retrieval MapReduce at Cluster . . . 41

6.7 Scale Up between Different Cluster Nodes . . . 42

6.8 Scale Up between Different Cluster Nodes . . . 42

(12)

6.10 Error Check at Upload Images . . . 43

6.11 Error check at Search Images . . . 44

6.12 Error check at Search Images . . . 44

7.1 Input set of Images Uploaded . . . 45

7.2 Input Search Image . . . 46

(13)

Chapter 1 INTRODUCTION

1.1 Private Information Retrieval

Today most of data over Internet lies into database servers which are connected to Inter-net. If a user makes a query to database, he/she should get required queried information. There are many curious users who makes query to database intentionally or uninten-tionally to get private information of other users. Lot of research is devoted to protect databases from these types of curious users. Private information retrieval is a technique or a research area where security with respect to this type of intrusion has been considered. A private information retrieval (PIR) protocol allows a user to retrieve an item from a server in possession of a database without revealing which item he/she is retrieving. Over period different improvements in PIR has been proposed which makes PIR faster, more efficient and more secure. Implementation like oblivious transfer is also implemented in PIR. This restricts the user from getting information about other database items [3]. PIR also maintains privacy of the queries from database.

1.2 Image Retrieval

An image retrieval system is designed to browse, search and retrieve images from large database of digital images. Most traditional and common methods of image retrieval

(14)

utilize method of adding metadata such as captioning, keywords, or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation [2]. Annotated images are searched based on Images’s matadata.

Image Meta search: Search of images based on associated metadata such as keywords, text, etc.

Content-based image retrieval (CBIR): The application of computer vision to the image retrieval. CBIR aims at avoiding the use of textual descriptions and instead re-trieves images based on similarities in their contents (textures, colors, shapes etc.) to a user-supplied query image or user-specified image features.

1.2.1 Content Based Image Retrieval

Since 1970’s, image retrieval has been an active research area. In the beginning, research was concentrated to text based search only. It was new framework at that time, which used names of image files as a search criteria. In this framework, images were need to be annotated first and then from database management system, images were retrieved [5]. This framework was having two limitations firstly size of image data that can fit into database and secondly extensive manual annotation work. There was need to do research work on retrieval which would not be having limitation of manual entry and big size data. It lead to Content Based Image Retrieval technique.

Content-based image retrieval (CBIR), also known as query by image content (QBIC) [6]. Content based image retrieval (CBIR) is a technique in which content of an image is used as matching criteria instead of image’s metadata such as keywords, tags, or any name associated with image. This provides most approximate match as compared to text based image retrieval [7], [8], [9], [10].

(15)

The term content’ in this context might refer to colors, shapes, textures, or any other information that can be derived from the image itself [6], [8], [10]. Image content can be categorized into visual and semantic contents. Visual content can be very general or domain specific. General visual content include color, texture, shape, spatial relationship, etc. Domain specific visual content, like human faces, is application dependent and may involve domain knowledge. Semantic content is obtained either by textual annotation or by complex inference procedures based on visual content [8].

COLOR: Image retrieval based on color actually means retrieval on color descriptors. Most commonly used color descriptors are the color histogram, color coherence vector, color correlogram, and color moments [9]. A color histogram identifies the proportion of pixels within an image holding specific values which can be used to find similarity between two images by using similarity distance measures [8], [11].It tries to identify color proportion by region and is independent of image size, format or orientation [6]. Color moments contains calculation of the first order (mean), the second (variance) and the third order (skewness) of an image [8].

TEXTURE: Texture of an image is actually visual patterns that an image possesses and how they are spatially defined. Textures are represented by texels which are then placed into a number of sets, depending on how many textures are detected in the image. These sets not only define the texture, but also where in the image the texture is located. Sta-tistical method such as co-occurrence Matrices can be used for quantitative measure of the arrangement of intensities in a region [8], [12].

SHAPE:Shape in image does not mean shape of an image but it means that shape of a particular region or an object. Segmentation and edge detection are prominent tech-niques that can be used in shape detection [6].

There are several components that are calculated from image content such as color inten-sity, entropy, mean of image etc. which are useful in creation of feature vector. Feature vector of every image is calculated and is stored in database. When user wants to retrieve set of images, he queries system through an image. Feature vector of required image is

(16)

matched with vectors of stored images and images with vector similar to vector of queried image are retrieved. User will select required image from set of shown images.

1.3 Similarity Measurement

A similarity measurement is always selected to find how similar the two vectors are. The problem can be converted to computing the discrepancy between two vectors x,y ∈ Rd_.

There are three distance measurements: Euclidean, Mahalanobis, and chord distances which are reviewed as follows. [11]

1.3.1 Euclidean Distance

The Euclidean distance between x,y ∈ Rd is computed by [11]

δ(x, y) = kx−yk2 =

q Pd

j=1(xj −yj)2

1.3.2 Mahalanobis Distance

The Mahalanobis distance between two vectors x and y with respect to the training patternsxi is computed by [11]

δ(x, y)2 =

q

(x−y)t_S−1₍_x₋_y₎

where the mean vector u and the sample covariance matrix S from the sample {xi |1≤

i≤n} of size n are computed by [11] S = 1

n

Pn

i=1(xi−u)(xi−u)t with u = 1_nPni=1xi

1.3.3 Chord Distance

The chord distance between two vectors x and y is to measure the distance between the projected vectors of x and y onto the unit sphere, which can be computed by[11]

(17)

1.4 Hadoop

The Apache Hadoop is a collection open-source software projects for reliable, scalable, distributed computing. Software libraries of Hadoop specify a framework that allows distributed processing of large data sets across clusters of computers using simple pro-gramming models. It is designed to scale up from single to thousands of machines, each offering local computation and storage [13].

Some modules of Hadoop are listed below [13]:

1. Hadoop Common: The common utilities that support the other Hadoop modules. 2. Hadoop Distributed File System (HDFS): A distributed file system that provides

high-throughput access to application data.

3. Hadoop YARN: A framework for job scheduling and cluster resource management. 4. Hadoop MapReduce: A YARN-based system for parallel processing of large data

sets.

5. HBase: A scalable, distributed database that supports structured data storage for large tables.

1.4.1 MapReduce

Map reduce is a framework for processing parallel problems across huge datasets using a large number of commodity computers (nodes). The computation performed on the nodes is independent of the physical location of nodes i.e. whether all nodes are on same network and use similar hardware(called as cluster) or nodes are shared across geograph-ically and administratively distributed systems, and use more heterogeneous hardware (called as grid). It is also independent of location of data. Data can be stored into clus-ter at any node.

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two

(18)

functions: map and reduce.

Map function, written by the programmer, takes an input pair of key/value and produces a set of intermediate key/value pairs. The MapReduce library groups together all values associated with the same key and passes them to the Reduce function.

Reduce function, also written by the programmer, accepts pair of key/value which is generated by map function. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value, which is actual output of the user program.

map (k1, v1) → list(k2, v2)

reduce (k2, list(v2)) → list(k3, v3)

Computational processing can occur on data stored either in a file system (unstruc-tured/HDFS) or in a database (structured/HBase).

Map: The master node takes the input, divides it into smaller sub-problems, and dis-tributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node [14].

Reduce: The master node then collects the answers to all the sub-problems and com-bines them in some way to form the output – the answer to the problem it was originally trying to solve [14].

MapReduce allows for distributed processing of the map and reduction operations. Pro-vided each mapping operation is independent of the others. All maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or number of CPUs near each source. Similarly, a set of ’reducers’ can perform the reduction phase provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or if the reduction function is associative.

(19)

Figure 1.1: Map Reduce Process.

Example [15]

The map function emits each word plus an associated count of occurrences (just 1 in this simple example). The reduce function sums together all counts emitted for a particular word.

map Word(String key file, String value line):

// key file: text file name, value line: text file contents for each word m in value line:

EmitIntermediateval(m, “1”);

reduce Worx(String key file, Iterator value line):

// key file: a word from text file, value line: a list of counts of each word int total count = 0;

for each word t in value line: total count += ParseInt(t);

(20)

1.4.2 HDFS

The Hadoop Distributed File System (HDFS) [16] is a another module of Hadoop Project. It is a distributed file system, which is designed to run on commodity hardware. HDFS is highly fault-tolerant and works on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets [16].

1.4.3 HBase

HBase actually works on HDFS. It is a distributed column oriented database built on top of HDFS [17]. It is a Hadoop application which is used, when user require real-time read/write random-access to very large datasets.

HBase can scale linearly just by adding nodes. It is not relational and does not support SQL, but given the proper problem space, it is able to do what an RDBMS cannot: host very large, sparsely populated tables on clusters made from commodity hardware [17].

1.5 Oblivious Technique

Oblivious retrieval technique is designed to achieve user and database privacy. Actually Oblivious retrieval technique contains implementation of homomorphic encryption tech-nique. This encryption technique allows user to do addition, multiplication operations on encrypted data without changing its original plain text data. In this application ho-momorphic encryption is done on image feature data.

Today many of the website firm outsource their data to external storage servers. In this case, it is necessary to maintain privacy of user with respect to external server as well as vice versa. Oblivious technique is one such attempt to maintain this type of security. In oblivious technique privacy of user with respect to the server as well as privacy of server with respect to user is maintained with the use of homomorphic encryption technique. Homomorphic encryption is a asymmetric cryptography technique. It works on public and private key pair. It allows specific types of computations to be carried out on cipher text and obtain an encrypted result which decrypted matches the result of operations

(21)

performed on the plain text.

For instance, one person could add two encrypted numbers and then another person could decrypt the result, without either of them being able to find the value of the individual numbers.

Using the public key of user, feature vector will be stored in encrypted form into database. When User inputs a query, it will go into database in encrypted form and images will be searched and retrieved on the basis of similar encrypted feature vector. These encrypted image data will be sent to user and user will decrypt that image data using his private key. Cryptographic technique which is used in this system is Paillier Cryptosystem. In the Paillier cryptosystem, the homomorphic property is supported as follows.

Homomorphic addition of plaintexts The product of two ciphertexts will decrypt to the sum of their corresponding plaintexts,

D(E(m1,r1)*E(m2,r2) mod n2_{) = (m1 + m2)mod n}

Homomorphic Encryption Algorithms: 1. Simple RSA Algorithm

2. ElGamal Cryptosystem 3. Paillier Cryptosystem

(22)

Chapter 2 Literature Survey

2.1 PIRMAP [1]

Private Information Retrieval (PIR) allows for retrieval of bits from a database in a way that hides a user’s access pattern from the server. This paper presents PIRMAP, a practi-cal, highly efficient protocol for PIR in MapReduce, a widely supported cloud computing API. PIRMAP focuses especially on the retrieval of large files from the cloud, where it achieves optimal communication complexity (O(1) for retrieval of an l bit file) with query times significantly faster than previous schemes.

2.1.1 PIR

When user connects to a cloud, there is series of interactions between the user and the server. User connects to cloud to retrieve his private information which is stored. The information can be in the form of different files.

Now user wants to retrieve x number of files out of his n files of his choice and also server should not learn about his choice of files. Now user of cloud has to provide such query to cloud such that user will receive his required number of files from cloud and this transaction should be secured.

(23)

2.1.2 MapReduce in PIRMAP

This information retrieval technique uses map reduce to parallelize the process which will reduce the computational time to retrieve the file from cloud and will make it very easy to use.

The first phase is called the “Map” phase. MapReduce will automatically split input computations, equally among available nodes in the cloud data center, and each node will then run a function called map on their respective pieces (called InputSplits). It is important to note that the splitting actually occurs when the data is uploaded into the cloud. This means that each “mapper” node will have local access to its InputSplit as soon as computation is started and you avoid a lengthy copying and distributing pe-riod. The map function runs a user defined computation on each InputSplit and outputs (emits) a number of key-value pairs that go into the next phase.

The second phase, “Reduce”, takes as input all of the key-value pairs emitted by the map-pers and sends them “reducer” nodes in the data center. Specifically, each reducer node receives a single key, along with the sequence of values emitted by the mappers which share that key. The reducers then take each set and combine it in some way, emitting a single value for each key [1].

2.1.3 Working of PIRMAP

PIRMAP is an extension of the PIR protocol by Kushile-vitz and Ostrovsky, targeting the retrieval of large files in a parallelization-aggregation computation framework such as MapReduce. We will start by giving an overview of PIRMAP which can be used with any additively homomorphic encryption scheme.

Upload: In the following, we assume that the cloud user has already uploaded its files into the cloud using the interface provided to them by the cloud provider.

(24)

is l bits in length. There is also an additional parameter k which is the block size of the chosen cipher. For ease of presentation, we will consider the case where all files are the same length, but PIRMAP can easily be extended to accommodate variable length files by padding or prepending each file with a few bytes that specify its length.

2.1.4 Future Scope/Issues

This implementation is specifically for files which contains textual data, it does not sup-port for multimedia data such as image, video or audio files.

2.2 Map/Reduce in CBIR Application [2]

At the start, image retrieval is mainly dependent on the text based retrieval of an image. This technology is widely used, the current mainstream search engines Google, Baidu, Yahoo, etc. mainly used this method to search image. In this technique, name of image file was used to compare and retrieve the mage from database.

But it has drawbacks, researchers often need to manually mark with text for all images, and this mark text can’t objectively and accurately describe the visual information of the images.

After the 1990s, there has been Content-Based Image Retrieval, the following unified call CBIR, and different from TBIR, it can extract visual features from image automatically, and then retrieval image by their visual features. Since CBIR intuitive, efficient, and can be widely used in information retrieval, medical diagnosis, trademarks and intellectual property protection, crime prevention and other areas, it has a very high applied value.

2.2.1 System Model

Selecting an Algorithm

The technique used in this paper uses Color feature-based image retrieval. This mainly contains the following algorithms:

(25)

Taking into account the color histogram algorithm is widely used, it’s feature extraction and similarity matching are more easier, this method will be used as our target color algorithm.

Color Histogram

Color Histogram of an image gives extensive information about its structure. It gives number of pixels, its colors in RGB format etc. So it is easy to calculate mean, entropy, median of an image which can be used as features of an image in case of feature extrac-tion. This feature can be used to compare with queried image’s feature and to retrieve all similar images from database.

Feature Calculation and Similarity Matching

Feature vector of each uploaded image is stored in database. This feature vector is match with feature vector of an input image. Both feature vectors are used to calculate simi-larity coefficient. This is called as Pearson correlation coefficient. Map Reduce is used in similarity matching phase only to provide fast results.

Feature vector of input image is calculated in the query and is matched with the images from image library and images for which images are matching are retrieved.

2.2.2 Future scope/Issues

Map Reduce technique is used at similarity matching stage of system. Map reduce tech-nique can be efficiently used in feature extraction process by splitting image into parts at map stage and then combining it at reduce level.

2.3 An Oblivious Image Retrieval Protocol [3]

Today many website providers have large collection of images. For e.g. google.com, face-book.com have large database of images where each user uploads his private data to the data center of website. This data should be protected from external or internal threat. The threat may be of stealing of image data, modification of the data, deleting private

(26)

data of an individual, or misusing it for particular purpose etc.

As the size of this data is increasing day by day, companies find it hard to manage and secure this kind of large data. Their data center servers are getting overloaded due to huge data. To reduce strain form their storage servers they are looking for another option. This option is of using external storage servers.

These external storage servers are maintained and managed by other companies. i.e. data is now outsourced by the companies to other companies to reduce overhead of their internal servers. In this case it is becomes very important to protect and secure the data of a customer over external outsourced database servers. Oblivious Image Retrieval Protocol comes in to picture for outsourced image databases. It is a technique to securely query image database for required image data and retrieve matched images form database. This retrieved data should also securely get transferred to the user.

2.3.1 System Model

System assumes that all feature vectors of all the images are already in database. This feature data is stored in encrypted form in to the database. All query operations are done on encrypted data. This mainly contains two parts one is privacy preserving querying mechanism and oblivious transfer of the decryption keys.

Privacy preserving protocol of paper suggests following. First, to query image set, user generates an encryption of the query feature vector set using a homomorphic public-key encryption technique and sends it to the database server. The query feature vector is distorted by the user with a constant random vector to prevent any statistical inference by the database server. Second, the database server uses the encrypted query feature vector and performs a homomorphic subtraction with a random feature vector. The server performs a subtraction of the database image features with the same random feature vector. Before sending the results back to the user, the database server performs a permutation of the subtracted feature vectors so that the user is not able to learn the relative indexing structure of the database images. The server includes pseudo-identifiers

(27)

for the permuted images to allow the user to identify his choices. Third, upon receiving the server response, the user removes the distorting random constant from the server response and calculates the Euclidean norm of the numerical difference the query image feature vector and the database image feature vectors [3].

2.3.2 Future Scope/Issues

The oblivious image retrieval technique can be implemented using map reduce technique, map reduce can be used at content based image retrieval stage.

2.4 Distributed Image Retrieval System Based on

MapReduce.[4]

A Distributed Image Retrieval System(DIRS) is a system in which images are retrieved in a content based way, and the retrieval among massive image data storage is carried parallely by utilizing MapReduce distributed computing model.

2.4.1 System Model

In this system, images are uploaded to HDFS in one map reduce process. This map reduce process extracts features of images by using LIRE library and stores images and their feature vectors to Hbase table.

Images are also retrieved parallelly. In the feature matching process, CBIR system cal-culates the similarity between the sample image and the target images, and then returns those images matching the sample image most closely.

Before MapReduce job is started, the sample image should be added to Distributed Cache to enable every map task to access it. Image is read from Distributed cache, extract its visual features, and then compare the feature vectors with feature vectors of target images from HBase. All matched images are retrieved back to user.

(28)

2.4.2 Future Scope/Issues

The distributed image retrieval technique needs some secure technique to protect data.

2.5 Summary

After literature survey, I found that each paper has some future scope. I want to combine all the three feature scopes into one application. This application will upload, search and retrieve image files from distributed database for particular user and will perform this in secured manner. It will eventually maintain user and data privacy. This will be carried out by using map reduce technology which will optimize the time complexity for retrieval.

(29)

Chapter 3 Motivation

With the explosive growth of digital media data, there is a huge demand for new tools and systems. It should enable average user to search, access, process, manage, author and share these digital media contents more efficiently and more effectively. This leads to change people’s interest from work to work with entrainment.

Examples:

Business Phones to Smartphones, Watching movies/T.V.,

Listening music,

Social Media: connect and share,

There are different types of multimedia content present today.

A multimedia information system can store and retrieve text, 2D grey-scale and color images, 1D time series, digitized voice or music, video and there is need to propose an efficient, robust, secure solution to retrieve this kind of multimedia information.

We are considering image retrieval over audio or video information retrieval because today image retrieval system has lot of applications. Image retrieval techniques have gone under several changes and are becoming more and more advanced over the period of time. In early years of image retrieval, images were searched on text based where each image has to be annotated with particular text. This technique was more troublesome because of its complexity. Today content based image retrieval technique provides extensive search

(30)

facility over database which retrieves all related images which are similar in content with queried image.

The CBIR technology has been used in several applications such as fingerprint identi-fication, biodiversity information systems, digital libraries, crime prevention, medicine, historical research, among others. This CBIR covers wide randge of different domains. Today many websites like google.com, facebook.com etc. has lot of images on their servers. Many users access these websites regularly. They upload, search download images from them. There is a need of fast and secured technique to upload, search and retrieve these images on user demand.

3.1 Problem Definition

The purpose of this research work is to create an efficient private content based image information upload, search and retrieval system using map reduce. The system will be created from easily available API’s and which has support on cloud so that system can be easily ported to cloud for use.

3.2 Scope of Research

Scope of this research is not limited to one particular domain. It spans over multiple domains. Domains in which extensive research can be done are Private Information Re-trieval, Content Based Image Retrieval and Oblivious transfer i.e. security. Research papers in this area suggests content based image retrieval technique over cloud comput-ing as well as secured transfer techniques for the images but not a scomput-ingle paper talks about fast (over cloud computing) as well as secured retrieval of images form servers using Private Information Retrieval Technique.

This system can be evolved in many ways. By changing CBIR algorithm, by using dif-ferent encryption techniques, by using difdif-ferent storage methods over HDFS, this system can be moved to new more efficient and secure level

(31)

3.3 Objectives

1. To implement CBIR in Map-Reduce

2. To select and implement homomorphic encryption technique for secure retrieval 3. To evaluate proposed solution

3.4 System Requirement Specifications

Hardware:

Minimum one Desktop system with Intel’s or AMD’s processor and 2Gb of RAM. Latest technology will provide better results. i.e. Cluster of 5 nodes having Intel’s core series processor(i3,i5,i7) of high frequency with 4Gb RAM will definitely perform better than 5 node cluster having older processor version working on less frequency and less RAM. Cluster should have nodes having same configuration. If it has nodes having dif-ferent RAM capacity, processor frequency then cluster will not perform up to capacity.

Software:

1. Operating System: Any open source Linux operating system.

2. Coding Language: Hadoop map reduce technology/software Java language (JDK) 3. IDE: Eclipse environment

(32)

Chapter 4 System Design

4.1 Complete Architecture

Figure 4.1 shows complete architecture of the system.

(33)

It provides an web based interface to receive user’s request. After receiving request, sys-tem performs specific operations on the images whether it is upload or retrieval. Core of this system is divided into two phases. One phase is to upload images and other phase is to search and retrieve similar images. Both modules are deployed on Hadoop cluster, which follows the MapReduce paradigm.

Upload process will read input data from HDFS, will process it using map and reduce functions, and then write output results to HDFS again. Similarly Image retrieval process will use map and reduce functions for retrieving similar images depending upon queried image. These images will be given to user on local file system.

4.1.1 Upload Images

Upload Images is an independent process where user can upload his image or photo in-formation to the systems database. The process of uploading images is carried out with the help of Map/Reduce techniques which parallelize the process of upload. In upload, there are three sub process. First is splitting of an image, second is feature extraction and third is encryption and storage. Upload Image process is parallelized with the help of Map Reduce technique for each image. When user uploads his private photos or images to the database, he might wants to upload an image or many images at a single point of time. So if user wants to upload many images at a single point of time, he has to provide only path of image folder where images are stored and map stage will parallelize the upload process instead of uploading one image at a time. Architecture of upload process can be seen in figure4.2.

For better understanding, explanation of upload process is split into three distinct phases. These three phases actually specifies image splitting, image feature extraction and en-cryption part.

Phase 1: An image which is going to be uploaded is first split into 16 small images. This step is taken to disperse data across nodes and attacker should not be able to track data from one location. Map/Reduce technique works on its distributed file system called as

(34)

Figure 4.2: Upload Process

Hadoop Distributed File System(HDFS) [18], [19], [13], [16]. specifically built for par-allelization. All image files are stored into file system. The information about all split images of every image are stored into HBase [17] table. HBase works on top of HDFS. Table in HBase will store path of image and other features of that image.

Phase 2: These images are given to the image processing part. In this phase, Color His-togram of each image is calculated. This hisHis-togram will provide different feature values such as entropy, intensity, mean, median etc. These values are used for the similarity calculation [11] of images i.e. distance between quired image and images from HDFS.

Phase 3: This calculated feature will be given to homomorphic encryption system which will store this vector in HBase table in encrypted form. Homomorphic encryption system actually provides additive and multiplicative operations to be carried out on encrypted data without changing its original data i.e. multiplication of all 16 encrypted feature vectors of an particular image will lead to the summation of feature vector of image in plaintext form.

Advantage of this will come into picture at the time of retrieval. At the time of retrieval when user will query to the system then encrypted feature vector of quired image will be used. This feature vector which is in encrypted form will be used against total feature

(35)

vector of all split image of particular image.

Algorithm used for this encryption technique is Paillier cryptosystem. It uses pair of private and public key of user. Image feature vector of an image is encrypted with the public key of user and is entered into HDFS. Users public key will be available with the database server whereas private and public key will be available with user. This encrypted feature vector is stored into files on HDFS. Each split image will have its feature vector corresponding to it and it will be stored in same row along with encrypted image name.

4.1.2 Image Retrieval

Figure 4.3: Image Retrieval Process

This phase of the system provides user,a GUI to search and retrieve images. Here user will upload one image for which he wants to find similar images. When user wants to search images based on particular image, he uploads it to system. Processes included in upload phase are also carried out on this image. Feature vector of each image is calculated from color histogram. All feature vectors of split images are encrypted with the public key of user. These all encrypted feature vectors of queried image are sent to server. This procedure of searching,comparing and retrieving images is also one Map reduce process. In this process, when user uploads image, encrypted feature vectors of queried image i.e. feature vectors of all split image files along with all image feature vectors in HBase table goes to Map process.

(36)

Map Stage: This map process multiplies all feature vectors of particular image and emits total feature vector of image. This process does this in parallel for all images present into system. It will create one mapper for one image and emits total feature vector after combining 16 feature vectors.

Reduce Stage:Vectors of all images are collected in reduce step. Each vector is compared with feature vector of queried image. Comparison is done on the basis of similarity measures. This similarity measure is calculated from the different standard formulas such as Euclidean Distance, Pearson correlation coefficient [11]. In proposed protocol, euclidean distance between color histograms of images is calculated.

(37)

Chapter 5 Implementation of System

Graphical user interface(GUI) is most important part of any application. It allows user to interact with a system. The success of application depends upon how well it is designed, how simple is it to use? This application even though, is designed in bottom up fashion, i.e. first main Map-Reduce code and lastly GUI, I will introduce GUI firstly, because when any user will start this application, he or she will interact with GUI only.

Complete application including GUI is created in eclipse[20]. It is an open source inte-grated development environment(IDE) which helps to develop applications easily. This IDE comes with integrated J2EE or Java EE [21]. J2EE is actually Java to enterprise edition technology. It is used to build enterprise web applications. Java EE includes sev-eral API specifications, such as JDBC,Enterprise JavaBeans, Connectors, servlets, Java Server Pages (JSPs) [12] and several web service technologies. This allows developers to create enterprise applications that are portable and scalable, and that integrate with legacy technologies [21].

I have created GUI with JSPs, it enables Web developers and designers to rapidly develop and easily maintain, information-rich, dynamic Web pages. Java code can be integrated with HTML in JSPs. JSP technology separates the user interface from content genera-tion, enabling designers to change the overall page layout without altering the underlying dynamic content [22]. In this application, JSPs have all GUI code and Java code contains all business logic of the application. All map reduce processes can run only through Java code as there are some specific commands which runs Map Reduce process.

(38)

This application is designed in such a way that if any process required for MapReduce program to run is not present then, the application automatically restarts all process in a system. Further it makes sure that all processes will be up and running. This is done only for the local deployment of hadoop cluster because when server is restarted all data will be formated. In actual industry deployment, this case will not arise because all precesses will always be running.

5.1 Log In Screen

As title suggest, this application is private content based image information retrieval, so every user will have his user id and password through which user will login to his/her account to upload or retrieve images. Log In Screen gives user to maintain his privacy. This is because many time databases are outsourced by the companies in that case data of many users can be present on same server/cluster. This technique provides facility that user can see only his data and not of others.

Figure 5.1: LogIn Screen

5.2 Create Account

Create account form comes after clicking link ”Create Account” from sign up. To access application, user has to create his account with web site, its more like an sign up form.

(39)

Figure 5.2: Create Account Screen

User account is created from users email id. Validations are done on email id field i.e. if in case user enters wrong email id, it asks for correct email id again. After valid email id, user account will get created along with security keys. User will be notified with User ID to login with. This User ID will create his account i.e. folder in HDFS. Information about login i.e. user id and passwords are stored in HBase table which is also a part of Hadoop. It is a database which allows to create table and store data in it. It is scalable database which can expand at run time.

Figure 5.3: Account Creation Screen - Validation

After successful creation of account, user will get a message that account has been created and it is ready to access. User now will go to LogIn screen to log in.

(40)

5.3 User Operation Screen

After logging in, user will see a screen having three links. One link is for logout, second is for uploading images and third one is for retrieving similar images for one particular image from database.

Figure 5.4: User Operations Screen

5.3.1 Changes in Hadoop

Conventional set up of Hadoop does not allow process to write output files to same output path. User of this system will not upload files only once. He will use system any time, any where. It is necessary for a system to upload user images at one particular location. So when map reduce process of upload happens all encrypted feature vector must go it same folder that belongs to user. This process will happen any time. So at each time we can allocate new folder space to user. to remove this limitations, I have modified Hadoop code so that output of every map reduce process will go into user specific folder and not to other location.

Modified Files : FileOutputFormat.java, MultipleOutputFormat.java, MultipleTextOut-putFormat.java

(41)

5.3.2 Upload Process

When user selects upload images link on screen, he is directed to a page where he can specify the folder path of total images. All the images present in the folder gets uploaded on HDFS first. These images are used in upload Map Reduce. At time of each upload, Map-Reduce process will run. It splits an image into 16 small images, and calculates feature vector of every small image i.e. color moment of each split image.

This feature vector is nothing but a color moment of an image calculated from color histogram. A color histogram identifies the proportion of pixels within an image holding specific values which can be used to find similarity between two images by using sim-ilarity distance measures [8], [11].It tries to identify color proportion by region and is independent of image size, format or orientation [6]. Color moments contains calculation of the first order (mean), the second (variance) and the third order (skewness) of an image [8]. All moments are encrypted with public key of user and are then stored into a file. Whenever user uploads images to the database, they will go into same folder.

Actually Hadoop does not support, usage of same output folder in different Map Reduce process. So I had modified hadoop so that in each upload Map-Reduce, files will go into same folder.

(42)

Map Reduce process takes some time to complete, if user does not get proper message he might thought that images are uploaded and he might shut down the process. So user gets a message showing that files are being uploaded. When map reduce will be executing, user will see following screen till process completely executes.

Figure 5.6: Uploading Screen

Again there is a chance of repetition of file names. So those files who has their names similar to files present into HDFS, will be renamed at the time of upload. First all file names of user will be checked from HDFS. They are taken in one string and file name of image which is going to get uploaded will be matched with the string if string contains image name then it is renamed. Program will again check, name which is gen-erated whether it is present in string or not if it is present then file will be renamed again.

Map Reduce Algorithm

Map reduce technique is used to create feature vector of images. Map will split input image file in to small chunks of fixed size, will calculate its feature vector, will encrypt it using homomorphic encryption technique and will store it into database. Reducer will get sorted input from mapper. It will combine all split encrypted feature vector and will emit imageid and combined list.

(43)

Algorithm 5.1 Upload Map Reduce Process Algorithm

Mappers emit encrypted feature vectors keyed by namesfeaturekey, the execution

framework groups vectors by names along with namesfeaturekey, and the reducers

write names and encrypted feature values to files.

imgs←new _{BufferedImage};

random←new BigInteger;

n, g, r, nsquare←new BigInteger;

dM eanR←new _double;

dM eanG←new double;

dM eanB ←new double;

1: procedure printPixelARGB(int pixel)

2: dM eanR ←dM eanR+_{Red Pixel Value}

3: dM eanG←dM eanG+Green Pixel Value

4: dM eanB ←dM eanB+Blue Pixel Value

1: procedure Encryption(intmeanV al)

2: _Return BigIntegerencrypted(meanV al)

1: class _Mapper

2: procedure Map(Imgid iId,Img im)

3: m0, m1, m2←new BigInteger;

4: imgs←splitImages

5: for all img i∈imgs[count]do

6: for all pixelp∈imgs i do

7: printP ixelARGB

8: m0←Encryption(dM eanR)

9: m1←Encryption(dM eanG)

10: m2←Encryption(dM eanB)

11: _Emit(Imgid iId,encrypted vals hiId i, m0, m1, m2i)

1: class Reducer

2: procedure _Reduce(Imgid iId,encrypted vals hiId i, m0, m1, m2i)

3: iIdL←new List

4: for all encrypted vals hiId i, mi ∈encrypted vals

5: [hiId 0, m0, m1, m2i,hiId 1, m0, m1, m2i. . .]

do

6: _Append(iIdL,hiId i, mi)

(44)

Following image shows the working of Map Reduce process in eclipse or behind the screen.

Figure 5.7: Process behind Upload

5.3.3 Image Retrieval

When user clicks search images link, he is directed to a page where he can specify the file path of image. This is an image on the basis of which images are to be retrieved from database. This comparison is actually done between features of images stored in HDFS and feature of an image in search process. In upload process, feature vectors are calculated and are stored into HDFS. These feature vectors are used for comparison.

(45)

When user selects image, it passes through several steps. These steps are actually similar that of upload process. i.e. Image feature calculation and its encryption. Image is split into 16 parts first same as in upload phase. This step is taken to maintain loss of pixels in both cases equal. Features are encrypted with same public key of an user. So at similarity matching phase features will match exactly.

At upload stage, features of split images are stored into HDFS. So at similarity measure-ment time they should be combined back into one feature vector. Here paillier cryptosys-tem comes into picture. Paillier provides facility to perform addition and multiplication operations on encrypted data without changing original data.

Similarity distance is calculated between stored image feature vectors and feature vector of an image to be searched. Similarity distance is nothing but a mathematical formula which gives relation between the images. There are different types of measures which can be used for calculation [8],[11]. Euclidean distance measurement is used for calculation of similarity between images. This distance is calculated between means. This will provide information about similarity between two images. This process of comparison is done in an map reduce to make it parallel.

When map reduce process is going on user will see following screen.

(46)

Background processes can be seen from following screen.

Figure 5.10: Processes behinds search Screen

Map Reduce Algorithm

Algorithm 5.2 Image Retrieval Map Reduce Process Algorithm

Mappers emit encrypted feature vectors keyed by opN ame f eatureid, the execution framework groups vectors by names along with opN ame f eatureid, combiner gets encrypted feature vectors of quired image and calculates difference between combined values from datastore and vectors of quired images also calculates square and emits values. The reducers gets values of difference write names and square root of encrypted feature values to files.

upload img m1←new BigInteger;

upload img m2←new _BigInteger;

(47)

1: class _Mapper

2: procedure Map(LongWritable key,Text value)

3: op name←new String;

4: m1, m2, m3←new _BigInteger;

5: line←line f rom f ile;

6: arr tokens←tokenizer(line);

7: while arr tokenshas tokens do

8: m1←arr tokens.nexttoken;

11: _Emit(LongWritable opN ame m1,encrypted vals m1);

1: class _Combiner

2: procedure Configure(JobConf conf)

3: upload img m1←conf.mean of upload1;

6: procedure _Reduce(Text key,Text value)

7: subtraction of f eatures←new Integer;

8: va←new Integer;

9: while va∈values do

10: va←va+va;

11: if key contains m1 then

12: subtraction of f eatures←va−upload img m1;

17: _Emit(LongWritable opN ame,subtracted encrypted vals subtraction of f eatures);

1: class Reducer

2: procedure _Reduce(Text key,Text value)

3: f inal root←textrmnewDouble;

4: va←textrmnewInteger;

5: while va∈values do

6: va←va+va;

7: f inal root←root(va);

(48)

Chapter 6 Experiments, Testing and Result

Analysis

6.1 Cluster Configuration

Application which is created, is actually a web based application. so their is a need of an web server which will run application on system.

Apache Tomcat is an open source web server and servlet container developed by the Apache Software Foundation (ASF). Tomcat implements the Java Servlet and the JavaServer Pages (JSP) specifications from Sun Microsystems, and provides a ”pure Java” HTTP web server environment for Java code to run [23], [24].

Initially, Hadoop was installed on single system i.e. on single node cluster which is cru-cially used for application development. After completion of development, single node cluster is gradually increased to two, three, four and then five node cluster.

In single node cluster, master node and slave nodes are same i.e. only one node where as it multi node cluster there is one master and other slaves. More master nodes can be possible. Each node of the cluster has installation of Hadoop’s Map Reduce. But HBase is limited to master only which is used to store users login credentials.

(49)

Master node all major processes will be running. These areNamenode process, Datanode process, Jobtracker, Tasktracker and secondary namenode process and HMaster process

of Hbase. Slave nodes will only have Datanode, Jobtracker and Tasktracker running. One more small cluster were created of 3 nodes just to check performance issue.

No-des

Descr-iption

Role

Me-mory

Gb

CPU

Freq.

GHz

Disk

Gb

OS

1 Laptop Single Node Cluster 4 Intel Core i3-370M 2.4 40 Ubuntu 12.04 1 Desktop Workstation 1st Node-Cluster(Master) 4 DualCore AMD Optron 1.8 48 Ubuntu 12.04 2 Desktop Workstation 2nd Node-Cluster(Slave) 2 DualCore AMD Optron 2.6 50 Ubuntu 12.04 3 Desktop Workstation 3rd Node-Cluster(Slave) 2 DuaCore AMD Optron 2 60 Ubuntu 12.04 4 Desktop Workstation 4th Node-Cluster(Slave) 4 DualCore AMD Optron 1.8 150 Ubuntu 12.04 5 Desktop Workstation 5th Node-Cluster(Slave) 4 DualCore AMD Optron 2 50 Ubuntu 12.04 6 Desktop Workstation 6th Node-Cluster(Slave) 2 DualCore AMD Optron 1.8 150 Ubuntu 12.04 Table 6.1: Cluster Configuration

No-des

Descr-iption

Role

Me-mory

Gb

CPU

Freq.

GHz

Disk

Gb

OS

3 Desktop 1-Master 2-Slave 4 Core i5 3.10 Ubuntu 12.04 Table 6.2: Second Cluster Configuration

(50)

6.2 Experiments and Testing

For doing experiments, you need database in this application case image database. Ini-tially experiments were done on static data When application main processes were de-veloped i.e. Map Reduce for Upload and retrieval, it was tested on image data set of 10 to 15 files. First feature vectors of these images were created through java code and uploaded to HDFS. That data set was used into processes. Then code was written to automatically upload files to HDFS and perform all operations on encrypted image files directly.

Experiments are carried out at upload and retrieval stage with a repository of images. These images are stored in data set of sizes 10, 20,.. to 100 then 200, 300,....1000 ,2000 ,...5000. Many of them were repeated. All having same file format .jpg and same size 640 by 480 pixel.

Again experiments were done with set of 25 images having different formats like .jpg, .gif, .bmp, .png etc. and different sizes. These processes take time to complete. These timings are taken into consideration for performance of system. Performance of a system is calculated at both upload and retrieval stage.

6.3 Result Analysis

At start of testing, experiments were done on two machines. Both having same single node cluster setup of Hadoop, but having different processor, speed, RAM etc. Readings of upload and retrieval processes were taken from both machines for different set of images. After observation it is found that map reduce process depends upon processor version, its speed/frequency, RAM present into system. Following result can definitely highlight this fact about systems. Results in the figure shows comparison between performance of map reduce upload process and retrieval process on one node cluster installed on laptop and desktop. Configuration of both is mentioned in the above table. Figure 6.1 shows total time taken to upload x images to the system where horizontal axis stands for number of images x and vertical axis stand for time in milliseconds.

(51)

Figure 6.1: One Node Cluster Difference at Upload MapReduce

Following figure6.2 shows same experiment on retrieval image map reduce process. It also shows same observations taken from upload process. Retrieval process takes less time compared to upload process. Upload process contains more computations than re-trieval process.

(52)

Similarly three node cluster difference is also calculated and can be seen from following figures. Configuration of other cluster is shown in table 6.2. After comparison of two three node cluster, we can clearly see that processor version, its speed, memory availability directly affects to Map Reduce process. Figure6.3 shows change in Upload Map Reduce.

Figure 6.3: Three Node Cluster Difference at Upload MapReduce Figure6.4 shows change in Retrieval Map Reduce.

(53)

After taking results on one node cluster, it is gradually increased to two, three, four, five and six node cluster. Results were taken on each cluster size and are displayed in figure6.5.

Figure 6.5: Performance of Upload MapReduce at Cluster Figure6.6 shows same experiment on retrieval image map reduce process.

Figure 6.6: Performance of Retrieval MapReduce at Cluster

Reading were taken by running process on gradually increasing cluster size. Retrieval process takes less time compared to upload process. Upload process contains more com-putations than retrieval process.

(54)

As you can see from figure, at cluster 3 readings are less than cluster 4. Actually at 4, readings should be less. This happened because addition of bad node. Node which was added, had 2 Gb of RAM, low processor speed and slow LAN connectivity. Sp reading got increased instead of decreasing. But other readings are gradually decreasing as we go on increasing cluster size.

Figure 6.7 and figure 6.8 shows scale up of time spend by a process. These readings are difference between times on consecutive size clusters. These are also in milliseconds. Negative size shows bad node cluster entry i.e. increasing cluster size from 3 to 4.

Figure 6.7: Scale Up between Different Cluster Nodes

(55)

6.4 Error Check in Application

Following screens shows different error checking schemes throughout the application life cycle.

Figure 6.9: Error Checking in Email

This error checking is done at upload screen, when user types folder path at which folder is not present.

Figure 6.10: Error Check at Upload Images

(56)

Figure 6.11: Error check at Search Images

6.5 Comparison of Different Systems

Following table shows comparison of different commercially available image retrieval sys-tems available. Some of them are applications and some are user libraries. Comparison shows difference between all available system with respect to my system on some major criteria.

(57)

Chapter 7 System Output

7.1 Input Images

Following figure shows a example regarding upload of images by system user. It contains images of different formats with variable sizes. Images shown are having formats like .jpg, .png, .gif and .bmp. This system does not support image upload and retrieval for images having extension .tiff.

Following is set of images uploaded to system.

Figure 7.1: Input set of Images Uploaded

(58)

Figure 7.2: Input Search Image

Following is an output of a system. Depending upon mean value of image all similar images from database are retrieved. System outputs all images which are having mean values different from mean of input image. Five variations from input mean is considered in the system.

(59)

Chapter 8 Conclusions and Future Scope

8.1 Conclusion

We experimented and evaluated the proposed system with three aspects. One is its sim-ilarity retrieval, second is execution time complexity and third is security.

On similarity measurement front, application is successful in uploading, searching and retrieving similar images by precision of nearly 89 to 94%, for same format and same dimension of image data. Variation in dimension varies the output similarity precision by 20%. But application does not get affected by format change and delivers output with same high similarity.

For execution time complexity, application is tested on desktop systems with different configurations. Systems with latest processors and memory deliver high performance and increases time efficiency nearly by 50%. Throughput of the system increased due to use of latest core processors having high frequency. Main memory also played crucial role in increasing performance.

Paillier cryptosystem, which is used in this application is highly secured. It is actually used to achieve user privacy and database privacy with respect to each other. This ap-plication system achieves, high security because retrieval process is completely done on encrypted data which is secured by using random key of user. So it does not lag in se-curity frame. But it can be made more secure by using different keys, provided by some

(60)

secured key management servers.

Although system has provided satisfactory results, we consider this as just a first step to provide internet/cloud users, a Private CBIR system for their use and it opens avenues for further research and developments in this.

8.2 Future Scope

This application is designed in such a way that it will try to cover three different domains. It does not use detailed research work from any of the domain. So there is lot of scope to improve this application form different domain’s perspective. This application can be evolved in Private Information Retrieval, Content Based Image Retrieval and Oblivious transfer i.e. security domains.

Different homomorphic encryption technique can be used, similarity measurement can be changed to improve accuracy, also different feature extraction techniques like Tamura features, Shape features can be used along with color histogram to increase the accuracy of similarity measurement.

This application has large scope in cloud domain and internet world because every do-main servers are using cloud technology and they want to provide their customers a new cutting edge applications for imaging which are secure, efficient and delivers with high throughput.

(61)

Appendix A

Publication Status

Title

Journal

Status

Private Image Information Re-trieval using Map/Reduce on Cloud

International Journal of Ad-vances in Management, Tech-nology and Engineering Sciences [ISSN 2249-7455]

(62)

Bibliography

[1] T. Mayberry, E.-O. Blass, and A. H. Chan, “Pirmap: Efficient private information retrieval for mapreduce,” 2012, http://eprint.iacr.org/.

[2] L. Shi, B. Wu, B. Wang, and X. Yan, “Map/reduce in cbir application,” in Com-puter Science and Network Technology (ICCSNT), 2011 International Conference on, vol. 4, dec. 2011, pp. 2465 –2468.

[3] P. R. Sabbu, U. Ganugula, S. Kannan, and B. Bezawada, “An oblivious image retrieval protocol,” in Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications, ser. WAINA ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 349–354. [Online]. Available: http://dx.doi.org/10.1109/WAINA.2011.128

[4] J. Zhang, X. Liu, J. Luo, and B. Lang, “Dirs: Distributed image retrieval system based on mapreduce,” inPervasive Computing and Applications (ICPCA), 2010 5th International Conference on, 2010, pp. 93–98.

[5] Y. Rui and T. S. Huang, “Image retrieval: Current techniques, promising direc-tions and open issues,”Journal of Visual Communication and Image Representation, vol. 10, pp. 39–62, 1999.

[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by image and video content: the QBIC system,” Computer, vol. 28, no. 9, pp. 23–32, 1995.

[7] Oracle intermedia users guide and reference. [Online]. Available: http: //docs.oracle.com/html/A88786 01/mm cbr.htm

(63)

[8] D. H. Z. Dr. Fuhui Long and P. D. D. Feng, FUNDAMENTALS OF CONTENT-BASED IMAGE RETRIEVAL.

[9] R. B. Gulfishan Firdose Ahmed, “A study on different image retrieval techniques in image processing.”

[10] J. Eakins, M. Graham, J. Eakins, M. Graham, and T. Franklin, “Content-based image retrieval,” Library and Information Briefings, vol. 85, pp. 1–15, 1999.

[11] C.-C. Chen and H.-T. Chu, “Similarity measurement between images,” in Proceedings of the 29th annual international conference on Computer software and applications conference, ser. COMPSAC-W’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 41–42. [Online]. Available: http://dl.acm.org/citation.cfm?id=1890517.1890538

[12] C. W. Niblack, R. Barber, W. Equitz, M. D. Flickner, E. H. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin, “QBIC project: querying images by content, using color, texture, and shape,” C. W. Niblack, Ed., vol. 1908, no. 1. SPIE, 1993, pp. 173–187. [Online]. Available: http://dx.doi.org/10.1117/12.143648 [13] Apache hadoop. [Online]. Available: http://hadoop.apache.org/

[14] Mapreduce - hadoop wiki. [Online]. Available: http://wiki.apache.org/hadoop/ MapReduce

[15] T. G. Peter Mell, “The nist definition of cloud computing.”

[16] Hdfs users guide. [Online]. Available: http://hadoop.apache.org/docs/hdfs/current/ hdfs user guide.html

[17] Apache hbase. [Online]. Available: http://hbase.apache.org/

[18] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492

Private Content Based Image Information Retrieval using Map-Reduce

Private Content Based Image Information

Retrieval using Map-Reduce

A Dissertation

submitted in fulfillment for the award of

the degree

Master of Technology,

Computer Engineering

Submitted by

Arpit D. Dongaonkar

MIS No: 121122006

Under the guidance of

Prof. Sunil B. Mane

College of Engineering, Pune

DEPARTMENT OF COMPUTER ENGINEERING AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE-5

DEPARTMENT OF COMPUTER ENGINEERING

AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE

CERTIFICATE

This is to certify that the dissertation entitled

“Private Content Based Image Information Retrieval using

Map-Reduce”

has been successfully completed

by

Mr. Arpit Dilip Dongaonkar

MIS No. 121122006

as a fulfillment of End Semester evaluation of

Master of Technology.

Contents

List of Tables

List of Figures

Chapter 1

INTRODUCTION

1.1

Private Information Retrieval

1.2

Image Retrieval

1.2.1

Content Based Image Retrieval

1.3

Similarity Measurement

1.3.1

Euclidean Distance

1.3.2

Mahalanobis Distance

1.3.3

Chord Distance

1.4

Hadoop

1.4.1

MapReduce

1.4.2

HDFS

1.4.3

HBase

1.5

Oblivious Technique

Chapter 2

Literature Survey

2.1

PIRMAP [1]

2.1.1

PIR

2.1.2

MapReduce in PIRMAP

2.1.3

Working of PIRMAP

2.1.4

Future Scope/Issues

2.2

Map/Reduce in CBIR Application [2]

2.2.1

System Model

2.2.2

Future scope/Issues

2.3

An Oblivious Image Retrieval Protocol [3]