Privacy Preserving Aware Over Big Data in Clouds Using GSA and Map Reduce Framework

(1)

Privacy Preserving-Aware Over Big Data in

Clouds Using GSA and Map Reduce

Framework

K.Sekar, M.Padmvathamma, J.Narasimhulu

Abstract: There is an increasing trend of applications that should handle big data. However, investigating big data is an extremely difficult issue today. For such applications, the Map Reduce (MR) framework has recently attracted a lot of consideration. The main intension of this paper is to Privacy Preserving-Aware over Big Data in Clouds Using GSA and Map Reduce Framework. This paper consists of two modules such as; Map Reduce (MR) module and evaluation module. In MR module, convolution process is applied to the dataset and creates a new kernel matrix. The convolution process is correctly done; the utility and privacy information of the data is well secured. Once the convolution process is over, the privacy-persevering framework over big data in cloud systems is performed based on the evaluation module. In Evaluation module, the neural network is trained based on the Gravitational Search Algorithm with Scaled conjugate gradient (GSA-SCG) algorithm which is improving the utility of the privacy data. Finally the reduced privacy data’s are stored in the computer service provider (CSP). The Map reduce framework is to ensure the private data, which is in charge for anonym zing original data sets as per privacy requirements.

KEY WORDS: - Map reduce, privacy preserving, big data, Cloud service provider, cloud system, GSA, convolution, entropy

I. INTRODUCTION

Cloud Computing is an application which provides services over the Internet and also the hardware and systems software datacenters that provide those services [1]. Cloud computing provides services like the delivery of software, infrastructure, and storage over the internet either as separate components or a complete platform based on user demand [2]. Cloud computing is not space constrained. Companies can access this service without incurring extra cost for information management since they are need not own the servers and can use the capacity leased from third parties. Moreover, Cloud computing needs to address three main security issues: confidentiality, integrity and availability. Since the flexibility level is high, the cloud has become the proliferating ground of a new generation of products and services. The drawback of, the flexibility of services of cloud is the risk of the security and privacy of users’ data [3]. For those IT

organizations dealing with cloud computing cloud applications such as data storage, data retrieval and data portability is of significant use. Considering the above requirement, the IT development and user oriented global services can be globalized and delivered in single click by means of cloud applications such as Big Data [4]. Due to this advantage, many companies or organizations are migrating or building their business into cloud. However, numerous potential customers are still hesitant to make use of cloud due to security and privacy concerns [5], [6].

However, providing a secure and privacy-preserving data service is very challenging, as there are a lot of security problems in multiple levels of data services, and security and privacy protection which may impede functionality and performance of the data services [7]. Moreover, preserving privacy for multiple datasets remains a challenging problem, privacy-preserving techniques like generalization can withstand most privacy attacks on one single dataset so, for preserving privacy of multiple datasets, it proves to be useful to anonymize all datasets first and then encrypt them before storing or sharing them in cloud. Usually, the volume of intermediate datasets is huge [8] when the amount of data is large. Here it is difficult or even impossible for the data to be stored on a single machine, which renders sequential algorithms unusable. In situations where the amount of data is prohibitively large, the Map Reduce [9] programming paradigm is used to overcome this obstacle. Thanks to the Map Reduce framework proposed by Google [10], which simplifies the programming for distributed data processing, and the Hadoop implementation by Apache Foundation [11], which gives the framework free access for everyone, thereby making distributed programming more and more popular.

(2)

issues are not new, their importance is amplified by cloud computing and Big Data [12]. There is a recent research on privacy issues in the Map Reduce framework on cloud which makes use of mechanisms such as encryption [13], access control [14], differential privacy [15] and auditing [16] to protect the data privacy in the Map Reduce framework. These mechanisms are well-known pillars of privacy protection and still have open questions in the context of cloud computing and Big Data [12]. If we encrypt these data sets, processing of data will be quite a challenging task, because most existing applications only run on unencrypted data sets. Furthermore, lot of encrypted algorithms also developed big data privacy. Data Anonymization is a new promising category of approach used to achieve this goal [17]. However, the computing infrastructure and paradigm has been moved to the Map Reduce framework in order to get scalability, e.g., the newly emerging project Apache Mahout [18]. Thus achieving privacy preservation and high utility of Big Data in the Map Reduce framework on cloud for mining or analytic applications is still a challenging problem and needs extensive investigation.

In this paper we explain the Privacy Preserving-Aware over Big Data in Clouds Using GSA and Map Reduce Framework. This approach consists of two modules such as Map Reduce (MR) module and evaluation module. At first, we split the data in to n number of mapper. After that we apply the convolution process for each mapper, to obtain the encrypted data. Once the encryption process is over, we go for the evolution process which is used to evaluate how much of data’s are properly encrypted using GSA-SCGNN. After that the reducer is used to reduce the repeated data. Finally the data’s are stored in the Cloud Service Provider (CSP). The rest of the paper is organized as follows: A brief review of some of the literature works in Privacy preserving in big data techniques are presented in Section 2. The technical preliminaries are explained in section 3. The proposed methodology is detail described in Section 4. The experimental results and performance evaluation discussion is provided in Section 5. Finally, the conclusions are summed up in Section 6.

II. RELATED WORKS

In a literature survey, several methods have been proposed for the Privacy preserving of big data in Cloud. Among the most recently published works are presented here as follows:

Yogachandran et al. [19] has explained the Privacy-Preserving Multi-Class Support Vector

Machine for Outsourcing the Data Classification in Cloud. Privacy-preserving (PP) data classification technique where the server was unable to learn any knowledge about clients’ input data samples while the server side classifier was hidden from the clients during the classification process. More specifically, they explained the first known client-server data classification protocol using support vector machine. The protocol performs PP classification for both two-class and multi-class problems. The protocol exploits the properties of Pailler homomorphic encryption and secure two-party computation. At the core of the protocol lies an efficient, novel protocol to obtain securely the sign of Pailler encrypted numbers.

Moreover, Xuyun Zhang et al. [20] have explained the Privacy Leakage Upper Bound Constraint-Based Approach for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud. Encrypting ALL data sets in cloud was widely accepted with the existing approaches to address this challenge. But they argue that encrypting all intermediate data sets were neither efficient nor cost-effective because it was very time consuming and costly for data-intensive applications to en/decrypt data sets frequently while performing any operation on them Here, an upper bound privacy leakage constraint-based approach to identify which intermediate data sets need to be encrypted and those which do not, have the privacy-preserving cost saved while the privacy requirements of data holders was satisfied. Evaluation results demonstrated that the privacy-preserving cost of intermediate data sets was significantly reduced with our approach over existing ones where all data sets were encrypted.

Likewise, Xingjian Li [21] has explained the Mining Frequent Item sets from Library Big Data. Frequent item set mining plays an important part in college library data analysis. Since there was a lot of redundant data in library database, the mining process may generate intra-property frequent item sets, and there was a significant hindrance to its efficiency. To address this issue, they introduced an improved FP-Growth algorithm which they call RFP-Growth to avoid generating intra-property frequent item sets, and to further boost its efficiency, implement its Map Reduce version with additional prune strategy. The algorithm was tested using both synthetic and real world library data, and the experimental results showed that the algorithm outperformed existing algorithms.

(3)

ubiquitous environment. Preserving proprietor’s data and information in cloud was one of the top most challenging missions for cloud provider. Here they explained a hybrid authentication technique as an end point lock. It was a composite model coupled with an algorithm for user’s privacy preserving, which was likely to be Hash Diff Anomaly Detection and Prevention (HDAD). This algorithmic protocol acts intelligently as a privacy preserving model and technique to ensure that the user’s data was kept more secretly thereby developing an endorsed trust on providers. It also explores the highest necessity to maintain the confidentiality of cloud user’s data.

Moreover, Xin Dong et al.[23] have explained the privacy-preserving data sharing service in cloud computing. They explained an effective, scalable and flexible privacy-preserving data policy with semantic security, by utilizing ciphertext policy attribute-based encryption (CP-ABE) combined with identity-based encryption (IBE) techniques. In addition to ensuring robust data sharing security, this policy succeeds in preserving the privacy of cloud users and supports efficient and secure dynamic operations including, but not limited to, file creation, user revocation and modification of user attributes. Security analysis indicates that the system policy was secure under the generic bilinear group model in the random oracle model thereby enforcing fine-grained access control, full collusion resistance and backward secrecy. Furthermore, performance analysis and experimental results shows that the overhead was as light as possible.

Additionally, Xuyun Zhanget al.[24] have explained the Privacy-Preserving Layer over Map Reduce on Cloud. Due to its salient features of providing powerful and economical infrastructural resources for cloud users to handle ever increasing Big Data with data-processing frameworks such as Map Reduce based on cloud computing, the Map Reduce framework was widely adopted to process huge-volume data sets by various companies and organizations. Nevertheless, privacy concerns in Map Reduce were aggravated because the privacy-sensitive information was scattered among various data sets. They were recovered with more ease when data and computational power were considerably abundant. The layer ensures data privacy preservation and data utility under the given privacy requirements before data were further processed by subsequent Map Reduce tasks. A corresponding prototype system was developed for the privacy-preserving layer as well.

III. TECHNICAL PRELIMINARIES

A. System model:

The basic architecture of the privacy preserving in cloud computing is shows in Fig. 1. The system consists of major three components such as data owner (DO), Cloud storage provider (CSP), trusted authority (TA) and user. Data owner the person, enterprise or businesses which upload their data to the cloud pace, and might later update the outsourced data by performing modify, delete, insert and append operation. Cloud storage provider has a significant amount of computing resources and storage system and is responsible for managing the cloud servers and Do’s data. Trusted authority has expertise and capabilities that the cloud users do not have and is trusted to assess the cloud storage service reliability on behalf of the user upon request. User has to be authenticated by the DO as a trusted user and be permitted to have pre-determined type of access to the outsourced data.

Fig. 1: Network architecture of the privacy preserving in cloud computing B. Map reduce (MR) framework

[image:3.612.321.550.315.501.2]

(4)

[image:4.612.73.298.118.264.2]

programs are generally used to process large files. The input and output for the map and reduce functions are expressed in the form of key-value pairs.

Fig. 2: Flow diagram of Map Reduce Framework

Basically, a Map Reduce consists of two primitive functions, Map and Reduce, defined over a data structure named key-value pair (key, value). Specifically, the map function can be formalized as, i.e, map takes a pair as input and then outputs another intermediate key-value pair. These intermediate pairs are consumed by the reduce function as input. Formally, the reduce function can be represented as, i.e., reduce takes intermediate and all its corresponding values as input and outputs another pair. Usually, list is the results which Map Reduce users attempt to obtain. Both Map and Reduce functions are specified by users according to their specific applications. An instance running a Map function is called Mapper, and that running a Reduce function is called Reducer, respectively.

C. Group search algorithm

The novel population based heuristic algorithm is based on the law of gravity and mass interactions [27]. In this algorithm, the agents are deemed as objects and their performance evaluated by their masses. All the objects attract each other by the gravity force, which triggers a global movement of all the objects towards those with the heavier masses. Hence, the masses work together by means a direct form of communication, through the gravitational force. The sluggish movement of the heavier masses ensures the development stage of the technique and is related to excellent solutions. In the GSA, each mass (agent) encompasses four specifications such as the position, inertial mass, active gravitational mass, and the passive gravitational mass. The position of the mass represents a solution of the issue, and its gravitational and inertial masses are decided by

means of a fitness function. Thus, each mass offers a solution, and the technique is directed by appropriately adapting the gravitational and inertia masses. Over a period time, the masses get attracted by the heaviest mass, which brilliantly brings in an optimum solution in the search space. The GSA could be considered as an isolated system of masses. It is like a small artificial world of masses obeying the Newtonian laws of gravitation and motion. More precisely, masses obey the following laws:

Law of gravity: The law of gravity expresses that "Every particle pulls in each other particle and the gravitational power (force) between two particles is straightforwardly corresponding to the product of their masses and conversely relative to the separation between them".

Law of motion: As per the law of motion, "the present speed of any mass is equivalent to the entirety of the division of its past speed and the variation in the velocity. The variation in the velocity or acceleration of any mass is equivalent to the power (force) followed up on the framework separated by mass of inertia."

Here, let us take the case of a system with agents (masses). Now, the position of the agent is represented by means of the following Relation:

(1) Where;

Position of agent in the dimension

At a particular time the force acting on mass from mass is represented as follows.

(2) Where;

Active gravitational mass associated to the agent

Passive gravitational mass linked to Gravitational constant at time  Small constant

Euclidian distance between the two agents and

The Euclidian distance is effectively estimated by means of the following equation given in (3);

(3) The total force which acts on the agent in a dimension be an arbitrarily weighted sum of components of the forces exerted from other agents which is estimated by Equation 4 given hereunder:

(5)

Where;

Random value within the range of [0, 1] Thus, by the law of motion, the acceleration of the agent at time, and in direction, is represented as mentioned hereunder.

(5) Where;

Inertial mass of agent

Moreover, the next velocity of an agent is deemed as a fraction of its current velocity added to its acceleration. Hence, its position and its velocity are evaluated by means of the Equations given below.

(6)

(7) Where;

Uniform random variable in the interval [0, 1]

The gravitational constant, is initialized at the commencement and is scaled down over a period of time so as to manage the search precision. Thus, represents a function of the initial value and time;

(8) By means of fitness evaluation we effectively calculate the Gravitational and inertia masses. A heavier mass indicates an incredibly efficient agent. In other words, the excellent agents are endowed with superior attractions and walk further sluggishly. Presuming the equality of the gravitational and inertia mass, the values of masses is estimated by means of the map of fitness. Further, the gravitational and inertial masses are modernized by means of the equations given hereunder:

,

(9)

(10) Where; signifies the fitness value of the agent at time, and are defined as detailed below (in respect of a minimization issue):

(11)

(12) It is worth-noting that for the maximization issue, equations (11) and (12) are varied to equations (13) and (14), correspondingly:

(13)

(14)

A novel method to carry out an excellent compromise between the exploration and exploitation is to scale down the number of agents over a period of time as illustrated in Eq. (4). The excellence in performance of the GSA is enhanced by way of regulating the exploration and exploitation. The agents tend to attract the others. c is a function of time, with the initial value at the commencement which comes down over a period of time. In an identical manner, at the commencement, all the agents apply the force, and as the time elapses, is reduced linearly and finally there remains just one agent applying force to the others. Hence, equation (4) can be reworked as detailed below:

(15) Where;

 Set of first agents with the best fitness value and biggest mass

D. Back propagation Neural network learning algorithm

[image:5.612.319.555.492.673.2]

Basically, the neural network consists of three layers such as input layer, hidden layer and output layer. Consider the neural network having the input vector is denoted as. The value of hidden layer nodes are denoted as and the values of output layer nodes are. The denotes the weight connecting the input layer node and the hidden layer node. The denotes the weight connecting and the output layer node,where , , . Finally we calculate the error value, as the error function is,. Figure 3 shows the Configuration of BP neural network

Fig. 3: Configuration of BP neural network

(6)

input layer to hidden layer and then to the output layer. This is called the forward pass of the back propagation algorithm. In forward pass, each neuron in hidden layer gets input from all neurons in input layer, which are multiplied with appropriate weights and then summed. The tan sigmoid transfer function is used as activation function to compute the output at each neuron in hidden layer. Each neuron in output layer gets input from all the neurons in hidden layer, which are multiplied with appropriate weights and then summed. The tan sigmoid transfer function is used as activation function to compute the output at each neuron in output layer. The output values of the output layer are compared with the target output values. The error between actual output values and target output values is calculated and propagated back toward hidden layer. This is called the backward pass of the back propagation algorithm. The error is used to update the connection weights between neurons, weight matrices between input-hidden layers and hidden-output layers are updated. Table 1 show the algorithm of Back propagation neural network learning

Input

N input sample vectors , , with dimension, , learning rate , target value , sigmoid function Output:

Network with final weights , ,, Begins:

Randomly initialize , foriteration =1,2,…., do

forsample 1,2,…,N , do // feed forward stage: for do

for do

then

//back propagation stage

else

[image:6.612.68.307.353.690.2]

// learning finish break

Table 1: Back propagation neural network learning algorithm

As described in table `, all the weights are initialized as small random numbers. In the Feed Forward Stage (FFS), values at each layer are calculated using the weights, the sigmoid function, and the values at the previous layer. In theBack Propagation stage (BPS), the algorithm checks whether the error between output values and target values is within the threshold. If not, all the weights will be modified according to Eq.16, 17 and the learning procedure is repeated. The learning will not be terminated until the error is within the threshold or the max number of iterations is exceeded. After the learning, the final weights on each link are used to generate the learned network.

(16)

(17)

IV. PROPOSED PRIVACY PRESERVING OVER BIG DATA

(7)

[image:7.612.90.310.68.259.2]

Fig. 4: Over all diagram of the proposed approach

Basically, this work consists of two modules such as map reduces module and evaluation module. Our aim is to reduce and encrypt the dataset and store the encrypted data into cloud service provider (CSP). The detail description is explained in the following section

Module 1: MAP REDUCE (MR) FRAMEWORK

[image:7.612.323.537.71.257.2]

Consider the dataset which having the number of attributes, . At first we split the data’s into number of subset such as. After that we map the corresponding split data. Besides the functionally of such application, a large volume of intermediate data sets is generated. In this process, the main challenging issue is securing the confidentiality of intermediate data sets because opponents can steel privacy-sensitive information by examining multiple intermediate data sets. In this paper we preserved the intermediate dataset based on the conditional entropy with convolution process .Without preserving the data’s, the malicious easily interrupt the data in the cloud. The convolution process is correctly done; the utility and privacy data’s are well securely stored in the cloud. The overall process of phase 1 is shows in figure 5.

Fig. 5: The overall process of phase 1 map reduce framework

In Fig. 5, we have the main component is convolution process which is used to transform the data into another format. In our work we use conditional entropy based convolution process is used to convert the data. The explanation is given in following section.

 Convolution process:

Convolution is a mathematical way of combining two signals to form a third signal. For example, Convolution is a mathematical operation on two functions f and g, producing a third function that is typically viewed as a modified version of one of the original functions, giving the area overlap between the two functions as a function of the amount that one of the original functions is translated. Let and be two functions. The convolution of f and g, denoted by, is given by:

(18)

After the convolution process we calculate the conditional entropy of dataset. Before calculating the conditional entropy we are calculating the entropy. Entropy in its basic forms a measure of uncertainty rather than a measure of information. Specifically, the entropy of a random variable a measure of the uncertainty associated with that random variable. The entropy of a discrete random variable Y is defined by:

(19)

Where, is the probability and entropy is denoted by . Here entropy is always positive such as . Entropy measures the uncertainty inherent in the distribution of a random variable. Conditional entropy is a simple extension that measures the uncertainty in the joint distribution of a pair of random variables, and the uncertainty in the conditional distribution of a pair of random variables. Here, first column and the original class form the pair of random variables.

(8)

(20)

It can also be expanded and simplified as:

(21)

Once the convolution process is over we apply the evaluation process to the encrypted data. The evaluation process is done using the SCG-NN classifier. Suppose the encrypted data’s are not correctly encrypted means the data are again given to the convolution process. The pseudo code of data encryption process is explained in table 2.

Input:Big data dataset Output:Encrypted dataset Start

{

Consider the dataset

We split the dataset into number of mapper ,

Each mapper we apply the convolution process To split the mapper into blocks

Generate the random matrix size of which is having the values of 0 to 5

Multiply with to generate new matrix using (18)

Calculate the entropy of using equation (19) After that calculate the conditional entropy of using (21) To replace the conditional entropy to the corresponding matrix

This process is repeated until all the position of is replaced by the conditional entropy value

Obtain the encrypted dataset }

end

[image:8.612.324.541.48.632.2]

Output:Encrypted dataset

Table 2: Pseudo code of data encryption

Module 2: EVALUATION PROCESS

[image:8.612.66.310.179.465.2]

After the convolution function is done, the privacy-preserving of big data in cloud computing is processed. By using of scaled conjugate gradient-neural network (SCG-NN) a competent classifier based service measure is improved, which should detain the essential factors that affect the attribute of data for our application. The outcome of the convolution phase is tested by the SCG-NN classifier. The original database is divided into n number of sets; each set is considered as the input of the neural network (NN). The output is estimated depend on the encrypted data and the original data. By using of these scores we can compute the error value. By using the error values, we can recognize whether the data is encrypted or not. The processing of neural network structure is already illustrated in table 1.

Neural networks are mainly used for methodologies of classification and pattern reorganization. Between the many exploited neural network models, the feed-forward network is used for the version of the back-propagation training technique. Generally, a neural network is a set of nodes and links. The nodes are indicating the neurons and the links are indicating the connections and the flow of data between the neurons. Connections are enumerated by weights, which are randomly altered during training. During training, a set of training elements are indicated as. Each training element is normally explained by a feature vector (known as input vector). It should be related with a preferred output, which is indicated as another vector, (known as desired output vector). The back-propagation technique is used to the following method [5]; an input format is given to the network, to contrast the output with the desired output, and to compute the error between them. And then next, all related weights are altered in such a manner and next time the same instance is processed, the perfect output is nearer to the desired one, which means reduce an error. This function is continued until to reach the minimum error or until a given number of training epochs is accomplished. It suits particularly in training neural network in which the process index is computed in Mean Squared Error (MSE) but still it is not capable to remove local minimum. To defeat these problem this journal proposed a new open loop technique that integrating GSA and Scaled conjugate gradient (SCG) algorithm to teach neural network for big data.

Fig. 6: Visual representation of evaluation process for proposed methodology

[image:8.612.323.538.431.639.2]

(9)

Solution representation

[image:9.612.71.309.266.337.2]

Initially, GSA algorithm begins with the solution encoding process, which indicates set of weights (input and output layer). A solution is indicated as dimensions vector and each dimension (location) indicated as weight. A solution represents the weight allocation information (for neural network structure). is from 1 to z, which contains the number of solution in the population. A solutioncontainsand, which represented as the information of input and output weight allocation. The length of the solution is equal to the number of weights allocated in the NN structure. When the maximum number of iteration is reached, the search is concluded. The solution representation format is illustrated in Figure 7.

Fig. 7:Solution representation format Fitness computation

Before implementing the GSA algorithm, the fitness calculation of the initial population is processed. The fitness value of each solution is created by the fitness function. The value represented as how suitable for resolving a weight optimization issue with a solution. For every solution, the fitness calculation is done by Mean Magnitude of Relative Error (MMRE) which is explained in equation (22).

(22)

(23)

(24) Where;

Original value Obtained output

Weight updation based on GSA

Based on law of gravity and mass interactions, the agents are shifted in this stage. In this algorithm, agents are represented as objects and their process is computed by their masses. During the search process, the agents are shifted corresponding to the below equation:

(25)

(26) Where;

Position of agent in dimension Velocity of agent in dimension

 Acceleration of agent in dimension which is shows in equation (5)

1) Stopping criterion

The algorithm discontinues its execution only if maximum number of iterations is achieved and the algorithm which is holding the best fitness value is selected.

2) Evaluation based on SCG-NN

In this section, we evaluate the encrypted data which is obtained from module 1. The evacuation process is done based on the scaled conjugate gradient neural network (SCG-NN). Here, the output of the module 1 is given to the input of the SCG-NN classifier. Here, the data’s are trained and optimal weights are calculated using the gravitational search algorithm (GSA). Finally, we measure the MMRE function between original output and obtained output. Here, we check the MMRE value to the randomly assigned threshold value. Suppose the MMRE value is higher than the threshold means we obtained the data is not a privacy data. At the same time the data is given to the module 1 and again calculating the convolution process. Similarly, the MMRE value is less than the threshold means the data is privacy data which is automatically given to the reduce part and stored to the cloud. MMRE equation and conditions are explained equation (27) and (28).

(27)

(28) Where;

Original value Obtained output

The data’s are satisfy the condition which is given in equation (29)

(29)

(10)

propagation algorithm, a second order training algorithm for training of neural network. The SCG training algorithm was developed to avoid this time-consuming line search, thus significantly reducing the number of computations performed in each iteration, although it may require more iterations to converge than the other conjugate gradient algorithms. In order to train neural networks, the gradient G of the loss function is computed with respect to each weight of the network. It shows the fact that small change in that weight will affect the overall error. Initially, loss function is divided into separate terms for each point in the training data. In each iteration in SCG training algorithm, an attempt is made to reduce this global error by adjusting the weights and biases. The property of SCG algorithm increases learning speed reliably in successive iteration. Let us we define an error function Taylor Series expansion.

(30)

Suppose the notations are used as and. Weight vector in iteration may be mentioned as . C is the hessian matrix and B is the local gradient vector. We will use a second order approximated error equation for further calculations and the same equation may be as follows

(31) The solution of the quadratic difference equation may be found as follows;

(32) The above solution can be achieved subject to condition that C is positive definite matrix. The above equation states that how much amount of shifts is required for all the weights so that the error curve converges significantly.

V. EXPERIMENTAL EVALUATION

A. Experiment settings

Our experiments are conducted in a cloud environment named Cloud Sim. We have implemented our proposed Privacy Preserving-Aware over Big Data in Clouds system using Java (jdk 1.6) and a series of experiments were performed on a PC with Windows XP Operating system at 2 GHz dual core PC machine with 4 GB main memory running a 64-bit version of Windows 2007. We conduct two groups of experiments in this section to evaluate the effectiveness and efficiency of our approach such as privacy preserving based on convolution process and evaluation process.

B. Dataset description:

The data set Census-Income (KDD) [31] is utilized in our experiments. Its subset Adult data set has been commonly used as a de facto benchmark for testing anonymization algorithms [32]. The data set is sanitized via removing records containing missing values and attributes with extremely skewed distributions. Totally this dataset having 299285 records and 40 attributes. Here, We obtain a sanitized data set with 153,926 records, from which data sets in the following experiments are sampled. Twelve attributes are chosen out of the original 40 ones, including 9 (4 numerical and 5 categorical) quasi-identifier ones and 3 (2 numerical and 1 categorical) sensitive ones. The privacy preserving in cloud is implemented in Java. Further, the system is design to the Map Reduce frame work using cloud Sim.

C. Experiment Process and Results

[image:10.612.328.545.524.649.2]

The basic idea of our research is to Privacy Preserving-Aware over Big Data in Clouds Using GSA and Map Reduce Framework. Here we implementing our work with two modules, such as Map Reduce frame work and evaluation process. In first module, we encrypt the data based on the convolution process. Here, inside the Mapper the data’s are in the form of encryption. After the encryption process, we evaluate, the data’s are correctly encrypted or not using GSA-SCGNN classifier. The group search algorithm (GRA) is an optimization algorithm which is used to optimize the weight inside the neural network. Finally the reducer reduces the repeated data and encrypted data’s are stored in the cloud.

(11)

[image:11.612.72.299.68.496.2]

Fig. 9: Performance analysis of memory by varying data size

Fig. 10: Performance analysis of execution time by varying number of mapper

Fig. 11: Performance analysis of memory by varying number of mapper

Fig. 8 shows that when the size of the database or the number of data increases, high computational overhead is incurred on the system. For example, the computational time for updating 10000 data system is 254684 ms, and when the file size increases to 25,000 data the overhead increases to 364714 ms (when the number of mapper is 10). Here, we understand our method applicable for big data files with dynamic property. Moreover, from the Fig. we obtain the minimum computation time compare to the existing techniques such as GA+NN, FA+NN and NN. The optimization approach is improving our performance of the big data system. In Fig. 8, without optimization based privacy preserving, we obtain the maximum computation of 558577 ms which is 50000ms for using proposed approach. Fig. 9 shows the performance analysis of memory usage by varying data size. This Fig. describes the change of memory

usage of proposed and existing with respect to the number of data records which ranges from 10,000 (10k) to 100,000 (100k). As the scale of data in these experiments is much greater than that in [9], the data sets in our experiments are big enough to evaluate the effectiveness of our approach in terms of data volume. Here, we fixed number if mapper is 10. When use the 10,000 records for big data processing we utilizing the memory of 1247141 bits. When the record is increase means the memory usage also increases.

In evaluation module, we used GSA-SCGNN classifier. In this section we check the already encrypted data’s are how much accurately encrypted. In this classification section, weight optimization we use the Group search optimization algorithm. Here, the decision based on the threshold value which is explained in evaluation process. The above Fig. 12 shows the Performance analysis of execution time by varying threshold. When we assign the threshold value is 0.1, we obtain the execution time of 268714 ms for using proposed GSA+SCGNN, 306543 ms for using GA+NN, 323343 ms for using FA+NN and 347654 ms for using NN. Similarly Fig. 13 shows the Performance analysis of memory usage by varying threshold. Here, also we obtain the minimum memory usage compare to the existing approach. The above all, the experimental results demonstrate that our approach, significantly improve the performance of the privacy preserving, scalability and efficiency of map reduce system over existing approaches.

IV. CONCLUSION

(12)

the cloud service provider. Here, the implementation is done using JAVA and the performance of the algorithm is analyzed with benchmark dataset. As per the experimentation the proposed algorithm achieves the maximum accuracy compare to the existing approaches.

REFERENCES

[1] Chandramohan Dhasarathan, Sathian Dananjayan, Rajaguru Dayalan, Vengattaraman Thirumal and Dhavachelvan Ponnurangam, "A multi-agent approach: To preserve user information privacy for a pervasive and ubiquitous environment", Egyptian Informatics Journal, vol. 16, pp.151–166, 2015. [2] Kaustubh Satpute,Charudatt Satpute and Dipti Bhade, “Review on Internet base Services of Cloud Computing”, International Journal on Recent and Innovation Trends in Computing and Communication ,vol. 2 no. 2 ,2014.

[3] Introduction to Cloud Computing Architecture by Sun Microsystems, Inc., june 2009.

[4] Chhaya S Dule, H.A. Girijamma and K.M Rajasekharaiah, "Privacy Preservation Enriched MapReduce for Hadoop Based BigData Applications",American International Journal of Research in Science, Technology, Engineering & Mathematics, vol.6, no.3, pp. 293-299, 2014

[5] H. Takabi, J.B.D. Joshi, and G. Ahn, “Security and Privacy Challenges in Cloud Computing Environments”, IEEE Security & Privacy, vol. 8, no. 6, pp. 24-31, Nov./Dec. 2010.

[6] D. Zissis and D. Lekkas, “Addressing Cloud Computing Security Issues”, Future Generation Computer Systems, vol. 28, no. 3, pp. 583- 592, 2011.

[7] Divyakant Agrawal, Amr El Abbadi and Shiyuan Wang, "Secure and Privacy-Preserving Data Services in the Cloud: A Data Centric View",Proceedings of the VLDB Endowment, Vol. 5, No. 12, 2012

[8] S.Hemalatha and S.Alaudeen Basha, “Enabling for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud”, International Journal of Scientific and Research Publications, vol. 3, no. 10, 2013.

[9] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI, pages 137–150, 2004.

Mr.K.Sekar(Koneti Sekar)obtained his Bachelor Degree in Computer Science from Sri Venkateswara University. The he obtained his Masters Degree from University of Madras and pursuing Ph.D in Sri Venkateswara University. Currently He is an Associate Professor working in the Department of Computer Science and Engineering, S.V.Engineering College for Women, Tirupati. His Specializations include Software Engineering, Computer Programming, Computer Security, Computer Organization and Object Oriented Programming

Prof.M.Padmavathamma (Mokkala Padmavathamma)born in Chittoor District,A.P., India, in 1963. She received M.Sc , M.Phil,M.Ed,Ph.D from S.V.University, Tirupathi and M.S(Software Systems) from BITS PILANI. Currently she is working as Head, Department of computer science, S.V. University, Andhra Pradesh, India. Her research interests lie in the areas of Number theory, Cryptography, Network Security, Distributed Systems and Data Mining. She has published 35 research papers in national/International journals and conferences. She published TWO text books as one of the author. Also she is life member of cryptology Research Society of India (CRSI) and Andhra Pradesh Association Mathematical Teachers (APAMT).