A Model For Revelation Of Data Leakage In Data Distribution

(1)

A Model For Revelation Of Data Leakage In Data Distribution

Saranya.R

Assistant Professor, Department Of Computer Science and Engineering Lord Jegannath college of Engineering and Technology

Nagercoil, Tamil Nadu, India

Abstract

Leakage detection is the technique for handling secure data delivery. A data distributor has given sensitive data to a set of trusted agents (third parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess how the data is leaked and who leaked the data. Data allocation strategies (across the agents) are proposed to improve the probability of identifying leakages. To improve the chances of detecting leaked data, “realistic but fake

“records was injected to the original data. The result can be extended to handle mixed cases, with some explicit and sample requests so that they can the handle agent request in online fashion.

Key Words: Allocation strategies, data leakage, data privacy, fake objects, leakage model and sensitive information.

1. Introduction

In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties.

For example, a hospital may give patient records to researchers who will devise new treatments. Similarly, a company may have partnerships with other companies that require sharing customer data. We call owner of the data, the distributor and the supposedly trusted third parties the agents. The goal of project is to detect when the distributor’s sensitive data has been leaked by agents, and show the probability for identifying the agent that leaked the data. We study unobtrusive techniques for detecting leakage of a set of objects or records. Specifically, we study the following scenario: After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. At this point the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We develop a model for assessing the “guilt” of agents. We also present algorithms for distributing objects to agents, in a way that improves our chances of identifying a leakier. Finally, we also consider the option of adding “Fake” objects to the distributed set.

In this paper, we proposed one model that can handle all the requests from customers and there is no limit on number of customers. The model gives the data allocation strategies featured with the forged objects injection proposed by [1] to improve the probability of identifying leakages, but they can accept request from only some number of customers. Also we study the application where there is a distributor, distributing and managing the files that contain sensitive information to users when they send request. The log is maintained for every request, which is later used to find overlapping with the leaked file set and the subjective risk assessment of guilt probability.

2. Previous work

The data leakage prevention based on the trustworthiness [2] is used to assess the trustiness of the customer. Maintaining the log of all customer’s request is related to the data provenance problem [3] i.e. tracing the lineage of objects. The data allocation strategy used is more relevant to the watermarking [4], [5] that is used as a means of establishing original ownership of distributed objects. There are also different mechanisms to allow only authorized users to access the sensitive information [6]

through access control policies, but these are restrictive and may make it impossible to satisfy agent’s requests.

YingweiCui et al [7] Suggested solutions are domain specific, such as lineage tracing for data warehouse and assume some prior knowledge on the way a data view is created out of data sources. We develop an approach to lineage tracing for general transformations that takes advantage of known structure or properties of transformations when present, yet provides tracing facilities in the absence of such information as well. Our tracing algorithms apply to single transformations, to linear sequences of transformations, and to arbitrary acyclic transformation graphs. We present optimizations that effectively reduce the storage and runtime overhead in the case of large transformation graphs. Our results can be used as the basis for an in-depth data warehouse analysis and debugging tool, by which analysts can browse their warehouse data, and then trace back to the

(2)

source data that produced warehouse data items of interest. Watermarking relational data were initial the need for watermarking database relations to determine their privacy, identify the unique characteristics of relational data which pose new challenges for watermarking, and provide desirable properties of a watermarking system for relational data. A watermark can be applied to any database relation having attributes which are such that changes in a few of their values do not affect the applications.

Recently [8], [9], and other works have also studied marks insertion to relational data. Our approach and watermarking are similar in the sense of providing agents with some kind of receiver identifying information.

However, by its very nature, a watermark modifies the item being watermarked. If the object to be watermarked cannot be modified, then a watermark cannot be inserted. In such cases, methods that attach watermarks to the distributed data are not applicable

3. Problem Setup 3.1 Problem definition

Suppose a distributor owns a set T = {t1, tm} of valuable data objects. The distributor wants to share some of the objects with a set of agents U₁, U₂ ,…U_n, but does wish the objects be leaked to other third parties. An agent U_i receives a subset of objects R_iwhich belongs to T, determined either by a sample request or an explicit request.

Sample Request R_i = SAMPLE (T, m_i ) : Any subset

of mi records from T can be given to Ui .

Explicit Request = EXPLICIT (T, cond_i ) : Agent U_i receives all the T objects that satisfy cond_i .

The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a database. After giving objects to agents, the distributor discovers that a set L of T has leaked. This means that some third party called the target has been caught in possession of L. For example, this target may be displaying L on its web site, or perhaps as part of a legal discovery process, the target turned over L to the distributor. Since the agents U1, U2…Un , have some of the data, it is reasonable to suspect them leaking the data.

However, the agents can argue that they are innocent, and that the L data was obtained by the target through other means.

3.2 Guilty Agents

Let L denote the leaked data set that may be leaked intentionally or guessed by the target user. Since agent having some of the leaked data of L, may be

susceptible for leaking the data. But he may argue that he is innocent and that the L data were obtained by target through some other means.

Our goal is to assess the likelihood that the leaked data came from the agents as opposed to other resources. Let L denote the leaked data set that may be leaked intentionally or guessed by the target user. Since agent having some of the leaked data of L, may be susceptible for leaking the data. But he may argue that he is innocent and that the L data were obtained by target through some other means. Our goal is to assess the likelihood that the leaked data came from the agents as opposed to other resources.

For example, if one of the objects of L was given to only agent U₁, we may suspect U₁ more. So probability that agent A₁ is guilty for leaking data set L is denoted as Pr {G_i | L} For example if one of the object of L was given to only agent U₁, we may suspect U₁ more. So probability that agent U₁ is guilty for leaking data set L is denoted as Pr {G_i | L}.

4. Guilt Probability

Suppose an agent is guilty if it contributes one or more objects to the target. The event that agent is guilty for a given leaked set S is denoted by Gi| L. The next step is to estimate Pr {G_i| L}, i.e., the probability that agent is guilty given evidence S. To compute the Pr { G_i| L }, estimate the probability that values in S can be “guessed”

by the target. For instance, say some of the objects in t are emails of individuals. For the sake of simplicity our model relies on two assumptions:

Assumption 1: For all t₁, t₂, ……….., tn ∊L and t1≠ t2 , the provenance of t₁is independent of t₂.

Assumption 2: Tuple t∊L can only be obtained by third user in one of the two ways:

1. Single user U₁ leaked t or

2. Third user guessed t with the help of other resources.

Now to compute the guilt probability that he leaks a single object t to L, we define a set of users. To find the probability that an agent U_iis guilty for the given set L, consider the target guessed t₁with probability p and that agent leaks t₁ to L with probability 1-p. First compute the probability that he leaks a single object to L. To compute this, define the set of agents U_t= { U_i| t∊ R_i} that have t in their data sets. Then using Assumption 2 and known probability p,we have,

Pr {Some agent leaked t to L=1-P ……...1.1 Assuming that all agents that belongs to U_t can leak t to L with equal probability and using Assumption 2 we get,

(3)

Pr {some agent leaked t to L} =

0

otherwise ...…1.2

Given that user U_iis guilty if he leaks at least one value to L, with assumption 1 and equation 2, we can compute the probability Pr {G_i | L} that user A_iis guilty :

Pr {Gi/L} = ……...1.3

5. Data allocation problem

The distributor gives the data to agents such that he can easily detect the guilty agent in case of leakage of data. To improve the chances of detecting guilty agent, he injects fake objects into the distributed dataset. These fake objects are created in such a manner that, agent cannot distinguish it from original objects. One can maintain the separate dataset of fake objects or can create it on demand. In this paper we have used the dataset of fake tuples.

Depending upon the addition of fake tuples into the agent’s request, data allocation problem is divided into four cases as:

i. Explicit request with fake tuples ii. Explicit request without fake tuples iii. Implicit request with fake tuples iv. Implicit request without fake tuples

For example, distributor sends the tuples to agents U₁ and U2 as R1= {t1, t2} and R2= { t1}. If the leaked dataset is L={ t₁}, then agent U₂appears more guilty than U₁. So to minimize the overlap, we insert the fake objects in to one of the agent’s dataset.

6. Optimization problem

The distributor’s data allocation to agents has one constraint and one objective. The distributor’s constraint is to satisfy agents’ requests, by providing them with the number of objects they request or with all available objects that satisfy their conditions. His objective is to be able to detect an agent who leaks any portion of his data.We consider the constraint as strict.

The distributor may not deny serving an agent request and may not provide agents with different perturbed versions of the same objects. The fake object distribution as the only possible constraint relaxation.

The objective is to maximize the chances of detecting a guilty agent that leaks all his data objects. The

Pr {G_j |L = R_i} or simply Pr {G_j |R_i } is the probability that agent is guilty if the distributor discovers a leaked table S that contains all objects. The difference functions Δ ( i, j ) is defined as:

Δ ( i, j ) = Pr {G_i |R_i } – Pr {G_j |R_i } ……. 1.4 1) Problem definition: Let the distributor have data requests from n agents. The distributor wants to give tables R_i, ….Rn. to agents U₁ , . . .U_n , respectively, so that Distribution satisfies agents’ requests; and Maximizes the guilt probability differences Δ ( i, j ) for all i, j = 1. . . n and i = j.

Assuming that the sets satisfy the agents’ requests, we can express the problem as a multi-criterion

2) Optimization problem:

Maximize (. . . , Δ ( i, j ), . . . ) i ! = j …… 1.5 (over R₁, . . . R_n, )

The approximation [3] of objective of the above equation does not depend on agent’s probabilities and therefore minimize the relative overlap among the agents as

………..1.6 (over R₁ , . . . ,R_n ).

This approximation is valid if minimizing the relative overlap maximizes Δ (i, j).

7. Allocation strategies:

In this section the allocation strategies [10] solve exactly or approximately the scalar versions of Equation 1.6 for the different instances. In Section 7.1 deals with problems with explicit data requests and in Section 7.2 deals with problems with sample data requests.

7.1 Explicit Data request

In case of explicit data request with fake not allowed, the distributor is not allowed to add fake objects to the distributed data. So Data allocation is fully defined by the agent’s data request. In case of explicit data request with fake allowed, the distributor cannot remove or alter the requests R from the agent. However distributor can add the fake object. In algorithm for data allocation for explicit request, the input to this is a set of request R₁,R₂ ,……Rn, from n agents and different conditions for requests. The e-optimal algorithm finds the agents that are eligible to receiving fake objects. Then create one fake object in iteration and allocate it to the agent selected. The e-optimal algorithm minimizes every term of the objective summation by adding maximum number of fake objects to every set R_i yielding optimal solution.

(4)

Step 1: Calculate total fake records as sum of fake records allowed.

Step 2: While total fake objects > 0

Step 3: Select agent that will yield the greatest Improvement in the sum objective

i.e.

Step 4: Create fake record

Step 5: Add this fake record to the agent and also to fake record set.

Step 6: Decrement fake record from total fake record set.

Algorithm makes a greedy choice by selecting the agent that will yield the greatest improvement in the sum-objective.

7.2 Sample Data request

With sample data requests, each agent may receive any T from a subset out of different ones.

Hence, there are different allocations. In every allocation, the distributor can permute T objects and keep the same chances of guilty agent detection. The reason is that the guilt probability depends only on which agents have received the leaked objects and not on the identity of the leaked objects. Therefore, from the distributor’s perspective there are different locations. An object allocation that satisfies requests and ignores the distributor’s objective is to give each agent a unique subset of T of size m. The s-max algorithm allocates to an agent the data record that yields the minimum increase of the maximum relative overlap among any pair of agents. The s-max algorithm is as follows.

Step 1: Initialize Min_overlap ← 1, the minimum out of the maximum relative overlaps that the allocations of different objects to U_i

Step 2: for k ∈ {k | ∈ } do

Initialize max_rel_ov ← 0, the maximum relative overlap between R_i and any set R_j that the allocation of t_kto U_i Step 3: for all j = 1,..., n : j = i and t_k∈ R_j do Calculate absolute overlap as

abs_ov ← |Ri ∩R_j | + 1 Calculate relative overlap as rel_ov ← abs_ov / min (mi , m_j ) Step 4: Find maximum relative as

max_rel_ov ← MAX (max_rel_ov, rel_ov) If max_rel_ov ≤ min_overlap then

min_overlap ← max_rel_ov

ret_k ← k Return ret_k

It can be shown that algorithm s-max is optimal for the sum-objective and the max-objective in problems where M ≤ |T| and n < |T|. It is also optimal for the maxobjective if |T| ≤ M ≤ 2 |T| or all agents request data of the same size.

It is observed that the relative performance of algorithm and main conclusion do not change. If p approaches to 0, it becomes easier to find guilty agents and algorithm performance converges. On the other hand, if p approaches 1, the relative differences among algorithms grow since more evidence is needed to find an agent guilty. The algorithm presented implements a variety of data distribution strategies that can improve the distributor’s chances of identifying a leaker. It is shown that distributing objects judiciously can make a significant difference in identifying guilty agents, especially in cases where there is large overlap in the data that agents must receive.

8. Conclusion

In doing a business there would be no need to hand over sensitive data to agents that may unknowingly or maliciously leak it. And even if we had to hand over sensitive data, in a perfect world we could watermark each object so that we could trace its origins with absolute certainty. Data leakage is a silent type of threat. Your employee as an insider can intentionally or accidentally leak sensitive information. This sensitive information can be electronically distributed via e-mail, Web sites, FTP, instant messaging, spreadsheets, databases, and any other electronic means available – all without your knowledge.

To assess the risk of distributing data two things are important, where first one is data allocation strategy that helps to distribute the tuples among customers with minimum overlap and second one is calculating guilt probability which is based on overlapping of his data set with the leaked data set.

.

References

[1] R. Sion, M. Atallah, and S. Prabhakar, “Rights Protection for Relational Data,” Proc. ACM SIGMOD, pp. 98-109, 2003.

[2] YIN Fan, WANG Yu, WANG Lina, Yu Rongwei. A Trustworthiness-Based Distribution Model for Data Leakage Detection: Wuhan University Journal Of Natural Sciences.

(5)

[3] P. Buneman, S. Khanna and W.C. Tan. Why and where: A characterization of data provenance. ICDT 2001, 8^th International Conference, London, UK, January4-6, 2001, Proceedings, volume 1973 of Lecture Notes in Computer Science, Springer, 2001.

[4] S. Czerwinski, R. Fromm, and T. Hodes. Digital music distribution and audio watermarking.

[5] Rakesh Agrawal, Jerry Kiernan. Watermarking Relational Databases// IBM Almaden Research Center.

[6] S. Jajodia, P. Samarati, M. L. Sapino, and V. S.

Subrahmanian. Flexible support for multiple access control policies. ACM Trans. Dataset

[7] Y. Cui and J. Widom, “Lineage Tracing for General Data Warehouse Transformations,” The VLDB J., vol. 12, pp. 41-58, 2003

[8] F. Guo, J. Wang, Z. Zhang, X. Ye, and D. Li, “An Improved Algorithm to Watermark Numeric Relational Data,” Information Security Applications, pp. 138-149, Springer, 2006.

[9] Y. Li, V. Swarup, and S. Jajodia, “Fingerprinting Relational Databases: Schemes and Specialties,” IEEE Trans. Dependable and Secure Computing, vol. 2, no. 1, pp. 34-45, Jan.-Mar. 2005.

[10] P. Papadimitriou and H. Garcia-Molina, “Data leakage Detection,” technical report, Stanford Univ., 2011.