Information Flows Security in Big Data

(1)

Information Flows Security in Big Data

Tsau Young (T. Y.) Lin

Institute of Data Science and Computing

and

Department of Computer Science

San Jose State University

San Jose, CA 95192

Abstract

This Project focuss on an ancient issue, information flow security on DAC (discretionary access control). DAC is the most interesting or rather paradoxical security mechanism. In one hand, DAC is the most popular commercial security mechanism (therefore it is vey secure!? On the other hand, it is probably the most seriously criticized concept: It was declared ”fundamentally and totally flawed,” almost 30 years ago by Boebert et al (1985, 1988), when they successfully illustrated a Trojan Horse attack. Essentially the same critique continued: ”This is why DAC is unable to enforce information flow controls, particularly with respect to Trojan Horses” (Osborn et al, 2000). ”cannot prevent the illegitimate use of data, once access is granted” (Vachharajani et al, 2004). Some authors even add information propagation by e-mail into the equation, or simply over criticize by saying ”Once its out, its out.”

We should point out the ”information propagation” or ”release of sensitive information” are under DAC.

Goal: The essence of DAC will be clearly formalized and its secuirity issues will be cleanly stated. Strategies of incremental solutions will be clearly designed, validated and verified. Final solution will be clearly presented. What do we mean by solution ?: All possible information flows will be mathematically analyzed (observed that the solution is beyond Turing machines(hru,1975) but mathematically expressible(see initial solutions2003, 2009).

I. INTRODUCTION

”Big Data is everywhere”; ”Big information is floating on the cloud”. Of course, the terms, ”floating”, ”cloud” and etc. are metaphors. The actual message is massive data is readily available, and information exchanges are very common. Translating to implementation level, these words imply actual data are stored in very large data centers and data are flowing in centers. Naturally security and privacy evolve into one of top keen issues big data, and hence in data centers.

As data centers have no ”standard” specification yet. The mathematical model is an abstract UNIX system with infinite users (hence infinite data). Mathematically theinfinite usersis a model of a system in which unlimited additions of new users (and hence new data) are permitted. Based on such a model, it is somewhat a surprise, we can have a good mathematical characterization of information flow secure systems with infinite users. To understand the problems and issues, let us recall some ”ancient” critiques:

1) DAC (discretionary access control model) was declared ”fundamentally and totally flawed,” when (Boebert et al, 1985, 1988) demonstrated successfully a Trojan Horse attack on a system protected by DAC.

2) ”This is why DAC is unable to enforce information flow controls, particularly with respect to Trojan Horses” (Osborn et al, 2000).

3) DAC ”cannot prevent the illegitimate use of data, once access is granted” e.g., (Vachharajani, 2004).

Based on our abstract protection system of UNIX, we can mathematically enumerate all possible DAC. The total number of all possible distinct DAC systems is2100_{. Among them, there are 8977053873043 (almost 9 trillion) secure DAC systems.}

Though 9 trillion is a large number, yet in comparing it with all possibilities, it is very small. Namely, the probability of randomly selecting a secure system from all possible systems is near zero. Therefore the probability of Boebert et al being wrong (about his critique) is also near zero.

The results in this paper are not isolated instances. The theory behind these results is an emerging technology called granular computing – It brings in the methodology of ”continuous mathematics” into the discrete world of computer science [13].

II. TROJANHORSEATTACKS INUNIX SYSTEMS In this section, we recast the idea of Trojan Horse attack in a university setting. [1] Trojan Horse attack in a UNIX system:

In ([2], pp.150), Boebert et al (1988) explicitly stated:

1) B-invisibility: ”The first element of Trojan Hoses attack: Agents in the form of programs whose actions are visible indirectly, if at all”; for convenience, we name this element B-invisibility.

(2)

These two elements are very general, so it is easy to see that many systems, including UNIX, do meet the two elements specified above. For us the important matter is that Trojan Horse attack can be set up in UNIX systems.

[2] The set of actors (users):

Two actors DRAKE and DOE were used by Boebert et al. We use the same names to reflect similar, roles.

1) DRAKE, very fluent on UNIX systems, is a student, denoted by S, taking the class of discrete mathematics. In [2], DRAKE was a hacker; this newS has a similar personality.

2) DOE, a student of mathematics, is a teaching assistant, denoted by T, of the class. In [2], DOE had faithfully carried out his duty - protecting important information; here this newT plays the same role, protecting EXAMS questions.

a) An important duty of new DOET is to edit all the exams. b) T has signed a non-disclosure contract.

3) Implicitly, there is a new actor C, a Professor of Computer Science, teaches the subject. [3] The protection system is the DAC of a UNIX system:

The protection mechanism used in [2] was the access control list(ACL), which was adopted in MULTICS, a UNIX’s predecessor. UNIX systems basically have the equivalent systems called DAC (Discretionary Access Control) that are embed in ”owner-group-other”; see Tanenbaum, pp. 293, second paragraph [16]. There are two types of Direct Information Flows (DIF) in UNIX systems.

1) The ”read” DIF: The data ownerX permits any member of his groupF(X)to ”read”X’s files. This DIF is known to the data owner, since he is the one who set up the ”group” bits. However,

2) The ”write” DIF: The user (data owner), in our example the new DOET, is not aware of the information flow executed by himself; observe that this ”write” permission is not set up by the data ownerT himself. Observe that new DOET is unaware of the fact that he is in the group of new DRAKES and is permitted byS to write his data into S’s account (a set of files owned byS)

• This ”write” channel is hidden toT, yet owned and used byT . ”New” DAC is to make it visible toT.

[4] Setting up a Trojan Horse attack:

1) As we have noted that this new DRAKE S is very fluent on UNIX systems; this fact is well known among students. 2) In an appropriate occasion,S offered his help to this new DOET, namely,S will set up a word processor forT. 3) During this sets up, (1) S includes T into his group F(S) and (2) allows T to ”write” T’s data into a file owned by

S (in [2], this file is called ”Back-Pocket” file). Observe that UNIX systems does not inform T this new arrangement. This is the B-invisibility of Trojan Horse stated in Item [1].

4) T’s new word processor (a gift fromS) produces a new DIF that send T’s data toS.T himself does not know about this fact. This tells us that the non-disclosure contract stated in Item [2] is useless.

a) The ”save” command in the word processor has the normal function that saves T’s data (exam problems) into a file in T’s account;

b) However, the extra ”write” command create a DIF (unknown toT), which sendsT’s data into a file inS’s account; this is the channel of Trojan Horse attack.

We hope this explanation has clearly pointed out the weakness of UNIX protection systems; the ”write” permission has unexpected implications.

III. THEPROTECTIONSYSTEMS FORBIGDATA Let us illustrate the important features of the protection systems.

Example 1: The concept of group in ”owner-group-other” bit patterns.

We will illustrate the concept of ”group” using the example of the Trojan Hose Attack in Section II. LetU ={C, T, S}be the set of three users, where C is the computer science professor,T is the teaching assistant DOE andS is the student DRAKE. For each user X, UNIX associate a setF(X)of users which is the ”group” of users referred in the ”owner-group-other” bit patterns. We will denote this association by a map F:U −→2U _{which is defined point-wisely as follows:}

(3)

1) C−→ {C, T}=F(C) 2) T −→ {T}=F(T) 3) S−→ {S, T}=F(S)

Note thatF(X)may not include X; see (File4 in Example 2).

The concept of group friends is important, we give a formal definition. Definition 1: The Concept of Group in Big Data

Let U be the set of users, possibly infinite. LetF :U −→2U _{be a map, called the Friend-list, that associates each user} _X _a group of friendsF(X), called Friend-list(X) atX.F(X)is the maximal set of users, selected byX, each member of this group is permitted byX to read, write and executeX’s files. The ”complement”E(X) =U−F(X)is called the Enemy-list(X) atX. The concept of group reflects the topological structure ofU; each memberY ofF(X)is ”sociologically near” X. Topology models the concept of ”near” [?]; ”near-ness” reflects the ”large amount of communications”. We are applying an emerging technology called granular computing [9], [8], [10], [11], [12], [13].

Example 2: The concept of ACL (access control list)

We examine this concept, since Boebert et al used it in constructing Trojan Horse. Let us take some examples directly from (Tanenbaum, 1987 [16], pp. 293):

File0: (Jan, *, rwx)

File2: (Jan, *, rw-), (Els, staff, r–), (Maaike, *, r–) File4: (Jelle, *, —),(*, student, r–)

where Jan, Els, Maaike, Jelle are users names, * means wild card, and etc.

For each file, we associate a list of pairs (X, F(X))that may access the file, and how. Such a list is called a ACL. In UNIX systems, ACLs are compressed into three bits, rwx, per file for the owner, the owner’s group, and the other.

Definition 2: Owner-Group-Other Bits in Big Data

Let U be the set of users, possibly infinite. LetF be the set of all data (files). Let F : U −→ 2U _{be the Friend-list. Let}

E:U −→2U _{be defined}_E₍_X_{) =}_U₋_F₍_X₎_{be the Enemy-list. A Bit-ACL is a map:}

M :F −→ {(OrOwOx, FrFwFx, ErEwEx)|X}

that marks each file with 9 bits that represent how files are accessed, whereOrOwOx are for the ownerX, FrFwFx are for the Friend-list F(X), andErEwEx are for the Enemy-listE(X). This is the mechanism that allowsX to modify his initial selections.

Group structure among users reflects the communication structures among users; ”Owner-Group-Other” reflects the directions of data movements. Group and Bits are important features of DAC that shape the flows of information.

IV. DIRECTINFORMATIONFLOWS INBIGDATA

Information flows are data moving between two users. In extension level, the concerning is actual data moving; between two users some data may move and some may not. Nevertheless, in this case, we do know there are communication channel between two users and information does flow between them; we may say this level of concerns is in intension level. In this section, we are information flows in this level.

The concept of DIF (direct information flows) reflects the information flows on intension level. It will be expressed by a binary relations on users. Let us recall that a binary relation onU is, by definition, a subset ofU×U.

The following two DIF were used in Trojan Horse attack of Boebert et al. Definition 3: Read-DIF

LetU be the set of users, possibly infinite, letF :U −→2U _{be the Friend-list. Let}_F

Rbe a binary relation that mathematically expresses the direct information flow:X’s data are read byY and save inY’s account. HereY’s account, by definition, is the set of files that are owned by Y, and data is referring to a file or files.

FR={(X, Y)|X ∈U andY ∈F(X)}

FR is called a Read-DIF; this is accessed by a friend.

(4)

Definition 4: Write-DIF

As above, a binary relation that expresses the concept thatX is permitted byY ro write his data intoY’ account.

FW ={(X, Y)|Y ∈U andX∈F(Y)}

FW is called a Write-DIF; note the directionX writes to data intoY’s account. This is where the Trojan Horse of Boebert embedded;X is unaware of his own action.

Definition 5: Total-DIF

The binary relation that combinesFW andFR is:

FT =FR∪FW

The binary relationFT is called Total-DIF.

Observe that the union of two binary relations is a binary relation, since the union of two subsets of U×U is still a subset of U×U.

Definition 6: The Friend-list of Total-DIF

Let X∈U be a user. The Friend-list of Total-DIF at X will be called Total Friend ofX, so we define a map: TF:U −→2U

where TF(X) ={Y |(X, Y)∈FT};

Such a Total Friend-list TF(X) is called the (right) binary neighborhood ofX in the terminology of granular computing. [10], [11].

Definition 7: The Enemy-list of Total-DIF

Let p∈U be a user. The Enemy-list of Total-DIF atpwill be called Enemy of total friend, and denoted byET F

ET F =U− {Y |(p, Y)∈FT}

Proposition 1: There are 2n2

distinct Total-DIFFT for a givenU whose cardinal number isn. V. INFORMATIONFLOWSECURITY INBIGDAC

First, let us recall some ancient concept.

Data Owner Centric View of Information Flow Security:

1) Data owner has discrete authority over who can or cannot access his data.

2) Security from data owner’s view: A direct information flow is said to bep-secure, ifp’s data, sayD, can never be flown intop’s Enemy-listET(p)≡U−FT(p)

Next, we will formalize this view. Up to now, the term DAC has been used vaguely to refer to a mathematical model that captures the protection mechanism of UNIX systems. For our purpose, we will only need the intension level of information flow part of DAC (Discretionary Access Model). For example, in [6], there is a matrix of Subjects ×Objects; since we are interested in the intension level of information flow (Maximal possible data flows between users), we would compose two of their matrices into one Subjects ×Subjects matrix, that is essentially ourFT:

Definition 8: A DAC is a 3-tuple (U, F, FT), where

1) U is a set of users, possibly infinite

2) F :U −→2U _{assign a Friend-list}_F₍_X₎ _{for each user}_X_{. However, we actually are interested in TF that is a revised} Friend-list, called Total-Friend-List that for eachX, TF(X)consists ofF(X)(who may accessX’s data) and thoseY

whoseF(Y)includesX; these Y regardX as their friends. Actually TF(X) ={Y |(X, Y)∈FT}

• In current UNIX,X may be unaware of his ”write” data intoY. In this new model X knows exactly who can read access his data, he also know whom he may write into.

3) FT a binary relation onUthat captures all the information flows (actual data moving) among users. Note that information flows could induce by the read access and write access; we have combine the two types of information flows.FT is the

(5)

binary relation of total information flows defined in last section Definition 5. Note that two distinct F on the same U, may generate the same FT, so we define

Definition 9: Given two DAC, DACi = (Ui, Fi, FT i) i= 1,2, then we define DAC2 =DAC2 if and only ifU1=U2, and

FT1=FT2.

Let us introduce a technical term Definition 10: Information Trajectory Let T(p) = {Fi

T(p)| i = 0,1,2...}, where FT0(p) = p,FTi(p) ≡FT(FTi−1(p)). Such a T(p) is called the trajectory of p. Informally,T(p) is a set (though is expressed as a sequence):p, FT(p), FT(FT(p)), FT(FT(FT(p))), . . ..

Definition 11: Information Flow Security in a DAC

1) Individual Security: Letp∈U. A DAC is said to bep-secure if and only if

ET(p)∩T(p) =∅.

2) Information Flow Security: A DAC is said to be Information Flow Secure, if the DAC is p-secure for every userp. Theorem 1: Secure Information Flows

1) A DAC is information flow secure if and only if the binary relationFT (Definition 5) is reflexive and transitive. Mathematicians know a lots about a reflexive and transitive binary relation on finite/infinite set; they are summarized in Lemma below. From the lemma we will have the corollaries

2) A DAC is information flow secure if and only if FT defines onU an Alexandroff space.

3) A finite DAC is information flow secure if and only if FT defines onU a finite topological space.

Example 3: For n=10, by Proposition 1 and Definition 9, we have2100_{distinct DAC. Among them there are 8977053873043}

(almost 9 trillion) of secure DAC! How is this number computed? From Theorem 1, #3, the number of secure DAC is the number of topological spaces. From [5],[18], the number of topological spaces (for n=10) is the number cited above. By comparing 2100 _{with 9 trillion; we could say that the probability of selecting a secure system randomly is near zero.}

Therefor the probability for Boebert et al to be wrong (about his critique) is near zero. VI. CONCLUSIONS

1) The Analysis of Static DAC.

From Example 3, for 10 users, there are almost 9 millions of secure DAC, so we conclude There are sufficiently many secure system (almost 9 trillion for 10 users) for data centers. 2) Analysis of Dynamic DAC.

Next question is: What would happen if we allow updating to DAC? For example, Harrsion et al (1976) identified six primitive operations that can be used as a base to model any protection system. The DAC for such protection system should be updatable. Purely from counting of the discussed example, the probability to update a secure system into another secure system is near zero. Since we have characterized all secure DAC, we should be able to pre-program the updating operations so that it will jump from one secure system to another secure system.

Though the analysis is static DAC, it can be a good base of dynamic analysis. 3) Comparing with Earlier Models

a) Mandatory Access Control.

The users in Military System are labeled by elements in a lattice; so the set of users is mapped into a subset of lattice. By pullback the partial ordering of the lattice, the set of users in MAC is a reflexive and transitive binary relation. So they are examples of Theorem 1.

b) Chinese Wall Security Policy Model.

In [3], Brewer and Nash proposed Chinese Wall Security Policy Model. The idea was excellent, however, the model was incorrect (Brewer and Nash agreed with my comment at the conference that led to a revised proposal [?]). However, to show that based on the Chinese Wall security policy, one can build a information flow secure model had to be waited until (Lin, 2003) [12]. Chinese Wall security policy means that given a binary relation, called conflict of interests, on U, a set of companies. Could we build an informtion secure system for them.

that consists of all the companies Y that is in conflicts with X. could we built a secure information flow secure systems.

(6)

consists all the users who have the same set of conflicting users. This is also a good example of Theorem 1. 4) The direct information flows in this paper is intension flows - the possible maximal data flows. The actual data flows

will have all possible sub-flows and will be colorful.

VII. APPENDIX Lemma 1: [5], [18].

1) A binary relation B on U is reflexive and transitive if and only ifB defines onU an Alexandroff space.

2) A binary relation B on finite U is reflexive and transitive if and only ifB defines a finite topological space on U. 3) For a finite set U, the topologies on U are in bijective correspondence with the reflexive and transitive relations on U. 4) A finite topological space is an Alexandroff topological space

A. Proof of Theorem 1

1) AssumeFT is reflexive and transitive, we will prove that Total-DIF is information flowp-secure for everyp: Supposep’s data, sayDp, are allowed by Total-DIF to flow intoq∈FT(p), that is,(p, q)∈FT. Also suppose q’s data, sayD, are allowed by Total-DIF to flow intox∈FT(q), that is, (q, x)∈FT. By transitivity ofFT, we conclude that(p, x)∈FT. That isF(q)⊆F(p), so ifD flows again, it will still be inFT(p)that is disjoint fromET(p); the arguments are valid for everyp.

2) Conversely: Assume Total-DIF is information flowp-secure, that is,ET(p)∩T(p) =∅. SoT(p)⊆FT(p)≡U−ET(p), sinceT(p) is reflexive-transitive closure of FT, so FT is reflexive and transitive; the arguments are valid for everyp. QED.

REFERENCES

[1] W.E. Boebert, R. Y. Kain, W. D. Young and S. A. Hansohn, Secure Ada Target: Issues,System Design and verification, Proceedings of the 1985 Symposium on Security and Privacy. April, 1985

[2] W. E. Boebert, R. Y. Kain, W. D. Young, Secure Computing: The ecure Ada Target Approach, R. Turn (eds): Advances in computer system security, 1988. 149-165:

(Statement: the workings of a Trojan horse attacks shows that ACL mechanism is fundamentally and fatally flawed pp.151 column 1))

[3] David D. C. Brewer and Michael J. Nash: ”The Chinese Wall Security Policy” IEEE Symposium on Security and Privacy, Oakland, May, 1989, pp 206-214.

[4] Denning, D. E. 1976. A lattice model of secure information flow. Commun. ACM 19,2, 236-243. [5] P. May Finite Topological Spaces, University of Chicago, 203, 2008, 2010.

[6] M. Harrison, W. Ruzzo, J. Ullman: Protection in Operating Systems. Commun. ACM 19(8): 461-471 (1976)

[7] C.E. Landhehr, and C. L. Heitmeyer:Military Message Systems:Requirements and Security Model, NRL Memorandom Report 4925, Computer Science and Systemss Branch, Naval research Laboratory, 1982

[8] T. Y. Lin, Neighborhood Systems and Approximation in Database and Knowledge Base Systems, Proceedings of the Fourth International Symposium on Methodologies of Intelligent Systems , Poster Session, October 12- 15, pp. 75-86, 1989.

[9] T. Y. Lin, ”Chinese Wall Security Policy–An Aggressive Model”, Proceedings of the Fifth Aerospace Computer Security Application Conference, December 4-8, 1989, pp. 286-293.

[10] T. Y. Lin ”Granular Computing on Binary Relations I: Data Mining and Neighborhood Systems.” In: Rough Sets In Knowledge Discovery, A. Skoworn and L. Polkowski (eds), Physica-Verlag, 1998, 107-121

[11] T. Y. Lin ”Granular Computing on Binary Relations II: Rough Set Representations and Belief Functions.” In: Rough Sets In Knowledge Discovery, A. Skoworn and L. Polkowski (eds), Physica - Verlag, 1998, 121-140.

[12] T. Y. Lin: Chinese Wall Security Policy Models: Information Flows and Confining Trojan Horses. DBSec 2003: 275-287

[13] . T. Y. Lin: Granular Computing: Practices, Theories, and Future Directions. Encyclopedia of Complexity and Systems Science 2009: 4339-4355 [14] T. Y. Lin $ J. Pan: Granular Computing and Flow Analysis on Discretionary Access Control: Solving the Propagation Problem. IEEE SMC 2009:

2965-2971

[15] S. Osborn, R. Sanghu and Q. Munawer,”Configuring RoleBased Access Control to Enforce Mandatory and Discretionary Access Control Policies,” ACM Transaction on Information and Systems Security, Vol 3, No 2, May 2002, Pages 85-106.

[16] A. Tanenbaum: Operating System: Design and Implmentation,1987

[17] Neil Vachharajani Matthew J. Bridges Jonathan Chang Ram Rangan Guilherme Ottoni, Jason A. Blome George A. Reis Manish Vachharajani David I. August RIFLE: An Architectural Framework for User-Centric Information-Flow Security International Symposium on Microarchitecture Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture Portland, Oregon Pages: 243 - 254 2004 ISBN ISSN:1072-4451 , 0-7695-2126-6 [18] Finite Topological Spaces, Wikipedia http://en.wikipedia.org/wiki/Finite-topological-space, July,16,2014