Efficient Auditing For Complex SQL queries

(1)

Efficient Auditing For Complex SQL queries

Raghav Kaushik

Microsoft Research

Ravi Ramamurthy

Microsoft Research

ABSTRACT

We address the problem of data auditing that asks for an audit trail of all users and queries that potentially breached information about sensitive data. A lot of the previous work in data auditing has fo-cused on providing strong privacy guarantees and studied the class of queries that can be audited efficiently while retaining the guar-antees. In this paper, we approach data auditing from a different perspective. Our goal is to design an auditing system forarbitrary

SQL queries containing constructs such as grouping, aggregation and correlated subqueries. Pivoted on the ability to feasibly ad-dress arbitrary queries, we study (1) what privacy guarantees we can expect, and (2) how we can efficiently perform auditing.

Categories and Subject Descriptors

H.2 [Database Management]: Systems

General Terms

Security, Performance

Keywords

Security, Auditing, Access Control, Privacy, Query Processing

1. INTRODUCTION

Database systems are used today as the primary repository of the most valuable information in any organization. As the volume of data stored in these repositories has increased, protecting the secu-rity of the data has gained increasing importance. Further, the re-sponsible management of sensitive data is mandated through laws such as the Sarbanes-Oaxley Act, the United States Fair Informa-tion Practices Act, the European Union Privacy Directive and the Health Insurance Portability and Accountability Act (HIPAA).

One of the important components of the DBMS security infras-tructure is an auditing system that can be used to aposteriori inves-tigate potential security breaches. Accordingly, there has been an increase in database auditing products on the market [1, 2, 3] in-cluding from the major database vendors. As the database system

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

SIGMOD’11,June 12–16, 2011, Athens, Greece.

is in production, these products monitor various operations such as user logins, queries, data updates and DDL statements to obtain an audit trail. The audit trail is analyzed offline either periodically or when needed to answer questions about access to schema ob-jects such as: (1) find failed login attempts and (2) find queries and corresponding users that accessed columns corresponding to PII (personal identifier information).

An important class of auditing isdata auditing. A simple ex-ample is single tuple auditing where the goal is to find all queries and update statements that “accessed” a particular tuple, e.g. PII of a specific individual. Such queries potentially reveal sensitive information. While we are not aware of any commercial database auditing system that supports this functionality, there has been prior research that studies this form of auditing.

Prior work has proposed two fundamentally different semantics for data auditing which we can classify broadly as (data) instance dependent and (data) instance independent. The basis for all data auditing semantics is to define what it means for a query to have accessed a particular tuple (that is, single-tuple auditing). In the instance-dependent approach [4], a query is said to access a tuple if deleting the tuple changes the query result on the database instance where the query was originally run — a tuple accessed by the query is said to beindispensableto the query. Unfortunately, subsequent work has shown that the instance dependent approach can lead to breaches of privacy [5].

In contrast, an instance independent approach can be used to get strong privacy guarantees [5, 6, 7]. Under the instance indepen-dent approach, a query is said to have accessed a tuple if there is

somedatabase instance where deleting it changes the query result. Previous work developing the instance independent approach has focused on increasing the class of queries that can be audited effi-ciently while retaining strong privacy guarantees. Efficient audit-ing techniques have been developed for interestaudit-ing subclasses of select-project-join (SPJ) queries.

However, real-world queries such as the benchmark TPC-H queries are often complex using constructs like grouping, aggre-gation and correlated subqueries. While it may be acceptable to consider a restricted class of audit expressions, an auditing system that restricts the class of audited queries is fundamentally incom-plete. Therefore, in this paper we approach data auditing from a different perspective. Our goal is to design an auditing system for

arbitrarySQL queries. Pivoted on the ability to feasibly address ar-bitrary queries, we ask (1) what privacy guarantees we can expect, and (2) how we can efficiently perform auditing.

We first show that previously proposed instance independent se-mantics are computationally incompatible with complex SQL — in the presence of subqueries, enforcing the semantics becomes unde-cidable (§3).

(2)

Therefore, we revisit the instance dependent approach to define our auditing semantics. We begin with the special case of single-tuple auditing (§4). There are several real-world scenarios where a single tuple is a natural unit of analysis. For example, in an employee-department database, an employee tuple contains sensi-tive information such as the employee’s salary. We define an audit-ing semantics by applyaudit-ing the informal notion of an indispensable tupleuniformlyto all queries. In contrast, the approach proposed by Agrawal et al. [4] is (1) non-uniform, (2) does not cover arbitrary SQL and (3) becomes similar to the instance independent approach in the presence of groupby and having clauses (§4.1). By adopting an instance dependent semantics uniformly, we obtain a feasible implementation for an arbitrary query — it is possible to perform single tuple auditing by running the query and a rewritten version that excludes the tuple and checking if the results are equal.

But in this process we also inherit the known privacy limitations of the instance dependent approach [5]. We then ask the question whether there is at least a weaker privacy guarantee our semantics can provide. We answer in the affirmative. We introduce the notion of arisk-freeattack and show that under our instance dependent se-mantics, no attack is risk-free (§4.2). Intuitively, this means that an attacker may get access to sensitive information but not without tak-ing a risk of getttak-ing detected. While the above guarantee falls short of the stronger privacy guarantees yielded by the instance indepen-dent approach, we believe it offers us an interesting way forward in addressing the full complexity of SQL.

We then study how we can efficiently audit a workload of queries (§5). The straightforward implementation suggested by our definition can be inefficient (especially for a workload of com-plex queries). Even though auditing is typically an offline process, the efficiency of auditing is important. For instance, auditing may be invoked as a response to an information breach in an attempt to narrow down the cause of the breach, in which case the efficiency of auditing is critical. In order to address the whole of SQL, we propose a novel rule-based optimization framework (§5.2) that attempts to audit a query without any query execution. The idea is to start with an algebraic plan for a query and find if we can “reach” a plan for the rewritten query by transforming the initial plan. The plan is transformed using equivalence rules in the usual way deployed by any rule-based query optimizer. The main difference is that in addition to the standard rules that hold for all database instances (for example, pushing a selection below join, join commutativity), we also handle rules that are instance specific. Instance specific rules are naturally derived from audit checks — a query that passes the audit is equivalent to the rewritten query. We refer to our framework as anaudit optimizer. Just as the extensibility of rule based query optimizers is key to ensuring the efficient compilation of arbitrary queries, our audit optimizer is naturally extensible allowing us to add more rules over time. We also present a technique that considers reordering the queries in the workload to further improve the efficiency (§5.4).

In general, we would like to audit not only a single tuple but more complex audit expressions. Similar to previous work, we for-mulate our audit expression in the form of aforbidden viewthat captures the sensitive information. We discuss how our semantics, privacy guarantees and optimization techniques naturally extend to cover our class of forbidden views (§6). We then report the results of our initial empirical evaluation (§7) conducted over benchmark data in order to study the impact of our optimization techniques for complex queries. Our evaluation shows that using our framework, we can reduce the time taken for auditing by up to an order of mag-nitude even when the queries are complex (containing correlated subqueries). We discuss related work in §8 and conclude in §9.

Audit Log DBMS Application Auditing Tool Offline Online Monitoring Infrastructure

Figure 1:Auditing Infrastructure

To summarize, we propose an auditing semantics that can be fea-sibly implemented for all of SQL. We characterize its privacy guar-antees and present novel techniques for efficient auditing, while not compromising the class of queries that we can support. We view our paper as an important first step in efficiently auditing complex SQL queries with privacy guarantees.

2. PRELIMINARIES

In this section, we discuss the auditing infrastructure and overall auditing algorithm. We defer a discussion of the auditing semantics to §4.

2.1 Auditing Infrastructure

As with most systems that perform database auditing [1, 2, 3, 4, 5] our auditing system consists of two components — an on-linecomponent and anofflinecomponent, shown in Figure 1. The online component is used in production to log the query and up-date statements issued to the database system. This is implemented using the monitoring infrastructure supported by all commercial database systems. Along with each query/update statement, we also log the corresponding user id. (The user id does not necessar-ily have to be a DBMS user id — many applications manage their users internally. We provide a call-back mechanism that can be used to ascertain the application user id.) We refer to thesequence

of query and update statements with associated user ids as the work-load. The workload is logged in a separate secure database called theaudit log(we use previous work to ensure the audit log is tam-per proof). The audit log is used to tam-perform auditing in theoffline

component. In general, auditing is performed not only against the current database state but also over past database states. Accord-ingly, the auditing component needs to be able to reconstruct past database states. We use the “point-in-time” recovery API provided by commercial database systems that lets us rewind the database state to any point in the past using the database transaction log.

2.2 Audit Expressions

In general, auditing is performed given an audit expression as input. Similar to previous work, our audit expression specifies sen-sitive data via aforbidden view(defined formally in §6). We de-fine a notion of a query beingsafewith respect to the given view. We are given the workloadW collected by the online component. Our goal is to find all query and update statements inW that are not safe, along with the corresponding user ids. We introduce the formal definition of safety for the special case of single tuple au-diting in §4, and the general case in §6. While we are not aware of any commercial database auditing system that supports such a

(3)

functionality, there has been prior research that studies this form of auditing [4, 5, 6, 7].

2.3 Overall Auditing Algorithm

In this paper, we primarily focus on the offline component of the auditing tool (see Figure 1). We now outline our overall algorithm in Algorithm 2.1.

Algorithm 2.1Overall Auditing Algorithm

Input: Workload W of query/update statements with corresponding user ids; Audit ExpressionAE

Output: Subset ofWthat is unsafe

1: Use the database log to rewind to the appropriate state of the database

2: SplitWinto query-only workloads with alternating updates

3: Proceed in the same sequence asW

4: For each query-only workloadQW

5: Call QUERYAUDIT(QW,AE) to find all queries inQW that are unsafe

6: For each update statementU

7: Call UPDATEAUDIT(U) to check ifUis unsafe

8: Update the current state of the database

9: Return all queries and updates that are unsafe along with corresponding user ids

Our auditing is performed on the state of the database when the query was originally run. Accordingly, we first use the transaction log to rewind to the state of the database when the first query in the workload was run (we can further optimize this step by taking periodic backups of the database). We then process the workload in order splitting it into query-only portions alternating with data updates. We refer to the query-only portion of the workload as a

query workload. The query workloads are audited by a query audit-ing algorithm QUERYAUDITthat is invoked over the corresponding state of the database that we refer to as thecurrentinstance in the rest of the paper. The query auditing algorithm returns the unsafe queries. The update statement is audited using an update auditing algorithm UPDATEAUDIT. The update is also replayed to obtain a new state of the database.

Updates are handled in our approach by finding the query under-lying the update and extending our semantics for queries. Updates can have a cascading effect. For example, a sensitive record may be copied by an update statement and future queries could refer-ence the copy. In this example, our auditing system will flag the original update statement that performs the copy. It is possible to track the copy in a subsequent auditing session. We could think of extending our auditing semantics to automatically include cascad-ing effects but we defer the extension to future work. For ease of exposition, in the rest of the paper we consider a fixed instance of the database and focus only on auditing queries. The details of the auditing algorithms are presented in §4, §5 and §6.

3. INSTANCE

INDEPENDENT

SEMAN-TICS

Previously proposed instance independent semantics are pivoted on strong privacy guarantees. We have the notion ofperfect pri-vacyintroduced by Miklau and Suciu [7] and the (weaker) notion ofweak syntactic suspiciousnessintroduced by Motwani et al. [5]. For the special case of single tuple auditing, both of the above definitions reduce to checking whether a tuple iscriticalto a query. A tuple drawn from the domain of the corresponding table is said to be critical to a query if there issomedatabase instance where dropping the tuple changes the query result. We illustrate with an example.

Example 3.1 Consider the TPC-H benchmark database. Consider the query:

select * from customer where c_custkey = 100

that asks for all details of the customer with id 100. Any customer tupletwith id 100 is critical since we can construct a database with a customer with id 100 where deletingtmakes the query result

empty. 2

Previous work developing the instance independent approach has focused on increasing the class of queries that can be audited ef-ficiently while retaining strong privacy guarantees. Efficient tech-niques have been developed for checking whether a tuple is critical to a query for large subclasses of conjunctive queries [6].

As stated in §1, our goal is different — we wish to be able to handle arbitrary SQL containing constructs such as grouping, ag-gregation and correlated subqueries. Even though the notion of a critical tuple applies to the whole of SQL, we show that checking whether a tuple is critical to a complex query involving subqueries is undecidable.

THEOREM 3.2. Checking if a tuple is critical to a query that is allowed to contain subqueries is undecidable.

PROOF. Recent work shows that the query containment prob-lem under bag semantics in the presence of inequalities is undecid-able [8]. We reduce this problem to checking criticality. Suppose we want to check if queryQ1is contained in queryQ2under bag

semantics, denotedQ1⊆bQ2. We create a new queryQselect

* from T where EXISTS (Q1 except all Q2)where

T is a new table not referenced in eitherQ1 orQ2. We use the

except allclause in SQL to compute the bag difference

be-tweenQ1 andQ2. IfQ1 ⊆b Q2, thenQis always empty and

hence no tuple in the domain ofTis critical. On the other hand, if

Q16⊆bQ2, then every tuple in the domain ofTis critical toQ.

It is known that checking perfect privacy and weak syntactic suspi-ciousness is at least as hard as checking whether a tuple is critical to a query [5, 7]. Thus, from Theorem 3.2, it follows that the pre-viously proposed instance independent semantics are computation-ally incompatible with the complexity of full SQL. We therefore revisit the instance dependent semantics for auditing.

4. SINGLE TUPLE AUDITING

Motivated by the need to audit complex queries, we reconsider an instance dependent semantics. We present our overall auditing semantics by beginning with the special case of single tuple au-diting. As noted in §1, there are several scenarios where a single tuple is a natural unit of auditing. We present our extension to more general audit expressions in §6.

4.1 Query Differentials

We define the notion ofquery differentialsto capture instance dependent single tuple auditing.

DEFINITION 4.1. Given a database instanceD, a queryQand a tupletspecified by the valuevof its primary key, we rewriteQto excludetfromTby adding the predicateT.id6=v(we refer to this expression asT −t). The rewritten query is called itsdifferential

with respect tot, denotedQ0t. We say that queryQaccessestuplet

ifQ(D)6=Qt0(D). IfQ(D) =Q

0

t(D), then we say thatQissafe

with respect tot. 2

(4)

Example 4.2 Consider the data and query in Example 3.1. Sup-pose there is a customer tupletwith id 100 in the current instance, then by Definition 4.1, tupletis accessed by the query. 2

Our notion of a tuple being accessed by a query is identical to the informal notion of an indispensable tuple introduced by Agrawal et al. [4]. For the class of SPJ queries, the formal defi-nitions are also identical. However:

1. The formal definition proposed by Agrawal et al. [4] is non-uniform. In fact, no definition of indispensability is offered for multi-block SQL queries that cannot be decorrelated.

2. For select-project-groupby-having queries, the approach pro-posed by Agrawal et al. [4] proceeds by dropping the having clause which reduces it to the instance independent seman-tics (we illustrate in Example 4.3).

In contrast, Definition 4.1 is uniformly instance dependent. The following example illustrates the difference between the instance dependent and independent semantics (and also the difference be-tween our semantics and that of Agrawal et al. [4]).

Example 4.3 We continue with the database in Example 3.1. Con-sider the queryQ:

select o_custkey, count(*) from orders

group by o_custkey having count(*) >= 10

that finds the number of orders placed per customer but only for those customers that have at least 10 orders. Suppose that customer with id 100 has placed only 5 (<10) orders corresponding to tuples

{o1, . . . , o5}in theorderstable. The output ofQon the current

database would not have an entry corresponding to customer 100. Deleting any of the customer’s orders does not change the output ofQ. Thus, by Definition 4.1, none of the tuples in{o1, . . . , o5}is

accessed byQ. However, each of the tuples in{o1, . . . , o5}is

crit-ical toQ; for example we can construct an instance of the database where customer 100 has 10 orders includingo1where deletingo1

changes the output ofQ. We note that the definition offered by Agrawal et al. drops the having clause in the above query; thus their definition reduces to the instance independent semantics. 2

4.2 Privacy Guarantees

As noted in §1, there are known privacy breaches implied by the instance dependent semantics. One class of breaches is what are callednegative disclosures. Consider a queryQthat finds the subset of tuples satisfying a predicateP in a tableT. Suppose an adversary is aware that a tupletexists in the database. If the output of queryQdoes not include the tuplet, the adversary can infer that tupletdoes not satisfy the predicateP.

We now provide a different example of a privacy breach implied by the instance dependent semantics.

Example 4.4 Suppose the customer table in the TPC-H database has a credit rating attribute. Suppose that in the cur-rent instance of the database, customer John Doe has a credit rating of 700. Consider the following queries,Q1:

select sum(CreditRating - 700) from customer andQ2:

select sum(CreditRating - 700)

from customer where c_custname <> ’John Doe’

By checking if the results of the two queries are equal, an adversary can learn that John Doe’s credit rating is 700. However, the tuple corresponding to John Doe is not accessed by eitherQ1 orQ2.

Thus our auditing semantics would fail to detect the above attack.

2

The question arises whether we can precisely characterize the privacy guarantee that the instance dependent approach yields. Let us re-examine Example 4.4. The attack in Example 4.4 essentially requires knowing the credit rating value apriori — if we change the credit rating ofJohn Doeto say 600, queryQ1 accesses the

corresponding tuple and is therefore flagged as unsafe. Thus, if the adversary does not know the value of John Doe’s credit rating upfront, then by issuing queriesQ1 andQ2, he is taking a risk of

being detected by the audit system.

Based on the above intuition, we define the notion of arisk-free

attack below. As in the definition of perfect privacy [7], we assume that the adversary views the database as a probability distribution (over instances) obtained from independent tuple probabilities.

We first formalize the notion of anattack. Fix a database instance

Dand a tuplet∈T of interest. For this discussion we assume that (without loss of generality) there is one sensitive attributeAin ta-bleT. Suppose the adversary has knowledge of all the data in the database except tuplet. Since we are assuming that the tuple prob-abilities for the adversary are independent and sincetis not critical toT−t, the adversary’s knowledge preserves perfect privacy of the value oft.A[7]. The adversary would like to learn some non-trivial boolean property oft.A— we call a boolean propertynon-trivialif there is at least one value in the domain oft.Athat makes it true and at least one value that makes it false. In Example 4.4, the attacker would like to know whether John Doe’s credit is 700 (or not).

We assume that the only interaction the adversary has with the data is running queries and examining their results. The adver-sary issues a set of queries{Q1, . . . , Qn}to the database and

com-putes a boolean functionf on their results. Intuitively, the pair

<{Q1, . . . , Qn}, f >constitutes an attack if the overall

computa-tion reveals some non-trivial property oft.A. We formalize this intuition below.

DEFINITION 4.5. Consider the pair <{Q1, . . . , Qn}, f >

where{Q1, . . . , Qn}is a set of queries andfis a boolean

func-tion. Define the following functionhdefined over the domain of t.A. Given valueαin the domain oft.A:

1. Create database instanceDαobtained from the instanceD

by changing the value oft.Atoα

2. Computeh(α) =f(Q1(Dα), . . . , Qn(Dα))

The pair<{Q1, . . . , Qn}, f >is defined to be an attack ifhis some

non-trivial boolean property oft.A. 2

Example 4.4 illustrates an example of an attack. The goal of an attack is to “cheat” the auditing system. In other words, it should infer a non-trivial boolean property of a tuple by running a set of queries that the auditing system deems to be safe.

DEFINITION 4.6. An attack<{Q1, . . . , Qn}, f >is said to be:

1. Successfulfor a database instanceDα(as defined in

Defini-tion 4.5) if eachQi∈ {Q1, . . . , Qn}is safe with respect to

tinDα.

2. Risk-freewith respect to an auditing system if it is successful for all database instancesDα.

(5)

Example 4.7 The attack described in Example 4.4 is successful in a database instance where the credit rating of John Doe is 700. 2

The above example illustrates a key difference between the in-stance dependent and the inin-stance independent semantics (based on perfect privacy and weak syntactic suspiciousness) which does not permit successful attacks. However, as noted above the attack in Example 4.4 is not a risk-free attack. If the credit rating of John Doe is not 700, then queryQ1would be flagged as unsafe. In fact,

we can show that for our auditing system that uses the instance de-pendent semantics,norisk-free attacks are possible. As a special case, it follows that negative disclosures are not risk-free.

THEOREM 4.8. No attack is risk-free with respect to the single-tuple auditing semantics introduced in Definition 4.1.

PROOF. We provide a proof sketch. Suppose that the tuple of interest ist ∈ T. Suppose to the contrary that we have a risk-free attack<{Q1, . . . , Qn}, f >. From Definition 4.6, it follows

that eachQicould essentially be equivalently rewritten to only go

over the viewT −t(formally, eachQiis conditionally valid [9]

with respect to the view T −t). This contradicts the fact that

<{Q1, . . . , Qn}, f >is an attack.

In this section, we present a novel characterization of the privacy guarantee that the instance dependent approach yields. While it is a weaker guarantee than what the instance independent approach yields, we think it is interesting given that it does not restrict the set of queries that can be feasibly audited. As mentioned in the in-troduction, an auditing system that cannot handle arbitrary queries is fundamentally incomplete. At the same time, there are stronger notions of privacy that detect all successful attacks but it is not clear how we can support them efficiently for complex queries. Under-standing the spectrum in between is an interesting direction for fu-ture work.

4.3 Baseline Query Auditing Algorithm

We note that Definition 4.1 lends itself to a feasible implementa-tion for arbitrary SQL queries — run the query and its differential and check if the results are the same. This baseline algorithm is illustrated in Algorithm 4.1. As described in §4.2, this algorithm disallows risk-free attacks. The baseline algorithm can be

ineffi-Algorithm 4.1Baseline Query Auditing Algorithm QUERYAUDIT Input: (1) Query WorkloadQWof query statements with corresponding

user ids, (2) Current state of databaseD, (3) Tuplet Output: Subset ofQWthat is unsafe

1: For each queryQ∈QW

2: Check ifQ(D)andQ0

t(D)are equal by executing both

3: If (not equal) report thatQis unsafe

cient especially given a workload of complex queries. As discussed in §1, even though auditing is typically an offline process, the ef-ficiency of auditing is important. We next present a more efficient algorithm for query auditing.

5. AUDIT OPTIMIZER

As discussed in §1, even though auditing is typically an offline process, the efficiency of auditing is important. For instance, au-diting may be invoked as a response to an information breach in an attempt to narrow down the cause of the breach, in which case the efficiency of auditing is critical. The baseline implementation outlined in Algorithm 4.1 involves the execution of the query and its differential and a check to examine if the results are identical

which can be quite expensive. Efficient auditing is the focus of this section. We first discuss optimizations for simple classes of queries such as select-project-join (SPJ) queries and then discuss an audit optimization framework for general queries.

5.1 Optimization

Techniques

For

Simple

Queries

Based on previous work on incremental view maintenance [10], we can design various optimizations for special classes of queries such as select-project-join (SPJ) queries. We first discuss technique which we term assubstitutionbased on previous work [4] for au-diting SPJ queries without self-joins. Consider a single tupletin a tableT. Suppose we want to check if a self-join-free SPJ queryQ

accessedt. The substitution technique creates a new queryQ0from

Qby substituting tableTwith a tableT0which has exactly the sin-gle tupletin it. This is accomplished by adding a suitable predicate toQ. Checking whetherQaccessedtis equivalent to checking if the output ofQ0is empty. The following example illustrates this technique.

Example 5.1 Consider the following query Q on the TPC-H schema.

select * from orders

where o_orderdate > ’1998-01-01’

Suppose the tuple of interest is a specific sensitive order, say o_orderkey= 5000. The substitution approach would check if the output of the following queryQ0is empty.

select * from orders

where o_orderdate > ’1998-01-01’ and o_orderkey = 5000

This check can be very efficiently implemented if there is an index on theo_orderkeycolumn. 2 Sometimes, we can check if the modified query Q0 is empty without executing it, for instance if Q contains the predicate o_orderkey < 3000 and we are interested in

o_orderkey= 5000. This technique has been termed static

pruning[11]. We can extend the above technique to handle simple grouping and aggregation. However, just as with previous work on incremental view maintenance, there is no simple extension to more complex queries.

5.2 Rule Based Framework

In this section, we present a novel rule based framework that is extensible to the whole of SQL and that attempts to audit queries

without any query execution. We motivate our framework by using a simple example.

Example 5.2 Consider the following workload of two queries.

1. select * from orders

Suppose that the tuple of interest is a specific order tuplet. Suppose that we have already verified that the first query is safe. Since the second query is subsumed by the first, we can infer that the second query also is safe without executing it. 2 Example 5.2 suggests that we can use the result of auditing one query in auditing future queries.

We propose a rule based framework similar to what is used in rule-based query optimizers [12, 13] to capture the optimization

(6)

illustrated in Example 5.2 (We defer a discussion of related work in answering queries using views to §8.) The idea is that if a query

Qdoes not access tuplet, thenQis equal to its differential on the current database instance. We incorporate this knowledge in the form of arule. Unlike traditional optimizer rules (e.g., the join commutativity rule) which are valid for all database instances, the new rules are valid only on thecurrentdatabase instance. Hence, we term theminstance equivalence rules. We represent the rules using the algebraic representation of a query in the form of a logical plan.

DEFINITION 5.3. An instance equivalence ruleis an ordered pair of logical plans, with a left-hand side (LHS) and a right-hand side (RHS), whose results are equal on the current database

in-stance. 2

While leveraging instance equivalence rules is reminiscent of se-mantic query optimization [14, 15], we note that there are important differences. We model the rules algebraically that lets us easily in-tegrate them with traditional rules as well as represent equivalences between complex query plans (in contrast to semantic query opti-mization that typically relies on exposing “soft” constraints as se-mantic rules). Moreover, instance equivalences have not been pre-viously leveraged for the notion of data auditing where the previ-ously audited queries provide a natural source to derive such rules.

Example 5.4 Consider the first example query discussed in Exam-ple 5.2 which is safe with respect to the tuExam-pletmentioned in the same example. This fact is represented as an instance equivalence rule shown in Figure 2. The example rule shows that the selection

!

!"""""""""""""""""

!

!" ""

" "

#$%&$'" (#$%&$')*+"

Figure 2: Instance Equivalence Rule For Select Query

predicate (denoted byσ1) on the original table produces the same result as its differential — note that the table in the RHS of the rule

isOrders - t. 2

Broadly, we have both the standard transformation rules (like those used in an query optimizer) along with instance equivalence rules on the current instance of the database. For a given query, we use these rules to try to “reach” its differential. If we succeed, we know that the query and its differential are also equal on the current database instance, without performing any query execution. On the other hand, if we fail then we fall back to query execution (perhaps an optimized form as discussed in §5.1). We illustrate this approach through an example.

Example 5.5 Consider the second query in Example 5.2. Using the instance equivalence rule in Figure 2, we can start from the log-ical plan corresponding to the second query and “reach” the loglog-ical plan of its differential. The sequence of transformations is shown in Figure 3. The selection predicate for the second query is denoted byσ2. The individual steps in the derivation are as follows.

1. Rewrite the original plan as a predicate on the LHS of the instance equivalence rule.

2. Replace the LHS of the instance equivalence rule with the RHS.

!!!!!

!

"!!!!!!!!!!!!!!!!!!!!!!

!

"!!!!!!!!!!!!!!!!!!!!!!

!

"!!!!!!!!!!!!!!!!!!!!!

!

"!!!!!

!

!!!!!!!!!!!!!!!!!!!!!!!!!!!

!

#

!!!!!!!!!!!!!!!!!

!

#!

!

$%&'%(! !!!)$%&'%(*+,!

!!!!!!$%&'%(! !!!!!)$%&'%(*+,!

Figure 3: Inferring Equality of a Query and its Differential for a Select Query

3. Collapse the predicates by using a rule that eliminates redun-dant selection predicates to obtain the target differential plan.

2

We can see from Example 5.5 that wematchagainst the LHS of a rule in order to use the rule. Once we find a match, we consider

applyingthe rule by replacing the matching portion by the RHS. A planP0 is said to bereachablefrom planP if we can transform

P toP0 by matching and applying one or more rules. We now define the formal problem addressed by the transformation based audit optimizer.

Reachability Problem: Given a set of transformation rulesRand a set of instance equivalence rulesR0, an input planP and a plan

P0of its differential, check ifP0is reachable fromP through a sequence of transformation rules in(R ∪ R0

)∗. 2

Our framework is extensible. Just as commercial optimizers can handle arbitrary SQL using transformation rules, we can easily ex-tend our framework to handle complex SQL queries by adding suit-able transformation and instance equivalence rules. Rule matching and application are performed in a manner similar to rule-based query optimizers both for standard transformation rules as well as for instance equivalence rules. The LHS is treated as a pattern and matched exactly. Alternately, we check if any part of the plan under consideration issubsumedby the LHS. For subsumption, we use view matching logic. For instance, in Example 5.5 we invoke the first step of the rewrite based on view matching. View matching as implemented in commercial optimizers is restricted to the class of materializable views. In our implementation, we also match against more complex views as we illustrate below.

Example 5.6 Consider the following two queries similar to the previous example but including a nested subquery.

where o_orderdate > ’1997-01-01’ and o_totalprice >

select (avg(o_totalprice) from orders)

Let the tuple of interesttbe a specific order tuple. Suppose that the first query is safe. The corresponding instance equivalence rule is shown in Figure 4. The LHS of the rule uses anApply oper-ator [16, 17] to represent the subquery. We normalize theApply operator by pulling the selection above the operator (we identify the selection above theApplywith theApply.) Similar to the pre-vious example, we can again establish the equality of the second query and its differential as shown in Figure 5. An interesting point to note is that the first step leverages a more sophisticated form of

(7)

! !!!!!"##$%! !&! '()*(+! ",-! '()*(+ +! !!!!!"##$%! !&! .'()*(+/01! ",-! .'()*(+!"!01! 1!

Figure 4: Instance Equivalence rule for a nested subquery !!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!"##$%! !"! &'()'*! "+,! &'()'* *! !!!!!"##$%! !-! &'()'*! "+,! &'()'* *! !"! !!!!!"##$%! !-! .&'()'*/01! "+,! !!.&'()'*!#01! !"! !!!!!"##$%! !"! .&'()'*/01! "+,! .&'()'*!#!01! 1!

Figure 5: Inferring Equality of a Query and its Differential for a Nested Subquery

view matching than Example 5.5. We reason that anApply oper-ator is subsumed by another if its parent selection is subsumed and its children are identical. We note that this goes beyond existing view matching techniques in commercial systems that do not han-dle subqueries. The rest of the derivation is similar to the previous

example. 2

We finally illustrate an example of a rewriting that involves mul-tiple applications of instance equivalence rules.

Example 5.7 We show the instance equivalence rules obtained from prior executions below. We observe that there are two of

σ1 σ1 Orders (Orders-t) Max Orders Max (Orders-t)

Figure 6: Instance Equivalence Rules

them. The rule involving themaxaggregation corresponds to the fact that the tuple of interest does not have the maximum value in theOrderstable. We illustrate how we can leverage the above two rules to derive the safety of a complex query involving a nested subquery in Figure 7. We note in Figure 7, we move away from the normalized form of theApplyoperator by pushing the selection down. We then find that both the left and right children of the Applyoperator are subsumed by the LHS of some instance equiv-alence rule. Applying both the rules yields the target plan of the

differential query. 2

5.3 Implementation

In this section we briefly discuss how the proposed rule based framework can be implemented. If we modify the DBMS code, we can consider the framework to be a special mode of the query optimizer (“the audit optimizer” mode) in which we modify a tra-ditional rule based query optimizer to accept as adtra-ditional input the

Apply σ2 Orders Max Orders s Apply σ2 (Orders-t) Max (Orders – t) ) Apply σ2 Orders Max Orders s Apply σ2 Orders Max Orders s σ1 Apply σ2 (Orders-t) Max Orders s σ1 Apply σ2 (Orders-t) Max Orders s Apply σ2 (Orders-t) Max (Orders-t)

Figure 7: Inferring equality using multiple instance-equivalence rules

instance equivalence rules, the original logical plan and to check if the differential plan is reachable using its rule engine. We would in addition need to augment the view matching rules to match more complex operator trees as discussed above.

Alternately, we can also build a client based solution in which we implement a rule engine to manage the set of original transfor-mation rules as well as the instance equivalence rules. Note that we only need the rule application and matching logic in a query opti-mizer; in particular, we do not need the cardinality estimation and cost estimation modules. In addition, we can leverage the database server for performing predicate subsumption and view matching using thehypothetical views API provided by most commercial database systems [18]. In order to check if a viewV1is subsumed

byV2, we create a hypothetical view (which is just a metadata

en-try in the catalog and does not contain any data), optimize the query corresponding toV1and check if it can be answered using the

hy-pothetical viewV2.

Algorithm 5.1OPTQUERYAUDIT

Input:Query-only WorkloadQWof query statements with corresponding user ids; Tupletin tableT

Output: Subset ofQWthat is unsafe

1: LetDbe the database state corresponding toQW

2: LetIbe the set of instance equivalence rules

3: For each queryQ∈QW

4: LetPandP0_{denote the logical plans for}_Q(D)_and_Q0

t(D)

5: Check ifP0is reachable fromPusingI

6: If not, Check ifQ(D)andQ0_t(D)are equal

7: If (equal) augmentIwith a new instance equivalence rule

8: If (not equal) report thatQis unsafe

As the number of instance equivalence rules increases, the over-head of checking reachability using the rule engine also increases. We can leverage techniques that are typically used to control the overheads of query optimization, such as timeouts — we run the rule based framework till a particular timeout expires. If we have not yet reached the target plan, we resort to query execution. We defer a more thorough study of how the overheads of the rule based framework can be tuned to future work.

Another issue to consider is the maintenance of the instance equivalence rules under updates. Recall that these rules are valid only for a particular database instance. Currently, as a first step, we use simple syntactic checks such as only invalidating rules that re-fer to the table being updated. More advanced techniques for

(8)

main-taining the instance equivalence rules in the presence of updates is an interesting direction for future work.

We present the optimized algorithm for single tuple auditing that augments the baseline approach with the rule based framework in Algorithm 5.1. The key differences from the original algorithm are as follows. In Steps 4-5, we use the rule based framework to check if the differential plan is reachable from the original plan. If so, we skip the execution for this query. Otherwise, we generate a new instance equivalence rule (which is added to the setI) if the query is safe.

5.4 Reordering queries

We note that in Algorithm 5.1, we do not change the order in which queries are processed. In this section we present an opti-mization that considers reordering the queries in the workload to further improve the efficiency. We motivate this optimization with an example.

Example 5.8 Consider the following workload of three queries presented in order.

As discussed previously in Example 5.6, we can use instance equiv-alence rules to avoid any query execution for the second query. However, we would still need to execute the third query as it is not subsumed by any of the previous queries. Thus, we would only save the execution corresponding to the second query. However, suppose that we reorder the queries in the workload so that we au-dit the third query first. If the third query is safe, we can then infer the same for both the remaining queries thus saving two query

ex-ecutions. 2

We formally define theQuery Reordering Problembelow.Query Reordering Problem: Given a query-only workload, find the per-mutation of the sequence of the queries that leads to the minimal number of executions for single tuple auditing.

We solve the above problem by creating asubsumption graph

which is a directed acyclic graph (DAG) in which there is one node corresponding to each query in the workload and there is an edge from nodeQ1 to nodeQ2 ifQ2 is subsumed byQ1. We order

queries in the order yielded by a topological sort of the subsumption graph. The rest of the algorithm is identical to Algorithm 5.1.

6. AUDIT EXPRESSIONS

We now briefly discuss how we extend our solution for single tu-ple auditing to more general audit expressions. Similar to previous work, our audit expressions are expressed asforbidden views. We begin by considering forbidden views expressed as a predicate over a single table. If the predicate is of the formid = value, then we reduce to single tuple auditing.

Example 6.1 Consider the following view over thePatients(

patientID, disease)table in a health care database.

select * from Patients where disease = ‘cancer’

The above view expresses the intuition that information about can-cer patients is sensitive. 2 We also support a limited class of joins as forbidden views. Our goal in extending forbidden views to joins is to express simple pred-icates on set-valued attributes. For instance, an extension of Ex-ample 6.1 would be where the database has the information about patients in a table calledPatientsbut the diseases are stored in a separate tableDiseasesto account for the fact that a patient might suffer from multiple diseases over time. In this instance, we can think of a patient as consisting of a set valued attribute contain-ing all diseases they have suffered from over time. More formally, we allow the following class of forbidden views:

select * from universal-table where condition list

The term “universal table” above refers to the join of a set of tables where we begin with a key table (e.g.,Patients) that intuitively captures the atomic attributes in the set, and join other tables (e.g., Diseases) through foreign key lookups that essentially add the set-valued attributes. The “condition list” consists of simple predi-cates that do not involve subqueries. We illustrate with the follow-ing example.

Example 6.2 Consider the Patients-Diseases example above. We can write the following view:

select *

from Patients P join Diseases D on patientID where P.zipcode = ’53702’

and D.disease = ’cancer’

to capture our desire to hide information about cancer patients in

the zip code 53702. 2

Although our forbidden views are not as general as the class consid-ered by previous work, they cover most of the examples previously considered [4, 5].

We now briefly discuss how our auditing semantics extend to cover the above class of forbidden views. As noted above, we think of the forbidden view as expressing a boolean predicate on a tu-ple containing a set-valued attribute. For single tutu-ple auditing, we rewrite a query to exclude the tuple. Given a forbidden view, we similarly rewrite the query to exclude any tuples in the “univer-sal” table that belong to the forbidden view. The rewritten query is called itsdifferentialwith respect to the forbidden view. A query is deemed to besafewith respect to the forbidden view if it has the same result as its differential on the current instance of the database. We illustrate with an example.

Example 6.3 We continue with the forbidden view in Exam-ple 6.2. Consider the queryQ:

select * from Patients where patientID = ‘Alice’ The differential ofQis:

select * from Patients P where patientID = ‘Alice’ and not (zipcode = ’53702’ and

exists (select * from Disease D

where P.patientID = D.patientID and D.disease = ’cancer’))

If Alice happens to live in the zipcode 53702 and suffered from cancer, then the differential would produce a different result. Oth-erwise,Qwould be deemed safe with respect to the forbidden view.

(9)

In general, each table referred to in a query is rewritten to exclude the forbidden sets. We omit the details of the rewriting which are straightforward.

Interestingly, both our privacy guarantees as well as our audit op-timization techniques extend in a straightforward manner to cover the class of forbidden views above. We assume that the adversary views the database as a probability distribution (over instances) ob-tained from independentuniversal tabletuple probabilities. The elements in a set-valued attribute need not be independent. For instance, in Example 6.2, we do not assume that the diseases af-flicting a patient are independent of one another. But there is inde-pendence between the “universal” records (patient records in Ex-ample 6.2). With this adversary model, it is not difficult to extend the definition of an attack and a risk-free attack to tuples with set-valued attributes. Just as in Theorem 4.8, we can show that the audit semantics proposed above for forbidden views guarantees that no attack is risk-free.

THEOREM 6.4. No attack is risk-free with respect to the forbid-den view auditing semantics defined above.

Similarly, the baseline algorithm and audit optimization tech-niques also extend. Instead of an expression of the formT−tfor a tableTand tuplet, we have an expression of the formT−σp(T)

where predicatepis derived from the forbidden view. With that change, the plan transformation proceeds as in §5. For instance, the instance equivalence ruler1 :σ1(Orders)≡σ1(Orders-t)in

Fig-ure 2 would instead be the instance equivalencer01:σ1(Orders)≡

σ1(Orders−σp(Orders)). The instance equivalence ruler01can be

used in the same way asr1in Example 5.5.

6.1 Relationship to Previous Work

There are two strains of previous work we now comment on. Our approach to forbidden views outlined above differs from the (instance dependent) semantics adopted by Agrawal et al. [4]. The semantics proposed by Agrawal et al. [4] deems a query to be safe with respect to a forbidden view if they share no indispensable tu-ples. The following example illustrates how our semantics is dif-ferent.

Example 6.5 Consider the forbidden view expressed in Exam-ple 6.2 and the queryselect * from Disease D where D.disease = ’heart attack’ and D.patientID = ’Alice’. Suppose that patient Alice lives in the zip code 53702 and has suffered from both heart attack and cancer. Alice’s disease record corresponding to heart attack is not indispens-able to the forbidden view. Thus, by the semantics proposed by Agrawal et al. [4], the query would be deemed safe. On the other hand, under our definition, the query would be deemed unsafe. 2 Essentially, our semantics for forbidden views assumes that uni-versal table tuples are independent but the elements with a single set-valued attribute (e.g., ’heart attack’ and ’cancer’ which are both correlated with smoking) are not necessarily independent. Our se-mantics yields the privacy guarantees discussed above. This is dif-ferent from the approach proposed by Agrawal et al. [4] where the privacy guarantees are not discussed.

The second strain of previous work that is related to our seman-tics is the recent work by Fabbri et al. [11] that focuses on au-diting anauthorizationpolicy in an instance dependent approach. As with authorization policies, the auditing semantics presented in this paper also performs query rewriting. However, our input is a forbidden view, not an authorization policy. We can think of our semantics intuitively as deriving an authorization policy based on the complement of the forbidden view in order to rewrite the query.

However, we formally analyze the privacy guarantees yielded by such an approach. We also propose a generic optimization frame-work applicable to arbitrary queries; in contrast, the optimizations proposed by Fabbri et al. [11] focus mostly on SPJ queries.

7. EXPERIMENTAL EVALUATION

In this section, we present an experimental evaluation of the tech-niques presented in §5 to reduce the cost of computing differentials for complex queries. We focus primarily on the offline component of the auditing tool.

The goals of the experimental study are to study the impact of our optimization techniques as a function of (1) the data size, (2) the number of queries failing the audit, (3) the selectivity of the audit expression, and (4) the fraction of updates in the workload. We first discuss some details of our experimental setup and then present our results and summarize their implications.

7.1 Implementation and Experimental Setup

Our auditing tool is built as a client application over Mi-crosoft SQL Server 2008. The “online” part of the audit relies on the auditing infrastructure of SQL Server [19]. The audit optimizer is built as part of the client application (as described in §5.3). The current prototype has the basic set of transformation rules (covering joins, selects, aggregates, group-by etc.) as well as view matching rules for operator trees. As discussed previously we build on the hypothetical view API of Microsoft SQL Server 2008 for checking subsumption. We also leverage a client-side parser tool to parse SQL queries into logical plans and also to derive instance equivalence rules for queries that are safe.

We use the TPC-H benchmark database [20] for our experi-ments. We use the 1GB version for most of our experiments but also present results on the 100MB as well as the 10GB version when we examine the sensitivity of the optimization techniques to data size. For our query workload, we choose the following stored procedure on the TPC-H schema which obtains information about “high-priced” orders (which are defined as orders that are “close” to the highest priced order item) that were shipped after a certain date. The procedure below is based on query 17 in the benchmark.

SELECT * FROM orders, lineitem WHERE l_orderkey = o_orderkey

and l_shipdate > $1 and o_totalprice >

0.9 *(SELECT max(o_totalprice) from orders)

The query workload is generated by using 30 random instances of the above stored procedure. We first consider a scenario for sin-gle tuple auditing where we assume that we are interested in audit-ing for a specific order of a particular customer (that could repre-sent sensitive information) which is specified by a predicate on the

o_orderkeycolumn. In other words, we are interested in finding

the specific instances of the stored procedure that accessed the in-formation corresponding to an order of a particular customer. For our experiments on audit expressions, we assume that we are inter-ested in auditing for orders in a particular time window which is specified by a predicate on theo_orderdatecolumn.

We choose the above stored procedure because it tests the perfor-mance of our optimizations for a complex query containing aggre-gation and a subquery. Moreover, stored procedures are commonly used in databases and it is important to test the performance of our techniques for such scenarios.

7.2 Comparing different alternatives

In this experiment, we first compare how the different techniques discussed in §5 perform for the basic case of single tuple auditing when all the queries are safe (we revisit this assumption in a later

(10)

0 5 10 15 20 25 30 35 0 200 400 600 800 1,000 1,200

Baseline Opt Opt-Reorder

No . o f E xec u ti o n s Exec u tion Ti m e (se c.)

Execution Time Number of Executions

0 5 10 15 20 100MB 1GB 10GB Spe e d u p Ov e r B aseli n e Database Size Opt Opt-Reorder

Figure 8:(a)Comparison of different alternatives (b)Varying Data Size

experiment). We ensure this by auditing for a customer order which is not among the high priced ones.

We present results for the following techniques.

1. Baseline: This is the baseline approach that checks if the query and its differential are equal for every query in the workload

2. Opt: This technique (described in Algorithm 5.1) leverages previous safe queries by deriving and exploiting instance equivalence rules.

3. Opt-Reorder: This technique (described in §5.4) in addition reorders the sequence of queries in order to best exploit the instance equivalence rules

Figure 8(a) illustrates (1) the total number of query executions and (2) the total execution time in seconds. The graph indicates thatOpt-Reorderreduces the number of executions by a factor of 30 when compared to theBaselineapproach. Note that in this ex-periment all queries are safe, thusOpt-Reorderwould in fact exe-cute only the query with the largest predicate range of l_shipdate first and the remaining queries would then be inferred to be equiv-alent to the corresponding differential queries using the instance equivalence rules. However, sinceOpt-Reorderwould still execute the most expensive query the corresponding ratios for execution times are less (though still significant). The graph indicates that the optimization techniques proposed in this paper have the potential to reduce the time taken for single tuple auditing by even an order of magnitude.

7.3 Varying Data Size

In this experiment, we repeat the previous experiment while varying the original database size. We present results for three versions of the TPC-H database (100 MB, 1GB and 10GB). Since the number of executions remain the same for the techniques (in-dependent of the database size) we only illustrate the speedup in execution times over the baseline algorithm. The results in Fig-ure 8(b) illustrate the following points. First, for small databases

Opt-Reorder can potentially be worse thanOpt. This is because the relative overheads of computing the subsumption graph (recall that our implementation needs to leverage the hypothetical views API for checking predicate/view subsumption) can be significant particularly when the query execution times are small. Second, as the database size increases, the relative overheads of computing the

subsumption graph are smaller andOpt-Reordercan provide sig-nificant reductions in the time taken for auditing when compared to

Opt.

7.4 Sensitivity to Audit Failure Rate

BothOptandOpt-Reorderleverage instance equivalence rules that are derived from previous safe queries. Thus, an important pa-rameter that could influence the performance of the proposed tech-niques is theAudit Failure Rate which is defined as the fraction of the queries that fail the audit. In order to vary this parameter we repeat the experiment in §7.2 but force the firstk%percent of the queries to fail the audit. We note that this experiment is biased againstOpt-Reorder(and notOpt) since the first few queries (after reordering) are typically the most expensive queries in the work-load and we essentially forceOpt-Reorderto execute them.

The results in Figure 9(a) indicate that the proposed techniques work well even when the audit failure rate is high. For instance, the speedup obtained usingOpt-Reorderis around 3 even for the case when20%of the queries fail the audit (which is a large fraction). The performance ofOptand Opt-Reorderare relatively close in this example. This is because the experiment is biased against Opt-Reorderas described previously.

7.5 Sensitivity to Audit Expression Selectivity

The results presented thus far have focused on the scenario of single tuple auditing. As discussed earlier, our optimization tech-niques naturally extend to support a larger class of audit expres-sions. In this section, we examine the sensitivity of the optimiza-tion techniques to the selectivity of the audit expression used. Re-call that the audit expression we consider is the set of orders in a particular time window which is specified by a predicate on the

o_orderdatecolumn.

Figure 9(b) shows the results as the predicate range on the

o_orderdatecolumn is varied from a week to a year. The results

show that the speedup obtained reduces as the date range widens. This is expected because a larger date range in turn increases the audit failure rate. However, the results show that even when the au-dit expression selects all orders made over a year, our optimization techniques speed up execution by a factor of nearly 3.

7.6 Sensitivity to Updates

Finally, we examine the sensitivity ofOptandOpt-Reorderto updates in the workload. Recall that bothOpt andOpt-Reorder

(11)

0 200 400 600 800 1000 1200 0% 5% 10% 20% Exec u ti o n Ti m e (se c.)

Fraction of Audit Failures Baseline Opt Opt-Reorder

0 200 400 600 800 1000 1200 1400

1 week 1 month 6 months 1 year

Exec u tio n Ti m e (se c.) Predicate Range Baseline Opt Opt-Reorder

Figure 9:Varying (a)Audit Failure Rate (b)Audit Expression Selectivity

0 200 400 600 800 1000 1200 1400 0% 3% 6% 15% Exec u tion Ti m e (sec .)

Fraction of Updates (uniform) Baseline Opt Opt-Reorder

Figure 10:Effect of Updates

rules are only valid for a particular instance of the database and may need to be invalidated in the presence of updates. As discussed in §5, the rules can be still applied for the queries in a workload that are between update statements.

In order to study the effect of updates, we add an increasing num-ber of update queries uniformly to the workload of 30 instances of the stored procedure. Figure 10 shows the results for an audit ex-pression that selects all orders made over a range of three months. For instance, the data point corresponding to 6% updates would essentially split the original 30 query sequence into 3 smaller sub-sequences of 10 queries each. The results indicate thatOpt-Reorder

yields significant speedup even in the presence of a large fraction of updates (e.g., providing a speedup of a factor of 3 even when the percentage of updates is 15%). We find similar results for the case when the updates are not uniformly distributed but skewed to appear among the last 10 queries of the original workload.

7.7 Summary

We now summarize our empirical results. Our initial experi-ments indicate thatOpt-Reorderhas the potential to significantly reduce the time taken to audit queries (especially for instances of a stored procedure). It is particularly effective in cases where the database size is large and the audit failure fraction is low (which is likely to be common in practice). Moreover, the performance im-provements yielded byOpt-Reorderremain significant as the input characteristics such as the data size, audit failure rate, audit

expres-sion selectivity and the fraction of updates in the workload change. We defer a more thorough evaluation of our framework to future work.

8. RELATED WORK

The related work in the space of data auditing is discussed earlier in the paper. This section focuses mostly on the work related to the audit optimizer. Our audit optimizer is architected similar to a rule based optimizer [12, 13] and can be extended by adding additional transformation rules. In addition to regular transformation rules, an audit optimizer also leverages instance equivalence rules that are only valid for a current instance. While the notion of leveraging additional “semantic” rules has previously been studied in the area of semantic query optimization [14, 15], the main difference in our approach is that we model the rules algebraically (in the form of logical plans) that lets us easily integrate the instance equivalence rules with standard transformation rules into the optimizer search space. Further, auditing previous queries is a novel and natural source of instance equivalence rules.

Our techniques for efficiently performing an audit are also remi-niscent of previous work in query answering using views [21]. For instance, instead of using a rule based framework, we could lever-age previously audited queries using the following problem for-mulation. Given a set of views corresponding to previously audited queries, we can avoid the execution of the current query and its cor-responding differential if it can be rewritten completely in terms of the views. However, we do not use this problem formulation for the following reasons: 1) the problem of answering queries using views is known to be a hard problem for complex queries/views [21] and 2) the view language used by database systems is typically very limited.

Finally, the techniques proposed in §5.1 for simple queries are based on techniques for incremental view maintenance [10]. Most techniques for incremental view maintenance are restricted to sim-ple classes of queries. Moreover, our rule based framework is com-plementary in that it attempts to completely avoid any query exe-cution.

9. CONCLUSIONS

In this paper we study the problem of auditing real world queries that typically leverage constructs such as subqueries. We establish that the instance-independent semantics is computationally incom-patible with complex queries. We therefore revisit the instance

(12)

de-pendent approach and propose a semantics for auditing that is feasi-ble for arbitrary SQL. We formally analyze the privacy guarantees yielded by our approach. We propose a novel audit optimization framework that is general and attempts to perform auditing with-out any query execution. The key advantage of the semantics and the optimization techniques is the fact that they are applicable to arbitrarily complex SQL queries. Our initial experiments indicate that the audit optimizer can significantly cut down the time taken for auditing complex queries. We view our work as an important first step in auditing complex SQL queries with privacy guaran-tees. There are several interesting avenues for future work such as: (1) further improving the efficiency of auditing, (2) understanding the interplay between updates and queries, and (3) drilling down into unsafe queries.

10. REFERENCES

[1] Guardium, http://www.guardium.com. [2] Lumigent, “Auditdb,”

http://www.lumigent.com/database-auditdb.html. [3] Oracle Corporation, “Oracle audit vault,”

http://www.oracle.com/us/products/database/audit-vault-066522.html.

[4] R. Agrawal, R. J. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau, and R. Srikant, “Auditing compliance with a hippocratic database,” inVLDB, 2004.

[5] R. Motwani, S. U. Nabar, and D. Thomas, “Auditing sql queries,” inICDE, 2008.

[6] A. Machanavajjhala and J. Gehrke, “On the efficiency of checking perfect privacy,” inPODS, 2006.

[7] G. Miklau and D. Suciu, “A formal analysis of information disclosure in data exchange,”JCSS, vol. 73, no. 3, 2007. [8] T. S. Jayram, P. G. Kolaitis, and E. Vee, “The containment

problem for REAL conjunctive queries with inequalities,” in

PODS, 2006.

[9] S. Rizvi, A. O. Mendelzon, S. Sudarshan, and P. Roy, “Extending query rewriting techniques for fine-grained access control,” inSIGMOD, 2004.

[10] A. Gupta and I. S. Mumick, “Maintenance of Materialized Views: Problems, Techniques, and Applications,”IEEE Data Engineering Bulletin, vol. 18, pp. 3–18, 1995.

[11] D. Fabbri, K. LeFevre, and Q. Zhu, “PolicyReplay: Misconfiguration-Response Queries for Data Breach Reporting.” inVLDB, 2010.

[12] G. Graefe, “Volcano - an extensible and parallel query evaluation system,”IEEE Trans. Knowl. Data Eng., vol. 6, no. 1, 1994.

[13] H. Pirahesh, T. Y. C. Leung, and W. Hasan, “A Rule Engine for Query Transformation in Starburst and IBM DB2 C/S DBMS,” inICDE, 1997.

[14] M. Hammer and S. B. Zdonik, “Knowledge-based query processing,” inVLDB, 1980.

[15] J. J. King, “QUIST: A System for Semantic Query Optimization in Relational Databases,” inVLDB, 1981. [16] C. A. Galindo-Legaria and M. Joshi, “Orthogonal

optimization of subqueries and aggregation,” inSIGMOD, 2001.

[17] R. Guravannavar, H. S. Ramanujam, and S. Sudarshan, “Optimizing nested queries with parameter sort orders,” in

VLDB, 2005.

[18] “Special issue on self-managing systems,”IEEE Data Eng. Bull., vol. 29, no. 3, 2006.

[19] Microsoft Corporation, “SQL Server 2008 Auditing,”

http://msdn.microsoft.com/en-us/library/dd392015(SQL.100).aspx.

[20] “The TPC-H Benchmark,”http://www.tpc.org. [21] A. Y. Halevy, “Answering queries using views: A survey,”

Efficient Auditing For Complex SQL queries