Conclusions - Provenance and Uncertainty

In this dissertation we have studied connections between privacy and uncertainty in two main directions: how a succinct representation of provenance can help propagate uncertainty from source to output and vice versa (Chapters 3 to 5), and (ii) how uncertainty

can help enable provenance to be revealed while hiding associated private information (Chapters6and7).

Chapters3and4focus on computing uncertainty in the output given uncertain source

data. In particular, we considered query evaluation in probabilistic databases: given a queryqand a probabilistic database I, compute the probabilities of the answers in q(I). Here our two main goals were (i) to compute the (exact or approximate) probability dis- tribution of the answers efficiently in poly-time, and (ii) to identify the classes of queries

q (or, query-instance pairs hq,Ii) for which such efficient computation is possible. This problem reduces to the computation of probabilities of boolean expressions given the probabilities of its constituent variables. Therefore, in both these chapters, we investi- gated boolean provenances resulting from query evaluation.

In Chapter3, we proposed the instance-by-instance approach that considers both the

query q and the given database instance I, in contrast to the widely-used approach of only considering the given query. We proposed a novel characterization and efficient algorithms to decide whether the boolean provenances of the answers inq(I)are read-once, which allows us to efficiently compute the probabilities even for the “unsafe” queries for which computing the exact probability is #P-hard in general. However, the read-once

property of boolean provenances is not a necessary condition for poly-time probability computation. In fact, it has been shown that other well-known knowledge-compilation techniques like BDD, d-DNNF, etc. can explain poly-time computation of answer probabilities [101, 137]. Since the computation of provenance does not have much overhead

while the query is evaluated, an exact characterization of boolean provenances that are amenable to poly-time computation is of both theoretical and practical interest.

In Chapter 4, we expanded the class of poly-time computable queries by including

difference operations. Difference operations are common in practice, but to the best of our knowledge, they have not been considered in this context. We showed that the computation of exact probability is #P-hard even in very restricted scenarios for queries with difference; moreover, unlike positive queries, even approximating these probabilities is computationally hard. On the positive side, we showed that for a class of queries (and a class of boolean provenances in the instance-by-instance approach), it is indeed possible to approximate the probabilities in poly-time. Our work is a first step toward understanding the complexity of queries with difference operations, and a deeper understanding of these queries will be an important direction to pursue in the future. As more general research directions, one can explore other classes of queries (e.g., recursive datalog queries) and uncertain inputs (e.g., databases allowing correlations in source tuples, semistructured or unstructured data) that have interesting practical applications.

In Chapter5we studied tracing errors in the output to find possible erroneous inputs,

and thereby to refine the input to improve the quality of the output. We studied this problem in the context of dictionary refinement in information extraction. Many of the rules in a rule-based information extraction system can be abstracted as operations in re- lational queries, and therefore, the outputs of the system can be associated with boolean provenances in terms of the dictionary entries used in the system. We proposed solutions to address two main challenges in this problem: (i) handling incomplete and sparse labeled data, and (ii) selection of dictionary entries that remove as many false positives as possible without removing too many true positives. We also supported our theoretical results by an extensive empirical evaluation using real-world information extraction system and extractors.

There are numerous interesting future directions in this area. For example, an important problem is to develop techniques for adaptively labeling such that a high quality system can be built with only a small labeled dataset. This is more important for informal domains such as Twitter, Facebook, Youtube and Flickr that are increasingly receiving at- tention from millions of web users today (as opposed to formal domains like news articles where several syntactic features like proper punctuation and capitalization in the text are available). In addition, the provenance-based framework, models and algorithms in our work on information extraction can be useful in the context of recommendation systems or automated question-answering systems. Given a set of outputs from these systems, where some outputs are correct while others are erroneous, a natural goal is to find the primary causes or sources of these errors; clearly, this is closely related to the objective in our work.

In Chapters6and7, we studied publishing privacy-aware provenance information by

introducing uncertainty in the provenance. In Chapter 6, we proposed a formal model

for module privacy in the context of workflow provenance by hiding partial provenance information (selected data values over all executions). We showed that private solutions for individual modules can be composed to guarantee their privacy when they are part of a workflow where each module interacts with many other modules. Since hiding provenance information also has a cost in terms of the loss of utility to the user, we thoroughly studied the complexity of finding the minimum amount of provenance that needs to be hidden to guarantee a certain privacy level.

Then in Chapter7 we took a closer look at module privacy in the presence of public

modules with known functionality. The composability property in the previous chapter does not hold any more and “hiding” the names of the modules by privatization may not work in practice. We proposed a propagation model, where the requirement of data hiding is propagated through public modules. We showed that another composability property holds under certain restrictions in this case. We also studied the corresponding optimization problem.

There are other privacy concerns that we propose as interesting future directions: Revealing the sequence of executed modules, even if the entire experimental data is with-

held, can posestructural privacythreats for workflow provenance. For example, the results of a set of diagnostic tests conducted on a patient may be concealed, but even knowing that the tests were performed reveals information about the patient’s medical condition. We can try to find suitable provenance queries and formalize structural privacy under such queries. Even for well-studied data privacy, we need to formalize the notion of privacy and utility with respect to workflow provenance where most of the data values are correlated.

In document Provenance and Uncertainty (Page 190-194)