Related Work in Machine Learning - Learning To Scale Up Search-Driven Data Integration

Finally, ontology-based approaches rely on auxilary external ontologies to compute explanations [34, 78]. Such ontologies model relationships among schema elements and are usually provided externally or inferred from schema. There are also studies on why-not explanations in the context of top-kqueries [19, 33, 40]. The goals of our work are different and consist of determining where data experts should focus on and helping data experts to debug incorrect answers, by learning a dynamic cost model. Recent work Data X-Ray [83] with a similar cost model and Bayesian update method is also close related. The different lies in learning settings and generalization of the the Bayesian method.

6.2 Related Work in Machine Learning

6.2.1 Active Learning

Active learning attempts to address the issue of high labeling cost. Typically, in active learning, a learning algorithm has access to unlabeled data and has the ability to select the next (explicit) sample for an oracle’s annotation. Requesting the next labeled sample can either be done by explicitly constructing the sample or by issuing a query for a highly informative sample, depending on different learning models. The objective of active learning is to learn a good model with significantly smaller number of samples requested.

While active learning is a popular area of machine learning [68], standard techniques cannot be directly used on tree-structured queries in which individual edges have uncertainty. Three strategies have been primarily used in prior research: (1) the least confident strategy considers only the most likely prediction; (2) the maximum margin strategy considers the top two predictions; (3) the entropy maximization strategy considers all predictions, which can be exponential in the size of the structured object predicted. Our approach of clustering predictions and choosing a representative tree per cluster is, in some sense, an intermediate strategy.

Prior work on active learning over structured output has sought to select the next in- stanceupon which to receive feedback, with feedback directly over the predicted objects [68, Sec. 2.4]. Our work differs in keeping the instance (keyword query) the same, and solic- iting feedback over different trees (interpretations) of the given query. With the notable exception of [45], most previous work on active learning over structured output involved

110 CHAPTER 6. RELATED WORK

sequences [21, 67], whereas in theQ system we infer trees. Also, our use of active learning in the keyword search-based data integration is novel.

Although cluster-based active learning has been found to be useful in previous research [68, Sec. 5.2], such work has focused on classification and not structured prediction. Moreover, clustering in those cases is performed over the input instances, rather than the output Steiner trees corresponding to the given keyword query.

Another thread of related work is active learning methods on trees and graphs [?

], which investigates optimal arrangement of queries to minimize mistakes on non-queries nodes. It uses spanning tree-based query selection methods and provides bounds on number of mistakes. This work assumes a different model from ours, in which it predicts binary labels on nodes.

6.2.2 Recommendation Systems

The rise of recommendation systems, especially of their techniques, has drawn much atten- tion from both academia and industry: there have been tons of related techniques developed in this area and many modern sites likeAmazonand Netflix deploy recommendation systems to pursue more profit. The goal of a recommendation system is to predict how likely a user will prefer an item, based on past history.

Recommendation systems [51] are typically classified into three main categories: content- based methods, collaborative filtering methods, and hybrid methods. Content-based methods attempt to recommend items to a user based on what kind of items the user likes the most, by computing item similarities and examining user’s past history. Alternatively, content-based methods may also model user profiles to discover similar users. Content- based recommendation methods are often limited since they only consider the history of the particular user, largely ignoring correlations within the huge set of users. To this ex- tent, recent methods mostly build on collaborative models, which aim to leverage ratings from other users as well. Collaborative filtering methods are further divided into neighborhood based-approaches and model-based approaches. A neighborhood based- (or memory based-) method usually predicts a rating based on an aggregated score derived from the entire or a subset of the ratings (for example, ratings from similar users). In contrast to

6.2. RELATED WORK IN MACHINE LEARNING 111

neighborhood-based methods, model-based methods seek to build effective statistical models that best reflect the rating behaviors and patterns. These models are trained from existing data and will be applied to predict unknown ratings. Examples of such models in- clude Bayesian networks, clustering, regressions, and Singular Value Decomposition (SVD). Recent work [10] shows that neighborhood-based methods are strong at discovering local structure but weak at predicting overall ratings while the opposite holds for model-based approaches. Furthermore, in a hybrid approach, one seeks to utilize content data and to build a unified model incorporating both content-based and collaborative methods.

Our work in Chapter 4 builds upon, that of collaborative filtering [8, 51, 73], where the goal is to develop personalized rankings of items, based on their similarities to other users, and those users’ preferences. Our work has a more difficult problem than traditional collaborative filtering, in that the basic items we seek to rank — query trees — have structure and may overlap with one another. We also have close ties between the collaborative filtering and online learning aspects of our platform. These have been the focal points of study in Chapter 4.

6.2.3 Learning by Membership Query Synthesis

Our work on query debugging, and more generally the vision of improving integration quality by synthesizing examples for labeling, is very similar to learning by membership query synthesis. In this setting, the learnercreates the sample on demand itself and requests sample label from an oracle. This model has been heavily studied in the literature [2, 3, 53]. To the best our knowledge, our proposed learning models differ from those and are more complex.

Chapter 7 Conclusion and Future Work

7.1 Summary

The vision of rapid information integration remains elusive. Recent work has proposed to complement (or even replace) conventional integration with a “pay-as-you-go”, keyword search-driven data integration model. This thesis addresses several fundamental research challenges in implementing this model in an end-to-end system, where integration is driven by users’ information needs specified as keywords, and integration quality is iteratively im- proved from user feedback given onto query results. These challenges require novel solutions to combine learning and limited amount of expert feedback to best improve integration, sometimes in very noisy models. Overall, this thesis proposes

• Active learning techniques to repair links from small amounts of user feedback;

• Collaborative learning techniques to combine users’ conflicting feedback;

• Debugging techniques to identify where data experts could best improve integration quality.

In developing these methods, this thesis also describes several basic building blocks applicable to global-scale data integration:

• Combing outputs from schema matching and/or record linking tools to estimate the amount of uncertainty associated with a query results;

In document Learning To Scale Up Search-Driven Data Integration (Page 123-127)