• No results found

7.5 Distributions over Classifiers

7.6.2 Batch Learning

While online algorithms are widely used, batch algorithms are still preferred for many tasks. Batch algorithms can make global learning decisions by examining the entire dataset, an ability beyond online algorithms. In general, when batch algorithms can be applied they perform better. We compare our new online algorithm (Variance) against two standard batch algorithms: maxent classification (default configuration of the maxent learner in MALLET [155]) and support vector machines (LibSVM [44]). We also include stochastic gradient descent (SGD) [22], which performs well for NLP tasks. Classifier parameters (Gaussian prior for maxent, C for SVM and the learning rate for SGD) were tuned as for the online methods. SGD was run for 10 iterations over the data.

Results for batch learning are shown in Table 7.2. As expected, the batch methods tend to do better than PA, with SVM doing better 10 times and maxent 14 times. SGD does better 11 times. However, in most cases Variance improves over the batch method, doing better than SVM and maxent 12 out of 15 times (at least 7 statistically significant.) Furthermore, it improves over SGD 14 out of 15 times. These results show that in these tasks, the much faster and simpler online algorithm performs better than the slower more complex batch methods.

We also evaluated the effects of commonly used techniques for online and batch learn-ing, including averaging and TFIDF features, none of which improved accuracy. Although the above datasets are balanced with respect to labels and predictive features, we also eval-uated the methods on variant datasets with unbalanced label or feature distributions, and still saw similar benefits from the Variance method.

7.7 Related Work

The idea of using parameter-specific variable learning rates has a long history in neural-network learning [233], although we do not know of a previous model that specifically

models confidence in a way that takes into account the frequency of features. The second-order perceptron (SOP) [39] is perhaps the closest to our CW algorithm. Both are online algorithms that maintain a weight vector and some statistics about previous examples.

While the SOP models certainty with feature counts, CW learning models uncertainty with a Gaussian distribution. CW algorithms have a probabilistic motivation, while the SOP is based on the geometric idea of replacing a ball around the input examples with a refined ellipsoid. Shivaswamy and Jebara [219] used this intuition in the context of batch learning.

Gaussian process classification (GPC) maintains a Gaussian distribution over weight vectors (primal) or over regressor values (dual). Our algorithm uses a different update criterion than the the standard Bayesian updates used in GPC [196, Ch. 3], avoiding the challenging issues in approximating posteriors in GPC. Bayes point machines [117]

maintain a collection of weight vectors consistent with the training data, and use the single linear classifier which best represents the collection. Conceptually, the collection is a non-parametric distribution over the weight vectors. Its online version [114] maintains a finite number of weight-vectors updated simultaneously.

Finally, with the growth of available data there is an increasing need for algorithms that process training data very efficiently. A similar approach to ours is to train classifiers incrementally [23]. The extreme case is to use each example once, without repetitions, as in the multiplicative update method of Carvalho and Cohen [37].

7.8 Conclusion

We have presented confidence-weighted linear classifiers, a new learning method designed for NLP problems based on the notion of parameter confidence. The algorithm maintains a distribution over parameter vectors; online updates both improve the parameter estimates and reduce the distribution’s variance. Our method improves over both online and batch

methods and learns faster on a fifteen NLP datasets. In addition to improving learning per-formance, confidence-weighted classifiers have a per-parameter confidence score, as well as a confidence in the predicted margin. In the next chapter, we explore some applications that use this confidence.

Figure 7.1: Accuracy on test data after each iteration on the several datasets. While PA continues to improve after the first iteration, the CW-Variance classifier tends to converge more quickly. On most datasets, Variance-Exact reaches optimal performance after a sin-gle iteration, although it does not perform as well as CW-Variance (approximate.)

Chapter 8

Confidence Based Applications of Confidence-Weighted Classifiers

8.1 Applications

One of the interesting aspects of confidence-weighted classifiers is a per-parameter confi-dence score, which translates into a conficonfi-dence in the predicted margin. There are many potential learning applications that could benefit from a notion of confidence. In this chap-ter, we will explore some learning settings motivated by our intelligent email work and show how the confidence aspect of CW learning can be applied to address these problems.

Online algorithms are especially attractive for training on large amounts of data. Since they do not require access to all training instances, they can operate over a stream of train-ing data, substantially reductrain-ing memory requirements and processtrain-ing time. This property is especially attractive for email data. For large scale email providers with millions of users, the number of training examples for email problems could reach trillions of exam-ples. We begin this chapter with experiments of Confidence-Weighted learning applied to a million training examples. We also consider how training can be spread across mul-tiple processors using the learned confidence parameters. Our work in this section is based on previously published work in the International Conference on Machine Learning

(ICML) [85].

Returning to our example of reply prediction, there are two particular learning chal-lenges to deploying such a system. As we saw, reply prediction requires labeled examples for learning. While our features created a representation useful for out of the box perfor-mance on new users, new training examples from a user can still improve the pre-trained classifier. However, obtaining labeled examples is expensive as it requires user interac-tion. If any deployed system were to solicit labeled examples, it would need to minimize the number of requested labels by using active learning. Active learning is a form of in-teractive learning, whereby a learning algorithm carefully selects examples from a pool of unlabeled examples to ask the user to label. Instead of labeling random examples, the user works with the learning algorithm to reduce the total labeling cost. We consider an application of active learning with online algorithms. Our work in this section is based on previously published work in the conference of the Association for Computational Lin-guistics (ACL) [83].

While we showed that our features can learn behaviors common across users for reply prediction, it cannot capture differences in user behaviors. For example, if users reply to email from coworkers with different frequency, then we cannot learn this behavior from a single user’s labeled data. However, creating separate classification rules for each user ig-nores the commonality of behavior across users useful for learning. We formulate this as a new learning setting called multi-domain learning, whereby a learning algorithm receives examples from multiple users or domains and both learns common and domain specific behaviors. Additionally, we consider the case of domain adaptation, where classifiers from multiple users or domains are combined for a new domain. We also extend these methods to a setting with multiple disparate users and consider scaling these applications to a much larger number of domains. Our work in this section is based on previously published work in the conference on Empirical Methods for Natural Language Processing (EMNLP) [84].