ADVANCES IN ONLINE LEARNING-BASED SPAM FILTERING

212 

Loading....

Loading....

Loading....

Loading....

Loading....

Full text

(1)

ADVANCES IN ONLINE

LEARNING-BASED SPAM FILTERING

A dissertation submitted by

D. Sculley, M.Ed., M.S.

In partial fulfillment of the requirements for the degree of

Doctor of Philosophy in

Computer Science

TUFTS UNIVERSITY

August 2008

(2)

Acknowledgments

I would like to take this opportunity to thank my advisor Carla Brodley for her patient guidance, my parents David and Paula Sculley for their support and en-couragement, and my bride Jessica Evans for making everything worth doing.

I gratefully acknowledge Rediff.com for funding the writing of this disserta-tion.

D. Sculley

TUFTS UNIVERSITY August 2008

(3)

Abstract

The low cost of digital communication has given rise to the problem of emailspam, which is unwanted, harmful, or abusive electronic content. In this thesis, we present several advances in the application of online machine learning methods for auto-matically filtering spam. We detail a sliding-window variant of Support Vector Machines that yields state of the art results for the standard online filtering task. We explore a variety of feature representations for spam data. We reduce human labeling cost through the use of efficient online active learning variants. We give practical solutions to the one-sided feedback scenario, in which users only give la-beling feedback on messages predicted to be non-spam. We investigate the impact of class label noise on machine learning-based spam filters, showing that previous benchmark evaluations rewarded filters prone to overfitting in real-world settings and proposing several modifications for combating these negative effects. Finally, we investigate the performance of these filtering methods on the more challenging task ofabusefiltering in blog comments. Together, these contributions enable more accurate spam filters to be deployed in real-world settings, with greater robustness to noise, lower computation cost and lower human labeling cost.

(4)

Contents

Acknowledgments ii

List of Tables viii

List of Figures xi

Chapter 1 Introduction 1

1.1 Idealized Online Filtering . . . 2

1.2 Online Learners for the Idealized Scenario . . . 4

1.3 Contributions: Beyond Idealized Online Filtering . . . 5

1.3.1 Online Filtering with Reduced Human Effort . . . 6

1.3.2 Online Filtering with One-Sided Feedback . . . 7

1.3.3 Online Filtering with Noisy Feedback . . . 8

1.3.4 Online Filtering with Feedback from Diverse Users . . . 9

1.4 Defining Spam . . . 10

1.4.1 Conflicting Definitions . . . 10

1.4.2 Scope and Scale . . . 11

1.5 Machine Learning Problems . . . 12

(5)

Chapter 2 Online Filtering Methods 14

2.1 Notation . . . 15

2.2 Feature Mappings . . . 16

2.2.1 Hand-crafted features . . . 16

2.2.2 Word Based Features . . . 17

2.2.3 k-mer Features . . . 19

2.2.4 Wildcard and Gappy Features . . . 21

2.2.5 Normalization . . . 23

2.2.6 Message Truncation . . . 23

2.2.7 Semi-structured Data . . . 24

2.3 Online Machine Learning Algorithms for Online Spam Filtering . . . 24

2.3.1 Naive Bayes Variants . . . 26

2.3.2 Compression-Based Methods . . . 32

2.3.3 Perceptron Variants . . . 36

2.3.4 Logistic Regression . . . 39

2.3.5 Ensemble Methods . . . 40

2.4 Experimental Comparisons . . . 41

2.4.1 TREC Spam Filtering Methodology . . . 41

2.4.2 Data Sets . . . 42

2.4.3 Parameter Tuning . . . 42

2.4.4 The (1-ROCA)% Evaluation Measure . . . 43

2.4.5 Comparison Results . . . 45

Chapter 3 Online Filtering with Support Vector Machine Variants 50 3.1 An Anti-Spam Controversy . . . 51

(6)

3.1.1 Contributions . . . 51

3.2 Spam and Online SVMs . . . 52

3.2.1 Background: SVMs . . . 52

3.2.2 Online SVMs . . . 55

3.2.3 Tuning the Regularization Parameter, C . . . 56

3.2.4 Email Spam and Online SVMs . . . 59

3.2.5 Computational Cost . . . 59

3.3 Relaxed Online SVMs (ROSVM) . . . 60

3.3.1 Reducing Problem Size . . . 63

3.3.2 Reducing Number of Updates . . . 64

3.3.3 Reducing Iterations . . . 65

3.4 Experiments . . . 65

3.4.1 ROSVM Tests . . . 66

3.4.2 Online SVMs and ROSVM . . . 71

3.4.3 Results . . . 72

3.5 ROSVMs at the TREC 2007 Spam Filtering Competition . . . 73

3.5.1 Parameter Settings . . . 73

3.5.2 Experimental Results . . . 73

3.6 Discussion . . . 74

Chapter 4 Online Active Learning Methods for Spam Filtering 77 4.1 Re-Thinking Active Learning for Spam Filtering . . . 77

4.2 Related Work . . . 79

4.2.1 Pool-based Active Learning . . . 80

4.2.2 Online Active Learning . . . 81

(7)

4.3 Online Active Learning Methods . . . 84

4.3.1 Label Efficient b-Sampling . . . 84

4.3.2 Logistic Margin Sampling . . . 86

4.3.3 Fixed Margin Sampling . . . 88

4.3.4 Baselines . . . 89

4.4 Experiments . . . 90

4.4.1 Data Sets . . . 90

4.4.2 Classification Performance . . . 90

4.4.3 Comparing Online and Pool-Based Active Learning . . . 100

4.4.4 Online Sampling Rates . . . 103

4.4.5 Online Active Learning at the TREC 2007 Spam Filtering Competition . . . 104

4.5 Conclusions . . . 105

Chapter 5 Online Filtering with One-Sided Feedback 108 5.1 The One-Sided Feedback Scenario . . . 109

5.2 Contributions . . . 110

5.3 Preliminaries and Background . . . 111

5.3.1 Breaking Classical Learners . . . 112

5.3.2 An Apple Tasting Solution . . . 113

5.3.3 Improving on Apple Tasting . . . 114

5.4 Label Efficient Online Learning . . . 116

5.5 Margin-Based Learners . . . 117

5.5.1 Two Margin-Based Learners . . . 117

5.5.2 Margin-Based Pushes and Pulls . . . 118

(8)

5.5.4 Exploring and Exploiting . . . 121

5.5.5 Pathological Distributions . . . 123

5.6 Minority Class Problems . . . 125

5.7 Experiments . . . 125

5.8 Conclusions . . . 128

Chapter 6 Online Filtering with Noisy Feedback 131 6.1 Noise in the Labels . . . 132

6.1.1 Causes of Noise . . . 132

6.1.2 Contributions . . . 133

6.2 Related Work . . . 134

6.2.1 Label Noise in Email Spam . . . 134

6.2.2 Avoiding Overfitting . . . 134

6.3 Label Noise Hurts Aggressive Filters . . . 135

6.3.1 Evaluation . . . 135

6.3.2 Data Sets with Synthetic Noise . . . 136

6.3.3 Filters . . . 136

6.3.4 Initial Results . . . 138

6.4 Filtering without Overfitting . . . 141

6.4.1 Tuning Learning Rates . . . 142

6.4.2 Regularization . . . 143

6.4.3 Label Cleaning . . . 146

6.4.4 Label Correcting . . . 147

6.5 Experiments . . . 150

6.5.1 Synthetic Label Noise . . . 150

(9)

6.6 Discussion . . . 153

Chapter 7 Online Filtering with Feedback from Diverse Users 155 7.1 Blog Comment Filtering . . . 156

7.1.1 User Flags and Community Standards . . . 157

7.1.2 Contributions . . . 157

7.2 Related Work . . . 158

7.2.1 Blog Comment Abuse Filtering . . . 158

7.2.2 Splog Detection . . . 158

7.2.3 Comparisons to Email Spam Filtering . . . 159

7.3 The msgboard1Data Set . . . 160

7.3.1 Noise and User Flags . . . 161

7.3.2 Patterns of Abuse . . . 162

7.3.3 Understanding User Flags . . . 163

7.4 Online Filtering Methods with Class-Specific Costs . . . 164

7.4.1 Feature Sets . . . 165

7.4.2 Alternatives . . . 165

7.5 Experiments . . . 166

7.5.1 Experimental Design . . . 166

7.5.2 Parameter Tuning . . . 166

7.5.3 Results Using User-Flags for Evaluation . . . 167

7.5.4 Filtering Thresholds . . . 169

7.5.5 Global Versus Per-Topic Filtering . . . 171

7.6 Gold-Standard Evaluation . . . 171

7.6.1 Constructing a Gold Standard Set . . . 172

(10)

7.6.3 Filters Versus User Flags . . . 177

7.6.4 Filters Versus Dedicated Adjudicators . . . 177

7.7 Discussion . . . 178

7.7.1 Two-Stage Filtering . . . 178

7.7.2 Feedback to and from Users . . . 179

7.7.3 Individual Thresholds . . . 179

Chapter 8 Conclusions 181 8.1 How Can We Benefit from Unlabeled Data? . . . 182

8.2 How can we attain better user feedback? . . . 182

8.3 How can the academic research community gain access to larger scale, real world benchmark data sets? . . . 183

(11)

List of Tables

2.1 Comparison results for methods on trec05p-1data set in the ideal-ized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis . . . 46 2.2 Comparison results for methods on trec06p data set in the

ideal-ized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis . . . 47 2.3 Comparison results for methods on trec07p data set in the

ideal-ized online scenario. Results are reported as (1-ROCA)%, with 0.95 confidence intervals in parenthesis . . . 48 3.1 Results for Email Spam filtering with Online SVM on benchmark

data sets. Score reported is (1-ROCA)%, where 0 is optimal. These results are directly comparable to those on the same data sets with other filters, reported in Chapter 2. . . 58 3.2 Execution time for Online SVMs with email spam detection, in CPU

seconds. These times do not include the time spent mapping strings to feature vectors. The number of examples in each data set is given in the last row as corpus size. . . 60

(12)

3.3 Email Spam Benchmark Data. These results compare Online SVM and ROSVM on email spam detection, using binary 4-mer feature space. Score reported is (1-ROCA)%, where 0 is optimal. . . 72 3.4 Results for ROSVMs and comparison methods at the TREC 2007

Spam Filtering track. Score reported is (1-ROCA)%, where 0 is op-timal, with .95 confidence intervals in parenthesis. . . 74 5.1 Results for Email Spam filtering. We report F1 score, Recall,

Preci-sion, number of False Spams (lost ham) and number of False Hams (spam in inbox) for with one-sided feedback. We report results with full feedback for comparison. . . 129 6.1 Results for prior methods ontrec06pdata set with uniform synthetic

noise. Results are reported as (1-ROCA)%, with 0.95 confidence intervals. Bold numbers indicate best result for a given noise level, or confidence interval overlapping with confidence interval of best result. 139 6.2 Results for prior methods ontrec07pdata set with uniform synthetic

noise. Results are reported as (1-ROCA)%, with 0.95 confidence intervals. Bold numbers indicate best result for a given noise level, or confidence interval overlapping with confidence interval of best result. Methods unable to complete a given task are marked with dnf. . . . 140

6.3 Results for modified methods ontrec06pdata set with uniform syn-thetic noise. Results are reported as (1-ROCA)%, with 0.95 con-fidence intervals. Bold numbers indicate best result, or concon-fidence interval overlapping with confidence interval of best result. . . 148

(13)

6.4 Results for modified methods ontrec07pdata set with uniform syn-thetic noise. Results are reported as (1-ROCA)%, with 0.95 con-fidence intervals. Bold numbers indicate best result, or concon-fidence interval overlapping with confidence interval of best result. . . 149 6.5 Results for natural and synthetic noise at identical noise levels.

Nat-ural label noise for trec05p-1 was uniformly sampled from human labelings collected by the spamorham.org project. Results are re-ported as (1-ROCA)%, with 0.95 confidence intervals. . . 152 7.1 Summary statistics for the msgboard1 corpus of blog comments,

broken out by topic. . . 160 7.2 Selected words with high information gain, for flagged and

non-flagged comments. Obscenities and references to specific religious figures have been removed from the flagged list for display, and stop words have been removed from the non-flagged list. . . 164 7.3 ROCA results of topic-specific versus global filtering. Generative

methods benefit from topic-specific filtering, while discriminative meth-ods are not significantly harmed by global filtering. . . 171 7.4 Summary statistics for the gold standard evaluation set.

Adjudi-cation and correction rates vary widely by topic. The news topic, in particular, required extensive adjudication of religious and racial comments. . . 175 7.5 Results for F1 Measure, Gold Standard Evaluation. F1 Measure is

computed using precision and recall, where an abusive comment is considered a positive example. For all filters, the F1 measure was computed at the precision-recall break-even point. . . 176

(14)

List of Figures

1.1 Idealized online filtering scenario. . . 3 2.1 Obscured Text. These are the 25 most common variants of the word

‘Viagra’ found in the 2005 TREC spam data set, illustrating the problem of word obfuscation. . . 19 2.2 Pseudo-code for Classical Perceptron update rule and classification

function. . . 36 2.3 Pseudo-code for Perceptron with Margins update rule; note that the

classification function is the same as for classical Perceptron. . . 38 2.4 Logistic Regression employs a logistic loss function for positive and

negative examples, which punishes mistakes made with high confi-dence more heavily than mistakes made with low conficonfi-dence. . . 39 3.1 Visualizing SVM Classification. An SVM learns a hyperplane that

separates the positive and negative data examples with the maximum possible margin. Error terms ξi > 0 are given for examples on the wrong side of their respective margin. . . 53 3.2 Pseudo code for Online SVM. . . 55

(15)

3.3 Tuning the Regularization Parameter C. Tests were conducted with Online SMO, using binary feature vectors, on thespamassassindata set of 6034 examples. Graph plots C versus Area under the ROC curve. 57 3.4 Visualizing the effect of C. Hyperplane A maximizes the margin

while accepting a small amount of training error. This corresponds to setting C to a low value. Hyperplane B accepts a smaller margin in order to reduce training error. This corresponds to setting C to a high value. Content-based spam filtering appears to do best with

high values of C. . . 61

3.5 Pseudo-code for Relaxed Online SVM. . . 62

3.6 Reduced Size Tests. . . 67

3.7 Reduced Iterations Tests. . . 69

3.8 Reduced Updates Tests. . . 70

4.1 Online Active Learning. . . 83

4.2 Label Efficient b-Sampling Probabilities . . . 85

4.3 Logistic Margin Sampling Probabilities . . . 87

4.4 Online Active Learning using Perceptron with Margins, ontrec05p-1 data. . . 92

4.5 Online Active Learning using Logistic Regression, on trec05p-1data. 93 4.6 Online Active Learning using ROSVM, ontrec05p-1data. . . 94

4.7 Online Active Learning using Perceptron with Margins, on trec06p data. . . 95

4.8 Online Active Learning using Logistic Regression, ontrec06pdata. 96 4.9 Online Active Learning using ROSVM, ontrec06p data. . . 97

(16)

4.10 Online Active Learning using Perceptron with Margins, on trec07p

data. . . 98 4.11 Online Active Learning using Logistic Regression, on trec07pdata. 99 4.12 Comparing Pool-based and Online Active Learning on trec06p . . 102 4.13 Perceptron with Margins, sampling rate over time, trec05p-1 . . . . 103 4.14 Screen shot of proposed user interface for active requests for label

feedback. In this framework, the user would be encouraged to label a small number of informative messages. . . 106 5.1 Spam Filtering with One-Sided Feedback. . . 109 5.2 One Sided Feedback Breaks Perceptron. Here, white dots are ham

examples, the black dots are spam, the dashed line is the prediction hyperplane, and the shaded area predictsspam. Examples 1, 2, and 3 each cause no updates: 1 and 3 are correct, and no feedback is given on 2. Examples 4 and 30 are the only examples causing updates, ratcheting the hyperplane until no hams are correctly identified. . . . 111 5.3 Pseudo-code for Label Efficient active learner. . . 114 5.4 Margin-Based Pushes and Pulls. Examples 1, 2, and 3 cause no

up-dates, as before. But Examples 4 and 25, each correctly classified but within the margins, push the hyperplane towards thespam. Example 15, a misclassifiedspam, pulls the hyperplane towards the ham. . . . 118 5.5 Implicit Uncertainty Sampling for Perceptron with Margins. The

margin-based learner with hypothesis h and margins m+ and m−

learning from one-sided feedback reduces to an active learner with hypothesis h′ and margins m

+ and h using uncertainty sampling in the region between h andh′. . . . 119

(17)

5.6 Exploration and Exploitation. If the initial hypothesis is h, then examples 1 and 2 cause margin updates pushinghe out towards m−,

but not beyond it unless an example is found to lie between h and he. 121 5.7 Pathological Distributions for One-Sided Feedback. . . 123 6.1 Results for varying learning rateηfor Logistic Regression, onspamassassin

tuning data with varying levels of synthetic uniform label noise. For clarity, the order of results is consistent between legend and figure. . 142 6.2 Results for varyingCin ROSVM for regularization, onspamassassin

tuning data with varying levels of synthetically added label noise. For clarity, the order of results is consistent between legend and figure. . 144 6.3 Pseudo-code for Online Label Cleaning. . . 146 6.4 Pseudo-code for Online Label Correcting. . . 147 7.1 Flag rates over time for the most popular blog. The spikes indicate

periods of high amounts of flagging, often caused by abusive flame warsamong users. Graphs for other blogs show similar patterns. . . 161 7.2 ROCA results for User-Flag evaluation (top) and Gold Standard

Evaluation (bottom). Legend is the same for both graphs. . . 168 7.3 ROC Curves using User Flag Evaluation (top) and Gold Standard

Evaluation(bottom), forall-test. . . 170 7.4 Screen shot of the blog comment rating tool used by adjudicators. . 174

(18)

Chapter 1

Introduction

The task of email spam filtering – automatically removing unwanted, harmful, or offensive email messages before they are delivered to a user – is an important, large scale application area for machine learning methods [37]. However, to date there has been a rift between academic researchers and industrial practitioners in spam filtering research. Academic evaluations, such as those conducted at the TREC spam filtering tracks [24, 16, 18], have reported that near-perfect filtering results may be obtained with a variety of machine learning methods [15]. Yet these methods and performance levels are not reflected in real-world practice.

This dissertation suggests that the rift between the academic and industrial spam filtering communities is rooted in an overly optimistic evaluation scenario used by academic researchers, which we refer to as the idealized online filtering scenario [15]. This idealized scenario assumes that a machine learning-based filter will be given perfectly accurate label feedback for every message that the filter encounters. In practice, this label feedback is provided only sporadically by users, and is far from perfectly accurate. Thus, practitioners have tended to discount the performance

(19)

claims made by academics as being difficult or impossible to replicate in real-world settings.

This dissertation shows that high levels of spam filtering performance are achievable in settings that are more realistic than the idealized scenario. These settings include online scenarios in which users are willing to label only a subset of messages, in which which users are willing to label predictedham1 but not predicted

spam, in which users give erroneous feedback, or in which diverse users disagree on what isspamand what isham.

In the remainder of this introductory chapter, we review the idealized on-line filtering scenario in detail, and discuss machine learning methods that have performed well in this setting. We detail the modified filtering scenarios that form the bulk of this dissertation. We discuss the scope and scale of the spam filtering problem, and define the term “spam” for the purposes of this dissertation. Finally, we explain why these problems are of general interest to the machine learning com-munity, as well as to the spam filtering community.

1.1

Idealized Online Filtering

Before exploring modified filtering scenarios, it is first necessary to understand the idealized online filtering scenario that has been traditionally used to evaluate learning-based filters [24, 16, 18, 15].

The idealized filtering scenario is depicted in Figure 1.1. Messages are as-sumed to arrive in streaming fashion, one message at a time. For each message, the learning-based filter makes a prediction ofspamorham. A human user then observes the message and the predicted label, and delivers feedback to the learning-based

fil-1

(20)

Message Stream Learning-based

Spam Filter Predict

Spam Ham

User Feedback

Figure 1.1: Idealized online filtering scenario.

ter, providing it with the true label of that message. The filter is then able to use this label feedback to update its internal model, ideally improving future predictive performance.

This idealized scenario is useful for three reasons. First, it is a reasonable first approximation of the setting in which spam filters are deployed in real-world settings, where messages do, indeed, arrive in streaming fashion and users do provide label feedback. Second, it is a natural adaptation of the online learning scenario that has been well studied in machine learning [65]. Thus, a range of existing machine learning algorithms may be applied to this task. Third, this scenario makes clear that the stream of messages may be unbounded. This emphasizes the need for solutions allowing updates that are efficient in both computation and memory requirements, and that may adjust to changing patterns in the data over time.

(21)

1.2

Online Learners for the Idealized Scenario

The idealized online filtering scenario has enabled the development and empirical evaluation of a wide range of machine learning methods for spam filtering. These methods are briefly discussed here, and are reviewed in detail in Chapters 2 and 3. The use of machine learning methods for online filtering dates back at least as far as 2002, when Paul Graham proposed using a variant of the Naive Bayes classifier for filtering email spam [39, 40]. Since then, a number of machine learning methods have been proposed for filtering, including several additional variants of the Naive Bayes classifier [63], as well as random decision-tree forests [72], support vector machines (SVMs) [34, 51, 72, 20, 84] logistic regression [38, 19], compression based methods [6] and ensemble methods [62].

Our contributions in this regard have included the application of the Per-ceptron Algorithm with Margins [52, 48, 86], and the development of a fast, online SVM variant called Relaxed Online SVM (ROSVM) for spam filtering [84]. ROSVM has given top level performance on several filtering tasks at TREC 2007 [18].

Despite this variety of possible methods, a recent survey of open-source spam filtering methods revealed that only Naive Bayes variants are commonly used in real-world filters [15]. Naive Bayes variants belong to thegenerativefamily of classifiers, which model the underlying processes generating the two classes of data (spamand

ham) as an intermediate step [66]. The other main family of classifiers are discrim-inative, which seek to find boundaries between classes in the data space, without modeling the functions that produce these classes [66]. It has been found that the generative family performs better than the discriminative family when training data is scarce. However, when training data is plentiful, discriminative methods achieve lower asymptotic error [68]. For spam filtering systems collecting large amounts of

(22)

training data, it seems reasonable to conjecture that discriminative methods would be superior in this setting. Indeed, in recent experiments, logistic regression and ROSVMs – both discriminative methods – have out-performed all known Bayesian competitors [38, 19, 84].

Of equal importance to the particular learning method is the feature rep-resentation chosen to representing this semi-structured email data. The majority of filters in the literature have relied on some form of word-based features [93]. However, spammers have developed attacks specifically designed to defeat word-based models, including tokenizationand obfuscation attacks [93] that produce the now-familiar character-level modifications often seen in email spam messages. An exception to this are the compression-based filtering methods [6], that essentially rely on short character substrings [82]. We extend work in this area by proposing and testing the use of a variety of feature representations for spam data. Some of these feature mappings were originally developed in the field of computational bi-ology where character-level mutations are common [86]. Top level performers from the most recent TREC evaluation used the binary 4-mer feature space we proposed for spam filtering [19, 85].

1.3

Contributions: Beyond Idealized Online Filtering

The user feedback in the idealized scenario, depicted by the dashed gray line in Figure 1.1, may be far from perfect. In this disserataion, we examine several ways that this assumption of perfect feedback may be modified to better reflect the needs and behaviors of real human users. First, in the idealized scenario, human users are required to perform significant amounts of hand labeling. This would ideally be reduced to require only a small fraction of examples to be labeled [79]. Second, in

(23)

many settings users never give feedback for any messages predicted to be spam [80]. Third, users may give mistaken or even maliciously inaccurate feedback [83]. Fourth, when many users view the same message, there may be significant disagreement about its “true” label [81]. Each of these observations motivates a contribution of this dissertation, as detailed in the remainder of this section.

1.3.1 Online Filtering with Reduced Human Effort

Examining the idealized online filtering scenario, one obvious problem is the as-sumption that humans will give feedback forevery message in the message stream [15]. There are several obvious flaws with this assumption, most notably the fact that requiring this effort from users reduces much of the benefit the filter is meant to provide. Gordon Cormack, among others, has proposed a scheme that allows users to report only errors made by filters [17], but even this requires users to scan every message. Industry experts have disclosed that real users label only a fraction of the total messages presented to them.

Theactive learningparadigm is a machine learning approach to reducing the labeling effort required by humans [14]. Such methods can reduce labeling effort dramatically, without significant reduction in classification performance. Although the idea of using active learning is a natural fit for reducing human labeling effort in the spam filtering domain, prior applications of active learning in this setting have been both computationally expensive and have harmed classification performance [16]. This is because these methods have used a pool-based approach, in which the active learner selects a number of examples from a large pool of unlabeled examples, iterating through many rounds. Cost is incurred as each example in the pool is considered many times.2 Additionally, many methods for pool-based active

2

(24)

learning are prone to selecting redundant examples in such a setting, reducing the benefit of the human labeling effort.

In this dissertation, we propose and evaluate the use of online active learn-ing methods for spam filterlearn-ing [79]. Online active learners do not rely on a pool of unlabeled examples, but consider each unlabeled message as it arrives. For each message, the learner not only predictshamorspam, but also determines whether or not request a label from a human. We test several prior online active learning meth-ods showing that the required human labeling effort may be significantly reduced with little reduction in classification performance and negligible additional compu-tational cost. Furthermore, our simple and novel fixed-margin sampling method gave best results across a majority of data sets and base learning methods. Our proposal of online active learning as the natural form of active learning for spam filtering was adopted by the TREC 2007 spam filtering track [18].

1.3.2 Online Filtering with One-Sided Feedback

A second possible problem with the idealized online filtering scenario is the assump-tion that users will give feedback on both messages predicted to be hamand those predicted to be spam. There are a number of real-world settings in which users will never give feedback for predictedspammessages. For example, some systems remove predicted spambefore they are shown to the user. In other cases, non-expert users may not know how to view predicted spam. Other users may simply choose never to view predicted spammessages. In all of these cases, feedback will only be given for predicted hamexamples.

This scenario, which we refer to as the one-sided feedback scenario, was first examined by Helmboldet al., who called it theapple tasting problemwherein labels still requires every example in the pool to be labeled before active learning can commence [87].

(25)

were only provided for predicted positive examples [43]. They showed that one-sided feedback would break several online-learning algorithms such as Perceptron and Winnow, and gave a solution to this problem that involved sampling from the predicted negatives in a uniformly random manner at a rate determined by the past performance of the model.

We confirm that in the spam filtering domain, mistake-driven learners such as Perceptron are indeed broken by one-sided feedback. We apply the apple-tasting solution, and propose additional variants of online active learning to deal with this problem. However, we find the surprising result that margin based learners such as the Perceptron Algorithm with Margins and ROSVMs are able to filter effectively in this setting without modification [80], as they implicitly perform a variant of fixed-margin uncertainty sampling on predicted spammessages.

1.3.3 Online Filtering with Noisy Feedback

A further issue with the idealized online filtering scenario is the assumption that user feedback will be accurate. In reality, users may give feedback that is mistaken, or even maliciously inaccurate [83]. Thus, the feedback may contain class label noise, wherein training data is incorrectly labeled. Indeed, John Graham-Cummings’

spamorham.orgproject found that human labeling error rates approached 10% [41], and industry experts have cited a 3% error rate from users [96]. The errors included in these figures are objective errors, on the order of a lottery scam email being re-ported as ham. Clearly, real-world spam filters must be robust to class label noise, but this is not considered in the idealized scenario.

There are several machine learning methods for dealing with class label noise, including various forms of regularization [78], as well as methods forcleaning[7, 71]

(26)

orcorrectingtraining examples suspected to be mis-labeled [98]. However, previous filtering methods, such as top performers from TREC spam filtering competitions, do not employ any such measures. Indeed, the current “folk wisdom” in the spam filtering community is that methods such as regularization only hurt filtering per-formance [17].

We show that even low levels of uniform class label noise harms or even breaks top performing filtering methods from TREC. We explore several methods of ameliorating the effects of class label noise, finding that uniform label noise can be successfully handled with a variety of different methods. However, non-uniform label noise, as is produced by real users, remains a more difficult challenge.

1.3.4 Online Filtering with Feedback from Diverse Users

We extend our investigation of noisy feedback by considering cases where label feedback is inconsistent. The idealized online filtering scenario implicitly assumes that the user feedback is consistent – that is, that there is a single, objectively true label applied to each message. However, if many different users view the same message, it is possible that they will have differing perceptions of the “true” label for that message. This scenario is reflected in the domain ofblog comment filtering. In a blog, or internet journal, the blogger periodically posts entries. The readers of that blog then may post comments about the entry, which are added to the blog as a form of community discussion. Such postings may containabuse, such as obscenities, personal attacks, or remarks degrading a particular race, religion, or nationality. However, because each comment may be read by many different users, there may be different viewpoints as to which comments are abusive and which are not. Indeed, our recent study of blog comments found pairwise inter-annotator

(27)

agreement to be below 75% among the three volunteer human adjudicators.

This environment is an extreme challenge for filters developed for the online filtering scenario. We explore the ability of such filters to perform in this challenging environment of filtering blog comment abuse, finding that regularization and the use of class-dependent misclassification costs both give improvements for filtering methods in this domain [81]. Additionally, we suggest methods for improving the use of feedback from diverse users [81].

1.4

Defining Spam

As the primary focus of this dissertation is on the detecting and filtering of content-based email spam, it is worth taking the time to define this term. Although the concept of spam as unwanted mass-email has entered the general lexicon [1], the term remains slightly ambiguous owing to several different conflicting usages.

1.4.1 Conflicting Definitions

In 1998, Cranor and LaMacchia defined spam as “unsolicited bulk email” [26]. This is a strict, unambiguous definition; however, in practice it is overly narrow. There may be messages that a user has “solicited” by neglecting to un-check a box when using a web form for purchase, for example. More importantly, unwanted, offensive, or harmful messages that are not sent in “bulk” still negatively impact the user. Such messages should ideally be filtered regardless of whether or not they are sent as part of a larger campaign.

A broader definition harkens back to Peter Denning’s 1982 ACM President’s letter, titled “Electronic Junk.” [32] Defining spam as electronic junk, or perhaps more specifically as unwanted or harmful electronic messages [84] encompasses a

(28)

wide range of user needs. However, these terms themselves are subjective, and different users may have different perceptions of what is junk, unwanted, or harmful. Thus, this second definition gains increased coverage, but loses the ability to be objectively applied.

A definition of spam which may be objectively applied isanything marked as spam by a user. However, this form of definition by example has limited ability to generalize – it essentially requires that every possible message be labeled by a given user.

For the purposes of this dissertation, we rely on the definition of spam used by the experts who provided the gold-standard judgments of spam and ham in the TREC spam filtering benchmark data sets, using a boot-strapping process [23]. These experts used the following definition [23] of spam:

Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Thus, these gold-standard labels attempt to provide an objective, consistent labeling as far as possible. In Chapters 6 and 7, we examine the impact of using actual user feedback labels for training, rather than gold-standard labels.

1.4.2 Scope and Scale

In practical settings, spam filtering is a large scale problem: recent estimates show that as much as 80% of all email traffic is spam, amounting to billions of messages per day worldwide [37]. This level of spam email creates cost in network capacity and storage, to say nothing of the human cost involved in sorting through unfiltered spam.

(29)

wherein the goal of the spammer is to deliver a spam message that will be understood by a human user, for possible commercial or social gain. Other forms of content-based spam include blog comment spam, which involves spam comments posted to message boards on electronic journals (blogs) [81, 64], and SMS spam sent by text-messaging services [21] on cellphones or IM spam sent by instant text-messaging services. Because of the similarity of these domains, it is reasonable to conjecture that ad-vances in email spam filtering be be applied in these other domains as well. This dissertation also includes a chapter on filtering blog comment spam, highlighting both similarities and differences of this domain.

1.5

Machine Learning Problems

We believe that developing effective, efficient spam filtering will provide unambigu-ous social benefit. Yet aside from issues of serving society, we are also motivated in this work by the important machine-learning challenges inherent in this task, that make results from spam filtering applicable to other domains ranging from general text classification to computational biology to optimizing online advertising systems. This domain typically involves semi-structureddata. For example, an email message includes not only textual data, but also header information with routing and meta-information, and may additionally include images, attachments, and links. Furthermore, the textual data may contain obfuscations that spammers employ in an attempt to avoid detection [93]. Typical feature representations for this domain typically result inhigh dimensionality, in which there may be many relevant features. In real-world settings, filtering is a large scale task that may involve billions of messages per day, placing a premium on efficient, scalable solutions that update in real-time, or near real-time. Finally, such systems rely on non-expert humans

(30)

to provide label feedback, and must be robust to a variety of imperfections in the training data. Thus, advances in this application area not only have practical benefit in the domain of spam filtering, but may also be of use in other large-scale, time sensitive domains involving semi-structured, high dimensional, noisy data.

1.6

Overview of Dissertation

The remainder of this dissertation is organized as follows. Chapter 2 reviews prior feature mappings and machine learning algorithms used for online spam filtering, in-cluding variants of Naive Bayes, compression methods, Perceptron variants, Logistic Regression, and ensemble methods, and provides an empirical comparison of these approaches. Chapter 3 introduces our novel ROSVM algorithm, an efficient SVM variant for streaming data suited to the online filtering task. Chapter 4 explores the use of online active learning for spam filtering. The problem of one-sided feedback is presented in Chapter 5, and issues of noisy feedback are discussed in Chapter 6. Chapter 7 shows the ability of online learning-based filters to perform on the blog comment abuse filtering task, which involves feedback from diverse users. Our conclusions and plans for future work appear in the final chapter.

(31)

Chapter 2

Online Filtering Methods

A wide range of machine learning methods have been applied to the online filtering scenario. In this chapter, we review the most successful of these methods, includ-ing variants of the Naive Bayes classifier, compression-based methods, Perceptron variants, Logistic Regression, and ensemble methods. We also describe several dif-ferent feature mappings that have been used to transform semi-structured email data, containing text, header information, and possible attachments, into feature vectors usable by machine learning methods. Thus, this chapter serves as a review of background and related work.

This chapter is divided into four sections. The first reviews basic notation used throughout this dissertation. The second reviews methods for feature map-ping, the pre-processing step necessary to convert semi-structured email data into numerical feature vectors used by most machine learning methods. The third section reviews online machine learning algorithms for online filtering, and the final section of this chapter includes an experimental comparison of these various methodologies. This chapter does not review Support Vector Machines or variants; these are

(32)

pre-sented in the next chapter of this dissertation with experimental comparisons to the methods detailed in this chapter.

2.1

Notation

Before describing online filtering methods, it is first necessary to outline the notation used in this dissertation.

In the online filtering scenario, the filter is shown one message (or example) at a time, in a time-ordered sequence whereiis the current time step. Each message is represented by a feature vector xi ∈X, whereX ⊆Rn, with an associated label yi Y where Y ⊆ {−1,+1} forham and spammessages, respectively. For now, we will assume that this label is ground truth, and represents the “true” classification of the message. In Chapters 6 and 7, we will explore the case where the label may be noisy. In these later chapters, the example’s “true” label is y′i, and it is not necessarily the case that yi=y′i.

For each example the filter is asked to predict hamorspam using a function f(xi), with the label hidden. This function may make use of a weight vectorw∈Rn, which maintains a set of weights, where each weight is associated with a particular dimension in the feature space.

Once the prediction is made, the example’s label yi is revealed to the filter, which may then update its prediction model f(·) as needed using (xi, yi). This update is often performed by changing the weights stored inw.

Thesparsityof an example vectorxiis given bys, and represents the number of non-zero values in the vector.1 An individual feature in X is referred to byX

j. The value of a particular feature Xj in a specific feature vector xi is given byxij.

1

Note that this usage of the term “sparsity” is slightly different than the common English usage of this word.

(33)

2.2

Feature Mappings

As noted in the previous section, each message is represented by a feature vector x Rn. That is, a set of numerical features is extracted from the message, and these numerical scores are stored in a vector. This process is called feature mapping, and allows generic machine learning methods to operate on semi-structured email data.2 The choice of feature mapping is an important one. Experience has shown in machine learning that choosing an appropriate feature mapping can have more impact on the quality of results than the choice of learning method. In this section, we review several possible feature mappings and discuss the strengths of each.

2.2.1 Hand-crafted features

One possibility is to hand-craft specific features that are uniquely suited to the spam filtering task. Several systems employ this approach, including one major industrial spam filtering system as well as the open-source filter SpamAssassin [2]. This approach requires experts to identify features within messages that may help to distinguish spam from ham. For example, the hand-crafted features used by SpamAssassin version 3.2 include the following [2]:

• Subject contains ‘‘As Seen’’

• Subject starts with dollar amount

• Offers a alert about a stock

2

Those familiar with kernel methods will note that string kernels exist that can allow kernel-based learning methods to operate directly on string-based data, without an intermediate feature mapping step [78]. However, the increased classification cost of such methods makes them impractical for this large-scale filtering setting. Indeed, recent work in computational biology has shown that explicit feature mapping for string-based genomic data reduces computational cost in comparison to the use of string kernels in practical settings [89]. In this dissertation, we consider only explicit feature mappings in order to maintain scalability.

(34)

• Money back guarantee

• Message talks about a replica watch

• Uses a numeric IP address in URL

• Phrase: extra inches

• Phrase: L0an

In SpamAssassin version 3.2, there are 748 of these hand-crafted features, which is a relatively low number of features. The human effort in hand-crafting these domain specific features results in a focused feature space in which there are few irrelevant features, ensuring that computation and storage cost are kept to a minimum.

However, this approach has significant drawbacks. First, the human effort required in crafting appropriate features is significant, and it is often difficult for hu-mans to guess which characteristics of a message are most informative [39]. Second, such features are not robust to changing circumstances, and may be easily attacked by intelligent spammers [93]. Third, such features are language-specific, requiring the entire feature space to be reformulated for each new language.

2.2.2 Word Based Features

An alternative to hand-crafting a small set of focused features is to employ a wider set of generic features, with the hope that this will capture more patterns in the data with less human effort.

One simple feature space is theword-based feature space. This is constructed as follows. Define a “word” as a contiguous substring of non-whitespace characters [38]. If we consider only words of a maximum finite length, then there arenpossible

(35)

words and we can construct a feature space Rn withndimensions, each indexed by a unique word. To map a message M to a feature vectorxRn, assign a score for each dimension in the vector based on the number of times that the indexing word appears in M.

There are several possible scoring schemes. The count-basedmethod assigns the raw number of occurrences of a given word in the message as its score. The TF-IDF scoring method [75] has been used in information retrieval to weight rare (and presumably more informative) words more heavily. However, several tests (including our own) have found that the simplebinaryscoring method is most effective for spam filtering tasks [34, 63]. In this system, a score of 1 indicates that a word occurs in the message, and a score of 0 indicates that it does not.

Note that although the feature space may be as large as the total number of possible words, typical feature vectors will be sparse, containing a relatively small number of non-zero values. Thus, sparse vector data structures allow for efficient storage of these vectors. These may be implemented as linked lists or hash tables containing index-value pairs. In the binary case, it is particularly efficient to store sparse vectors as arrays containing non-zero index values, which may be sorted for efficient computation of inner products.

To our knowledge, the word-based feature space was first employed by Gra-ham [39], using a Naive Bayes variant. This work highlighted the benefit of a wide-reaching feature space: his filter found the “word” ff0000 to be one of the most informative indicators of spaminess [39], which is the HTML code for the color red. Such features are quickly identified by machine learning methods, but are far from obvious to humans attempting to hand-craft features. As an implicit admission of the limited utility of hand-crafted features, the SpamAssassin team has recognized

(36)

Viagra VIAGRA Viiagrra viagra visagra Vi@gra Viaagrra Viaggra Viagraa Viiaagra Via-ggra Viia-gra V1AAGRRA Viiagra Via-gra Vi graa V iagra via gra Viagrra V&Igra VIAgra V|agra Viaaggra vaigra V’iagra

Figure 2.1: Obscured Text. These are the 25 most common variants of the word ‘Viagra’ found in the 2005 TREC spam data set, illustrating the problem of word obfuscation.

these problems, and include binary features that encode the output of a Naive Bayes classifier using a word-based feature space [2].

Although word-based features are an improvement over hand-crafted fea-tures, they are still subject to attack. Spammers routinely attempt to defeat word-based filters using techniques such as intentional misspellings, character substitu-tions, and insertions of white-space, all of which can pose problems for word-based filters [93].

2.2.3 k-mer Features

As noted above, based features are subject to attack by spammers using word-obfuscation methods [93], which include intentional misspellings, character substi-tutions, and insertions of white-space by spammers. For example, Figure 2.1 shows the 25 most common obfuscations of the word viagra in the TREC 2005 public corpus of email spam, trec05p-1[86]. There were hundreds of other obfuscations for this word alone. While a case could be made that a word-based filter would eventually encounter all possible obfuscations of a given word, the combinatorics of obfuscation render this possibility impractical. Instead, we suggest that feature

(37)

mappings that allow forinexact string matchingprovide a practical alternative [86]. One such feature mapping employs a feature space using overlappingk-mers, which are contiguous substrings of k symbols [60, 53]. For example, the 4-mers of the string ababbacbare:

abab babb abba bbac bacb

This feature space was originally designed for use on genomic data, in which genomic sequences are encoded as strings [60, 53]. Like spam data, character levels substitutions, insertions, and deletions are common in this setting, and the use of overlapping k-mers provides a measure of robustness to these variations. As with the word-based features, several scoring methods are possible, including count-based scores and TF-IDF scoring. However, our tests have found that binary scoring is most effective [84].

This feature space requires a unique dimension for each possible unique k -mer. Thus, the dimensionality of this space is|Σ|k, where|Σ|is the size of the alpha-bet of available symbols, and the value of each dimension in the space corresponds to the score associated with a particular k-mer. In email and spam classification tasks, which may include attachments, the available alphabet of symbols is quite large, consisting of all 256 possible single-byte characters. However, sparse data structures may also be employed here in similar fashion to those suggested for the word-based feature-space discussed above.

The first use ofk-mers in spam detection was by Hershkop and Stolfo, who tested a spam filter using the cosine similarity measure between k-mer vectors in conjunction with a centroid-based variant the Nearest Neighbor classifier [44]. Also, note thatk-mers are sometimes referred to as character-leveln-grams. We choose to use the termk-mers to avoid confusion with word-leveln-grams which are commonly

(38)

employed and discussed in information retrieval literature.

For those familiar with kernel methods, note that although k-mers may be employed in conjunction with string kernels [53, 54], we follow the recommendation of Sonnenburg [89] and represent k-mer features in explicit sparse feature vectors. Our tests have shown that a setting ofk= 4 is often optimal for spam classification, thus the resulting feature space is still possible to store explicitly and doing so increases computational efficiency for classification.

2.2.4 Wildcard and Gappy Features

In computational biology, it has been found that k-mers alone are not expressive enough to give optimal classification performance on genomic data. In general, a k-mer feature space may have reduced efficacy on substrings in which at least two character substitutions, insertions, or deletions occur no more thank positions apart. Computational biologists have developed extended forms of inexact string matching features to address this issue. These include wildcard features andgappy features [55], which allowk-mers to match even if there have been a small (specified) number of insertions, deletions, or character substitutions.

We applied variants of these modifications in our TREC 2006 Spam Filtering competition entries [86], in order to test the effectiveness of added flexibility in inexact string matching. Our subsequent tests (a subset of which are included in the results section of this chapter) showed that the simple binaryk-mer feature space gave optimal results – the added flexibility of wildcards and gaps did not give added benefit in the spam filtering domain. For completeness, we describe these feature mappings in this section. As with the word-based and k-mer features above, these feature mappings may be performed explicitly using sparse vector data structures

(39)

for computational efficiency [86].

Wildcards. The (k, w) wildcard mapping maps eachk-mer to a set ofk-mers in which up tow“wildcard” characters replace the characters in the originalk-mer [55]. For example, the (3,1) wildcard mapping of the k-mer abc is the set {abc, *bc, a*b, ab*}. The wildcard character is a special symbol that is allowed to match with any other character. Naturally, allowing wildcards increases computational cost. However, in our testing with spam data, we have found that allowing a fixed wildcard variant gives equivalently strong results as to the standard wildcard kernel [86]. A fixed (k, p) wildcard mapping maps a given k-mer to a set of two k-mers: the original, and a k-mer with a wildcard character at position p. Note that the first position in a string is position 0. Thus, the fixed (3,1) wildcard mapping of

abc is {abc, a*c}. This fixed mapping gives more flexibility to the k-mer feature space, but only increases computational cost by a constant factor of two.

Gaps. The (g, k) gappy mapping (where g k), allows g-mers to match with k-mers by inserting kg gaps into the g-mer [55]. Note that this is equivalent to allowing k-mers to match with g-mers by deleting up to kg characters from the k-mer. Thus, the (2,3) gappy mapping of the string acbd includes positive values for features indexed by{acb, cbd, ab, cb, cd, bd}. As with the wildcard mappings, we reduce computational cost with afixed(k, p) gappy variant, in which a k-mer is is mapped to a set of k-mers: the original k-mer, and ak-mer in which the character at positionp has been removed [86].

(40)

2.2.5 Normalization

Email messages are of varying lengths, causing feature vectors mapped from these messages to be of varying magnitudes. This can cause problems for some machine learning methods, especially those that compute an inner product < w,xi >. In such cases, the learner may make predictions with exceptionally low confidence for short messages, even when those messages are clearly spam or ham.

One standard method from information retrieval and machine learning that reduces the impact of message length is to normalize the feature vectors representing the messages. One standard method, employed in this dissertation, is to normalize using the Euclidean norm (also known as the L2 norm) of the feature vectors. That is: xi−normalized= xi √ <xi,xi > 2.2.6 Message Truncation

In 2006, Cormack first noted that the use ofmessage truncation[17] led to improved filtering results as well as increased computational efficiency. In message truncation, the email message is truncated afterncharacters, including any header information and attachments. Typical values of n in message truncation have been 2500 [19] or 3000 [84]. This tends to emphasize information contained in the headers, in-cluding routing information, sender and recipient information, and the subject line. Computational cost is reduced in cases where the message is otherwise very long: some emails contain several megabytes of data. Furthermore, truncation provides a measure implicit normalization, as truncated messages do not vary as greatly in length as un-truncated messages. Finally, truncation provides a measure of

(41)

resis-tance against the good word attack, in which objectively “good” words are stuffed into the end of an email spam in an attempt to defeat learning-based spam filters [61].

2.2.7 Semi-structured Data

As noted in Chapter 1, email data is best described as semi-structured data. That is, email data contains information in different forms such as header information, message text, and attachments that may include images or other files. Further-more, the message text may include obfuscations such as intentional mis-spellings, character substitutions, and the like.

Some attempts have been made to exploit the structure in emails for more effective feature mappings. These have included hand-crafted rules discussed above, and also more general strategies such as treating words or features drawn from the message header differently than words or features drawn from the message body [38]. Interestingly, the 4-mer feature space with message truncation has out-performed results from all such attempts. One could argue that message truncation implicitly takes some structure into account: it clearly places emphasis on the message header and the early parts of the message body. Yet it is still interesting to note that appar-ently the most effective method for dealing with this semi-structured information is primarily to ignore the structure.

2.3

Online Machine Learning Algorithms

for Online Spam Filtering

The previous section reviewed several feature mapping methods that transform la-beled email data into feature vectors for training machine learning methods. In

(42)

theory, any standard supervised machine learning method could be applied to such data. Yet the practical application of online spam filtering methods has strict re-quirements that are influenced by the scale of contemporary email usage. This scale renders many supervised machine learning methods impractical.

We identify five requirements that a given machine learning method must satisfy to be appropriate for the spam filtering domain. These are:

• Classification Performance. The first goal of any filter is to filter effec-tively. We assess classification performance with the (1-ROCA)% measure, which we use as our primary evaluation measure in this dissertation. The (1-ROCA)% measure is reviewed in section 2.4.4, along with a discussion of alternate possible metrics.

• Fast Prediction. The computational cost for a prediction for a given example xi should be computable with O(s) computation cost, where sis the sparsity of xi.

• Scalable Online Updates. The cost of updating the model should not depend on the amount of data in the training data set. In the online case the size of the data set increases over time, and may be effectively unbounded.

• Fast Adaptation. In the real world, spammers adapt to filters by changing the patterns of their spam attacks [93]. An effective filtering method adapts to new attacks quickly; ideally after only a single example of a new attack.

• Robustness to High Dimensionality. As described in the previous section, several natural feature representations for spam filtering are of high dimen-sionality. Not all learning methods perform well with high dimensional data [45].

(43)

In the remainder of this section, we review the machine learning methods that satisfy these requirements and are thus suitable for the task of online spam filtering. In doing so, we will note when a given method fails to meet any of the above requirements. Assessments of classification performance are given in the ex-perimental section at the end of this chapter.

2.3.1 Naive Bayes Variants

Variants of the Naive Bayes classifier are among the first machine learning methods to be applied to the spam filtering problem [39, 40, 63]. We first review the general principles of the Naive Bayes classifier, and then describe several variants that have been proposed for spam filtering.

Overview of Naive Bayes In the supervised learning methodology, we assume that there is an unknown target functionf :X 7→ Y, mapping example vectors to class labels [66]. This target may be expressed as a probability distributionP(Y|X), in which the probability of a class label depends on a distribution over the space of possible messages [66]. We seek to model this target function using labeled training data.

The Bayes rule is given by:

P(Y|X) = P(X|Y)P(Y) P(X)

Estimating P(Y) from observation is straightforward and requires only a relatively small sample of training data, but estimating P(X|Y) may require considerably more data. WhenXRn, there are 2(2n1) parameters to estimate [66]. This is impractical in the general case with high dimensional data.

(44)

For the case in which the features inX are conditionally independent given Y, then: P(X|Y) = n Y j=1 P(Xj|Y)

When both the elements of X and the class values of Y are binary attributes, then there are now only 2nparameters to estimate, dramatically reducing the amount of training data needed from the general case, above [66].

Note that for numerical reasons, it is infeasible to compute products of many small fractions, as is required here, because these are quickly driven to zero due to round-off errors in finite precision computing. Thus, it is preferred to work with log probabilities [65] for all Naive Bayes variants.

This observation forms the basis of the Naive Bayes classifier: we (naively) assume that the elements ofX are conditionally independent, and then use training data to estimate values for each P(Xj|Y) [66]. Although the assumption of condi-tional independence is often objectively untrue, in practice effective classifiers are still often obtained.

The training of a Naive Bayes classifier produces a model of the distribution that generates the data as an intermediate step. For this reason, these methods are said to belong to thegenerativefamily of classifiers [66].

There are several variants of Naive Bayes, which differ primarily in their assumptions about the probability distributions P(Xj|Y). Of these, the Multi-Variate Bernoulli Naive Bayes does not meet the criteria for predictions in O(s) time. Furthermore, all of the Naive Bayes variants deal poorly with high dimensional data, as shown in the experimental section.

(45)

Multi-Variate Bernoulli Naive Bayes Perhaps the most commonly applied variant [63] is the the Multi-Variate Bernoulli Naive Bayes. This variant assumes that a message is generated in the following way. There are two boxes, one marked

spamand one marked ham, with each box containing a distinct biased coin for each unique feature in the feature space X. A message is generating by first flipping a biased coin to determine the class label yi of the message. Depending on the result of the class label, the appropriate box of biased coins is selected. Each coin in the box is then flipped, one per feature in X. For each coin that comes up heads the associated feature in the feature vectorxi is scored 1, and is scored 0 otherwise. This reflects the independence assumption that the features in messages are conditionally independent [63].

The probability for a given coin j coming up heads is P(Xj|yi). We can estimate this probability from data using a Laplacian prior that we refer to as the Document Prior:

P(Xj|yi) = 1 +MXj,yi

2 +MXj

Here,MXj,yiis the number of seen messages with class labelyifor which the feature

Xj = 1, and MXj is the total number of seen messages seen for which the feature Xj = 1.

To classify a new messagexi, we can compute the probability that either box of coins would generate the message. For a given class label yi, this probability is:

P(xi|yi) = n Y j=1 P(Xj|yi)x ij ·(1P(Xj|yi))(1−x ij)

(46)

classi-fication of spamonly when:

P(yi= +1)P(xi|yi= +1)

P(yi= +1)P(xi|yi = +1) +P(yi=1)P(xi|yi=1) > τ

In the online scenario, updating a Naive Bayes model simply requires updat-ing the probabilities for each P(Xj|yi) for each feature.

Although Metsis et al.found that this variant was the most widely applied Naive Bayes variant in deployed systems [63], it is undesirable for two reasons. First, classification using this method requires a computation for each feature in the feature space, and is thereforeO(n) rather than O(s). In general, the feature space nmay be many orders of magnitude larger than the sparsitysof a typical example vector, especially when message truncation is used. Second, this variant gave worst classification performance out of all variants tested by Metsiset al.in a comparison of Naive Bayes variants for spam filtering [63].

Multinomial Naive Bayes with Boolean Attributes (Token Prior) In the multi-nomial model, we assume that a message is generated in a different way. We still have two boxes, one for spam and one for ham as before. However, now the boxes are filled with tokens of varying sizes, and each token has a feature from X written on it. To generate a message, we first determine the number of tokens to pull by randomly selecting d from a distribution of possible message lengths.3 We then choose a box by flipping a biased coin to determine the class label yi of our new message. Finally, we blindly pull a total of d tokens from the box, with replacement. For each token pulled, we add the feature listed on the token to the

3

Note that this distribution of message lengths does not depend on the class label of the message [63]. Although this may not be a good assumption in the general case for spam filtering, it is not unreasonable when message truncation is performed.

(47)

generated message [63].

Each feature is randomly selected from the box with probability P(Xj|yi), which we can estimate from training data using a Laplacian prior based on counts of tokens, referred to here as the Token Prior:

P(Xj|yi) =

1 +TXj,yi

n+TXj

where n is the total number of possible tokens, TXj,yi is the number of times that

token Xj occurs in messages of class label yi, and TXj is the total number of times

tokenXj occurs in all messages.

Note that in the general case, a token may occur in a message more than once. Yet Metsis et al. found that restricting these counts to binary values in the resulting feature vector xi improved classification performance. This agrees with work using other methods that have also found binary feature values to be most effective [34, 84]. When the token counts are restricted to binary values, this variant is referred to as Multinomial Naive Bayes with Boolean Attributes [63], and was the one of the best performing Naive Bayes variant tested by Metsiset al.

The probability that a message xi is generated by the box with class label yi is given by [63]: P(xi|yi) =P(yi)P(d)d! n Y j=1 P(Xj|yi)xij xij!

Note that when xij = 0, then P(Xj|yi)xij

= 1, and these terms may be excluded from the product. Thus, it is only necessary to compute probabilities for those features that actually occur in the message, reducing classification cost from O(n) for the Multivariate Naive Bayes to O(s) with this variant.

(48)

Furthermore, the termsP(d),d, andxij! are constant across all possible class labels, and cancel out when computing relative likelihood. Thus, to classify a new message, select a threshold τ as before, and return spamonly when [63]:

P(yi)Qn j=1P(Xj|y = +1) xij P y∈{−1,+1}P(y) Qn j=1P(Xj|y) xij > τ

Multinomial Naive Bayes with Boolean Attributes (Document Prior) Observe that, in practical terms, there are two differences between the Multivariate Bernoulli Naive Bayes and the Multinomial Naive Bayes with Boolean Attributes (Term Prior) discussed above.

The first difference is in the manner in which the token probabilities are estimated. In the Multivariate case, the default assumption is that a given token has a P = 12 probability of appearing in a message, while in the Multinomial case the default probability is that a given token hasP dnprobability of appearing in a given message.

The second difference is that the Multivariate method explicitly includes probabilities of absent tokens not occuring in the message in computing the likeli-hood that a message was generated by a given box of coins, while the Multinomial method does not include this information. This makes the Multinomial method more computationally efficient. The Multinomial method is also more effective in terms of classification performance. Is this difference in performance due to the different estimate of probability, or the exclusion of token absence information?

We have tested this question by proposing a hybrid variant, which we refer to as Multinomial Naive Bayes with Boolean Attributes (Document Prior). In this case, the per-token conditional probabilities are estimated with the document prior used by the Multivariate method:

(49)

P(Xj|yi) = 1 +MXj,yi

2 +Myi

Classification is performed using the Multinomial method: P(yi)Qn j=1P(Xj|y = +1) xij P y∈{−1,+1}P(y) Qn j=1P(Xj|y) xij > τ

Surprisingly, this variant gave best results out of all Naive Bayes methods tested on noisy blog comment data (see Chapter 7). That is, the traditional Multino-mial (Term Prior) method out-performs the Multivariate Method (Document Prior), and the Multinomial (Document Prior) method out-performs the Multinomial (Term Prior) method. Thus, we conclude that both the Document Prior and the exclusion of absence information are helpful in spam filtering.

2.3.2 Compression-Based Methods

The Naive Bayes variants operate best under the condition that features are condi-tionally independent. When this condition breaks down, classification performance may suffer. For example, the Naive Bayes variants perform poorly with the feature space of overlappingk-mers. In this case, the features are inter-dependent and treat-ing them as independent tokens causes the assumption of conditional independence to break down.

A second set of methods in the generative family of supervised machine learning methods employs techniques from data compression [6] as a means to deal with this problem.

In this methodology, a message is assumed to be generated roughly as fol-lows. Assume that we are dealing with a Markov process of orderk that generates

Figure

Updating...

References

Updating...

Related subjects :