Logistic Regression for Spam Filtering

(1)

Logistic Regression for Spam Filtering

Nikhila Arkalgud February 14, 2008

Abstract

The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used in spam filtering is to predict using logistic regression. Words that frequently occur in a spam email are used as the feature set in the regression problem. This in this report, we examine some of the different techniques used for minimizing the logistic loss function and provide a performance analysis of the differnt techniques. Specifically three diffrent types of minimization techniques were implemented and tested: Regular Batch Gradient Descent Algorithm, Regularized Gradient Descent Algorithm and Stochastic Gradient Descent Algorithm.

1 Introduction and Problem Description

What is SPAM? - One of the definition could be, electronic junk mail or junk newsgroup postings. Some people define spam even more generally as any unsolicited e-mail. And a spam filter is a software tool used to classify spam emails from genuine emails. Hence the spam filter predicts which class the email belongs to spam/no spam. This problem has been addressed using several techniques such as SVMs, Naive Bayes, and Logistic Regression.

Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables (features) that may be either numerical or categories. Other names for logistic regression used in various other application areas include logistic model, logit model, and maximum-entropy classifier. Logistic regression is one of a class of models known as generalized linear models.

In this report three different minimization techniques have been studied to minimize the logistic loss function.

The Normal Gradient Descent, Regularized Gradient Descent and the Stochastic Gradient Descent. The Logistic Regression algorithm is introduced in section[2], the minimization techniques are explained in detail in the sections[3,4 and 5]. A detailed experimental analysis is provided in section6.

2 Logistic Regression

An explanation of logistic regression begins with an explanation of the logistic function(also called the sigmoid function): f (z) = 1

1 + e^−z. The logistic function is useful because it can take as an input, any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. The variable, z represents the exposure to some set of risk factors, while f(z) represents the probability of a particular outcome, given that set of risk factors. The variable z is a measure of the total contribution of all the risk factors used in the model and is known as the logit. The variable z is usually defined as: z =Pn

i=1wixi, where x1..xn are the features and w1..wn are the regression coefficients (weights).

The Logistic Regression algorithm is as given below:

1. initialize weight vector to zero

2. train the features by minimizing the logistic loss while (||gradient||1> precision) do

calculate the new prediction ˜yvector using: ˜Y = 1 1 + e^−W.X

(2)

calculate the gradient vector using: gradient = P

t(( ˜yt− yt)xt) T update the weights using: wt+1= wt+ η.gradient 3. calculate the logistic loss on the test set using

loss(y, ˜y) = yln(y

˜

y) + (1 − y)ln(1 − y 1 − ˜y)

=

½ −ln(1 − ˜y) = ln(1 + e^w.x) if y = 0

−ln(˜y) = ln(1 + e^w.x) − w.x if y = 1

= negative log likelihood

3 Minimization of logistic loss using Normal(batch) Gradient De- scent

One of the standard methods used for minimization of any convex function is the method of Gradient Descent, where the optimal solution is found when the gradient of the function is equal to zero by taking steps proportional to the negative of the gradient of the function. As shown in the algorithm above, for each gradient step, the gradient of the loss vector for all the examples in the batch is found and the weights for each feature is updated. This is continued till the gradient is less than some threshold precision value. The gradient equation and the weight update equations are as given below:

˜

y) + (1 − y)ln(1 − y 1 − ˜y) gradient=

P

t(( ˜yt− yt)xt) wt+1= wt+ η.gradientT

Gradient Descent could be time consuming as it may take many iterations to converge to a local minima.

Gradient descent could also lead to overfitting of the data. Overfitting the training data is undesirable as it would lead to a higher loss on the test data. Hence it is common to find the gradient using other iterative methods.

4 Minimization of logistic function using Regularized Gradient Descent

A common technique used to prevent overfitting of the training data is to regularise the weights. Regu- larization as defined in [1] is ”Any tunable method that increases the average loss on the training set, but decreases the average loss on the test set“. Some of the techniques used in regularization are, to stop training early, to regularise with relative entropies, feature selection, clipping the range of labels etc.

We have implemented regularization using the following minimization function:

˜

y) + (1 − y)ln(1 − y 1 − ˜y) infw

P

t((_2η¹||w||²2) + (loss(yt,y)))˜ and train until the gradient:T

||¹_ηw+_T¹ P

t(˜y− yt)xt||1≤ precision

5 Minimization of logistic function using Stochastic Gradient De- scent with Simulated Annealing

In standard (or ”batch”) gradient descent, the true gradient is used to update the parameters of the model.

The true gradient is usually the sum of the gradients caused by each individual training example. The

(3)

parameter vectors are adjusted by the negative of the true gradient multiplied by a step size. Therefore, batch gradient descent requires one sweep through the training set before any parameters can be changed.

In stochastic (or ”online”) gradient descent, the true gradient is approximated by the gradient of the cost function only evaluated on a single training example. The parameters are then adjusted by an amount proportional to this approximate gradient. Therefore, the parameters of the model are updated after each training example. For large data sets, on-line gradient descent is found to be much faster than batch gradient descent.

Instead of using a constant learning rate for each gradient update, a variable learning rate was implemented, whose value is gradually reduced to control the weight vectors. This technique is intutively similar to the Annealing technique[2] where the metal is heated to a high temprature and then gradually cooled. The heat causes the atoms to become unstuck from their initial positions (a local minimum of the internal energy) and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one. We are cooling the learning rate η by an αⁱ⁻¹ factor where i is the iteration(or pass) number. The gradient and the weight update equations are as given below:

˜

y) + (1 − y)ln(1 − y 1 − ˜y) gradient=

P

t(( ˜yt− yt)xt) T

wt+1= wt+ (η ∗ α^t−1).gradient

6 Experimental Results

Logistic regression algorithms were implemented in Matlab. Several tests were conducted, a detailed analysis of which will be presented in the following sections. The spam dataset (given in the class website) was used for analysis. The number of trials were limited to 2000. In total 2000 features were used for each example.

6.1 Cross Validation

A 5 fold Cross validation over 10 runs were used to obtain the average training and test loss on each algorithm.

We have used 2000 examples to train and test the algorithms. With each example containing 2000 features (words).

1. for i= 1 to 10 permute data

split into 3/4 training and 1/4 testing set

perform 5 fold cross validation to determine the best model parameters:

- partition the training set into 5 parts - for each of the 5 holdouts

* train all models on the 4/5 part - training set

* record the average logistic loss on 1/5 part - validation set

- the best model is chosen as the one with the best average over 5 holdouts compute the best model by computing the average logistic loss on the 1/4 test set 2. compute the average performance of the best model on the 10 runs

6.2 Logistic Regression using Gradient Descent

The regular(batch) gradient descent algorithm was implemented and the training and test losses were found using the 5 fold cross validation as stated in section[6.1]. Values were obtained for all of the precision values with 10 runs, except for 0.0001, which completed only 5 runs after running for 2 days. Hence the training loss and test loss were computed for only 5 runs.

(4)

6.2.1 Effect of Early Stopping of the training

By running the gradient descent algorithm for minimization of the total logistic loss and stopping the training early (not allow the gradient to go to zero) amounts to implicit regularization as the weights are initially small. Figure[1] shows the variation of the training loss and test loss as the precision values are varied. It can be seen that for precision 10⁻⁴ the training leads to overfitting of data due to which the training loss is very low, but the test loss is high. But as we reduce the precision(early stopping), the training loss increases, that is we control the weights, and hence the test loss is lowered. Again a slight increase in test loss and training loss is observed at precision 1, this could be due to the fact that the weights were not trained enough. Hence from the graph, we can select 0.01 as a good value for the stopping point of the gradient.

10⁻⁴ 10⁻³ 10⁻² 10⁻¹ 10⁰

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Early Stopping for Logistic Regression using normal batch gradient descent

Gradient at stopping point

Mean Logistic Loss

Test Loss Training Loss

Figure 1: Normal Gradient Descent: Variation of mean logistic loss as a function of gradient stopping point

6.3 Logistic Regression using Regularized Gradient Descent

Logistic regression using regularised gradient descent was implemented as explained in section[4]. A 5-fold crossvalidation over 10 runs were conducted to obtain the mean logistic loss on the test and validation data sets. After conducting the tests, precision set at 0.1 and λ=0.01 and η=0.2 were found to be good choices for the parameters.

6.3.1 Effect of regularization parameter λ

Figure[2] shows the variation of the training loss and the test loss with respect to the regularization parameter λ. It was observed that for λ ≥ 0.01 the logistic test and training losses remained almost constant as expected, thus implying, regularization helps in preventing overfitting. For lower values of λ it is observed that the effect of the regularization is reduced and hence overfitting on the trainng data is observed due to which high test losses are recorded.

(5)

10⁻⁴ 10⁻³ 10⁻² 10⁻¹ 10⁰ 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Logistic Regression using 2−Norm regularizer

Regularization Parameter: lamda

Mean Logistic Loss

training loss test loss

Figure 2: Regularized Gradient Descent: Variation of mean logistic loss as a function of λ

6.3.2 Effect of learning rateη

Figure[3] shows the variation of the logistic loss for different learning rates. It was observed that varying the learning rates did not have a pronounced effect on the test and training losses. The lowest was recorded at η= 0.2 and the losses were slightly higher for the other learning rates. This could be due to the regularization.

The 2 norm regularization of weights is ensuring a faster convergence of the gradient, and hence even for low learning rates, the redundant weights are not influencing the loss function.

6.4 Logistic Regression using Stochastic Gradient Descent with Simulated An- nealing

Logistic Regression using Stochastic Gradient Descent was implemented as explained in section[5]. A 5-fold crossvalidation over 10 runs were conducted to obtain the mean logistic loss on the test and validation data sets. After studing the varies effects of the different parameters (α, η, passes), the parameters were set at, passes=10, η=0.2 and α=0.5.

6.4.1 Effect of varying the number of passes

Figure[4] shows the variation of the mean logistic loss on the test and the validation set as the number of training passes(iterations) is varied. It can be seen that the loss is the lowest for pass=10. But the difference in the loss values is not significantly different for different number of passes. It is unclear if an optimal number of passes exists for stochastic gradient descent. Since the weights are updated after seeing each example, the gradient values would be random after each update. But when the number of passes is restricted to 1, the loss on both the test set and training sets increase significantly. This is due to the fact that when pass=1, the learning rate remains a constant for the entire algorithm. But when the number of passes is increased, the learning rate is reduced by an αⁱ⁻¹factor, where i is the iteration number. And this helps in controlling the weight vector due to simulated annealing.

(6)

10⁻¹ 10⁰ 10¹ 0.04

0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22

Mean Logistic Loss vs learning rate eta

Learning Rate eta

Mean Logistic Loss

test loss training loss

Figure 3: Regularized gradient Descent: Mean Logistic Loss as a function of η

0 10 20 30 40 50 60 70 80 90 100

0 0.02 0.04 0.06 0.08 0.1 0.12

Stochastic Gradient Descent: Mean logistic loss vs Number of Passes

Number of passes

Mean Logistic Loss

test loss train loss

Figure 4: Stochastic Gradient Descent: Mean Logistic Loss as a function of number of passes

(7)

6.4.2 Effect of varying η

Figure[5] shows the variation of the mean logistic loss as a function of the learning rate η. Clearly, higher learning rates do not work well for stochastic gradient descent. This could be due to the randomness in the gradients. The weight vectors are not controlled at higher learning rates even with a low cooling rate of α= 0.5. From the figure[5], η = 0.2 seems to be a good value as a learning rate parameter.

0 0.5 1 1.5 2 2.5 3 3.5 4

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Stochastic Gradient Descent: Mean Logistic Loss vs Learning Rate eta

Learning rate eta

Mean Logistic loss

Figure 5: Stochastic Gradient Descent: Mean Logistic Loss as a function of learning rate η

6.4.3 Effect of varying α

Figure[6] shows the variation of the mean logistic loss as a function of the cooling rate α. As explained in section[5], α helps in gradually reducing the learning rate of the weight vector. It was observed that the performance of the algorithm was similar for a range of 0.5 ≤ α ≤ 0.95. But when α ≤ 0.5, the loss on both the test and the training sets went up, same was the case when α = 1(constant learning rate). This shows that using simulated annealing technique helps in faster and better convergence of the gradient.

(8)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

Stochastic Gradient Descent: Mean Logistic Loss vs Cooling Rate alpha

Cooling Rate alpha

Mean Logistic Loss

Figure 6: Stochastic Gradient Descent: Mean Logistic Loss as a function of Cooling rate α

7 Conclusion

Three different logistic loss minimization techniques were implemented and studied. Normal Gradient De- scent with varying gradient stopping points, Regularized Gradient Descent with different λ and η values, and the Stochastic(online) Gradient Descent with Simulated Annealing with varying α and η values were studied.

Incase of a normal gradient descent, it was observed that, early stopping of the gradient helped in preventing overfitting on the training data and thus improved the performance on the test set.

Using a 2-norm regularizer of the weights along with the logistic loss helped in obtaining a faster and better convergence of the gradient. This was due the fact that the regularizer acted as a relative entropy, thus controlling the learning of the weights. This prevented the weights from overfitting on the training data, leading to a better performance on the test set.

In the stochastic gradient descent with simulated annealing technique, the over fitting was prevented by starting with a low learning rate η and further reducing the learning rate using the cooling rate α. Since the gradient values vary on each update it is still unclear how to optimally control the weight vector. An extension on this algorithm would be include a relative entropy term in the minimization function and then apply stochastic gradient descent with varying learning rates.

8 References

[1] ”Shrik-Stretch of labels for regularizing logistic regression” by Manfred K. Warmuth [2] Wikipedia on “Simulated Annealing“