3.4 Empirical evaluation
3.4.2 Algorithms
We compared different variants of BOMC with three state-of-the-art online learners. BOMC was trained with the following settings.
1. The order of the training examples was randomized.
2. The prior of the feature weights were Gaussians N(0,1). No class-wise bias was
used. The single global bias had prior N(0,104). The noise level in Eq. (3.1) is set toβ= 0.01, reflecting the fact that text data lies in a high dimensional space
which is pretty likely to be linearly separable.
3. EP was used for inference. The convergence criterion of EP was that the relative change of all messages fell below 10−5. On average, it just took about 3 iterations for EP to converge.
In practice, many classes only have very few positive examples and over 90% exam- ples are negative. This skewed ratio is commonly dealt with by two heuristics. The first approach tunes the threshold by using cross validation (CV) (Yang,2001;Lewis et al., 2004; Fan & Lin, 2007). Intuitively it translates the original separating hyperplane towards the negative example region. However, CV is very expensive and relies heavily on the batch setting. The second approach requires more prior knowledge but is much cheaper. It uses different costs for misclassifying positive and negative examples, e.g. the “-j” parameter in SVMlight. Intuitively it increases the influence of less common classes. Lewis (2001) won the TREC-2001 Batch Filtering Evaluation by using this heuristic with SVMlight. Theoretically, Musicant et al. (2003) proved that such cost models approximately optimize F-score.
All the competing algorithm in our experiment perform very poorly when neither heuristic is used. Therefore we assume some prior knowledge such as the relative frequency of positive and negative examples (denoted by r). BOMC can encode this
prior in the delta factor δ(· > ). For negative examples, the loss factor is set to δ(d <−1), while for positive examples the loss factor is set to δ(d >ln(e+ 1/r)).
BMOC with sampling (BOMC Sample) To label the test data, we sampled from the posterior of the learned model as shown in Algorithm 9. 5 samples were drawn since the experiment showed that drawing 10 or 20 samples did not improve the F-score significantly. We call this algorithm BOMC Sample.
Special care was required when a class was never “activated” by samples, i.e. for all test examples the inner product of feature and sampled weight being less than the sampled bias. Not being activated by 5 samples probably should not rule out the possibility of activating the class in the test set. Suppose the learned model is w∼ N(µ,σ) andb∼ N(µ0, σ0), we set the threshold of that class to the maximum of membership probability (given by Eq. (3.5))
p(y= 1|x) = Φ h µ,xi −µ0 q σ2 0+ P dx2dσd2 ,
over all testing examplesx.
BOMC with Class Mass Normalization (BOMC CMN) A much simpler but non- Bayesian heuristic for picking the threshold is by matching the zero-th order moment: making the class ratio in the testing set identical to that in the training set. This heuristic was proposed by Zhu et al. (2003) to solve a similar threshold tuning prob- lem in semi-supervised learning. Technically, we sorted this membership probability (Eq. (3.5)) of all testing examples in decreasing order, and labeled the top p percent
to be positive, where p is the fraction of positive examples in the training set. This approach is called class mass normalization (CMN) byZhu et al.(2003), so we call this variant of BOMC as BOMC CMN.
BMOC: Training all classes independently (BOMC IND CMNandBOMC IND Sample) We also tried training each class independently, i.e. each class c has its own bias bc and the shared global bias is no longer used. Now the posterior of each class can be computed in closed form for each training example. During testing, both CMN and sampling are again applicable, and hence called BOMC IND CMN and BOMC IND Sample,
respectively.
All the variants of BOMC were implemented in F#9, and can downloaded from
http://www.stat.purdue.edu/~zhang305/code/bomc.tar.bz2.
Batch SVM (SVM Batch) As a baseline for benchmark, we compared with SVM whose batch nature is an unfair advantage over BOMC as an online learner. We
9
trained one SVM for each class independently. Since many classes in Reuters have very few positive examples, we applied the heuristic of using different cost for mislabeling positive and negative examples. The cost was chosen by 5 fold CV, and the final result was rather poor. So we tried the algorithm in (Yang,2001) which tunes the threshold, and it yielded very competitive performance. The final algorithm relies highly on CV: besides using CV for picking the loss-regularization tradeoff parameter C, it also
employs a nontrivial 2-level CV strategy to tune the bias of SVM (Fan & Lin,2007). So in total, CV withk folds costs k3 rounds. We call this method SVM Batch.
As the 3-level CV is very expensive, our experiment used 3 folds for each level of CV, and so the underlying trainer was called for 33+ 1 = 28 times. We tried 5 folds on some random samples of training and testing data and it gave almost the same result. We used theCimplementation ofliblinearas the batch SVM solver10, and wrote
a Matlab script to deal with the multi-label data.
LaSVM (LaSVM) LaSVMis an online optimizer for SVM objective proposed byBordes
et al. (2005), who showed that by going through the dataset for a single pass, LaSVM
achieves almost as good generalization performance as the batch SVM. Operating in the dual which allows nonlinear kernels, LaSVM maintains the active/support vector
set, and employs a removal heuristic to avoid overfitting especially when the data is noisy. Strictly speaking, it is not a stream learner because it memorizes some data points (support vectors).
For our multi-label problem, we again trained all classes separately. The experiment showed that using different cost for positive and negative examples did not improve the testing F-score of LaSVM on imbalanced data, hence we resorted to the CV based
strategy to tune the bias as in SVM Batch. Due to the high computational cost of LaSVM, we only managed to use 2 folds for each level/parameter under CV. This means
calling LaSVMfor 23+ 1 = 9 times.
We used theCimplementation ofLaSVM11, and wrote a Matlab script for the multi-
label scenario. Although only linear kernels are used here, this LaSVM implementation
was not optimized for this specialization, hence inefficient.
Passive-Aggressive (PA) This online algorithm has been repeatedly proposed (un- der different names) for training SVMs,e.g.(Cheng et al.,2006;Crammer et al.,2006; Hsieh et al.,2008b). The idea is simple: given a current model wtand a new training
10
http://www.csie.ntu.edu.tw/∼cjlin/liblinear/
11
example (xt, yt), find a new wwhich minimizes wt+1 := argmin w λ 2kw−wtk 2 H+ loss(xt, yt,w). (3.15)
PA does not naturally accommodate the bias in SVM. Hence we applied the same
CV strategy used inSVM Batch to find the optimal bias. Here, CV may either use PA
or batch SVM, which we callPA OnlineCV and PA BatchCVrespectively.
Due to the equivalence of PA and running one pass of liblinear, we simply used liblinearwith the iteration number set to 1.