Algorithms - Empirical evaluation - Graphical Models: Modeling, Optimization, and Hilbert Space

3.4 Empirical evaluation

3.4.2 Algorithms

We compared different variants of BOMC with three state-of-the-art online learners. BOMC was trained with the following settings.

1. The order of the training examples was randomized.

2. The prior of the feature weights were Gaussians N(0,1). No class-wise bias was

used. The single global bias had prior N(0,104_{). The noise level in Eq. (}_3.1_{) is} set toβ= 0.01, reflecting the fact that text data lies in a high dimensional space

which is pretty likely to be linearly separable.

3. EP was used for inference. The convergence criterion of EP was that the relative change of all messages fell below 10−5_{. On average, it just took about 3 iterations} for EP to converge.

In practice, many classes only have very few positive examples and over 90% examples are negative. This skewed ratio is commonly dealt with by two heuristics. The first approach tunes the threshold by using cross validation (CV) (Yang,2001;Lewis et al., 2004; Fan & Lin, 2007). Intuitively it translates the original separating hyperplane towards the negative example region. However, CV is very expensive and relies heavily on the batch setting. The second approach requires more prior knowledge but is much cheaper. It uses different costs for misclassifying positive and negative examples, e.g. the “-j” parameter in SVMlight_{. Intuitively it increases the influence of less common} classes. Lewis (2001) won the TREC-2001 Batch Filtering Evaluation by using this heuristic with SVMlight_{. Theoretically,} _{Musicant et al.} _{(2003) proved that such cost} models approximately optimize F-score.

All the competing algorithm in our experiment perform very poorly when neither heuristic is used. Therefore we assume some prior knowledge such as the relative frequency of positive and negative examples (denoted by r). BOMC can encode this

prior in the delta factor δ(· > ). For negative examples, the loss factor is set to δ(d <₋1), while for positive examples the loss factor is set to δ(d >ln(e+ 1/r)).

BMOC with sampling (BOMC Sample) To label the test data, we sampled from the posterior of the learned model as shown in Algorithm 9. 5 samples were drawn since the experiment showed that drawing 10 or 20 samples did not improve the F-score significantly. We call this algorithm BOMC Sample.

Special care was required when a class was never “activated” by samples, i.e. for all test examples the inner product of feature and sampled weight being less than the sampled bias. Not being activated by 5 samples probably should not rule out the possibility of activating the class in the test set. Suppose the learned model is w_{∼ N}(µ,σ) andb_{∼ N}(µ0, σ0), we set the threshold of that class to the maximum of membership probability (given by Eq. (3.5))

p(y= 1_|x) = Φ   h µ,x_{i −}µ0 q σ2 0+ P dx2dσd2  ,

over all testing examplesx.

BOMC with Class Mass Normalization (BOMC CMN) A much simpler but non- Bayesian heuristic for picking the threshold is by matching the zero-th order moment: making the class ratio in the testing set identical to that in the training set. This heuristic was proposed by Zhu et al. (2003) to solve a similar threshold tuning problem in semi-supervised learning. Technically, we sorted this membership probability (Eq. (3.5)) of all testing examples in decreasing order, and labeled the top p percent

to be positive, where p is the fraction of positive examples in the training set. This approach is called class mass normalization (CMN) byZhu et al.(2003), so we call this variant of BOMC as BOMC CMN.

BMOC: Training all classes independently (BOMC IND CMNandBOMC IND Sample) We also tried training each class independently, i.e. each class c has its own bias bc and the shared global bias is no longer used. Now the posterior of each class can be computed in closed form for each training example. During testing, both CMN and sampling are again applicable, and hence called BOMC IND CMN and BOMC IND Sample,

respectively.

All the variants of BOMC were implemented in F#9_{, and can downloaded from}

http://www.stat.purdue.edu/~zhang305/code/bomc.tar.bz2.

Batch SVM (SVM Batch) As a baseline for benchmark, we compared with SVM whose batch nature is an unfair advantage over BOMC as an online learner. We

trained one SVM for each class independently. Since many classes in Reuters have very few positive examples, we applied the heuristic of using different cost for mislabeling positive and negative examples. The cost was chosen by 5 fold CV, and the final result was rather poor. So we tried the algorithm in (Yang,2001) which tunes the threshold, and it yielded very competitive performance. The final algorithm relies highly on CV: besides using CV for picking the loss-regularization tradeoff parameter C, it also

employs a nontrivial 2-level CV strategy to tune the bias of SVM (Fan & Lin,2007). So in total, CV withk folds costs k3 _{rounds. We call this method} _{SVM Batch}_.

As the 3-level CV is very expensive, our experiment used 3 folds for each level of CV, and so the underlying trainer was called for 33_{+ 1 = 28 times. We tried 5 folds on} some random samples of training and testing data and it gave almost the same result. We used theCimplementation ofliblinearas the batch SVM solver10, and wrote

a Matlab script to deal with the multi-label data.

LaSVM (LaSVM) LaSVMis an online optimizer for SVM objective proposed byBordes

et al. (2005), who showed that by going through the dataset for a single pass, LaSVM

achieves almost as good generalization performance as the batch SVM. Operating in the dual which allows nonlinear kernels, LaSVM maintains the active/support vector

set, and employs a removal heuristic to avoid overfitting especially when the data is noisy. Strictly speaking, it is not a stream learner because it memorizes some data points (support vectors).

For our multi-label problem, we again trained all classes separately. The experiment showed that using different cost for positive and negative examples did not improve the testing F-score of LaSVM on imbalanced data, hence we resorted to the CV based

strategy to tune the bias as in SVM Batch. Due to the high computational cost of LaSVM, we only managed to use 2 folds for each level/parameter under CV. This means

calling LaSVMfor 23+ 1 = 9 times.

We used theCimplementation ofLaSVM11, and wrote a Matlab script for the multi-

label scenario. Although only linear kernels are used here, this LaSVM implementation

was not optimized for this specialization, hence inefficient.

Passive-Aggressive (PA) This online algorithm has been repeatedly proposed (under different names) for training SVMs,e.g.(Cheng et al.,2006;Crammer et al.,2006; Hsieh et al.,2008b). The idea is simple: given a current model wtand a new training

http://www.csie.ntu.edu.tw/∼cjlin/liblinear/

example (xt, yt), find a new wwhich minimizes wt+1 := argmin w λ 2kw−wtk 2 H+ loss(xt, yt,w). (3.15)

PA does not naturally accommodate the bias in SVM. Hence we applied the same

CV strategy used inSVM Batch to find the optimal bias. Here, CV may either use PA

or batch SVM, which we callPA OnlineCV and PA BatchCVrespectively.

Due to the equivalence of PA and running one pass of liblinear, we simply used liblinearwith the iteration number set to 1.

In document Graphical Models: Modeling, Optimization, and Hilbert Space Embedding (Page 101-104)