Results and Analysis - Classification Methodology

3.2 Classification Methodology

3.3.2 Results and Analysis

In Section 3.2.3.4, we obtained the best performance (F1-score) for each category and we re- port some statistics of the performance. It should be noted that the predictive results presented are obtained by using 3-fold cross validation on the training data rather than on the new webpages from the Internet. Therefore, the results reported here may not accurately reflect the real situation. The reason we do not evaluation on new webpages is that new webpages on the In- ternet have no class labels and it is difficult to evaluate the performance on them. However, the diversity of the training set (webpages from various topics in ODP) can still make our training data approximate the distribution of the Internet webpages. Besides, the large number of training examples can also avoid overfitting of the classifiers. Thus, the results reported here can still be regarded as an important indication for the classification performance on new Internet webpages.

Table 3.4 presents the average F1-score over the categories at each level of the hierarchy. We can see the lowest average F1-score (the first level) of the four levels is close to 0.8. The classification performance is pretty impressive, since the class is imbalanced (with the average positive class ratio from 0.1 to 0.2).

In order to observe the performance of each individual classifier, we also show the distribution of the F1-score of different categories in each level in Figure 3.3. In the X-axis, we split [0,1] into 10 ranges and the Y-axis represents the percentage of categories in each level having the F1-score falling into those ranges. For example, for all the 499 categories in level 3, 33% have F1-score within [0.9,1], 33% within [0.8,0.9), and only about 10% within [0.6,0.7), etc. We can discover that for all the four levels, most of the categories have their F1-score higher than 0.7, and very few categories perform poorly (with F1-score less than 0.6).

Thus, by optimizing the classification performance, the classifiers are expected to achieve reasonable results to classify new webpages.

Furthermore, we find some other interesting results for the optimized parameters. Figure 3.4 plots the distribution of the feature filtering method in the optimized parameters. It can be observed that DF is selected as the best feature filtering method for most of the categories in the hierarchy. Furthermore, for the first two levels, none of the categories choose BNS. As

52 Chapter3. ClassificationSystem

Figure 3.3: The distribution of F1-score over different categories.

discovered in our experiments, the tendency of BNS to select rare words can greatly reduce the size of the training data, which negatively affects the classification performance when the original data is large. For example, the training sets in the first level generally have a large number of examples with more than one million features. One original training set in the first level can usually take 300MB disk space. By applying BNS to reduce the features to 10%, we can decrease the needed disk space to an extremely small size (say 5MB). However, many of the reduced examples contain no features and become useless for training the classifier, which significantly degrades the classification performance.

Figure 3.5 demonstrates the distribution of the best calibration approach over different categories. It is clear that isotonic regression dominates Platt scaling in all the four levels. However, there is a trend that Platt scaling become more and more popular as the level goes deeper. In our experiment, we found that the step-wise isotonic regression is more likely to produce calibrated probability when the number of training examples is large, while Platt scaling is less sensitive to the data size since it fits the data to a sigmoid sharp. It is the reason that the deep categories (with less training examples) prefer Platt scaling as the calibration approach to isotonic regression.

Figure 3.6 and Figure 3.7 draw the distribution of the two parameters (C and w₊) for Li- bLinear. Generally, the higher the level, the smaller the C and the larger the w₊. C is used to control the fitting of the model with the training data. Usually, the larger theC, the better the model fits the training data. For the categories in the higher levels, the training sets are often large and largeC can easily cause the model to overfit the training data, thus high-level categories favor the smallerC (e.g., C = 1). We can usew₊ to balance the fitting of positive and negative training examples. Higherw₊will lead to high penalty when a positive example is inconsistent with the model. In the high level, because we allow the inconsistency between the model and the training data by choosing a smaller C, we hope positive examples to be more consistent with the model since the positive (rare) class is often more important in class- imbalanced problems. However, in the low-level categories, more consistency between the model and the (smaller) training data can usually result in high classification performance. In

3.3. ClassificationEvaluation 53

Figure 3.4: The distribution of best feature filter over different categories.

54 Chapter3. ClassificationSystem

Figure 3.6: The distribution of best parameter C over different categories.

this case,w₊ actually takes less effect, since there are fewer training inconsistencies for both positive and negative class. It is the reason that the higher the level, the larger thew₊is chosen.

In document A New Web Search Engine with Learning Hierarchy (Page 62-65)