Evaluation Methodology - Web Page Classification Performance Evaluation

4.3 Web Page Classification Performance Evaluation

4.3.2 Evaluation Methodology

We evaluate the classification performance using the different parameters for each classification method that we discussed before. To ensure that the dataset used in this section is consistent with prior knowledge of browser usage in real-world traces, we randomly sample our 100,350 captured web page downloads by browser (using weights from Table 4.2). This data is then used to evaluate web page classification using the 4 independent labeling schemes: video streaming-based (VSL), web page navigation-based (WNL), target device-based (TDL), and Alexa genre-based (AGL). We conduct 10 independent 10-fold cross validation trials (90% of the dataset is used for training and 10% is used for testing). Please note that during this process, we ensure that a sample of the same web page is not included in both the training and test data set. Also note that some web pages may have multiple genre-based labels (e.g., Cnn.com web page is classified as both Arts and News). In such scenarios, we randomly select a single label for each cross validation trial. We next consider which metrics are best suited for evaluating the different classification methods. The

most basic metric that has been used for classification problems is perhaps the accuracy metric [Erman et al., 2006, Dyer et al., 2012]. This metric is defined as:

Accuracy=∑ L

i=1t pi

N (4.3)

whereLis the number of classes considered,N is the number of samples, andt pi is the number of true

positives for classi[Erman et al., 2006]. While the accuracy metric is applicable to classification problems, and is also an easy metric to interpret, the accuracy metric itself is not ideal for performance evaluation because it does not incorporate the impact that other factors, such as the number of false negatives and false positives, have on the performance of different methods. Other prominent metrics that consider these include the precision, recall, and F-score [Chase Lipton et al., 2014, Erman et al., 2006, Schatzmann et al., 2010]. These metrics, which are functions of the number of true positives (t p), false positives (f p), and false negatives (f n), are defined below:

Precision= t p

t p+f p (4.4)

Recall= t p

t p+f n (4.5)

Fscore=Precision×Recall

Precision+Recall (4.6)

Most prior work that evaluates the performance of classification methods uses the F-score as the single metric for determining the overall best performing method. This is because the F-score is the harmonic mean of the precision and recall metrics which, when considered together, encapsulate information about the number of true positives, false positives, and false positives into a single metric. One weakness of using the F-score is that it is difficult to interpret since it is a function of multiple metrics. It is because of this difficulty that most prior work that report the F-score also report the precision and recall metrics to provide intuition on how the F-score is influenced by each [Schatzmann et al., 2010, Ihm and Pai, 2011]. Another weakness of the F-score is that it is only defined for binary classification problems — that is, the F-score (and also the precision and recall metrics) only exists for problems where the number of classes considered is two. This is a problem for our evaluation because we consider two labeling schemes that include more

than two classes (the WNL and AGL labels). Instead, we opt to use modified versions of the F-score, called the micro and macro F-scores, that are also applicable to classification problems with two or more classes [Chase Lipton et al., 2014]. These metrics are defined below:

• Micro F-score: The Micro F-score,MiF, is a metric that is the function of the micro precision,MiP, and micro recall,MiR, of each of theLclasses in a classification problem. For each classi, we define t pias the number of true positives, f pias the number of false positives, and f nias the number of false

negatives.MiP,MiR, andMiFare defined below:

MiP= ∑ L i=1t pi ∑Li=1t pi+f pi (4.7) MiR= ∑ L i=1t pi ∑Li=1t pi+f ni (4.8)

MiF=2·MiR×MiP

MiP+MiR (4.9)

Due to the way MiR and MiPare computed, MiF is biased towards classes that make up a large fraction of the dataset.

• Macro F-score:The Macro F-score,MaF, is a metric that is the function of the macro precision,MaP, and macro recall,MaR. The macro precision and recall is a function of the individual precisions and recalls for each of theLclasses in the classification problem.

MaP=∑ L i=1 t pi t pi+f pi L (4.10) MaR=∑ L i=1 t pi t pi+f ni L (4.11)

MaF=2·MaR×MaP

MaF gives an average performance across each class in the classification problem and is not biased for datasets that may have labels that make up a large fraction of the data set.

Please note that while we determine the best classification method using the macro and micro F-score metrics, we also report the precision, recall, and accuracy metrics across each labeling scheme for the best performing classification method since they are easier to interpret than F-scores. The mean performance for each metric above, ¯M, for the 10 cross validation trials is computed as: ¯M=∑10i=1Mi/10.

For comparison, we also include results for a baseline heuristic called random guessing (RG), which randomly assigns a label to a web page.8 _{We also add another, more challenging baseline heuristic called}

apriori guessing(AG), which relies on prior knowledge of the frequency of the most common class. Apriori guessing always assigns a label to the class that appears most often in the dataset — that is, the expected accuracy of apriori guessing can be expressed as max_i_∈_[₁_,_L_]{Fi}, where Fi is the fraction of times classi

appears in a dataset (obtained from Table 4.1). Outperforming the apriori guessing baseline heuristic, using any metric, means that the classification performance is not biased towards dominant classes in the dataset. Outperforming these baseline heuristics also shows that web page classification using anonymized TCP/IP headers is feasible.

In document Sanders_unc_0153D_17177.pdf (Page 175-178)