Split-at-Random Benchmarks - Experimental Results

6.2 Experimental Results

6.2.2 Split-at-Random Benchmarks

In this section, we analyze the performance of the models trained on the benchmarks split at random.

We report the best results obtained on the threshold and test batches. The former can be thought of as validation performance, while the latter tells us if the model just got lucky or cheated on the threshold batch. Due to space limitations, we specify the selected hyperparameter configurations for each model in AppendixC.

Table17 presents the precision-recall (PR) curves achieved by each model on the threshold batch. As we can see, for all three datasets, the AGGR_T and AGGR_3T models get much closer to the top right corner (precision = 1, recall = 1) than the

(a) Fit time

(b) Predict time

Figure 19: Relative ranking of the algorithms by their fit and predict times

NO_AGGR models. Furthermore, the AGGR_3T models seem to outperform the AGGR_T models.

Table 18 presents the values of precision, recall, F1-score and selected threshold

obtained by each model on the test batch. Keep in mind that, by our design, the threshold batch has a fixed contamination rate of 5%; therefore, the correct detection threshold value is always 95% percent. The selected threshold tells us if the model achieves a high precision or recall by virtue of performing well or just making a tradeoff.

The results from Table 18suggest that:

• Overall, we observe comparable F1-scores for CTU-13 and CICIDS 2017. This

is somewhat surprising, as we expected to see better performance on CICIDS 2017 with its emulated network traffic. We observe the best performance results for IoT-23. We believe this is because IoT devices generally have a simpler network profile. If we compare the performance between the threshold and test batches, we can see that they are, in fact, higher on the latter. This can be explained by the fact that the test batch has a higher percentage of anomalies (10% vs 5%), which seems to help the models achieve higher precision values. • As hinted by the PR curves, for all three datasets, we observe a significant im-

provement in performance for the AGGR models compared to the NO_AGGR models. This is most noticeable for the IoT-23 dataset. The numbers seem to suggest that three-minute aggregation yields better results than one-minute aggregation. However, this may be affected by the fact that there are fewer data points. Another interesting observation is that the AGGR models are also more successful at selecting the detection threshold (the values are closer to the true 95%).

• The autoencoding models, in particular VAE and AE, seem to consistently outperform IF and MAH in almost all cases, although the improvement is not always significant. On the other hand, AEGMM shows inconsistent performance, ranging from the best-performing model to the worst. In a few cases, the model

NO_AGGR AGGR_T AGGR_3T

CTU-13

CICIDS

IoT-23

Table 17: Precision-recall curves obtained on the threshold batches. For each model we choose the detection threshold that maximizes the F1-score on the threshold

THRESHOLD BATCH

CTU-13 NO_AGGR AGGR_M AGGR_3M