The basis of the experimental evaluation were developed in the Information Retrieval field through the famous Cranfield paradigm, in the years 1958-1966 [8, 9]. This discipline studies the effective- ness measures and the experimental tests that allow to determine how a system behaves according to the different information retrieval situations in which it could be used [27]. With these tests it is therefore possible to evaluate the performances of the models and to make a selection to determine the best solution in the development scenario. In particular, starting from a corpus of items called pool it is possible to evaluate the output of the retrieval system called run, on the basis of the contingency table through the following measures.
Precision. Indicates how many relevant documents have been retrieved in response to a query compared to all those retrieved:
P recision = T P T P + F P.
Recall. Indicates how many relevant documents have been retrieved in response to a query com- pared to all the relevant ones in the collection:
Recall = T P T P + F N.
F-Measure (F1). The harmonic mean of Precision and Recall:
F1= 2 ·
P recision · Recall P recision + Recall.
48 5.1. Experimental Evaluation
Relevant Not Relevant Retrieved True Positive – TP False Positive – FP Not Retrieved False Negative – FN True Negative – TN
Table 5.1: Contingency table.
Based on such measures, it is possible to define further metrics.
Average Precision (AP ). Indicates a global precision measure of the run in output from the system. It can be calculated as the average of all the precision values defined on the run or as the average of the precision values interpolated to fixed cut-off levels:
AP = 1 N · N X i=0 P recisioni.
Mean Average Precision (mAP ). It is the average on all APs calculated for each topic in the collection: mAP = 1 |T |· X t∈T APt.
The metrics defined above are the main model evaluation tool in the information retrieval field but many other metrics can be used concurrently with the former to further deepen the analysis towards adjacent survey areas. These are generally contextualized in a specific evaluation pipeline depending on the considered use-case. Below we will discuss the main model evaluation pipeline in the object detection field.
5.1.1
Experimental Evaluation in Object Detection
Experimental evaluation can be easily extended in other investigation dimensions. In particular, in the object detection field, it is also necessary to contextualize these metrics based on the predictions of the models [39]. Precision, Recall, F-Measure and aggregated metrics are in fact defined in compliance with a further metric called IoU – Intersection over Union – of how much two bounding boxes or masks are overlapping. Specifically, in most of the cases in literature, the best model is selected through an average precision measure according to a previously defined intersection over union threshold – mAP @IoU.
Intersection over Union (IoU ). Given a ground truth bounding box Bgt and a predicted one
Bp the IoU indicates the ratio between the overlapping area of the boxes and the sum of the
areas of the boxes:
IoU = area(Bp ∩ Bgt) area(Bp∪ Bgt)
.
The effects of the latter threshold in particular are studied differently depending on the context of international challenge considered (e.g., PASCAL VOC [16], ILSVRC [62], Ms COCO [46]) since different challenges have different evaluation rules. In particular, while for PASCAL VOC considers as an evaluation metric mAP @IoU = 0.5, Ms COCO selects mAP @IoU = [0.5 : 0.95], where the latter is an aggregated metric on different degrees of intersection over union. The evaluation steps of a classic object detection challenge are therefore the following for every submitted model.
1. Model predictions are computed over a reference test set. 2. Predictions are sorted by decreasing confidence.
(Mean) Average Recall (mAR). Indicates a global recall measure of the run in output from the system. It can be calculated as the average of all the maximum recalls obtained for each class:
AR =maxN
i=0 Recalli mAR =
1 |T | ·
X
t∈T
ARt.
(Mean) Average F-Measure (mAF1) Indicates a global F-Measure of the run in output from
the system. It can be calculated as the F-Measure of the aggregated precision and recall metrics:
mAF1= 2 ·
mAP · mAR mAP + mAR.
Wilcoxon signed rank. In terms of model selection, the Wilcoxon signed rank hypothesis test is an excellent tool for evaluating model performance. This is used as a non-parametric hypothesis test to compare the series of measurements in output from two different systems. In particular, this test is used in this context to determine whether or not model predictions come from the same probability distribution.
Based on these metrics, it is possible to define the following extended evaluation pipeline in the Customer use-case.
1. Model predictions are computed over a reference test set.
2. Mean average precision is computed by averaging the APs of all classes on a target IoU. 3. Mean average recall and mean average F-Measure are computed by averaging over all classes
on a target IoU.
4. A final measure for mAP @@IoU = [0.5 : 0.95], mAR@IoU = [0.5 : 0.95] and mAF1@IoU =
[0.5 : 0.95]is computed by aggregating the previous measures on a target IoU threshold set. 5. Wilcoxon Signed Rank test is used to evaluate the submitted model with a target one in
order to ensure that the predictions do not come from the same distribution.
In the following sections, the experimental evaluation of the models in the Customer scenario starting from the previously defined pipeline will therefore integrate the metrics built ad-hoc to fine-tune the network architecture and determine the best model.