• No results found

Besides evaluation data, performance metrics are the second fundamental pillar of machine learning evaluation methodology. Metrics try to approximately assess the quality of performance of a system without the need to conduct human studies to judge its behaviour, or deploy a model in the intended downstream application to observe its impact. Requirements consequently include: metrics should be automatic, fast to compute, and ideally summarise performance in a single benchmark score to facilitate comparison with other models. The following paragraphs review the problem of trying to capture the evaluation of generative models in single-number metrics, and the statistical fallacies and inadequacies of comparing performance scores.

Performance metrics for generative models2. Discriminative models are relatively straight-

forward to assess, as there are well-understood metrics like accuracy or precision/recall which

2Here and in the following, “generative model” refers to the informal usage of the term in the context of deep

learning as a model which generates new data (images, language, etc), as opposed to a “discriminative model” which produces classification labels from a fixed set of categories. In traditional machine learning, these terms are more narrowly formally defined: considering a function f : x → y to be learned (e.g., a classifier), a generative model is a model of the joint probability distribution p(x, y) of inputs and outputs, whereas a discriminative model is a model of the conditional probability distribution p(y | x) of outputs given inputs.

are easy to interpret. In contrast, the output space for generative tasks is high-dimensional and the expected response is not well-defined, meaning that there is no single correct output. As a con- sequence, it is unclear how to quantify distances between potential outputs in a meaningful way, and what best characterises the quality and appropriateness of ‘good’ outputs. Theis et al. (2016) illustrated that three common metrics for generative image algorithms like GANs are largely independent for high-dimensional data, and Luˇci´c et al. (2018) noted how a “memory GAN” just reproducing the training data would score perfectly in most current evaluations. Barratt and Sharma (2018) identified a range of problems with the recently introduced Inception score, both with the metric itself and with its popular adoption by the vision community, emphasising the need for “meaningful evaluation metrics” over “ad-hoc metrics” (Barratt and Sharma, 2018). Xu et al. (2018) assessed a range of common metrics for GANs for desirable characteristics like distinguishing generated from real images, sensitivity to mode dropping/collapsing, or detecting overfitting, and found that popular metrics did not overall cover these aspects well.

In the case of image captioning, Anderson et al. (2016) noted how metrics are primarily sensitive to n-gram overlap with gold captions, which is neither necessary nor sufficient for improving human judgement of generated captions. Low correlation between captioning metrics and human judgement was already identified as a problem by Elliott and Keller (2014). Kilickaya et al. (2017) investigated the robustness/sensitivity of various metrics to synonyms, word order, phrase replacement and similar modifications, and found that semantically close captions may receive differing scores whereas captions with different meaning but surface similarity do not. Similar concerns about the correlation with human judgement were expressed and experimentally confirmed by Liu et al. (2016) for the evaluation of dialogue systems, and by Sulem et al. (2018) for text simplification. A possible reason for this mismatch is seen in the fact that metrics like BLEU, METEOR or ROUGE originated from machine translation and were adopted for respective task, despite differences in what qualifies a good solution. However, Callison- Burch et al. (2006) noted early on that even machine translation is overly reliant on BLEU despite performance increases being neither necessary nor sufficient for improved translation quality, and pointed out use cases where BLEU should and should not be used for evaluation. Furthermore, Post (2018) highlighted problems with changing parametrisation and reference processing schemes, resulting in substantially different performance scores, while Reiter (2018) reviewed reports of (non-)correlation of BLEU with human evaluation for language output quality assessment and concluded that its use outside of machine translation is questionable.

Originally used for speech recognition evaluation, perplexity is another common performance metric for generative prediction models. However, Smith (2012) pointed out problems with this measure: on the one hand, it unnecessarily requires the model to be probabilistic and the comparability of scores is highly sensitive to details of the event space; on the other hand, improved perplexity scores are known to not correlate well with actual error reduction in application tasks (Chang et al., 2009; Smith, 2012).

Statistical flaws of interpreting performance scores. While the performance metrics for dis- criminative tasks itself are well-defined, concerns have been raised repeatedly about statistically sound comparison between scores and what can be concluded from them. Ioannidis (2005) famously summarised the problems around likelihood of statistically significant false-positive findings in the context of systematic biases, unreported failure results and simultaneous experi- mentation by multiple research teams. Bennett et al. (2009) presented a striking example of a deliberately nonsensical experiment which investigated whether a dead salmon can correctly determine emotional state when shown photographs of humans. According to standard statistical analysis, the high-dimensional fMRI scans imply significantly positive results, however, the absurd conclusion highlights how standard statistical thresholds are ineffective in controlling for multiple comparisons. Considering this problem in the context of machine learning, Demˇsar (2008) pointed out that the ease of generating new algorithms thanks to flexible machine learning frameworks, in combination with the practice of relying on significantly improved benchmark scores, implicitly encouraged many such false-positive findings. Arguably, deep neural net- works and frameworks like TensorFlow and PyTorch nowadays allow for even more architecture variation than ten years ago.

An interesting theoretical analysis by Szucs and Ioannidis (2017) concluded that null hy- pothesis testing is unsuitable for large datasets, since increasing the sample size guarantees that the null hypothesis can be rejected eventually even with miniature effect sizes. Reimers and Gurevych (2018) demonstrated a related effect by comparing an architecture with itself (BiLSTM-CRF architecture on seven common NLP sequence tagging tasks). If the test is based on predictions of two trained versions of this model, they found significant differences more frequently than what the 5% level of p = 0.05 would suggest. However, when comparing the “learning approach”with itself – that is, the performance score distribution of multiple training runs, which takes into account the various sources of randomness in modern ML training – the relative amount of significantly different results is as expected at the 5% level. They concluded that the common approach of assessing significance based on a single run is problematic, and that randomness in modern ML can have substantial effects on model comparison.

These concerns focus on problems with statistical comparison methodology when applied properly. However, a recent review of papers published at the conference on Neural Information Processing Systems in 2017 (Kir´aly et al., 2018) assessed the mere completeness of argumentative steps, and found substantial shortcomings for most papers: besides missing baseline scores as reference, only around a third of the papers reported confidence intervals, however, with no reference or explanation, and only 3% reported formal comparison/hypothesis testing.