Purpose of data - Evaluating visually grounded language capabilities using microworlds

2.6 Conclusion

3.1.2 Purpose of data

Which function data is supposed to fulfil within machine learning determines ideal characteristics or, conversely, when introducing new data it is important to take its intended purpose into account. The following distinctions present key aspects to consider:

• Fixed vs flexible source: Presentation of and interaction with the data can follow either a more rigid or a more variable design, including its availability.

• Application-driven vs hypothesis-driven structure: The higher-level structure of the data may be chosen with either a general task or a specific testable hypothesis in mind. • Generic vs model-informed content: The content of the data may either make little to no

assumptions or be tied to details of the model class or even instance.

In the following paragraphs, three common functions of data within machine learning are discussed with respect to these points: as training data, as comparative benchmark, and for in-depth evaluation.

Training. The most obvious use for data is to train machine learning models. Deep learning is comparatively insensitive to data quality but definitely profits from vast quantities of data points. It is further common practice to augment data in various ways: for instance, by increasing the frequency of underrepresented classes, applying semantics-invariant transformations, augmenting with auxiliary tasks on the same data, leveraging related sources, to name a few. All this clearly suggests a preference for a flexible data source. Moreover, its structure is largely determined by the application that the trained model is supposed to solve. Finally, while the content of training data is generally expected to be generic, to enable training of any type of model, model-dependent augmentation like the addition of adversarial examples is not uncommon.

Comparative benchmarks. The relative qualities of machine learning models are usually assessed by comparing performance on benchmark data. Fair comparison requires standardisation of the evaluation procedure, consequently a fixed data source is preferred here. This includes presenting data as a single dataset with accompanying evaluation script, and eliminating model- unrelated confounding aspects which may affect results. In addition, such a dataset is ideally ‘temporally fixed’, that is, used over the years to facilitate comparison to older models, and with ‘fixed access’, that is, limiting repeatedly running evaluations to tune a model which ultimately leads to (community-wide) overfitting, for instance, by restricting the number of submissions for evaluation on a withheld test dataset. Besides overfitting, controlling access is also important to ensure the statistical validity of results given the problem of controlling for multiple comparisons, which otherwise are likely to yield some positive results just by chance. Benchmarks moreover require the structure to be driven solely by the problem in question, to preserve comparability despite changing hypotheses, which would otherwise intrinsically favour certain methods. Similarly, its content needs to be agnostic to any modelling aspects and cannot rely on properties like being probabilistic or using neural network techniques.

In-depth evaluation. Data constitutes the only viable approach to investigate most deep learning models, which are otherwise hard to interpret. Detailed – as opposed to comparative – evaluation benefits from a flexible data source which facilitates controlling and adapting all aspects of the data. Since models may have different strengths and weaknesses, the data needs to

be flexible enough to enable meaningful analysis in either case. Additionally, the structure of evaluation data is usually driven by changing hypotheses about what aspect of model behaviour is considered most interesting to assess more thoroughly. For instance, one may focus on relational reasoning as a suspected weakness and thus require data containing relational instances which challenge this capability. In particular, structure here needs not necessarily resemble an underlying task, but may test the limitations of a model on unrealistic and/or adversarial instances. Since a model usually introduces new techniques and architecture design decisions to address shortcomings of previous models, the content of data for in-depth evaluation is expected to be informed by these aspects to obtain the most convincing results. This may go as far as using details like model outputs or gradients to design adversarial evaluation instances.

Current practice: monolithic datasets. Following the ML paradigm, the majority of recent work centres around monolithic datasets which serve as training data as well as benchmark plus, in some cases, the basis for more detailed evaluation. The latter, however, is limited by what additional annotations a dataset provides like, for instance, a more fine-grained categorisation of instance types, which makes it possible to report performance per category. Otherwise, the practice of qualitative evaluation by hand-picking a few illustrative examples is a questionable way to infer properties of a model. By concentrating on a single monolithic dataset, each of the aforementioned purposes of data suffers: (a) training: many different sources of training data could be utilised instead of just one; (b) comparative benchmark: the quality of benchmarking is affected due to lacking ‘fixed access’ and consequent overfitting; and (c) in-depth evaluation: analyses are severely limited by the annotations a dataset provides, while not at all informed by what motivated design decisions for the analysed model.

In document Evaluating visually grounded language capabilities using microworlds (Page 47-49)