2.6 Conclusion
3.1.3 Nature of data
Data comes in different forms, like numbers, text, images, etc. However, to be useful for current machine learning, a large set of equally-shaped data points is required. The shape is what unites the variety of instances for a task and thus characterises the space of valid data points: for instance, image captioning examples fundamentally consist of an image and an accompanying textual caption, possibly further constrained to be a single sentence, whereas a simple form of language inference may consist of two sentences – premise and hypothesis – plus an associated inference label indicating entailment/neutrality/contradiction. Not every point of this shape is valid – for example, two sentences cannot be entailing and contradicting at once – but is subject to various constraints, like logical consistency in the previous example. Exactly what constitutes a valid example of a real-world problem can be hard to specify formally, to the point of defaulting to an “I know it when I see it” specification. Importantly though, a more concrete characterisation of the underlying structure of data provides a framework for how to reason about it. For instance, syntactic theory identifies useful components like words or phrases and rules like
valid phrase compositions, which inform the processing and analysis of natural language data. Finally, some instances may be more common than others, which is captured by the distribution over the space, specifying the relative frequency of valid data points. An explicit space structure helps to formulate this distribution in a more meaningful way, like the distribution over words instead of over (mostly unique) sentences.
The nature of data is a defining property in this context, with natural real-world data on one end versus artificial abstract data on the other. Besides these two ‘extremes’, I introduce a third category which is referred to as semi-artificial data and which, I argue, aptly describes many of the labelled and/or crowdsourced datasets used in modern deep learning. While this data appears to be natural at first glance, this name emphasises the fact that it has, crucially, several artificial features.
Natural real-world data. It is hard to pin down exactly what makes data ‘natural’, which is why it is often intuitively defined as “like/from the real world”. While this vaguely describes the natural data space, it lacks specificity with regards to the structure and thus its distribution can only really be characterised as given by random samples. This generally results in extremely sparse coverage, but some superficial structure can help to densely approximate its character- isation. Note that real-world distributions often resemble a power law distribution, that is, the distribution is dominated by a few patterns and exhibits a long tail of relevant but rare points. For instance, most written sentences are unique, but their distribution can be roughly captured when interpreted in terms of components like words, n-grams or syntactic/semantic representations, all of which resemble power laws. With respect to evaluation, it is thus no surprise that natural data, lacking explicit structure, comes in the form of a large number of randomly sampled data points, and that the ‘best-possible’ differentiation of train and test data is based on a simple random split.
Artificial abstract data. I consider data ‘artificial’ if it is either created with a specific problem in mind or transferred from its natural context to the problem in question. Synthetic abstract data is the prime example of fully artificial data, while semi-artificial data is discussed below. The structure of the artificial data space is known by design and explicitly specified by its generating mechanism, which defines its rules and meaningful components. The distribution can thus be controlled in detail – although global patterns emerging from the interaction of different components can still be difficult to predict. The component distribution is usually chosen to be uniform, as there is no ‘natural’ reason to differentiate frequencies for abstract content. In contrast to natural data, artificial data is ideally represented by its generating process and not by a fixed dataset. On the one hand, a generator makes it possible to create datasets of any size and configuration when required. On the other hand, not all structural aspects may be obvious by looking at data points of a fixed dataset, but are explicit in the generator specification. The more
a generator supports the configuration of its parameters, the less its application is constrained to one specific task, but its data can be useful for a variety of problems. As a consequence, training and test data are not required to follow the same distribution.
Current practice: semi-artificial data. One purpose of the distinction between natural and artificial data is to highlight how many of the recent labelled and crowdsourced datasets are best described as ‘semi-artificial’, as opposed to fully natural. On the one hand, labelled datasets usually introduce an artificial discrete classification which is supposed to uniquely characterise any instance. Depending on the application, these classes are more or less obviously chosen, but even in seemingly straightforward cases like object recognition or parsing, existing categorisations are controversial (Tommasi et al., 2015; Manning, 2011). On the other hand, crowdsourced datasets are collected by posing an artificial task to human workers precisely because such data does not naturally occur. While platforms like Amazon Mechanical Turk are expected to lead to more ‘natural’ annotators than, for instance, university students or subject experts (Smith, 2012), many other aspects of the crowdsourcing setup are artificial, like people doing crowdsourcing as a paid job with a consequent bias to solve tasks quickly and simply (Gururangan et al., 2018). If the dataset includes images, these are often sourced from available photo datasets – like MS-COCO based on Flickr (Lin et al., 2014) – which show staged scenes selected by human photographers based on aesthetic, social, humorous and other criteria (Pinto et al., 2008).
The degree of artificiality depends on the dataset and can be controlled to some degree by the data collection methodology. So what is the key difference to natural data? The more guided/enforced collection process implicitly shapes the nature of data which, while still being opaque, cannot anymore be characterised as “like the real world”. In particular, it may introduce non-natural biases and artefacts which are not intentional, but simultaneously hard to avoid or detect, given the opaque structure of natural data in the first place. The implication is that while such data in many ways approximates natural data well, we cannot rely on the fact that it does so in every respect, and consequently have to question its status as proxy for a real-world application. There is a danger that instead of combining the advantages of being natural as well as task-focused, such data in fact ends up being ‘opaquely artificial’ and thus irrelevant (as some examples in chapter 2 illustrate).