How Large Is a Large Dataset? - Statistics and Computing. Series Editors: J. Chambers D. Hand W

need for multiresolution strategies, that is, representing the data at dif-ferent levels of abstraction or detail. Some computer scientists have pub-lished research on visualization of large datasets and have experimented with novel ideas. Keim (2000), for instance, considers pixel-oriented ways of encoding multivariate information for large datasets.

In the United States, the National Visualization and Analytics Cen-ter (NVAC) has published a book (Thomas and Cook; 2005) on analysing large datasets for combatting terrorism. They refer (p. 4) to taking ad-vantage of “the human eye’s broad bandwidth pathway into the mind to allow users to see, explore, and understand large amounts of information at once.” They seek new methods using “multiple and large-area computer displays to assist analysts” (p. 84), a different approach to the one in this book, which concentrates on improving performance on single screens, the practical situation most of us face.

1.4 How Large Is a Large Dataset?

What is meant by large, when large datasets are referred to, tends to change over time and depends on what methods are to be applied to the data. Computers have more storage and more power, and tasks that were onerous last year become run-of-the-mill this year. The Lanark-shire Milk Experiment was a large-scale study from 1930, made famous amongst statisticians by Student’s devastating criticism (Student; 1931).

Figure 1.4 shows the odd pattern of average growth for the 10,000 school-girls in the study (there was a similar pattern for the boys). Each child was measured twice, once in winter and again six months later in sum-mer. The display shows the averages at each age and links them together to estimate an average growth curve for girls over time. Such a display would be easy to produce today, but it must have taken substantial effort seventy years ago to organise the necessary calculations.

It would be interesting to know what statisticians have thought of as large over the years, but it turns out to be difﬁcult to pin down. There is Huber’s (1992) classiﬁcation, where he divided datasets from tiny up to huge based on storage space required, see Table 1.1. He estimated that a large dataset might have a million cases with 10 variables. Wegman (1995) extended the table by a further factor of 10²(“monster” datasets).

Once electronic computers were used more commonly in statistics, their capabilities started to determine how big the datasets that could be analysed were. There are several levels of looking at this. Firstly, you can consider the amount of data that can be stored (which is related to the Huber scale of dataset sizes and depends on the capacity of the stor-age media available) and identiﬁed (which depends on the software avail-able). Secondly, you can think about what analyses can be carried out on

10 1 Introduction

Fig. 1.4. A reproduction of one of Student’s displays from the Lanarkshire Milk Study. Notice the apparent irregularity of weight increase with age, which drew attention to some of the problems in the design of the study and in how it was carried out.

the data. The requirements for some methods obviously grow too fast with numbers of cases (e.g., hierarchical clustering) and some software needs all the data in main memory, which automatically makes that a limit-ing factor. Thirdly, you can demand that analyses be carried out within an “acceptable” time. For instance, interactive analyses need very fast response times, much faster than anything dreamed about in the days when users expected to wait a few hours for their output to turn up.

A search of the major statistical journals using JSTOR threw up a number of comments in published papers on the size of datasets, but most

Table 1.1. Huber’s Classiﬁcation of Dataset Sizes

Size Description Bytes

Tiny Can be written on a blackboard 10²

Small Fits on a few pages 10⁴

Medium Fills a ﬂoppy disk 10⁶

Large Fills a tape 10⁸

Huge Needs many tapes 10¹⁰

1.4 How Large Is a Large Dataset? 11 were not quantiﬁed. They are listed in Table 1.2 for illustration and some are discussed in more detail in chronological order.

Table 1.2: Quotations on Dataset Size

Year Comment Author

1959 The phrase “If large-scale storage is not available” implies that a data set of 1,000 cases would have been large.

Harris

1965 “the analysis of the data recorded by Tel-Star, an early communications satellite, in-volved tens of thousands of observations and challenged contemporary computing technology.”

Chambers

1966 “The need for better editing is well known to those concerned with extensive data sets.”

Yates

1967 Datasets of “modest bulk”. Page

1967 “Vast” datasets. Gower

1975 “It is now possible to access large data sets directly from magnetic tape.”

McNeill &

Tukey 1978 For SPSS “any one analytical use of the ﬁle

is limited to using at most 500 variables”, though up to 5,000 could be loaded in all.

Muller

1981 “There is now a collection of computer sub-routines designed to summarize large data sets in histogram form.” In the example he used, statistics were calculated for 20,000 samples of size 50 and histograms with 800 (!) cells were prepared for each statistic.

Dickey

1981 Restricted in their analysis at one site be-cause the software there could only handle 88,000 real numbers.

Aitken et al.

1981 “Substantial” data sets in the census Kruskal 1982 Moderate data sets have less than 500

cases and large have more than 2,000 (for linear-logistic models).

Koch

1986 “... allows even very large data sets to be explored interactively” and referred to a re-gression data set with 11,000 cases.

Gilks

continued on next page

12 1 Introduction

Table 1.2: continued

Year Comment Author

1986 “The increased use of computing has in turn increased the importance of develop-ing methods for interpretation of large vol-umes of data. . . ”

Eddy

1987 “What is large depends on the frame of ref-erence. If available plotting space for a scat-terplot is a one-inch square, 500 points can seem large. For our purposes, N is large if plotting or computation times are long, or if plots can have an extensive amount of over-plotting.” At that time, 50,000 points was large (based on rendering time) but the au-thors pointed out: “The representation of 1 million or more data points in each plot is feasible.”

Carr et al.

1987 “a moderate amount of data, say several hundred observations.”

Becker et al.

1990 “A regression model for 5,000 cases with 6 variables would be a high sample size for immediate evaluation (c. 3 seconds), but far too big for even rough bootstrapping (esti-mated to take an hour).”

Sawitzki

1990 “Computing plain medians was not feasible because there were nRC = 2,621,400,000 data values in all, which could not be stored in central memory.”

Rousseeuw

1991 “2% of the total census records is a very large data ﬁle”. For a UK population of 65 million this would be about a million cases.

Marsh et al.

1993 “The data are listed in Good and Gask-ins (1980) as a histogram of 172 bGask-ins of length 10 MeV constructed from the loca-tions of 25,752 events on a mass-spectrum.

For such a huge data set in one dimen-sion...”

1996 “huge samples (size 100,000)” and “a large number of groups (100 say)”

Sasieni &

Royston continued on next page

1.4 How Large Is a Large Dataset? 13 Table 1.2: continued

Year Comment Author

1996 “large surveys such as the NCVS may have 60,000 or more observations, and only re-cently has research begun on how to plot massive data sets.”

Fesco et al.

1998 A “large” data set has 3667 cases. (Scatter-plots for surveys)

Korn &

Graubard 1999 “We focus our attention on a very small

[sic] part of the information available in these data; namely the birth weight of the 4,017,264 registered singleton births.”

Clemons &

Pagano

In 1959, Harris described data plotting with an IBM 650. “With a fairly small table a 650 might handle up to 1,000 non-negative obser-vations of not over 5 digits each.” The accompanying phrase “If large-scale storage is not available” implies that such a dataset of 1,000 cases would have been large. For its time, the display in Figure 1.5 must have been impressive. Of course, there is “large” and “large” and according to

Fig. 1.5. Reproduction of Figure 1 of Harris’s (1959) paper. Reprinted with per-mission from The American Statistician. Copyright 1959 by the American Statis-tical Association. All rights reserved.

14 1 Introduction

Chambers (1999), “Large-scale applications did exist, even at this time [in 1965]; the analysis of the data recorded by TelStar, an early communica-tions satellite, involved tens of thousands of observacommunica-tions and challenged contemporary computing technology.”

In his 1966 Fisher memorial lecture Yates wrote:

As an example of the type of work that can now be readily under-taken I may instance the analysis of the directional recording of swell from distant storms involving complicated spectral analyses of 10⁶automatic recordings from three pressure gauges.

He also noted that “A serious fault of many statistical investigations in the past has been that all available data bearing on the question at is-sue were not made use of” — a clear cry for more powerful computing facilities to enable the analysis of larger datasets.

In 1980, Good and Gaskin published an article on what they called

‘bump-hunting’ and included a plot of the distribution of a dataset with just over 25,000 cases (Figure 1.6). For the 1980s, Carr et al. (1987) may sound surprisingly ambitious: “The representation of 1 million or more data points in each plot is feasible.” In fact, the only surprise is that so few people have followed up on this work. Figure 1.7 shows only a quar-ter of a million points, but it is clear that a plot could readily have been drawn for a million. In 1995, the US National Research Council organised a workshop on “Massive Data Sets”. Several of the papers were revised and reviewed a few years later and published in an issue of JCGS (Vol.

8, no. 3). There was some reference to numbers, but, according to Ket-tenring, the main organiser, in a later paper presented at the Interface meeting in 2001: “It seemed appropriate to stick with a murky deﬁnition

Fig. 1.6. Reproduction of Figure A of Good and Gaskins’s (1980) paper. Reprinted with permission from The Journal of the American Statistical Association. Copy-right 1980 by the American Statistical Association. All Copy-rights reserved.

1.4 How Large Is a Large Dataset? 15

Fig. 1.7. Reproduction of Figure 10 of Carr et al.’s (1987) paper plotting 243,800 points from a glass-melter simulation. Reprinted with permission from The Jour-nal of the American Statistical Association. Copyright 1987 by the American Sta-tistical Association. All rights reserved.

[of massive].” (Kettenring; 2001) He offered one version of a murky def-inition as: “A massive data set is one for which the size, heterogeneity, and general complexity cause serious pain for the analyst(s).” A realistic, if unattractive, description.

In their paper, Eddy et al. (1999) worked with brain image data, analysing 2800 slices of 128× 128 voxels each, making up about 256MB of raw data. Kahn and Braverman (1999) described climate data being collected at the rate of 80 gigabytes per day (though they did not claim to analyse datasets of this size). In a later JCGS paper, Braverman (2002) discussed the analysis of a subset of 5 million cases for 2 variables, i.e., large according to Huber’s table. McIntosh (1999) studied telephone net-works and was able to store 2 to 4 million messages on 128MB. Four years later, when he revised his paper for publication in JCGS, his storage limit had jumped to 55 gigabytes!

It is gratifying to be able to show a plot for a genuinely large dataset (at least in comparison to most of the datasets used so far). Figure 1.8 displays the distribution of reported birthweights for more than 4 million children. The curious form is due to rounding. (Perhaps there are more urgent matters to attend to just after a birth than to record the precise weight of the baby?) Hand et al. (2000) in their paper on Data Mining give examples of datasets that are potentially much larger than anything discussed here (Barclaycard’s 350 million credit card transactions a year,

In document Statistics and Computing. Series Editors: J. Chambers D. Hand W. Härdle (Page 22-28)