Data organization - Data in statistics and in R

Data in statistics and in R

2.3 Data organization

Data that describe or measure a single attribute, say height of a tree, are called uni-variate. They are composed of a set of observations of objects about which a single value is obtained. Bivariate data are represented in pairs. Multivariate data are com-posed of a set of observations on objects. Each observation contains a number of values that represent this object.

Statistical analysis usually involves more than one data file. Often we use several files to store different data that relate to a single analysis. We then need to somehow relate data from different files. This requires careful consideration of how the data are to be organized. Once you commit the data to a particular organization it is difficult to change. The way the data are organized will then dictate how easy they are to prepare for different types of statistical analyses.

Data are organized into tables and tables are related to each other. The tables, their relationship and other auxiliary information form a database. For example, you may have data about air pollution. The pollution is measured in numerically labeled stations and the data are stored in one table. Another table stores the correspon-dence between the station number and the name of the closest town (this is how the U.S. Environmental Protection Agency saves many of its pollution-related data).

Tables are often stored in separate files.

2.3.1 Data tables

We arrange data in columns (variables) and rows (observations). In the database ver-nacular, we call variables fields and observations cases or rows. The following example demonstrates multivariate data. It is a good example of how data should be reported succinctly and referenced appropriately.

56 Data in statistics and in R

Example 2.5. R comes with some data frames bundled. You can access these data with data() and use its name as an argument. If you give no argument, then R will print a list of all data that are installed by default. The table below shows the first 10 observations from a data set with 6 variables

> data(airquality)

> head(airquality)

Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1

We view the data frame by first bringing it into our R session with data() and then printing its head(). The data represent daily readings of air quality values in New York City for May 1, 1973 through September 30, 1973. The data consist of 154 observations on 6 variables:

1. Ozone (ppb) – numeric values that represent the mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island.

2. Solar.R (lang) – numeric values that represent solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park.

3. Wind (mph) – numeric values that represent average wind speed in miles per hour at 0700 and 1000 hours at La Guardia Airport.

4. Temp (degrees F) – numeric values that represent the maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

5. Month – numeric month (1–12) 6. Day – numeric day of month (1–31)

The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data). The data were reported

in Chambers and Hastie (1992). ut

The output in Example 2.5 illustrates typical arrangement of data and reporting:

• the data in a table with observations in rows and variables in columns;

• the variable names and their type;

• the units of measurement;

• when and where the data were collected;

• the source of the data;

• where they were reported.

This is a good example of how data should be documented. Always include the units and cite the source. Give variables meaningful names and you will not have to waste time looking them up. The distinction between variables and observations need not be rigid. They may even switch roles, based on the questions asked.

Example 2.6. The data on vote counts in Florida were introduced in Example 2.2.

In Table 2.1, the candidates are variables. Each column displays the number of votes

Table 2.1 Number of votes by county and candidate. U.S. 2000 presidential elections, Florida counts.

County Gore Bush Buchanan Nader

ALACHUA 47 365 34 124 263 3 226

BAKER 2 392 5 610 73 53

BAY 18 850 38 637 248 828

BRADFORD 3 075 5 414 65 84

BREVARD 97 318 115 185 570 4 470

for the candidate. The counties are the observations (rows). In Table 2.2, the counties are the variables. The columns display the number of votes cast for different candi-dates in a county. Now if you want to compute the total votes cast for Gore, you might have to present the data in Table 2.1 to your statistical package. If you want the total number of votes cast in a county, you might have to produce the data in Table 2.2 to your statistical package. Contrary to appearances, switching the roles of rows and columns may not be a trivial task. We shall see that R is particularly

suitable for such switches. ut

Table 2.2 Number of votes by candidate and county. U.S. 2000 presidential elections, Florida counts.

Candidate ALACHUA BAKER BAY BRADFORD BREVARD

Gore 47 365 2 392 18 850 3 075 97 318

Bush 34 124 5 610 38 637 5 414 115 185

Buchanan 263 73 248 65 570

Nader 3 226 53 828 84 4 470

2.3.2 Relationships among tables

Many tables may be part of a single project that requires statistical analysis. Creating these tables may require data entry—a tedious and error-prone task. It is therefore important to minimize the amount of time spent on such activities. Sometimes the tables are very large. In epidemiological studies you might have hundreds of thou-sands of observations. Large tables take time to compute and consume storage space.

Therefore, you often need to minimize the amount of space occupied by your data.

Example 2.7. The World Health Organization (WHO) reports vital statistics from various countries (WHO, 2004). Figure 2.2 shows a few lines from three related tables from the WHO data. One table, named who.ccodes stores country codes under the variable named code and the country name under the variable named name. Another table, named who.pop.var.names stores variable names under the column var and description of the variable under the column descr. For example the variable Pop10 stores population size for age group 20 to 24. The third table, who.pop.2000, stores population size for country (rows) by age group (columns).

If you wish to produce a legible plot or summary of the data, you will have to relate these tables. To show population size by country and age group, you have to read the country code from who.pop.2000 and fetch the country name from who.ccodes. You

58 Data in statistics and in R

Figure 2.2 Sample of WHO population data from three related tables.

will also have to read population size for a variable and fetch its description from pop.var.names. In the figure, the population size in 2000 in Armenia for age group

20–24 was 158 400. ut

We could collapse the three tables in Example 2.7 into one by replacing the country code by the country name and variable name by its description. With a table with thousands of records, the column code would store names instead of numbers. If, for example, some statistical procedure needs to repeatedly sort the table by country, then sorting on a string of characters is more time-consuming than sorting by numbers.

Worse yet, most statistical software and database management systems cannot store a variable name such as 20–24. The process of minimizing the amount of data that needs to be stored is called normalization in database “speak.” If the tables are not too large, you can store them in R as three distinct data frames in a list.

In document Statistics and Data With R (Page 70-73)