Using if exp versus by varlist: with statistical commands 16

1.3 Installing the support materials

2.1.8 Using if exp versus by varlist: with statistical commands 16

subset of the data. You can summarize the data for each region, where NE is coded as region 1 and N Cntrl is coded as region 2, by using an if exp:

. summarize medage marr divr if region==1

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 9 31.23333 1.023474 29.4 32.2

marr | 9 44.47922 47.56717 5.226 144.518

divr | 9 19.30433 19.57721 2.623 61.972

. summarize medage marr divr if region==2

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 12 29.525 .7008113 28.3 30.9

marr | 12 47.43642 35.29558 6.094 109.823

divr | 12 24.33583 19.684 2.142 58.809

If your data have discrete categories, you can use Stata’s by varlist: prefix instead of the if exp qualifier.

If you use by varlist: with one or more categorical variables, the command is repeated automatically for each value of the by varlist:, no matter how many

2An even simpler approach would be to type generate largepop = 1 - smallpop. If you properly define smallpop to handle missing values, the algebra of the generate state-ment will ensure that they are handled in largepop, since any function of missing data produces missing data.

subsets are expressed by the by varlist:. However, by varlist: can execute only one command.

To illustrate how to use by varlist:, let’s generate the same summary statis-tics for the two census regions:

. by region, sort: summarize medage marr divr

---> region = NE

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 9 31.23333 1.023474 29.4 32.2

marr | 9 44.47922 47.56717 5.226 144.518

divr | 9 19.30433 19.57721 2.623 61.972

---> region = N Cntrl

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 12 29.525 .7008113 28.3 30.9

marr | 12 47.43642 35.29558 6.094 109.823

divr | 12 24.33583 19.684 2.142 58.809

Here we needed to sort by region with the by varlist: prefix. The statistics indicate that Northeasterners are slightly older than those in North Central states, although the means do not appear to be statistically distinguishable.

Do not confuse the by varlist: prefix with the by () option available on some Stata commands. For instance, we could produce the summary statistics for medage by using the tabstat command, which also generates statistics for the entire sample:

. tabstat medage, by(region) statistics(N mean sd min max) Summary for variables: medage

by categories of: region (Census region)

region | N mean sd min max

---+---NE | 9 31.23333 1.023474 29.4 32.2

N Cntrl | 12 29.525 .7008113 28.3 30.9

---+---Total | 21 30.25714 1.199821 28.3 32.2

---Using by() as an option modifies the command, telling Stata that we want to compute a table with summary statistics for each region. On the other

hand, the by varlist: prefix used above repeats the entire command for each value of the by-group.

The by varlist: prefix may include more than one variable, so all combi-nations of the variables are evaluated, and the command is executed for each combination. Say that we combine smallpop and largepop into one categor-ical variable, popsize, which equals 1 for small states and 2 for large states.

Then we can compute summary statistics for small and large states in each region:

. generate popsize = smallpop + 2*largepop

. by region popsize, sort: summarize medage marr divr

---> region = NE, popsize = 1

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 5 30.74 1.121606 29.4 32

marr | 5 12.011 8.233035 5.226 26.048

divr | 5 6.2352 4.287408 2.623 13.488

---> region = NE, popsize = 2

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 4 31.85 .4509245 31.2 32.2

marr | 4 85.0645 44.61079 46.273 144.518

divr | 4 35.64075 18.89519 17.873 61.972

---> region = N Cntrl, popsize = 1

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 8 29.5625 .7998885 28.3 30.9

marr | 8 26.85387 16.95087 6.094 54.625

divr | 8 12.14637 8.448779 2.142 27.595

---> region = N Cntrl, popsize = 2

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 4 29.45 .5446711 28.8 29.9

marr | 4 88.6015 22.54513 57.853 109.823

divr | 4 48.71475 8.091091 40.006 58.809

The youngest population is found in large North Central states. Remember that large states have popsize = 2. We will see below how to better present the results.

2.1.9 Labels and notes

Stata makes it easy to provide labels for the dataset, for each variable, and for each value of a categorical variable, which will help readers understand the data. To label the dataset, use the label command:

. label data "1980 US Census data with population size indicators"

The new label overwrites any previous dataset label.

Say that we want to define labels for the urbanized, smallpop, largepop, and popsize variables:

. label variable urbanized "Population in urban areas, %"

. label variable smallpop "States with <= 5 million pop, 1980"

. label variable largepop "States with > 5 million pop, 1980"

. label variable popsize "Population size code"

. describe pop smallpop largepop popsize urbanized storage display value

variable name type format label variable label

---pop double %8.1f 1980 Population, ’000

smallpop float %9.0g States with <= 5 million pop, 1980

largepop float %9.0g States with > 5 million pop, 1980

popsize float %9.0g Population size code

urbanized float %9.0g Population in urban areas, %

Now if we give this dataset to another researcher, the researcher will know how we defined smallpop and largepop.

Last, consider value labels, such as the one associated with the region variable:

. describe region

storage display value

variable name type format label variable label

---region byte %-8.0g cenreg Census region

region is a byte (integer) variable with the variable label Census region and the value label cenreg. Unlike other statistical packages, Stata’s value labels are not specific to a particular variable. Once you define a label, you can assign it to any number of variables that share the same coding scheme.

Let’s examine the cenreg value label:

. label list cenreg cenreg:

1 NE 2 N Cntrl 3 South 4 West

cenreg contains codes for four Census regions, only two of which are rep-resented in our dataset.

Because popsize is also an integer code, we should document its categories with a value label:

. label define popsize 1 "<= 5 million" 2 "> 5 million"

. label values popsize popsize

We can confirm that the value label was added to popsize by typing the following:

. describe popsize

storage display value

variable name type format label variable label

---popsize float %12.0g popsize Population size code

To view the mean for each of the values of popsize, type

. by popsize, sort: summarize medage

---> popsize = <= 5 million

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 13 30.01538 1.071483 28.3 32

---> popsize = > 5 million

Variable | Obs Mean Std. Dev. Min Max

---+---medage | 8 30.65 1.363818 28.8 32.2

The smaller states have slightly younger populations.

You can use the notes command to add notes to a dataset and individual variables (think of sticky notes, real or electronic):

. notes: Subset of Census data, prepared on TS for Chapter 2

. notes medagel: median age for large states only

. notes popsize: variable separating states by population size . notes popsize: value label popsize defined for this variable . describe

Contains data from http://www.stata-press.com/data/imeus/census2c.dta

obs: 21 1980 Census data for NE and NC

states

vars: 12 14 Jun 2006 08:48

size: 1,554 (99.9% of memory free) (_dta has notes)

---storage display value

variable name type format label variable label

---state str13 %-13s State

region byte %-8.0g cenreg Census region

pop double %8.1f 1980 Population, ’000

popurb double %8.1f 1980 Urban population, ’000

medage float %9.2f Median age, years

marr double %8.1f Marriages, ’000

divr double %8.1f Divorces, ’000

urbanized float %9.0g Population in urban areas, %

medagel float %9.0g *

smallpop float %9.0g States with <= 5 million pop, 1980

largepop float %9.0g States with > 5 million pop, 1980

popsize float %12.0g popsize * Population size code

* indicated variables have notes ---Sorted by: popsize

Note: dataset has changed since last saved . notes

_dta:

1. Subset of Census data, prepared on 5 May 2011 19:07 for Chapter 2 medagel:

1. median age for large states only popsize:

1. variable separating states by population size 2. value label popsize defined for this variable

The string TS in the first note is automatically replaced with a time stamp.

In document Baum - An Introduction to Modern Econometrics Using Stata (Page 30-35)