2.2 Common data transformations
2.2.7 The egen command
Whereas the functions available in generate or replace are limited to those listed in [D] functions (see also help functions), Stata’s egen command provides an open-ended list of capabilities. Just as you can extend Stata’s command set by placing other .ado and .hlp files on the adopath, you can invoke egen functions that are defined by ado-files with names starting with -g, stored on the adopath. Many of these functions are part of official Stata (see [D] egen and help egen), but your copy of Stata may include other egen functions that you have written or that you have downloaded from the SSC archive ([R] ssc) or another Stata user’s net site. This section discusses several official Stata functions and several useful additions developed by the Stata user community.
Although egen’s syntax is similar to that of generate, there are several differences. Not all egen functions allow a by varlist: (see the documentation
8For each of these data types, negative numbers of similar magnitudes may be stored.
For floating-point numbers, see the maxfloat() and maxdouble() functions.
to determine whether a function is byable). Similarly, you cannot use n and N explicitly with egen. Since you cannot specify that a variable created with a nonbyable egen function should use the logic of replace, you may need to use a temporary variable as the egen result and then use replace to combine those values over groups.
Official egen functions
To get spreadsheetlike functionality in Stata’s data transformations, you will need to understand the rowwise egen functions, which allow you to calculate sums, averages, standard deviations, extrema, and counts across several Stata variables. You can also use wildcards. With a list of state-level U.S. Cen-sus variables pop189O, pop1900,. . .,pop2000, you may use egen nrCenCen-sus = rowmean(pop*) to compute the average population of each state over those decennial censuses. As discussed in section 2.2.3, the rowwise functions can work with missing values. The mean will be computed for all 50 states, al-though several were not part of the United States in 1890. You can compute the number of nonmissing elements in the rowwise list with rownonmiss(), with rowmiss() as the complementary value. Other official rowwise func-tions include rowmax(), rowmin(), rowtotal(), and rowsd() (row standard deviation).
Official egen also provides statistical functions for computing a statistic for specified observations of a variable and placing that constant value in each observation of the new variable. Since these functions generally let you use by varlist:, you can use them to compute statistics for each by-group of the data, as discussed in section 2.2.8. Using by varlist: makes it easier to compute statistics for each household for individual-level data or each industry for firm-level data. The count(), mean(), min(), max(), and total() functions are especially useful.9
Other functions in this statistical category include iqr() (interquartile range), kurt() (kurtosis), mad() (median absolute deviation), mdev() (mean absolute deviation), median(), mode(), pc() (percent or proportion of total), pctile(), p(n) (nth percentile), rank(), sd() (standard deviation), skew() (skewness), and std() (z-score).
9Before Stata 9, egen total() was called egen sum(), but the name was changed be-cause it was often confused with generate’s sum() function.
egen functions from the user community
The most comprehensive collection of additional egen functions is Nicholas J.
Cox’s egenmore package, available with the ssc command.10 The egenmore package contains routines by Cox and others (including me). Some of these routines extend the functionality of official egen routines, whereas others pro-vide capabilities lacking in official Stata. Many of the routines require Stata version 8 or later.
For example, extensions have been made to improve the way Stata handles dates. Stata’s date variables are stored internally as floating-point values. For a date variable measuring days (rather than weeks, months, quarters, half-years, or years), the integer part records the number of days elapsed since an arbitrary day zero of 1 January 1960. Although you could use the decimal part of a date to represent an elapsed fraction of a day (e.g., 0.25 as 6:00 a.m.), Stata does not support intraday values or time arithmetic. The egenmore package contains functions that provide such support. The dhms() function creates a date variable with the fractional part reflecting hours, minutes, and seconds, whereas hms() computes the number of seconds past midnight for time comparisons (e.g., such as stock-market tick data, which are recorded to the second). The conipanion function elap2() displays an elapsed time between two fractional date variables in days, hours, minutes, and seconds, whereas elap() provides a similar function for a number of seconds. Functions hmm() and hmmss() display fractional days as hours and minutes, or hours, minutes, and seconds.
Several egenmore functions work with standard Stata dates, expressed as integer days. The bom() and eom() functions create date variables correspond-ing to the first day or las,t day of a given calendar month. They can be used to generate the offset for any number of months (e.g., the last day of the third month from now). If you use the work option, you can specify the first (last) nonweekend day of the month (although this function does not support hol-idays). You can also use the functions bomd() and eomd() to find the first (last) day of a month in which their date-variable argument falls, which is useful if you wish to aggregate observations by calendar month.
Several egenmore functions extend egen’s statistical capabilities. The corr() function computes correlations (optionally covariances) between two variables;
10The package is labeled egenmore since it further extends egenodd, which appeared in the Stata Technical Bulletin (Cox 1999, 2000). Most of the egenodd functions now appear in official Stata, so they will not be discussed here.
gmean() and hmean() compute geometric and harmonic means; rndint() computes random integers from a specified uniform distribution; semean() computes the standard error of the mean; and var() computes the variance.
The filter() function generalizes egen’s ma() function, which can produce only two-sided moving averages of an odd number of terms. In contrast, filter() can apply any linear filter to data that you have declared to be time-series data by using tsset, including panel data, for which the filter is applied separately to each panel (see section 3.4.1). You can use the compan-ion functcompan-ion ewma() to apply an exponentially weighted moving average to time-series data.
Useful data-management functions include rail(), rany(), and rcount().
These rowwise functions, working from a varlist, evaluate a specified condition and indicate whether all (any) of the variables satisfy the condition or how many variables satisfy the condition. For instance, typing
. egen allpos = rall(var1 var2 var3), cond(@ > 0 & @ < .) . egen anyneg = rany(var1 var2 var3), cond(@ < 0 )
. egen countpos = rcount(dum*), cond(@ > 0 & @ < .)
would create allpos with a value of 1 for each observation in which all three variables are positive and nonmissing and 0 otherwise; anyneg with a val-ue of 1 where any of the three variables were negative and 0 otherwise; and countpos() indicating the number of nonmissing dummy variables that are positive. You could use countpos to ensure that a set of dummies is mutu-ally exclusive and exhaustive, since it should return 1 for each observation (countpos has other uses as well). The @ symbol is a placeholder, standing for the value of the variable in that observation. You can also apply these functions to string variables.
Another useful data-management function is the record() function (the name is meant to evoke “setting a record”). You can use this function to compute the record value, such as the highest wage earned to date by each employee or the lowest stock price encountered to date. If the data contain annual wage rates for several employees over several years,
. egen hiwage = record(wage), by(empid) order(year)
will compute for each employee (as specified with by(empid)) the highest wage earned to date, allowing you to evaluate conditions when wages have fallen because of a job change, etc.11 Several other egen functions are available in the egenmore package on the SSC archive.
11I am grateful to Nicholas J. Cox for his thorough documentation of help egenmore.
In summary, egen functions handle several common data-management tasks.
The open-ended nature of this command implies that new functions often be-come available, either through ado-file updates to official Stata or through con-tributions from the user community. The latter will generally be announced on Statalist (with past messages accessible in the Statalist archives), and recent contributions will be highlighted in ssc whatsnew.