Session 4: Descriptive statistics and exporting Stata results

(1)

Session 4: Descriptive statistics and exporting Stata results

In this session we are going to work with descriptive statistics in Stata. First, we present a short introduction to the very basic statistical contents of the session and then we will explain the way of obtaining them in Stata.

1. Short introduction to descriptive statistics

Descriptive statistics is used to describe the contents and properties of a given variable. With a number, or a limited set of numbers, we can easily know how is a variable distributed in our sample/population of interest.

Average

It is the most well-known descriptive statistic, equal to the sum of all cases divided by the number of cases

n

x

X

n i i





1 Weighted average

Every observation is weighted by a given value, that represents the importance of its contribution to the final average. It is calculated just like the average but multiplying each observation by its weight and dividing by the overall sum of weights



 



_k i i k i i i

w

x

X

1 1 Median

It is the central value of a variable: it has as many cases below ad above. More formally, it is the value of the distribution that satisfies the condition of having half of the values lower or equal and the other half being higher or equal to it. In case that the number of cases was even, the median would equal the average of the two central values.

Mode

(2)

Quartiles are an extension of the median: are those values that have a 25%, 50%, and 75% of the cases below them, respectively. Percentiles are, in turn, a generalization of the same idea: percentile p has p% of the values below and (100-p)% above.

Variance

The variance expresses how a distribution is spread out. It equals the mean of the squareddeviations of that variable from its mean

n X x n i i



   1 2 2 ) (



Standard deviation

The standard deviation is the square root of the variance:

2





2

s



The standard deviation is important because it has some interesting properties. It is the most widely used dispersion statistic. In general, we can take as a reference point what we know on the normal distribution: 95% of the cases are within, aprox, +/- 2 standard deviations from the mean, and 99,87% within +/- 3 standard deviations

Range

The range of a variable equals the difference between the largest and smallest values, and expresses its amplitude.

(3)

Interquartile range.

The range might be affected by extreme values, and therefore misrepresent the amplitude. We can use the interquartile range, that equals the difference between the third and first quartiles. Within the interquartile range we will have half of the cases.

R = Q 3-Q 1

Skewness

It measures the symmetry of the distribution. It take the normal distribution as a reference point, because it is perfectly symmetrical. A normally distributed variable would have a skewness of 0. Otherwise the skewness can be:

1. Positive: A longer tail to the right, more observations on the left and therefore, few high values. Also called right-skewed

2. Negative: longer left tail, more observations to the right and few low-values. Also called left-skewed

Descriptive statistics in Stata

Stata can present all this information with the command summarize,:

 Summarize

The command summarize variable1 variable2 (etc.) details the number of valid observations, the mean, the standard deviation and the minimum and maximum value of the variables. If we want some additional information, we could use the option detail:

(4)

o Detail Typing summarize variable1 variable2, detail Stata will display the mean, standard deviation, minimum and maximum, percentiles, variance and Skewness.

 Descriptive statistics tables

The summarize command is useful for summarizing the whole sample. Although we can combine it with the options if and by to get descriptives of sub-samples, this is not the most appropriate command to do that. Stata has several useful options of building tables of descriptives by groups:

 Tabulate, summarize tabulate groupvariable, summarize(variable1) shows a frequency table of the groups defined by the variable groupvariable with the mean and standard deviation of variable1 for each group.

 Tabstat is a more powerful command, since we can include in the table a wider choice of descriptive statistics of more than one variable.

tabstat variable1 variable2, stats(mean med sd min max) by(variablegrupo) format(%9.2f)

Exporting Stata results

Stata produces results in the main window, but often we want to export them to a spreadsheet or word document. This requires some additional work.

 Log files

The Stata result window does not store the whole session, but just the last part. If we want to store the whole output we should use a log file. We can open and name it through an icon on the main window, but the same can also be done using the commands:

o Open log-file: log using file.log This opens a log file with the specified name, that will store all our activity. We can choose the format –log (plain text) or .scml (formatted). If we want to work on an existing file, we can either overwrite it (option,replace) or use the option ,append that adds the new results at the end of the file.

o Close log file: log close closes the log file

o Suspend el log file: Sometimes we might want to suspend the storing of the results and then restart is. The commands log off and log on will do the trick.

(5)

o Check the status of the log file: We might easily forget whether a log file is open or not. In this case, we can just type log in the command line and Stata would tell us.

 Copy results

Either if we use a log file or not, to export our results to word or excel we will commonly use the copy-paste functions. From Stata we can copy the relevant results by highlighting them, right-clicking on them and choosing one of the following options:

o Copy Copies the selection as text. It can be pasted on a word processor, but if we want to preserve the alignment of the tables we have to use courier or courier new fonts and choose a small font size (10, 9, 8, depending on the table).

o Copy table This is the most useful option, copies the selection as a table. If the table fits in the document, it will appear aligned by tabs, so we could easily convert it into a word table. However, this option is best suited for using excel as an intermediate step. We have to export one table at a time, and if possible select the minimum number of elements.

o Copy table as html can be useful in some contexts.

o Copy image Copies the table as an image ion the clipboard. Only useful if for whatever reason we wish to keep exactly the same appearance as in Stata.

 Advanced commands

In this introductory course we are not going to deal with these commands in detail, but in any case it is useful to know that there are several commands that can produce directly from Stata publication-quality tables that can be directly used in our papers. These commands can save us a lot of time.

 Tabout is the most complete command, a full table creation program. It needs some effort to learn it, but then it pays off. We can install it using the command ssc install tabout. And find a tutorial at

www.ianwatson.com.au/stata/tabout_tutorial.pdf .

 Esttab For more advanced analysis, mainly regression models, the command esttab will be useful, because it easily creates .rtf documents with the tables we need.