Assumptions of Statistical Tests - OReilly Statistics in a Nutshell A Desktop Quick Reference A

Typical violations of some statistical tests are given below, and mechanisms to test whether the assumptions are violated are also provided.

t-tests

Two-samplet-tests assume that the samples are unrelated; if they are related, then a pairedt-test should be used (t-tests are discussed further in Chapter 8).Unre- lated here means independent—you can test for linear independence by using the correlation coefficient.Serial correlation may become an issue if data is collected over a period of time.

t-tests are also influenced by outliers; these should be removed when they are two or more standard deviations above or below the mean.Alternatively, they may be visually detected by using a boxplot or a normal Q-Q plot.Use caution with

Inferential Statistics | 119

Critiquing

Statistics

outlier removal, as removal of any data will reduce the generalizability of your results.

Note that discarding outliers on the basis of sound statistical measures—such as the standard deviation—is an entirely separate activity from discarding data that happens to be unfavorable.For example, when there is a 5% chance of commit- ting a Type I error, then discarding the 19/20 experiments that do not meet your favored conclusion would not be statistically valid (or ethical).

t-tests assume that the underlying population variances of the two groups are equal (since the variances are pooled as part of the test); if they are not, then the

Welch-Satterthwaite t-testshould be used, since this provides a direct means to

adjust for the inequality.AnF-test could be performed to directly test the equiva- lence of variances, or a side-by-side boxplot comparison could be used.

Normality of the distributions of both variables is assumed, although for the small samples that a t-test is often used to test, this may be difficult to establish—a histogram of the distribution should reveal any significant lack of symmetry (or skew).In this case, a nonparametric or “distribution-free” test (discussed in Chapter 11), such as theWilcoxon rank-sum test, may be more appropriate.The lack of balance in sample sizes may result in biased estimation of the population parameters in one of the groups; certainly, the standard error of the mean will be greatest in the smaller group.

Note that thet-test is often used with small sample sizes.Using small samples in any design may result in a lack of power, meaning a true difference may not be determined.Unless variances are small, testing within small samples may produce a nonsignificant result, even if there truly is a significant difference.Relaxation of the alpha level will increase power, as will increasing the sample size and/or reducing variance.

ANOVA

ANOVA has a large number of assumptions that need to be met, which usually requires directly determining whether the assumption is met (rather than hoping that it is met, or ignoring it).ANOVA (discussed further in Chapter 12) assumes independence and normality—again, the impact of outliers needs to be considered if these are the main cause of the nonnormality, and attempting to screen them may radically change the result of theF-test, but at least the result would then be valid.Themost importantassumption, from a practitioner’s perspective, is the equality of variances.

ANOVA is most reliable when sample sizes are balanced and when the population variances are equal.Skewed distributions and unequal variances may make the interpretation of theF-test unreliable.A side-by-side boxplot comparison may be very helpful; if data is sampled from a truly normal distribution, then there should be symmetry in the boxplots.If there is no attempt to establish normality, ask why.While it’s true that—if the population data is normally distributed— increasing the sample size will bring about a greater approximation to normality, if the population is not normal, then increasing the sample size won’t help.And yet many studies rely on large numbers to claim reliability, putting great faith in the Central Limit Theorem.Levene’s test and Bartlett’s test are very useful for

determining whether the assumption of equal population variances has been met from a sample.

If samples are both nonnormal and population variance is thought to be unequal, and/or there is a lack of balance in sample sizes, it might be best to use a nonparametric test, such as the Kruskal-Wallis.Alternatively, if sample sizes are unequal, but the other assumptions are met, then a Tukey-Kramer adjustment may be made.

MANOVA

In addition to the assumptions underlying univariate ANOVA, MANOVA assumes the equality of variance-covariance matrices (more on MANOVA can be found in Chapter 13).This assumption can be tested using the Box test, and significance levels are often provided.Data is also assumed to be multivariate normal; unfortunately, there is no direct test available for multivariate normality, but univariate normality tests should at least be undertaken.

MANOVA is also sensitive to outliers, and these should be removed before analysis, again noting that removing any cases from your analysis may reduce the generalizability of your results.Tests for linear relationships (to exclude nonlinear relations) should be performed; however, where multicollinearity arises, reducing redundancy for dependent measures (through principal component analysis or similar) should be considered—usually wherer > 0.80, as a rule of thumb. Linear regression

Like the other techniques described here,linear regressionassumes the independence of errors in the IV and DV: if a seasonal effect is present, then examining the residuals should indicate that a more complex model is required (look for any pattern other than a random distribution).Linear regression is covered in depth in Chapters 12 and 14.Time series analysis, for example, provides methods to remove seasonal or cyclical trends from data before performing linear regression. Examining residuals is more an art than a science.However, by becoming familiar with residual analysis, you will be better able to assess the regression analyses presented by others and pinpoint any problems.

Table 6-1 shows average wholesale coffee prices per pound for the past 10 years. As you can see, the rise in prices is strongly correlated with the year,r= 0.991. There is some random variation present in the data—perhaps some prices were transcribed incorrectly, or perhaps some growers were slightly more or less greedy each year. But generally, the relationship is linear.

Figure 6-6 shows the residuals from the model fit, with an overlaid normal distribution.Although there are some deviations for such a small sample, it’s actually a good fit.

Inferential Statistics | 121

Critiquing

Statistics

However, what if coffee prices had spiked in 2002 to $10.86? The correlation would then ber= 0.572, resulting in only 32% of the variation being accounted for in the DV by the IV, rather than 98%.The residual plot in Figure 6-7 shows 9 cases clustered around standardized residuals of 1, while there is only one case with a residual of approximately 3.If that single case had been removed as an outlier, using the ±2 SD criterion, then the almost-perfect fit observed in Figure 6-6 would have been maintained.

Imagine a seasonal effect (shown in Table 6-2) that reflects government policy to run a subsidization program every second year to ensure that growers can remain Table 6-1. Average wholesale coffee prices

Year Price 1998 2.40 1999 2.89 2000 3.75 2001 4.00 2002 4.20 2003 4.82 2004 5.19 2005 5.98 2006 6.36 2007 7.31

competitive in a global market.In this case, there is an increasing linear trend overall (r= 0.74), but you can see a repeating pattern where there are serial clus- ters that are above and below zero.You wouldn’t see this from the histogram, which is why, especially with regression through time, it’s useful to examine the serial order of residuals. Figure 6-8 illustrates this.

Figure 6-7. Residuals and overlaid fit to normal distribution—single outlier

Table 6-2. Average wholesale coffee prices with cyclical effect

Year Price 1998 2.51 1999 1.97 2000 2.63 2001 1.91 2002 2.66 2003 2.12 2004 2.86 2005 2.94 2006 3.48 2007 3.25

Inferential Statistics | 123

Critiquing

Statistics

In other situations, there may be an observed expansion in the divergence at posi- tive and negative parts of the cycle—perhaps the government increases spending on the subsidy in the first year, and then has to decrease the subsidy because it has less money.In this situation, you may see a bifurcation, as shown in Figure 6-9. Again, the correlation is still high, at r = 0.79, but the residuals, shown in Figure 6-10, clearly show the oscillation between successive residuals, as well as their increase in magnitude.

Figure 6-8. Residuals plotted serially: cyclical effect

125 Chapter 7Inferential Statistics

7

Inferential Statistics

Statistical inference is the science of characterizing or making decisions about a population using information from a sample drawn from that population.Most of the practice of statistics is concerned with inferential statistics, and many sophisti- cated techniques have been developed to facilitate this type of inference.

The name “inferential statistics” derives from the term “inference,” given two definitions by the Merriam-Webster online dictionary (http://www.m-w.com/

dictionary/inference):

a) the act of passing from one proposition, statement, or judgment considered as true to another whose truth is believed to follow from that of the former b) the act of passing from statistical sample data to generalizations (as of the value of population parameters) usually with calculated degrees of certainty

The second meaning, which is specific to statistics, is clearly related to the first. Inference in general is a method of making suppositions about an unknown, drawing on what is known to be true.Statistical inference is a refinement of ordi- nary inference, and is a process of making generalizations about unmeasured populations using data calculated on measured samples.Statistical inference has the additional advantage of quantifying the degree of certainty for a particular inference.

People sometimes get confused about the difference between descriptive statistics (covered in Chapter 4) and inferential statistics, partly because in many cases the statistical procedures used are identical while the interpretation differs.For instance, the same formula is used for calculating a mean whether the data represents a population or a sample: add up all the data values and divide by the number of values.There are differences in the notations of the formula, however, such as the use of the Greek letterµto represent the population mean (which is properly called aparametersince it is a number that describes a population) and the Latin letterxwith a bar over it (x), pronounced “x-bar,” to represent a sample mean (properly called astatisticsince it is a number that represents a sample), and

the use of the uppercaseNfor population size versus the lowercasenfor sample size.In other cases, the formula is different: for instance to calculate a population standard deviation we divide by N, while for the sample standard deviation we divide byn– 1.

So it can make a difference, even before you get to the interpretation stage, whether you are working with descriptive or inferential statistics.To answer this question, think about the purpose of your study: is it merely to describe the specific people or entities that provided the data upon which you will perform the calculations? Or is it to generalize to a larger group of which the study objects are considered representative? The basic rule is this:

Any time you want to generalize your results beyond the specific cases that provided your data, you should be doing inferential statistics.

To look at the same question from another side:

Any time the cases that provided your data do not represent the entire population of interest, you should be doing inferential statistics.

In document OReilly Statistics in a Nutshell A Desktop Quick Reference Aug 2008 pdf (Page 142-150)