Statistical Inference II: Interval Estimation, Hypothesis Testing,
2.2 Hypothesis Testing
2.2.1 Mechanics of Hypothesis Testing
To formulate questions about transportation phenomena a researcher must pose two competing statistical hypotheses: a null hypothesis (the hypoth- esis to be nullified) and an alternative. The null hypothesis, typically denoted by H0, is an assertion about one or more population parameters that are assumed to be true until there is sufficient statistical evidence to conclude otherwise. The alternative hypothesis, typically denoted by Ha,
is the assertion of all situations not covered by the null hypothesis. Together the null and the alternative constitute a set of hypotheses that covers all possible values of the parameter or parameters in question. Considering the NMSL repeal previously discussed, the following pair of competing hypotheses could be formulated:
Null Hypothesis (H0): There has not been a change in mean speeds as a result of the repeal of the NMSL
Alternative Hypothesis (Ha): There has been a change in mean speeds
as a result of the repeal of the NMSL.
The purpose of a hypothesis test is to determine whether it is appro- priate to reject or not to reject the null hypothesis. The test statistic is the sample statistic upon which the decision to reject, or fail to reject, the null hypothesis is based. The nature of the hypothesis test is deter- mined by the question being asked. For example, if an engineering inter- vention is expected to change the mean of a sample (the mean of vehicle speeds), then a null hypothesis of no difference in means is appropriate. If an intervention is expected to change the spread or variability of data, then a null hypothesis of no difference in variances should be used. There are many different types of hypothesis tests that can be conducted. Regardless of the type of hypothesis test, the process is the same: the empirical evidence is assessed and will either refute or fail to refute the null hypothesis based on a prespecified level of confidence. Test statistics used in many parametric hypothesis testing applications rely upon the Z, t, F, and G2distributions.
The decision to reject or fail to reject the null hypothesis may or may not be based on the rejection region, which is the range of values such that, if the test statistic falls into the range, the null hypothesis is rejected. Recall that, upon calculation of a test statistic, there is evidence either to reject or to fail to reject the null hypothesis. The phrases reject and fail to reject have been chosen carefully. When a null hypothesis is rejected, the information in the sample does not support the null hypothesis and it is concluded that it is unlikely to be true, a definitive statement. On the other hand, when a null hypothesis is not rejected, the sample evidence is consistent with the null hypothesis. This does not mean that the null hypothesis is true; it simply means that it cannot be ruled out using the observed data. It can never be proved that a statistical hypothesis is true
using the results of a statistical test. In the language of hypothesis testing, any particular result is evidence as to the degree of certainty, ranging from almost uncertain to almost certain. No matter how close to the two extremes a statistical result may be, there is always a non-zero probability to the contrary.
Whenever a decision is based on the result of a hypothesis test, there is a chance that it will be incorrect. Consider Table 2.1. In this classical Ney- man–Pearson methodology, the sample space is partitioned into two regions. If the observed data reflected through the test statistic falls into the rejection or critical region, the null hypothesis is rejected. If the test statistic falls into the acceptance region, the null hypothesis cannot be rejected. When the null hypothesis is true, there is E percent chance of rejecting it (Type I error). When the null hypothesis is false, there is still a F percent chance of accepting it (Type II error). The probability of Type I error is the size of the test. It is conventionally denoted by E and called the significance level. The power of a test is the probability that it will correctly lead to rejection of a false null hypothesis, and is given as 1 F.
Because both probabilities E and F reflect probabilities of making errors, they should be kept as small as possible. There is, however, a trade-off between the two. For several reasons, the probability of making a Type II error is often ignored. Also, the smaller the E, the larger the F. Thus, if E is made to be really small, the “cost” is a higher probability for making a Type II error, all else being equal. The determination of which statistical error is least desirable depends on the research question asked and the subsequent consequences of making the respective errors. Both error types are undesir- able, so attention to proper experimental design prior to data collection and sufficiently large sample sizes will help to minimize the probability of mak- ing these two statistical errors. In practice, the probability of making a Type I error E is usually set in the range from 0.01 to 0.10 (1 and 10% error rates, respectively). The selection of an appropriate E level is based on the conse- quences of making a Type I error. For example, if human lives are at stake when an error is made (accident investigations, medical studies), then an E of 0.01 or 0.005 may be most appropriate. In contrast, if an error results in monies being spent for improvements (congestion relief, travel time, etc.) that might not bring about improvements, then perhaps a less stringent E is appropriate.
TABLE 2.1
Results of a Test of Hypothesis
Reality Test Result H0is true H0is false
Reject Type I error
P(Type I error) = E
Correct decision
Do not reject Correct decision Type II error