• No results found

Statistical and Econometric Methods for Transportation Data Analysis by S P Washington

N/A
N/A
Protected

Academic year: 2021

Share "Statistical and Econometric Methods for Transportation Data Analysis by S P Washington"

Copied!
413
0
0

Loading.... (view fulltext now)

Full text

(1)

S

tatistical

AND

E

conometric

M

ethods

FOR

T

ransportation

D

ata

A

nalysis

S

imon

P. W

ashington

M

atthew

G

.

K

arlaftis

F

red

L

.

M

annering

CHAPMAN & HALL/CRC

A CRC Press Company

(2)

Cover Images: Left, “7th and Marquette during Rush Hour,” photo copyright 2002 Chris Gregerson,

www.phototour.minneapolis.mn.us. Center, “Route 66,” and right, “Central Albuquerque,” copyright Marble Street Studio, Inc., Albuquerque, NM.

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use.

Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microÞlming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher.

The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. SpeciÞc permission must be obtained in writing from CRC Press LLC for such copying.

Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are

used only for identiÞcation and explanation, without intent to infringe.

Visit the CRC Press Web site at www.crcpress.com © 2003 by Chapman & Hall/CRC

No claim to original U.S. Government works International Standard Book Number 1-58488-030-9

Library of Congress Card Number 2003046163 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0

Printed on acid-free paper

Library of Congress Cataloging-in-Publication Data

Washington, Simon P.

Statistical and econometric methods for transportation data analysis / Simon P. Washington, Matthew G. Karlaftis, Fred L. Mannering.

p. cm.

Includes bibliographical references and index. ISBN 1-58488-030-9 (alk. paper) 1. Transportation--Statistical methods. 2.

Transportation--Econometric models. I. Karlaftis, Matthew G. II. Mannering, Fred L. III. Title.

HE191.5.W37 2003

(3)

Dedication

To Tracy, Samantha, and Devon

— S.P.W.

To Amy, George, and John

— M.G.K.

To Jill, Willa, and Freyda

(4)

Preface

Transportation plays an essential role in developed and developing societies. Transportation is responsible for personal mobility; provides access to ser-vices, jobs, and leisure activities; and is integral to the delivery of consumer goods. Regional, state, national, and world economies depend on the efficient and safe functioning of transportation facilities and infrastructure.

Because of the sweeping influence transportation has on economic and social aspects of modern society, transportation issues pose challenges to professionals across a wide range of disciplines, including transportation engineering, urban and regional planning, economics, logistics, systems and safety engineering, social science, law enforcement and security, and con-sumer theory. Where to place and expand transportation infrastructure, how to operate and maintain infrastructure safely and efficiently, and how to spend valuable resources to improve mobility, access to goods, services, and health care are among the decisions made routinely by transportation-related professionals.

Many transportation-related problems and challenges involve stochastic processes, which are influenced by observed and unobserved factors in unknown ways. The stochastic nature of transportation problems is largely a result of the role that people play in transportation. Transportation system users routinely face decisions in transportation contexts, such as which trans-portation mode to use, which vehicle to purchase, whether or not to partic-ipate in a vanpool or to telecommute, where to relocate a residence or business, whether to support a proposed light-rail project, and whether or not to utilize traveler information before or during a trip. These decisions involve various degrees of uncertainty. Transportation system managers and governmental agencies face similar stochastic problems in determining how to measure and compare system performance, where to invest in safety improvements, how to operate transportation systems efficiently, and how to estimate transportation demand.

The complexity, diversity, and stochastic nature of transportation problems requires that transportation analysts have an extensive set of analytical tools in their toolbox. Statistical and Econometric Methods for Transportation Data Analysis describes and illustrates some of these tools commonly used for transportation data analysis.

Every book must strike an appropriate balance between depth and breadth of theory and applications, given the intended audience. Statistical and Econometric Methods for Transportation Data Analysis targets two general audiences. First, it serves as a textbook for advanced undergraduate, mas-ter’s, and Ph.D. students in transportation-related disciplines, including

(5)

engineering, economics, urban and regional planning, and sociology. There is sufficient material to cover two 3-unit semester courses in analytical methods. Alternatively, a one semester course could consist of a subset of topics covered in this book. The publisher’s Web site, www.crcpress.com, contains the data sets used to develop this book so that applied modeling problems will reinforce the modeling techniques discussed throughout the text. Second, the book serves as a technical reference for researchers and practitioners wishing to examine and understand a broad range of analytical tools required to solve transportation problems. It provides a wide breadth of transportation examples and case studies, covering applications in vari-ous aspects of transportation planning, engineering, safety, and economics. Sufficient analytical rigor is provided in each chapter so that fundamental concepts and principles are clear, and numerous references are provided for those seeking additional technical details and applications.

The first section of the book provides statistical fundamentals (Chapters 1 and 2). This section is useful for refreshing readers regarding the fundamen-tals and for sufficiently preparing them for the sections that follow.

The second section focuses on continuous dependent variable models. The chapter on linear regression (Chapter 3) devotes a few extra pages to intro-ducing common modeling practice — examining residuals, creating indica-tor variables, and building statistical models — and thus serves as a logical starting chapter for readers new to statistical modeling. Chapter 4 discusses the impacts of failing to meet linear regression assumptions and presents corresponding solutions. Chapter 5 deals with simultaneous equation mod-els and presents modeling methods appropriate when studying two or more interrelated dependent variables. Chapter 6 presents methods for analyzing panel data — data obtained from repeated observations on sampling units over time, such as household surveys conducted several times on a sample of households. When data are collected continuously over time, such as hourly, daily, weekly, or yearly, time-series methods and models are often applied (Chapter 7). Latent variable models, presented in Chapter 8, are used when the dependent variable is not directly observable and is approximated with one or more surrogate variables. The final chapter in this section pre-sents duration models, which are used to model time-until-event data such as survival, hazard, and decay processes.

The third section presents count and discrete dependent variable models. Count models (Chapter 10) arise when the data of interest are non-negative integers. Examples of such data include vehicles in a queue and the number of vehicular crashes per unit time. Discrete outcome models, which are extremely useful in many transportation applications, are described in Chap-ter 11. A unique feature of the book is that discrete outcome models are first derived statistically and then related to economic theories of consumer choice. Discrete/continuous models, presented in Chapter 12, demonstrate that interrelated discrete and continuous data need to be modeled as a system rather than individually, such as the choice of which vehicle to drive and

(6)

The appendices are complementary to the text of the book. Appendix A presents the fundamental concepts in statistics that support the analytical methods discussed. Appendix B is an alphabetical glossary of statistical terms that are commonly used, serving as a quick and easy reference. Appen-dix C provides tables of probability distributions used in the book. Finally, Appendix D describes typical uses of data transformations common to many statistical methods.

Although the book covers a wide variety of analytical tools for improving the quality of research, it does not attempt to teach all elements of the research process. Specifically, the development and selection of useful research hypotheses, alternative experimental design methodologies, the vir-tues and drawbacks of experimental vs. observational studies, and some technical issues involved with the collection of data such as sample size calculations are not discussed. These issues are crucial elements in the con-duct of research, and can have a drastic impact on the overall results and quality of the research endeavor. It is considered a prerequisite that readers of this book be educated and informed on these critical research elements so that they appropriately apply the analytical tools presented here.

(7)

Contents

Part I

Fundamentals

1

Statistical Inference I: Descriptive Statistics

1.1 Measures of Relative Standing

1.2 Measures of Central Tendency

1.3 Measures of Variability

1.4 Skewness and Kurtosis

1.5 Measures of Association 1.6 Properties of Estimators 1.6.1 Unbiasedness 1.6.2 Efficiency 1.6.3 Consistency 1.6.4 Sufficiency

1.7 Methods of Displaying Data

1.7.1 Histograms

1.7.2 Ogives

1.7.3 Box Plots

1.7.4 Scatter Diagrams

1.7.5 Bar and Line Charts

2

Statistical Inference II: Interval Estimation, Hypothesis Testing, and Population Comparisons

2.1 Confidence Intervals

2.1.1 Confidence Interval for m with Known s2 2.1.2 Confidence Interval for the Mean

with Unknown Variance

2.1.3 Confidence Interval for a Population Proportion

2.1.4 Confidence Interval for the Population Variance

2.2 Hypothesis Testing

2.2.1 Mechanics of Hypothesis Testing

2.2.2 Formulating One- and Two-Tailed Hypothesis Tests

2.2.3 The p-Value of a Hypothesis Test

2.3 Inferences Regarding a Single Population

2.3.1 Testing the Population Mean with Unknown Variance

2.3.2 Testing the Population Variance

2.3.3 Testing for a Population Proportion

2.4 Comparing Two Populations

(8)

2.4.2 Testing Differences between Two Means: Paired Observations

2.4.3 Testing Differences between Two Population Proportions

2.4.4 Testing the Equality of Two Population Variances

2.5 Nonparametric Methods

2.5.1 The Sign Test

2.5.2 The Median Test

2.5.3 The Mann–Whitney U Test

2.5.4 The Wilcoxon Signed-Rank Test for Matched Pairs

2.5.5 The Kruskal–Wallis Test

2.5.6 The Chi-Square Goodness-of-Fit Test

Part II

Continuous Dependent Variable Models

3

Linear Regression

3.1 Assumptions of the Linear Regression Model

3.1.1 Continuous Dependent Variable Y

3.1.2 Linear-in-Parameters Relationship between Y and X 3.1.3 Observations Independently and Randomly Sampled

3.1.4 Uncertain Relationship between Variables

3.1.5 Disturbance Term Independent of X and Expected Value Zero

3.1.6 Disturbance Terms Not Autocorrelated

3.1.7 Regressors and Disturbances Uncorrelated

3.1.8 Disturbances Approximately Normally Distributed

3.1.9 Summary

3.2 Regression Fundamentals

3.2.1 Least Squares Estimation

3.2.2 Maximum Likelihood Estimation

3.2.3 Properties of OLS and MLE Estimators

3.2.4 Inference in Regression Analysis

3.3 Manipulating Variables in Regression

3.3.1 Standardized Regression Models

3.3.2 Transformations

3.3.3 Indicator Variables

3.3.3.1 Estimate a Single Beta Parameter

3.3.3.2 Estimate Beta Parameter for Ranges of the Variable

3.3.3.3 Estimate a Single Beta Parameter for

m – 1 of the m Levels of the Variable

3.3.4 Interactions in Regression Models

3.4 Checking Regression Assumptions

(9)

3.4.2 Homoscedastic Disturbances

3.4.3 Uncorrelated Disturbances

3.4.4 Exogenous Independent Variables

3.4.5 Normally Distributed Disturbances

3.5 Regression Outliers

3.5.1 The Hat Matrix for Identifying Outlying Observations

3.5.2 Standard Measures for Quantifying Outlier Influence

3.5.3 Removing Influential Data Points from the Regression

3.6 Regression Model Goodness-of-Fit Measures

3.7 Multicollinearity in the Regression

3.8 Regression Model-Building Strategies

3.8.1 Stepwise Regression

3.8.2 Best Subsets Regression

3.8.3 Iteratively Specified Tree-Based Regression

3.9 Logistic Regression

3.10 Lags and Lag Structure

3.11 Investigating Causality in the Regression

3.12 Limited Dependent Variable Models

3.13 Box–Cox Regression

3.14 Estimating Elasticities

4

Violations of Regression Assumptions

4.1 Zero Mean of the Disturbances Assumption

4.2 Normality of the Disturbances Assumption

4.3 Uncorrelatedness of Regressors

and Disturbances Assumption

4.4 Homoscedasticity of the Disturbances Assumption

4.4.1 Detecting Heteroscedasticity

4.4.2 Correcting for Heteroscedasticity

4.5 No Serial Correlation in the Disturbances Assumption

4.5.1 Detecting Serial Correlation

4.5.2 Correcting for Serial Correlation

4.6 Model Specification Errors

5

Simultaneous Equation Models

5.1 Overview of the Simultaneous Equations Problem

5.2 Reduced Form and the Identification Problem

5.3 Simultaneous Equation Estimation

5.3.1 Single-Equation Methods

5.3.2 System Equation Methods

5.4 Seemingly Unrelated Equations

5.5 Applications of Simultaneous Equations

to Transportation Data

(10)

6

Panel Data Analysis

6.1 Issues in Panel Data Analysis

6.2 One-Way Error Component Models

6.2.1 Heteroscedasticity and Serial Correlation

6.3 Two-Way Error Component Models

6.4 Variable Coefficient Models

6.5 Additional Topics and Extensions

7

Time-Series Analysis

7.1 Characteristics of Time Series

7.1.1 Long-Term Movements

7.1.2 Seasonal Movements

7.1.3 Cyclic Movements

7.1.4 Irregular or Random Movements

7.2 Smoothing Methodologies

7.2.1 Simple Moving Averages

7.2.2 Exponential Smoothing

7.3 The ARIMA Family of Models

7.3.1 The ARIMA Models

7.3.2 Estimating ARIMA Models

7.4 Nonlinear Time-Series Models

7.4.1 Conditional Mean Models

7.4.2 Conditional Variance Models

7.4.3 Mixed Models

7.4.4 Regime Models

7.5 Multivariate Time-Series Models

7.6 Measures of Forecasting Accuracy

8

Latent Variable Models

8.1 Principal Components Analysis

8.2 Factor Analysis

8.3 Structural Equation Modeling

8.3.1 Basic Concepts in Structural Equation Modeling

8.3.2 The Structural Equation Model

8.3.3 Non-Ideal Conditions in the Structural Equation Model

8.3.4 Model Goodness-of-Fit Measures

8.3.5 Guidelines for Structural Equation Modeling

9

Duration Models

9.1 Hazard-Based Duration Models

9.2 Characteristics of Duration Data

9.3 Nonparametric Models

(11)

9.5 Fully Parametric Models

9.6 Comparisons of Nonparametric, Semiparametric,

and Fully Parametric Models

9.7 Heterogeneity

9.8 State Dependence

9.9 Time-Varying Covariates

9.10 Discrete-Time Hazard Models

9.11 Competing Risk Models

Part III

Count and Discrete Dependent Variable Models

10

Count Data Models

10.1 Poisson Regression Model

10.2 Poisson Regression Model Goodness-of-Fit Measures

10.3 Truncated Poisson Regression Model

10.4 Negative Binomial Regression Model

10.5 Zero-Inflated Poisson and Negative Binomial

Regression Models

10.6 Panel Data and Count Models

11

Discrete Outcome Models

11.1 Models of Discrete Data

11.2 Binary and Multinomial Probit Models

11.3 Multinomial Logit Model

11.4 Discrete Data and Utility Theory

11.5 Properties and Estimation of Multinomial Logit Models

11.5.1 Statistical Evaluation

11.5.2 Interpretation of Findings

11.5.3 Specification Errors

11.5.4 Data Sampling

11.5.5 Forecasting and Aggregation Bias

11.5.6 Transferability

11.6 Nested Logit Model (Generalized Extreme Value Model)

11.7 Special Properties of Logit Models

11.8 Mixed MNL Models

11.9 Models of Ordered Discrete Data

12

Discrete/Continuous Models

12.1 Overview of the Discrete/Continuous Modeling Problem

12.2 Econometric Corrections: Instrumental Variables

and Expected Value Method

12.3 Econometric Corrections: Selectivity-Bias Correction Term

(12)

Appendix A: Statistical Fundamentals

Appendix B: Glossary of Terms

Appendix C: Statistical Tables

Appendix D: Variable Transformations References

(13)

Part I

(14)

1

Statistical Inference I: Descriptive Statistics

This chapter examines methods and techniques for summarizing and inter-preting data. The discussion begins by examining numerical descriptive measures. These measures, commonly known as point estimators, enable inferences about a population by estimating the value of an unknown pop-ulation parameter using a single value (or point). This chapter also overviews graphical representations of data. Relative to graphical methods, numerical methods provide precise and objectively determined values that can easily be manipulated, interpreted, and compared. They permit a more careful analysis of the data than more general impressions conveyed by graphical summaries. This is important when the data represent a sample from which inferences must be made concerning the entire population.

Although this chapter concentrates on the most basic and fundamental issues of statistical analyses, there are countless thorough introductory sta-tistical textbooks that can provide the interested reader with greater detail. For example, Aczel (1993) and Keller and Warrack (1997) provide detailed descriptions and examples of descriptive statistics and graphical techniques. Tukey (1977) is the classical reference on exploratory data analysis and graphical techniques. For readers interested in the properties of estimators (Section 1.7), the books by Gujarati (1992) and Baltagi (1998) are excellent and fairly mathematically rigorous.

1.1 Measures of Relative Standing

A set of numerical observations can be ordered from smallest to largest mag-nitude. This ordering allows the boundaries of the data to be defined and allows for comparisons of the relative position of specific observations. If an observation is in the 90th percentile, for example, then 90% of the observations have a lower magnitude. Consider the usefulness of percentile rank in terms of a nationally administered test such as the Scholastic Aptitude Test (SAT) or Graduate Record Exam (GRE). An individual’s score on the test is compared with the scores of all people who took the test at the same time, and the relative

(15)

position within the group is defined in terms of a percentile. If, for example, the 80th percentile of GRE scores is 660, this means that 80% of the sample of individuals who took the test scored below 660 and 20% scored 660 or better. A percentile is defined as that value below which lies P% of the numbers in the remaining sample. For sufficiently large samples, the position of the Pth percentile is given by (n + 1)P/100, where n is the sample size.

Quartiles are the percentage points that separate the data into quarters: first quarter, below which lies one quarter of the data, making it the 25th percentile; second quarter, or 50th percentile, below which lies half of the data; third quarter, or 75th percentile point. The 25th percentile is often referred to as the lower or first quartile, the 50th percentile as the median or middle quartile, and the 75th percentile as the upper or third quartile. Finally, the interquartile range, a measure of the spread of the data, is defined as the difference between the first and third quartiles.

1.2 Measures of Central Tendency

Quartiles and percentiles are measures of the relative positions of points within a given data set. The median constitutes a useful point because it lies in the center of the data, with half of the data points lying above it and half below. Thus, the median constitutes a measure of the centrality of the observations.

Despite the existence of the median, by far the most popular and useful measure of central tendency is the arithmetic mean, or, more succinctly, the mean. The sample mean or expectation is a statistical term that describes the central tendency, or average, of a sample of observations, and varies across samples. The mean of a sample of measurements x1, x2, …, xnis defined as

, (1.1)

where n is the size of the sample.

When an entire population constitutes the set to be examined, the sample mean is replaced by Q, the population mean. Unlike the sample mean, the population mean is constant. The formula for the population mean is

. (1.2)

where N is the number of observations in the entire population.

MEAN X E X X x n i i n ( )!

? A

! !

§

!1 X Q !

§

! x N i i N 1

(16)

The mode (or modes because it is possible to have more than one of them) of a set of observations is the value that occurs most frequently, or the most commonly occurring outcome, and strictly applies to discrete variables (nom-inal and ord(nom-inal scale variables) as well as count data. Probabilistically, it is the most likely outcome in the sample; it has occurred more than any other value. It is useful to examine the advantages and disadvantages of each the three measures of central tendency. The mean uses and summarizes all of the infor-mation in the data, is a single numerical measure, and has some desirable mathematical properties that make it useful in many statistical inference and modeling applications. The median, in contrast, is the central-most (center) point of ranked data. When computing the median, the exact locations of data points on the number line are not considered; only their relative standing with respect to the central observation is required. Herein lies the major advantage of the median; it is resistant to extreme observations or outliers in the data. The mean is, overall, the most frequently used measure of central tendency; in cases, however, where the data contain numerous outlying observations the median may serve as a more reliable measure of central tendency.

If the sample data are measured on the interval or ratio scale, then all three measures of centrality (mean, median, and mode) make sense, provided that the level of measurement precision does not preclude the determination of a mode. If data are symmetric and if the distribution of the observations has only one mode, then the mode, the median, and the mean are all approxi-mately equal (the relative positions of the three measures in cases of asym-metric distributions is discussed in Section 1.4). Finally, if the data are qualitative (measured on the nominal or ordinal scales), using the mean or median is senseless, and the mode must be used. For nominal data, the mode is the category that contains the largest number of observations.

1.3 Measures of Variability

Variability is a statistical term used to describe and quantify the spread or dispersion of data around the center, usually the mean. In most practical situations, knowing the average or expected value of a sample is not sufficient to obtain an adequate understanding of the data. Sample vari-ability provides a measure of how dispersed the data are with respect to the mean (or other measures of central tendency). Figure 1.1 illustrates two distributions of data, one that is highly dispersed and another that is more tightly packed around the mean. There are several useful measures of variability, or dispersion. One measure previously discussed is the inter-quartile range. Another measure is the range, which is equal to the differ-ence between the largest and the smallest observations in the data. The range and the interquartile range are measures of the dispersion of a set of observations, with the interquartile range more resistant to outlying

(17)

observations. The two most frequently used measures of dispersion are the variance and its square root, the standard deviation.

The variance and the standard deviation are more useful than the range because, like the mean, they use the information contained in all the obser-vations. The variance of a set of observations, or sample variance, is the average squared deviation of the individual observations from the mean and varies across samples. The sample variance is commonly used as an estimate of the population variance and is given by

. (1.3)

When a collection of observations constitute an entire population, the variance is denoted by W2. Unlike the sample variance, the population vari-ance is constant and is given by

, (1.4)

where in Equation 1.3 is replaced by Q.

Because calculation of the variance involves squaring the original measure-ments, the measurement units of the variance are the square of the original measurement units. While variance is a useful measure of the relative variability of two sets of measurements, it is often preferable to express variability in the same units as the original measurements. Such a measure is obtained by taking the square root of the variance, yielding the standard deviation. The formulas for the sample and population standard deviations are given, respectively, as

(1.5) FIGURE 1.1

Examples of high and low variability data. Low Variability High Variability s x X n i i n 2 2 1 1 !



 !

§

W2 Q 2 1 !

§

!

x 

N i i N X s s x X n i i n ! !



 !

§

2 2 1 1

(18)

. (1.6)

Consistent with previous results, the sample standard deviation s2is a ran-dom variable, whereas the population standard deviation W is a constant.

A mathematical theorem attributed to Chebyshev establishes a general rule, which states that at least of all observations in a sample or population will lie within k standard deviations of the mean, where k is not necessarily an integer. For the approximately bell-shaped normal distribu-tion of observadistribu-tions, an empirical rule-of-thumb suggests that the following approximate percentage of measurements will fall within 1, 2, or 3 standard deviations of the mean. These intervals are given as

,

which contains approximately 68% of the measurements,

,

which contains approximately 95% of the measurements, and

,

which contains approximately 99% of the measurements.

The standard deviation is an absolute measure of dispersion; it does not take into consideration the magnitude of the values in the population or sample. On some occasions, a measure of dispersion that accounts for the magnitudes of the observations (relative measure of dispersion) is needed. The coefficient of variation is such a measure. It provides a relative measure of dispersion, where dispersion is given as a proportion of the mean. For a sample, the coefficient of variation (CV) is given as

. (1.7)

If, for example, on a certain highway section vehicle speeds were observed with mean = 45 mph and standard deviation s = 15, then the CV is s/ = 15/45 = 0.33. If, on another highway section, the average vehicle speed is = 60 mph and standard deviation s = 15, then the CV is equal to s/ = 15/65 = 0.23, which is smaller and conveys the informa-tion that, relative to average vehicle speeds, the data in the first sample are more variable.

W! W2 !

§

!

Q

2 1 x N i i N 1 1 2

k

Xs Xs

,

X s X s

2 , 2

X s X s

3 , 3

CV s X ! X X X x

(19)

Example 1.1

By using the speed data contained in the “speed data” file that can be

downloaded from the publisher’s Web site (www.crcpress.com), the basic

descriptive statistics are sought for the speed data, regardless of the season, type of road, highway class, and year of observation. Any com-mercially available software with statistical capabilities can accommo-date this type of exercise. Table 1.1 provides descriptive statistics for the speed variable.

The descriptive statistics indicate that the mean speed in the sample collected is 58.86 mph, with little variability in speed observations (s is low at 4.41, while the CV is 0.075). The mean and median are almost equal, indicating that the distribution of the sample of speeds is fairly symmetric. The data set contains more information, such as the year of observation, the season (quarter), the highway class, and whether the observation was in an urban or rural area, which could give a more complete picture of the speed characteristics in this sample. For example, Table 1.2 examines the descriptive statistics for urban vs. rural roads. Interestingly, although some of the descriptive statistics may seem to differ from the pooled sample examined in Table 1.1, it does not appear that the differences between mean speeds and speed variation in urban vs. rural Indiana roads is important. Similar types of descriptive statistics could be computed for other categorizations of average vehicle speed.

1.4 Skewness and Kurtosis

Two additional attributes of a frequency distribution that are useful are skewness and kurtosis. Skewness is a measure of the degree of asymmetry

TABLE 1.1

Descriptive Statistics for Speeds on Indiana Roads

Statistic Value N (number of observations) 1296 Mean 58.86 Std. deviation 4.41 Variance 19.51 CV 0.075 Maximum 72.5 Minimum 32.6 Upper quartile 61.5 Median 58.5 Lower quartile 56.4

(20)

of a frequency distribution. It is given as the average value over the entire population (this is often called the third moment around the mean, or third central moment, with variance the second moment). In general, when the distribution stretches to the right more than it does to the left, it can be said that the distribution is right-skewed, or positively skewed. Simi-larly, a left-skewed (negatively skewed) distribution is one that stretches asymmetrically to the left (Figure 1.2). When a distribution is right-skewed, the mean is to the right of the median, which in turn is to the right of the mode. The opposite is true for left-skewed distributions. To make the measure (xi– Q)3independent of the units of measurement of

the variable, it is divided by W3. This results in the population skewness parameter often symbolized as K1. The sample estimate of this parameter, (g1), is given as

, (1.8)

TABLE 1.2

Descriptive Statistics for Speeds on Rural vs. Urban Indiana Roads

Statistic Rural Roads Urban Roads

N (number of observations) 888 408 Mean 58.79 59.0 Std. deviation 4.60 3.98 Variance 21.19 15.87 CV 0.078 0.067 Maximum 72.5 68.2 Minimum 32.6 44.2 Upper quartile 60.7 62.2 Median 58.2 59.2 Lower quartile 56.4 56.15 FIGURE 1.2 Skewness of a distribution. g m m m 1 3 2 2 ! ( ) Left-Skewed Distribution Right-Skewed Distribution Symmetric Distribution

Mean = Median = Mode

Median Mean Mode

Median Mean Mode

(21)

where

.

If a sample comes from a population that is normally distributed, then the parameter g1is normally distributed with mean 0 and standard deviation

.

Kurtosis is a measure of the “flatness” (vs. peakedness) of a frequency distribution and is shown in Figure 1.3; it is the average value of (xi– )4

divided by s4 over the entire population or sample. Kurtosis (K2) is often called the fourth moment around the mean or fourth central moment. For the normal distribution the parameter K2 has a value of 3. If the parameter is larger than 3 there is usually a clustering of points around the mean (leptokurtic distribution), whereas if the parameter is lower than 3 the curve demonstrates a “flatter” peak than the normal distribu-tion (platykurtic).

The sample kurtosis parameter, g2, is often reported as standard output of many statistical software packages and is given as

, (1.9)

where

.

For most practical purposes, a value of 3 is subtracted from the sample kurtosis parameter so that leptokurtic sample distributions have positive kurtosis parameters, whereas platykurtic sample distributions have negative kurtosis parameters. FIGURE 1.3 Kurtosis of a distribution. Platykurtic Distribution Leptokurtic Distribution m x X n m x X n i i n i i n 3 3 1 2 2 1 !



!



! !

§

§

/ / 6 / n x g2!K2 !3

m4/m22

3 m xi X n i n 4 4 1 !

§

!



/

(22)

Example 1.2

Revisiting the speed data from Example 1.1, there is interest in determin-ing the shape of the distributions for speeds on rural and urban Indiana roads. Results indicate that when all roads are examined together their skewness parameter is –0.05, whereas for rural roads the parameter has the value of 0.056 and for urban roads the value of –0.37. It appears that, at least on rural roads, the distribution of speeds is symmetric, whereas for urban roads the distribution is left-skewed.

Although the skewness parameter is similar for the two types of roads, the kurtosis parameter varies more widely. For rural roads the parameter has a value of 2.51, indicating a distribution close to normal, whereas for rural urban roads the parameter has a value of 0.26, indicating a relatively flat (platykurtic) distribution.

1.5 Measures of Association

To this point the discussion has focused on measures that summarize a set of raw data. These measures are effective for providing information regard-ing individual variables. The mean and the standard deviation of a variable, for example, convey useful information regarding the nature of the measure-ments related to that variable. However, the measures reviewed thus far do not provide information regarding possible relationships between variables. The correlation between two random variables is a measure of the linear relationship between them. The correlation parameter V is a commonly used measure of linear correlation and gives a quantitative measure of how well two variables move together.

The correlation parameter is always in the interval . When there is no linear association, meaning that a linear relationship does not exist between the two variables examined. It is possible, however, for two variables to be nonlinearly related and yet have . When , there is a positive linear relationship between the variables examined, such that when one of the variables increases the other variable also increases, at a rate given by the value of (Figure 1.4). In the case when , there is a “perfect” positively sloped straight-line relationship between two variables. When there is a negative linear relationship between the two variables examined, such that an increase in one variable is associated with a decrease in the value of the other variable, at a rate given by the value of . In the case when there is a perfect negatively sloped straight-line relation-ship between two variables.

The concept of correlation stems directly from another measure of associ-ation, the covariance. Consider two random variables, X and Y, both nor-mally distributed with population means and , and population



? A

1 1, V ! 0 V ! 0 V " 0 V V ! 1 V 0 V V ! 1 QX QY

(23)

standard deviations and , respectively. The population and sample covariances between X and Y are defined, respectively, as follows:

(1.10)

. (1.11)

As can be seen from Equations 1.10 and 1.11, the covariance of X and Y is the expected value of the product of the deviation of X from its mean and the deviation of Y from its mean. The covariance is positive when the two variables move in the same direction, it is negative when the two variables move in opposite directions, and it is zero when the two variables are not linearly related.

As a measure of association, the covariance suffers from a major drawback. It is usually difficult to interpret the degree of linear association between two variables from the covariance because its magnitude depends on the magnitudes of the standard deviations of X and Y. For example, suppose that the covariance between two variables is 175. What does this say regard-ing the relationship between the two variables? The sign, which is positive, indicates that as one increases, the other also generally increases. However, the degree to which the two variables move together cannot be ascertained. FIGURE 1.4

Positive (top) and negative (bottom) correlations between two variables. ρ < 0 ρ > 0 WX WY COV X Y x y N p i X i Y i N ,

! 



!

§

Q Q 1 COV X Y x X y Y n s i i i n ,

! 



 !

§

1 1

(24)

But, if the covariance is divided by the standard deviations, a measure that is constrained to the range of values , as previously discussed, is obtained. This measure, called the Pearson product-moment correlation parameter, or correlation parameter, for short, conveys clear information about the strength of the linear relationship between the two variables. The population and sample r correlation parameter of X and Y are defined, respectively, as

(1.12)

, (1.13)

where sXand sYare the sample standard deviations.

Example 1.3

By using data contained in the “aviation 1” file, the correlations between annual U.S. revenue passenger enplanements, per-capita U.S. gross do-mestic product (GDP), and price per gallon for aviation fuel are exam-ined. After deflating the monetary values by the Consumer Price Index (CPI) to 1977 values, the correlation between enplanements and per-capita GDP is 0.94, and the correlation between enplanements and fuel price –0.72.

These two correlation parameters are not surprising. One would expect that enplanements and economic growth go hand-in-hand, while en-planements and aviation fuel price (often reflected by changes in fare price) move in opposite directions. However, a word of caution is necessary. The existence of a correlation between two variables does not necessarily mean that one of the variables causes the other. The determination of causality is a difficult question that cannot be directly answered by looking at correlation parameters. To this end, consider the correlation parameter between annual U.S. revenue passenger en-planements and annual ridership of the Tacoma-Pierce Transit System in Washington State. The correlation parameter is considerably high (-0.90) indicating that the two variables move in opposite directions in nearly straight-line fashion. Nevertheless, it is safe to say (1) that neither of the variables causes the other or (2) that the two variables are not even remotely related. In short, it needs to be stressed that correlation does not imply causation.

To this point, the discussion on correlation has focused solely on contin-uous variables measured on the interval or ratio scale. In some situations,



? A

1 1, V V W W !COV X Y X Y ( , ) r COV X Y s sX Y ! ( , )

(25)

however, one or both of the variables may be measured on the ordinal scale. Alternatively, two continuous variables may not satisfy the requirement of approximate normality assumed when using the Pearson product-moment correlation parameter. In such cases the Spearman rank correlation parameter, an alternative (nonparametric method), should be used to determine whether a linear relationship exists between two variables.

The Spearman correlation parameter is computed first by ranking the observations of each variable from smallest to largest. Then, the Pearson correlation parameter is applied to the ranks; that is, the Spearman correla-tion parameter is the usual Pearson correlacorrela-tion parameter applied to the ranks of two variables. The equation for the Spearman rank correlation parameter is given as

, (1.14)

where di, i = 1, …, n, are the differences in the ranks of xi, yi:di= R(xi) – R(yi).

There are additional nonparametric measures of correlation between variables, including Kendall’s tau. Its estimation complexity, at least when compared with Spearman’s rank correlation parameter, makes it less pop-ular in practice.

1.6 Properties of Estimators

The sample statistics computed in previous sections, such as the sample average , variance s2, and standard deviation s and others, are used as estimators of population parameters. In practice population parameters (often called parameters) such as the population mean and variance are unknown constants. In practical applications, the sample average is used as an estimator for the population mean , the sample variance s2for the population variance , and so on. These statistics, however, are random variables and, as such, are dependent on the sample. “Good” statistical estimators of true population parameters satisfy four important properties: unbiasedness, efficiency, consistency, and sufficiency.

1.6.1 Unbiasedness

If there are several estimators of a population parameter, and if one of these estimators coincides with the true value of the unknown parameter, then this estimator is called an unbiased estimator. An estimator is said to be unbiased if its expected value is equal to the true population parameter it is

r d n n s i i n !  

!

§

1 6 1 2 1 2 X X QX W2

(26)

meant to estimate. That is, an estimator, say, the sample average , is an unbiased estimator of if

. (1.15)

The principle of unbiasedness is illustrated in Figure 1.5. Any systematic deviation of the estimator away from the population parameter is called a bias, and the estimator is called a biased estimator. In general, unbiased estimators are preferred to biased estimators.

1.6.2 Efficiency

The property of unbiasedness is not, by itself, adequate, because there can be situations in which two or more parameter estimates are unbiased. In these situations, interest is focused on which of several unbiased estimators is superior. A second desirable property of estimators is efficiency. Efficiency is a relative property in that an estimator is efficient relative to another, which means that an estimator has a smaller variance than an alternative estimator. An estimator with the smaller variance is more efficient. As can be seen from

Figure 1.6, both and are unbiased estimators of Q1andQ2, respectively;

however, while , yielding a relative efficiency of relative to of 1/n, where n is the sample size.

In general, the unbiased estimator with minimum variance is preferred to alternative estimators. A lower bound for the variance of any unbiased estimator of U is given by the Cramer–Rao lower bound that can be written as (Gujarati, 1992)

. (1.16) The Cramer–Rao lower bound is only a sufficient condition for efficiency. Failing to satisfy this condition does not necessarily imply that the estimator FIGURE 1.5

Biased and unbiased estimators of the mean value of a population Qx.

Bias E(X–) = µx Unbiased Estimator E(X*)≠ µx Biased Estimator X QX E X

! QX X1 X VAR ˆ

Q1 !W2 VARQˆ W n 2 2

! ˆ Q2 Qˆ1 ˆU VAR

Uˆ u1

_

nE LNf X

x

;U

xU

a

2! 1

_

nE

x2LNf X

;U

xU2

a

(27)

is not efficient. Finally, unbiasedness and efficiency hold true for any finite sample n, and when they become asymptotic properties.

1.6.3 Consistency

A third asymptotic property is that of consistency. An estimator is said to be consistent if the probability of being closer to the true value of the parameter it estimates (U) increases with increasing sample size. Formally, this says that as , for any arbitrary constant c. For example, this property indicates that will not differ from Q as . Figure 1.7 graphically depicts the property of consistency, showing the behavior of an estimator of the population mean Q with increasing sample size.

It is important to note that a statistical estimator may not be an unbiased estimator; however, it may be a consistent one. In addition, a sufficient condition for an estimator to be consistent is that it is asymptotically unbi-ased and that its variance tends to zero as (Hogg and Craig, 1994). FIGURE 1.6

Comparing efficiencies.

FIGURE 1.7

The property of consistency.

µx E(X–) = µx X–~ N(µ, σ2/n) X1~ N(µ, σ2) np g ˆU np g lim [|ˆP U U " !| ]c 0 X np g X np g Probability Density f (X*) for n4 < n3 f (X*) for n3 < n2 f (X*) for n2 < n1 f (X*) for n1 µx

(28)

1.6.4 Sufficiency

An estimator is said to be sufficient if it contains all the information in the data about the parameter it estimates. In other words, is sufficient for Q if contains all the information in the sample pertaining to Q.

1.7 Methods of Displaying Data

Although the different measures described in the previous sections often provide much of the information necessary to describe the nature of the data set being examined, it is often useful to utilize graphical techniques for examining data. These techniques provide ways of inspecting data to deter-mine relationships and trends, identify outliers and influential observations, and quickly describe or summarize data sets. Pioneering methods frequently used in graphical and exploratory data analysis stem from the work of Tukey (1977).

1.7.1 Histograms

Histograms are most frequently used when data are either naturally grouped (gender is a natural grouping, for example) or when small subgroups may be defined to help uncover useful information contained in the data. A histogram is a chart consisting of bars of various heights. The height of each bar is proportional to the frequency of values in the class represented by the bar. As can be seen in Figure 1.8, a histogram is a convenient way of plotting the frequencies of grouped data. In the figure, frequencies on the first (left) Y-axis are absolute frequencies, or counts of the number of city transit buses in the State of Indiana belonging to each age group (data were taken from Karlaftis and Sinha, 1997). Data on the second Y-axis are relative frequencies, which are simply the count of data points in the class (age group) divided by the total number of data points.

Histograms are useful for uncovering asymmetries in data and, as such, skewness and kurtosis are easily identified using histograms.

1.7.2 Ogives

A natural extension of histograms is ogives. Ogives are the cumulative relative frequency graphs. Once an ogive such as the one shown in Figure 1.9 is constructed, the approximate proportion of observations that are less than any given value on the horizontal axis can be read directly from the graph. Thus, for example, it can be estimated from Figure 1.9 that the pro-portion of buses that are less than 6 years old is approximately 60%, and the proportion less than 12 years old is 85%.

X X

(29)

1.7.3 Box Plots

When faced with the problem of summarizing essential information of a data set, a box plot (or box-and-whisker plot) is a pictorial display that is extremely useful. A box plot illustrates how widely dispersed observations are and where the data are centered. This is accomplished by providing, graphically, five summary measures of the distribution of the data: the largest observation, the upper quartile, the median, the lower quartile, and the smallest observation (Figure 1.10).

Box plots can be very useful for identifying the central tendency of the data (through the median), identifying the spread of the data (through the interquartile range (IQR) and the length of the whiskers), identifying pos-sible skewness of the data (through the position of the median in the box), FIGURE 1.8

Histogram for bus ages in the State of Indiana (1996 data).

FIGURE 1.9

Ogive for bus ages in the State of Indiana.

F requency 35 0.18 0.15 0.12 0.09 0.06 25 15 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 10 20 30 Relativ e F requency 0.03 Age Cum ulativ e Relativ e F requency 1,0 0,8 0,5 0,3 0,0 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Age

(30)

identifying possible outliers (points beyond the 1.5(IQR) mark), and for comparing data sets.

1.7.4 Scatter Diagrams

Scatter diagrams are most useful for examining the relationship between two continuous variables. As examples, assume that transportation researchers are interested in the relationship between economic growth and enplane-ments, or the effect of a fare increase on travel demand. In some cases, when one variable depends (to some degree) on the value of the other variable, then the first variable, the dependent, is plotted on the vertical axis. The pattern of the scatter diagram provides information about the relationship between two variables. A linear relationship is one that can be approximately graphed by a straight line (see Figure 1.4). A scatter plot can show a positive correlation, no correlation, and a negative correlation between two variables (Section 1.5 and Figure 1.4 analyzed this issue in greater depth). Nonlinear relationships between two variables can also be seen in a scatter diagram and typically will be revealed as curvilinear. Scatter diagrams are typically used to uncover underlying relationships between variables, which can then be explored in greater depth with more quantitative statistical methods.

1.7.5 Bar and Line Charts

A common graphical method for examining nominal data is a pie chart. The Bureau of Economic Analysis of the U.S. Department of Commerce in its 1996 Survey of Current Business reported the percentages of the U.S. GDP accounted for by various social functions. As shown in Figure 1.11, trans-portation is a major component of the economy, accounting for nearly 11% of GDP in the United States. The data are nominal since the “values” of the variable, major social function, include six categories: transportation, housing, food, education, health care, and other. The pie graph illustrates the propor-tion of expenditures in each category of major social funcpropor-tion.

The U.S. Federal Highway Administration (FHWA, 1997) completed a report for Congress that provided information on highway and transit assets, FIGURE 1.10

The box plot.

Smallest obs. within 1.5(IQR) Largest obs. within 1.5(IQR) Whisker Whisker Media

Lower Quartile Upper IQR

(31)

trends in system condition, performance, and finance, and estimated invest-ment requireinvest-ments from all sources to meet the anticipated demands in both highway travel and transit ridership. One of the interesting findings of the report was the pavement ride quality of the nation’s urban highways as measured by the International Roughness Index. The data are ordinal because the “values” of the variable, pavement roughness, include five cate-gories: very good, good, fair, mediocre, and poor. This scale, although it resembles the nominal categorization of the previous example, possesses the additional property of natural ordering between the categories (without uniform increments between the categories). A reasonable way to describe these data is to count the number of occurrences of each value and then to convert these counts into proportions.

Bar charts are a common alternative to pie charts. They graphically repre-sent the frequency (or relative frequency) of each category as a bar rising from the horizontal axis; the height of each bar is proportional to the fre-quency (or relative frefre-quency) of the corresponding category. Figure 1.13,

for example, presents the motor vehicle fatal accidents by posted speed limit FIGURE 1.11

U.S. GDP by major social function (1995) (From U.S. DOT, 1997.)

FIGURE 1.12

Percent miles of urban interstate by pavement roughness category. (From FHWA, 1997.) Healthcare 15% Housing 24% Transportation 11% Food 13% Education 7% Other 30% Good 27% Fair 24% Very Good 12% Poor 10% Mediocre 27%

(32)

for 1985 and 1995, and Figure 1.14 presents the percent of on-time arrivals for some U.S. airlines for December 1997.

The final graphical technique considered in this section is the line chart. A line chart is obtained by plotting the frequency of a category above the point on the horizontal axis representing that category and then joining the points with straight lines. A line chart is most often used when the categories are points in time (time-series data). Line charts are excellent for uncovering FIGURE 1.13

Motor vehicle fatal accidents by posted speed limit. (From U.S. DOT, 1997.)

FIGURE 1.14

Percent of on-time arrivals for December 1997. (From Bureau of Transportation Statistics,

www.bts.gov.) Motor V ehicle F atal Accidents 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 0–25 26–35 36–45 46–54 55 65

Posted Speed Limit 1985 1995 Air lines America West 80 75 70 65 60 TWA Northwest Delta United Southwest Continental American US Airways

(33)

trends of a variable over time. For example, consider Figure 1.15, which represents the evolution of the U.S. air-travel market. A line chart is useful for showing the growth in the market over time. Two points of particular interest to the air-travel market, the deregulation of the market and the Gulf War, are marked on the graph.

FIGURE 1.15

U.S. revenue passenger enplanements 1954 through 1999. (From Bureau of Transportation Statistics, www.bts.gov.)

U.S. Revenue Passenger Enplanements

(103) 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 1999 1954 1957 1960 1963 1966 1969 1972 1975 1978 1981 1984 1987 1990 1993 1996 Year Deregulation Gulf War

(34)

2

Statistical Inference II: Interval

Estimation, Hypothesis Testing,

and Population Comparisons

Scientific decisions should be based on sound analysis and accurate infor-mation. This chapter provides the theory and interpretation of confidence intervals, hypothesis tests, and population comparisons, which are statistical constructs (tools) used to ask and answer questions about the transportation phenomena under study. Despite their enormous utility, confidence intervals are often ignored in transportation practice and hypothesis tests and popu-lation comparisons are frequently misused and misinterpreted. The tech-niques discussed in this chapter can be used to formulate, test, and make informed decisions regarding a large number of hypotheses. Such questions as the following serve as examples. Does crash occurrence at a particular intersection support the notion that it is a hazardous location? Do traffic-calming measures reduce traffic speeds? Does route guidance information implemented via a variable message sign system successfully divert motor-ists from congested areas? Did the deregulation of the air-transport market increase the market share for business travel? Does altering the levels of operating subsidies to transit systems change their operating performance? To address these and similar types of questions, transportation researchers and professionals can apply the techniques presented in this chapter.

2.1 Confidence Intervals

In practice, the statistics calculated from samples such as the sample aver-age , variance s2, standard deviation s, and others reviewed in the previous chapter are used to estimate population parameters. For example, the sample average is used as an estimator for the population mean Qx,

the sample variance s2is an estimate of the population variance W2, and so on. Recall from Section 1.6 that desirable or “good” estimators satisfy four

X

(35)

important properties: unbiasedness, efficiency, consistency, and sufficiency. However, regardless of the properties an estimator satisfies, estimates will vary across samples and there is at least some probability that it will be different from the population parameter it is meant to estimate. Unlike the point estimators reviewed in the previous chapter, the focus here is on interval estimates. Interval estimates allow inferences to be drawn about a population by providing an interval, a lower and upper boundary, within which an unknown parameter will lie with a prespecified level of confi-dence. The logic behind an interval estimate is that an interval calculated using sample data contains the true population parameter with some level of confidence (the long-run proportion of times that the true population parameter interval is contained in the interval). Intervals are called confi-dence intervals (CIs) and can be constructed for an array of levels of confidence. The lower value is called the lower confidence limit (LCL) and the upper value the upper confidence limit (UCL). The wider a confidence interval, the more confident the researcher is that it contains the population parameter (overall confidence is relatively high). In contrast, a relatively narrow confidence interval is less likely to contain the population param-eter (overall confidence is relatively low).

All the parametric methods presented in the first four sections of this chapter make specific assumptions about the probability distributions of sample estimators, or make assumptions about the nature of the sampled populations. In particular, the assumption of an approximately normally distributed population (and sample) is usually made. As such, it is imper-ative that these assumptions, or requirements, be checked prior to apply-ing the methods. When the assumptions are not met, then the nonparametric statistical methods provided in Section 2.5 are more appropriate.

2.1.1 Confidence Interval for QQQQ with Known WWWW2

The central limit theorem (CLT) suggests that whenever a sufficiently large random sample is drawn from any population with mean Q and standard deviation W, the sample mean is approximately normally distributed with mean Q and standard deviation . It can easily be verified that this standard normal random variable Z has a 0.95 proba-bility of being between the range of values [–1.96, 1.96] (see Table C.1 in Appendix C). A probability statement regarding Z is given as

. (2.1)

With some basic algebraic manipulation the probability statement of Equa-tion 2.1 can be written in a different, yet equivalent form:

X W / n P X n   ¨ ª© ¸ º¹! 1 96. 1 96 0 95 / . . Q W

(36)

. (2.2)

Equation 2.2 reveals that, with a large number of intervals computed from different random samples drawn from the population, the proportion of values of for which the interval captures Q is 0.95. This interval is called the 95% confidence interval estimator ofQ. A shortcut notation for this interval is

. (2.3)

Obviously, probabilities other than 95% can be used. For example, a 90% confidence interval is

.

In general, any confidence level can be used in estimating the confidence intervals. The confidence interval is , and is the value of Z such that the area in each of the tails under the standard normal curve is . Using this notation, the confidence interval estimator of Q can be written as

. (2.4)

Because the confidence level is inversely proportional to the risk that the confidence interval fails to include the actual value ofQ, it generally ranges between 0.90 and 0.99, reflecting 10% and 1% levels of risk of not including the true population parameter, respectively.

Example 2.1

A 95% confidence interval is desired for the mean vehicular speed on Indiana roads (see Example 1.1 for more details). First, the assumption of normality is checked; if this assumption is satisfied we can proceed with the analysis. The sample size is n = 1296, and the sample mean is = 58.86. Suppose a long history of prior studies has shown the

popu-lation standard deviation as W = 5.5. Using Equation 2.4, the confidence

interval can be obtained:

0 95 1 96 1 96 1 96 1 96 . . . . . ! ¨ª©  ¸º¹ ! ¨ª©   ¸º¹ P n X n P X n X n W Q W W Q W X (X1 96. W n X, 1 96. W n) X n s 1 96. W X n s 1 645. W 1

E

ZE 2 E 2

X Z n s E2 W X

(37)

.

The result indicates that the 95% confidence interval for the unknown population parameter Q consists of lower and upper bounds of 58.56 and 59.16. This suggests that the true and unknown population parameter would lie somewhere in this interval about 95 times out of 100, on average. The confidence interval is rather “tight,” meaning that the range of possible values is relatively small. This is a result of the low assumed standard deviation (or variability in the data) of the population examined. The 90% confidence interval, using the same standard deviation, is [58.60, 59.11], and the 99% confidence interval is [58.46, 59.25]. As the confidence interval becomes wider, there is greater and greater confidence that the interval contains the true unknown population parameter.

2.1.2 Confidence Interval for the Mean with Unknown Variance

In the previous section, a procedure was discussed for constructing confi-dence intervals around the mean of a normal population when the variance of the population is known. In the majority of practical sampling situations, however, the population variance is rarely known and is instead estimated from the data. When the population variance is unknown and the population is normally distributed, a (1 – E)100% confidence interval for Q is given by

, (2.5)

where s is the square root of the estimated variance (s2), is the value of the t distribution with n  1 degrees of freedom (for a discussion of the t distribution, see Appendix A).

Example 2.2

Continuing with the previous example, a 95% confidence interval for the mean speed on Indiana roads is computed, assuming that the population variance is not known, and instead an estimate is obtained from the data with the same value as before. The sample size is n = 1296, and the sample

mean is = 58.86. Using Equation 2.3, the confidence interval can be

obtained as

.

Interestingly, inspection of probabilities associated with the t distribution

(see Table C.2 in Appendix C) shows that the t distribution converges to

X n s1 96 !58 86 1 96s 5 5 ! s !

?

A

1296 58 86 0 30 58 56 59 16 . W . . . , . X t s n s E 2 tE 2 X X t s n s E 2 !58 86 1 96s 4 41 !

?

A

1296 58 61 59 10 . . . . , .

(38)

the standard normal distribution as . Although the t distribution is the correct distribution to use whenever the population variance is un-known, when sample size is sufficiently large the standard normal distri-bution can be used as an adequate approximation to the t distridistri-bution.

2.1.3 Confidence Interval for a Population Proportion

Sometimes, interest centers on a qualitative (nominal scale) variable, rather than a quantitative (interval or ratio scale) variable. There might be interest in the relative frequency of some characteristic in a population such as, for exam-ple, the proportion of people in a population who are transit users. In such cases, an estimate of the population proportion, p, whose estimator is has an approximate normal distribution provided that n is sufficiently large ( and , where ). The mean of the sampling distribution is the population proportion p and the standard deviation is .

A large sample confidence interval for the population propor-tion, p is given by

, (2.6)

where the estimated sample proportion, , is equal to the number of “suc-cesses” in the sample divided by the sample size, n, and .

Example 2.3

A transit planning agency wants to estimate, at a 95% confidence level, the share of transit users in the daily commute “market” (that is, the percentage of commuters using transit). A random sample of 100 commut-ers is obtained and it is found that 28 people in the sample are transit uscommut-ers. By using Equation 2.6, a 95% confidence interval for p is calculated as

.

Thus, the agency is 95% confident that transit users in the daily commute range from 19.2 to 36.8%.

2.1.4 Confidence Interval for the Population Variance

In many situations, in traffic safety research for example, interest centers on the population variance (or a related measure such as the population standard deviation). As a specific example, vehicle speeds contribute to crash probability, with an important factor the variability in speeds on the

np g ˆp npu 5 nqu 5 q! 1 p ˆp pq n 1 100

E

% ˆ ˆ ˆ p Z pq n s E 2 ˆp ˆ ˆ q! 1 p ˆ ˆ ˆ . . . , p Z pq n a s 2 !0 28 1 96s

! s !

?

A

0 28 0 72 100 0 28 0 088 0 192 0.368

(39)

roadway. Speed variance, measured as differences in travel speeds on a roadway, relates to crash frequency in that a larger variance in speed between vehicles correlates with a larger frequency of crashes, especially for crashes involving two or more vehicles (Garber, 1991). Large differences in speeds results in an increase in the frequency with which motorists pass one another, increasing the number of opportunities for multivehicle crashes. Clearly, vehicles traveling the same speed in the same direction do not overtake one another; therefore, they cannot collide as long as the same speed is maintained (for additional literature on the topic of speeding and crash probabilities, covering both the United States and abroad, the interested reader should consult FHWA, 1995, 1998, and TRB, 1998).

A confidence interval for W2, assuming the population is nor-mally distributed, is given by

, (2.7)

where is the value of the distribution with n 1 degrees of freedom. The area in the right-hand tail of the distribution is , while the area in the left-hand tail of the distribution is . The chi-square distribution is described in Appendix A, and the table of probabilities associated with the chi-square distribution is provided in Table C.3 of Appendix C.

Example 2.4

A 95% confidence interval for the variance of speeds on Indiana roads

is desired. With a sample size of 100 and a variance of 19.51 mph2, and

using the values from the G2table (Appendix C, Table C.3), one obtains

= 129.56 and = 74.22. Thus, the 95% confidence interval is

given as

.

The speed variance is, with 95% confidence, between 15.05 and 26.02. Again, the units of the variance in speed are in mph2.

2.2 Hypothesis Testing

Hypothesis tests are used to assess the evidence on whether a difference in a population parameter (a mean, variance, proportion, etc.) between two or

1 100

E

% n s n s



« ­ ¬ ¬ » ½ ¼ ¼  1 2 1 2 2 2 1 2 2 GE G E , GE 22 G2 GE 22 G1E2 2  GE 22 G1E2 2  99 19 51 129 56 99 19 51 74 22 15 05 26 02 . . , . . . , .

« ­ ¬ » ½ ¼ !

?

A

References

Related documents

An analysis of the economic contribution of the software industry examined the effect of software activity on the Lebanese economy by measuring it in terms of output and value

to be considered for publications media@mnstatefair.org June 26 Deals, Drawings &amp; Giveaways Guide Form Attn: Marketing &amp; Comm. (this will be mailed to you in the spring)

linearly associated to severity indicators except for the 2 extreme clusters (individuals with 6 and 7 symptoms respectively). Therefore the authors therefore did not support

Such a collegiate cul- ture, like honors cultures everywhere, is best achieved by open and trusting relationships of the students with each other and the instructor, discussions

We have analysed four sets of labour-market experiences. The …rst was the transition from compulsory school to various states: continuing their education, …nding a job,

In our “Present Value” estimates, we used as instruments, the fit and its one period lag of the difference of the weighted present value of the fundament (the mark-up over real

The purpose of this study was to evaluate the diagnostic utility of real-time elastography (RTE) in differentiat- ing between reactive and metastatic cervical lymph nodes (LN)

Microbiologic wash-outs for the assessment of contamination of needles after single and repeated use were carried out in the patients of the 1st group, after 1 injection, in the