Quantitative Methods
Summary
Based on the script written by
Professor David Targett
Table of Contents
Measures ... 4
Location ... 4
Arithmetic mean ... 4
Median... 4
Mode... 4
Scatter ... 4
Range ... 4
Interquartile range ... 4
Mean Absolute Deviation (MAD) ... 4
Variance... 4
Standard Deviation ... 5
Coefficient of Variation ... 5
Indices... 5
Simple Index ... 5
Simple Aggregate Index ... 5
Weighted Aggregate Index (Laspeyres, Paasche) ... 6
Other Summary Measures... 6
Skew ... 6
Kurtosis... 6
Distributions ... 7
ANOVA table... 7
One-‐Way Analysis of Variance... 7
Two-‐Way Analysis of Variance... 7
Regression Analysis... 7
Binominal Distribution ... 8
Normal Distribution ... 9
Poisson Distribution ...11
t-Distribution...12
Chi-squared Distribution...13
F-Distribution ...14
Significance Tests ...15
5 steps ...15
Null hypothesis / Alternative hypothesis ...15
Difference in means of two samples ...15
Difference between paired samples ...16
Holt-‐Winters method...18
Decomposition Method...18
Box-‐Jenkins Method ...19
Forecasting ...20
Qualitative Methods ...20
Causal Modelling ...20
Time Series Methods...20
Regression ...21
Simple Linear Regression...21
Testing randomness of residuals...21
Runs Test...21
Multiple Regression Analysis ...21
Stages in multiple regression analysis ...21
Discarding of variables...22
Correlation ...22
Correlation coefficient...22
R-‐bar-‐squared ...23
Collinearity...23
Exams...24
Measures
Location
Arithmetic mean
The arithmetic mean is calculated as
Sum of readings Σx -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ or -‐-‐-‐-‐
Number of readings n
Median
The median is the middle number of a set of values. In case of an even number of readings the arithmetic mean of the middle two numbers is used.
Mode
The most frequent number that occurs in a set of readings.
Scatter
Range
Largest reading – smallest reading
Interquartile range
Range of the middle 50% of readings: Strip of the top and bottom 25% of readings, then use the Range calculation.
Mean Absolute Deviation (MAD)
The mean absolute deviation is the average distance of readings from their arithmetic mean:
Easier to calculate for large numbers of readings:
∑(x2) – n * xmean2
Variance = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ n -‐ 1
Standard Deviation
The standard deviation is the square root of the variance.
Standard deviation = sqrt(Variance)
Coefficient of Variation
To compare the means of two groups the measure of scatter must be standardised before a relative comparison can be made. The coefficient of variation is a possible way to achieve this:
Coefficient of variation = Standard deviation / Arithmetic mean
Example:
Mean Standard deviation Coefficient of variation Airport 1 4 200 1 050 0.25
Airport 2 15 600 2 250 0.14
Indices
Simple Index
An index is the result of the conversion of one series of numbers into another based on 100. One value is picked as the base value and assigned the value “100”. Previous and following values are calculated relative to this base value:
Index value = value / base value * 100
Simple Aggregate Index
An aggregate index takes into account multiple factors. Example: The price index for meat consists of the price of beef, pigs and lamb.
To do this, add together the values for each meat in a time period. The first time period gets assigned the value “100”. Previous, following periods: See above.
Price relative index: If some inputs have very different levels the effect of a high-‐ value column can overshadow a low-‐value column. To counter this a price relative index can be used. Here an index value is calculated per column and the aggregate index is then based on those index values.
Weighted Aggregate Index
A weighted aggregate index allows different weights to be given to different prices (columns). Instead of simply adding up prices the prices are first weighted by the quantities and then added up. The final number is used to create the combined index.
The quantities used for the weighting should be the same for all time periods as this index measures price changes and not quantity changes. If the quantities from the base month (row) are used the index is known as Laspeyres Index. The Laspeyres Index tends to overestimate inflation.
To use the methods for a quantities index the roles of prices and quantities need to be reversed.
The Paasche Index uses the weights from the most current time period. Here the whole series has to be recalculated when a new row is added. The Paasche Index tends to underestimate inflation.
A fixed weight index does not use the weight from the first or most recent time period. Some intermediate time period is picked or an average is used to
calculate the index.
Other summary measures
Skew
Skew measures the extent to which a distribution is non-‐symmetrical. Left (or negatively) skewed graphs show a peak to the right of the middle. Right (or positively) skewed graphs show the peak to the left of the middle. Zero-‐skewed graphs are symmetrical.
Kurtosis
Kurtosis measures the extent to which a distribution “broad”, i.e. how “thick” or “pointy” the middle of the graph is. Low kurtosis means a pointier graph, high kurtosis indicates a “fatter” graph.
Distributions
ANOVA table
Conventionally analyses of variance are laid out in a systematic form called an
ANOVA table (ANalysis Of VAriance table).
One-‐Way Analysis of Variance
Variation Degrees of
freedom Sums of squares Mean square F Explained by
treatments c – 1 SST MST MST/MSE
Error or
unexplained (r -‐ 1) * c SSE MSE
Total r * c – 1 SS
Two-‐Way Analysis of Variance
Variation Degrees of
freedom Sums of squares Mean square F Explained by
treatments c – 1 SST MST MST/MSE
Explained by
blocks r – 1 SSB MSB MSB/MSE
Error or
unexplained (r -‐ 1) * (c – 1) SSE MSE
Total r * c – 1 SS
Regression Analysis
Variation Degrees of freedom
Sums of squares Mean square F
Explained by
treatments k SST MST MST/MSE
Error or
unexplained n – k -1 SSE MSE
Total n - 1 SS
Degrees of freedom
c = Number of columns r = Number of rows
n = Number of observations
Binominal Distribution
The binominal distribution is based on taking samples from a population whose elements are of two types. A random sample of size n is taken. Because of the randomness of the sample it could contain between 0 and n elements of type p.
Example:
20 Percent of chips produced are defective. A sample of 30 chips is inspected to see how many are actually defective.
Parameters:
p = Proportion of the population of type 1 (1-‐p is the proportion of type 2) n = The size of the sample being taken
Calculation:
P(r of type 1 in sample) = nCr * pr * (1 -‐ p) n-‐r
n! with nCr = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
r! * (n – r)!
How many combination possibilities are there for taking n elements out of a population of r? Think Lotto, pick 6 out of 49 numbers:
49! / ( 6! * 43!) = 13 983 816 (chance of winning 1 in ~14 Million)
Uses of the binominal distribution:
-‐ Inspection schemes (does the observed defect rate differ from the agreed one?) -‐ Opinion polls (for/against)
-‐ Selling (sale/no sale)
Normal Distribution
The normal distribution is, when drawn, a bell shaped, continuous and
symmetrical curve. Unlike discrete distributions it is not the height of the line that defines the probability for the normal distribution. Instead the area
between two values on the x-‐axis and the curve give the probability of an event to be between both points.
The area below the curve has the following attributes:
68.26% of the readings lie between ±1 standard deviation of the mean 95.44% of the readings lie between ±2 standard deviation of the mean 99.74% of the readings lie between ±3 standard deviation of the mean
Example:
The IQ of children has a mean of 100 and a standard deviation of 17 points. This means that:
68.26% of children have an IQ of 83 to 117 95.44% of children have an IQ of 66 to 134 99.74% of children have an IQ of 49 to 151
Parameters: Mean
Standard deviation
For looking up values in the normal tables:
Observed value – Mean zcalc = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Standard deviation
When to use a normal distribution
The normal distribution should be used when observations or measurements are taken from a population. Each observation is subject to multiple sources of disturbances. Each of those sources changes the value of the observation slightly. Some errors might cancel each other out (some being positive, some being
negative) so most of the measurements will fall close to a mean value. Some observations however will experience the addition of errors and be further away from the mean.
A lot of real-‐world examples exist: -‐ IQs of children
-‐ Heights of people with the same sex
-‐ Dimensions of mechanically produced components -‐ Weights of machine-‐produced items
-‐ Arithmetic means of large samples
Approximating a binominal distribution with the Normal:
The binominal distribution can be approximated by the normal if both n * p and n * (1 – p) exceed a value of 5.
Mean = n * p
Standard deviation: sqrt(np * (1 -‐ p))
If the proportion of defectives (instead of the number of defectives) is looked at these values need to be used:
Mean = p
Standard deviation = sqrt((p * (1 – p) / n)
When a discrete distribution is approximated by the Normal care needs to be taken to use the correct values for the limits. For example if the probability of an event occurring less than 50 times is required this means we need to look for
Poisson Distribution
Describes the occurrence of isolated events within a continuum. Like the
binominal distribution based on taking a sample from a population of elements of two types with the types being the occurrence and the non-occurrence of an event. The Poisson distribution is discrete; its shape varies from right-‐skewed to almost symmetrical.
Example:
Continuum: Time
Events: -‐ A telephone call arrives at a switchboard -‐ No telephone call arrives at a switchboard
The total number of elements is infinite as there are an unlimited number of non-‐ occurred events that are part of the sample.
Other uses include flaws in cable (cable being the continuum, flaws being the events) or mechanical breakdown of machinery (time again as continuum, breakdown as event).
Parameter:
λ = Average number of events per sample
Probability of r events occurring in a sample:
e-λ * λr
P(r) = ---------- r!
Example:
λ = 2 (average number of calls arriving per minute)
P(0 calls) = 0.135 * 1 / 1 = 0.135
P(1 call) = 0.135 * 2 / 1 = 0.27
Instead of deriving the “full” value for every r we can incrementally calculate it:
P(r) * λ P(r + 1) = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ r + 1
P(1) = P(0) * λ P(2) = P(1) * λ / 2 P(3) = P(2) * λ / 3
Approximate a binominal distribution with Poisson:
t-Distribution
Similar to the normal distribution but with longer tails it is also continuous with a symmetrical shape. For small sample sizes the tails are considerable longer than those of the normal distribution while for sample sizes of 30+ the t- distribution and the normal distribution can be considered to behave the same.
Parameters:
-‐ Arithmetic (sample) mean -‐ Standard deviation
-‐ Degrees of freedom (sample size – 1)
Observed sample mean -‐ Mean
t = ————————————————————————— Estimate of standard deviation / sqrt(Number of samples)
When to use a t-distribution
-‐ The population standard deviation is unknown and has to be estimated from the sample
-‐ The sample size is less than 30 (for sample sizes > 30 the normal could be used)
-‐ The underlying distribution of the population from which the sample was taken is normal
All these conditions need to be met for the t-distribution to be applicable!
Example:
Test for length of life of 40 light bulbs -‐> normal distribution Test for length of life of 20 light bulbs -‐> t-‐distribution
Uses of t:
1) Calculate limits for observed sample means to be within a confidence limit
Look up the t-‐value for the given levels of freedom and the confidence limit. With the t-‐value and the standard deviation and sample size
Chi-squared Distribution
The chi-‐squared (χ2 ) distribution provides the method for comparing an observed sample variance with a hypothesised population variance. It can answer the question: Is the observed scatter of the sample in accord with what is thought to be the scatter of the population?
Parameters:
(n – 1) * Observed sample variance χ2 = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Population variance
Using Chi-squared to test differences in proportion
For each cell calculate the expected value (based on the total). Calculate the chi-‐ squared value for each cell and sum them up:
(f0 – fe)2
Σ -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ fe
The test is one-‐sided as it tests if the chi-‐squared value is higher than should be expected. Degrees of freedom = (Number of rows – 1) * (Number of columns – 1). Look up critical value in chi-‐squared table and compare against the sum
F-Distribution
The F-‐distribution is used to compare the variance of one sample with that of a second. The variable of an F-‐distribution is the ratio between two variance estimates. Just as the location of two samples could be compared through the difference in their means (by applying a normal or t-‐test), so the scatter of two samples can be compared through the ratio of their variances (by applying an F test).
Parameters:
Significance Tests
5 steps:
-‐ Formulate the hypothesis -‐ Collect a sample of evidence -‐ Decide on the significance level
-‐ Calculate the probability of the sample evidence occurring -‐ Compare the probability with the significance level
Null hypothesis / Alternative hypothesis
The null hypothesis usually refers to a hypothesis that the test tries to disprove. Example: The question is if in a sample there is a significant difference between the response of male and female patients to the treatment. The null hypothesis would be that there is NO significant difference in the response.
The alternative hypothesis is concluded if the null hypothesis is disproved. In the above case there seems to be a difference in the response of male and female patients if the alternative hypothesis is accepted.
Difference in means of two samples
Two samples are taken from a population, their means and the difference between the means calculated. The mean of the distribution of those means is 0 (means difference in samples cancel each other out).
Variance sum theorem:
Variance(x + y) = Variance(x) + Variance(y) Variance(x -‐ y) = Variance(x) + Variance(y)
With some dark math it follows that
Variance(xmean – ymean) = 2 * V / n
Standard deviation = sqrt(2) * s / sqrt(n)
Therefore z can be calculated as:
x1, mean – x2, mean
z = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ sqrt(2) * s / sqrt(n)
Difference between paired samples
Create a new sample with the difference between paired values. Treat the new sample like a basic single sample significance test with a mean of 0.
xmean – 0
z = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ s / sqrt(n)
Tests on proportions
Arithmetic mean = p
Standard deviation = sqrt(p * (1 -‐ p) / n)
Errors in significance tests
Type 1 errors
Occur when a hypothesis is rejected due to a sample in the “reject” tail of the distribution event though the hypothesis is in fact true. The probability of this is equal to the significance level (the usual 5 or 1%).
Type 2 errors
These occur when a hypothesis is accepted falsely. To determine the probability of this requires knowledge of the alternative hypothesis, which has to be
precisely defined (not: Average IQ is not 100).
The probability of correctly accepting the alternative hypothesis is the power of the significance test.
Time Series Techniques
Smoothing methods
Series type Methods
Stationary Moving averages Exponential smoothing
Trend Holt
Seasonal Holt-‐Winters Cyclical Decomposition Other Box-‐Jenkins
Stationary series
Moving average
Replace the original series with a smoothed series, replacing each value with the average of it and the neighbouring values. Examples are three-‐point moving average, five-‐point moving average, etc.
The calculated value can be used as the forecast for the first period after the last value used for its calculation only!
Exponential Smoothing
Gives more weight to recent values.
St = (1 -‐ α) * St-‐1 + α * xt
α is usually in the range of 0.1 to 0.4
The forecast can be used for the next month as only past values are used for the calculation. As the first value in the series the first value from the original series is used.
Series with a trend
Holt’s Method
Two parameters:
α: Smoothing parameter for series values γ: Smoothing parameter for trend values
St = (1 -‐ α) * (St-‐1 + bt-‐1) + α * xt
bt = (1 -‐ γ) * bt-‐1 + γ * (St – St-‐1)
xt = actual observation at time t
St = smoothed value at time t
bt = smoothed trend at time t
Ft = forecast for m periods in the future
The forecast can be used for the next month as only past values are used for the calculation. As the first two values in the series the first two values from the original series are used. The first value for the trend is the difference between the first and second value of the smoothed series (which are the same as the original series) and is in row 2. Only then do we have enough values to
“properly” calculate the smoothed value and smoothed trend in the following rows.
Series with a trend and seasonality
Holt-Winters Method
The Holt-‐Winters Method adds a third smoothing equation for seasonality as compared to Holt’s method described above. A new smoothing constant denoted β is introduced.
Seasonality is measured as the ratio between actual data and smoothed data:
Actual data Seasonality = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Smoothed data
Decomposition Method
This method assumes that a time series can be decomposed into four distinct elements:
-‐ Trend -‐ Cycle -‐ Seasonality -‐ Random
Cycles
By choosing a suitable moving average (12 points for monthly data, 4 for quarterly) the random and seasonal elements can be smoothed away, leaving just trend and cycle.
If St is such a moving average then the ratio between St and (a + bt) must be the
cycle. If the ratio is approx. 1 for all time periods there is no cycle.
Seasonality
Seasonality is isolated by a similar approach to that for cycles. The moving average (St) comprises trend and cycles. The actual value (xt) comprises trend,
cycle, seasonality and random effects. The ratio
Actual value xt
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ or -‐-‐-‐ Moving average St
should reflect the seasonality and the random effect. If the data is quarterly then the seasonality for the first quarter can be calculated as
x1 x5 x9
Average of -‐-‐-‐, -‐-‐-‐, -‐-‐-‐, … S1 S5 S9
The averaging helps in eliminating random effects.
Box-Jenkins Method
The Box-‐Jenkins Methods allows for compensating of previous errors as time goes by. To do this past residuals (forecasting errors) are incorporated into the equation. Box-‐Jenkins is better described as a process:
(a)Pre-‐whiten (b)Identify (c)Estimate (d)Diagnose (e)Forecast
Forecasting
Three methods for forecasting:
-‐ Qualitative -‐ Causal modelling -‐ Time series methods
Qualitative Methods are based on using judgement rather than (historical) data. May be the only method when dealing with new products or new technologies.
In causal modelling the variable to be forecasted is related statistically to one or more other variables. Assumption is that the relationship between the chosen variable and the modelled one will hold in the future!
Regression
Simple Linear Regression
Least squares method of regression: Minimize the sum of the squared residuals. The residuals of the regression should be random.
Line is defined as: y = a + b * x
With
Σ(x -‐ xmean) * (y -‐ ymean)
b = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Σ(x -‐ xmean)2
a = ymean – b * xmean
Testing randomness of residuals
Runs test
Group the residuals by their sign. Each change of sign is a “run”. Having too many runs is a sign of non-‐randomness as is a very low number of runs. Check the upper and lower critical values from a statistical table with the number of negative and positive residuals as the parameters.
If the number of observed runs is within the upper and lower critical values the residuals appear to be random.
Multiple Regression Analyses
The idea of the Simple Linear Regression is extended to multiple variables on the right-‐hand side of the equation:
y = A + B * x + C * z + D * t
These are three independent ‘x’ variables: x, z, t. Their coefficients are B, C and D while A is the constant.
Stages in multiple regression analyses
-‐ Identify dependent and independent variables -‐ Examine scatter diagrams (multiple needed) -‐ Run regression analysis
-‐ Calculate R2 value to determine proportion of total variation explained
-‐ Test significance using ANOVA table and F-test
-‐ Check residuals
o Plot residuals against fitted y-‐values o Use Runs-test to check for randomness -‐ Check for collinearity (see “Collinearity” below) -‐ Use model for prediction
Discarding of variables
In multiple regression analyses not all variables will have a statistically
significant impact on the result. Each variable can be tested for its effect on y. For this a t-test is used with the usual 5 stages:
(a)H0: The population coefficient for this variable is 0.
(b)The coefficient and the standard error for the variable will have to be computed.
(c)Significance level is the usual 5 percent. This is a two-‐sided test hence the 5% are split into 2.5% upper and 2.5% lower tail.
(d)Degrees of freedom are n – k – 1 with n = number of observations
k = The number of x variables in the regression The observed t value is
Coefficient estimate tObs. = -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Standard error of coefficient
(e)If tObs. exceeds t0.025 then the hypothesis is rejected and the variable does
have a significant impact. If the t value is lower the variable may be eliminated from the regression equation.
Correlation
Total variation Σ(y -‐ ymean)2
Explained variation Σ(Fitted y – ymean)2
Unexplained variation Σ(Residuals)2
Correlation coefficient
Σ[(x -‐ xmean) * (y -‐ ymean)]
For R2 closer to 1 more of the total variation can be explained. The closer it is to
0 less of the total variation can be explained by the current model.
R-bar-squared
R-‐bar-‐squared is a more sensitive measure of closeness of fit. Based on the same ratio as R2 but with an adjusted formula that allows R-bar-squared to fall when
a variable unconnected to the result is added.
Collinearity
Collinearity occurs when two (or more) of the x variables are highly correlated. In this case multiple variables contribute some of the same information to the end result. To avoid collinearity problems you can:
(a)Use only one of the variables (which to use is largely subjective) (b)Combine the variables (if the aggregate has any meaning)
(c)Substitute with another variable with a similar meaning and a low correlation
Exams
J’06
CS1: Indices, Simple aggregate index, Laspeyres, Paasche, Methodology CS2: Difference between paired samples, t-‐ Test, Methodology
CS3: Simple linear regression, Correlation, R2, Methodology
CS4: Time series, Moving average, Exponential Smoothing, MSE
D’06
CS1: Presentation of data
CS2: Compare two sample means, F-‐Test, Survey methodology CS3: Binominal distribution, Methodology
J’07
CS1: One-‐way analyses of variance, ANOVA CS2: Survey methodology
CS3: Hypothesis test
CS4: Time Series, Exponential smoothing, Forecasting techniques
D’07
CS1: Linear regression, ANOVA, Analysis recommendations
CS2: Normal distribution, Characteristics of the normal distribution CS3: Compare two sample means, Survey methodology
CS4: Time series, Moving average, Holt’s method, Methodology
J’08
CS1: Indices, Simple aggregate index, Laspeyres, Paasche, Methodology CS2: Normal distribution, Testing a sample proportion
CS3: Chi-‐squared, Test of difference in proportion, Methodology CS4: Regression, R2, ANOVA, t-‐Test, Methodology
D’08
CS1: Poisson distribution, Methodology
CS2: Two-‐Way analyses of variance, Methodology CS3: Decomposition analyses, Methodology
D’09
CS1: Measures of location and range, Methodology
CS2: One-‐Way Analyses of variance, ANOVA, Methodology CS3: t-‐Test, Methodology
CS4: Simple linear regression, Correlation, R2, Methodology
J’10
CS1: Normal distribution, Methodology
CS2: Multiple-‐regression analyses, Collinearity, Runs-‐test, Methodology CS3: Chi-‐squared distribution, Methodology
CS4: Time series, Moving average, Exponential Smoothing, MSE
D’10:
CS1: Binominal distribution, Poisson distribution, Methodology CS2: Holt’s method, Methodology
CS3: Two-‐Way analyses of variance, F-‐test for significance, Methodology CS4: Graphs, presentation of numbers
J’11
CS1: Indices, Simple aggregate index, Laspeyres, Paasche, Methodology CS2: Simple linear regression, Correlation, R2, Methodology
CS3: Time series, Exponential Smoothing, Methodology
CS4: Compare two sample means, F-‐distribution, Survey methodology
D’11
CS1: Chi-‐squared distribution, Methodology
CS2: Two-‐Way analyses of variance, F-‐test for significance, Methodology CS3: Forecast, Decomposition analyses, Methodology
CS4: Normal distribution, Methodology