Week 1
Exploratory Data Analysis
Exploratory Data Analysis
Practicalities
This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics.
Two lectures and one seminar/tutorial per week.
Exam (for the MSc in Financial Mathematics) in January, plus assessed coursework.
Exploratory Data Analysis
Aims and Objectives
What’s the course about?
1. Describing financial data 2. Modelling financial data
3. Making inferences about financial data
Exploratory Data Analysis Samples and Populations
Samples and Populations
(Experimental) Unitthe object on which measurements are made Populationthe set of all units about which information is wanted Samplethe set of units about which information is available (Simple) random samplea sample such that units in the population have equal chance of inclusion, independent of the inclusion of any other unit
Variablea measurable characteristic of a unit Statistica measurable characteristic of a sample Parametera measurable characteristic of a population
Exploratory Data Analysis Variation
Variation
Natural Variationvariation due to different units in the population having different values of the same variable
Sampling Variationvariation due to different samples containing different units and hence producing different values of the same statistic
Exploratory Data Analysis Nature and Structure of Data
Primary and Secondary Data
Primary and Secondary Data
Primary Dataare collected specifically for the current study
I Observational e.g. survey data
I Interventione.g. experimental data
Secondary Datacollected and/or compiled for another purpose
I can be limitations or problems with quality
Exploratory Data Analysis Nature and Structure of Data
Primary and Secondary Data
Example: National Unemployment Data
Suppose we want to know the UK unemployment figures in 5-year age bands to compare with similar figures from China, collected in 2005. Published data may be insufficient because
I only collected from major cities
I unemployment numbers presented in 10-year age bands
I compiled from a survey 10 years ago
Exploratory Data Analysis Nature and Structure of Data
Form of Data
Form of Data - Samples
I Relationship between samples
I independent samplese.g. unemployment figures from two countries
I dependent samplese.g. social class of father and son
I Structure across samples
I unstructurede.g. unemployment figures from two countries
I structurede.g. 2 × 2 factorial experiment
Exploratory Data Analysis Nature and Structure of Data
Form of Data
Form of Data - Variables
I Number of variables
I univariate,bivariateor multivariate
I Scales of measurement
I continuouse.g. age
I discretee.g. sex (binary), ethnic origin (unordered categorical), social class (ordered categorical)
Exploratory Data Analysis Stages of Data Analysis
Stages of Data Analysis
1. Exploratory data analysis using descriptive statistics
I numerical summaries
I tabular summaries
I graphical summaries
2. Formal analysis using statistical techniques, often based on an assumed probability model
3. Presentation and evaluation of results
Exploratory Data Analysis Descriptive Statistics
Numerical Summaries
Numerical Summaries
I Numerical summaries
I help to describe and compare samples
I give information about corresponding parameters
I Qualitative data can be summarise by countsor percentages.
I Quantitative data can be summarised by measures of location, scaleand shape.
Exploratory Data Analysis Descriptive Statistics
Measures of Location
Averages
For observations x1, . . . , xn, let x(j) denote the j’th smallest observation (j’thorder statistic)
Sample mean
¯ x = 1
n
n
X
i=1
xi
Sample median
xM =
x n+1
2
if n is odd
1 2
x n
2
+ x n
2+1
if n is even
= x n 2+12
Sample modethe value which occurs most frequently in the sample
Exploratory Data Analysis Descriptive Statistics
Measures of Location
Averages - Advantages and Disadvantages
I Sample mean
I adv: conventional average; uses every value, convenient mathematically
I disadv: rarely corresponds to sample unit, influenced by outliers
I Sample median
I reverse adv/disadv of the sample mean
I Sample mode
I often not well defined; sample values are often poor values for populations
Exploratory Data Analysis Descriptive Statistics
Measures of Location
Quantiles
Sample Lower Quartile
xL= x n
4+12 Sample Upper Quartile
xU = x 3n 4+12 pth Sample Percentile
x100p%= x pn 100+12 Five Number Summary
x(1), xL, xM, xU, x(n)
Exploratory Data Analysis Descriptive Statistics
Measures of Scale
Measures of Scale
Sample Variance
V ar(x) = Pn
j=1(xj− ¯x)2 n − 1 Sample Standard Deviation (SD)
pV ar(x) Inter-Quartile Range(IQR)
xU− xL Sample Range
x(n)− x(1)
Exploratory Data Analysis Descriptive Statistics
Measures of Scale
Measures of Scale - Advantages and Disadvantages
I Variance
I similar adv/disadv to mean
I SD
I in the same units as the data - useful for interpretation
I IQR
I robust measure
I Sample Range
I sensitive to outliers, sampling variability and data errors
Exploratory Data Analysis Descriptive Statistics
Measures of Shape
Measures of Shape
Modalitynumber of peaks in the sample distribution Skewnessa statistic measuring symmetry such that
I 0 ⇒ symmetric sample distribution
I +ve ⇒ skewed to the right (long right-hand tail)
I -ve ⇒ skewed to the left (long left-hand tail) Kurtosisa statistic measuring peakedness such that
I 3 ⇒ same peakedness as the Normal distribution (mesokurtic)
I > 3 ⇒ more peaked - slim or long-tailed (leptokurtic)
I < 3 ⇒ less peaked - flat, fat or short-tailed (platykurtic) Sometimes adjusted to give 0 for mesokurtic distributions.
Exploratory Data Analysis Descriptive Statistics
Measures of Shape
Skewness and Kurtosis
−4 −2 0 2 4
0.00.10.20.30.4
x
f(x)
−4 −2 0 2 4
0.00.20.40.6
x
f(x)
Exploratory Data Analysis Descriptive Statistics
Measure of Linear Relation Between Two Variables
Correlation Coefficient
For observations x1, . . . , xn; y1, . . . , yn of two variables X and Y Correlation Coefficient
r =
Pn
i=1(xi− ¯x)(yi− ¯y) q Pn
i=1(xi− ¯x)2 Pn
i=1(yi− ¯y)2
I measure of linear relationship
I correlation does not imply cause
I may be linked via third variable
Exploratory Data Analysis Descriptive Statistics
Tabular Summaries
Tabular Summaries
I Provide succinct display of data set
I Emphasise the structure of the data
I Sometimes more powerful than a graph, or may provide record of graphed data
I Things to consider
I included data
I layout (dimensions, ordering, totals)
I representation of numbers (units, significant figures, percentages)
Exploratory Data Analysis Descriptive Statistics
Tabular Summaries
Example: Society of Business Economists Salary Survey
Age (years) Per cent of responses Median salaries (£k)*
2004 2003 1999 2004 2003 1999 30 & under 21 12 17 35.0 30.5 30.0
31 – 35 12 17 15 45.7 46.8 45.0
36 – 40 17 14 8 65.0 73.0 45.0
41 – 45 8 9 17 63.0 100.0 65.0
46 – 50 18 15 18 82.3 90.0 60.0
51 – 55 10 16 17 59.0 55.5 55.0
Over 55 14 17 8 75.0 55.0 50.0
Men 84 80 91 60.0 60.0 52.0
Women 16 20 9 52.5 45.5 30.0
*Including any London/regional allowance and self-employment income Source: http://www.sbe.co.uk/survey/salary_survey_2004.pdf
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries
Graphical Summaries
Graphical summaries are useful for
I providing an overall picture of the data
I exploring relationships e.g. comparing groups, exploring trends over time
I checking assumptions underlying methods of formal analysis
I checking for problems with the data, e.g. outliers
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Qualitative Data
Graphical Summaries for Qualitative Data
I Pie Charts
I area of slices proportional to frequency - misleading to compare pie charts of different area or based on different sample sizes
I limited accuracy - rounding can be misleading
I hard to read with large number of segments
I Bar Charts
I height of bars proportional to frequency - more intuitive
I bars can be segmented to show component parts
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Qualitative Data
Example: Shares of National Income
Source: Survey of Current Business (2006) 86(1), http://bea.gov/bea/pub/0106cont.htm
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Qualitative Data
Example: Shares of National Income
1959 2004
Other
Taxes on production & imports Net interest & misc. payments Corporate profits
Rental income of persons Proprietors' income
Supplements to wages & salaries Wages and salary accruals
020406080100
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Quantitative Data
Stem-and-Leaf Plots
I Tallies data in bins, using values themselves for display E.g. Times (in hours) to first failure of air-conditioning unit on Boeing 720, different transformations
hours × 10 √
hours log10hours
1 | 04 3 | 27 1.0 | 0
2 | 03455679 4 | 589 1.1 | 5 3 | 2357 5 | 00124779 1.2 |
4 | 4469 6 | 1668 1.3 | 068
5 | 369 7 | 035778 1.4 | 00136
6 | 01 8 | 779 1.5 | 1247
7 | 569 9 | 25 1.6 | 4469
8 | 4 1.7 | 25789
9 | 0 1.8 | 88
1.9 | 025
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Quantitative Data
Stem-and-Leaf Plots
E.g. Carbon-dating fragments of a pre-historic artefact, different scales
1000 years 100 years 100 years, split
(* = 0-4, . = 5-9)
4 | 99999999 48 | 89 48. | 89
5 | 0000000001111122 49 | 0111337778 49* | 011133 50 | 11235588 49. | 7778
51 | 2456 50* | 1123
50. | 5588 51* | 24 51. | 56
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Quantitative Data
Box Plots
I Represent five number summaries diagrammatically.
I Most software produce truncated box plots, which exclude outliers - these are usually plotted as isolated points E.g. Inflation rates over 20 year period for five countries
●
USA UK Japan Germany France
510152025
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Quantitative Data
Histogram
I Equivalent of barchart for binned continuous data
I Area of bars proportional to frequency in each bin - usually choose equal bin widths so height proportional to frequency E.g. GDP per capita for 26 countries
GDP per capita ($)
Frequency
0 5000 10000 15000 20000 25000 30000
0246810
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Quantitative Data
Graphical Summaries of Distribution for Quantitative Data
I Stem-and-leaf
I adv: good for small data sets - shows all of the data
I disadv: choice of bins affects display
I Box plot
I adv: simple, can split by group, almost any sample size will do
I disadv: can be too simple, e.g. no good for multi-modal data
I Histogram
I adv: good for large data sets, shows all characteristics of distribution
I disadv: choice of bins affects display
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Quantitative Data
Scatterplot
I Plot of data points in 2-D or 3-D space with variables as axes
I Useful for exploring relationships between variables
E.g. Standard & Poor (S&P) company’s index of 500 common stock prices against the Consumer Price Index (CPI) for 1978-1989
● ●
● ●
●
● ●
●
●
●
●
●
70 80 90 100 110 120
100150200250300
CPI
SP500 Index
Exploratory Data Analysis Descriptive Statistics
Graphical Summaries for Quantitative Data
Time Series
I Plot of data against time
I Look for seasonality, unusual events, etc
E.g. Quarterly personal consumption expenditure (PCE) from 1977-1980 (AUS$)
202530
Time
PCE
1977 1978 1979 1980