• No results found

Week 1. Exploratory Data Analysis

N/A
N/A
Protected

Academic year: 2022

Share "Week 1. Exploratory Data Analysis"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Week 1

Exploratory Data Analysis

(2)

Exploratory Data Analysis

Practicalities

This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics.

Two lectures and one seminar/tutorial per week.

Exam (for the MSc in Financial Mathematics) in January, plus assessed coursework.

(3)

Exploratory Data Analysis

Aims and Objectives

What’s the course about?

1. Describing financial data 2. Modelling financial data

3. Making inferences about financial data

(4)

Exploratory Data Analysis Samples and Populations

Samples and Populations

(Experimental) Unitthe object on which measurements are made Populationthe set of all units about which information is wanted Samplethe set of units about which information is available (Simple) random samplea sample such that units in the population have equal chance of inclusion, independent of the inclusion of any other unit

Variablea measurable characteristic of a unit Statistica measurable characteristic of a sample Parametera measurable characteristic of a population

(5)

Exploratory Data Analysis Variation

Variation

Natural Variationvariation due to different units in the population having different values of the same variable

Sampling Variationvariation due to different samples containing different units and hence producing different values of the same statistic

(6)

Exploratory Data Analysis Nature and Structure of Data

Primary and Secondary Data

Primary and Secondary Data

Primary Dataare collected specifically for the current study

I Observational e.g. survey data

I Interventione.g. experimental data

Secondary Datacollected and/or compiled for another purpose

I can be limitations or problems with quality

(7)

Exploratory Data Analysis Nature and Structure of Data

Primary and Secondary Data

Example: National Unemployment Data

Suppose we want to know the UK unemployment figures in 5-year age bands to compare with similar figures from China, collected in 2005. Published data may be insufficient because

I only collected from major cities

I unemployment numbers presented in 10-year age bands

I compiled from a survey 10 years ago

(8)

Exploratory Data Analysis Nature and Structure of Data

Form of Data

Form of Data - Samples

I Relationship between samples

I independent samplese.g. unemployment figures from two countries

I dependent samplese.g. social class of father and son

I Structure across samples

I unstructurede.g. unemployment figures from two countries

I structurede.g. 2 × 2 factorial experiment

(9)

Exploratory Data Analysis Nature and Structure of Data

Form of Data

Form of Data - Variables

I Number of variables

I univariate,bivariateor multivariate

I Scales of measurement

I continuouse.g. age

I discretee.g. sex (binary), ethnic origin (unordered categorical), social class (ordered categorical)

(10)

Exploratory Data Analysis Stages of Data Analysis

Stages of Data Analysis

1. Exploratory data analysis using descriptive statistics

I numerical summaries

I tabular summaries

I graphical summaries

2. Formal analysis using statistical techniques, often based on an assumed probability model

3. Presentation and evaluation of results

(11)

Exploratory Data Analysis Descriptive Statistics

Numerical Summaries

Numerical Summaries

I Numerical summaries

I help to describe and compare samples

I give information about corresponding parameters

I Qualitative data can be summarise by countsor percentages.

I Quantitative data can be summarised by measures of location, scaleand shape.

(12)

Exploratory Data Analysis Descriptive Statistics

Measures of Location

Averages

For observations x1, . . . , xn, let x(j) denote the j’th smallest observation (j’thorder statistic)

Sample mean

¯ x = 1

n

n

X

i=1

xi

Sample median

xM =



 x n+1

2

 if n is odd

1 2

 x n

2

+ x n

2+1



if n is even

= x n 2+12

Sample modethe value which occurs most frequently in the sample

(13)

Exploratory Data Analysis Descriptive Statistics

Measures of Location

Averages - Advantages and Disadvantages

I Sample mean

I adv: conventional average; uses every value, convenient mathematically

I disadv: rarely corresponds to sample unit, influenced by outliers

I Sample median

I reverse adv/disadv of the sample mean

I Sample mode

I often not well defined; sample values are often poor values for populations

(14)

Exploratory Data Analysis Descriptive Statistics

Measures of Location

Quantiles

Sample Lower Quartile

xL= x n

4+12 Sample Upper Quartile

xU = x 3n 4+12 pth Sample Percentile

x100p%= x pn 100+12 Five Number Summary

x(1), xL, xM, xU, x(n)

(15)

Exploratory Data Analysis Descriptive Statistics

Measures of Scale

Measures of Scale

Sample Variance

V ar(x) = Pn

j=1(xj− ¯x)2 n − 1 Sample Standard Deviation (SD)

pV ar(x) Inter-Quartile Range(IQR)

xU− xL Sample Range

x(n)− x(1)

(16)

Exploratory Data Analysis Descriptive Statistics

Measures of Scale

Measures of Scale - Advantages and Disadvantages

I Variance

I similar adv/disadv to mean

I SD

I in the same units as the data - useful for interpretation

I IQR

I robust measure

I Sample Range

I sensitive to outliers, sampling variability and data errors

(17)

Exploratory Data Analysis Descriptive Statistics

Measures of Shape

Measures of Shape

Modalitynumber of peaks in the sample distribution Skewnessa statistic measuring symmetry such that

I 0 ⇒ symmetric sample distribution

I +ve ⇒ skewed to the right (long right-hand tail)

I -ve ⇒ skewed to the left (long left-hand tail) Kurtosisa statistic measuring peakedness such that

I 3 ⇒ same peakedness as the Normal distribution (mesokurtic)

I > 3 ⇒ more peaked - slim or long-tailed (leptokurtic)

I < 3 ⇒ less peaked - flat, fat or short-tailed (platykurtic) Sometimes adjusted to give 0 for mesokurtic distributions.

(18)

Exploratory Data Analysis Descriptive Statistics

Measures of Shape

Skewness and Kurtosis

−4 −2 0 2 4

0.00.10.20.30.4

x

f(x)

−4 −2 0 2 4

0.00.20.40.6

x

f(x)

(19)

Exploratory Data Analysis Descriptive Statistics

Measure of Linear Relation Between Two Variables

Correlation Coefficient

For observations x1, . . . , xn; y1, . . . , yn of two variables X and Y Correlation Coefficient

r =

Pn

i=1(xi− ¯x)(yi− ¯y) q Pn

i=1(xi− ¯x)2 Pn

i=1(yi− ¯y)2

I measure of linear relationship

I correlation does not imply cause

I may be linked via third variable

(20)

Exploratory Data Analysis Descriptive Statistics

Tabular Summaries

Tabular Summaries

I Provide succinct display of data set

I Emphasise the structure of the data

I Sometimes more powerful than a graph, or may provide record of graphed data

I Things to consider

I included data

I layout (dimensions, ordering, totals)

I representation of numbers (units, significant figures, percentages)

(21)

Exploratory Data Analysis Descriptive Statistics

Tabular Summaries

Example: Society of Business Economists Salary Survey

Age (years) Per cent of responses Median salaries (£k)*

2004 2003 1999 2004 2003 1999 30 & under 21 12 17 35.0 30.5 30.0

31 – 35 12 17 15 45.7 46.8 45.0

36 – 40 17 14 8 65.0 73.0 45.0

41 – 45 8 9 17 63.0 100.0 65.0

46 – 50 18 15 18 82.3 90.0 60.0

51 – 55 10 16 17 59.0 55.5 55.0

Over 55 14 17 8 75.0 55.0 50.0

Men 84 80 91 60.0 60.0 52.0

Women 16 20 9 52.5 45.5 30.0

*Including any London/regional allowance and self-employment income Source: http://www.sbe.co.uk/survey/salary_survey_2004.pdf

(22)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries

Graphical Summaries

Graphical summaries are useful for

I providing an overall picture of the data

I exploring relationships e.g. comparing groups, exploring trends over time

I checking assumptions underlying methods of formal analysis

I checking for problems with the data, e.g. outliers

(23)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Qualitative Data

Graphical Summaries for Qualitative Data

I Pie Charts

I area of slices proportional to frequency - misleading to compare pie charts of different area or based on different sample sizes

I limited accuracy - rounding can be misleading

I hard to read with large number of segments

I Bar Charts

I height of bars proportional to frequency - more intuitive

I bars can be segmented to show component parts

(24)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Qualitative Data

Example: Shares of National Income

Source: Survey of Current Business (2006) 86(1), http://bea.gov/bea/pub/0106cont.htm

(25)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Qualitative Data

Example: Shares of National Income

1959 2004

Other

Taxes on production & imports Net interest & misc. payments Corporate profits

Rental income of persons Proprietors' income

Supplements to wages & salaries Wages and salary accruals

020406080100

(26)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Quantitative Data

Stem-and-Leaf Plots

I Tallies data in bins, using values themselves for display E.g. Times (in hours) to first failure of air-conditioning unit on Boeing 720, different transformations

hours × 10

hours log10hours

1 | 04 3 | 27 1.0 | 0

2 | 03455679 4 | 589 1.1 | 5 3 | 2357 5 | 00124779 1.2 |

4 | 4469 6 | 1668 1.3 | 068

5 | 369 7 | 035778 1.4 | 00136

6 | 01 8 | 779 1.5 | 1247

7 | 569 9 | 25 1.6 | 4469

8 | 4 1.7 | 25789

9 | 0 1.8 | 88

1.9 | 025

(27)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Quantitative Data

Stem-and-Leaf Plots

E.g. Carbon-dating fragments of a pre-historic artefact, different scales

1000 years 100 years 100 years, split

(* = 0-4, . = 5-9)

4 | 99999999 48 | 89 48. | 89

5 | 0000000001111122 49 | 0111337778 49* | 011133 50 | 11235588 49. | 7778

51 | 2456 50* | 1123

50. | 5588 51* | 24 51. | 56

(28)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Quantitative Data

Box Plots

I Represent five number summaries diagrammatically.

I Most software produce truncated box plots, which exclude outliers - these are usually plotted as isolated points E.g. Inflation rates over 20 year period for five countries

USA UK Japan Germany France

510152025

(29)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Quantitative Data

Histogram

I Equivalent of barchart for binned continuous data

I Area of bars proportional to frequency in each bin - usually choose equal bin widths so height proportional to frequency E.g. GDP per capita for 26 countries

GDP per capita ($)

Frequency

0 5000 10000 15000 20000 25000 30000

0246810

(30)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Quantitative Data

Graphical Summaries of Distribution for Quantitative Data

I Stem-and-leaf

I adv: good for small data sets - shows all of the data

I disadv: choice of bins affects display

I Box plot

I adv: simple, can split by group, almost any sample size will do

I disadv: can be too simple, e.g. no good for multi-modal data

I Histogram

I adv: good for large data sets, shows all characteristics of distribution

I disadv: choice of bins affects display

(31)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Quantitative Data

Scatterplot

I Plot of data points in 2-D or 3-D space with variables as axes

I Useful for exploring relationships between variables

E.g. Standard & Poor (S&P) company’s index of 500 common stock prices against the Consumer Price Index (CPI) for 1978-1989

70 80 90 100 110 120

100150200250300

CPI

SP500 Index

(32)

Exploratory Data Analysis Descriptive Statistics

Graphical Summaries for Quantitative Data

Time Series

I Plot of data against time

I Look for seasonality, unusual events, etc

E.g. Quarterly personal consumption expenditure (PCE) from 1977-1980 (AUS$)

202530

Time

PCE

1977 1978 1979 1980

References

Related documents

I Constitutional Law of India.. Nature of the Indian Constitution. The distinctive features of its federal character. Directive Principles and their relationship with

Rank order of the most abundant species of LAs and DAs of the total assemblage and the habitats: inner flat (IF), outer flat (OF), sandbar (SB), channel (CH), shallow sublittoral

Transgenic HvPRT6 RNAi lines had higher levels of anaerobic response gene expression, which translated into an increased tolerance to waterlogging, as indicated by stabilized growth

This higher level theme will present findings relating to how participants portrayed a preserved sense of self and identity and how they attempted to maintain their sense

4 tactics: drop, add, combine variables, discover variables via residuals Today, we looked at distinction between exploratory and confirmatory?. We also learned about box and

Computer Science and Data Analysis Series. Exploratory Data Analysis

А для того, щоб така системна організація інформаційного забезпечення управління існувала необхідно додержуватися наступних принципів:

Imported articles shall be deemed &#34;entered&#34; in the Philippines for consumption when the specified entry form is properly filed and accepted, together with any