Exploratory data analysis
Dr. David Lucy
Lancaster University
Exploratory data analysis – p.1/36
Graphs
Graphics are a very important part of making sense of data:
• Allow the researcher to compare quantities easily and
simply by comparing lengths and/or areas.
• Many humans adapted to view quantities rather than
number.
• Immediate impact.
• Suggests ideas for further work.
For many scientists this is their main form of anlysis - some of the worlds best science has been done purely by graphs.
Graphs
A graph is a suitable way of representing data if:
• A line or area can represent the quantities in the data
in some way.
• Several “standard” forms can be used.
“Standard” forms are not the only forms.You can make up
your own if you please
R very good for this.
Exploratory data analysis – p.3/36
Standard forms
There are three “standard” types of graph:
1. histogram- used to examine the distribution of a set of
observations - can be used to compare distributions between sets of observations - observations may be discrete (underlying continuous) and continuous,
2. scatterplot- use to look for relationships between
different continuous variables,
3. boxplot- sometimes called box and whiskers plot - use
to compare distributions of continuous variables which is equivalent to looking for relationships between factors and continuous variables.
Histograms
Partial Run length Count 0 1 2 3 4 5 0 20 40 60 Full Run length Count 0 1 2 3 4 5 0 20 40 60Exploratory data analysis – p.5/36
Histograms
Do not confuse histigrams with barcharts:
• Histogramshave the area proportional to the quantity
of interest:
• not necessarity equal column widths,
• although most are.
• Barchartshave the column height proportional to the
quantity of interest.
Histograms
Partial Run length Count 0 1 2 3 4 5 0 20 40 60 Full Run length Count 0 1 2 3 4 5 0 20 40 60Exploratory data analysis – p.7/36
Exercise 3.2.2
Rescale the full histogram so the bars sum to one?
Run length 0 1 2 3 4 5
Number of runs 71 28 5 2 2 1
Exercise 3.2.2
P
= 109 - divide each frequency by 109:
Run length 0 1 2 3 4 5
Number of runs 71 28 5 2 2 1
in 109 observations
Normalised 0.65 0.26 0.05 0.02 0.02 0.01
Process sometimes called normalisation by scientists
Exploratory data analysis – p.9/36
Scaled histograms
Partial Run length Count 0 1 2 3 4 5 0.0 0.2 0.4 0.6 Full Run length Count 0 1 2 3 4 5 0.0 0.2 0.4 0.6Histogram comparison
LeedsDaily max ozone
Density 20 40 60 80 0.00 0.01 0.02 0.03 0.04 Ladybower Reservoir
Daily max ozone
Density 20 40 60 80 100 120 0.00 0.01 0.02 0.03
Exploratory data analysis – p.11/36
Histogram comparison
DifferencesDaily max ozone
Density −50 −40 −30 −20 −10 0 10 20 0.00 0.01 0.02 0.03 0.04 0.05
Histogram problems
Leeds
Summer daily maxima
Density 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10 Leeds
Summer daily maxima
Density 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10
Exploratory data analysis – p.13/36
Kernal density estimates
20 40 60 80 100 120 0.00 0.01 0.02 0.03 0.04 0.05 Density Ozone (ppb) Ladybower Leeds
Kernal density estimates
probability density
x
0 2 4 6 8
Exploratory data analysis – p.15/36
Cumulative distributions
Recall the cumulative distribution function (c.d.f.) of a random variableX:
F (x) = P (X ≤ x)
How can we estimate this from a finite number of observations?
Cumulative distributions
Let us assume
• That our variablesX1, . . . , Xn are independent and
identically distributed (i.i.d.)
• They are replicates of a random variableX which has
cumulative distribution functionF.
• We can denote byx1, . . . , xn, the observed values of
X1, . . . , Xn.
Exploratory data analysis – p.17/36
Cumulative distributions
The empirical cumulative distribution function (c.d.f.) is
defined as: ˜ F (x) = 1 n(num ofxi ≤ x) = Pn i=1Π(xi ≤ x) n where : Π(xi≤ x) = ( 1 ifxi ≤ x 0 ifxi > 0
Cumulative distributions
The empirical c.d.f is aproperdistribution function and has the following properties:
• ˜F (x)is a step function with jumps at the data points;
• ˜F (x) = 1ifx ≥ max(x1, . . . , xn);
• ˜F (x) = 0ifx < min(x1, . . . , xn).
Exploratory data analysis – p.19/36
Cumulative distributions
To construct:
• Take the observed values and order them so that the
smallest one comes first.
• Label these ordered valuesx(1), x(2), · · · , x(n)so that
x(1)≤ x(2)≤ · · · ≤ x(n).
Exercise 3.2.5
For the observations{1, 2, 2, 3, 4}, findF (x)˜ and sketch the plot.
x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
˜ F (x)
Exploratory data analysis – p.21/36
Exercise 3.2.5
For the observations{1, 2, 2, 3, 4}, findF (x)˜ and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 n(xi) ≤ xi 0 0 1 1 3 3 4 4 5 5 5 ˜ F (x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 5/5 5/5 5/5 ˜ F (x) 0 0 0.2 0.2 0.6 0.6 0.8 0.8 1 1 1
Exercise 3.2.5
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 x densityExploratory data analysis – p.23/36
Exercise 3.2.6
Construct the cdf for the first 20 points from the Leeds Summer ozone measurements and sketch it
These are:
32 29 32 32 33 27 34 22 30 35 27 23 28 34 35 45 36 26 23 16
At each sorted data point we have a jump ofi/n, which is
Exercise 3.2.6
15 20 25 30 35 40 45 0.0 0.2 0.4 0.6 0.8 1.0 LeedsSummer daily maxima
Fn(x)
Exploratory data analysis – p.25/36
Summer ozone Leeds
0 20 40 60 80 0.0 0.2 0.4 0.6 0.8 1.0 Leeds
Summer daily maxima
Fn(x)
Scatterplots
Scatterplots look at the relationship between continuous variables.
• Usually they project two dimensions onto two
dimensions.
• Several ways of representing three dimensions.
Scatterplots are the mainstay of physical sciences.
Exploratory data analysis – p.27/36
Scatterplots
20 40 60 80 100 20 40 60 80 NO2 O3 Summer 20 40 60 80 100 120 0 10 20 30 40 NO2 O3 WinterScatterplots
heart COHb saturation level
peripheral COHb saturation level
10 15 20 25 30 0 10 20 30 40
Exploratory data analysis – p.29/36
Independence
Scatterplots can be used to look fordependencebetween continuous variables.
• They can also be useful to identify situations in which
variables appear to be independent.
• If two variables are independent, then the distribution
of one variable will look the same regardless of the value of the other variable.
This is what the ozone versus NO2 above plots looked like.
Independence
Conditional probabilities were introduced in Math104:
IfAandBare two events then, as long asP (B) > 0, the
conditional probabilityofAgivenBis written asP (A|B)
and calculated from:
P (A|B) = P (A ∩ B) P (B) .
Exploratory data analysis – p.31/36
Independence
We can look for some structure in our data:
• including the dependence of one variable on another,
• by examining conditional distributions of some subsets
of our data.
Do this by seperating the data by some defined criterion, and plotting the subsets.
Exercise 3.2.8
Summer Ozone|NO2 <= 40 O3 Density 0 10 20 30 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05 Winter Ozone|NO2 <=40 O3 Density 0 10 20 30 40 0.00 0.01 0.02 0.03 0.04 0.05 Summer Ozone|40<NO2<=60 O3 Density 0 10 20 30 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05 Winter Ozone|40<NO2<=60 O3 Density 0 10 20 30 40 0.00 0.01 0.02 0.03 0.04 0.05Exploratory data analysis – p.33/36
Boxplots
The third of the “standard” forms for graphs:
• Similar to multiple histograms.
• Examine distribution of continuous variable.
• For different levels of a discrete variable.
The discrete variable can be ordered, or nominal.
Boxplots
Leeds.O3 Ladybower.O3 0 20 40 60 80 100 Summer Leeds.O3 Ladybower.O3 0 20 40 60 80 100 WinterExploratory data analysis – p.35/36
Next session
Next time we shall:
1. take a look at boxplots,
2. learn about some of the classic plots from history, 3. find out what makes a good graph,