Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

(1)

Exploratory data analysis

Dr. David Lucy

[email protected]

Lancaster University

Exploratory data analysis – p.1/36

Graphs

Graphics are a very important part of making sense of data:

• Allow the researcher to compare quantities easily and

simply by comparing lengths and/or areas.

• Many humans adapted to view quantities rather than

number.

• Immediate impact.

• Suggests ideas for further work.

For many scientists this is their main form of anlysis - some of the worlds best science has been done purely by graphs.

Graphs

A graph is a suitable way of representing data if:

• A line or area can represent the quantities in the data

in some way.

• Several “standard” forms can be used.

“Standard” forms are not the only forms.You can make up

your own if you please

R very good for this.

Standard forms

There are three “standard” types of graph:

1. histogram- used to examine the distribution of a set of

observations - can be used to compare distributions between sets of observations - observations may be discrete (underlying continuous) and continuous,

2. scatterplot- use to look for relationships between

different continuous variables,

3. boxplot- sometimes called box and whiskers plot - use

to compare distributions of continuous variables which is equivalent to looking for relationships between factors and continuous variables.

(2)

Histograms

Partial Run length Count 0 1 2 3 4 5 0 20 40 60 Full Run length Count 0 1 2 3 4 5 0 20 40 60

Histograms

Do not confuse histigrams with barcharts:

• Histogramshave the area proportional to the quantity

of interest:

• not necessarity equal column widths,

• although most are.

• Barchartshave the column height proportional to the

quantity of interest.

Histograms

Partial Run length Count 0 1 2 3 4 5 0 20 40 60 Full Run length Count 0 1 2 3 4 5 0 20 40 60

Exercise 3.2.2

Rescale the full histogram so the bars sum to one?

Run length 0 1 2 3 4 5

Number of runs 71 28 5 2 2 1

(3)

Exercise 3.2.2

P

= 109 - divide each frequency by 109:

Run length 0 1 2 3 4 5

Number of runs 71 28 5 2 2 1

in 109 observations

Normalised 0.65 0.26 0.05 0.02 0.02 0.01

Process sometimes called normalisation by scientists

Scaled histograms

Partial Run length Count 0 1 2 3 4 5 0.0 0.2 0.4 0.6 Full Run length Count 0 1 2 3 4 5 0.0 0.2 0.4 0.6

Histogram comparison

Leeds

Daily max ozone

Density 20 40 60 80 0.00 0.01 0.02 0.03 0.04 Ladybower Reservoir

Daily max ozone

Density 20 40 60 80 100 120 0.00 0.01 0.02 0.03

Histogram comparison

Differences

Daily max ozone

Density −50 −40 −30 −20 −10 0 10 20 0.00 0.01 0.02 0.03 0.04 0.05

(4)

Histogram problems

Leeds

Summer daily maxima

Density 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10 Leeds

Summer daily maxima

Density 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10

Kernal density estimates

20 40 60 80 100 120 0.00 0.01 0.02 0.03 0.04 0.05 Density Ozone (ppb) Ladybower Leeds

Kernal density estimates

probability density

x

0 2 4 6 8

Cumulative distributions

Recall the cumulative distribution function (c.d.f.) of a random variableX:

F (x) = P (X ≤ x)

How can we estimate this from a finite number of observations?

(5)

Cumulative distributions

Let us assume

• That our variablesX1, . . . , Xn are independent and

identically distributed (i.i.d.)

• They are replicates of a random variableX which has

cumulative distribution functionF.

• We can denote byx1, . . . , xn, the observed values of

X1, . . . , Xn.

Cumulative distributions

The empirical cumulative distribution function (c.d.f.) is

defined as: ˜ F (x) = 1 n(num ofxi ≤ x) = Pn i=1Π(xi ≤ x) n where : Π(xi≤ x) = ( 1 ifxi ≤ x 0 ifxi > 0

Cumulative distributions

The empirical c.d.f is aproperdistribution function and has the following properties:

• ˜F (x)is a step function with jumps at the data points;

• ˜F (x) = 1ifx ≥ max(x1, . . . , xn);

• ˜F (x) = 0ifx < min(x1, . . . , xn).

Cumulative distributions

To construct:

• Take the observed values and order them so that the

smallest one comes first.

• Label these ordered valuesx(1), x(2), · · · , x(n)so that

x(1)≤ x(2)≤ · · · ≤ x(n).

(6)

Exercise 3.2.5

For the observations{1, 2, 2, 3, 4}, findF (x)˜ and sketch the plot.

x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

˜ F (x)

Exercise 3.2.5

For the observations{1, 2, 2, 3, 4}, findF (x)˜ and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 n(xi) ≤ xi 0 0 1 1 3 3 4 4 5 5 5 ˜ F (x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 5/5 5/5 5/5 ˜ F (x) 0 0 0.2 0.2 0.6 0.6 0.8 0.8 1 1 1

Exercise 3.2.5

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 x density

Exercise 3.2.6

Construct the cdf for the first 20 points from the Leeds Summer ozone measurements and sketch it

These are:

32 29 32 32 33 27 34 22 30 35 27 23 28 34 35 45 36 26 23 16

At each sorted data point we have a jump ofi/n, which is

(7)

Exercise 3.2.6

15 20 25 30 35 40 45 0.0 0.2 0.4 0.6 0.8 1.0 Leeds

Summer daily maxima

Fn(x)

Summer ozone Leeds

0 20 40 60 80 0.0 0.2 0.4 0.6 0.8 1.0 Leeds

Summer daily maxima

Fn(x)

Scatterplots

Scatterplots look at the relationship between continuous variables.

• Usually they project two dimensions onto two

dimensions.

• Several ways of representing three dimensions.

Scatterplots are the mainstay of physical sciences.

Scatterplots

20 40 60 80 100 20 40 60 80 NO2 O3 Summer 20 40 60 80 100 120 0 10 20 30 40 NO2 O3 Winter

(8)

Scatterplots

heart COHb saturation level

peripheral COHb saturation level

10 15 20 25 30 0 10 20 30 40

Independence

Scatterplots can be used to look fordependencebetween continuous variables.

• They can also be useful to identify situations in which

variables appear to be independent.

• If two variables are independent, then the distribution

of one variable will look the same regardless of the value of the other variable.

This is what the ozone versus NO2 above plots looked like.

Independence

Conditional probabilities were introduced in Math104:

IfAandBare two events then, as long asP (B) > 0, the

conditional probabilityofAgivenBis written asP (A|B)

and calculated from:

P (A|B) = P (A ∩ B) P (B) .

Independence

We can look for some structure in our data:

• including the dependence of one variable on another,

• by examining conditional distributions of some subsets

of our data.

Do this by seperating the data by some defined criterion, and plotting the subsets.

(9)

Exercise 3.2.8

Summer Ozone|NO2 <= 40 O3 Density 0 10 20 30 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05 Winter Ozone|NO2 <=40 O3 Density 0 10 20 30 40 0.00 0.01 0.02 0.03 0.04 0.05 Summer Ozone|40<NO2<=60 O3 Density 0 10 20 30 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05 Winter Ozone|40<NO2<=60 O3 Density 0 10 20 30 40 0.00 0.01 0.02 0.03 0.04 0.05

Boxplots

The third of the “standard” forms for graphs:

• Similar to multiple histograms.

• Examine distribution of continuous variable.

• For different levels of a discrete variable.

The discrete variable can be ordered, or nominal.

Boxplots

Leeds.O3 Ladybower.O3 0 20 40 60 80 100 Summer Leeds.O3 Ladybower.O3 0 20 40 60 80 100 Winter

Next session

Next time we shall:

1. take a look at boxplots,

2. learn about some of the classic plots from history, 3. find out what makes a good graph,