• No results found

Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

N/A
N/A
Protected

Academic year: 2021

Share "Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:"

Copied!
9
0
0

Loading.... (view fulltext now)

Full text

(1)

Exploratory data analysis

Dr. David Lucy

[email protected]

Lancaster University

Exploratory data analysis – p.1/36

Graphs

Graphics are a very important part of making sense of data:

• Allow the researcher to compare quantities easily and

simply by comparing lengths and/or areas.

• Many humans adapted to view quantities rather than

number.

• Immediate impact.

• Suggests ideas for further work.

For many scientists this is their main form of anlysis - some of the worlds best science has been done purely by graphs.

Graphs

A graph is a suitable way of representing data if:

• A line or area can represent the quantities in the data

in some way.

• Several “standard” forms can be used.

“Standard” forms are not the only forms.You can make up

your own if you please

R very good for this.

Exploratory data analysis – p.3/36

Standard forms

There are three “standard” types of graph:

1. histogram- used to examine the distribution of a set of

observations - can be used to compare distributions between sets of observations - observations may be discrete (underlying continuous) and continuous,

2. scatterplot- use to look for relationships between

different continuous variables,

3. boxplot- sometimes called box and whiskers plot - use

to compare distributions of continuous variables which is equivalent to looking for relationships between factors and continuous variables.

(2)

Histograms

Partial Run length Count 0 1 2 3 4 5 0 20 40 60 Full Run length Count 0 1 2 3 4 5 0 20 40 60

Exploratory data analysis – p.5/36

Histograms

Do not confuse histigrams with barcharts:

• Histogramshave the area proportional to the quantity

of interest:

• not necessarity equal column widths,

• although most are.

• Barchartshave the column height proportional to the

quantity of interest.

Histograms

Partial Run length Count 0 1 2 3 4 5 0 20 40 60 Full Run length Count 0 1 2 3 4 5 0 20 40 60

Exploratory data analysis – p.7/36

Exercise 3.2.2

Rescale the full histogram so the bars sum to one?

Run length 0 1 2 3 4 5

Number of runs 71 28 5 2 2 1

(3)

Exercise 3.2.2

P

= 109 - divide each frequency by 109:

Run length 0 1 2 3 4 5

Number of runs 71 28 5 2 2 1

in 109 observations

Normalised 0.65 0.26 0.05 0.02 0.02 0.01

Process sometimes called normalisation by scientists

Exploratory data analysis – p.9/36

Scaled histograms

Partial Run length Count 0 1 2 3 4 5 0.0 0.2 0.4 0.6 Full Run length Count 0 1 2 3 4 5 0.0 0.2 0.4 0.6

Histogram comparison

Leeds

Daily max ozone

Density 20 40 60 80 0.00 0.01 0.02 0.03 0.04 Ladybower Reservoir

Daily max ozone

Density 20 40 60 80 100 120 0.00 0.01 0.02 0.03

Exploratory data analysis – p.11/36

Histogram comparison

Differences

Daily max ozone

Density −50 −40 −30 −20 −10 0 10 20 0.00 0.01 0.02 0.03 0.04 0.05

(4)

Histogram problems

Leeds

Summer daily maxima

Density 0 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10 Leeds

Summer daily maxima

Density 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 0.10

Exploratory data analysis – p.13/36

Kernal density estimates

20 40 60 80 100 120 0.00 0.01 0.02 0.03 0.04 0.05 Density Ozone (ppb) Ladybower Leeds

Kernal density estimates

probability density

x

0 2 4 6 8

Exploratory data analysis – p.15/36

Cumulative distributions

Recall the cumulative distribution function (c.d.f.) of a random variableX:

F (x) = P (X ≤ x)

How can we estimate this from a finite number of observations?

(5)

Cumulative distributions

Let us assume

• That our variablesX1, . . . , Xn are independent and

identically distributed (i.i.d.)

• They are replicates of a random variableX which has

cumulative distribution functionF.

• We can denote byx1, . . . , xn, the observed values of

X1, . . . , Xn.

Exploratory data analysis – p.17/36

Cumulative distributions

The empirical cumulative distribution function (c.d.f.) is

defined as: ˜ F (x) = 1 n(num ofxi ≤ x) = Pn i=1Π(xi ≤ x) n where : Π(xi≤ x) = ( 1 ifxi ≤ x 0 ifxi > 0

Cumulative distributions

The empirical c.d.f is aproperdistribution function and has the following properties:

• ˜F (x)is a step function with jumps at the data points;

• ˜F (x) = 1ifx ≥ max(x1, . . . , xn);

• ˜F (x) = 0ifx < min(x1, . . . , xn).

Exploratory data analysis – p.19/36

Cumulative distributions

To construct:

• Take the observed values and order them so that the

smallest one comes first.

• Label these ordered valuesx(1), x(2), · · · , x(n)so that

x(1)≤ x(2)≤ · · · ≤ x(n).

(6)

Exercise 3.2.5

For the observations{1, 2, 2, 3, 4}, findF (x)˜ and sketch the plot.

x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

˜ F (x)

Exploratory data analysis – p.21/36

Exercise 3.2.5

For the observations{1, 2, 2, 3, 4}, findF (x)˜ and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 n(xi) ≤ xi 0 0 1 1 3 3 4 4 5 5 5 ˜ F (x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 5/5 5/5 5/5 ˜ F (x) 0 0 0.2 0.2 0.6 0.6 0.8 0.8 1 1 1

Exercise 3.2.5

0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 x density

Exploratory data analysis – p.23/36

Exercise 3.2.6

Construct the cdf for the first 20 points from the Leeds Summer ozone measurements and sketch it

These are:

32 29 32 32 33 27 34 22 30 35 27 23 28 34 35 45 36 26 23 16

At each sorted data point we have a jump ofi/n, which is

(7)

Exercise 3.2.6

15 20 25 30 35 40 45 0.0 0.2 0.4 0.6 0.8 1.0 Leeds

Summer daily maxima

Fn(x)

Exploratory data analysis – p.25/36

Summer ozone Leeds

0 20 40 60 80 0.0 0.2 0.4 0.6 0.8 1.0 Leeds

Summer daily maxima

Fn(x)

Scatterplots

Scatterplots look at the relationship between continuous variables.

• Usually they project two dimensions onto two

dimensions.

• Several ways of representing three dimensions.

Scatterplots are the mainstay of physical sciences.

Exploratory data analysis – p.27/36

Scatterplots

20 40 60 80 100 20 40 60 80 NO2 O3 Summer 20 40 60 80 100 120 0 10 20 30 40 NO2 O3 Winter

(8)

Scatterplots

heart COHb saturation level

peripheral COHb saturation level

10 15 20 25 30 0 10 20 30 40

Exploratory data analysis – p.29/36

Independence

Scatterplots can be used to look fordependencebetween continuous variables.

• They can also be useful to identify situations in which

variables appear to be independent.

• If two variables are independent, then the distribution

of one variable will look the same regardless of the value of the other variable.

This is what the ozone versus NO2 above plots looked like.

Independence

Conditional probabilities were introduced in Math104:

IfAandBare two events then, as long asP (B) > 0, the

conditional probabilityofAgivenBis written asP (A|B)

and calculated from:

P (A|B) = P (A ∩ B) P (B) .

Exploratory data analysis – p.31/36

Independence

We can look for some structure in our data:

• including the dependence of one variable on another,

• by examining conditional distributions of some subsets

of our data.

Do this by seperating the data by some defined criterion, and plotting the subsets.

(9)

Exercise 3.2.8

Summer Ozone|NO2 <= 40 O3 Density 0 10 20 30 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05 Winter Ozone|NO2 <=40 O3 Density 0 10 20 30 40 0.00 0.01 0.02 0.03 0.04 0.05 Summer Ozone|40<NO2<=60 O3 Density 0 10 20 30 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05 Winter Ozone|40<NO2<=60 O3 Density 0 10 20 30 40 0.00 0.01 0.02 0.03 0.04 0.05

Exploratory data analysis – p.33/36

Boxplots

The third of the “standard” forms for graphs:

• Similar to multiple histograms.

• Examine distribution of continuous variable.

• For different levels of a discrete variable.

The discrete variable can be ordered, or nominal.

Boxplots

Leeds.O3 Ladybower.O3 0 20 40 60 80 100 Summer Leeds.O3 Ladybower.O3 0 20 40 60 80 100 Winter

Exploratory data analysis – p.35/36

Next session

Next time we shall:

1. take a look at boxplots,

2. learn about some of the classic plots from history, 3. find out what makes a good graph,

References

Related documents

The structure of the remainder of this paper is as follows: Section 2 a brief summary of the state-of-the-art in cloud interoperability is presented,

• How BIOS and device drivers are used to send instructions to hardware.. • How different operating

In section 3 an outline of past, current and possible future communication network architectures was presented to answer the second research questions (RQ2), which covered different

According to researchers, the four research sites fall on a continuum of wildfi re defensible space policies, ranging from completely voluntary (Grand Haven) to completely

• Early and Late Run Retrospective Processing uses Final intermediate files, so they come after Final • Final is always ~3.5 months behind, so the Early and Late

The question paper will be administered (examination conducted) by the course teacher (separately in the university teaching department and in the college where the course is

To fulfil the manpower requirements of the wide variety of institutes, centres, projects and plants under PAEC, it requires outstanding engineers, scientists and medical doctors

After creating the metadata for an entity type, you can use the Generate Jobs option from the entity type editor toolbar to create and publish jobs to the DataFlux Data