• No results found

Data Visualization in R

N/A
N/A
Protected

Academic year: 2021

Share "Data Visualization in R"

Copied!
77
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Visualization in R

L. Torgo

[email protected]

Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto

(2)

Introduction

Motivation for Data Visualization

Humans are outstanding at detecting patterns and structures with their eyes

Data visualization methods try to explore these capabilities In spite of all advantages visualization methods also have several problems, particularly with very large data sets

(3)

Outline of what we will learn

Tools for univariate data Tools for bivariate data Tools for multivariate data

(4)

Introduction

R Graphics

(5)

Standard Graphics (the

graphics

package)

R standard graphics, available through packagegraphics,

includes several functions that provide standard statistical plots, like: Scatterplots Boxplots Piecharts Barplots etc.

These graphs can be obtained tyipically by a single function call

Example of a scatterplot

plot(1:10,sin(1:10))

These graphs can be easily augmented by adding several elements to these graphs (lines, text, etc.)

(6)

Introduction to Standard Graphics

Standard Graphics (the

graphics

package)

R standard graphics, available through packagegraphics,

includes several functions that provide standard statistical plots, like: Scatterplots Boxplots Piecharts Barplots etc.

These graphs can be obtained tyipically by a single function call

Example of a scatterplot

plot(1:10,sin(1:10))

These graphs can be easily augmented by adding several elements to these graphs (lines, text, etc.)

(7)

Standard Graphics (the

graphics

package)

R standard graphics, available through packagegraphics,

includes several functions that provide standard statistical plots, like: Scatterplots Boxplots Piecharts Barplots etc.

These graphs can be obtained tyipically by a single function call

Example of a scatterplot

plot(1:10,sin(1:10))

These graphs can be easily augmented by adding several elements to these graphs (lines, text, etc.)

(8)

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the active graphics device

The default and more frequently used device is thescreen

There are many more graphical devices in R, like thepdfdevice, thejpegdevice, etc.

The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing thetype of output required by the device

This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before!

Several devices may be open at the same time, but only one is the

activedevice

(9)

Graphics Devices

R graphics functions produce output that depends on the active graphics device

The default and more frequently used device is thescreen

There are many more graphical devices in R, like thepdfdevice, thejpegdevice, etc.

The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing thetype of output required by the device

This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before!

Several devices may be open at the same time, but only one is the

(10)

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the active graphics device

The default and more frequently used device is thescreen

There are many more graphical devices in R, like thepdfdevice,

thejpegdevice, etc.

The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing thetype of output required by the device

This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before!

Several devices may be open at the same time, but only one is the

activedevice

(11)

Graphics Devices

R graphics functions produce output that depends on the active graphics device

The default and more frequently used device is thescreen

There are many more graphical devices in R, like thepdfdevice,

thejpegdevice, etc.

The user just needs to open (and in the end close) the graphics

output device she/he wants. R takes care of producing thetype of

output required by the device

This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before!

Several devices may be open at the same time, but only one is the

(12)

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the active graphics device

The default and more frequently used device is thescreen

There are many more graphical devices in R, like thepdfdevice,

thejpegdevice, etc.

The user just needs to open (and in the end close) the graphics

output device she/he wants. R takes care of producing thetype of

output required by the device

This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before!

Several devices may be open at the same time, but only one is the

activedevice

(13)

Graphics Devices

R graphics functions produce output that depends on the active graphics device

The default and more frequently used device is thescreen

There are many more graphical devices in R, like thepdfdevice,

thejpegdevice, etc.

The user just needs to open (and in the end close) the graphics

output device she/he wants. R takes care of producing thetype of

output required by the device

This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before!

Several devices may be open at the same time, but only one is the

(14)

Introduction to Standard Graphics

A few examples

A scatterplot

plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))

The same but stored on a jpeg file

jpeg('exp.jpg')

plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))

dev.off()

And now as a pdf file

pdf('exp.pdf',width=6,height=6)

plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))

dev.off()

(15)

Package

ggplot2

Packageggplot2implements the ideas created by Wilkinson

(2005) on a grammar of graphics

This grammar is the result of a theoretical study on what is a statistical graphic

ggplot2builds upon this theory by implementing the concept of

a layered grammar of graphics (Wickham, 2009) The grammar defines a statistical graphic as:

a mapping from data intoaesthetic attributes(color, shape, size, etc.) ofgeometric objects(points, lines, bars, etc.)

L. Wilkinson (2005). The Grammar of Graphics. Springer.

(16)

Introduction to ggplot

The Basics of the Grammar of Graphics

Key elements of a statistical graphic:

data aesthetic mappings geometric objects statistical transformations scales coordinate system faceting

(17)

Aesthetic Mappings

Controls the relation between data variables and graphic variables

map theTemperaturevariable of a data set into thex variable in a scatter plot

map theSpeciesof a plant into thecolour of dots in a graphic etc.

(18)

Introduction to ggplot

Geometric Objects

Controls what is shown in the graphics

show each observation by a point using the aesthetic mappings that map two variables in the data set into thex,y variables of the plot etc.

(19)

Statistical Transformations

Allows us to calculate and do statistical analysis over the data in the plot

Use the data and approximate it by a regression line on thex,y

coordinates

Count occurrences of certain values etc.

(20)

Introduction to ggplot

Scales

Maps the data values into values in the coordinate system of the graphics device

(21)

Coordinate System

The coordinate system used to plot the data

Cartesian Polar etc.

(22)

Introduction to ggplot

Faceting

Split the data into sub-groups and draw sub-graphs for each group

(23)

A Simple Example

data(iris) library(ggplot2) ggplot(iris, aes(x=Petal.Length, y=Sepal.Length, colour=Species ) ) + geom_point() ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 6 7 8 2 4 6 Petal.Length Sepal.Length Species ● ● ● setosa versicolor virginica
(24)

Univariate Plots Distribution of Values of Nominal Variables

The Distribution of Values of Nominal Variables

The Barplot

The Barplot is a graph whose main purpose is to display a set of values as heights of bars

It can be used to display the frequency of occurrence of different values of a nominal variable as follows:

First obtain the number of occurrences of each value Then use the Barplot to display these counts

(25)

The Distribution of Values of Nominal Variables

The Barplot

The Barplot is a graph whose main purpose is to display a set of values as heights of bars

It can be used to display the frequency of occurrence of different values of a nominal variable as follows:

First obtain the number of occurrences of each value Then use the Barplot to display these counts

(26)

Univariate Plots Distribution of Values of Nominal Variables

Barplots in base graphics

barplot(table(algae$season),

main='Distribution of the Water Samples across Seasons')

autumn spring summer winter

Distribution of the Water Samples across Seasons

0 10 20 30 40 50 60

(27)

Barplots in ggplot2

ggplot(algae,aes(x=season)) + geom_bar() +

ggtitle('Distribution of the Water Samples across Seasons')

0 20 40 60

autumn spring summer winter

season

count

(28)

Univariate Plots Distribution of Values of Nominal Variables

Pie Charts

Pie charts serve the same purpose as bar plots but present the information in the form of a pie.

pie(table(algae$season),

main='Distribution of the Water Samples across Seasons')

autumn spring

summer

winter

Distribution of the Water Samples across Seasons

Note: Avoid pie charts!

(29)

Pie Charts in ggplot

ggplot(algae,aes(x=factor(1),fill=season)) + geom_bar(width=1) +

ggtitle('Distribution of the Water Samples across Seasons') +

coord_polar(theta="y")

50 100 150 0/200 1 count factor(1) season autumn spring summer winter

(30)

Univariate Plots Distribution of the Values of a Continuous Variable

The Distribution of Values of a Continuous Variable

The Histogram

The Histogram is a graph whose main purpose is to display how the values of a continuous variable are distributed

It is obtained as follows:

First the range of the variable is divided into a set ofbins(intervals

of values)

Then the number of occurrences of values on each bin is counted Then this number is displayed as a bar

(31)

The Distribution of Values of a Continuous Variable

The Histogram

The Histogram is a graph whose main purpose is to display how the values of a continuous variable are distributed

It is obtained as follows:

First the range of the variable is divided into a set ofbins(intervals of values)

Then the number of occurrences of values on each bin is counted Then this number is displayed as a bar

(32)

Univariate Plots Distribution of the Values of a Continuous Variable

Two examples of Histograms in R

hist(algae$a1,main='Distribution of Algae a1',

xlab='a1',ylab='Nr. of Occurrencies')

hist(algae$a1,main='Distribution of Algae a1',

xlab='a1',ylab='Percentage',prob=TRUE)

Distribution of Algae a1 a1 Nr . of Occurrencies 0 20 40 60 80 0 20 40 60 80 100 Distribution of Algae a1 a1 P ercentage 0 20 40 60 80 0.00 0.01 0.02 0.03 0.04 0.05

(33)

Two examples of Histograms in

ggplot2

ggplot(algae,aes(x=a1))+ geom_histogram(binwidth=10) +

ggtitle("Distribution of Algae a1")+ ylab("Nr. of Occurrencies")

ggplot(algae,aes(x=a1))+ geom_histogram(binwidth=10,aes(y=..density..)) + ggtitle("Distribution of Algae a1")+ ylab("Percentage")

0 30 60 90 0 25 50 75 100 a1 Nr . of Occurrencies Distribution of Algae a1 0.00 0.02 0.04 0 25 50 75 100 a1 P ercentage Distribution of Algae a1

(34)

Univariate Plots Distribution of the Values of a Continuous Variable

Problems with Histograms

Histograms may be misleading in small data sets

Another key issued is how the limits of the bins are chosen

There are several algorithms for this

Check the “Details” section of the help page of functionhist()if you want to know more about this and to obtain references on alternatives

Withinggplot2you may control this through thebinwidth

parameter

(35)

Showing the Distribution of Values

Kernel Density Estimates

Some of the problems of histograms can be tackled by smoothing the estimates of the distribution of the values. That is the purpose of kernel density estimates

Kernel estimates calculate the estimate of the distribution at a certain point by smoothly averaging over the neighboring points Namely, the density is estimated by

ˆf(x) = 1 n n X i=1 K x−xi h

(36)

Univariate Plots Kernel Density Estimation

Showing the Distribution of Values

Kernel Density Estimates

Some of the problems of histograms can be tackled by smoothing the estimates of the distribution of the values. That is the purpose of kernel density estimates

Kernel estimates calculate the estimate of the distribution at a certain point by smoothly averaging over the neighboring points

Namely, the density is estimated by

ˆf(x) = 1 n n X i=1 K x−xi h

whereK()is a kernel function andha bandwidth parameter

(37)

Showing the Distribution of Values

Kernel Density Estimates

Some of the problems of histograms can be tackled by smoothing the estimates of the distribution of the values. That is the purpose of kernel density estimates

Kernel estimates calculate the estimate of the distribution at a certain point by smoothly averaging over the neighboring points Namely, the density is estimated by

ˆf(x) = 1 n n X i=1 K x−xi h

(38)

Univariate Plots Kernel Density Estimation

An Example and how to obtain it in R basic graphics

hist(algae$mxPH,main='The Histogram of mxPH (maximum pH)',

xlab='mxPH Values',prob=T)

lines(density(algae$mxPH,na.rm=T))

rug(algae$mxPH)

The Histogram of mxPH (maximum pH)

mxPH Values Density 6 7 8 9 10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

(39)

An Example and how to obtain it in

ggplot2

ggplot(algae,aes(x=mxPH))+ geom_histogram(binwidth=.5, aes(y=..density..)) +

geom_density(color="red") +geom_rug() +ggtitle("The Histogram of mxPH (maximum pH)")

0.0 0.2 0.4 0.6 5 6 7 8 9 10 mxPH density

(40)

Univariate Plots Kernel Density Estimation

Showing the Distribution of Values

Graphing Quantiles

Thex quantile of a continuous variable is the value below which

anx% proportion of the data is

Examples of this concept are the 1st (25%) and 3rd (75%) quartiles and the median (50%)

We can calculate these quantiles at different values ofx and then plot them to provide an idea of how the values in a sample of data are distributed

(41)

Showing the Distribution of Values

Graphing Quantiles

Thex quantile of a continuous variable is the value below which

anx% proportion of the data is

Examples of this concept are the 1st (25%) and 3rd (75%) quartiles and the median (50%)

We can calculate these quantiles at different values ofx and then plot them to provide an idea of how the values in a sample of data are distributed

(42)

Univariate Plots Kernel Density Estimation

Showing the Distribution of Values

Graphing Quantiles

Thex quantile of a continuous variable is the value below which

anx% proportion of the data is

Examples of this concept are the 1st (25%) and 3rd (75%) quartiles and the median (50%)

We can calculate these quantiles at different values ofx and then

plot them to provide an idea of how the values in a sample of data are distributed

(43)

The Cumulative Distribution Function in Base Graphics

plot(ecdf(algae$NO3),

main='Cumulative Distribution Function of NO3',xlab='NO3 values')

0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0

Cumulative Distribution Function of NO3

NO3 values

Fn(x)

● ●

(44)

Univariate Plots Kernel Density Estimation

The Cumulative Distribution Function in

ggplot

ggplot(algae,aes(x=NO3)) + stat_ecdf() + xlab('NO3 values') +

ggtitle('Cumulative Distribution Function of NO3')

0.00 0.25 0.50 0.75 1.00 0 10 20 30 40 NO3 values y

Cumulative Distribution Function of NO3

(45)

QQ Plots

Graphs that can be used to compare the observed distribution against the Normal distribution

Can be used to visually check the hypothesis that the variable under study follows a normal distribution

(46)

Univariate Plots Kernel Density Estimation

QQ Plots

Graphs that can be used to compare the observed distribution against the Normal distribution

Can be used to visually check the hypothesis that the variable under study follows a normal distribution

Obviously, more formal tests also exist

(47)

QQ Plots

Graphs that can be used to compare the observed distribution against the Normal distribution

Can be used to visually check the hypothesis that the variable under study follows a normal distribution

(48)

Univariate Plots Kernel Density Estimation

QQ Plots from package

car

using base graphics

library(car)

qqPlot(algae$mxPH,main='QQ Plot of Maximum PH Values')

−3 −2 −1 0 1 2 3

6

7

8

9

QQ Plot of Maximum PH Values

norm quantiles algae$mxPH ● ● ●● ●●●● ● ●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ● ●

(49)

QQ Plots using

ggplot2

ggplot(algae,aes(sample=mxPH)) + stat_qq() +

ggtitle('QQ Plot of Maximum PH Values')

● ● ●● ●●●● ● ●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ● ● ● 6 7 8 9 −3 −2 −1 0 1 2 3 theoretical sample

(50)

Univariate Plots Kernel Density Estimation

QQ Plots using

ggplot2

(2)

q.x <-quantile(algae$mxPH,c(0.25,0.75),na.rm=TRUE) q.z <-qnorm(c(0.25,0.75))

b <-(q.x[2] - q.x[1])/(q.z[2] - q.z[1]) a <-q.x[1] - b* q.z[1]

ggplot(algae,aes(sample=mxPH))+ stat_qq() + ggtitle('QQ Plot of Maximum PH Values') +

geom_abline(intercept=a,slope=b,color="red")

● ● ●● ●●●● ● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●● ●●●●●● ● ● ● 6 7 8 9 −3 −2 −1 0 1 2 3 theoretical sample

QQ Plot of Maximum PH Values

(51)

Showing the Distribution of Values

Box Plots

Box plots provide interesting summaries of a variable distribution

For instance, they inform us of the interquartile range and of the outliers (if any)

(52)

Univariate Plots Kernel Density Estimation

Showing the Distribution of Values

Box Plots

Box plots provide interesting summaries of a variable distribution For instance, they inform us of the interquartile range and of the outliers (if any)

(53)

An Example with base graphics

boxplot(algae$mnO2,ylab='Minimum values of O2')

● ● 2 4 6 8 10 12 Minim um v alues of O2

(54)

Univariate Plots Kernel Density Estimation

An Example with base

ggplot2

ggplot(algae,aes(x=factor(0),y=mnO2))+geom_boxplot()+

ylab("Minimum values of O2") + xlab("") + scale_x_discrete(breaks=NULL)

● ● 5 10 Minim um v alues of O2

(55)

Hands on Data Visualization - the Algae data set

Using the Algae data set from packageDMwRanswer to the following

questions:

1 Create a graph that you find adequate to show the distribution of

the values of algaea6 solution

2 Show the distribution of the values ofsize solution

3 Check visually if it is plausible to consider thatoPO4follows a

(56)

Univariate Plots Kernel Density Estimation

Solution of Exercise 1

Create a graph that you find adequate to show the distribution of

the values of algaea6

hist(algae$a6,main="Distribution of Algae a6",xlab="Frequency of a6",

ylab="Nr. of Occurrences") Distribution of Algae a6 Frequency of a6 Nr . of Occurrences 0 20 40 60 80 0 50 100 150

(57)

Solution of Exercise 1 with

ggplot2

Create a graph that you find adequate to show the distribution of

the values of algaea6

ggplot(algae,aes(x=a6)) + geom_histogram(binwidth=10) +

ggtitle("Distribution of Algae a6")

0 50 100 150 0 25 50 75 a6 count Distribution of Algae a6 Go Back

(58)

Univariate Plots Kernel Density Estimation

Solution to Exercise 2

Show the distribution of the values ofsize

barplot(table(algae$size), main = "Distribution of River Size")

large medium small

Distribution of River Size

0

20

40

60

80

(59)

Solution to Exercise 2 with

ggplot2

Show the distribution of the values ofsize

ggplot(algae,aes(x=size)) + geom_bar() +

ggtitle("Distribution of River Size")

0 20 40 60 80

large medium small

size

count

Distribution of River Size

(60)

Univariate Plots Kernel Density Estimation

Solutions to Exercise 3

Check visually if it is plausible to consider thatoPO4follows a

normal distribution

library(car)

qqPlot(algae$oPO4,main = "QQ Plot of oPO4")

−3 −2 −1 0 1 2 3 0 100 200 300 400 500 QQ Plot of oPO4 norm quantiles algae$oPO4 ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●● ● ●● ● ● Go Back

(61)

Conditioned Graphs

Data sets frequently have nominal variables that can be used to

createsub-groups of the dataaccording to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtained on each of these sub-groups

Conditioned plots allowthe simultaneous presentation of these sub-group graphsto better allowfinding eventual differences between the sub-groups

Base graphics do not have conditioning butggplot2has it through the concept of facets

(62)

Conditioned Graphs

Conditioned Graphs

Data sets frequently have nominal variables that can be used to

createsub-groups of the dataaccording to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtained on each of these sub-groups

Conditioned plots allowthe simultaneous presentation of these sub-group graphsto better allowfinding eventual differences between the sub-groups

Base graphics do not have conditioning butggplot2has it through the concept of facets

(63)

Conditioned Graphs

Data sets frequently have nominal variables that can be used to

createsub-groups of the dataaccording to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtained on each of these sub-groups

Conditioned plots allowthe simultaneous presentation of these

sub-group graphsto better allowfinding eventual differences

between the sub-groups

Base graphics do not have conditioning butggplot2has it through the concept of facets

(64)

Conditioned Graphs

Conditioned Graphs

Data sets frequently have nominal variables that can be used to

createsub-groups of the dataaccording to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtained on each of these sub-groups

Conditioned plots allowthe simultaneous presentation of these

sub-group graphsto better allowfinding eventual differences

between the sub-groups

Base graphics do not have conditioning butggplot2has it

through the concept of facets

(65)

Conditioned Histograms

Goal: Constrast the distribution of data sub-groups

algae$speed <- factor(algae$speed,levels=c("low","medium","high"))

ggplot(algae,aes(x=a3)) + geom_histogram(binwidth=5) + facet_wrap(~ speed) +

ggtitle("Distribution of A3 by River Speed")

low medium high

0 20 40 60 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 a3 count

(66)

Conditioned Graphs Conditioned histograms

Conditioned Histograms (2)

algae$season<-factor(algae$season,levels=c("spring","summer","autumn","winter"))

ggplot(algae,aes(x=a3))+ geom_histogram(binwidth=5)+ facet_grid(speed~season)+ ggtitle('Distribution of a3 as a function of River Speed and Season')

spring summer autumn winter

0 5 10 15 0 5 10 15 0 5 10 15 lo w medium high 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 a3 count

Distribution of a3 as a function of River Speed and Season

(67)

Conditioned Box Plots on base graphics

boxplot(a6 ~ season,algae,

main='Distribution of a6 as a function of Season',

xlab='Season',ylab='Distribution of a6')

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

spring summer autumn winter

0

20

40

60

80

Distribution of a6 as a function of Season

Season

Distr

ib

(68)

Conditioned Graphs Conditioned Box Plots

Conditioned Box Plots on

ggplot2

ggplot(algae,aes(x=season,y=a6))+geom_boxplot() +

ggtitle('Distribution of a6 as a function of Season and River Size')

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80

spring summer autumn winter

season

a6

Distribution of a6 as a function of Season and River Size

(69)

Hands on Data Visualization - Algae data set

Using the Algae data set from packageDMwRanswer to the following

questions:

1 Produce a graph that allows you to understand how the values of

NO3 are distributed across the sizes of river solution

2 Try to understand (using a graph) if the distribution of algae a1

(70)

Conditioned Graphs Conditioned Box Plots

Solutions to Exercise 1

Produce a graph that allows you to understand how the values of NO3 are distributed across the sizes of river.

ggplot(algae,aes(x=NO3)) + geom_histogram(binwidth=5) + facet_wrap(~size) +

ggtitle("Distribution of NO3 for different river sizes")

large medium small

0 20 40 60 0 20 40 0 20 40 0 20 40 NO3 count

Distribution of NO3 for different river sizes

Go Back

(71)

Solutions to Exercise 2

Try to understand (using a graph) if the distribution of algae a1 varies with the speed of the river

ggplot(algae,aes(x=speed,y=a1)) + geom_boxplot() +

ggtitle("Distribution of a1 for different River Speeds")

● ● ● ● ● ● ● ● ● ● ● ● 0 25 50 75

high low medium

speed

a1

Distribution of a1 for different River Speeds

(72)

Other Plots Scatterplots

Scatterplots in base graphics

The Scatterplot is the natural graph for showing the relationship between two numerical variables

plot(algae$a1,algae$a6,

main='Relationship between the frequency of a1 and a6',

xlab='a1',ylab='a6')

● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 0 20 40 60 80

Relationship between the frequency of a1 and a6

a1

a6

(73)

Scatterplots in

ggplot2

ggplot(algae,aes(x=a1,y=a6)) + geom_point() +

ggtitle('Relationship between the frequency of a1 and a6')

● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 0 25 50 75 a1 a6

(74)

Other Plots Scatterplots

Conditioned Scatterplots using the

ggplot2

package

ggplot(algae,aes(x=a4,y=Chla)) + geom_point() + facet_wrap(~season)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● autumn spring summer winter 0 30 60 90 0 30 60 90 0 10 20 30 40 0 10 20 30 40 a4 Chla

(75)

Scatterplot matrices through package

GGally

These graphs try to present all pairwise scatterplots in a data set. They are unfeasible for data sets with many variables.

library(GGally)

ggpairs(algae,columns=12:16,

diag=list(continuous="density",discrete="bar"),axisLabels="show")

a1 a2 a3 a4 a5 a1 a2 a3 a4 a5 0 25 50 75 Corr: −0.294 Corr: −0.147 Corr: −0.038 Corr: −0.292 ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ●● ● ● ●● ● ● ●● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ●●●●●● ● ●●●●●●●●●●● ●●●●●●● ●●●●●●●● ● ● ● ●● ●●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●●● ●● ●●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●● 0 20 40 60 Corr: 0.0321 Corr: −0.172 Corr: −0.16 ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ●●●●●● ● ●●●●●●●●● ●●●● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● 0 10 20 30 40 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●●● ●●●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ●●●●● ● ●● ● ● ● ● ● ●● ● Corr: 0.0123 Corr: −0.108 ●●●●●●●●●●● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●● ● ●● ● ● ●●●●●●● ● ●●●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ●● ● ● ● ● ●● ● ●●●● ● ●● ●● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●●●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●● ● ● ●●●●● ●● ●●● ●●●● ● ● ●●●●●● ●●●●● ● ●● ●●●●● ●●● ● ● ● 0 10 20 30 40 ● ●● ●●●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ●●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●●● ●● ● ● ● ●●● ● ● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●●●● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●●● ●● ● ● ●●●●● ● ●● ●● ● ●●●● ●● ● ●●●●●●●●●● ●● ● ●● ● ●● ●●● ● ● ● ●●●●●● ●● ● ● ●●● ●● ●● ● ● ● ● ● ●● ● ● ●●●●●● ●●● ● ●●●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●● ● ● ● ● ● ●●●●● ●●● ● ● ● ●●●● ● ●● ● ●● ● ●● ●● ●● ●●●●● ● ● ●● ●● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ●●● ● ● ●●●● ● ● ● ● ● ● ●●●●● ● ● ●●●● ● ●●●●● ●●●●●●●● ● Corr: −0.109 ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●●●●● ● ● ●● ● ●● ● ● ● ● ●●●●●● ●●●●● ● ●● ● ● ● ● ●● ●● ● ●● 0 10 20 30 40 0 255075 ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ●●● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ●●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●● ●● ● ● ● ●●● ● ● ● ● ● 0 20 40 60 ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ●●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●●●●● ● ●●●●● ●●●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●●●● ● ● ●●●● ● ●●●● ● ● ● ●● ● ● ●● ● 0 10203040 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ●●● ● ●● ● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ●● ● ● ●●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 0 10203040 0 10203040

(76)

Other Plots Scatterplot matrices

Scatterplot matrices involving nominal variables

ggpairs(algae,columns=c("season","speed","a1","a2"),axisLabels="show")

season speed a1 a2 season speed a1 a2 0 20 40 60 autumn spr ingsummer winter ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● autumn spr ingsummer winter 0 10 20 0 10 20 0 10 20 ● ● ● ● ● ●●●●●● ● ●●● ● ● ● ●● ●●● ● ● ● ● ● ● ● high lo w medium 0 25 50 75 Corr: −0.294 0 20 40 60 0102030010203001020300102030020400204002040 ●● ● ● ● ● ●●● ● ●● ●●●●●● ● ● ●● ● ● ● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ● ● ●● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● 0 25 50 75 0 20 40 60

(77)

Parallel Plots

Parallel plots are also interesting for visualizing a data set

ggparcoord(algae,columns=12:16,groupColumn="speed")

0.0 2.5 5.0 7.5 10.0 a1 a2 a3 a4 a5 variable v alue speed high low medium

References

Related documents

Using input data from the noon reports only the best prediction error is found to 4.9% with input variable combination 18 , where all available noon report input, except

Selectman Crowley moved that the Board of Selectmen authorize remote participation in public meetings by members of all Town of Medway public bodies in accordance with the

The Halmos College of Natural Sciences and Oceanography (HCNSO) offers degree and certificate programs in biology, ocean science, marine biology, mathematics, chemistry, physics,

The HASP key determines the license (Instrument Control, Desktop, or Network) and the edition (Standard or Security) of your Bio-Plex Manager.. Instrument Control or Desktop HASP

The filter bandwidth (full width at half maximum, FWHM ) is normally in the order of 1. Such systems are already used for the analysis of natural gas. Three channels are required

The Electric vehicle (EV) development has been through five distinct phases in Norway, concept development, testing, early market, market introduction and finally from 2013

Message Transformation Inbound Message Processing Outbound Message Processing Operational Data Store Integration Hub Knowledge Management Business Intelligence Environment Public

• to develop geothermal energy (electricity) to complement hydro and other sources of power to meet the energy demand of rural areas in sound environment,.. • to mitigate rural