Summarizing data. Mauricio Romero

(1)

Summarizing data

(2)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(3)

Summarizing data

Remember...

(4)

Inference

Goal: Estimate unknown parameters

To approximate parameters, we use an estimator, which is a function of the data

Thus, estimator is a random variable (it is a function of a random variable)

Use relationship between estimator (its distribution usually) and parameters to infer something about the parameters

(5)

Important notation

Based on this tweet: https://twitter.com/nickchk/status/1272993322395557888 Greek letters (e.g.,µ) are the truth (i.e., parameters of the true DGP)

Greek letters with hats (e.g., µ_b) are estimates (i.e., what we think the truth is)

Non-Greek letters (e.g., X) denote sample/data

Non-Greek letters with lines on top (e.g., X) denote calculations from the data (e.g., X = _N1 P

iXi).

We want to estimate the truth, with some calculation from the data (µ_b=X)

Data −→ Calculations −→ Estimate −→ |{z} Hopefully Truth Example: X−→ X −→µ_b −→ |{z} Hopefully µ

(6)

Summarizing data

Remember...

(7)

Summarizing data

Remember...

(8)

Very good way to describe a variable is by showing it’s distribution with a graph

Density plots (for continuous)

Histograms (for continuous)

Bar plots (for categorical)

Plus:

Adding plots together

Putting lines on plots

(9)

Plot types we won’t cover:

Lots and lots and lots of plot options

Mosaics, Sankey plots, pie graphs (cause I hate pie graphs)

Some aren’t common in Econ but could be!

Others are too advanced (likemaps, but GIS knowledge is a really useful skill!)

(10)

Density plots and histograms

Density plots and histograms will show you the full distribution of the variable

Values along the x-axis, and how often those values show up on y

The density plots will present a smooth line by averaging nearby values

(11)

Density plots

If variable iscontinuous, we can’t count how ofteneach value comes up

We smooth it by looking at the number of times it falls within a range

df < - r e a d.csv(

’ h t t p :// www . a i r e . c d m x . gob . mx / o p e n d a t a / red _ m a n u a l / red _ m a n u a l _ p a r t i c u l a s _ s u s p . csv ’, s k i p =8, s t r i n g s A s F a c t o r s = F )

df= f i l t e r (df, cve _ p a r a m e t e r ==" PM1 0")

p l o t(d e n s i t y(df$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2, bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ",

(12)

How would you make this figure better? 0.000 0.005 0.010 0.015 density.default(x = df$value) Density

(13)

Titles and labels

Readability is super important in graphs

Add labels and titles! Titles with ’main’ and axis labels with ’xlab’ and ’ylab’

(14)

How would you make this figure better? 0.000 0.005 0.010 0.015 Distribution of PM10 in CDMX Density

(15)

Histograms

When a variable is continuous, we can’t count the number of timeseach value comes up

We create ‘bins” and show how many observations fall into each bin

h i s t(df$value , y l a b =" D e n s i t y ", x l a b =" PM1 0", f r e q = F , bty =" L ",

m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ", b r e a k s =3 0,col=" # FAA4 3A ", cex . lab =1.2, cex .a x i s=1.2)

(16)

How would you make this figure better? Histogram of df$value Density 0 1000 2000 3000 4000 5000 6000 7000

(17)

Histograms

These need labels too! Other important options:

Do proportions with ‘freq=FALSE‘

(18)

How would you make this figure better? Distribution of PM10 in CDMX Density 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014

(19)

Barplot

If it’s a discreet variable (like a coin toss or a roll of a dice), often best to just count the number (or fraction) of observations in each category

‘table()‘ command, shows us the whole distribution of a categorical

Imagine we gather data from 500 rolls of a dice

d a t a < - s a m p l e(c(1:6) ,5 0 0,r e p l a c e=T R U E) d a t a < - d a t a . f r a m e ( R e s u l t = d a t a )

t a b l e( d a t a )

p r o p.t a b l e(t a b l e( d a t a ))

b a r p l o t(p r o p.t a b l e(t a b l e( d a t a )))

(20)

(21)

1 2 3 4 5 6

0.00

0.05

0.10

(22)

1 2 3 4 5 6 500 rolls of a dice Propor tion 0.00 0.05 0.10 0.15

(23)

Overlaying Densities

Sometimes it’s nice to be able to compare two distributions

Because density plots are so simple, we can do this by overlaying them

The ‘lines()‘ function will add a line to your graph

Be sure to set colors so you can tell them apart

p l o t(d e n s i t y( f i l t e r (df, cve _ s t a t i o n ==" TLA ")$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2,

bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0", cex . lab =1.2, cex .a x i s=1.2, y l i m =c(0,0.0 1 5))

l i n e s(d e n s i t y( f i l t e r (df, cve _ s t a t i o n ==" PED ")$v a l u e ) ,col=2, lwd =2, lty =2)

l e g e n d(" t o p r i g h t ",c(" TLA "," PED ") ,

(24)

How would you make this figure better?

0.000

0.005

0.010

0.015

density.default(x = filter(df, cve_station == "TLA")$value)

Density

TLA PED

(25)

How would you make this figure better? 0 50 100 150 200 250 300 350 0.000 0.005 0.010 0.015 0.020 0.025 Distribution of PM10 Density TLA PED

(26)

Practice

Install ‘Ecdata‘, load it, and get the ‘MCAS‘ data

Make a density plot for ‘totsc4‘, and add a vertical green dashed line at the median

Create a bar plot showing the proportion of observations that have nonzero values of ‘bilingua‘

(27)

Practice answers i n s t a l l.p a c k a g e s(’ E c d a t a ’) l i b r a r y( E c d a t a ) d a t a ( M C A S ) p l o t(d e n s i t y( M C A S$t o t s c4) , x l a b ="4th G r a d e T e s t S c o r e ") a b l i n e( v =m e d i a n( M C A S$t o t s c4) ,col=’ g r e e n ’, lty =’ d a s h e d ’) # THE G G P L O T2 WAY g g p l o t ( MCAS , aes ( x = t o t s c4))+s t a t_d e n s i t y( g e o m =’ l i n e ’)+ g e o m _ v l i n e ( aes ( x i n t e r c e p t =m e d i a n( t o t s c4)) , c o l o r =’ g r e e n ’, l i n e t y p e =’ d a s h e d ’)+ x l a b ("4th G r a d e T e s t S c o r e ") M C A S < - M C A S % >% m u t a t e ( n o n z e r o b i = M C A S$b i l i n g u a > 0) g g p l o t ( MCAS , aes ( x = n o n z e r o b i ))+ g e o m _ bar ()+

g g t i t l e (" N o n z e r o S p e n d i n g Per B i l i n g u a l S t u d e n t ") g g p l o t ( MCAS , aes ( y = t o t d a y ))+ g e o m _b o x p l o t()+

(28)

0.00 0.01 0.02

675 700 725

4th Grade Test Score

(29)

How would you make this figure better? 0 50 100 150 FALSE TRUE nonzerobi count

(30)

How would you make this figure better? 4000 6000 8000 10000 −0.4 −0.2 0.0 0.2 0.4 totda y

(31)

Summarizing data

Remember...

(32)

Summarizing data

Remember...

Inferring things about the distribution

Relationships Between Variables

(33)

Summary statistics

The mean, median, etc. describe the distribution in a condensed way

Means and medians both describe the center of the distribution

Percentiles describe where other parts not necessarily in the middle are

Standard deviations and variances describe how spread outthe distribution is

Important to differentiate between the truth (e.g., E(X)) and the sample

(34)

The Mean — sample equivalent of the expected value E[X] :=Pxf(x)x

The sample equivalent is the mean (or average)

Calculated by multiplying each value by the proportion of times it comes up, and adding it all together

In R, ‘mean(x)‘

Essentially we estimate f(x) with the proportion of timesx shows up in the data

Let our random variable be the distribution of the rolls of a die

E(X) =P6x=1x16 = 3

From our 500 simulations: X = 3.382

(35)

The Mean

Nice things about the mean:

Easy to understand

The mean of ‘x-mean(x)‘ is 0 (same for_E(_E(X)−X) = 0)

Good statistical properties

Makes sense with large or small samples, with discrete or continuous variables Not so nice:

Sensitive to outliers (also true toE(X))

m e a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

m e a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 0 0)) The mean doesn’t describe EVERYTHINGabout the distribution!

(36)

(37)

The Median

Order observations from smallest to largest and pick the one in the middle

If there’s an even number of observations, take the mean of the two middle

Population equivalent is m such thatP(X ≤m) =P(X ≥m) = 1₂

x < - c(3,1,4,2,2)

m e d i a n( x )

(38)

(39)

The Median

Nice things about the median:

Super easy to calculate (you can often do it by hand)

Represents the “typical” observation

Not sensitive to outliers

m e d i a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

m e d i a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 0 0))

Generally not affected by transforming the data Not so nice:

Insensitive to outliers means it can ignore real changes in the “tails”

Can ignore magnitudes generally

(40)

(41)

Adding Lines

It’s pretty common for us to want to add some explanatory lines to our graphs

For example, adding mean/median/etc. to a density

Or showing where 0 is

Do this with ‘abline()‘

After creating the plot, THEN

Add the ‘abline(intercept,slope)‘

‘abline(h=horizontal)‘ for horizontal numbers

‘abline(v=vertical)‘ for vertical numbers

(42)

Mean and Median Together

p l o t(d e n s i t y(df$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2, bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ")

a b l i n e( v =m e a n(df$v a l u e ) , lwd =3,col=2, lty =2)

a b l i n e( v =m e d i a n(df$v a l u e ) , lwd =3,col=4, lty =4)

(43)

How would you make this figure better? 0 100 200 300 400 500 600 0.000 0.005 0.010 0.015 density.default(x = df$value) Density Mean Median

(44)

How would you make this figure better? 0.000 0.005 0.010 0.015 Distribution of PM10 in CDMX Density Mean Median

(45)

Percentiles

A percentile is just like a median

Except that you don’t necessarily pick the MIDDLE

Sort observations and pick the (percentile)th observation

Population equivalent: Percentile kth is m such thatP(X <m)<k/100

Use the ‘quantile()‘ function, and list the percentiles you want

Percentiles can fully describe the distribution if you use enough!

q u a n t i l e(c(0,1,2,3,4,5) ,c(.4,.5,1))

(46)

(47)

Percentiles

Exactly 20% of the observations are between each set of lines

p l o t(d e n s i t y(df$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2, bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ")

(48)

How would you make this figure better? 0.000 0.005 0.010 0.015 Distribution of PM10 in CDMX Density

(49)

Min and Max

Also useful are the minimum and maximum (a.k.a. the 0% and 100% percentiles)

Show you the range of values that the variable takes in the sample

(50)

Standard deviation and variance

Standard ways of understanding how much the datavaries around the mean

Variance = Standard deviation squared

(51)

Standard deviation and variance

Start with data and subtract out the mean

The result is the residuals (left-over part, unexplained part)

Square the residuals

Average them (variance) [note: then multiply by N/(N-1)]

Square root of the variance is the standard deviation

Why this process rather than some other measure around the mean (i.e. why square it)? Good statistical reasons I promise

(52)

Standard deviation and variance d a t a < - c(1,1,1,1,2) d a t a < - d a t a - m e a n( d a t a ) d a t a # V a r i a n ce , sd c((5/4)*me a n( d a t a ^2) ,var(c(1,1,1,1,2)) , s q r t((5/4)*m e a n( d a t a ^2)) ,sd(c(1,1,1,1,2))) d a t a2 < - c(1 0 0,0, -3 0,5 0,8 0) d a t a2 < - d a t a2 - m e a n( d a t a2) # V a r i a n ce , sd c((5/4)*me a n( d a t a2^2) ,var(c(1 0 0,0, -3 0,5 0,8 0)) , s q r t((5/4)*m e a n( d a t a2^2)) ,sd(c(1 0 0,0, -3 0,5 0,8 0)))

(53)

(54)

Graphically, SD and variance tell you how “wide” the distribution is

dat < - d a t a . f r a m e ( w i d e = r n o r m(1 0 0,sd=1) , n a r r o w = r n o r m(1 0 0,sd=1/2))

p l o t(d e n s i t y( dat$w i d e ) ,col=2, lwd =2, t y p e =" l ", bty =" L ", y l i m =c(0,1.5))

l i n e s(d e n s i t y( dat$n a r r o w ) ,col=4, lwd =2, t y p e =" l ", bty =" L ", lty =3)

(55)

How would you make this figure better? −4 −2 0 2 0.0 0.5 1.0 1.5 Density SD=1 SD=0.5

(56)

Summary statistics table

Something we will often want to do is display a bunch of summary statistics at once for the variables we have

This makes it easy to understand a variable’s distribution at a glance

We’ll be using the ‘stargazer‘ command for this

l i b r a r y( s t a r g a z e r ) d a t a ( L i f e C y c l e S a v i n g s )

(57)

(58)

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max

Savings 50 9.7 4.5 0.6 7.0 12.6 21.1

% of population under 15 50 35.1 9.2 21.4 26.2 44.1 47.6 % of population over 75 50 2.3 1.3 0.6 1.1 3.3 4.7 Disposable income (DPI) 50 1,106.8 990.9 88.9 288.2 1,795.6 4,001.9 % growth rate of DPI 50 3.8 2.9 0.2 2.0 4.5 16.7

(59)

Stargazer

See help(stargazer) to see what other summary stats you can include

‘type=’text’‘ tells it to give us a basic text table.

Another one is ‘type=’html’‘, especially if we want to output our table to a file

You can open up the HTML table and copy/paste it into Excel or Word Another one is ‘type=’latex’‘, for when exporting to LA_TEX

(60)

Putting it all together

Data about public employees salaries is widely available in Mexico

https://nominatransparente.rhnet.gob.mx/ ttps://tudinero.cdmx.gob.mx/buscador personas https://datos.cdmx.gob.mx/explore/dataset/ remuneraciones-al-personal-de-la-ciudad-de-mexico https://www.transparencia.cdmx.gob.mx/ https://sep.gob.mx/es/sep1/

Articulo 73 de la Ley General de Contabilidad Gubernamental

To speed things up, I cleaned a little the data of the 4th trimester of 2019 for the Secretaria de Gobierno (CDMX)

(61)

Practice answers

l i b r a r y( s t a r g a z e r )[ b a s i c s t y l e =\ t i n y ]

S a l a r i o s C D M X =r e a d.csv(" h t t p :// m a u r i c i o - r o m e r o . com / d a t a / c l a s s / S a l a r i o s C D M X2 0 1 9_ G o b e r n a c i o n . csv ") S a l a r i o s C D M X$M u j e r =( S a l a r i o s C D M X$S e x o ==" F e m e n i n o ")

(62)

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Clave puesto 9,773 231.4 231.6 20 159 190 1,184 Monto bruto 9,773 12,170.0 6,390.9 0 8,960 13,119 109,981 Monto neto 9,773 9,700.9 4,837.2 −22,323.2 7,004.3 10,415.8 77,909.5

Mujer (=1) 9,773 0.5 0.5 0 0 1 1

(63)

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0e+00 1e−04 2e−04 3e−04 4e−04 5e−04 Density Hombre Mujer

(64)

Summary statistics table - Males

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Clave puesto 5,238 221.5 215.9 24 169 190 1,184 Monto bruto 5,238 12,678.5 6,966.0 0 9,640 13,119 104,740 Monto neto 5,238 10,077.8 5,239.1 0.0 7,305.0 10,523.5 74,971.0

Mujer (=1) 5,238 0.0 0.0 0 0 0 0

(65)

Summary statistics table - Females

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Clave puesto 4,535 242.7 248.1 20 150 190 1,096 Monto bruto 4,535 11,582.8 5,597.8 2,816 8,790 13,119 109,981 Monto neto 4,535 9,265.6 4,286.5 −22,323.2 6,745.0 10,292.7 77,909.5

Mujer (=1) 4,535 1.0 0.0 1 1 1 1

(66)

How would you make this figure better? 0e+00 1e−04 2e−04 3e−04 Density No confianza Confianza

(67)

Practice

Use ‘data(LifeCycleSavings)‘ to get the Life Cycle Savings data, and use ‘help()‘ and ‘str()‘ to look at it

Use ‘stargazer()‘ to get a text table of summary statistics for all the variables EXCEPT ddpi

Now make an HTML table for all the variables. Open the file and look at it in a browser.

For each of the statistics that the ‘stargazer()‘ table gives you, plus the median, calculate that statistic on your own for the ‘pop15‘ variable using the appropriate R function

Calculate the max, min, and median in two ways - using their own respective functions, and as percentiles.

(68)

Practice answers l i b r a r y( s t a r g a z e r ) d a t a ( L i f e C y c l e S a v i n g s ) h e l p( L i f e C y c l e S a v i n g s ) str ( L i f e C y c l e S a v i n g s ) s t a r g a z e r ( s e l e c t ( L i f e C y c l e S a v i n g s , - d d p i ) , t y p e =’ t e x t ’) s t a r g a z e r ( s e l e c t ( L i f e C y c l e S a v i n g s , - d d p i ) , t y p e =’ h t m l ’, out =’ t a b l e . h t m l ’) LS < - L i f e C y c l e S a v i n g s

c( l e n g t h ( LS$pop1 5) ,m e a n( LS$pop1 5) ,sd( LS$pop1 5) ,min( LS$pop1 5) ,

(69)

Summarizing data

Remember...

(70)

Summarizing data

Remember...

Inferring things about the distribution

(71)

We aren’t just interested in looking at variables by themselves!

We want to know how variables can be related to each other

When ‘X‘ is high, would we expect ‘Y‘ to also be high, or be low?

How are variables correlated?

How does one variable explainanother?

(72)

What Does it Mean to be Related?

Two variables are related if knowing something about one tells you something about the other

For example, consider the answer to two questions:

Do you have an uterus?

Are you pregnant?

What is the probability that a random person is pregnant?

What is the probability that a random person who doesn’t have an uterusis pregnant?

(73)

What Does it Mean to be Related?

Variables are dependentif telling you the value of one gives you information about the value of the other

Variables are correlatedif knowing whether one of them isunusually high gives you information about whether the other is unusually high(positive correlation)

or unusually low (negative correlation)

Explaining one variable ‘Y‘ with another ‘X‘ means predicting ‘Y‘by looking at

(74)

An Example: Dependence w a g e1 < - r e a d_ s t a t a ( " h t t p :// f m w w w . bc . edu / ec - p / d a t a / w o o l d r i d g e / w a g e1. dta " ) t a b l e( w a g e1$numdep , w a g e1$smsa , dnn =c(’ Num . D e p e n d e n t s ’,’ L i v e s in M e t r o p o l i t a n A r e a ’))

(75)

(76)

An Example: Dependence

What are we looking for here?

For dependence, simply see if the distribution of one variable changes for the

different values of the other.

Does the distribution of Number of Dependents differ based on your SMSA status?

p r o p.t a b l e(t a b l e( w a g e1$numdep , w a g e1$smsa , dnn =c(’ Num . D e p e n d e n t s ’,

’ L i v e s in M e t r o p o l i t a n A r e a ’)) ,

(77)

(78)

An Example: Dependence

Does the distribution of SMSA differ based on your Number of Dependents Status?

p r o p.t a b l e(t a b l e( w a g e1$numdep , w a g e1$smsa , dnn =c(’ N u m b e r of D e p e n d e n t s ’, ’ L i v e s in M e t r o p o l i t a n A r e a ’)) ,

m a r g i n=1)

Looks like it!

(79)

(80)

An Example: Correlation

We are interested in whether two variables tend tomove together (positive correlation) or move apart(negative correlation)

One basic way to do this is to see whether values tend to be hightogether

One way to check in dplyr is to use ‘group by()‘ to organize the data into groups

Then ‘summarize()‘ the data within those groups

w a g e1 % >%

g r o u p _by( s m s a ) % >%

s u m m a r i z e ( n u m d e p =m e a n( n u m d e p ))

(81)

(82)

An Example: Correlation

There’s also a summary statistic we can calculatecalledcorrelation, this is typically what we mean by “correlation”

Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation)

Basically ”a one-standard deviation increase in ‘X‘ is associated with a “correlation” standard-deviation increase in ‘Y‘”

cor( w a g e1$numdep , w a g e1$s m s a )

(83)

(84)

An Example: Explanation

Let’s go back to those different means:

w a g e1 % >%

g r o u p _by( s m s a ) % >%

s u m m a r i z e ( n u m d e p =m e a n( n u m d e p )) Explanation would be saying that:

If you’re in an SMSA, I predict that you have these many dependents

m e a n( f i l t e r ( w a g e1, s m s a ==1)$n u m d e p )

If you’re not in an SMSA, I predict that you have these many dependents

m e a n( f i l t e r ( w a g e1, s m s a ==0)$n u m d e p )

If you are in an SMSA and have 2 dependents, then only some of those dependents

are explained by SMSA and some of them are unexplained by SMSA

(85)

Coding Recap

‘table(df$var1,df$var2)‘ to look at two variables together

‘prop.table(table(df$var1,df$var2))‘ for the proportion in each cell

‘prop.table(table(df$var1,df$var2),margin=2)‘ to get proportions within each column

‘prop.table(table(df$var1,df$var2),margin=1)‘ to get proportions within each row

‘df %>% group by(var1) %>% summarize(mean(var2))‘ to get mean of var2 for each value of var1

(86)

Graphing Relationships

Relationships between variables can be easier to see graphically

And graphs are extremely important to understanding relationships and the “shape” of those relationships

(87)

Wage and Education

Let’s use ‘plot(xvar,yvar)‘

p l o t( w a g e1$educ , w a g e1$wage , x l a b =" Y e a r s of E d u c a t i o n ", y l a b =" W a g e ") # THE G G P L O T2 WAY

g g p l o t ( w a g e1, aes ( x = educ , y = w a g e ))+ g e o m _ p o i n t ()+ x l a b (’ Y e a r s of E d u c a t i o n ’)+

y l a b (’ W a g e ’)

As we look at different values of ‘educ‘, what changes about the values of ‘wage‘ we see?

(88)

0 5 10 15 0 5 10 15 20 25 Years of Education W age

(89)

(90)

Try to picture the shapeof the data

Should this be a straight line? A curved line? Positively sloped? Negatively?

p l o t( w a g e1$educ , w a g e1$wage , x l a b =" Y e a r s of E d u c a t i o n ", y l a b =" W a g e ")

a b l i n e( -.9,.5,col=’ red ’)

p l o t(f u n c t i o n( x ) 5.4-.6* x +.0 5*( x ^2) ,0,1 8,add=TRUE,col=’ b l u e ’)

# THE G G P L O T2 WAY

g g p l o t ( w a g e1, aes ( x = educ , y = w a g e ))+ g e o m _ p o i n t ()+ x l a b (’ Y e a r s of E d u c a t i o n ’)+

y l a b (’ W a g e ’)+

g e o m _a b l i n e( aes ( i n t e r c e p t = -.9, s l o p e =.5) ,col=’ red ’)+

(91)

(92)

‘plot(xvar,yvar)‘ is extremely powerful, and will show you relationships at a glance

The previous graph showed a clear positive relationship, and indeed ‘cor(wage1$wage,wage1$educ)‘ is 0.406

Further, we don’t only see a positive relationship, but we have some sense of how

(93)

Let’s compare clothing sales volume vs. profit margin for men’s clothing firms

l i b r a r y( E c d a t ) d a t a ( C l o t h i n g ) p l o t( C l o t h i n g$sales , C l o t h i n g$margin, x l a b =" G r o s s S a l e s ", y l a b =" M a r g i n ") # THE G G P L O T2 WAY l i b r a r y( E c d a t ) d a t a ( C l o t h i n g ) g g p l o t ( C l o t h i n g , aes ( x = sales , y =m a r g i n))+ g e o m _ p o i n t ()+ x l a b (’ G r o s s S a l e s ’)+ y l a b (’ M a r g i n ’)

(94)

0 5000 10000 15000 20000 25000 20 30 40 50 60 Gross Sales Margin

(95)

Comparing Singapore diamond prices vs. carats

l i b r a r y( E c d a t ) d a t a ( D i a m o n d ) p l o t( D i a m o n d$carat , D i a m o n d$price , x l a b =" N u m b e r of C a r a t s ", y l a b =" P r i c e ") # THE G G P L O T2 WAY l i b r a r y( E c d a t ) d a t a ( D i a m o n d )

g g p l o t ( Diamond , aes ( x = carat , y = p r i c e ))+ g e o m _ p o i n t ()+ x l a b (’ N u m b e r of C a r a t s ’)+

(96)

0.2 0.4 0.6 0.8 1.0 5000 10000 15000 Number of Carats Pr ice

(97)

Another way to graph a relationship, especially when one of the variables only takes a few values, is to plot the ‘density()‘ function for different values

p l o t(d e n s i t y( f i l t e r ( w a g e1, m a r r i e d ==0)$w a g e ) ,col=’ b l u e ’, x l a b =" W a g e ", m a i n =" W a g e D i s t r i b u t i o n ; B l u e = U n m a r r i e d , Red = M a r r i e d ", bty =" L ", las =1, lwd =2) l i n e s(d e n s i t y( f i l t e r ( w a g e1, m a r r i e d ==1)$w a g e ) ,col=’ red ’, lwd =2) # THE G G P L O T2 WAY g g p l o t ( f i l t e r ( w a g e1, m a r r i e d ==0) , aes ( x = w a g e ))+s t a t_d e n s i t y( g e o m =’ l i n e ’,col=’ b l u e ’)+ x l a b (’ W a g e ’)+ y l a b (’ D e n s i t y ’)+ g g t i t l e (" W a g e D i s t r i b u t i o n ; B l u e = U n m a r r i e d , Red = M a r r i e d ")+ s t a t_d e n s i t y( d a t a = f i l t e r ( w a g e1, m a r r i e d ==1) , g e o m =’ l i n e ’,col=’ red ’)

(98)

0 5 10 15 20 0.00 0.05 0.10 0.15 0.20 0.25

Wage Distribution; Blue = Unmarried, Red = Married

Wage

(99)

Different distributions: married people earn more!

We can back that up other ways

w a g e1 % >% g r o u p _by( m a r r i e d ) % >% s u m m a r i z e ( w a g e = m e a n( w a g e ))

(100)

(101)

Keep in mind!

Just because two variables are related doesn’t mean we knowwhy

If ‘cor(x,y)‘ is positive, it could be that ‘x‘ causes ‘y‘... or that ‘y‘ causes ‘x‘, or that something else causes both!

Or many other configurations...

(102)

Practice

Install the ‘SMCRM‘ package, load it, get the ‘customerAcquisition‘ data. Rename it ca

Among ‘acquisition==1‘ observations, see if the size of first purchase is related to duration as a customer, with ‘cor‘ and (labeled) ‘plot‘

See if ‘industry‘ and ‘acquisition‘ are dependent on each other using ‘prop.table‘ with the ‘margin‘ option

See if average revenues differ between industries using ‘aggregate‘, then check the ‘cor‘

Plot the density of revenues for ‘industry==0‘ in blue and, on the same graph, revenues for ‘industry==1‘ in red

(103)

Practice Answers i n s t a l l.p a c k a g e s(’SMCRM ’) l i b r a r y(SMCRM) d a t a ( c u s t o m e r A c q u i s i t i o n ) c a<−c u s t o m e r A c q u i s i t i o n c o r( f i l t e r ( ca , a c q u i s i t i o n==1)$f i r s t p u r c h a s e , f i l t e r ( ca , a c q u i s i t i o n==1)$d u r a t i o n ) p l o t( f i l t e r ( ca , a c q u i s i t i o n==1)$f i r s t p u r c h a s e , f i l t e r ( ca , a c q u i s i t i o n==1)$d u r a t i o n , x l a b=” V a l u e o f F i r s t P u r c h a s e ”, y l a b=” C u s t o m e r D u r a t i o n ”) p r o p.t a b l e(t a b l e( c a$i n d u s t r y , c a$a c q u i s i t i o n ) ,m a r g i n=1) p r o p.t a b l e(t a b l e( c a$i n d u s t r y , c a$a c q u i s i t i o n ) ,m a r g i n=2)

a g g r e g a t e( r e v e n u e ˜ i n d u s t r y , d a t a=ca , FUN=mean)

c o r( c a$r e v e n u e , c a$i n d u s t r y )

p l o t(d e n s i t y( f i l t e r ( ca , i n d u s t r y==0)$r e v e n u e ) ,c o l=’ b l u e ’, x l a b=” R e v e n u e s ”, main=” R e v e n u e D i s t r i b u t i o n ”)

(104)

Summarizing data

Remember...

(105)

Summarizing data

Remember...

Don’t trust statistics alone, visualize your data!

(106)

Anscome’s Quartet

Developed by F.J. Anscombe in 1973

Anscombe’s Quartet is a set of four datasets, all with the same summary statistics (mean, standard deviation, and correlation)

(107)

Anscome’s Quartet l i b r a r y( d a t a s e t s ) d a t a s e t s :: a n s c o m b e s t a r g a z e r ( a n s c o m b e , t y p e =’ t e x t ’,s u m m a r y.s t a t = c(" n "," m e a n "," sd ")) cor( a n s c o m b e$x1, a n s c o m b e$y1) cor( a n s c o m b e$x2, a n s c o m b e$y2) cor( a n s c o m b e$x3, a n s c o m b e$y3) cor( a n s c o m b e$x4, a n s c o m b e$y4)

(108)

(109)

Anscome’s Quartet

par( m f r o w =c(2,2))

p l o t( a n s c o m b e$x1, a n s c o m b e$y1, pch =1 9,col=" # FAA4 3A ", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

x l a b =" x1", y l a b =" y1", m a i n =" D a t a s e t #1", cex . lab =1.2, cex .a x i s=1.2, las =1)

p l o t( a n s c o m b e$x2, a n s c o m b e$y2, pch =1 9,col=" #5DA5DA ", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

p l o t( a n s c o m b e$x3, a n s c o m b e$y3, pch =1 9,col=" #4D4D4D ", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

p l o t( a n s c o m b e$x4, a n s c o m b e$y4, pch =1 9,col=" # F1 5 8 5 4", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

(110)

0 5 10 15 20 0 2 4 6 8 10 12 Dataset #1 x1 y1 0 5 10 15 20 0 2 4 6 8 10 12 Dataset #2 x2 y2 0 5 10 15 20 0 2 4 6 8 10 12 Dataset #3 x3 y3 0 5 10 15 20 0 2 4 6 8 10 12 Dataset #4 x4 y4

(111)

Datasaurus Dozen datasets

It’s like Anscome’s Quartet on steroids

The original Datasaurus was created by Alberto Cairo (see

http://www.thefunctionalart.com/2016/08/ download-datasaurus-never-trust-summary.html)

The other twelve were created by Justin Matejka and George Fitzmaurice (see

https://www.autodeskresearch.com/publications/samestats)

(112)

Datasaurus Dozen datasets d a t a s a u r u s _ d o z e n % >% g r o u p _by( d a t a s e t ) % >% s u m m a r i z e ( m e a n_ x = m e a n( x ) , m e a n_ y = m e a n( y ) , std _dev_ x = sd( x ) , std _dev_ y = sd( y ) , c o r r _ x _ y = cor( x , y ) )

(113)

(114)

Datasaurus Dozen datasets i n s t a l l.p a c k a g e s(" d a t a s a u R u s ") l i b r a r y( d a t a s a u R u s ) par( m f r o w =c(3,4)) for( d a t a in u n i q u e( d a t a s a u r u s _ d o z e n$d a t a s e t )){ p r i n t( d a t a ) df= d a t a s a u r u s _ d o z e n [w h i c h( d a t a s a u r u s _ d o z e n$d a t a s e t == d a t a ) ,] p l o t(df$x ,df$y , pch =1 9,col=" # FAA4 3A ", bty =" L ", x l i m =r a n g e( d a t a s a u r u s _ d o z e n$x ) , y l i m =r a n g e( d a t a s a u r u s _ d o z e n$y ) , x l a b =" x ", y l a b =" y ", m a i n = data , cex =0.5, cex . lab =1.2, cex .a x i s=1.2, las =1) }

(115)

20 40 60 80 100 0 20 40 60 80 100 dino x y 20 40 60 80 100 0 20 40 60 80 100 away x y 20 40 60 80 100 0 20 40 60 80 100 h_lines x y 20 40 60 80 100 0 20 40 60 80 100 v_lines x y 20 40 60 80 100 0 20 40 60 80 100 x_shape x y 20 40 60 80 100 0 20 40 60 80 100 star x y 20 40 60 80 100 0 20 40 60 80 100 high_lines x y 20 40 60 80 100 0 20 40 60 80 100 dots x y 20 40 60 80 100 0 20 40 60 80 100 circle x y 20 40 60 80 100 0 20 40 60 80 100 bullseye x y 20 40 60 80 100 0 20 40 60 80 100 slant_up x y 20 40 60 80 100 0 20 40 60 80 100 slant_down x y 20 40 60 80 100 0 20 40 60 80 100 wide_lines x y

(116)

Summarizing data

Remember...

(117)

Summarizing data

Remember...

Don’t trust statistics alone, visualize your data!

(118)

Let’s Simulate!

Let’s expand our use of simulation to simulate the relationship betweentwo

variables

We can do this by using one variable to build another

I draw 400 random genders, and add to them 400 random normals) #4 0 0 p e o p l e e q u a l l y l i k e l y t o be M o r F s i m d a t a<−d a t a . f r a m e ( g e n d e r = s a m p l e(c(” Male ”,” F e m a l e ”) ,4 0 0,r e p l a c e=T ) ) # H e i g h t i s n o r m a l l y d i s t r i b u t e d w i t h mean 1 8 0cm and s d 1 0cm #and men a r e 5cm o f a f o o t t a l l e r s i m d a t a<− s i m d a t a %>%m u t a t e ( h e i g h t f t = r n o r m(4 0 0,1 8 0,1 0)+5*( g e n d e r ==” Male ”) ) s i m d a t a %>% g r o u p by( g e n d e r ) %>% s u m m a r i z e ( h e i g h t =mean( h e i g h t f t ) )

(119)

Two Variable Simulation

We get in our simulation that men are on average ‘mean(filter(simdata,gender==”Male”)$

heightft)-mean(filter(simdata,gender==”Female”)$heightft)‘ taller than women.

The true data-generating process is that heightft is a normal variable with mean 180, plus 5cm if you’re male

g g p l o t ( f i l t e r ( s i m d a t a , g e n d e r==” Male ”) , a e s ( x=h e i g h t f t ,c o l=’ Male ’))+

s t a t d e n s i t y( geom=’ l i n e ’)+

s t a t d e n s i t y( d a t a= f i l t e r ( s i m d a t a , g e n d e r==” F e m a l e ”) , a e s ( x=h e i g h t f t ,c o l=’ F e m a l e ’) , geom=’ l i n e ’)+

(120)

0.00 0.01 0.02 0.03 0.04 160 180 200 220 heightft density colour Female Male

(121)

Two Variable Simulation

So does checking for the difference of means give us back the difference in height from the data-generating process? Let’s loop!

h e i g h t d i f f<−c( ) f o r ( i i n 1:2 5 0 0) { s i m d a t a<−d a t a . f r a m e ( g e n d e r = s a m p l e(c(” Male ”,” F e m a l e ”) ,4 0 0,r e p l a c e=T ) ) # H e i g h t i s n o r m a l l y d i s t r i b u t e d w i t h mean 1 8 0cm and s d 1 0cm #and men a r e 5cm o f a f o o t t a l l e r s i m d a t a<−s i m d a t a %>%m u t a t e ( h e i g h t f t = r n o r m(4 0 0,1 8 0,1 0)+5*( g e n d e r ==” Male ”) ) s i m d a t a %>% g r o u p by( g e n d e r ) %>% s u m m a r i z e ( h e i g h t = mean( h e i g h t f t ) ) h e i g h t d i f f [ i ]<−mean( f i l t e r ( s i m d a t a , g e n d e r==” Male ”)$h e i g h t f t )− mean( f i l t e r ( s i m d a t a , g e n d e r==” F e m a l e ”)$h e i g h t f t ) } s t a r g a z e r ( a s . d a t a . f r a m e ( h e i g h t d i f f ) , t y p e=’ t e x t ’)

(122)

(123)

Another Example

So far, no problem, right? Everything works out (I mean, of course it does)

Of coursethe average number of heads in a sample will on average be 50%

So what can we actually learn here?

(124)

Another Example

# Is y o u r c o m p a n y in t e c h ? Let ’ s say 3 0% of f i r m s are

df < - d a t a . f r a m e ( t e c h = s a m p l e(c(0,1) ,5 0 0,r e p l a c e= T , p r o b =c(.7,.3))) # T e c h f i r m s on a v e r a g e s p e n d $3mil m or e d e f e n d i n g IP l a w s u i t s

df < - df% >% m u t a t e ( IP . s p e n d = 3* t e c h +r u n i f(5 0 0,min=0,max=4)) # Now let ’ s c h e c k for how p r o f i t and IP . s p e n d are c o r r e l a t e d !

df < - df% >% m u t a t e (log. p r o f i t = 2* t e c h - .3* IP . s p e n d + r n o r m(5 0 0,m e a n=2))

cor(df$log. profit ,df$IP . s p e n d )

(125)

(126)

-Another Example

Maybe just a fluke? Let’s loop.

I P c o r r < - c()

for ( i in 1:1 0 0 0) {

# Is y o u r c o m p a n y in t e c h ? Let ’ s say 3 0% of f i r m s are

df < - d a t a . f r a m e ( t e c h = s a m p l e(c(0,1) ,5 0 0,r e p l a c e= T , p r o b =c(.7,.3))) # T e c h f i r m s on a v e r a g e s p e n d $3mil m o r e d e f e n d i n g IP l a w s u i t s

df < - df% >% m u t a t e ( IP . s p e n d = 3* t e c h +r u n i f(5 0 0,min=0,max=4)) # Now let ’ s c h e c k for how p r o f i t and IP . s p e n d are c o r r e l a t e d !

df < - df% >% m u t a t e (log. p r o f i t = 2* t e c h - .3* IP . s p e n d + r n o r m(5 0 0,m e a n=2)) I P c o r r [ i ] < - cor(df$log. profit ,df$IP . s p e n d )

}

(127)

(128)

What’s Happening? A graph might help

g g p l o t (df, aes ( x = IP . spend , y =log. profit , c o l o r = as .f a c t o r(" All C o m p a n i e s ") ) ) + g e o m _ p o i n t ()+

(129)

−2 0 2 4 0 2 4 6 IP.spend log.profit Firm Type All Companies

(130)

What’s Happening? A graph might help

g g p l o t ( m u t a t e (df, t e c h =f a c t o r( tech ,l a b e l s=c(" Not T e c h "," T e c h "))) , aes ( x = IP . spend , y =log. profit , c o l o r = t e c h ))+ g e o m _ p o i n t ()+ g u i d e s ( c o l o r = g u i d e _l e g e n d(t i t l e=" F i r m T y p e "))

(131)

−2 0 2 4 0 2 4 6 IP.spend log.profit Firm Type Not Tech Tech

(132)

Simpson’s Paradox

Here we have a true negative relationship - we know it’s in the true model!

But when we plot it out, it’s positive

WITHIN tech companies and non-tech companies, IP spend is negatively correlated with profit

But because tech companies have higher IP spend and higher profit, they’re positively correlated!

(133)

Simpson’s Paradox

So our method (looking at the correlation between them) doesn’t work!

The simulation has shown us that we’d get it wrong if we do this

Our analysis method doesn’t correct for what tech is doing here

We need to have some way of incorporating what we know about tech

Taking what we might know about the true model - that firm type has something to do with this, and adjusting so we get the right answer

(134)

Practice

Use the ‘prob‘ option in ‘sample‘ to generate 300 coin flips weighted to be 55% heads. Calculate heads prop.

Loop it 2000 times. How often will you correctly claim that the coin is more than 50% heads?

Create ‘dat‘: 1500 obs of ‘married‘ (0 or 1), ‘educ‘ (unif 0 to 16, plus ‘2*married‘), ‘log.wage‘ (normal mean 5 plus ‘.1*educ‘ plus ‘2*married‘)

Loop it 1000 times and calculate ‘cor‘ between ‘educ‘ and ‘log.wage‘ each time.

It’s positive - does that mean it’s right? If not, how do you know?

Use ‘plot‘ and then ‘points‘ to plot the ‘married==0‘ and ‘married==1‘ data in different colors