• No results found

Summarizing data. Mauricio Romero

N/A
N/A
Protected

Academic year: 2021

Share "Summarizing data. Mauricio Romero"

Copied!
134
0
0

Loading.... (view fulltext now)

Full text

(1)

Summarizing data

(2)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(3)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(4)

Inference

ˆ Goal: Estimate unknown parameters

ˆ To approximate parameters, we use an estimator, which is a function of the data

ˆ Thus, estimator is a random variable (it is a function of a random variable)

ˆ Use relationship between estimator (its distribution usually) and parameters to infer something about the parameters

(5)

Important notation

Based on this tweet: https://twitter.com/nickchk/status/1272993322395557888 ˆ Greek letters (e.g.,µ) are the truth (i.e., parameters of the true DGP)

ˆ Greek letters with hats (e.g., µb) are estimates (i.e., what we think the truth is)

ˆ Non-Greek letters (e.g., X) denote sample/data

ˆ Non-Greek letters with lines on top (e.g., X) denote calculations from the data (e.g., X = N1 P

iXi).

ˆ We want to estimate the truth, with some calculation from the data (µb=X)

ˆ Data −→ Calculations −→ Estimate −→ |{z} Hopefully Truth ˆ Example: X−→ X −→µb −→ |{z} Hopefully µ

(6)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(7)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(8)

Very good way to describe a variable is by showing it’s distribution with a graph

ˆ Density plots (for continuous)

ˆ Histograms (for continuous)

ˆ Bar plots (for categorical)

ˆ Plus:

ˆ Adding plots together

ˆ Putting lines on plots

(9)

Plot types we won’t cover:

ˆ Lots and lots and lots of plot options

ˆ Mosaics, Sankey plots, pie graphs (cause I hate pie graphs)

ˆ Some aren’t common in Econ but could be!

ˆ Others are too advanced (likemaps, but GIS knowledge is a really useful skill!)

(10)

Density plots and histograms

ˆ Density plots and histograms will show you the full distribution of the variable

ˆ Values along the x-axis, and how often those values show up on y

ˆ The density plots will present a smooth line by averaging nearby values

(11)

Density plots

ˆ If variable iscontinuous, we can’t count how ofteneach value comes up

ˆ We smooth it by looking at the number of times it falls within a range

df < - r e a d.csv(

’ h t t p :// www . a i r e . c d m x . gob . mx / o p e n d a t a / red _ m a n u a l / red _ m a n u a l _ p a r t i c u l a s _ s u s p . csv ’, s k i p =8, s t r i n g s A s F a c t o r s = F )

df= f i l t e r (df, cve _ p a r a m e t e r ==" PM1 0")

p l o t(d e n s i t y(df$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2, bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ",

(12)

How would you make this figure better? 0.000 0.005 0.010 0.015 density.default(x = df$value) Density

(13)

Titles and labels

ˆ Readability is super important in graphs

ˆ Add labels and titles! Titles with ’main’ and axis labels with ’xlab’ and ’ylab’

(14)

How would you make this figure better? 0.000 0.005 0.010 0.015 Distribution of PM10 in CDMX Density

(15)

Histograms

ˆ When a variable is continuous, we can’t count the number of timeseach value comes up

ˆ We create ‘bins” and show how many observations fall into each bin

h i s t(df$value , y l a b =" D e n s i t y ", x l a b =" PM1 0", f r e q = F , bty =" L ",

m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ", b r e a k s =3 0,col=" # FAA4 3A ", cex . lab =1.2, cex .a x i s=1.2)

(16)

How would you make this figure better? Histogram of df$value Density 0 1000 2000 3000 4000 5000 6000 7000

(17)

Histograms

ˆ These need labels too! Other important options:

ˆ Do proportions with ‘freq=FALSE‘

(18)

How would you make this figure better? Distribution of PM10 in CDMX Density 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014

(19)

Barplot

ˆ If it’s a discreet variable (like a coin toss or a roll of a dice), often best to just count the number (or fraction) of observations in each category

ˆ ‘table()‘ command, shows us the whole distribution of a categorical

ˆ Imagine we gather data from 500 rolls of a dice

d a t a < - s a m p l e(c(1:6) ,5 0 0,r e p l a c e=T R U E) d a t a < - d a t a . f r a m e ( R e s u l t = d a t a )

t a b l e( d a t a )

p r o p.t a b l e(t a b l e( d a t a ))

b a r p l o t(p r o p.t a b l e(t a b l e( d a t a )))

(20)
(21)

1 2 3 4 5 6

0.00

0.05

0.10

(22)

1 2 3 4 5 6 500 rolls of a dice Propor tion 0.00 0.05 0.10 0.15

(23)

Overlaying Densities

ˆ Sometimes it’s nice to be able to compare two distributions

ˆ Because density plots are so simple, we can do this by overlaying them

ˆ The ‘lines()‘ function will add a line to your graph

ˆ Be sure to set colors so you can tell them apart

p l o t(d e n s i t y( f i l t e r (df, cve _ s t a t i o n ==" TLA ")$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2,

bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0", cex . lab =1.2, cex .a x i s=1.2, y l i m =c(0,0.0 1 5))

l i n e s(d e n s i t y( f i l t e r (df, cve _ s t a t i o n ==" PED ")$v a l u e ) ,col=2, lwd =2, lty =2)

l e g e n d(" t o p r i g h t ",c(" TLA "," PED ") ,

(24)

How would you make this figure better?

0.000

0.005

0.010

0.015

density.default(x = filter(df, cve_station == "TLA")$value)

Density

TLA PED

(25)

How would you make this figure better? 0 50 100 150 200 250 300 350 0.000 0.005 0.010 0.015 0.020 0.025 Distribution of PM10 Density TLA PED

(26)

Practice

ˆ Install ‘Ecdata‘, load it, and get the ‘MCAS‘ data

ˆ Make a density plot for ‘totsc4‘, and add a vertical green dashed line at the median

ˆ Create a bar plot showing the proportion of observations that have nonzero values of ‘bilingua‘

(27)

Practice answers i n s t a l l.p a c k a g e s(’ E c d a t a ’) l i b r a r y( E c d a t a ) d a t a ( M C A S ) p l o t(d e n s i t y( M C A S$t o t s c4) , x l a b ="4th G r a d e T e s t S c o r e ") a b l i n e( v =m e d i a n( M C A S$t o t s c4) ,col=’ g r e e n ’, lty =’ d a s h e d ’) # THE G G P L O T2 WAY g g p l o t ( MCAS , aes ( x = t o t s c4))+s t a t_d e n s i t y( g e o m =’ l i n e ’)+ g e o m _ v l i n e ( aes ( x i n t e r c e p t =m e d i a n( t o t s c4)) , c o l o r =’ g r e e n ’, l i n e t y p e =’ d a s h e d ’)+ x l a b ("4th G r a d e T e s t S c o r e ") M C A S < - M C A S % >% m u t a t e ( n o n z e r o b i = M C A S$b i l i n g u a > 0) g g p l o t ( MCAS , aes ( x = n o n z e r o b i ))+ g e o m _ bar ()+

g g t i t l e (" N o n z e r o S p e n d i n g Per B i l i n g u a l S t u d e n t ") g g p l o t ( MCAS , aes ( y = t o t d a y ))+ g e o m _b o x p l o t()+

(28)

How would you make this figure better?

0.00 0.01 0.02

675 700 725

4th Grade Test Score

(29)

How would you make this figure better? 0 50 100 150 FALSE TRUE nonzerobi count

(30)

How would you make this figure better? 4000 6000 8000 10000 −0.4 −0.2 0.0 0.2 0.4 totda y

(31)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(32)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution

Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(33)

Summary statistics

ˆ The mean, median, etc. describe the distribution in a condensed way

ˆ Means and medians both describe the center of the distribution

ˆ Percentiles describe where other parts not necessarily in the middle are

ˆ Standard deviations and variances describe how spread outthe distribution is

ˆ Important to differentiate between the truth (e.g., E(X)) and the sample

(34)

The Mean — sample equivalent of the expected value ˆ E[X] :=Pxf(x)x

ˆ The sample equivalent is the mean (or average)

ˆ Calculated by multiplying each value by the proportion of times it comes up, and adding it all together

ˆ In R, ‘mean(x)‘

ˆ Essentially we estimate f(x) with the proportion of timesx shows up in the data

ˆ Let our random variable be the distribution of the rolls of a die

ˆ E(X) =P6x=1x16 = 3

ˆ From our 500 simulations: X = 3.382

(35)

The Mean

ˆ Nice things about the mean:

ˆ Easy to understand

ˆ The mean of ‘x-mean(x)‘ is 0 (same forE(E(X)−X) = 0)

ˆ Good statistical properties

ˆ Makes sense with large or small samples, with discrete or continuous variables ˆ Not so nice:

ˆ Sensitive to outliers (also true toE(X))

m e a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

m e a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 0 0)) ˆ The mean doesn’t describe EVERYTHINGabout the distribution!

(36)
(37)

The Median

ˆ Order observations from smallest to largest and pick the one in the middle

ˆ If there’s an even number of observations, take the mean of the two middle

ˆ Population equivalent is m such thatP(X ≤m) =P(X ≥m) = 12

x < - c(3,1,4,2,2)

m e d i a n( x )

(38)
(39)

The Median

ˆ Nice things about the median:

ˆ Super easy to calculate (you can often do it by hand)

ˆ Represents the “typical” observation

ˆ Not sensitive to outliers

m e d i a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

m e d i a n(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 0 0))

ˆ Generally not affected by transforming the data ˆ Not so nice:

ˆ Insensitive to outliers means it can ignore real changes in the “tails”

ˆ Can ignore magnitudes generally

(40)
(41)

Adding Lines

ˆ It’s pretty common for us to want to add some explanatory lines to our graphs

ˆ For example, adding mean/median/etc. to a density

ˆ Or showing where 0 is

ˆ Do this with ‘abline()‘

ˆ After creating the plot, THEN

ˆ Add the ‘abline(intercept,slope)‘

ˆ ‘abline(h=horizontal)‘ for horizontal numbers

ˆ ‘abline(v=vertical)‘ for vertical numbers

(42)

Mean and Median Together

p l o t(d e n s i t y(df$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2, bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ")

a b l i n e( v =m e a n(df$v a l u e ) , lwd =3,col=2, lty =2)

a b l i n e( v =m e d i a n(df$v a l u e ) , lwd =3,col=4, lty =4)

(43)

How would you make this figure better? 0 100 200 300 400 500 600 0.000 0.005 0.010 0.015 density.default(x = df$value) Density Mean Median

(44)

How would you make this figure better? 0.000 0.005 0.010 0.015 Distribution of PM10 in CDMX Density Mean Median

(45)

Percentiles

ˆ A percentile is just like a median

ˆ Except that you don’t necessarily pick the MIDDLE

ˆ Sort observations and pick the (percentile)th observation

ˆ Population equivalent: Percentile kth is m such thatP(X <m)<k/100

ˆ Use the ‘quantile()‘ function, and list the percentiles you want

ˆ Percentiles can fully describe the distribution if you use enough!

q u a n t i l e(c(0,1,2,3,4,5) ,c(.4,.5,1))

(46)
(47)

Percentiles

Exactly 20% of the observations are between each set of lines

p l o t(d e n s i t y(df$v a l u e ) , y l a b =" D e n s i t y ", x l a b =" PM1 0", lwd =2, bty =" L ", m a i n =" D i s t r i b u t i o n of PM1 0 in C D M X ")

(48)

How would you make this figure better? 0.000 0.005 0.010 0.015 Distribution of PM10 in CDMX Density

(49)

Min and Max

ˆ Also useful are the minimum and maximum (a.k.a. the 0% and 100% percentiles)

ˆ Show you the range of values that the variable takes in the sample

(50)

Standard deviation and variance

ˆ Standard ways of understanding how much the datavaries around the mean

ˆ Variance = Standard deviation squared

(51)

Standard deviation and variance

ˆ Start with data and subtract out the mean

ˆ The result is the residuals (left-over part, unexplained part)

ˆ Square the residuals

ˆ Average them (variance) [note: then multiply by N/(N-1)]

ˆ Square root of the variance is the standard deviation

ˆ Why this process rather than some other measure around the mean (i.e. why square it)? Good statistical reasons I promise

(52)

Standard deviation and variance d a t a < - c(1,1,1,1,2) d a t a < - d a t a - m e a n( d a t a ) d a t a # V a r i a n ce , sd c((5/4)*me a n( d a t a ^2) ,var(c(1,1,1,1,2)) , s q r t((5/4)*m e a n( d a t a ^2)) ,sd(c(1,1,1,1,2))) d a t a2 < - c(1 0 0,0, -3 0,5 0,8 0) d a t a2 < - d a t a2 - m e a n( d a t a2) # V a r i a n ce , sd c((5/4)*me a n( d a t a2^2) ,var(c(1 0 0,0, -3 0,5 0,8 0)) , s q r t((5/4)*m e a n( d a t a2^2)) ,sd(c(1 0 0,0, -3 0,5 0,8 0)))

(53)
(54)

Graphically, SD and variance tell you how “wide” the distribution is

dat < - d a t a . f r a m e ( w i d e = r n o r m(1 0 0,sd=1) , n a r r o w = r n o r m(1 0 0,sd=1/2))

p l o t(d e n s i t y( dat$w i d e ) ,col=2, lwd =2, t y p e =" l ", bty =" L ", y l i m =c(0,1.5))

l i n e s(d e n s i t y( dat$n a r r o w ) ,col=4, lwd =2, t y p e =" l ", bty =" L ", lty =3)

(55)

How would you make this figure better? −4 −2 0 2 0.0 0.5 1.0 1.5 Density SD=1 SD=0.5

(56)

Summary statistics table

ˆ Something we will often want to do is display a bunch of summary statistics at once for the variables we have

ˆ This makes it easy to understand a variable’s distribution at a glance

ˆ We’ll be using the ‘stargazer‘ command for this

l i b r a r y( s t a r g a z e r ) d a t a ( L i f e C y c l e S a v i n g s )

(57)
(58)

Summary statistics table

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max

Savings 50 9.7 4.5 0.6 7.0 12.6 21.1

% of population under 15 50 35.1 9.2 21.4 26.2 44.1 47.6 % of population over 75 50 2.3 1.3 0.6 1.1 3.3 4.7 Disposable income (DPI) 50 1,106.8 990.9 88.9 288.2 1,795.6 4,001.9 % growth rate of DPI 50 3.8 2.9 0.2 2.0 4.5 16.7

(59)

Stargazer

ˆ See help(stargazer) to see what other summary stats you can include

ˆ ‘type=’text’‘ tells it to give us a basic text table.

ˆ Another one is ‘type=’html’‘, especially if we want to output our table to a file

ˆ You can open up the HTML table and copy/paste it into Excel or Word ˆ Another one is ‘type=’latex’‘, for when exporting to LATEX

(60)

Putting it all together

ˆ Data about public employees salaries is widely available in Mexico

ˆ https://nominatransparente.rhnet.gob.mx/ ˆ ttps://tudinero.cdmx.gob.mx/buscador personas ˆ https://datos.cdmx.gob.mx/explore/dataset/ remuneraciones-al-personal-de-la-ciudad-de-mexico ˆ https://www.transparencia.cdmx.gob.mx/ ˆ https://sep.gob.mx/es/sep1/

Articulo 73 de la Ley General de Contabilidad Gubernamental

ˆ To speed things up, I cleaned a little the data of the 4th trimester of 2019 for the Secretaria de Gobierno (CDMX)

(61)

Practice answers

l i b r a r y( s t a r g a z e r )[ b a s i c s t y l e =\ t i n y ]

S a l a r i o s C D M X =r e a d.csv(" h t t p :// m a u r i c i o - r o m e r o . com / d a t a / c l a s s / S a l a r i o s C D M X2 0 1 9_ G o b e r n a c i o n . csv ") S a l a r i o s C D M X$M u j e r =( S a l a r i o s C D M X$S e x o ==" F e m e n i n o ")

(62)

Summary statistics table

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Clave puesto 9,773 231.4 231.6 20 159 190 1,184 Monto bruto 9,773 12,170.0 6,390.9 0 8,960 13,119 109,981 Monto neto 9,773 9,700.9 4,837.2 −22,323.2 7,004.3 10,415.8 77,909.5

Mujer (=1) 9,773 0.5 0.5 0 0 1 1

(63)

How would you make this figure better?

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0e+00 1e−04 2e−04 3e−04 4e−04 5e−04 Density Hombre Mujer

(64)

Summary statistics table - Males

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Clave puesto 5,238 221.5 215.9 24 169 190 1,184 Monto bruto 5,238 12,678.5 6,966.0 0 9,640 13,119 104,740 Monto neto 5,238 10,077.8 5,239.1 0.0 7,305.0 10,523.5 74,971.0

Mujer (=1) 5,238 0.0 0.0 0 0 0 0

(65)

Summary statistics table - Females

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max Clave puesto 4,535 242.7 248.1 20 150 190 1,096 Monto bruto 4,535 11,582.8 5,597.8 2,816 8,790 13,119 109,981 Monto neto 4,535 9,265.6 4,286.5 −22,323.2 6,745.0 10,292.7 77,909.5

Mujer (=1) 4,535 1.0 0.0 1 1 1 1

(66)

How would you make this figure better? 0e+00 1e−04 2e−04 3e−04 Density No confianza Confianza

(67)

Practice

ˆ Use ‘data(LifeCycleSavings)‘ to get the Life Cycle Savings data, and use ‘help()‘ and ‘str()‘ to look at it

ˆ Use ‘stargazer()‘ to get a text table of summary statistics for all the variables EXCEPT ddpi

ˆ Now make an HTML table for all the variables. Open the file and look at it in a browser.

ˆ For each of the statistics that the ‘stargazer()‘ table gives you, plus the median, calculate that statistic on your own for the ‘pop15‘ variable using the appropriate R function

ˆ Calculate the max, min, and median in two ways - using their own respective functions, and as percentiles.

(68)

Practice answers l i b r a r y( s t a r g a z e r ) d a t a ( L i f e C y c l e S a v i n g s ) h e l p( L i f e C y c l e S a v i n g s ) str ( L i f e C y c l e S a v i n g s ) s t a r g a z e r ( s e l e c t ( L i f e C y c l e S a v i n g s , - d d p i ) , t y p e =’ t e x t ’) s t a r g a z e r ( s e l e c t ( L i f e C y c l e S a v i n g s , - d d p i ) , t y p e =’ h t m l ’, out =’ t a b l e . h t m l ’) LS < - L i f e C y c l e S a v i n g s

c( l e n g t h ( LS$pop1 5) ,m e a n( LS$pop1 5) ,sd( LS$pop1 5) ,min( LS$pop1 5) ,

(69)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(70)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution

Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(71)

Relationships Between Variables

ˆ We aren’t just interested in looking at variables by themselves!

ˆ We want to know how variables can be related to each other

ˆ When ‘X‘ is high, would we expect ‘Y‘ to also be high, or be low?

ˆ How are variables correlated?

ˆ How does one variable explainanother?

(72)

What Does it Mean to be Related?

ˆ Two variables are related if knowing something about one tells you something about the other

ˆ For example, consider the answer to two questions:

ˆ Do you have an uterus?

ˆ Are you pregnant?

ˆ What is the probability that a random person is pregnant?

ˆ What is the probability that a random person who doesn’t have an uterusis pregnant?

(73)

What Does it Mean to be Related?

ˆ Variables are dependentif telling you the value of one gives you information about the value of the other

ˆ Variables are correlatedif knowing whether one of them isunusually high gives you information about whether the other is unusually high(positive correlation)

or unusually low (negative correlation)

ˆ Explaining one variable ‘Y‘ with another ‘X‘ means predicting ‘Y‘by looking at

(74)

An Example: Dependence w a g e1 < - r e a d_ s t a t a ( " h t t p :// f m w w w . bc . edu / ec - p / d a t a / w o o l d r i d g e / w a g e1. dta " ) t a b l e( w a g e1$numdep , w a g e1$smsa , dnn =c(’ Num . D e p e n d e n t s ’,’ L i v e s in M e t r o p o l i t a n A r e a ’))

(75)
(76)

An Example: Dependence

ˆ What are we looking for here?

ˆ For dependence, simply see if the distribution of one variable changes for the

different values of the other.

ˆ Does the distribution of Number of Dependents differ based on your SMSA status?

p r o p.t a b l e(t a b l e( w a g e1$numdep , w a g e1$smsa , dnn =c(’ Num . D e p e n d e n t s ’,

’ L i v e s in M e t r o p o l i t a n A r e a ’)) ,

(77)
(78)

An Example: Dependence

ˆ Does the distribution of SMSA differ based on your Number of Dependents Status?

p r o p.t a b l e(t a b l e( w a g e1$numdep , w a g e1$smsa , dnn =c(’ N u m b e r of D e p e n d e n t s ’, ’ L i v e s in M e t r o p o l i t a n A r e a ’)) ,

m a r g i n=1)

ˆ Looks like it!

(79)
(80)

An Example: Correlation

ˆ We are interested in whether two variables tend tomove together (positive correlation) or move apart(negative correlation)

ˆ One basic way to do this is to see whether values tend to be hightogether

ˆ One way to check in dplyr is to use ‘group by()‘ to organize the data into groups

ˆ Then ‘summarize()‘ the data within those groups

w a g e1 % >%

g r o u p _by( s m s a ) % >%

s u m m a r i z e ( n u m d e p =m e a n( n u m d e p ))

(81)
(82)

An Example: Correlation

ˆ There’s also a summary statistic we can calculatecalledcorrelation, this is typically what we mean by “correlation”

ˆ Ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation)

ˆ Basically ”a one-standard deviation increase in ‘X‘ is associated with a “correlation” standard-deviation increase in ‘Y‘”

cor( w a g e1$numdep , w a g e1$s m s a )

(83)
(84)

An Example: Explanation

ˆ Let’s go back to those different means:

w a g e1 % >%

g r o u p _by( s m s a ) % >%

s u m m a r i z e ( n u m d e p =m e a n( n u m d e p )) ˆ Explanation would be saying that:

ˆ If you’re in an SMSA, I predict that you have these many dependents

m e a n( f i l t e r ( w a g e1, s m s a ==1)$n u m d e p )

ˆ If you’re not in an SMSA, I predict that you have these many dependents

m e a n( f i l t e r ( w a g e1, s m s a ==0)$n u m d e p )

ˆ If you are in an SMSA and have 2 dependents, then only some of those dependents

are explained by SMSA and some of them are unexplained by SMSA

(85)

Coding Recap

ˆ ‘table(df$var1,df$var2)‘ to look at two variables together

ˆ ‘prop.table(table(df$var1,df$var2))‘ for the proportion in each cell

ˆ ‘prop.table(table(df$var1,df$var2),margin=2)‘ to get proportions within each column

ˆ ‘prop.table(table(df$var1,df$var2),margin=1)‘ to get proportions within each row

ˆ ‘df %>% group by(var1) %>% summarize(mean(var2))‘ to get mean of var2 for each value of var1

(86)

Graphing Relationships

ˆ Relationships between variables can be easier to see graphically

ˆ And graphs are extremely important to understanding relationships and the “shape” of those relationships

(87)

Wage and Education

ˆ Let’s use ‘plot(xvar,yvar)‘

p l o t( w a g e1$educ , w a g e1$wage , x l a b =" Y e a r s of E d u c a t i o n ", y l a b =" W a g e ") # THE G G P L O T2 WAY

g g p l o t ( w a g e1, aes ( x = educ , y = w a g e ))+ g e o m _ p o i n t ()+ x l a b (’ Y e a r s of E d u c a t i o n ’)+

y l a b (’ W a g e ’)

ˆ As we look at different values of ‘educ‘, what changes about the values of ‘wage‘ we see?

(88)

0 5 10 15 0 5 10 15 20 25 Years of Education W age

(89)

0 5 10 15 0 5 10 15 20 25 Years of Education W age

(90)

Graphing Relationships

ˆ Try to picture the shapeof the data

ˆ Should this be a straight line? A curved line? Positively sloped? Negatively?

p l o t( w a g e1$educ , w a g e1$wage , x l a b =" Y e a r s of E d u c a t i o n ", y l a b =" W a g e ")

a b l i n e( -.9,.5,col=’ red ’)

p l o t(f u n c t i o n( x ) 5.4-.6* x +.0 5*( x ^2) ,0,1 8,add=TRUE,col=’ b l u e ’)

# THE G G P L O T2 WAY

g g p l o t ( w a g e1, aes ( x = educ , y = w a g e ))+ g e o m _ p o i n t ()+ x l a b (’ Y e a r s of E d u c a t i o n ’)+

y l a b (’ W a g e ’)+

g e o m _a b l i n e( aes ( i n t e r c e p t = -.9, s l o p e =.5) ,col=’ red ’)+

(91)

0 5 10 15 0 5 10 15 20 25 Years of Education W age

(92)

Graphing Relationships

ˆ ‘plot(xvar,yvar)‘ is extremely powerful, and will show you relationships at a glance

ˆ The previous graph showed a clear positive relationship, and indeed ‘cor(wage1$wage,wage1$educ)‘ is 0.406

ˆ Further, we don’t only see a positive relationship, but we have some sense of how

(93)

Graphing Relationships

ˆ Let’s compare clothing sales volume vs. profit margin for men’s clothing firms

l i b r a r y( E c d a t ) d a t a ( C l o t h i n g ) p l o t( C l o t h i n g$sales , C l o t h i n g$margin, x l a b =" G r o s s S a l e s ", y l a b =" M a r g i n ") # THE G G P L O T2 WAY l i b r a r y( E c d a t ) d a t a ( C l o t h i n g ) g g p l o t ( C l o t h i n g , aes ( x = sales , y =m a r g i n))+ g e o m _ p o i n t ()+ x l a b (’ G r o s s S a l e s ’)+ y l a b (’ M a r g i n ’)

(94)

0 5000 10000 15000 20000 25000 20 30 40 50 60 Gross Sales Margin

(95)

Graphing Relationships

ˆ Comparing Singapore diamond prices vs. carats

l i b r a r y( E c d a t ) d a t a ( D i a m o n d ) p l o t( D i a m o n d$carat , D i a m o n d$price , x l a b =" N u m b e r of C a r a t s ", y l a b =" P r i c e ") # THE G G P L O T2 WAY l i b r a r y( E c d a t ) d a t a ( D i a m o n d )

g g p l o t ( Diamond , aes ( x = carat , y = p r i c e ))+ g e o m _ p o i n t ()+ x l a b (’ N u m b e r of C a r a t s ’)+

(96)

0.2 0.4 0.6 0.8 1.0 5000 10000 15000 Number of Carats Pr ice

(97)

Graphing Relationships

ˆ Another way to graph a relationship, especially when one of the variables only takes a few values, is to plot the ‘density()‘ function for different values

p l o t(d e n s i t y( f i l t e r ( w a g e1, m a r r i e d ==0)$w a g e ) ,col=’ b l u e ’, x l a b =" W a g e ", m a i n =" W a g e D i s t r i b u t i o n ; B l u e = U n m a r r i e d , Red = M a r r i e d ", bty =" L ", las =1, lwd =2) l i n e s(d e n s i t y( f i l t e r ( w a g e1, m a r r i e d ==1)$w a g e ) ,col=’ red ’, lwd =2) # THE G G P L O T2 WAY g g p l o t ( f i l t e r ( w a g e1, m a r r i e d ==0) , aes ( x = w a g e ))+s t a t_d e n s i t y( g e o m =’ l i n e ’,col=’ b l u e ’)+ x l a b (’ W a g e ’)+ y l a b (’ D e n s i t y ’)+ g g t i t l e (" W a g e D i s t r i b u t i o n ; B l u e = U n m a r r i e d , Red = M a r r i e d ")+ s t a t_d e n s i t y( d a t a = f i l t e r ( w a g e1, m a r r i e d ==1) , g e o m =’ l i n e ’,col=’ red ’)

(98)

0 5 10 15 20 0.00 0.05 0.10 0.15 0.20 0.25

Wage Distribution; Blue = Unmarried, Red = Married

Wage

(99)

Graphing Relationships

ˆ Different distributions: married people earn more!

ˆ We can back that up other ways

w a g e1 % >% g r o u p _by( m a r r i e d ) % >% s u m m a r i z e ( w a g e = m e a n( w a g e ))

(100)
(101)

Keep in mind!

ˆ Just because two variables are related doesn’t mean we knowwhy

ˆ If ‘cor(x,y)‘ is positive, it could be that ‘x‘ causes ‘y‘... or that ‘y‘ causes ‘x‘, or that something else causes both!

ˆ Or many other configurations...

(102)

Practice

ˆ Install the ‘SMCRM‘ package, load it, get the ‘customerAcquisition‘ data. Rename it ca

ˆ Among ‘acquisition==1‘ observations, see if the size of first purchase is related to duration as a customer, with ‘cor‘ and (labeled) ‘plot‘

ˆ See if ‘industry‘ and ‘acquisition‘ are dependent on each other using ‘prop.table‘ with the ‘margin‘ option

ˆ See if average revenues differ between industries using ‘aggregate‘, then check the ‘cor‘

ˆ Plot the density of revenues for ‘industry==0‘ in blue and, on the same graph, revenues for ‘industry==1‘ in red

(103)

Practice Answers i n s t a l l.p a c k a g e s(’SMCRM ’) l i b r a r y(SMCRM) d a t a ( c u s t o m e r A c q u i s i t i o n ) c a<−c u s t o m e r A c q u i s i t i o n c o r( f i l t e r ( ca , a c q u i s i t i o n==1)$f i r s t p u r c h a s e , f i l t e r ( ca , a c q u i s i t i o n==1)$d u r a t i o n ) p l o t( f i l t e r ( ca , a c q u i s i t i o n==1)$f i r s t p u r c h a s e , f i l t e r ( ca , a c q u i s i t i o n==1)$d u r a t i o n , x l a b=” V a l u e o f F i r s t P u r c h a s e ”, y l a b=” C u s t o m e r D u r a t i o n ”) p r o p.t a b l e(t a b l e( c a$i n d u s t r y , c a$a c q u i s i t i o n ) ,m a r g i n=1) p r o p.t a b l e(t a b l e( c a$i n d u s t r y , c a$a c q u i s i t i o n ) ,m a r g i n=2)

a g g r e g a t e( r e v e n u e ˜ i n d u s t r y , d a t a=ca , FUN=mean)

c o r( c a$r e v e n u e , c a$i n d u s t r y )

p l o t(d e n s i t y( f i l t e r ( ca , i n d u s t r y==0)$r e v e n u e ) ,c o l=’ b l u e ’, x l a b=” R e v e n u e s ”, main=” R e v e n u e D i s t r i b u t i o n ”)

(104)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(105)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data!

(106)

Anscome’s Quartet

ˆ Developed by F.J. Anscombe in 1973

ˆ Anscombe’s Quartet is a set of four datasets, all with the same summary statistics (mean, standard deviation, and correlation)

(107)

Anscome’s Quartet l i b r a r y( d a t a s e t s ) d a t a s e t s :: a n s c o m b e s t a r g a z e r ( a n s c o m b e , t y p e =’ t e x t ’,s u m m a r y.s t a t = c(" n "," m e a n "," sd ")) cor( a n s c o m b e$x1, a n s c o m b e$y1) cor( a n s c o m b e$x2, a n s c o m b e$y2) cor( a n s c o m b e$x3, a n s c o m b e$y3) cor( a n s c o m b e$x4, a n s c o m b e$y4)

(108)
(109)

Anscome’s Quartet

par( m f r o w =c(2,2))

p l o t( a n s c o m b e$x1, a n s c o m b e$y1, pch =1 9,col=" # FAA4 3A ", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

x l a b =" x1", y l a b =" y1", m a i n =" D a t a s e t #1", cex . lab =1.2, cex .a x i s=1.2, las =1)

p l o t( a n s c o m b e$x2, a n s c o m b e$y2, pch =1 9,col=" #5DA5DA ", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

x l a b =" x2", y l a b =" y2", m a i n =" D a t a s e t #2", cex . lab =1.2, cex .a x i s=1.2, las =1)

p l o t( a n s c o m b e$x3, a n s c o m b e$y3, pch =1 9,col=" #4D4D4D ", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

x l a b =" x3", y l a b =" y3", m a i n =" D a t a s e t #3", cex . lab =1.2, cex .a x i s=1.2, las =1)

p l o t( a n s c o m b e$x4, a n s c o m b e$y4, pch =1 9,col=" # F1 5 8 5 4", bty =" L ", x l i m =c(0,2 0) , y l i m =c(0,1 3) ,

(110)

0 5 10 15 20 0 2 4 6 8 10 12 Dataset #1 x1 y1 0 5 10 15 20 0 2 4 6 8 10 12 Dataset #2 x2 y2 0 5 10 15 20 0 2 4 6 8 10 12 Dataset #3 x3 y3 0 5 10 15 20 0 2 4 6 8 10 12 Dataset #4 x4 y4

(111)

Datasaurus Dozen datasets

ˆ It’s like Anscome’s Quartet on steroids

ˆ The original Datasaurus was created by Alberto Cairo (see

http://www.thefunctionalart.com/2016/08/ download-datasaurus-never-trust-summary.html)

ˆ The other twelve were created by Justin Matejka and George Fitzmaurice (see

https://www.autodeskresearch.com/publications/samestats)

(112)

Datasaurus Dozen datasets d a t a s a u r u s _ d o z e n % >% g r o u p _by( d a t a s e t ) % >% s u m m a r i z e ( m e a n_ x = m e a n( x ) , m e a n_ y = m e a n( y ) , std _dev_ x = sd( x ) , std _dev_ y = sd( y ) , c o r r _ x _ y = cor( x , y ) )

(113)
(114)

Datasaurus Dozen datasets i n s t a l l.p a c k a g e s(" d a t a s a u R u s ") l i b r a r y( d a t a s a u R u s ) par( m f r o w =c(3,4)) for( d a t a in u n i q u e( d a t a s a u r u s _ d o z e n$d a t a s e t )){ p r i n t( d a t a ) df= d a t a s a u r u s _ d o z e n [w h i c h( d a t a s a u r u s _ d o z e n$d a t a s e t == d a t a ) ,] p l o t(df$x ,df$y , pch =1 9,col=" # FAA4 3A ", bty =" L ", x l i m =r a n g e( d a t a s a u r u s _ d o z e n$x ) , y l i m =r a n g e( d a t a s a u r u s _ d o z e n$y ) , x l a b =" x ", y l a b =" y ", m a i n = data , cex =0.5, cex . lab =1.2, cex .a x i s=1.2, las =1) }

(115)

20 40 60 80 100 0 20 40 60 80 100 dino x y 20 40 60 80 100 0 20 40 60 80 100 away x y 20 40 60 80 100 0 20 40 60 80 100 h_lines x y 20 40 60 80 100 0 20 40 60 80 100 v_lines x y 20 40 60 80 100 0 20 40 60 80 100 x_shape x y 20 40 60 80 100 0 20 40 60 80 100 star x y 20 40 60 80 100 0 20 40 60 80 100 high_lines x y 20 40 60 80 100 0 20 40 60 80 100 dots x y 20 40 60 80 100 0 20 40 60 80 100 circle x y 20 40 60 80 100 0 20 40 60 80 100 bullseye x y 20 40 60 80 100 0 20 40 60 80 100 slant_up x y 20 40 60 80 100 0 20 40 60 80 100 slant_down x y 20 40 60 80 100 0 20 40 60 80 100 wide_lines x y

(116)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data! Simulations and relationship between variables

(117)

Summarizing data

Remember...

Plotting the distribution

Inferring things about the distribution Relationships Between Variables

Don’t trust statistics alone, visualize your data!

(118)

Let’s Simulate!

ˆ Let’s expand our use of simulation to simulate the relationship betweentwo

variables

ˆ We can do this by using one variable to build another

ˆ I draw 400 random genders, and add to them 400 random normals) #4 0 0 p e o p l e e q u a l l y l i k e l y t o be M o r F s i m d a t a<−d a t a . f r a m e ( g e n d e r = s a m p l e(c(” Male ”,” F e m a l e ”) ,4 0 0,r e p l a c e=T ) ) # H e i g h t i s n o r m a l l y d i s t r i b u t e d w i t h mean 1 8 0cm and s d 1 0cm #and men a r e 5cm o f a f o o t t a l l e r s i m d a t a<− s i m d a t a %>%m u t a t e ( h e i g h t f t = r n o r m(4 0 0,1 8 0,1 0)+5*( g e n d e r ==” Male ”) ) s i m d a t a %>% g r o u p by( g e n d e r ) %>% s u m m a r i z e ( h e i g h t =mean( h e i g h t f t ) )

(119)

Two Variable Simulation

ˆ We get in our simulation that men are on average ‘mean(filter(simdata,gender==”Male”)$

heightft)-mean(filter(simdata,gender==”Female”)$heightft)‘ taller than women.

ˆ The true data-generating process is that heightft is a normal variable with mean 180, plus 5cm if you’re male

g g p l o t ( f i l t e r ( s i m d a t a , g e n d e r==” Male ”) , a e s ( x=h e i g h t f t ,c o l=’ Male ’))+

s t a t d e n s i t y( geom=’ l i n e ’)+

s t a t d e n s i t y( d a t a= f i l t e r ( s i m d a t a , g e n d e r==” F e m a l e ”) , a e s ( x=h e i g h t f t ,c o l=’ F e m a l e ’) , geom=’ l i n e ’)+

(120)

0.00 0.01 0.02 0.03 0.04 160 180 200 220 heightft density colour Female Male

(121)

Two Variable Simulation

So does checking for the difference of means give us back the difference in height from the data-generating process? Let’s loop!

h e i g h t d i f f<−c( ) f o r ( i i n 1:2 5 0 0) { s i m d a t a<−d a t a . f r a m e ( g e n d e r = s a m p l e(c(” Male ”,” F e m a l e ”) ,4 0 0,r e p l a c e=T ) ) # H e i g h t i s n o r m a l l y d i s t r i b u t e d w i t h mean 1 8 0cm and s d 1 0cm #and men a r e 5cm o f a f o o t t a l l e r s i m d a t a<−s i m d a t a %>%m u t a t e ( h e i g h t f t = r n o r m(4 0 0,1 8 0,1 0)+5*( g e n d e r ==” Male ”) ) s i m d a t a %>% g r o u p by( g e n d e r ) %>% s u m m a r i z e ( h e i g h t = mean( h e i g h t f t ) ) h e i g h t d i f f [ i ]<−mean( f i l t e r ( s i m d a t a , g e n d e r==” Male ”)$h e i g h t f t )− mean( f i l t e r ( s i m d a t a , g e n d e r==” F e m a l e ”)$h e i g h t f t ) } s t a r g a z e r ( a s . d a t a . f r a m e ( h e i g h t d i f f ) , t y p e=’ t e x t ’)

(122)
(123)

Another Example

ˆ So far, no problem, right? Everything works out (I mean, of course it does)

ˆ Of coursethe average number of heads in a sample will on average be 50%

ˆ So what can we actually learn here?

(124)

Another Example

# Is y o u r c o m p a n y in t e c h ? Let ’ s say 3 0% of f i r m s are

df < - d a t a . f r a m e ( t e c h = s a m p l e(c(0,1) ,5 0 0,r e p l a c e= T , p r o b =c(.7,.3))) # T e c h f i r m s on a v e r a g e s p e n d $3mil m or e d e f e n d i n g IP l a w s u i t s

df < - df% >% m u t a t e ( IP . s p e n d = 3* t e c h +r u n i f(5 0 0,min=0,max=4)) # Now let ’ s c h e c k for how p r o f i t and IP . s p e n d are c o r r e l a t e d !

df < - df% >% m u t a t e (log. p r o f i t = 2* t e c h - .3* IP . s p e n d + r n o r m(5 0 0,m e a n=2))

cor(df$log. profit ,df$IP . s p e n d )

(125)
(126)

-Another Example

Maybe just a fluke? Let’s loop.

I P c o r r < - c()

for ( i in 1:1 0 0 0) {

# Is y o u r c o m p a n y in t e c h ? Let ’ s say 3 0% of f i r m s are

df < - d a t a . f r a m e ( t e c h = s a m p l e(c(0,1) ,5 0 0,r e p l a c e= T , p r o b =c(.7,.3))) # T e c h f i r m s on a v e r a g e s p e n d $3mil m o r e d e f e n d i n g IP l a w s u i t s

df < - df% >% m u t a t e ( IP . s p e n d = 3* t e c h +r u n i f(5 0 0,min=0,max=4)) # Now let ’ s c h e c k for how p r o f i t and IP . s p e n d are c o r r e l a t e d !

df < - df% >% m u t a t e (log. p r o f i t = 2* t e c h - .3* IP . s p e n d + r n o r m(5 0 0,m e a n=2)) I P c o r r [ i ] < - cor(df$log. profit ,df$IP . s p e n d )

}

(127)
(128)

What’s Happening? A graph might help

g g p l o t (df, aes ( x = IP . spend , y =log. profit , c o l o r = as .f a c t o r(" All C o m p a n i e s ") ) ) + g e o m _ p o i n t ()+

(129)

−2 0 2 4 0 2 4 6 IP.spend log.profit Firm Type All Companies

(130)

What’s Happening? A graph might help

g g p l o t ( m u t a t e (df, t e c h =f a c t o r( tech ,l a b e l s=c(" Not T e c h "," T e c h "))) , aes ( x = IP . spend , y =log. profit , c o l o r = t e c h ))+ g e o m _ p o i n t ()+ g u i d e s ( c o l o r = g u i d e _l e g e n d(t i t l e=" F i r m T y p e "))

(131)

−2 0 2 4 0 2 4 6 IP.spend log.profit Firm Type Not Tech Tech

(132)

Simpson’s Paradox

ˆ Here we have a true negative relationship - we know it’s in the true model!

ˆ But when we plot it out, it’s positive

ˆ WITHIN tech companies and non-tech companies, IP spend is negatively correlated with profit

ˆ But because tech companies have higher IP spend and higher profit, they’re positively correlated!

(133)

Simpson’s Paradox

ˆ So our method (looking at the correlation between them) doesn’t work!

ˆ The simulation has shown us that we’d get it wrong if we do this

ˆ Our analysis method doesn’t correct for what tech is doing here

ˆ We need to have some way of incorporating what we know about tech

ˆ Taking what we might know about the true model - that firm type has something to do with this, and adjusting so we get the right answer

(134)

Practice

ˆ Use the ‘prob‘ option in ‘sample‘ to generate 300 coin flips weighted to be 55% heads. Calculate heads prop.

ˆ Loop it 2000 times. How often will you correctly claim that the coin is more than 50% heads?

ˆ Create ‘dat‘: 1500 obs of ‘married‘ (0 or 1), ‘educ‘ (unif 0 to 16, plus ‘2*married‘), ‘log.wage‘ (normal mean 5 plus ‘.1*educ‘ plus ‘2*married‘)

ˆ Loop it 1000 times and calculate ‘cor‘ between ‘educ‘ and ‘log.wage‘ each time.

ˆ It’s positive - does that mean it’s right? If not, how do you know?

ˆ Use ‘plot‘ and then ‘points‘ to plot the ‘married==0‘ and ‘married==1‘ data in different colors

References

Related documents

With sales representatives handling incoming calls, 53.2 percent are answered or connected successfully with the correct party.. In a call center environment, 67.6 percent

As robust watermarking technology is used to embed the absolute time code, very small parts of a media stream could be deleted without destroying the watermark..

a) That a By-law to Authorize the Submission of an Application to Ontario Infrastructure and Lands Corporation (“OILC”) for Financing Certain Ongoing Capital Works of the

107 general que establece que la mayor parte de esta explicación corresponde a las diferencias en las condiciones socioeconómicas de unos y otros niños, mientras que las

Topographical maps of the statistical z values at the early learning stages showed that beta amplitudes for the Learning group were significantly larger than those for the

A technical assistance plan was proposed to assist the Bank Supervision Department in developing such a model for monitoring bank performance whereby the consultant would

If indoor samples are essentially random subsamples of the outdoor air, as the simulation suggest (Supplementary Figures 6 and 7), then the rare species randomly assembled and

On the other hand, inoculations we performed at these early shoot growth stages (simulating wet weather and an overwintering inoculum source in the trellis) can produce leaf and