Adapt deck Jun 2018

(1)

Murtaza Haider

Email: [email protected]

(2)

Professor of Real Estate Management Ryerson University, Toronto

Author:

Getting Started with Data Science Syndicated Columnist:

Financial Post Blogger:

Huffington Post

Instructor:

www.CognitiveClass.ai Education:

Ph.D. in Transportation Engineering and Planning

Research focus:

(3)

(4)

(5)

Our address on the web

http://tinyurl.com/r-analytics What you need for today

Have R & R Studio installed on your device

Do good looking people get higher salaries/promotions?

Hamermesh, Daniel S. and Amy M. Parker (2005). Beauty in the Classroom: Instructors' Pulchritude and Putative Pedagogical

Productivity. Economics of Education Review, August 2005. pp. 5-16. Help with software

https://sites.google.com/site/statsr4us/intro/software

Learning R

https://sites.google.com/site/statsr4us/intro/software/learning-r

First Session in RCmdr

(6)

1.Morning session 1. 10:00 – 10:30

1. Welcome and Introductions 2. What is Big Data?

2. 10:30 – 11:00

1. Introducing data

2. Data types and structures 3. 11:05 – 12:00

1. Introducing R and RStudio

2. Ten Commandments of Statistical Analysis 4. 12:00 - 1:00

1. Lunch 2.Afternoon session 1. 1:00 – 2:00

1. Making Summary Tables 2. 2:05 - 3:00

1. Graphic Details 3. 3:10 – 4:30

1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression

3. All Else Being Equal 4. 4:45 - 5:00

(7)

Murtaza Haider

1. It teaches you three things: 1. How to summarize data in

tables

2. How to turn data into illustrations

(8)

(9)

(10)

(11)

(12)

Ontario held

elections in June 2018

Let’s plot the

(13)

seats = c(76,40,7,1)

seats

names(seats) = c("PC", "NDP", "Liberal", "Green")

seats

plot(seats) barplot(seats) barplot(seats,

ylim = c(0,80),

ylab = "Seats won",

(14)

(15)

(16)

followers = c(36500, 37000, 31600, 34500); followers tweets = c(8540, 7597, 23800, 5013); tweets

fpt = followers/tweets

names(fpt) = c("@OntarioPCParty", "@OntarioNDP", "@OntLiberal", "@OntarioGreens")

fpt

barplot(fpt) barplot(fpt,

ylim = c(0,8),

ylab = "Followers per tweet",

(17)

(18)

Murtaza Haider

(19)

(20)

In a world awash with Big Data and Analytics, businesses and institutions are increasingly competing on analytics.

• For this, they need professionals skilled in data/statistical analysis.

McKinsey Global Institute estimates a shortage of hundreds of thousands of skilled data scientists.

• It’s time to say Hi to data science!

The workshop provides hands-on training in skills necessary to be proficient in a data-centric world.

Prerequisites:

(21)

Three dimensions

• Volume

• Variety

• Velocity

Gartner, Inc. defines big data in similar

terms:

• “Big data is volume, velocity and

high-variety information assets that demand cost-effective,

innovative forms of information processing for

(22)

(23)

Continuous variables Age

Income

Housing prices

Discrete or categorical variables Binary:

Gender .. Male/female Multinomial

Mode of travel … Walk, bike, drive, transit

Ordinal

(24)

 Type of Data

 Cross-Sectional – measurements

taken at one time period

 e.g., students course evaluations

in a course

 Cross sectional Panel

 Same student’s evaluation of

different courses in a particular year

 Time series – data collected over

time

 E.g., unemployment rate,

monthly retail sales

 Panel Time Series

 Housing sales by month by

(25)

What do I need to know about the

data others have collected

• Who sponsored data collection?

• Cereal breakfasts are the best.

• Study funded by Kellogg

• Metadata: The data about data

• Questionnaire

• Data dictionaries

• Top line surveys

(26)

Murtaza Haider

(27)

It’s powerful

It’s free

Extensive support documentation

It’s current

(28)

Recent advances

• R Commander from John Fox

• R integrated into Microsoft Excel

• R Through Excel

• Zelig

• R Studio

(29)

(30)

(31)

(32)

(33)

(34)

(35)

StatsR4US

(36)

(37)

THE SOLUTION

FOR STATS

HEADACHES

 We have taught a prescribed set of

methods

 These methods were developed a

hundred years ago

 We have not updated them  We must

 The alternative approach

 Teach tools that business analysts,

engineers, and researchers need

 Keep things simple

 Don’t follow the table of contents

generated a 100 years ago

 Start afresh

 Fit everything one need in ten lines  The Ten Commandments of Statistical

analysis in R

 Master 10 lines of code to do almost

(38)

Tasks

1. Open/Import a dataset 2. Review/Summarize it 3. Summarize by groups 4. Generate cross tabs 5. Generate some plots 6. Check the distribution

1. Plot it

2. Normal/T-distribution 7. Test your hypothesis 8. Chi Square Test

9. Run Regressions 10. Save Data

Methods

1. read.csv( ) 2. summarize( ) 3. tapply( ) 4. table( ) 5. plot()

6. Working with 2 distributions 1. hist( )

2. pnorm( ) 3. pt( ) 7. lm( )

8. chisq.test( ) 9. lm( )

(39)

TFR <- c(2.6, 1.9, 2.0, 3.3, 2.5, 2.3, 2.5)

names(TFR) <- c (“Belize”, “Costa Rica”, “El Salvador”, “Guatemala”, “Honduras”, “Nicaragua”, “Panama”)

TFR <- sort(TFR)

par(mar = c(5,8,4,2))

barplot (TFR, horiz = TRUE, col = “lightblue”,

border = “lightblue”,

main = “Central America, Fertility Rate 2012”, xlab = “average births per woman”,

xlim = c(0,4), cex.lab = 1.4, cex.main = 1.7, cex.names = 1.4, las = 1)

(40)

(41)

1.Morning session 1. 10:00 – 10:30

1. Welcome and Introductions 2. What is Big Data?

2. 10:30 – 11:00

1. Introducing data

2. Data types and structures 3. 11:05 – 12:00

1. Introducing R and RStudio

2. Ten Commandments of Statistical Analysis 4. 12:00 - 1:00

1. Lunch 2.Afternoon session 1. 1:00 – 2:00

1. Making Summary Tables 2. 2:05 - 3:00

1. Graphic Details 3. 3:10 – 4:30

1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression

3. All Else Being Equal 4. 4:45 - 5:00

(42)

(43)

Hamermesh, Daniel S. and Amy M. Parker (2005). "Beauty in the

(44)

Data from University of Texas 98 instructors

463 courses

Teaching evaluation score registered by the students The Beauty Panel

5 student ranked professors for beauty

Caveats

Beauty Evaluations done by a separate group

Teaching effectiveness might depend upon the following: Knowledge of the subject matter

Eagerness and enthusiasm to transfer knowledge The ability to make complex appear simple

Respect for learners

(45)

Dataset:

TeachingRatings.rda

DESCRIPTION:

Data on course evaluations, course characteristics, and professor

characteristics for 463 courses for the academic years 2000-2002 at the University of Texas at Austin.

FORMAT:

A data frame containing 463 observations on 13 variables.

BEAUTY

Rating of the instructor’s physical appearance by a panel of six

students, averaged across the six panelists, shifted to have a mean of zero.

EVAL

(46)

 MINORITY FACTOR

 Does the instructor belong to a minority (non-Caucasian)?

 AGE

 Professor’s age

 GENDER

 Factorindicating instructor’s gender.

 CREDITS FACTOR

 Is the course a single-credit elective (e.g., yoga, aerobics, dance)?

 DIVISION FACTOR

 Is the course an upper or lower division course? (Lower division

courses are mainly large freshman and sophomore courses)?

 NATIVE FACTOR

 Is the instructor a native English speaker?

 TENURE FACTOR

 Is the instructor on tenure track?

 STUDENTS

 Number of students that participated in the evaluation.

 ALLSTUDENTS

 Number of students enrolled in the course.

 PROF

(47)

(48)

Murtaza Haider

(49)

A report or a deliverable comprises

three necessary ingredients, namely:

(50)

(51)

(52)

(53)

(54)

(55)

(56)

Murtaza Haider

(57)

 A report or a deliverable comprises

three necessary ingredients, namely:

(58)

(59)

(60)

(61)

(62)

(63)

(64)

(65)

(66)

(67)

(68)

(69)

(70)

(71)

(72)

(73)

(74)

(75)

(76)

(77)

(78)

(79)

(80)

Murtaza Haider

Let’s Regress

(81)

If there is one tool analysts should be comfortable with, it's

Regression

It’s the mother, the father, and the grandmother of all analytic tools.

 It can answer most questions that you’d put to other tools.

 It’s the state-of-the-art, even though it’s hundreds of years old.  It is Regression.

Isn’t it odd that the tool you should be most familiar with is tucked at

the back of most texts on statistical analysis?

 You have to leaf through hundreds of pages to get to Regression Analysis.  Well, let’s not regress anymore as we embrace Regression Analysis in all

its glory!

Why Regress?

 Regression Models are the bridge between the traditional statistical analysis and modern day data science.

 Regression can answer what T-tests, Correlation tests, Anova and other tools could answer. Regression models (including GLS) often offer better (more useful) forecasts than ANN and other datamining tools.

Regression can help you determine if good-looking instructors

(82)

(83)

 Sir Frances Galton & Regression toward the mean  Galton, F. (1886). "Regression towards

mediocrity in hereditary stature". Nature 15: 246–263. http://www.jstor.org/pss/2841583.

 “the average regression of the offspring is a constant fraction of their respective mid-parental deviations”

(84)

QUESTIONS

Do tall parents have tall children? Do women spend more on clothing

than men?

Do unmarried women spend more on

clothing?

Do married women with children

spend less on clothing?

Do households postpone buying

expensive goods during recession?

Do low-income households postpone

buying expensive goods during recession?

Do households with young children

(85)

Do more women eat cereal for breakfast than men?

Do single- or two-person households

prefer living in a condominium/apartment than households with children?

Do Canadian buyers prefer Japanese cars over American cars?

Are couples postponing child rearing to later in their lives?

(86)

PARTIAL

ANSWERS

Correlations

A positive correlation

exists between obesity and suburban living

exists between fertility and suburban living

exists between poverty and fertility rates

(87)

OTHER

FACTORS:

 A positive correlation exists between obesity and suburban living

 What about age and income?

 A positive correlation exists between fertility and suburban living

 What about moving to suburbs for cheaper housing to raise a family?

 A positive correlation exists between poverty and fertility rates

(88)

A positive correlation exists

between summer months and drownings

Does summer cause drowning

(89)

(90)

WHEN

SEVERAL

FACTORS ARE

INVOLVED

Factors affecting spending on

clothing by women:

Age Income

Type of career Marital status

Spouse’s income Personal taste

Children

Location: Cultural differences, e.g.,

Chicoutimi vs. Montreal

Location: Climate differences: Florida

versus New York

Location: proximity to clothing

(91)

ALL ELSE

BEING EQUAL

How to isolate the impact of one

particular factor by holding other influences constant.

What is the impact of proximity to

public transit, if the size of the

house, its structural type, quality of construction, neighbourhood

amenities and ambience and other factors are the same?

All else being equal implies that

(92)

(93)

Have you ever

travelled in a cab?

• Cabs are sooooo

2015!

• I know.

(94)

• $3.25

You sit in the cab and

see a base fare that is constant

• $0.25 per 143 m

The fare increases

with each additional 143 m at a certain rate

• $0.25 per additional 29

seconds

The fare increases for each 29 seconds of

taxi being idle

• $2.00

For each passenger in

(95)

Fare = Constant + Dist_Rate x Distance + Wait_Rate x Time +

FourPlus x Four or more Passengers

Fare = 3.25 + 0.25 x Distance + 0.25 x Idle Time + 2 x Four_plus

A sample trip

3 people 6 kms 4 minutes

Adjusting time and distance

6000 m / 143 = 41.9 segments of 143 m 4(60)/29 = 8.3 segments of 29 seconds

(96)

$0.25 for

143 m $0.25 for 29 seconds $2 for 4+ pax

Fare Distance (km) (seconds)Time 4+ pax

$ 15.8 6 240 0

$ 16.9 6.7 225 0

$ 23.1 8.5 350 1

$ 11.4 3.8 180 0

$ 12.1 3.25 135 1

(97)

What if you don’t know the rates?

Regression estimates the unknown rates

Fare = Constant + Dist_Rate x Distance + Wait_Rate x Time +

FourPlus x Four or more Passengers

𝑦𝑦

=

𝛽𝛽

₀

+

𝛽𝛽

₁

∗ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑

+

𝛽𝛽

₂

∗ 𝑑𝑑𝑑𝑑𝑡𝑡𝑑𝑑

+

𝛽𝛽

₃

∗ 𝑝𝑝𝑑𝑑𝑝𝑝

+

𝜖𝜖

$0.25 for

143 m $0.25 for 29 seconds $2 for 4+ pax

Fare Distance (km) (seconds)Time 4+ pax

$ 15.8 6 240 0

$ 16.9 6.7 225 0

$ 23.1 8.5 350 1

$ 11.4 3.8 180 0

$ 12.1 3.25 135 1

(98)

Y is a function of X

Y (the dependant variable) is explained by other variables, X₁, X₂

(the explanatory variables)

Regression Notation

The betas in the above equation are the regression coefficients

that explain the relationship between the dependant variable and explanatory variables. Epsilon is the error term that captures what is not captured by the variables in the model.