Murtaza Haider
Email: [email protected]
Professor of Real Estate Management Ryerson University, Toronto
Author:
Getting Started with Data Science Syndicated Columnist:
Financial Post Blogger:
Huffington Post
Instructor:
www.CognitiveClass.ai Education:
Ph.D. in Transportation Engineering and Planning
Research focus:
Our address on the web
http://tinyurl.com/r-analytics What you need for today
Have R & R Studio installed on your device
Do good looking people get higher salaries/promotions?
Hamermesh, Daniel S. and Amy M. Parker (2005). Beauty in the Classroom: Instructors' Pulchritude and Putative Pedagogical
Productivity. Economics of Education Review, August 2005. pp. 5-16. Help with software
https://sites.google.com/site/statsr4us/intro/software
Learning R
https://sites.google.com/site/statsr4us/intro/software/learning-r
First Session in RCmdr
1.Morning session 1. 10:00 – 10:30
1. Welcome and Introductions 2. What is Big Data?
2. 10:30 – 11:00
1. Introducing data
2. Data types and structures 3. 11:05 – 12:00
1. Introducing R and RStudio
2. Ten Commandments of Statistical Analysis 4. 12:00 - 1:00
1. Lunch 2.Afternoon session 1. 1:00 – 2:00
1. Making Summary Tables 2. 2:05 - 3:00
1. Graphic Details 3. 3:10 – 4:30
1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression
3. All Else Being Equal 4. 4:45 - 5:00
Murtaza Haider
1. It teaches you three things: 1. How to summarize data in
tables
2. How to turn data into illustrations
Ontario held
elections in June 2018
Let’s plot the
seats = c(76,40,7,1)
seats
names(seats) = c("PC", "NDP", "Liberal", "Green")
seats
plot(seats) barplot(seats) barplot(seats,
ylim = c(0,80),
ylab = "Seats won",
followers = c(36500, 37000, 31600, 34500); followers tweets = c(8540, 7597, 23800, 5013); tweets
fpt = followers/tweets
names(fpt) = c("@OntarioPCParty", "@OntarioNDP", "@OntLiberal", "@OntarioGreens")
fpt
barplot(fpt) barplot(fpt,
ylim = c(0,8),
ylab = "Followers per tweet",
Murtaza Haider
Email: [email protected]
In a world awash with Big Data and Analytics, businesses and institutions are increasingly competing on analytics.
• For this, they need professionals skilled in data/statistical analysis.
McKinsey Global Institute estimates a shortage of hundreds of thousands of skilled data scientists.
• It’s time to say Hi to data science!
The workshop provides hands-on training in skills necessary to be proficient in a data-centric world.
Prerequisites:
Three dimensions
• Volume
• Variety
• Velocity
Gartner, Inc. defines big data in similar
terms:
• “Big data is volume, velocity and
high-variety information assets that demand cost-effective,
innovative forms of information processing for
Continuous variables Age
Income
Housing prices
Discrete or categorical variables Binary:
Gender .. Male/female Multinomial
Mode of travel … Walk, bike, drive, transit
Ordinal
Type of Data
Cross-Sectional – measurements
taken at one time period
e.g., students course evaluations
in a course
Cross sectional Panel
Same student’s evaluation of
different courses in a particular year
Time series – data collected over
time
E.g., unemployment rate,
monthly retail sales
Panel Time Series
Housing sales by month by
What do I need to know about the
data others have collected
• Who sponsored data collection?
• Cereal breakfasts are the best.
• Study funded by Kellogg
• Metadata: The data about data
• Questionnaire
• Data dictionaries
• Top line surveys
Murtaza Haider
Email: [email protected]
It’s powerful
It’s free
Extensive support documentation
It’s current
Recent advances
• R Commander from John Fox
• R integrated into Microsoft Excel
• R Through Excel
• Zelig
• R Studio
StatsR4US
THE SOLUTION
FOR STATS
HEADACHES
We have taught a prescribed set of
methods
These methods were developed a
hundred years ago
We have not updated them We must
The alternative approach
Teach tools that business analysts,
engineers, and researchers need
Keep things simple
Don’t follow the table of contents
generated a 100 years ago
Start afresh
Fit everything one need in ten lines The Ten Commandments of Statistical
analysis in R
Master 10 lines of code to do almost
Tasks
1. Open/Import a dataset 2. Review/Summarize it 3. Summarize by groups 4. Generate cross tabs 5. Generate some plots 6. Check the distribution
1. Plot it
2. Normal/T-distribution 7. Test your hypothesis 8. Chi Square Test
9. Run Regressions 10. Save Data
Methods
1. read.csv( ) 2. summarize( ) 3. tapply( ) 4. table( ) 5. plot()
6. Working with 2 distributions 1. hist( )
2. pnorm( ) 3. pt( ) 7. lm( )
8. chisq.test( ) 9. lm( )
TFR <- c(2.6, 1.9, 2.0, 3.3, 2.5, 2.3, 2.5)
names(TFR) <- c (“Belize”, “Costa Rica”, “El Salvador”, “Guatemala”, “Honduras”, “Nicaragua”, “Panama”)
TFR <- sort(TFR)
par(mar = c(5,8,4,2))
barplot (TFR, horiz = TRUE, col = “lightblue”,
border = “lightblue”,
main = “Central America, Fertility Rate 2012”, xlab = “average births per woman”,
xlim = c(0,4), cex.lab = 1.4, cex.main = 1.7, cex.names = 1.4, las = 1)
1.Morning session 1. 10:00 – 10:30
1. Welcome and Introductions 2. What is Big Data?
2. 10:30 – 11:00
1. Introducing data
2. Data types and structures 3. 11:05 – 12:00
1. Introducing R and RStudio
2. Ten Commandments of Statistical Analysis 4. 12:00 - 1:00
1. Lunch 2.Afternoon session 1. 1:00 – 2:00
1. Making Summary Tables 2. 2:05 - 3:00
1. Graphic Details 3. 3:10 – 4:30
1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression
3. All Else Being Equal 4. 4:45 - 5:00
Hamermesh, Daniel S. and Amy M. Parker (2005). "Beauty in the
Data from University of Texas 98 instructors
463 courses
Teaching evaluation score registered by the students The Beauty Panel
5 student ranked professors for beauty
Caveats
Beauty Evaluations done by a separate group
Teaching effectiveness might depend upon the following: Knowledge of the subject matter
Eagerness and enthusiasm to transfer knowledge The ability to make complex appear simple
Respect for learners
Dataset:
TeachingRatings.rda
DESCRIPTION:
Data on course evaluations, course characteristics, and professor
characteristics for 463 courses for the academic years 2000-2002 at the University of Texas at Austin.
FORMAT:
A data frame containing 463 observations on 13 variables.
BEAUTY
Rating of the instructor’s physical appearance by a panel of six
students, averaged across the six panelists, shifted to have a mean of zero.
EVAL
MINORITY FACTOR
Does the instructor belong to a minority (non-Caucasian)?
AGE
Professor’s age
GENDER
Factorindicating instructor’s gender.
CREDITS FACTOR
Is the course a single-credit elective (e.g., yoga, aerobics, dance)?
DIVISION FACTOR
Is the course an upper or lower division course? (Lower division
courses are mainly large freshman and sophomore courses)?
NATIVE FACTOR
Is the instructor a native English speaker?
TENURE FACTOR
Is the instructor on tenure track?
STUDENTS
Number of students that participated in the evaluation.
ALLSTUDENTS
Number of students enrolled in the course.
PROF
Murtaza Haider
Email: [email protected]
A report or a deliverable comprises
three necessary ingredients, namely:
Murtaza Haider
Email: [email protected]
A report or a deliverable comprises
three necessary ingredients, namely:
If there is one tool analysts should be comfortable with, it's
Regression
It’s the mother, the father, and the grandmother of all analytic tools.
It can answer most questions that you’d put to other tools.
It’s the state-of-the-art, even though it’s hundreds of years old. It is Regression.
Isn’t it odd that the tool you should be most familiar with is tucked at
the back of most texts on statistical analysis?
You have to leaf through hundreds of pages to get to Regression Analysis. Well, let’s not regress anymore as we embrace Regression Analysis in all
its glory!
Why Regress?
Regression Models are the bridge between the traditional statistical analysis and modern day data science.
Regression can answer what T-tests, Correlation tests, Anova and other tools could answer. Regression models (including GLS) often offer better (more useful) forecasts than ANN and other datamining tools.
Regression can help you determine if good-looking instructors
Sir Frances Galton & Regression toward the mean Galton, F. (1886). "Regression towards
mediocrity in hereditary stature". Nature 15: 246–263. http://www.jstor.org/pss/2841583.
“the average regression of the offspring is a constant fraction of their respective mid-parental deviations”
QUESTIONS
Do tall parents have tall children? Do women spend more on clothing
than men?
Do unmarried women spend more on
clothing?
Do married women with children
spend less on clothing?
Do households postpone buying
expensive goods during recession?
Do low-income households postpone
buying expensive goods during recession?
Do households with young children
Do more women eat cereal for breakfast than men?
Do single- or two-person households
prefer living in a condominium/apartment than households with children?
Do Canadian buyers prefer Japanese cars over American cars?
Are couples postponing child rearing to later in their lives?
PARTIAL
ANSWERS
Correlations
A positive correlation
exists between obesity and suburban living
A positive correlation
exists between fertility and suburban living
A positive correlation
exists between poverty and fertility rates
A positive correlation
OTHER
FACTORS:
A positive correlation exists between obesity and suburban living
What about age and income?
A positive correlation exists between fertility and suburban living
What about moving to suburbs for cheaper housing to raise a family?
A positive correlation exists between poverty and fertility rates
A positive correlation exists
between summer months and drownings
Does summer cause drowning
WHEN
SEVERAL
FACTORS ARE
INVOLVED
Factors affecting spending on
clothing by women:
Age Income
Type of career Marital status
Spouse’s income Personal taste
Children
Location: Cultural differences, e.g.,
Chicoutimi vs. Montreal
Location: Climate differences: Florida
versus New York
Location: proximity to clothing
ALL ELSE
BEING EQUAL
How to isolate the impact of one
particular factor by holding other influences constant.
What is the impact of proximity to
public transit, if the size of the
house, its structural type, quality of construction, neighbourhood
amenities and ambience and other factors are the same?
All else being equal implies that
Have you ever
travelled in a cab?
• Cabs are sooooo
2015!
• I know.
• $3.25
You sit in the cab andsee a base fare that is constant
• $0.25 per 143 m
The fare increaseswith each additional 143 m at a certain rate
• $0.25 per additional 29
seconds
The fare increases for each 29 seconds of
taxi being idle
• $2.00
For each passenger inFare = Constant + Dist_Rate x Distance + Wait_Rate x Time +
FourPlus x Four or more Passengers
Fare = 3.25 + 0.25 x Distance + 0.25 x Idle Time + 2 x Four_plus
A sample trip
3 people 6 kms 4 minutes
Adjusting time and distance
6000 m / 143 = 41.9 segments of 143 m 4(60)/29 = 8.3 segments of 29 seconds
$0.25 for
143 m $0.25 for 29 seconds $2 for 4+ pax
Fare Distance (km) (seconds)Time 4+ pax
$ 15.8 6 240 0
$ 16.9 6.7 225 0
$ 23.1 8.5 350 1
$ 11.4 3.8 180 0
$ 12.1 3.25 135 1
What if you don’t know the rates?
Regression estimates the unknown rates
Fare = Constant + Dist_Rate x Distance + Wait_Rate x Time +
FourPlus x Four or more Passengers
𝑦𝑦
=
𝛽𝛽
0+
𝛽𝛽
1∗ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
+
𝛽𝛽
2∗ 𝑑𝑑𝑑𝑑𝑡𝑡𝑑𝑑
+
𝛽𝛽
3∗ 𝑝𝑝𝑑𝑑𝑝𝑝
+
𝜖𝜖
$0.25 for143 m $0.25 for 29 seconds $2 for 4+ pax
Fare Distance (km) (seconds)Time 4+ pax
$ 15.8 6 240 0
$ 16.9 6.7 225 0
$ 23.1 8.5 350 1
$ 11.4 3.8 180 0
$ 12.1 3.25 135 1
Y is a function of X
Y (the dependant variable) is explained by other variables, X1, X2
(the explanatory variables)
Regression Notation
The betas in the above equation are the regression coefficients
that explain the relationship between the dependant variable and explanatory variables. Epsilon is the error term that captures what is not captured by the variables in the model.
)
,
(
X
1X
2f
Y
=
ε
β
β
β
+
+
+
Hamermesh, Daniel S. and Amy M. Parker (2005). "Beauty in the