Murtaza Haider
THE BIG PLAN
FOR TODAY
Introductions
Who am I?
Who are you?
Why the book?
Why are You Here?
Course Introduction
MY JOURNEY
WITH DATA
SCIENCE
Civil Engineering
Journalism
Nesbitt Burns
U of Toronto
Housing & Transportation
McGill
Infrastructure, land development &
logistics
Travel demand models
Ryerson
Supply chain
Housing
“This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most
importantly it teaches how to tell a story with data.”
— Thomas H. Davenport,
Distinguished Professor, Babson College; Research Fellow, MIT; author of Competing on Analytics and Big Data @ Work
Murtaza Haider How this book is different?
1. It’s not trying to turn you into a statistician 2. It repeats the important lessons
3. It believes analytics are performed to tell fascinating stories
4. It teaches you three things:
R and R Studio
WHAT IS
DATA
SCIENCE?
Data Science is what data scientists do Who are data scientists?
I define data scientist as someone who finds
solutions to problems by analyzing big or small data using appropriate tools and then tells stories to communicate her findings to the relevant stakeholders.
I do not use the data size as a restrictive
clause. A data below a certain arbitrary threshold does not make one less of a data scientist.
Nor is my definition of a data scientist
restricted to particular analytic tools, such as machine learning.
As long as one has a curious mind, fluency in
analytics, and the ability to communicate the findings, I consider the person a data
scientist.
Harvard Business Review called data science
BUSINESS ANALYSTS/DATA SCIENTISTS
While the world is awash with large volumes of data,
inexpensive computing power, and vast amounts of digital
storage, the skilled workforce capable of analyzing data and
interpreting it is in short supply.
A 2011 McKinsey Global Institute report suggests that “the
WHO USES
DATA?
EVERYONE
Getting Started with Data Science (GSDS) is an applied text on
analytics written for
professionals like Chelsea Clinton who either perform or manage analytics for small and large corporations.
Ms. Clinton credited statistical analysis software (Stata) for helping her to “absorb
information more quickly and mentally sift through and
catalog it.”
The text is equally appealing to those who would like to develop skills in analytics to pursue a career as a data analyst
TEACHING
PHILOSOPHY
Unlike academic research, industry research delivers
reports that often have only three key ingredients: namely
Summary tabulations Insightful graphics Narrative.
A review of the reports produced by the industry
leaders, such as PricewaterhouseCoopers, Deloitte, and large commercial banks, revealed that most used
simple analytics—i.e., summary tabulations and insightful graphics to present data-driven findings.
Industry reports seldom highlighted advanced
statistical models or other similar techniques. Instead, they focused on creative prose that told stories from data.
GSDS appreciates the fact that most working analysts
will not be required to generate reports with advanced statistical methods, but instead will be expected to summarize data in tables and charts (graphics) and wrap these up in convincing narratives.
Thus, GSDS extensively uses graphs and tables to
THE STORY
TELLING
DIFFERENTIATOR
This book is as much about
storytelling as it is about analytics. I
believe that a data scientist is a
person who uses data and analytics to
find solutions to problems, and then
uses the findings to tell the most
convincing and compelling story.
I believe that unless a data scientist is
willing to tell the story, she will remain
in a back office job where others will
use her analytics and findings to build
the narrative and receive praise, and
in time, promotions.
Storytelling is, in fact, the final and
most important stage of analytics.
Therefore, successful communication
of findings to stakeholders is as
BACK TO
STORYTELLING
Storytelling is equally important to the biggest big data firm in the world.
“Google has a very data-led culture. But we care just as much about the storytelling...”
Lorraine Twohill, who served as Google’s senior vice president of global marketing in 2014.
Twohill believes “getting the storytelling right— and having the substance and the authenticity in the storytelling— is as respected internally as [is] the return and the impact.”
“If you fail on the messaging and storytelling, all that those tools will get you are a lot of bad impressions.”
“[T]here is one very important aspect we look for, which perhaps differentiates a data analyst from other technologists. It exponentially improves their career prospects if they can match this technical, data-geek knowledge with great communication and presentation skills.”
America’s chief data scientist, D.J.
Patil.
White House Office of Science and
Technology Policy.
A “data scientist is that unique blend
of skills that can both unlock the
GROWING
DATA PAINS
Our digital footprint has expanded
rapidly over the past 10 years.
The size of the digital universe was
roughly 130 billion gigabytes in
1995.
SAP, a leader in data and analytics, reported
from a survey that 92% of the responding firms in its sample experienced a significant
increase in their data holdings.
At the same time, three-quarters identified
the need for new data science skills in their firms.
Accenture believes that the demand for data
scientists outstripped supply by 250,000 in 2015 alone.
A similar survey of 150 executives by KPMG in
2014 found that 85% of the respondents did not know how to analyze data.
“Most organizations are unable to connect
the dots because they do not fully
understand how data and analytics can transform their business.”
Alwin Magimay, head of digital and
WHAT’S IN
IT FOR YOU
Realizing the demand sooner than other universities, North Carolina University launched a Master’s in Analytics degree in 2007.
Michael Rappa, director of the Institute for Advanced Analytics informed, the New York Times that each one of the 84 graduates of the class of 2012 received a job offer.
Those without experience on average earned $89,000 and the experienced graduate netted more than $100,000.
Galvanize: Its data science course runs for 12 weeks
Google Flu forecasts
Target predicting teenage pregnancies
Another example of big data hubris dates back to
1936 when Alfred Landon, a Republican, was contesting the American presidential elections against F.D. Roosevelt.
The Literary Digest decided to survey 10 million
individuals, which constituted one-fourth of the electorate, about their choice of the presidential candidate.
The Digest compiled 2.4 million responses
received in the mail and claimed that Alfred Landon would win by a landslide. It predicted Landon would receive 55% of the vote, whereas Roosevelt would receive 41%.
F.D. Roosevelt won by a landslide by securing
61% of the votes.
George Gallup, a pollster, conducted a small
Do attractive professors get better teaching evaluations?
Are religious individuals more or less likely to have extramarital affairs?
What motivates one to start smoking?
What determines housing prices more: lot size or the number of bedrooms?
How do teenagers and older people differ in the way they use social media?
Who is more likely to use online dating services?
Why do some people purchase iPhones and others Blackberries?
Data from University of Texas
98 instructors
463 courses
Teaching evaluation score registered by the students
The Beauty Panel
5 student ranked professors for beauty
Caveats
Beauty Evaluations done by a separate group
Teaching effectiveness might depend upon the following:
Knowledge of the subject matter
Eagerness and enthusiasm to transfer knowledge
The ability to make complex appear simple
Respect for learners
Dataset:
TeachingRatings.
rda
DESCRIPTION:
Data on course evaluations, course characteristics, and professor
characteristics for 463 courses for the academic years 2000-2002 at the
University of Texas at Austin.
FORMAT:
A data frame containing 463 observations on 13 variables.
BEAUTY
Rating of the instructor’s physical appearance by a panel of six
students, averaged across the six panelists, shifted to have a mean of
zero.
EVAL
MINORITY FACTOR
Does the instructor belong to a
minority (non-Caucasian)?
AGE
Professor’s age
GENDER
Factorindicating instructor’s gender.
CREDITS FACTOR
Is the course a single-credit elective
(e.g., yoga, aerobics, dance)?
DIVISION FACTOR
Is the course an upper or lower
division course? (Lower division
courses are mainly large freshman and sophomore courses)?
NATIVE FACTOR
Is the instructor a native English
speaker?
TENURE FACTOR
Is the instructor on tenure track?
STUDENTS
Number of students that participated
in the evaluation.
ALLSTUDENTS
Number of students enrolled in the
course.
PROF
We will use the following tools
SPSS
R
R Commander
Rstudio
Data Scientist Workbench
Help with software
https://sites.google.com/site/statsr4us/intro/software
Learning R