Welcome Who am I?
Why are You Here? Introducing Big Data The Learning Path
1.Morning session 1. 9:00 – 10:00
1. Welcome and Introductions 2. What is Big Data?
2. 10:00 – 11:00
1. Introducing R and Rstudio
2. Ten Command(ments) of Statistical Analysis 3. 11:00 – 12:00
1. Introducing data
2. Data types and structures 3. Making Summary Tables 4. 12:00 - 1:00
1. Lunch 2.Afternoon session 1. 1:00 – 2:00
1. Graphic Details 2. 2:15 - 4:45
1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression
3. All Else Being Equal 3. 4:45 - 5:00
“This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most
importantly it teaches how to tell a story with data.”
— Thomas H. Davenport,
Distinguished Professor, Babson College; Research Fellow, MIT; author of Competing on Analytics and Big Data @ Work
Murtaza Haider
How this book is different?
1. It’s not trying to turn you into a statistician 2. It repeats the important lessons
3. It believes analytics are performed to tell fascinating stories
4. It teaches you three things:
Who you are? What you know?
• In a world awash with Big Data and Analytics, businesses and
institutions are increasingly competing on analytics. For this, they need professionals skilled in data/statistical analysis.
• McKinsey Global Institute estimates a shortage of hundreds of
thousands of skilled data scientists.
• It’s time to say Hi to data science!
• The workshop provides hands-on training in skills necessary to
be proficient in a data-centric world.
• Prerequisites: Curiosity, high-school math, prescribed book, a
laptop computer, and willingness to learn R.
Our address on the web
http://tinyurl.com/r-analytics What you need for today
Have R & R Studio installed on your device
Do good looking people get higher salaries/promotions?
Hamermesh, Daniel S. and Amy M. Parker (2005). Beauty in the
Classroom: Instructors' Pulchritude and Putative Pedagogical
Size is the first, and at times, the only dimension that leaps out at
the mention of big data.
We offered a broader definition of big data that captures its other
unique and defining characteristics.
The rapid evolution and adoption of big data by industry has
leapfrogged the discourse to popular outlets, forcing the academic press to catch up.
A particular distinguishing feature of this paper is its focus on
analytics related to unstructured data, which constitute 95% of big data.
This paper highlights the need to develop appropriate and
efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats.
The heterogeneity, noise, and the massive size of structured big
Three dimensions
Volume Variety Velocity
Gartner, Inc. defines big data in similar terms:
“Big data is high-volume, high-velocity and high-variety information
assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
TechAmerica Foundation defines big data as follows:
“Big data is a term that describes large volumes of high velocity,
complex and variable data that require advanced techniques and
o Our digital footprint has expanded rapidly over the past 10 years.
o The size of the digital universe was roughly 130 billion gigabytes in 1995.
o By 2020, this number will swell to 40 trillion gigabytes. o Companies will compete for hundreds of thousands, if not
millions, of new workers needed to navigate the digital world. o No wonder the prestigious Harvard Business Review called data
o A report by the McKinsey Global Institute warns of huge talent shortages for data and analytics.
o By 2018, the United States alone could face a shortage of
o140,000 to 190,000 people with deep analytical skills
o1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.
o SAP reported from a survey that 92% of the responding firms in its sample experienced a significant increase in their data
holdings.
oAt the same time, three-quarters identified the need for new data science skills in their firms.
o Accenture believes that the demand for data scientists may outstrip supply by 250,000 in 2015 alone.
It’s powerful It’s free
Extensive support documentation It’s current
Recent advances
R Commander from John Fox
R integrated into Microsoft Excel R Through Excel
Zelig R Studio
StatsR4US
THE
SOLUTION
FOR STATS
HEADACHES
We have taught a prescribed set of
methods
These methods were developed
a hundred years ago
We have not updated them We must
The alternative approach Teach tools that business
analysts, engineers, and researchers need
Keep things simple
Don’t follow the table of contents
generated a 100 years ago
Start afresh
Fit everything one need in ten
lines
The Ten Commandments of
Statistical analysis in R
Master 10 lines of code to do
TASKS NOT
METHODS
Here are the tasks we need to
perform:
1. Open/Import a dataset 2. Review/Summarize it 3. Summarize by groups 4. Generate cross tabs 5. Generate some plots 6. Check the distribution
1. Plot it
2. Normal/T-distribution 7. Test your hypothesis 8. Chi Square Test
Tasks
1. Open/Import a dataset 2. Review/Summarize it 3. Summarize by groups 4. Generate cross tabs 5. Generate some plots 6. Check the distribution
1. Plot it
2. Normal/T-distribution 7. Test your hypothesis 8. Chi Square Test
9. Run Regressions 10. Save Data
Methods
1. read.csv( ) 2. summarize( ) 3. tapply( ) 4. table( ) 5. plot()
6. Working with 2 distributions 1. hist( )
2. pnorm( ) 3. pt( ) 7. lm( )
8. chisq.test( ) 9. lm( )
Data is (are) to Data Science what air is to humans.
Without data, or air for that matter, there is no Data Science.
Learning opportunities in Data Science though suffer from a major drawback.
Instructors routinely assume that whatever data sets are needed are either available, or could be made available, to learners.
Furthermore, a bigger constraint is that instructors assume learners would know how to deal with data and thus the instructors embark directly on analytics.
Ready to go
Most new, and some experienced, learners do not know where to look for, or how to do deal with, data.
They even don't know if data is singular or plural. Datum is singular, data are plural.
Today, we will embrace data in all its imperfections.
10 35 20 10 25 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Effort
Finding data Cleaning data Warangling data Analytics Story telling
* Chart is not based on real data.
Algorithmics or (al-Khwārizmī)
By the time you are halfway through this paragraph: 2.5 million
Facebook users would have exchanged contents online. Google would have received more than 4 million search requests. More than 200 million email messages would have flown over the
Internet and some 275,000 tweets would have been heard.
Never before in the history of humankind have we been able to
generate a living history of ourselves. In the process, we are creating new data of immense size and scope.
It is indeed a transformative change to see that within a few
decades we have moved from complaining about the lack of data to a data deluge.
Population data Survey data
Census data
Time series data Panel data
Panel-time series data
Unstructured data Audio and video data Big data
Continuous variables Age
Income
Housing prices
Discrete or categorical variables Gender .. Male/female (binary)
Mode of travel … Walk, bike, drive, transit (multinomial)
Number of children … none, one, two, three, four or more (ordinal)
Type of Data
Cross-Sectional – measurements taken at one time period e.g., students course evaluations in a course
Cross sectional Panel
Same student’s evaluation of different courses in a particular year Time series – data collected over time
E.g., unemployment rate, monthly retail sales Panel Time Series
Number of Variables Univariate
data consisting of a single variable to measure some entity Multivariate
Categorical (nominal)
data sorted into mutually exclusive (an observation cannot belong to more than one category) categories
Geographical region, type of employee, gender, state of birth, type of automobile owned
In R software it is referred to as Factors.
Properties
Ordinal data
Data ordered or ranked according to some relationship to one another Number of cars owned by a household
Properties
Categories can be compared with one another
Open data
Government data
World Bank, UN, IMF, others FRED, QUANDL
Census agencies Pew, ICPSR
Others
Closed data
Open data refers to the idea that the liberation of all sorts of data
whose subsequent use by a larger community will result in the benefit of the human race.
Other similar movements advocating for free access to software
(open source), and publications and other media (open content) work on the same principle.
A widely held belief in the development community is that free
data assists in making informed decisions about human
development. In the case of developing countries, open data
Hans Rosling, the promoter-in-chief for data and analytics. Dr. Rosling is a Swedish medical doctor whose passionate and
animated presentations about analytics spiced up the dialogue on data-driven analytics.
He manages Gapminder, a not-for-profit agency that promotes the use of data and statistics to achieve global development and the United Nation’s Millennium Development Goals.
Gapminder and other groups like it took data liberation to
another level by making data visualization and analytics available through the Internet.
Examples:
www.Data.gov www.Data.gov.uk
It follows that because there is liberated or open data, there is
also caged or proprietary data.
Governments, businesses, and others hold on to the vast majority
of data that exists today.
Most of such data sets will never be made public because they
are as valuable to enterprise profitability as any other resource.
In fact, Microsoft calls data the new natural resource, equally
From data starvation to data deluge, the shift was immediate and
massive in magnitude.
Those who saw value in liberating data and making it easier for
the rest of us to access and use it have facilitated the new dawn of the age of data.
These entities have made huge strides in making the data freely
FRED
Toronto-based
Quandl is fast becoming one of the biggest repositories of data.
At present, it is similar to FRED because most available data sets are economic time series. However, Quandl aims to be much
more than a disburser of time series data.
Google Trends Correlate
Nike
Pew Research Data
Inter-University Consortium for Political and Social Research A repository of social science survey data.
ICPSR advances and expands social and behavioral research, acting as a global leader in data stewardship and providing rich data resources and responsive educational opportunities for present and future
generations.”
More than 700 universities and similar institutions are members
of the consortium, which holds approximately 500,000 data sets.
The data sets include information on education, aging, criminal
What do I need to know about the data others have collected Who sponsored data collection?
Cereal breakfasts are the best. Study funded by Kellogg
Metadata: The data about data Questionnaire
Data dictionaries Top line surveys
Hamermesh, Daniel S. and Amy M. Parker (2005). "Beauty in the
Classroom: Instructors' Pulchritude and Putative Pedagogical
Data from University of Texas 98 instructors
463 courses
Teaching evaluation score registered by the students The Beauty Panel
5 student ranked professors for beauty
Caveats
Beauty Evaluations done by a separate group
Teaching effectiveness might depend upon the following: Knowledge of the subject matter
Eagerness and enthusiasm to transfer knowledge The ability to make complex appear simple
Respect for learners
Dataset:
TeachingRatings.rda DESCRIPTION:
Data on course evaluations, course characteristics, and professor
characteristics for 463 courses for the academic years 2000-2002 at the University of Texas at Austin.
FORMAT:
A data frame containing 463 observations on 13 variables.
BEAUTY
Rating of the instructor’s physical appearance by a panel of six
students, averaged across the six panelists, shifted to have a mean of zero.
EVAL
MINORITY FACTOR
Does the instructor belong to a
minority (non-Caucasian)?
AGE
Professor’s age
GENDER
Factorindicating instructor’s gender.
CREDITS FACTOR
Is the course a single-credit elective
(e.g., yoga, aerobics, dance)?
DIVISION FACTOR
Is the course an upper or lower
division course? (Lower division
courses are mainly large freshman and sophomore courses)?
NATIVE FACTOR
Is the instructor a native English
speaker?
TENURE FACTOR
Is the instructor on tenure track?
STUDENTS
Number of students that participated
in the evaluation.
ALLSTUDENTS
Number of students enrolled in the
course.
PROF
We will use the following tools R
R Commander Rstudio
Data Scientist Workbench
Help with software
https://sites.google.com/site/statsr4us/intro/software Learning R
https://sites.google.com/site/statsr4us/intro/software/learning-r First Session in R Cmdr
Murtaza Haider
Email: [email protected]
1.Morning session 1. 10:00 – 11:00
1. Welcome and Introductions 2. What is Big Data?
2. 11:00 – 12:00
1. Introducing R and Rstudio
2. Ten Command(ments) of Statistical Analysis 3. 12:00 – 1:00
1. Introducing data
2. Data types and structures 3. Making Summary Tables 4. 1:00 - 2:00
1. Lunch 2.Afternoon session 1. 2:00 – 3:00
1. Graphic Details 2. 3:00 - 4:45
1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression
3. All Else Being Equal 3. 4:45 - 5:00
A report or a deliverable comprises three necessary ingredients,
namely:
Graphics Tables Narrative
Most books on statistics and analytics do not dedicate space to
the fundamentals of summarizing data in tables. Such books are full of tables. However, these books assume that their readers already know how to generate the tables and, more importantly, format them adequately to serve as effective tools for
communicating findings to their intended audiences.
Reality: most tables either fail completely or partially to
communicate the findings to the intended audience.
Recall how many times you have seen some summary statistics
Let us begin by calculating descriptive statistics for the
continuous variables in the data set.
Note that I am relying on the describe and describeBy
Murtaza Haider
Email: [email protected]
1.Morning session 1. 10:00 – 11:00
1. Welcome and Introductions 2. What is Big Data?
2. 11:00 – 12:00
1. Introducing R and Rstudio
2. Ten Command(ments) of Statistical Analysis 3. 12:00 – 1:00
1. Introducing data
2. Data types and structures 3. Making Summary Tables 4. 1:00 - 2:00
1. Lunch 2.Afternoon session 1. 2:00 – 3:00
1. Graphic Details 2. 3:00 - 4:45
1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression
3. All Else Being Equal 3. 4:45 - 5:00
A report or a
deliverable
comprises three necessary
ingredients, namely:
1.Morning session 1. 10:00 – 11:00
1. Welcome and Introductions 2. What is Big Data?
2. 11:00 – 12:00
1. Introducing R and Rstudio
2. Ten Command(ments) of Statistical Analysis 3. 12:00 – 1:00
1. Introducing data
2. Data types and structures 3. Making Summary Tables 4. 1:00 - 2:00
1. Lunch 2.Afternoon session 1. 2:00 – 3:00
1. Graphic Details 2. 3:00 - 4:45
1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression
3. All Else Being Equal 3. 4:45 - 5:00
If there is one tool analysts should be comfortable with, it's
Regression
It’s the mother, the father, and the grandmother of all analytic tools.
It can answer most questions that you’d put to other tools.
It’s the state-of-the-art, even though it’s hundreds of years old. It is Regression.
Isn’t it odd that the tool you should be most familiar with is tucked at
the back of most texts on statistical analysis?
You have to leaf through hundreds of pages to get to Regression Analysis. Well, let’s not regress anymore as we embrace Regression Analysis in all
its glory!
Why Regress?
Regression Models are the bridge between the traditional statistical
analysis and modern day data science.
Regression can answer what T-tests, Correlation tests, Anova and other
tools could answer. Regression models (including GLS) often offer better (more useful) forecasts than ANN and other datamining tools.
Regression can help you determine if good-looking instructors
Sir Frances Galton & Regression toward the mean Galton, F. (1886). "Regression towards
mediocrity in hereditary stature". Nature 15: 246–263. http://www.jstor.org/pss/2841583.
“the average regression of the offspring is a
constant fraction of their respective mid-parental deviations”
“So if its parents are each two inches taller
QUESTIONS
Do tall parents have tall children? Do women spend more on clothing
than men?
Do unmarried women spend more on
clothing?
Do married women with children
spend less on clothing?
Do households postpone buying
expensive goods during recession?
Do low-income households postpone
buying expensive goods during recession?
Do households with young children
Do more women eat cereal for breakfast than men?
Do single- or two-person households
prefer living in a condominium/apartment than households with children?
Do Canadian buyers prefer Japanese cars over American cars?
Are couples postponing child rearing to later in their lives?
PARTIAL
ANSWERS
Correlations
A positive correlation
exists between obesity and suburban living
A positive correlation
exists between fertility and suburban living
A positive correlation
exists between poverty and fertility rates
A positive correlation
OTHER
FACTORS:
A positive correlation exists
between obesity and suburban living
What about age and income?
A positive correlation exists
between fertility and suburban living
What about moving to suburbs
for cheaper housing to raise a family?
A positive correlation exists
between poverty and fertility rates
What about the high mortality
A positive correlation exists
between summer months and drownings
Does summer cause drowning
WHEN
SEVERAL
FACTORS ARE
INVOLVED
Factors affecting spending on
clothing by women:
Age Income
Type of career Marital status
Spouse’s income Personal taste
Children
Location: Cultural differences, e.g.,
Chicoutimi vs. Montreal
Location: Climate differences: Florida
versus New York
Location: proximity to clothing
ALL ELSE
BEING EQUAL
How to isolate the impact of one
particular factor by holding other influences constant.
What is the impact of proximity to
public transit, if the size of the
house, its structural type, quality of construction, neighbourhood
amenities and ambience and other factors are the same?
All else being equal implies that
Have you ever
travelled in a cab?
• Cabs are sooooo
2015!
• I know.
• $3.25
You sit in the cab and see a base fare that is
constant
• $0.25 per 143 m
The fare increases with each additional 143 m at a certain rate
• $0.25 per additional 29
seconds
The fare increases for each 29 seconds of
taxi being idle
• $2.00
Fare = Constant + Dist_Rate x Distance + Wait_Rate x Time +
FourPlus x Four or more Passengers
Fare = 3.25 + 0.25 x Distance + 0.25 x Idle Time + 2 x Four_plus
A sample trip 3 people
6 kms 4 minutes
Adjusting time and distance
6000 m / 143 = 41.9 segments of 143 m 4(60)/29 = 8.3 segments of 29 seconds
$0.25 for
143 m $0.25 for 29 seconds $2 for 4+ pax
Fare Distance (km) (seconds)Time 4+ pax
$ 15.8 6 240 0
$ 16.9 6.7 225 0
$ 23.1 8.5 350 1
$ 11.4 3.8 180 0
$ 12.1 3.25 135 1
What if you don’t know the rates?
Regression estimates the unknown rates
Fare = Constant + Dist_Rate x Distance + Wait_Rate x Time +
FourPlus x Four or more Passengers
𝑦𝑦 = 𝛽𝛽
0+ 𝛽𝛽
1∗ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + 𝛽𝛽
2∗ 𝑑𝑑𝑑𝑑𝑡𝑡𝑑𝑑 + 𝛽𝛽
3∗ 𝑝𝑝𝑑𝑑𝑝𝑝 + 𝜖𝜖
$0.25 for143 m $0.25 for 29 seconds $2 for 4+ pax
Fare Distance (km) (seconds)Time 4+ pax
$ 15.8 6 240 0
$ 16.9 6.7 225 0
$ 23.1 8.5 350 1
$ 11.4 3.8 180 0
$ 12.1 3.25 135 1
© Murtaza Haider, 2016
© Murtaza Haider, 2016
Notation
Y is a function of X
Y (the dependant variable) is explained by other
variables, X
1, X
2(the explanatory variables)
Regression Notation
The
betas
in the above equation are the regression
coefficients that explain the relationship between the
dependant variable and explanatory variables.
Epsilon
is the error term that captures what is not captured by
the variables in the model.
)
,
(
X
1X
2f
Y
=
ε
β
β
β
+
+
+
© Murtaza Haider, 2016
Hamermesh, Daniel S. and Amy M. Parker (2005). "Beauty in the
Data from University of Texas 98 instructors
463 courses
Teaching evaluation score registered by the students The Beauty Panel
5 student ranked professors for beauty
Caveats
Beauty Evaluations done by a separate group
Teaching effectiveness might depend upon the following: Knowledge of the subject matter
Eagerness and enthusiasm to transfer knowledge The ability to make complex appear simple
Respect for learners
Dataset:
TeachingRatings.rda DESCRIPTION:
Data on course evaluations, course characteristics, and professor
characteristics for 463 courses for the academic years 2000-2002 at the University of Texas at Austin.
FORMAT:
A data frame containing 463 observations on 13 variables.
BEAUTY
Rating of the instructor’s physical appearance by a panel of six
students, averaged across the six panelists, shifted to have a mean of zero.
EVAL
Course overall teaching evaluation score, on a scale of 1 (very
MINORITY FACTOR
Does the instructor belong to a minority (non-Caucasian)?
AGE
Professor’s age
GENDER
Factorindicating instructor’s gender.
CREDITS FACTOR
Is the course a single-credit elective (e.g., yoga, aerobics, dance)?
DIVISION FACTOR
Is the course an upper or lower division course? (Lower division
courses are mainly large freshman and sophomore courses)?
NATIVE FACTOR
Is the instructor a native English speaker?
TENURE FACTOR
Is the instructor on tenure track?
STUDENTS
Number of students that participated in the evaluation.
ALLSTUDENTS
Number of students enrolled in the course.
PROF