ADaPT slidedeck 2018

(1)

Murtaza Haider

Email: [email protected]

Session - 1

(2)

Welcome Who am I?

Why are You Here? Introducing Big Data The Learning Path

(3)

1.Morning session 1. 9:00 – 10:00

1. Welcome and Introductions 2. What is Big Data?

2. 10:00 – 11:00

1. Introducing R and Rstudio

2. Ten Command(ments) of Statistical Analysis 3. 11:00 – 12:00

1. Introducing data

2. Data types and structures 3. Making Summary Tables 4. 12:00 - 1:00

1. Lunch 2.Afternoon session 1. 1:00 – 2:00

1. Graphic Details 2. 2:15 - 4:45

1. Regression Models: The mother of all analytics 2. Hypothesis testing with Regression

3. All Else Being Equal 3. 4:45 - 5:00

(4)

(5)

(6)

(7)

“This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most

importantly it teaches how to tell a story with data.”

— Thomas H. Davenport,

Distinguished Professor, Babson College; Research Fellow, MIT; author of Competing on Analytics and Big Data @ Work

Murtaza Haider

How this book is different?

1. It’s not trying to turn you into a statistician 2. It repeats the important lessons

3. It believes analytics are performed to tell fascinating stories

4. It teaches you three things:

(8)

(9)

 Who you are?  What you know?

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

• In a world awash with Big Data and Analytics, businesses and

institutions are increasingly competing on analytics. For this, they need professionals skilled in data/statistical analysis.

• McKinsey Global Institute estimates a shortage of hundreds of

thousands of skilled data scientists.

• It’s time to say Hi to data science!

• The workshop provides hands-on training in skills necessary to

be proficient in a data-centric world.

• Prerequisites: Curiosity, high-school math, prescribed book, a

laptop computer, and willingness to learn R.

(19)

Our address on the web

http://tinyurl.com/r-analytics What you need for today

Have R & R Studio installed on your device

Do good looking people get higher salaries/promotions?

Hamermesh, Daniel S. and Amy M. Parker (2005). Beauty in the

Classroom: Instructors' Pulchritude and Putative Pedagogical

(20)

(21)

Size is the first, and at times, the only dimension that leaps out at

the mention of big data.

We offered a broader definition of big data that captures its other

unique and defining characteristics.

The rapid evolution and adoption of big data by industry has

leapfrogged the discourse to popular outlets, forcing the academic press to catch up.

A particular distinguishing feature of this paper is its focus on

analytics related to unstructured data, which constitute 95% of big data.

This paper highlights the need to develop appropriate and

efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats.

The heterogeneity, noise, and the massive size of structured big

(22)

Three dimensions

Volume Variety Velocity

Gartner, Inc. defines big data in similar terms:

“Big data is high-volume, high-velocity and high-variety information

assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

TechAmerica Foundation defines big data as follows:

 “Big data is a term that describes large volumes of high velocity,

complex and variable data that require advanced techniques and

(23)

(24)

(25)

(26)

o Our digital footprint has expanded rapidly over the past 10 years.

o The size of the digital universe was roughly 130 billion gigabytes in 1995.

o By 2020, this number will swell to 40 trillion gigabytes. o Companies will compete for hundreds of thousands, if not

millions, of new workers needed to navigate the digital world. o No wonder the prestigious Harvard Business Review called data

(27)

o A report by the McKinsey Global Institute warns of huge talent shortages for data and analytics.

o By 2018, the United States alone could face a shortage of

o140,000 to 190,000 people with deep analytical skills

o1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

o SAP reported from a survey that 92% of the responding firms in its sample experienced a significant increase in their data

holdings.

oAt the same time, three-quarters identified the need for new data science skills in their firms.

o Accenture believes that the demand for data scientists may outstrip supply by 250,000 in 2015 alone.

(28)

(29)

(30)

(31)

(32)

(33)

Murtaza Haider

Intro to R

(34)

It’s powerful It’s free

Extensive support documentation It’s current

(35)

Recent advances

R Commander from John Fox

R integrated into Microsoft Excel R Through Excel

Zelig R Studio

(36)

(37)

(38)

(39)

(40)

(41)

(42)

(43)

StatsR4US

(44)

(45)

THE

SOLUTION

FOR STATS

HEADACHES

 We have taught a prescribed set of

methods

 These methods were developed

a hundred years ago

 We have not updated them  We must

 The alternative approach  Teach tools that business

analysts, engineers, and researchers need

 Keep things simple

 Don’t follow the table of contents

generated a 100 years ago

 Start afresh

 Fit everything one need in ten

lines

 The Ten Commandments of

Statistical analysis in R

 Master 10 lines of code to do

(46)

TASKS NOT

METHODS

 Here are the tasks we need to

perform:

1. Open/Import a dataset 2. Review/Summarize it 3. Summarize by groups 4. Generate cross tabs 5. Generate some plots 6. Check the distribution

1. Plot it

2. Normal/T-distribution 7. Test your hypothesis 8. Chi Square Test

(47)

Tasks

1. Open/Import a dataset 2. Review/Summarize it 3. Summarize by groups 4. Generate cross tabs 5. Generate some plots 6. Check the distribution

1. Plot it

2. Normal/T-distribution 7. Test your hypothesis 8. Chi Square Test

9. Run Regressions 10. Save Data

Methods

1. read.csv( ) 2. summarize( ) 3. tapply( ) 4. table( ) 5. plot()

6. Working with 2 distributions 1. hist( )

2. pnorm( ) 3. pt( ) 7. lm( )

8. chisq.test( ) 9. lm( )

(48)

Murtaza Haider

Session - 2

(49)

(50)

Data is (are) to Data Science what air is to humans.

Without data, or air for that matter, there is no Data Science.

Learning opportunities in Data Science though suffer from a major drawback.

Instructors routinely assume that whatever data sets are needed are either available, or could be made available, to learners.

Furthermore, a bigger constraint is that instructors assume learners would know how to deal with data and thus the instructors embark directly on analytics.

Ready to go

(51)

Most new, and some experienced, learners do not know where to look for, or how to do deal with, data.

They even don't know if data is singular or plural. Datum is singular, data are plural.

Today, we will embrace data in all its imperfections.

(52)

10 35 20 10 25 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Effort

Finding data Cleaning data Warangling data Analytics Story telling

* Chart is not based on real data.

Algorithmics or (al-Khw_ārizm_ī)

(53)

By the time you are halfway through this paragraph: 2.5 million

Facebook users would have exchanged contents online. Google would have received more than 4 million search requests. More than 200 million email messages would have flown over the

Internet and some 275,000 tweets would have been heard.

Never before in the history of humankind have we been able to

generate a living history of ourselves. In the process, we are creating new data of immense size and scope.

It is indeed a transformative change to see that within a few

decades we have moved from complaining about the lack of data to a data deluge.

(54)

Population data Survey data

Census data

Time series data Panel data

Panel-time series data

Unstructured data Audio and video data Big data

(55)

(56)

Continuous variables Age

Income

Housing prices

Discrete or categorical variables Gender .. Male/female (binary)

Mode of travel … Walk, bike, drive, transit (multinomial)

Number of children … none, one, two, three, four or more (ordinal)

(57)

Type of Data

Cross-Sectional – measurements taken at one time period e.g., students course evaluations in a course

Cross sectional Panel

Same student’s evaluation of different courses in a particular year Time series – data collected over time

E.g., unemployment rate, monthly retail sales Panel Time Series

(58)

Number of Variables Univariate

data consisting of a single variable to measure some entity Multivariate

(59)

Categorical (nominal)

data sorted into mutually exclusive (an observation cannot belong to more than one category) categories

Geographical region, type of employee, gender, state of birth, type of automobile owned

In R software it is referred to as Factors.

Properties

(60)

Ordinal data

Data ordered or ranked according to some relationship to one another Number of cars owned by a household

Properties

Categories can be compared with one another

(61)

Open data

Government data

World Bank, UN, IMF, others FRED, QUANDL

Census agencies Pew, ICPSR

Others

Closed data

(62)

Open data refers to the idea that the liberation of all sorts of data

whose subsequent use by a larger community will result in the benefit of the human race.

Other similar movements advocating for free access to software

(open source), and publications and other media (open content) work on the same principle.

A widely held belief in the development community is that free

data assists in making informed decisions about human

development. In the case of developing countries, open data

(63)

Hans Rosling, the promoter-in-chief for data and analytics. Dr. Rosling is a Swedish medical doctor whose passionate and

animated presentations about analytics spiced up the dialogue on data-driven analytics.

He manages Gapminder, a not-for-profit agency that promotes the use of data and statistics to achieve global development and the United Nation’s Millennium Development Goals.

Gapminder and other groups like it took data liberation to

another level by making data visualization and analytics available through the Internet.

Examples:

www.Data.gov www.Data.gov.uk

(64)

It follows that because there is liberated or open data, there is

also caged or proprietary data.

Governments, businesses, and others hold on to the vast majority

of data that exists today.

Most of such data sets will never be made public because they

are as valuable to enterprise profitability as any other resource.

In fact, Microsoft calls data the new natural resource, equally

(65)

From data starvation to data deluge, the shift was immediate and

massive in magnitude.

Those who saw value in liberating data and making it easier for

the rest of us to access and use it have facilitated the new dawn of the age of data.

These entities have made huge strides in making the data freely

(66)

FRED

(67)

Toronto-based

Quandl is fast becoming one of the biggest repositories of data.

At present, it is similar to FRED because most available data sets are economic time series. However, Quandl aims to be much

more than a disburser of time series data.

(68)

(69)

Google Trends Correlate

Nike

(70)

(71)

(72)

Pew Research Data

(73)

Inter-University Consortium for Political and Social Research A repository of social science survey data.

ICPSR advances and expands social and behavioral research, acting as a global leader in data stewardship and providing rich data resources and responsive educational opportunities for present and future

generations.”

More than 700 universities and similar institutions are members

of the consortium, which holds approximately 500,000 data sets.

The data sets include information on education, aging, criminal

(74)

(75)

What do I need to know about the data others have collected Who sponsored data collection?

Cereal breakfasts are the best. Study funded by Kellogg

Metadata: The data about data Questionnaire

Data dictionaries Top line surveys

(76)

(77)

Hamermesh, Daniel S. and Amy M. Parker (2005). "Beauty in the

Classroom: Instructors' Pulchritude and Putative Pedagogical

(78)

Data from University of Texas 98 instructors

463 courses

Teaching evaluation score registered by the students The Beauty Panel

5 student ranked professors for beauty

Caveats

Beauty Evaluations done by a separate group

Teaching effectiveness might depend upon the following: Knowledge of the subject matter

Eagerness and enthusiasm to transfer knowledge The ability to make complex appear simple

Respect for learners

(79)

Dataset:

TeachingRatings.rda DESCRIPTION:

Data on course evaluations, course characteristics, and professor

characteristics for 463 courses for the academic years 2000-2002 at the University of Texas at Austin.

FORMAT:

A data frame containing 463 observations on 13 variables.

BEAUTY

Rating of the instructor’s physical appearance by a panel of six

students, averaged across the six panelists, shifted to have a mean of zero.

EVAL

(80)

 MINORITY FACTOR

 Does the instructor belong to a

minority (non-Caucasian)?

 AGE

 Professor’s age

 GENDER

 Factorindicating instructor’s gender.

 CREDITS FACTOR

 Is the course a single-credit elective

(e.g., yoga, aerobics, dance)?

 DIVISION FACTOR

 Is the course an upper or lower

division course? (Lower division

courses are mainly large freshman and sophomore courses)?

 NATIVE FACTOR

 Is the instructor a native English

speaker?

 TENURE FACTOR

 Is the instructor on tenure track?

 STUDENTS

 Number of students that participated

in the evaluation.

 ALLSTUDENTS

 Number of students enrolled in the

course.

 PROF

(81)

(82)

We will use the following tools R

R Commander Rstudio

Data Scientist Workbench

Help with software

https://sites.google.com/site/statsr4us/intro/software Learning R

https://sites.google.com/site/statsr4us/intro/software/learning-r First Session in R Cmdr

(83)

Murtaza Haider

(84)

2. 11:00 – 12:00

1. Introducing data

(85)

A report or a deliverable comprises three necessary ingredients,

namely:

Graphics Tables Narrative

Most books on statistics and analytics do not dedicate space to

the fundamentals of summarizing data in tables. Such books are full of tables. However, these books assume that their readers already know how to generate the tables and, more importantly, format them adequately to serve as effective tools for

communicating findings to their intended audiences.

Reality: most tables either fail completely or partially to

communicate the findings to the intended audience.

Recall how many times you have seen some summary statistics

(86)

(87)

Let us begin by calculating descriptive statistics for the

continuous variables in the data set.

Note that I am relying on the describe and describeBy

(88)

(89)

(90)

(91)

(92)

Murtaza Haider

(93)

2. 11:00 – 12:00

1. Introducing data

(94)

A report or a

deliverable

comprises three necessary

ingredients, namely:

(95)

(96)

(97)

(98)

(99)

(100)

(101)

(102)

(103)

(104)

(105)

(106)

(107)

(108)

(109)

(110)

(111)

(112)

(113)

(114)

(115)

(116)

(117)

Murtaza Haider

Let’s Regress

(118)

2. 11:00 – 12:00

1. Introducing data

(119)

If there is one tool analysts should be comfortable with, it's

Regression

It’s the mother, the father, and the grandmother of all analytic tools.

 It can answer most questions that you’d put to other tools.

 It’s the state-of-the-art, even though it’s hundreds of years old.  It is Regression.

Isn’t it odd that the tool you should be most familiar with is tucked at

the back of most texts on statistical analysis?

 You have to leaf through hundreds of pages to get to Regression Analysis.  Well, let’s not regress anymore as we embrace Regression Analysis in all

its glory!

Why Regress?

 Regression Models are the bridge between the traditional statistical

analysis and modern day data science.

 Regression can answer what T-tests, Correlation tests, Anova and other

tools could answer. Regression models (including GLS) often offer better (more useful) forecasts than ANN and other datamining tools.

Regression can help you determine if good-looking instructors

(120)

(121)

 Sir Frances Galton & Regression toward the mean  Galton, F. (1886). "Regression towards

mediocrity in hereditary stature". Nature 15: 246–263. http://www.jstor.org/pss/2841583.

 “the average regression of the offspring is a

constant fraction of their respective mid-parental deviations”

 “So if its parents are each two inches taller

(122)

QUESTIONS

Do tall parents have tall children? Do women spend more on clothing

than men?

Do unmarried women spend more on

clothing?

Do married women with children

spend less on clothing?

Do households postpone buying

expensive goods during recession?

Do low-income households postpone

buying expensive goods during recession?

Do households with young children

(123)

Do more women eat cereal for breakfast than men?

Do single- or two-person households

prefer living in a condominium/apartment than households with children?

Do Canadian buyers prefer Japanese cars over American cars?

Are couples postponing child rearing to later in their lives?

(124)

PARTIAL

ANSWERS

Correlations

A positive correlation

exists between obesity and suburban living

exists between fertility and suburban living

exists between poverty and fertility rates

(125)

OTHER

FACTORS:

 A positive correlation exists

between obesity and suburban living

 What about age and income?

between fertility and suburban living

 What about moving to suburbs

for cheaper housing to raise a family?

between poverty and fertility rates

 What about the high mortality

(126)

A positive correlation exists

between summer months and drownings

Does summer cause drowning

(127)

(128)

WHEN

SEVERAL

FACTORS ARE

INVOLVED

Factors affecting spending on

clothing by women:

Age Income

Type of career Marital status

Spouse’s income Personal taste

Children

Location: Cultural differences, e.g.,

Chicoutimi vs. Montreal

Location: Climate differences: Florida

versus New York

Location: proximity to clothing

(129)

ALL ELSE

BEING EQUAL

How to isolate the impact of one

particular factor by holding other influences constant.

What is the impact of proximity to

public transit, if the size of the

house, its structural type, quality of construction, neighbourhood

amenities and ambience and other factors are the same?

All else being equal implies that

(130)

(131)

Have you ever

travelled in a cab?

• Cabs are sooooo

2015!

• I know.

(132)

• $3.25

You sit in the cab and see a base fare that is

constant

• $0.25 per 143 m

The fare increases with each additional 143 m at a certain rate

• $0.25 per additional 29

seconds

The fare increases for each 29 seconds of

taxi being idle

• $2.00

(133)

Fare = Constant + Dist_Rate x Distance + Wait_Rate x Time +

FourPlus x Four or more Passengers

Fare = 3.25 + 0.25 x Distance + 0.25 x Idle Time + 2 x Four_plus

A sample trip 3 people

6 kms 4 minutes

Adjusting time and distance

6000 m / 143 = 41.9 segments of 143 m 4(60)/29 = 8.3 segments of 29 seconds

(134)

$0.25 for

143 m $0.25 for 29 seconds $2 for 4+ pax

Fare Distance (km) (seconds)Time 4+ pax

$ 15.8 6 240 0

$ 16.9 6.7 225 0

$ 23.1 8.5 350 1

$ 11.4 3.8 180 0

$ 12.1 3.25 135 1

(135)

What if you don’t know the rates?

Regression estimates the unknown rates

Fare = Constant + Dist_Rate x Distance + Wait_Rate x Time +

FourPlus x Four or more Passengers

𝑦𝑦 = 𝛽𝛽

0

+ 𝛽𝛽

1

∗ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 + 𝛽𝛽

2

∗ 𝑑𝑑𝑑𝑑𝑡𝑡𝑑𝑑 + 𝛽𝛽

3

∗ 𝑝𝑝𝑑𝑑𝑝𝑝 + 𝜖𝜖

$0.25 for

143 m $0.25 for 29 seconds $2 for 4+ pax

Fare Distance (km) (seconds)Time 4+ pax

$ 15.8 6 240 0

$ 16.9 6.7 225 0

$ 23.1 8.5 350 1

$ 11.4 3.8 180 0

$ 12.1 3.25 135 1

(136)

(137)

Notation



Y is a function of X



Y (the dependant variable) is explained by other

variables, X

₁

, X

₂

(the explanatory variables)



Regression Notation



The

betas

in the above equation are the regression

coefficients that explain the relationship between the

dependant variable and explanatory variables.

Epsilon

is the error term that captures what is not captured by

the variables in the model.

)

,

(

X

₁

X

₂

f

Y

=

ε

β

+

(138)

(139)

(140)

(141)

(142)

(143)

Hamermesh, Daniel S. and Amy M. Parker (2005). "Beauty in the

(144)

Data from University of Texas 98 instructors

463 courses

Teaching evaluation score registered by the students The Beauty Panel

5 student ranked professors for beauty

Caveats

Beauty Evaluations done by a separate group

Teaching effectiveness might depend upon the following: Knowledge of the subject matter

Eagerness and enthusiasm to transfer knowledge The ability to make complex appear simple

Respect for learners

(145)

Dataset:

TeachingRatings.rda DESCRIPTION:

Data on course evaluations, course characteristics, and professor

characteristics for 463 courses for the academic years 2000-2002 at the University of Texas at Austin.

FORMAT:

A data frame containing 463 observations on 13 variables.

BEAUTY

Rating of the instructor’s physical appearance by a panel of six

students, averaged across the six panelists, shifted to have a mean of zero.

EVAL

Course overall teaching evaluation score, on a scale of 1 (very

(146)

 MINORITY FACTOR

 Does the instructor belong to a minority (non-Caucasian)?

 AGE

 Professor’s age

 GENDER

 Factorindicating instructor’s gender.

 CREDITS FACTOR

 Is the course a single-credit elective (e.g., yoga, aerobics, dance)?

 DIVISION FACTOR

 Is the course an upper or lower division course? (Lower division

courses are mainly large freshman and sophomore courses)?

 NATIVE FACTOR

 Is the instructor a native English speaker?

 TENURE FACTOR

 Is the instructor on tenure track?

 STUDENTS

 Number of students that participated in the evaluation.

 ALLSTUDENTS

 Number of students enrolled in the course.

 PROF

(147)

(148)