Enhancing Big Data Projects through Statistical Engineering

(1)

1

Statistical Engineering and BIG DATA

Enhancing Big Data Projects through

Statistical Engineering

National Quality Month Statistics Symposium

With Significant contributions from Roger Hoerl, Union College Dick DeVeaux, Williams College

(2)

Abstract

Massive data sets or Big Data have become more common recently, due to improved

technology for data acquisition, storage, and processing of data. New tools have been developed to analyze such data, including classification and regression trees (CART), neural nets, and methods based on bootstrapping. These tools make high-powered statistical methods available to not only professional statisticians, but also to casual users. As with any tools, the results to be expected are proportional to the knowledge and skill of the user, as well as the quality of the data. Unfortunately, much of the professional literature may give casual users the impression that if one has powerful enough algorithms and a lot of data, good models and good results are guaranteed at the push of a button.

This presentation focuses on the application of principles of statistical engineering (Anderson-Cook and Lu, Quality Engineering, 2012) to the Big Data problem. Viewed through the statistical engineering lens several potential pitfalls of commonly used approaches to Big Data projects become apparent. The consequences of four major issues are addressed: 1) Lack of a strategic, disciplined approach to modeling, 2) Use of "one shot studies" versus sequential approaches, 3) Assuming all data are high

(3)

3

Statistical Engineering and BIG DATA

Special Issue on Statistical Engineering

Statistical Engineering Developments

(4)

Statistical Engineering Developments

Statistical Engineering Advances

•

Five Phase

Implementation Framework

•

Underlying Theory

•

Data Pedigree

•

Statistical Thinking Book - CH 4

•

“Statistical Engineering:

Frameworks and Tools”

(5)

5

Statistical Engineering and BIG DATA

Agenda

Today’s Realities

Significance of “Big Data” phenomenon and Data Mining Growth What Can Go Wrong?

How Statistical Engineering Principles Can Help The Importance of Data Quality

The Advantages of Sequential Approaches The Need for Subject Matter Knowledge Develop a Strategy

Putting It All Together – Doing Big Data the Right Way Summary

(6)

Today's Realities – Data Mining and Big Data

Data Mining and Big Data are popular subjects today

Data Mining- Analyzing data

Big Data – The data set being analyzed Data mining has been around for decades

1950s: Stepwise regression first developed at Esso (now Exxon) by Efroymson to analyze oil refinery data

1960s: Graphical methods developed by Tukey, Wilk, Gnanadesikan and others at Bell Labs to gain insight from large data sets

1970s: DuPont uses data compression algorithms in process monitoring using on-line systems

“The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (), unusual records (

(7)

7

Statistical Engineering and BIG DATA

Today's Realities – Big Data

BIG DATA the “new” idea, new frenzy that has the promise of changing the world

Fueled by ubiquitous availability of high speed, high capacity information technology

2007 “Competing on Analytics”, Thomas Davenport and Jeanne Harris Used by many Industries and Companies; Google, Netflix, Amazon, FedEx……

2009 “I keep saying that the sexy job in the next 10 years will be statisticians,” said Hal Varian, chief economist at Google. “And I’m not kidding.”

March 2012, The White House announced a national "Big Data Initiative" that consisted of six Federal departments and agencies committing more than $200 million to big data research projects.

Characteristics that Make Big Data Work (Intel IT Center): Volume – Huge data sets

Variety - Heterogeneous, complex, and variable data Velocity - Data is generated as a constant stream

(8)

Big Data – An Example

Developing Vehicle Emission Standards

How to Use Ambient Air Quality Data to Help Set

Vehicle Emission Standards?

•

EPA Rollback Model – Reduce Vehicle Emissions proportional to

the reduction needed to achieve air quality standard

•

Carbon Monoxide Standard is 9ppm (8-hour average)

•

Ex: 2

nd

Highest value at a site is 18ppm – required change needed

to meet the standard is 50% reduction

(9)

9

Statistical Engineering and BIG DATA

Determining Vehicle Emission Standards

An Alternative Model

Strategy: Assess

• Data quality; in particular the variation of the 2nd_{highest 8-hour CO value}

(10)

Available Carbon Monoxide (CO) Data

Continuous Air Monitoring Program (CAMP) Hourly CO Data at 6 major US

cities from 1962 - 1971.

Chicago, Denver, Washington, DC, Cincinnati, St Louis and Philadelphia 51 station-years of hourly CO data resulting in 450k measurements

Traffic volume and Metrologic data significantly add to the volume of data Data Sources

Traffic Data – Chicago, Colorado, Illinois

CAMP Carbon Monoxide Data - EPA-NERC, Chicago

Meteorological Data Monoxide Data - National Climatic Center Denver Vehicle Emissions – Colorado Department of Highways New Jersey Air Quality Data - New Jersey EPA

Data Quality Assessment

Ability of a single monitoring station to represent the air quality of a city Air Quality Standard and models use 2nd_{highest observed value}

Highly variable and susceptible to errors caused by equipment malfunction.

(11)

11

Chicago CAMP Data 1968-69

Effect of Meteorology, Traffic Volume and Business Activity on

Hourly CO Levels – Regression Statistics

(12)

Wind Direction Has A Major Effect on the

Chicago Camp Station Hourly CO Values

Wind Direction

• Can Reduce the CO Levels by As Much as Approximately 50% of the Annual Average CO

• Largest Effect in 1968 • 2nd_{Largest Effect in 1969}

(13)

13

Standard Deviation of 8-Hour Average CO Values and

Annual Average CO for All CAMP Cities

(14)

Percent of Time 8-hour CO Standard was Exceeded vs

Annual Average CO at CAMP Stations 1962-71

Model Confirmed Using New Jersey Data 1965-73 (> 100 Station-Years of Data)

(15)

15

Carbon Monoxide Emission Rates for New Cars

Based on CO Annual Average

(16)

BIG DATA

(17)

17

Statistical Engineering and BIG DATA

As Seen on 60 Minutes Webpage

Deception at Duke: Fraud in cancer care?

(18)

What Could Possibly Go Wrong?

Duke Genomics Center publishes groundbreaking cancer

biomarker articles from 2005 - 2010

Clinical trials based on this research surprisingly did not pan out

Women died unexpectedly

Two statisticians, Keith Baggerly and Kevin Coombes, dug into

the research.

Are We At the Point of Providing

(19)

19

Statistical Engineering and BIG DATA

What Could Possibly Go Wrong?

Their conclusions?

“Dr. Baggerly and Dr. Coombes found errors almost immediately. Some

seemed careless – moving a row or column over by one in a giant

spreadsheet – while others seemed inexplicable. The Duke team

shrugged them off as “clerical errors”...In the end, four gene signature

papers were retracted. Duke shut down three trials using the results.

(Lead investigator) Dr. Potti resigned from Duke...His collaborator and

mentor, Dr. Nevins, no longer directs one of Duke’s genomics centers.

The cancer world is reeling

.”

Large Amounts of Data Plus Sophisticated Algorithms

Do Not Guarantee Success!

(20)

(21)

21

Statistical Engineering and BIG DATA

What Could Possibly Go Wrong?

On April 18

th

, 2011 the text book “The Making of a Fly”

debuts on Amazon.com

Amazon’s automated algorithm price: $1,730,045

Later in the day, the price goes up to $23,698,656

Plus $3.55 for shipping and handling.

No one buys the book that day.

Days later, the Amazon price was $106.

People started to buy the book.

(22)

What Could Possibly Go Wrong?

Our quandary:

All other things being equal,

“Big Data” is better than “little data”

The newer data mining tools are powerful and work quite

well

Classification and regression trees (CART),

Neural nets,

Methods based on bootstrapping.

Yet, modeling disasters continue to occur; why?

Clearly, we are missing something in the equation

(23)

23

Statistical Engineering and BIG DATA

Can Statistical Engineering

Fundamentals Help?

(24)

Role of Statistical Engineering in Big Data

So where are we?

Explosion of large data sets …. Big Data ….

Lots of new analytical tools available

But modeling disasters continue to occur:

Poorly defined problems

Poor data – quantity but no quality

“Unbridled empiricism”: lack of subject matter knowledge

Need more strategic approach

Statistical engineering is one option

Effective framework for handling large, complex unstructured

problems

(25)

25

Statistical Engineering and BIG DATA

Statistical Engineering Definition

Statistical Engineering:

“The study of how to best utilize statistical concepts,

methods, and tools and integrate them with

information technology and other relevant sciences

to generate improved results”

Hoerl and Snee 2010

Building Something Bigger From the

“Parts List” of Statistical Tools

(26)

Typical Phases of Statistical Engineering Projects

1.

Identify problems

: find the high-impact issues inhibiting

achievement of the organization’s strategic goals

2.

Create structure:

carefully define the problem, objectives,

constraints, metrics for success, and so on.

3.

Understand the context

: identify important stakeholders (e.g.,

customers, organizations, individuals, management), research the

history of the issue, identify unstated complications and cultural

issues, locate relevant data sources.

4.

Develop a strategy

: create an overall, high level approach to

attacking the problem, based on phases 2 and 3.

5.

Establish tactics

: develop and implement diverse initiatives or

projects that collectively will accomplish the strategy

(27)

27

Statistical Engineering and BIG DATA

Statistical Engineering – Critical Considerations for BIG DATA

Data Quality

Omissions, errors, missing values, etc. Missing variables

Subject Matter Knowledge

Variables selection and appropriate scales

Interpretation of results, ability to extrapolate findings Use of Sequential Approaches

Big problems not solved with one analysis or one data set Strategy must move beyond the “one-shot study” mindset Develop a Strategy – Treat as a Project

A high level plan for success Generate short-term wins

(28)

Data Pedigree - What to Look for?

Data Quality

Measurement process

Understand how the process operates

Understand how

Product was made

Service provided

Check data collection assumptions

Observational data and designed experiments

(29)

29

Statistical Engineering and BIG DATA

Data Mining ala Wikipedia (2012)

Data mining (the analysis step of the

"Knowledge Discovery in Databases" process, or KDD),_{a field at the intersection of}

and

attempts to discover patterns in large . It utilizes methods at the ion of ,

, , and _{The overall goal of the data mining}

process is to extract information from a data set and transform it into an understandable structure for further us_{Aside from the raw analysis} step, it involves database and

aspects, , and

considerations, interestingness metrics,

discovered structures,

Some Key Words and Phrases

1.

Large amounts of data

2.

Preexisting databases

3.

Computer technology

4.

Algorithms

5.

Turn data into information

6.

Knowledge generation

7.

Discovery

8.

Identification of trends and patterns in data

9.

Relationships and models

(30)

Data Mining - What’s Missing?

Some Key Words and Phrases

1.

Large amounts of data

2.

Preexisting databases

3.

Computer technology

4.

Algorithms

5.

Turn data into information

6.

Knowledge generation

7.

Discovery

8.

Identification of trends and

patterns in data

9.

Relationships and models

10.

Understandable structure

Missing Critical Words and Phrases

Understanding the

Process and product or service from which the data were collected Relevant science, engineering and subject matter

Data quality

Errors, mistakes, atypical values Measurement process

Measurement variation

Data collection process and

assumptions

(31)

31

Statistical Engineering and BIG DATA

Data Mining

Perceptions and Unspoken Assumptions

More data is always better

All data are good data If you have enough data

Sophisticated algorithms will find useful trends and patterns

We don't need to understand the context of the problem, or have good subject matter expertise

Measurement systems (accuracy, precision, etc.) are not important

Changes over time due to unknown variables;

The algorithms will take care of time variations in the current data set A good model today will remain a good model in the future

A single mathematical metric (e.g., prediction accuracy) can completely quantify how good a model is

Sophisticated algorithms replace skill of the model builder

(32)

Need to Know the Pedigree

In the world of farm animals, horses and other livestock, if you want to assess

and predict the “quality” of an animal and how it will perform you look at its pedigree.

Triple Crown horserace winners often produce winning offspring. Secretariat – 1973 Triple Crown Winner

Kentucky Derby, Preakness and Belmont Races Sired many stakes race winners

Bold Ruler – Sire of Secretariat Preakness winner

Sired 10 other champions and many stakes race winners

(33)

33

Statistical Engineering and BIG DATA

Sire dkb/br. 1954 b. 1940 b. 1932 _Nogara Mumtaz Begum Miss Disco b. 1944 Ariadne Outdone Sweep Out Dam b. 1952 b. 1940 Indolence Cosquilla Quick Thought Imperatrice dkb/br. 1938 Caruso Polymelian Sweet Music Cinquepace Brown Bud

Assignation (Family 2-S)

Pedigree of Secretariat Wikipedia 2012

(34)

Knowing Data Pedigree Provides Deep Understanding of Data

It is important to get the right data at the beginning of a study.

But what do we do when the data are in hand?

Evaluating “data pedigree” involves understanding:

Science, engineering and structure of the process or product

from which the data were collected.

Collection process used to obtain data and prepare for analysis.

How the measurements were made.

(35)

35

Statistical Engineering and BIG DATA

Observations on Current Practice

Data collected without controls and careful administration of the data collection process often contain erroneous results

Data residing in electronic files says nothing about the quality of the data Data mining as practiced seems to be assuming that all data are good data and more data is better

Knowing how the data were collected is also critical to performing the correct analysis of the data.

Data structure and sources of variation are easily identified.

Model form that best fits the structure and situation becomes more apparent (e.g., crossed vs nested factors, split plotting).

Analyzing data without understanding the associated process, sampling and testing procedures greatly increases the risk of erroneous results.

Reproducibility of study results is more than purely using a wrong analysis - The Duke study is a classic case of this.

(36)

Case Study - Vehicle Emission Standards

Types of Data and Data Sources

Traffic Data

Illinois Department of Transportation Chicago Bureau of Streets

Colorado Department of Highways

Continuous Air Monitoring Program (CAMP) Hourly Data

EPA-NERC – Approximately 51 station-years of data from 6 major cities Chicago Department of Environmental Control

Meteorological Data

National Climatic Center

Vehicle Emissions – Denver

Colorado Department of Highways

New Jersey Hourly Air Quality Data - >100 station-years of data

New Jersey EPA

(37)

37

Assessing Data Quality – An Example

Carbon Monoxide Data

(Snee and Pierrard 1977)

Ambient air quality standard for carbon monoxide is 9 ppm (8-hour average) not to be exceeded more than once per year.

Second highest value over an 8 hour period in a year was being used to assess the air quality in the vicinity of the sampler.

2nd_{highest value is a highly variable statistic}_{. For example}

Denver CO sampling station in 1971 had a CO 2nd_{highest value was 35}

ppm with a maximum value of 39 ppm, well above the standard

Hourly data used to compute the 2nd highest value were evaluated

A plot of the hourly CO values for the period in questions showed 10 consecutive hourly readings of 39 ppm, with 4 out of the next 6 hourly readings at 39 ppm and the remaining 2 readings at 36 ppm.

This small amount of variation over a 16-hour period is not typical of variation in hourly CO readings and do not represent an accurate characterization of the air quality in the area of the sampler.

It is highly probable that these results are due to equipment malfunction. A similar problem was found in the CO data from Cincinnati in 1968.

(38)

Histogram of 8-Hour Non-overlapping Average CO

Measurements at the Denver CAMP Station 1972

(39)

39

Statistical Engineering and BIG DATA

Sequence Plot of Hourly CO Readings

Denver CAMP Station June 18 -19, 1972

Data Associated with the 2 Highest Values during 1972

(40)

Sequence Plot of Hourly CO Readings

Cincinnati CAMP Station Friday July 12 to Sunday July 14, 1968

Data Associated with the 7 Highest Values during 1968 Accurate Measure of Air Quality in 1968???

(41)

41

Annual Average-Useful Alternative Measure of Air Quality

Annual Average

Good predictor of % time standard is exceeded

Much more precise and robust than the 2nd_{highest value}

2nd_{Highest value of a lognormal distribution is related to the annual average}

Empirical evidence from the data collected at six major cities over a 10 year period and at several sampling stations in New Jersey showed that the

relationship is very strong.

Recommendation:

• Use 2nd_{Highest Value to Compare Air Quality to} Air Quality Standard

• Use Annual Average to Make Predictions and Calculate Emission Reductions

(42)

Conclusions – Regarding Data Pedigree

Trust but Verify - Data pedigree must be assessed when analyzing BIG DATA. Data quality is an issue with all sources of data

Careful thought must be given to the model form needed to answer the

question. Different models can often get you to the same place, or to different places

Multiple sources of data require careful thought as to data pedigree and how to fit the data bases together to produce useful results

Different data sources are typically associated with political issues, different agendas, different objectives, etc.

(43)

43

Statistical Engineering and BIG DATA

(44)

The Advantages of a Sequential Approach

Much of our professional literature, including our textbooks,

assume that statistical problems are,

“one shot studies”

We are handed a

fixed data set,

and must develop the “best”

model to fit the data

Articles are frequently published challenging previously

published analyses, and proposing a better model for the

same data

This is the clearly the tone of many high-profile data analysis

competitions:

Netflix Challenge

Kaggle.com

(45)

45

Statistical Engineering and BIG DATA

The Advantages of a Sequential Approach

The important problems I have faced have almost always:

Needed a sequential approach, involving more than one statistical tool Viewing problem solving and data analysis as a sequential process results in a very different viewpoint versus one-shot studies

A key goal in the process is to direct the next round of data gathering and analysis, as opposed to finding the “correct” model

Sequential approaches also offer the opportunity for using hindsight to our advantage

“The best time to plan an experiment is after you have done it”

R. A Fisher

Are Netflix and Kaggle.com Missing Something?

(46)

Data

Subject Matter Theory – Hypothesis - Conjecture

Process Knowledge Increases

Business Process

Customer

Data

(47)

47

Statistical Engineering and BIG DATA

Need for

(48)

Use of Subject Matter Knowledge is Critical

“Data have no meaning of themselves, they only have meaning within the

context of a conceptual model of the phenomenon under study”

Box, Hunter and Hunter (1978)

Some believe that all you need is data –

“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (Anderson 2008)

Theory underlying the process and the data can help Select the variables to be studied

Select model form

Interpret the results and extrapolate the findings

Understanding the process provides the context that aids Framing of the problem

(49)

49

Putting It All Together

(50)

Statistical Engineering Approach to Big Data

Seek out high-impact problems

Don’t wait for them to knock on your door

Provide structure to poorly defined problems

Take time to understand context and background

Evaluate data quality – get more if needed

Develop a strategy or overall plan of attack

Software tools provide horsepower to “tame” Big Data

Don’t forget the fundamentals

(51)

51

Statistical Engineering and BIG DATA

Summary

The glass is half-full:

BIG DATA offers us a unique opportunity Useful statistical software exists

When used correctly, enable effective analysis of BIG DATA Statistical engineering fundamentals still apply

Ignoring these fundamentals increases likelihood of modeling disasters Probability of success is significantly increased with:

Identifying a Strategy, creating a plan for the project Understanding of “data pedigree”

Utilization of sequential approaches Integration of subject matter knowledge

Big Data Analytics Do Not Replace

(52)

Underlying Theory of Statistical Engineering

Systems and strategy are needed to guide effective use of statistical

tools and methods

Impact of statistical thinking and methods can be increased by

integrating several statistical tools

Enables practitioners to deal with highly complex issues that cannot

be addressed with any one method or tool.

Linking and sequencing the use of statistical tools

Speeds the learning of the approach

Increases the impact of the methods

Use of information technology increases the effectiveness of identifying

and implementing statistical engineering solutions

Embedding statistical thinking and tools into daily work institutionalizes

their application

Viewing statistical thinking and methods from an engineering context

provides a clear focus on problem solving to the benefit of humankind

(53)

53

Statistical Engineering and BIG DATA

References

Anderson, C. (2000) “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired Magazine, June 23, 2008

Davenport, T. H and J. G. Harris (2007) Competing on Analytics, Harvard Business School Press, Boston, MA

DeVeaux, R. D. and D. J. Hand (2005) “How to Lie with Bad Data”, Statistical Science, Vol. 20, No.3, 231-238

Hoerl, R. W. and R. D. Snee (2012) Statistical Thinking: Improving Business Performance, 2nd_{Ed., Wiley,}

2012

Pierrard, J. M. (1974) “Relating Automotive Emissions and Urban Air Quality”, DuPont Innovation, Vol. 5. No. 2, pp 6-9.

Pierrard, J. M., R. D. Snee and J. Zelson (1973) “A New Approach to Setting Vehicle Emission Standards”, Presented at Air Pollution Control Association Annual Meeting, June 24-28, 1973

Pierrard, J. M., R. D. Snee and J. Zelson (1974) “A New Approach to Setting Vehicle Emission Standards”, Air Pollution Control Association Journal, Vol. 24, No. 9, pp 841-848.

Snee, R. D. and R. W. Hoerl (2003) Leading Six Sigma – A Step by Step Guide Based on Experience With

General Electric and Other Six Sigma Companies, FT Prentice Hall, New York, NY.

Snee, R. D. and R. W. Hoerl (2012) “Inquiry on Pedigree – Do You Know the Quality and Origin of Your Data?” Quality Progress, December 2012, 66-68.

Snee, R. D. and J. M. Pierrard (1977) “The Annual Average: An Alternative to the Second Highest Value as a Measure of Air Quality”, Air Pollution Control Association Journal, Vol. 27, No. 2, pp 131-133.

(54)

Articles on Statistical Engineering by Hoerl and Snee

Roger W. Hoerl and Ronald D. Snee, (2009) “Post Financial Meltdown: What Do Services Industries Need From Us Now?” Applied Stochastic Models in Business and Industry, December 2009, pp. 509-521.

Roger W. Hoerl and Ronald D. Snee, (2010) “Moving the Statistics Profession Forward to the Next Level,” The

American Statistician, February 2010, pp. 10-14.

Roger W. Hoerl and R. D. Snee, (2010) “Closing the Gap: Statistical Engineering Can Bridge Statistical Thinking with Methods and Tools,” Quality Progress, May 2010, pp. 52-53.

Roger W. Hoerl and R. D. Snee, (2010) “Tried and True—Organizations Put Statistical Engineering to the Test and See Real Results,” Quality Progress, June 2010, pp. 58-60.

Roger W. Hoerl and Ronald D. Snee, (2010) “Statistical Thinking and Methods in Quality Improvement: A Look to the Future,” Quality Engineering, 22, 3, pp. 119-139.

Roger W. Hoerl and Ronald D. Snee, (2011) “Statistical Engineering: Is This Just Another Term for Applied Statistics?” Joint Newsletter of the ASA Section on Physical and Engineering Sciences and Quality and

Productivity , March 2011, 4-6.

Ronald D. Snee and Roger W. Hoerl, (2010) “Further Explanation; Clarifying Points About Statistical Engineering,” Quality Progress, December 2010, pp. 68-72

Ronald D. Snee and Roger W. Hoerl (2011) “Engineering an Advantage”, Six Sigma Forum Magazine, Guest Editorial, February 2011, 6-7.

(55)

55

Statistical Engineering and BIG DATA

For Further Information, Please Contact:

Ronald D. Snee, PhD

Snee Associates, LLC

[email protected]

610-213-5595