1
Statistical Engineering and BIG DATA
Enhancing Big Data Projects through
Statistical Engineering
National Quality Month Statistics Symposium
Sponsored by
Philadelphia ASQ Section at
Penn State Great Valley, Malvern, PA
October 4, 2013
Ronald D. Snee, Snee Associates, LLC
With Significant contributions from Roger Hoerl, Union College Dick DeVeaux, Williams College
Abstract
Massive data sets or Big Data have become more common recently, due to improvedtechnology for data acquisition, storage, and processing of data. New tools have been developed to analyze such data, including classification and regression trees (CART), neural nets, and methods based on bootstrapping. These tools make high-powered statistical methods available to not only professional statisticians, but also to casual users. As with any tools, the results to be expected are proportional to the knowledge and skill of the user, as well as the quality of the data. Unfortunately, much of the professional literature may give casual users the impression that if one has powerful enough algorithms and a lot of data, good models and good results are guaranteed at the push of a button.
This presentation focuses on the application of principles of statistical engineering (Anderson-Cook and Lu, Quality Engineering, 2012) to the Big Data problem. Viewed through the statistical engineering lens several potential pitfalls of commonly used approaches to Big Data projects become apparent. The consequences of four major issues are addressed: 1) Lack of a strategic, disciplined approach to modeling, 2) Use of "one shot studies" versus sequential approaches, 3) Assuming all data are high
3
Statistical Engineering and BIG DATA
Special Issue on Statistical Engineering
Statistical Engineering Developments
Statistical Engineering Developments
Statistical Engineering Advances
•
Five Phase
Implementation Framework
•
Underlying Theory
•
Data Pedigree
•
Statistical Thinking Book - CH 4
•
“Statistical Engineering:
Frameworks and Tools”
5
Statistical Engineering and BIG DATA
Agenda
Today’s RealitiesSignificance of “Big Data” phenomenon and Data Mining Growth What Can Go Wrong?
How Statistical Engineering Principles Can Help The Importance of Data Quality
The Advantages of Sequential Approaches The Need for Subject Matter Knowledge Develop a Strategy
Putting It All Together – Doing Big Data the Right Way Summary
Today's Realities – Data Mining and Big Data
Data Mining and Big Data are popular subjects todayData Mining- Analyzing data
Big Data – The data set being analyzed Data mining has been around for decades
1950s: Stepwise regression first developed at Esso (now Exxon) by Efroymson to analyze oil refinery data
1960s: Graphical methods developed by Tukey, Wilk, Gnanadesikan and others at Bell Labs to gain insight from large data sets
1970s: DuPont uses data compression algorithms in process monitoring using on-line systems
“The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (), unusual records (
7
Statistical Engineering and BIG DATA
Today's Realities – Big Data
BIG DATA the “new” idea, new frenzy that has the promise of changing the worldFueled by ubiquitous availability of high speed, high capacity information technology
2007 “Competing on Analytics”, Thomas Davenport and Jeanne Harris Used by many Industries and Companies; Google, Netflix, Amazon, FedEx……
2009 “I keep saying that the sexy job in the next 10 years will be statisticians,” said Hal Varian, chief economist at Google. “And I’m not kidding.”
March 2012, The White House announced a national "Big Data Initiative" that consisted of six Federal departments and agencies committing more than $200 million to big data research projects.
Characteristics that Make Big Data Work (Intel IT Center): Volume – Huge data sets
Variety - Heterogeneous, complex, and variable data Velocity - Data is generated as a constant stream
Big Data – An Example
Developing Vehicle Emission Standards
How to Use Ambient Air Quality Data to Help Set
Vehicle Emission Standards?
•
EPA Rollback Model – Reduce Vehicle Emissions proportional to
the reduction needed to achieve air quality standard
•
Carbon Monoxide Standard is 9ppm (8-hour average)
•
Ex: 2
ndHighest value at a site is 18ppm – required change needed
to meet the standard is 50% reduction
9
Statistical Engineering and BIG DATA
Determining Vehicle Emission Standards
An Alternative Model
Strategy: Assess
• Data quality; in particular the variation of the 2nd highest 8-hour CO value
Available Carbon Monoxide (CO) Data
Continuous Air Monitoring Program (CAMP) Hourly CO Data at 6 major UScities from 1962 - 1971.
Chicago, Denver, Washington, DC, Cincinnati, St Louis and Philadelphia 51 station-years of hourly CO data resulting in 450k measurements
Traffic volume and Metrologic data significantly add to the volume of data Data Sources
Traffic Data – Chicago, Colorado, Illinois
CAMP Carbon Monoxide Data - EPA-NERC, Chicago
Meteorological Data Monoxide Data - National Climatic Center Denver Vehicle Emissions – Colorado Department of Highways New Jersey Air Quality Data - New Jersey EPA
Data Quality Assessment
Ability of a single monitoring station to represent the air quality of a city Air Quality Standard and models use 2nd highest observed value
Highly variable and susceptible to errors caused by equipment malfunction.
11
Statistical Engineering and BIG DATA
Chicago CAMP Data 1968-69
Effect of Meteorology, Traffic Volume and Business Activity on
Hourly CO Levels – Regression Statistics
Wind Direction Has A Major Effect on the
Chicago Camp Station Hourly CO Values
Wind Direction
• Can Reduce the CO Levels by As Much as Approximately 50% of the Annual Average CO
• Largest Effect in 1968 • 2nd Largest Effect in 1969
13
Statistical Engineering and BIG DATA
Standard Deviation of 8-Hour Average CO Values and
Annual Average CO for All CAMP Cities
Percent of Time 8-hour CO Standard was Exceeded vs
Annual Average CO at CAMP Stations 1962-71
Model Confirmed Using New Jersey Data 1965-73 (> 100 Station-Years of Data)
15
Statistical Engineering and BIG DATA
Carbon Monoxide Emission Rates for New Cars
Based on CO Annual Average
BIG DATA
17
Statistical Engineering and BIG DATA
As Seen on 60 Minutes Webpage
Deception at Duke: Fraud in cancer care?
What Could Possibly Go Wrong?
Duke Genomics Center publishes groundbreaking cancer
biomarker articles from 2005 - 2010
Clinical trials based on this research surprisingly did not pan out
Women died unexpectedly
Two statisticians, Keith Baggerly and Kevin Coombes, dug into
the research.
Are We At the Point of Providing
19
Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
Their conclusions?
“Dr. Baggerly and Dr. Coombes found errors almost immediately. Some
seemed careless – moving a row or column over by one in a giant
spreadsheet – while others seemed inexplicable. The Duke team
shrugged them off as “clerical errors”...In the end, four gene signature
papers were retracted. Duke shut down three trials using the results.
(Lead investigator) Dr. Potti resigned from Duke...His collaborator and
mentor, Dr. Nevins, no longer directs one of Duke’s genomics centers.
The cancer world is reeling
.”Large Amounts of Data Plus Sophisticated Algorithms
Do Not Guarantee Success!
21
Statistical Engineering and BIG DATA
What Could Possibly Go Wrong?
On April 18
th, 2011 the text book “The Making of a Fly”
debuts on Amazon.com
Amazon’s automated algorithm price: $1,730,045
Later in the day, the price goes up to $23,698,656
Plus $3.55 for shipping and handling.
No one buys the book that day.
Days later, the Amazon price was $106.
People started to buy the book.
What Could Possibly Go Wrong?
Our quandary:
All other things being equal,
“Big Data” is better than “little data”
The newer data mining tools are powerful and work quite
well
Classification and regression trees (CART),
Neural nets,
Methods based on bootstrapping.
Yet, modeling disasters continue to occur; why?
Clearly, we are missing something in the equation
23
Statistical Engineering and BIG DATA
Can Statistical Engineering
Fundamentals Help?
Role of Statistical Engineering in Big Data
So where are we?
Explosion of large data sets …. Big Data ….
Lots of new analytical tools available
But modeling disasters continue to occur:
Poorly defined problems
Poor data – quantity but no quality
“Unbridled empiricism”: lack of subject matter knowledge
Need more strategic approach
Statistical engineering is one option
Effective framework for handling large, complex unstructured
problems
25
Statistical Engineering and BIG DATA
Statistical Engineering Definition
Statistical Engineering:
“The study of how to best utilize statistical concepts,
methods, and tools and integrate them with
information technology and other relevant sciences
to generate improved results”
Hoerl and Snee 2010
Building Something Bigger From the
“Parts List” of Statistical Tools
Typical Phases of Statistical Engineering Projects
1.
Identify problems
: find the high-impact issues inhibiting
achievement of the organization’s strategic goals
2.
Create structure:
carefully define the problem, objectives,
constraints, metrics for success, and so on.
3.
Understand the context
: identify important stakeholders (e.g.,
customers, organizations, individuals, management), research the
history of the issue, identify unstated complications and cultural
issues, locate relevant data sources.
4.
Develop a strategy
: create an overall, high level approach to
attacking the problem, based on phases 2 and 3.
5.
Establish tactics
: develop and implement diverse initiatives or
projects that collectively will accomplish the strategy
27
Statistical Engineering and BIG DATA
Statistical Engineering – Critical Considerations for BIG DATA
Data QualityOmissions, errors, missing values, etc. Missing variables
Subject Matter Knowledge
Variables selection and appropriate scales
Interpretation of results, ability to extrapolate findings Use of Sequential Approaches
Big problems not solved with one analysis or one data set Strategy must move beyond the “one-shot study” mindset Develop a Strategy – Treat as a Project
A high level plan for success Generate short-term wins
Data Pedigree - What to Look for?
Data Quality
Measurement process
Understand how the process operates
Understand how
Product was made
Service provided
Check data collection assumptions
Observational data and designed experiments
29
Statistical Engineering and BIG DATA
Data Mining ala Wikipedia (2012)
Data mining (the analysis step of the
"Knowledge Discovery in Databases" process, or KDD), a field at the intersection of
and
attempts to discover patterns in large . It utilizes methods at the ion of ,
, , and The overall goal of the data mining
process is to extract information from a data set and transform it into an understandable structure for further usAside from the raw analysis step, it involves database and
aspects, , and
considerations, interestingness metrics,
discovered structures,
Some Key Words and Phrases
1.
Large amounts of data2.
Preexisting databases3.
Computer technology4.
Algorithms5.
Turn data into information6.
Knowledge generation7.
Discovery8.
Identification of trends and patterns in data9.
Relationships and modelsData Mining - What’s Missing?
Some Key Words and Phrases
1.
Large amounts of data
2.
Preexisting databases
3.
Computer technology
4.
Algorithms
5.
Turn data into information
6.
Knowledge generation
7.
Discovery
8.
Identification of trends and
patterns in data
9.
Relationships and models
10.
Understandable structure
Missing Critical Words and Phrases
Understanding the
Process and product or service from which the data were collected Relevant science, engineering and subject matter
Data quality
Errors, mistakes, atypical values Measurement process
Measurement variation
Data collection process and
assumptions
31
Statistical Engineering and BIG DATA
Data Mining
Perceptions and Unspoken Assumptions
More data is always betterAll data are good data If you have enough data
Sophisticated algorithms will find useful trends and patterns
We don't need to understand the context of the problem, or have good subject matter expertise
Measurement systems (accuracy, precision, etc.) are not important
Changes over time due to unknown variables;
The algorithms will take care of time variations in the current data set A good model today will remain a good model in the future
A single mathematical metric (e.g., prediction accuracy) can completely quantify how good a model is
Sophisticated algorithms replace skill of the model builder
Need to Know the Pedigree
In the world of farm animals, horses and other livestock, if you want to assessand predict the “quality” of an animal and how it will perform you look at its pedigree.
Triple Crown horserace winners often produce winning offspring. Secretariat – 1973 Triple Crown Winner
Kentucky Derby, Preakness and Belmont Races Sired many stakes race winners
Bold Ruler – Sire of Secretariat Preakness winner
Sired 10 other champions and many stakes race winners
33
Statistical Engineering and BIG DATA
Sire dkb/br. 1954 b. 1940 b. 1932 Nogara Mumtaz Begum Miss Disco b. 1944 Ariadne Outdone Sweep Out Dam b. 1952 b. 1940 Indolence Cosquilla Quick Thought Imperatrice dkb/br. 1938 Caruso Polymelian Sweet Music Cinquepace Brown Bud
Assignation (Family 2-S)
Pedigree of Secretariat Wikipedia 2012
Knowing Data Pedigree Provides Deep Understanding of Data
It is important to get the right data at the beginning of a study.
But what do we do when the data are in hand?
Evaluating “data pedigree” involves understanding:
Science, engineering and structure of the process or product
from which the data were collected.
Collection process used to obtain data and prepare for analysis.
How the measurements were made.
35
Statistical Engineering and BIG DATA
Observations on Current Practice
Data collected without controls and careful administration of the data collection process often contain erroneous results
Data residing in electronic files says nothing about the quality of the data Data mining as practiced seems to be assuming that all data are good data and more data is better
Knowing how the data were collected is also critical to performing the correct analysis of the data.
Data structure and sources of variation are easily identified.
Model form that best fits the structure and situation becomes more apparent (e.g., crossed vs nested factors, split plotting).
Analyzing data without understanding the associated process, sampling and testing procedures greatly increases the risk of erroneous results.
Reproducibility of study results is more than purely using a wrong analysis - The Duke study is a classic case of this.
Case Study - Vehicle Emission Standards
Types of Data and Data Sources
Traffic DataIllinois Department of Transportation Chicago Bureau of Streets
Colorado Department of Highways
Continuous Air Monitoring Program (CAMP) Hourly Data
EPA-NERC – Approximately 51 station-years of data from 6 major cities Chicago Department of Environmental Control
Meteorological Data
National Climatic Center
Vehicle Emissions – Denver
Colorado Department of Highways
New Jersey Hourly Air Quality Data - >100 station-years of data
New Jersey EPA
37
Statistical Engineering and BIG DATA
Assessing Data Quality – An Example
Carbon Monoxide Data
(Snee and Pierrard 1977)
Ambient air quality standard for carbon monoxide is 9 ppm (8-hour average) not to be exceeded more than once per year.
Second highest value over an 8 hour period in a year was being used to assess the air quality in the vicinity of the sampler.
2nd highest value is a highly variable statistic. For example
Denver CO sampling station in 1971 had a CO 2nd highest value was 35
ppm with a maximum value of 39 ppm, well above the standard
Hourly data used to compute the 2nd highest value were evaluated
A plot of the hourly CO values for the period in questions showed 10 consecutive hourly readings of 39 ppm, with 4 out of the next 6 hourly readings at 39 ppm and the remaining 2 readings at 36 ppm.
This small amount of variation over a 16-hour period is not typical of variation in hourly CO readings and do not represent an accurate characterization of the air quality in the area of the sampler.
It is highly probable that these results are due to equipment malfunction. A similar problem was found in the CO data from Cincinnati in 1968.
Histogram of 8-Hour Non-overlapping Average CO
Measurements at the Denver CAMP Station 1972
39
Statistical Engineering and BIG DATA
Sequence Plot of Hourly CO Readings
Denver CAMP Station June 18 -19, 1972
Data Associated with the 2 Highest Values during 1972Sequence Plot of Hourly CO Readings
Cincinnati CAMP Station Friday July 12 to Sunday July 14, 1968
Data Associated with the 7 Highest Values during 1968 Accurate Measure of Air Quality in 1968???
41
Statistical Engineering and BIG DATA
Annual Average-Useful Alternative Measure of Air Quality
Annual AverageGood predictor of % time standard is exceeded
Much more precise and robust than the 2nd highest value
2nd Highest value of a lognormal distribution is related to the annual average
Empirical evidence from the data collected at six major cities over a 10 year period and at several sampling stations in New Jersey showed that the
relationship is very strong.
Recommendation:
• Use 2nd Highest Value to Compare Air Quality to Air Quality Standard
• Use Annual Average to Make Predictions and Calculate Emission Reductions
Conclusions – Regarding Data Pedigree
Trust but Verify - Data pedigree must be assessed when analyzing BIG DATA. Data quality is an issue with all sources of data
Careful thought must be given to the model form needed to answer the
question. Different models can often get you to the same place, or to different places
Multiple sources of data require careful thought as to data pedigree and how to fit the data bases together to produce useful results
Different data sources are typically associated with political issues, different agendas, different objectives, etc.
43
Statistical Engineering and BIG DATA
The Advantages of a Sequential Approach
Much of our professional literature, including our textbooks,
assume that statistical problems are,
“one shot studies”
We are handed a
fixed data set,
and must develop the “best”
model to fit the data
Articles are frequently published challenging previously
published analyses, and proposing a better model for the
same data
This is the clearly the tone of many high-profile data analysis
competitions:
Netflix Challenge
Kaggle.com
45
Statistical Engineering and BIG DATA
The Advantages of a Sequential Approach
The important problems I have faced have almost always:Needed a sequential approach, involving more than one statistical tool Viewing problem solving and data analysis as a sequential process results in a very different viewpoint versus one-shot studies
A key goal in the process is to direct the next round of data gathering and analysis, as opposed to finding the “correct” model
Sequential approaches also offer the opportunity for using hindsight to our advantage
“The best time to plan an experiment is after you have done it”
R. A Fisher
Are Netflix and Kaggle.com Missing Something?
Data
Subject Matter Theory – Hypothesis - Conjecture
Process Knowledge Increases
Business Process
Customer
Data
47
Statistical Engineering and BIG DATA
Need for
Use of Subject Matter Knowledge is Critical
“Data have no meaning of themselves, they only have meaning within thecontext of a conceptual model of the phenomenon under study”
Box, Hunter and Hunter (1978)
Some believe that all you need is data –
“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (Anderson 2008)
Theory underlying the process and the data can help Select the variables to be studied
Select model form
Interpret the results and extrapolate the findings
Understanding the process provides the context that aids Framing of the problem
49
Statistical Engineering and BIG DATA
Putting It All Together
Statistical Engineering Approach to Big Data
Seek out high-impact problems
Don’t wait for them to knock on your door
Provide structure to poorly defined problems
Take time to understand context and background
Evaluate data quality – get more if needed
Develop a strategy or overall plan of attack
Software tools provide horsepower to “tame” Big Data
Don’t forget the fundamentals
51
Statistical Engineering and BIG DATA
Summary
The glass is half-full:BIG DATA offers us a unique opportunity Useful statistical software exists
When used correctly, enable effective analysis of BIG DATA Statistical engineering fundamentals still apply
Ignoring these fundamentals increases likelihood of modeling disasters Probability of success is significantly increased with:
Identifying a Strategy, creating a plan for the project Understanding of “data pedigree”
Utilization of sequential approaches Integration of subject matter knowledge
Big Data Analytics Do Not Replace
Underlying Theory of Statistical Engineering
Systems and strategy are needed to guide effective use of statistical
tools and methods
Impact of statistical thinking and methods can be increased by
integrating several statistical tools
Enables practitioners to deal with highly complex issues that cannot
be addressed with any one method or tool.
Linking and sequencing the use of statistical tools
Speeds the learning of the approach
Increases the impact of the methods
Use of information technology increases the effectiveness of identifying
and implementing statistical engineering solutions
Embedding statistical thinking and tools into daily work institutionalizes
their application
Viewing statistical thinking and methods from an engineering context
provides a clear focus on problem solving to the benefit of humankind
53
Statistical Engineering and BIG DATA
References
Anderson, C. (2000) “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Wired Magazine, June 23, 2008
Davenport, T. H and J. G. Harris (2007) Competing on Analytics, Harvard Business School Press, Boston, MA
DeVeaux, R. D. and D. J. Hand (2005) “How to Lie with Bad Data”, Statistical Science, Vol. 20, No.3, 231-238
Hoerl, R. W. and R. D. Snee (2012) Statistical Thinking: Improving Business Performance, 2nd Ed., Wiley,
2012
Pierrard, J. M. (1974) “Relating Automotive Emissions and Urban Air Quality”, DuPont Innovation, Vol. 5. No. 2, pp 6-9.
Pierrard, J. M., R. D. Snee and J. Zelson (1973) “A New Approach to Setting Vehicle Emission Standards”, Presented at Air Pollution Control Association Annual Meeting, June 24-28, 1973
Pierrard, J. M., R. D. Snee and J. Zelson (1974) “A New Approach to Setting Vehicle Emission Standards”, Air Pollution Control Association Journal, Vol. 24, No. 9, pp 841-848.
Snee, R. D. and R. W. Hoerl (2003) Leading Six Sigma – A Step by Step Guide Based on Experience With
General Electric and Other Six Sigma Companies, FT Prentice Hall, New York, NY.
Snee, R. D. and R. W. Hoerl (2012) “Inquiry on Pedigree – Do You Know the Quality and Origin of Your Data?” Quality Progress, December 2012, 66-68.
Snee, R. D. and J. M. Pierrard (1977) “The Annual Average: An Alternative to the Second Highest Value as a Measure of Air Quality”, Air Pollution Control Association Journal, Vol. 27, No. 2, pp 131-133.
Articles on Statistical Engineering by Hoerl and Snee
Roger W. Hoerl and Ronald D. Snee, (2009) “Post Financial Meltdown: What Do Services Industries Need From Us Now?” Applied Stochastic Models in Business and Industry, December 2009, pp. 509-521.
Roger W. Hoerl and Ronald D. Snee, (2010) “Moving the Statistics Profession Forward to the Next Level,” The
American Statistician, February 2010, pp. 10-14.
Roger W. Hoerl and R. D. Snee, (2010) “Closing the Gap: Statistical Engineering Can Bridge Statistical Thinking with Methods and Tools,” Quality Progress, May 2010, pp. 52-53.
Roger W. Hoerl and R. D. Snee, (2010) “Tried and True—Organizations Put Statistical Engineering to the Test and See Real Results,” Quality Progress, June 2010, pp. 58-60.
Roger W. Hoerl and Ronald D. Snee, (2010) “Statistical Thinking and Methods in Quality Improvement: A Look to the Future,” Quality Engineering, 22, 3, pp. 119-139.
Roger W. Hoerl and Ronald D. Snee, (2011) “Statistical Engineering: Is This Just Another Term for Applied Statistics?” Joint Newsletter of the ASA Section on Physical and Engineering Sciences and Quality and
Productivity , March 2011, 4-6.
Ronald D. Snee and Roger W. Hoerl, (2010) “Further Explanation; Clarifying Points About Statistical Engineering,” Quality Progress, December 2010, pp. 68-72
Ronald D. Snee and Roger W. Hoerl (2011) “Engineering an Advantage”, Six Sigma Forum Magazine, Guest Editorial, February 2011, 6-7.
55
Statistical Engineering and BIG DATA
For Further Information, Please Contact:
Ronald D. Snee, PhD
Snee Associates, LLC
610-213-5595