Enhancing “Big Data” Projects
JMP Discovery Summit San Antonio TX
September 9-12, 2013
Ronald Snee Richard De Veaux Roger Hoerl
Outline
The Impact of Big Data Analytics
What Could Possibly Go Wrong?
How Statistical Engineering Fundamentals Can Help
Demo of JMP Tools for Big Data
Doing Big Data the Right Way
Summary
The Big Data Phenomenon
New technology for acquiring, storing, and processing data
IBM: 1.6 zetabytes (10
21bytes) of digital now data available.
Enough to watch HD TV for 47,000 years
Varian: “I keep saying that the sexy job in the next 10 years
will be statisticians, and I’m not kidding.”
White House national "Big Data Initiative" launched in 2012
Real-time data acquisition becoming the norm
Computational research replacing the laboratory
The Impact of Big Data Analytics
Analytics based on Big Data changing our lives in profound ways “Competing on Analytics” a best seller in 2007
$1,000,000 Netflix competition in 2009 “Crowdsourcing” approach to analytics
kaggle.com becomes the “eBay of data competitions”
Actress Angelina Jolie has a double-mastectomy because of DNA analytics, not a diagnosis of cancer
Doodle online poll showed me an ad for the Hilton Hotel in Sofia, Bulgaria
Who else has gotten this ad?
What else does Doodle know about me?
As Seen on 60 Minutes Webpage
What Could Possibly Go Wrong?
Duke Genomics Center publishes groundbreaking cancer
biomarker articles from 2005 - 2010
Clinical trials based on this research surprisingly did not pan out
Women died unexpectedly
Two statisticians, Keith Baggerly and Kevin Coombes, dug into
the research.
What Could Possibly Go Wrong?
Their conclusions?
Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemed careless – moving a row or column over by one in a giant
spreadsheet – while others seemed inexplicable. The Duke team
shrugged them off as “clerical errors”...In the end, four gene signature papers were retracted. Duke shut down three trials using the results. (Lead investigator) Dr. Potti resigned from Duke...His collaborator and mentor, Dr. Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling.
Anybody Remember These Guys?
What Could Possibly Go Wrong?
Lehman Brothers declared bankruptcy on September 15
th, 2008.
Largest bankruptcy filing in US history ($600 billion in assets
)
DJIA drops over 500 points
Ironically, I had visited Lehman Brothers with GE Capital a few
years earlier
Lehman was selling models to predict corporate defaults
Their models were quite sophisticated, and based on large
amounts of historical financial data
Virtually all financial institutions impacted by the crisis had
models
What Could Possibly Go Wrong?
On April 18
th, 2011 the text book “The Making of a Fly”
debuts on Amazon.com
Amazon’s automated algorithm price: $1,730,045
Later in the day, the price goes up to $23,698,656
Plus $3.55 for shipping and handling.
No one buys the book that day.
Days later, the Amazon price was $106.
People started to buy the book.
Critical Considerations for BIG DATA
Data Quality
Omissions, errors, missing values, etc. Missing variables
Subject Matter Knowledge
Variables selection and appropriate scales Interpretation of results
Ability to extrapolate findings
Use of Sequential Approaches
Big problems not solved with one analysis or one data set Strategy must move beyond the “one-shot study” mindset
Role of Statistical Engineering in Big Data
So where are we?
Explosion of large data sets …. Big Data …. Lots of new analytical tools available
But modeling disasters continue to occur: Poorly defined problems
Poor data – quantity but no quality
“Unbridled empiricism”: lack of subject matter knowledge
Big Data problems are typically large, complex, and unstructured Textbook approaches are not suitable for such problem
Need more strategic approach
Statistical engineering is one option
Statistical Engineering Definition
Statistical engineering:
The study of how to best utilize statistical concepts,
methods, and tools and integrate them with
information technology and other relevant sciences
to generate improved results
Typical Phases of Statistical Engineering Projects
1.
Identify high-impact problems
2.
Create structure
3.
Understand the context
4.
Develop a strategy
5.
Establish tactics
Statistical Engineering Approach to Big Data
Seek out high-impact problems
Don’t wait for them to knock on your door
Provide structure to poorly defined problems
Take time to understand context and background
Evaluate data quality – get more if needed
Develop a strategy or overall plan of attack
JMP tools provide horsepower to “tame” Big Data
Don’t forget the fundamentals
Summary
The glass is half-full:
Big Data offers us a unique opportunity
JMP has very powerful tools, which if used correctly, enable effective analysis of Big Data sets
Statistical engineering fundamentals still apply
Ignoring these fundamentals increases likelihood of modeling disasters
Probability of success is significantly increased with: Understanding of “data pedigree”
Utilization of sequential approaches Integration of subject matter knowledge