• No results found

Enhancing Big Data Projects

N/A
N/A
Protected

Academic year: 2021

Share "Enhancing Big Data Projects"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

Enhancing “Big Data” Projects

JMP Discovery Summit San Antonio TX

September 9-12, 2013

Ronald Snee Richard De Veaux Roger Hoerl

(2)

Outline

The Impact of Big Data Analytics

What Could Possibly Go Wrong?

How Statistical Engineering Fundamentals Can Help

Demo of JMP Tools for Big Data

Doing Big Data the Right Way

Summary

(3)

The Big Data Phenomenon

New technology for acquiring, storing, and processing data

IBM: 1.6 zetabytes (10

21

bytes) of digital now data available.

Enough to watch HD TV for 47,000 years

Varian: “I keep saying that the sexy job in the next 10 years

will be statisticians, and I’m not kidding.”

White House national "Big Data Initiative" launched in 2012

Real-time data acquisition becoming the norm

Computational research replacing the laboratory

(4)

The Impact of Big Data Analytics

Analytics based on Big Data changing our lives in profound ways “Competing on Analytics” a best seller in 2007

$1,000,000 Netflix competition in 2009 “Crowdsourcing” approach to analytics

kaggle.com becomes the “eBay of data competitions”

Actress Angelina Jolie has a double-mastectomy because of DNA analytics, not a diagnosis of cancer

Doodle online poll showed me an ad for the Hilton Hotel in Sofia, Bulgaria

Who else has gotten this ad?

What else does Doodle know about me?

(5)
(6)
(7)

As Seen on 60 Minutes Webpage

(8)

What Could Possibly Go Wrong?

Duke Genomics Center publishes groundbreaking cancer

biomarker articles from 2005 - 2010

Clinical trials based on this research surprisingly did not pan out

Women died unexpectedly

Two statisticians, Keith Baggerly and Kevin Coombes, dug into

the research.

(9)

What Could Possibly Go Wrong?

Their conclusions?

Dr. Baggerly and Dr. Coombes found errors almost immediately. Some seemed careless – moving a row or column over by one in a giant

spreadsheet – while others seemed inexplicable. The Duke team

shrugged them off as “clerical errors”...In the end, four gene signature papers were retracted. Duke shut down three trials using the results. (Lead investigator) Dr. Potti resigned from Duke...His collaborator and mentor, Dr. Nevins, no longer directs one of Duke’s genomics centers. The cancer world is reeling.

(10)

Anybody Remember These Guys?

(11)

What Could Possibly Go Wrong?

Lehman Brothers declared bankruptcy on September 15

th

, 2008.

Largest bankruptcy filing in US history ($600 billion in assets

)

DJIA drops over 500 points

Ironically, I had visited Lehman Brothers with GE Capital a few

years earlier

Lehman was selling models to predict corporate defaults

Their models were quite sophisticated, and based on large

amounts of historical financial data

Virtually all financial institutions impacted by the crisis had

models

(12)
(13)

What Could Possibly Go Wrong?

On April 18

th

, 2011 the text book “The Making of a Fly”

debuts on Amazon.com

Amazon’s automated algorithm price: $1,730,045

Later in the day, the price goes up to $23,698,656

Plus $3.55 for shipping and handling.

No one buys the book that day.

Days later, the Amazon price was $106.

People started to buy the book.

(14)

Critical Considerations for BIG DATA

Data Quality

Omissions, errors, missing values, etc. Missing variables

Subject Matter Knowledge

Variables selection and appropriate scales Interpretation of results

Ability to extrapolate findings

Use of Sequential Approaches

Big problems not solved with one analysis or one data set Strategy must move beyond the “one-shot study” mindset

(15)
(16)

Role of Statistical Engineering in Big Data

So where are we?

Explosion of large data sets …. Big Data …. Lots of new analytical tools available

But modeling disasters continue to occur: Poorly defined problems

Poor data – quantity but no quality

“Unbridled empiricism”: lack of subject matter knowledge

Big Data problems are typically large, complex, and unstructured Textbook approaches are not suitable for such problem

Need more strategic approach

Statistical engineering is one option

(17)

Statistical Engineering Definition

Statistical engineering:

The study of how to best utilize statistical concepts,

methods, and tools and integrate them with

information technology and other relevant sciences

to generate improved results

(18)

Typical Phases of Statistical Engineering Projects

1.

Identify high-impact problems

2.

Create structure

3.

Understand the context

4.

Develop a strategy

5.

Establish tactics

(19)
(20)
(21)

Statistical Engineering Approach to Big Data

Seek out high-impact problems

Don’t wait for them to knock on your door

Provide structure to poorly defined problems

Take time to understand context and background

Evaluate data quality – get more if needed

Develop a strategy or overall plan of attack

JMP tools provide horsepower to “tame” Big Data

Don’t forget the fundamentals

(22)

Summary

The glass is half-full:

Big Data offers us a unique opportunity

JMP has very powerful tools, which if used correctly, enable effective analysis of Big Data sets

Statistical engineering fundamentals still apply

Ignoring these fundamentals increases likelihood of modeling disasters

Probability of success is significantly increased with: Understanding of “data pedigree”

Utilization of sequential approaches Integration of subject matter knowledge

References

Related documents

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

To that end, solutions to this problem include: community-based mental health services, reopening some state psychiatric hospitals with greater oversight, funding medical research

We find that while the first stage of the hypothesis, which links the difference in productivities and growth with the difference in prices of the tradable and non tradable sectors,

Dari tabel 2, gambar 4, dan gambar 5 dapat dilihat bahwa dengan pengujian noise gaussian (rata-rata sama dengan nol dan varians = 0,1) didapatkan bahwa nilai

To investigate the potential benefit of active search support and summarising search results, we performed a lab-based user study, where twenty-four participants undertook

Reasons Big Data Projects Get delayed or  Stopped…. Hidden Risks in Big Data Adoption Big Data  Enables deeper data  analysis  More value from old data

Osculating cones to Brill–Noether loci for line and vector bundles on curves and relative.. canonical resolutions

Transport is highly dependent on the other sectors – energy (eg electricity supply for trains or electric vehicles and control systems, power to buildings such as airports,