• No results found

Big Data Analytics: The Art of the Data Scientist

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Analytics: The Art of the Data Scientist"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Analytics: The Art of the

Data Scientist

Neil Raden

Founder, Hired Brains Research

Twitter: @NeilRaden

Blog: http://hiredbrains.wordpress.com

Website: http://www.hiredbrains.com

Mail: nraden@hiredbrains.com

(2)

1950

1960

1970

1980

1990

2000

Batch Reporting

CICS/OLTP

C/S OLTP

Y2K/ERP

4GL/PC/SS

DW/BI

Big Data

Hybrid

2010

Convergence: End of managing from scarcity

(3)
(4)

My Generation This Generation

Control

Security

Stability

Experience

Engagement

Gamification

(5)

Big Is Relative

(6)
(7)

Even Big Data Doesn’t Speak for Itself

Incomplete

Behaviors

under-represented

Anonymizing

disasters

Single source of

data inadequate

Harmonization

Not a crystal ball

(8)

Missing in Big Data: Abstraction

The 1971 Audi 100 had an accelerator cable connected to the gas pedal

and operated the carburetor. The 2013 Audi S8 is purely fly-by-wire .

Sensors in the accelerator pedal communicate with the engine

management system, which determines how much or how little fuel to

allow the fuel injectors to shoot into the engine. Stepping on the gas in

2013 is an abstraction.

A 2013 Audi S8 has more aggregate computing power than 1971 IBM

mainframe

If you had to manage all of these systems yourself, you wouldn’t get out of

the driveway. Their function is “abstracted”

(9)

What Is Data Science?

• Discovering what we don’t know from data

• Getting predictive and/or actionable insight

• Development of data products that have clear

business value

• Providing value to the organization through

sharing and learning

• Using techniques like storytelling and

metaphor to explain concepts

(10)
(11)

Do You Know This Number?

2.718281828459...

(12)

Euler Gave Us the Tools

Contribution

Example

Graph Theory

Graph & Ontology Databases

Infinitesimal Calculus

Everything

Topology

Topological Data Analysis

Number Theory

Encryption

(13)

But Euler Got One Thing Wrong

• Tobias Mayer

• A contemporary of Euler

• Famous for his observations of the

libration of the moon

• TONS of observations

• Figured out how to group them

Famous quote:

Because these observation were derived from nine times as

many observations, one can therefore conclude that they are

(14)

Euler Not a Data Scientist

Euler:“By the combination of two or more

equations, the errors of the combinations and

the calculations multiply themselves.”

The greatest

mathematician of all time

pre-dated the concept of

statistical error

(15)

Why Does This Matter?

Because Data Science is

not the realm of the

most brilliant

mathematicians

It’s for people who know how to do

it and who have the correct training

and tools to do it themselves

(16)

The Data Scientist

• Term invented by Yahoo

• Super-tech, super-quant

• Business expert too

• Orientation: Search and Web

• We used to call them quants

• Few and far between

• How do you find/train them?

• Hint: like actuaries

(17)

A Typical Day

• Basic data manipulations to wrangle data

and fit a variety of standard models -

40%

• Translate a business problem into the

design of a data analysis strategy -

5%

• Graphically explore data to motivate

modeling choices and improvements–

10%

• Interpret and critically examine standard

model output –

5%

• Test the performance of models on

holdout data -

10%

(18)
(19)

Data Scientist Job Listing

A PhD, a Master's Degree with at least five years of data mining experience.

A degree in one of the following fields: computer science, computational biology, statistics, or another

computational area with emphasis on the use of machine learning/data mining to build predictive models.

Extensive hands on experience working with very large data sets, including statistical analyses, data

visualization, data mining, and data cleansing/transformation.

Under-the-hood knowledge machine learning: supervised/unsupervised,loss

functions,regularization,feature selection,regression/classification,cross-validation bagging kernel

methods, sampling

, probability distributions

Experience prototyping and developing data mining solutions using statistical software (R, Matlab, etc).

Strong ability to communicate deep analytical results in forms that resonate with scientific and/or

business collaborators, highlighting actionable insights.

Entrepreneurial inclination to discover novel opportunities for applying analytical techniques to

business/scientific problems across the company.

Strong Unix/Linux scripting skills (Perl, Python, etc).

Familiarity with writing SQL queries and working with databases.

Object oriented programming experience (Java, C++, etc).

Capacity to motivate and train junior scientists and offer counsel to peers.

(20)

Who Remembers Distributions?

The equations are the

Moment Generating

Function and Fourier

transform of a normal

distribution f with mean μ

and deviation σ

This is one of the most

common formulas in

statistics.

(21)

From IBM: Looking for Unicorns

“There is not a clear definition of the kind

of profile you need for a top analytic

performer. IBM is conducting research

with various organizations to interview

leaders and analytic experts to develop

this kind of profile. What’s bubbling out of

this is that top performers:”

• Have good communication skills

• Have good math skills (not typically

associated with #1)

• Good individual problem-solver

• Collaborative (again, a bit of a

contradiction with #3)

• Intellectually curious

(22)

Not Invented Here Danger

• Very few industries start with rawest materials

and finish through complete product

• Mining Big Data for insight and action can be

considered the same way

• Even digital giants like Twitter don’t

• Why would you?

(23)
(24)

Descriptive Title

Quantitative

Sophistication/Numeracy

Sample Roles

Type I

Quantitative R&D

PhD or equivalent

Creation of theory,

development of algorithms.

Academic /research. Work in

business/government for

very specialized roles

Type II

Data Scientist or Quantitative

Analyst

Advanced Math/Stat, not

necessarily PhD

Internal expert in statistical

and mathematical modelling

and development, with solid

business domain knowledge.

Type III

Operational Analytics

Good business domain,

background in statistics

optional

Running and managing

analytical models. Strong

skills in and/or project

management of analytical

systems implementation

Type IV

Business Intelligence/

Discovery

Data and numbers oriented,

but no special advanced

statistical skills

Reporting, dashboard, OLAP

and visualization, some

design, posterior analysis of

Analytic Types

(25)

Descriptive Title

Quantitative

Sophistication/Numeracy

Sample Roles

Type I

Quantitative R&D

PhD or equivalent

Creation of theory,

development of algorithms.

Academic /research. Work in

business/government for

very specialized roles

Type II

Data Scientist or Quantitative

Analyst

Advanced Math/Stat, not

necessarily PhD

Internal expert in statistical

and mathematical modelling

and development, with solid

business domain knowledge.

Type III

Operational Analytics

Good business domain,

background in statistics

optional

Running and managing

analytical models. Strong

skills in and/or project

management of analytical

systems implementation

Type IV

Business Intelligence/

Discovery

Data and numbers oriented,

but no special advanced

statistical skills

Reporting, dashboard, OLAP

and visualization, some

design, posterior analysis of

results from quantitative

Analytic Types

Types of Analysis

Training/Mentoring/Apps

Training/Mentoring/Apps

3rd Party Services

Type Shifting

(26)

Type Shifting

• As much as 80% of “Data Scientist” work can

be done by others

• Data gathering, cleansing, profiling, parsing

and loading

• Data and process stewardship

• Platform availability

• Providing organizational and market domain

expertise

(27)

Types of Analytics

Data Mining

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Who are my best/worst

customers? How do I

turn my data into rules

for better decisions?

Predictive Analytics

How are those

customers likely to

behave in the future?

How do they react to

the myriad ways I can

“touch” them?

Optimization

How do make the

best possible

decisions given my

constraints?

Knowledge - Description

Action - Prescription

Business Intelligence

How do I use data to

learn about my

customers? What has

been happening in my

business?

(28)

Descriptive Analytics - Improve Rules

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Low-moderate

income, young

High

Income

High income,

low-moderate education

Moderate-high education

low-moderate income

High

Moderate education,

low income, middle-aged

Education

High

(29)

Predictive Analytics – Add Insight

10

20

30

40

(30)
(31)

Stat Tools Can Be Dangerous

• Tests are not the event

• Tests are flawed

Tests detect things that don’t exist

• Tests give test probabilities not the real probabilities

• False positives skew results

• People prefer natural numbers

• Even Science is a test

(32)

The combination of some data and an

aching

(33)
(34)

Analytics is hard

Analytics takes resources

Analytics takes effort to create and assimilate

You need to focus your analytics at the key leverage

points of your business

UPS focuses on where the package is

Marriott focuses on yield management

If you try to do everything, you won’t do anything

(35)

Analytics is hard

. Analytics takes resources. It

takes effort

for an organization to

create and

assimilate learnings

from analytics.

You need to

focus

your analytics at the

key

leverage points of your business.

UPS focuses

their analytics on knowing where packages are,

Marriott focuses on yield management.

If you try to do everything, you won’t do

anything well.

(36)

Decisions: A Miracle Happens?

The Jordan river

Problem: 40 years

wandering with BI;

how do we get

across?

Will Data Science

Lead Us to Better

Decision Processes?

(37)

A Final Thought About Analytics

The challenge of analytics is communication and

creating a shared understanding.

It’s about focusing on high impact areas, moving

forward one step at a time, being skeptical, being

creative, searching for the truth.

Any company can

“Compete on Analytics.”

But not like this

Stock Market Returns for the “Competing on Analytics” Cohort

0% 40%

80% 120%

n tt a is rt S n e o s

(38)

Five Things to Remember

• Data is an “asset,” people make it valuable

• Your data scientists may well be a team

• Communication, insight and reason more

important than math

• You have lurking data scientists in your firm

• Start with what matters, build confidence

(39)

Thank You

Questions?

Neil Raden

Founder, Hired Brains Research

Twitter: NeilRaden

Blog: http://hiredbrains.wordpress.com

Website: http://www.hiredbrains.com

Mail: nraden@hiredbrains.com

References

Related documents

In this paper, we use teleseismically and hydroacoustically recorded seismicity data from the equatorial East Pacific Rise and Coulomb static stress models to

allocation across application needs, (ii) index management to facilitate indexing of data on flash, (iii) storage reclamation to handle deletions and reclamation of storage space,

Woodland Park Zoo Field Conservation NWFSC, NOAA Fisheries WDFW Environment Canada Trent University Department of Ecology Leidos Consulting Thanks, again Site access City of Kent

Methods and analysis This self-controlled cluster randomised trial is evaluating the effectiveness of an 8-week Alert Program school curriculum for improving self-regulation

After treatment with a Plk1 inhibitor or carrier (DMSO), cancer cells were fixed with formaldehyde and stained with DAPI (DNA stain; blue), alpha-tubulin antibody (microtubules

The present investigation was carried out with the idea of developing an Online Pest Management Information System (PMISNET) on major agricultural crops containing

Con el modelo extendido podemos analizar las respuestas de los alumnos a problemas empíricos y así avanzar en el estudio de características del razonamiento de los alumnos

After these conclusions, the final prototype design was modified towards a group of straight evaporation channels with individual solar chim- neys, adopting the raised pre-heater