Big Data Analytics: The Art of the
Data Scientist
Neil Raden
Founder, Hired Brains Research
Twitter: @NeilRaden
Blog: http://hiredbrains.wordpress.com
Website: http://www.hiredbrains.com
Mail: nraden@hiredbrains.com
1950
1960
1970
1980
1990
2000
Batch Reporting
CICS/OLTP
C/S OLTP
Y2K/ERP
4GL/PC/SS
DW/BI
Big Data
Hybrid
2010
Convergence: End of managing from scarcity
My Generation This Generation
Control
Security
Stability
Experience
Engagement
Gamification
Big Is Relative
Even Big Data Doesn’t Speak for Itself
•
Incomplete
•
Behaviors
under-represented
•
Anonymizing
disasters
•
Single source of
data inadequate
•
Harmonization
Not a crystal ball
Missing in Big Data: Abstraction
The 1971 Audi 100 had an accelerator cable connected to the gas pedal
and operated the carburetor. The 2013 Audi S8 is purely fly-by-wire .
Sensors in the accelerator pedal communicate with the engine
management system, which determines how much or how little fuel to
allow the fuel injectors to shoot into the engine. Stepping on the gas in
2013 is an abstraction.
A 2013 Audi S8 has more aggregate computing power than 1971 IBM
mainframe
If you had to manage all of these systems yourself, you wouldn’t get out of
the driveway. Their function is “abstracted”
What Is Data Science?
• Discovering what we don’t know from data
• Getting predictive and/or actionable insight
• Development of data products that have clear
business value
• Providing value to the organization through
sharing and learning
• Using techniques like storytelling and
metaphor to explain concepts
Do You Know This Number?
2.718281828459...
Euler Gave Us the Tools
Contribution
Example
Graph Theory
Graph & Ontology Databases
Infinitesimal Calculus
Everything
Topology
Topological Data Analysis
Number Theory
Encryption
But Euler Got One Thing Wrong
• Tobias Mayer
• A contemporary of Euler
• Famous for his observations of the
libration of the moon
• TONS of observations
• Figured out how to group them
Famous quote:
Because these observation were derived from nine times as
many observations, one can therefore conclude that they are
Euler Not a Data Scientist
Euler:“By the combination of two or more
equations, the errors of the combinations and
the calculations multiply themselves.”
The greatest
mathematician of all time
pre-dated the concept of
statistical error
Why Does This Matter?
Because Data Science is
not the realm of the
most brilliant
mathematicians
It’s for people who know how to do
it and who have the correct training
and tools to do it themselves
The Data Scientist
• Term invented by Yahoo
• Super-tech, super-quant
• Business expert too
• Orientation: Search and Web
• We used to call them quants
• Few and far between
• How do you find/train them?
• Hint: like actuaries
A Typical Day
• Basic data manipulations to wrangle data
and fit a variety of standard models -
40%
• Translate a business problem into the
design of a data analysis strategy -
5%
• Graphically explore data to motivate
modeling choices and improvements–
10%
• Interpret and critically examine standard
model output –
5%
• Test the performance of models on
holdout data -
10%
Data Scientist Job Listing
•
A PhD, a Master's Degree with at least five years of data mining experience.
•
A degree in one of the following fields: computer science, computational biology, statistics, or another
computational area with emphasis on the use of machine learning/data mining to build predictive models.
•
Extensive hands on experience working with very large data sets, including statistical analyses, data
visualization, data mining, and data cleansing/transformation.
•
Under-the-hood knowledge machine learning: supervised/unsupervised,loss
functions,regularization,feature selection,regression/classification,cross-validation bagging kernel
methods, sampling
, probability distributions
•
Experience prototyping and developing data mining solutions using statistical software (R, Matlab, etc).
•
Strong ability to communicate deep analytical results in forms that resonate with scientific and/or
business collaborators, highlighting actionable insights.
•
Entrepreneurial inclination to discover novel opportunities for applying analytical techniques to
business/scientific problems across the company.
•
Strong Unix/Linux scripting skills (Perl, Python, etc).
•
Familiarity with writing SQL queries and working with databases.
•
Object oriented programming experience (Java, C++, etc).
•
Capacity to motivate and train junior scientists and offer counsel to peers.
Who Remembers Distributions?
The equations are the
Moment Generating
Function and Fourier
transform of a normal
distribution f with mean μ
and deviation σ
This is one of the most
common formulas in
statistics.
From IBM: Looking for Unicorns
“There is not a clear definition of the kind
of profile you need for a top analytic
performer. IBM is conducting research
with various organizations to interview
leaders and analytic experts to develop
this kind of profile. What’s bubbling out of
this is that top performers:”
• Have good communication skills
• Have good math skills (not typically
associated with #1)
• Good individual problem-solver
• Collaborative (again, a bit of a
contradiction with #3)
• Intellectually curious
Not Invented Here Danger
• Very few industries start with rawest materials
and finish through complete product
• Mining Big Data for insight and action can be
considered the same way
• Even digital giants like Twitter don’t
• Why would you?
Descriptive Title
Quantitative
Sophistication/Numeracy
Sample Roles
Type I
Quantitative R&D
PhD or equivalent
Creation of theory,
development of algorithms.
Academic /research. Work in
business/government for
very specialized roles
Type II
Data Scientist or Quantitative
Analyst
Advanced Math/Stat, not
necessarily PhD
Internal expert in statistical
and mathematical modelling
and development, with solid
business domain knowledge.
Type III
Operational Analytics
Good business domain,
background in statistics
optional
Running and managing
analytical models. Strong
skills in and/or project
management of analytical
systems implementation
Type IV
Business Intelligence/
Discovery
Data and numbers oriented,
but no special advanced
statistical skills
Reporting, dashboard, OLAP
and visualization, some
design, posterior analysis of
Analytic Types
Descriptive Title
Quantitative
Sophistication/Numeracy
Sample Roles
Type I
Quantitative R&D
PhD or equivalent
Creation of theory,
development of algorithms.
Academic /research. Work in
business/government for
very specialized roles
Type II
Data Scientist or Quantitative
Analyst
Advanced Math/Stat, not
necessarily PhD
Internal expert in statistical
and mathematical modelling
and development, with solid
business domain knowledge.
Type III
Operational Analytics
Good business domain,
background in statistics
optional
Running and managing
analytical models. Strong
skills in and/or project
management of analytical
systems implementation
Type IV
Business Intelligence/
Discovery
Data and numbers oriented,
but no special advanced
statistical skills
Reporting, dashboard, OLAP
and visualization, some
design, posterior analysis of
results from quantitative
Analytic Types
Types of Analysis
Training/Mentoring/Apps
Training/Mentoring/Apps
3rd Party Services
Type Shifting
Type Shifting
• As much as 80% of “Data Scientist” work can
be done by others
• Data gathering, cleansing, profiling, parsing
and loading
• Data and process stewardship
• Platform availability
• Providing organizational and market domain
expertise
Types of Analytics
Data Mining
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XWho are my best/worst
customers? How do I
turn my data into rules
for better decisions?
Predictive Analytics
How are those
customers likely to
behave in the future?
How do they react to
the myriad ways I can
“touch” them?
Optimization
How do make the
best possible
decisions given my
constraints?
Knowledge - Description
Action - Prescription
Business Intelligence
How do I use data to
learn about my
customers? What has
been happening in my
business?
Descriptive Analytics - Improve Rules
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *Low-moderate
income, young
High
Income
High income,
low-moderate education
Moderate-high education
low-moderate income
High
Moderate education,
low income, middle-aged
Education
High
Predictive Analytics – Add Insight
10
20
30
40
Stat Tools Can Be Dangerous
• Tests are not the event
• Tests are flawed
Tests detect things that don’t exist
• Tests give test probabilities not the real probabilities
• False positives skew results
• People prefer natural numbers
• Even Science is a test
The combination of some data and an
aching
Analytics is hard
Analytics takes resources
Analytics takes effort to create and assimilate
You need to focus your analytics at the key leverage
points of your business
UPS focuses on where the package is
Marriott focuses on yield management
If you try to do everything, you won’t do anything
Analytics is hard
. Analytics takes resources. It
takes effort
for an organization to
create and
assimilate learnings
from analytics.
You need to
focus
your analytics at the
key
leverage points of your business.
UPS focuses
their analytics on knowing where packages are,
Marriott focuses on yield management.
If you try to do everything, you won’t do
anything well.
Decisions: A Miracle Happens?
The Jordan river
Problem: 40 years
wandering with BI;
how do we get
across?
Will Data Science
Lead Us to Better
Decision Processes?
A Final Thought About Analytics
The challenge of analytics is communication and
creating a shared understanding.
It’s about focusing on high impact areas, moving
forward one step at a time, being skeptical, being
creative, searching for the truth.
Any company can
“Compete on Analytics.”
But not like this
Stock Market Returns for the “Competing on Analytics” Cohort0% 40%
80% 120%
n tt a is rt S n e o s