• No results found

Leveraging Mesos as the Ultimate Distributed Data Science Platform

N/A
N/A
Protected

Academic year: 2021

Share "Leveraging Mesos as the Ultimate Distributed Data Science Platform"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Leveraging Mesos as the Ultimate

Distributed Data Science

Platform

(such a long title,) by @DataFellas @Noootsab, 8th Oct. ‘15 @MesosCon

(2)

(Legacy) Data Science Pipeline/Product

What changed since then

Distributed Data Science (today)

Luckily, we have mesos and friends

Going beyond (productivity)

(3)

Data Fellas

5 months old Belgian Startup

Andy Petrella Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir Physics Bioinformatics Scala Spark

(4)

(Legacy) Data Science Pipeline

Or, so called, Data Product

Static

Results

Lot of information

lost

in translation

Sounds like Water

fall

ETL

look and feel

(5)

(Legacy) Data Science Pipeline

Or, so called, Data Product

Mono machine!

CPU

bounds

Memory

bounds

(6)

Facts

Data gets bigger or, precisely, the amount of available

source explodes

Data gets faster (and faster), only even consider:

watching netflix over 4G ôÖ

Our world Today

(7)

Consequences

HARD

(or will be too big...)

Ephemeral

Restricted View

Sampling

Report

Our world Today

(8)

Interpretation

Too SLOW to get real

ROI

out of the overall system

How to work that around?

Our world Today

No, it wasn’t better before

(9)

Our world Today

No, it wasn’t better before

Alerting system over descriptive charts

More accurate results

more or harder models (e.g. Deep Learning)

More data

Constant data flow

Online interactions under control (e.g. direct feedback)

Needs

(10)

Our world Today

No, it wasn’t better before

Distributed Systems

(11)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks Access Layer

(12)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks Access Layer

(13)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks Access Layer

(14)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks Access Layer

(15)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances Write results to Sinks Access Layer

(16)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances

Write results to Sinks Access Layer

User Access

YO!

Aren’t we talking about “Big” Data ?

Fast Data ?

So could really (all) results being neither big nor fast?

Actually, Results are becoming themselves

“Big” Data ! Fast Data !

(17)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks

Access Layer User Access

how do we access data since 90’s? remember SOA?

→ SERVICES!

Nowadays, we’re talking about micro services. Here we are, one service for one result.

(18)

Distributed Data Science

System/Platform/SDK/Pipeline/Product/… whatever you call it

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks Access Layer

User Access

C’mon, charts/Tables Cannot only be the only views offered to customers/clients right?

We need to open the capabilities to UI (dashboard), connectors (third parties), other services (“SOA”) …

(19)

Where is Mesos?

(Almost) EVERYWHERE!

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks Access Layer User Access Implies Allocation Implies Scalability Implies Deployment Implies Deployment Implies Scalability

(20)

Why Mesos?

Because it can… (and even more)

Mesos

Allocate Access COnfigure Deploy Scale Schedule Marathon Chronos DCOS

(21)

What about Productivity?

Streamlining development lifecycle most welcome

“Create” Cluster

Find available sources (context, content, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model Tune accuracy

Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data

web ops data

(22)

What about Productivity?

Streamlining development lifecycle most welcome

➔ Longer production line

➔ More constraints (resources sharing, time, …) ➔ More people

➔ More skills

Overlooking these points and you’ll be

soon or sooner So, how to have:

● results coming fast enough whilst keeping accuracy level high? ● Responsivity to external/unpredictable events?

(23)

What about Productivity?

Streamlining development lifecycle most welcome

At Data Fellas, we think that we need Interactivity and Reactivity to tighten the frontiers (within team and in time).

Hence, Data Fellas

● extends the Spark Notebook (interactivity) ● in the Shar3 product (Integrated Reactivity)

(24)

That’s all folks

Thanks for listening/staying Poke us on

@DataFellas @Shar3_Fellas @SparkNotebook

@Xtordoir & @Noootsab

Now @TypeSafe: http://t.co/o1Bt6dQtgH

Follow up Soon on http://NoETL.org

References

Related documents

Next, ODS classification design is needed to identify ODS type for every transaction, data model development and a requirement for ETL specification. They are differentiated by

It means that it is not to use on presentation of national or international seminar because the rule is not so detailed and complicated The business presentation based on the

to a recently analyzed program of technical interventions to reduce air pollution from urban transport in Mexico City. It was found that a tax of 6.2 cents per liter would be

Different limb soft tissue variables which were used to improve fetal weight estimation models includes thigh volume obtained by cross sectional images,

DAY 5: Lake Wanaka to Queenstown (1 hr) Drive over New Zealand’s highest public highway – the Crown Range - to fly out of Queenstown International Airport DON’T MISS:

Although the marginal value that the government attaches to a public good is model-specific (as it depends on the exact nature of the underlying decision-problem), the

From cultural events in schools, to trips and learning in courses such as African Canadian Studies 11, to professional development for educators around our Syrian newcomers, to

With four exceptions the industry-funded Interphone studies found no increased risk of brain tumors from cellphone use, while the Swedish studies, independent of industry