designing and exploiting
BIG DATA PLATFORM
SQL
BIG
DATA
010101010101010101001 1010100100101010100011101001001010101010100010101010101010100101010101010100101110 10101010101010101001010101 0011001010100100101010100011101001001010101010100010101010101010100101010101010100 1011101010101010010101010011001010100100101010100011101001001010101010100010101010 101010100101010101010100101110101010101001010101 0011001010100100101010100011101001001010101010100010101010101010100101010101010100 101110101010101001010101 0011001010100100101010100011101001001010101010100010101010101010100101010101010100 101110101010101001010101 0011001010100100101010100011101001001010101010100010101010101010100101010101010100 101110101010101001010101 0011001010100100101010100011101001001010101010100010101010101010100101010101010100 101110101010101001010101
Purpose: Capturing
all
business data and powering ‘smart
applications’ that turn data into profits
• • • • • •
• • • • • •
Smart Applications turn data into profits by delivering…
• •• • •• • •• • ••%
…the most relevant
LOYALTY offers for EACH
CLIENT…
…the best PRODUCT
ASSORTMENT for each
store all the time..
• ••
…the right amount of
INVENTORYin each store
every day…
• ••
• ••
%
…the right check-out
COUPON for each client
in real time… …the BEST PRICE for
each product at all times…
…the most RELEVANT 1-TO-1 COMMUNICATION
at all times…
…the most engaging MOBILE
CONTENT or COUPON for
each client…
…the most RELEVANT
WEBSHOP CONTENT for
each client…
• ••
BIG DATA PLATFORM
Why BIG DATA PLATFORM?
Cost
Big Data Platform
SQL
BIG
Capacity
BIG DATA PLATFORM
SQL
Speed
BIG DATA PLATFORM
SQL
Capability
BIG DATA PLATFORM
SQL
Profitability
BIG DATA PLATFORM
SQL
SQL (status quo)
Never lose or dump data again
Decrease latency from hours to seconds
1/100
thof SQL’s cost
Massively increase the depth of information captured
Generate > 10€ of profits for every 1€ of cost
Acting as a bridge between
Transactions
and
Decisions
• • • • • • • • • • • • Supply Chain Marketing Operations… all relevant data sources…
Smart applications exploiting the data
Making operational decisions
High degree of automation
High ROI Strong profitability
Transactional world
Data
Decisional World
Unlimited capacity Event Sourcing Agility & Resilience Very low costRemoving the performance limitations of SQL
Unlimited storage
: Close to unlimited, affordable storage
True availability:
Stable and high performance no matter how much data
is stored and how many applications are pulling data
Preserving the full history of your data
: Record every change to your
data and enable much smarter decision (Event sourcing)
High agility:
Rapid iterations over the data
Simplicity:
Data is documented and truly easy to consume
Serviceability:
Emphasize on self-serving patterns while avoiding table
clutter and cryptic fields
Rapid & simple implementation:
Scoping and implementation in weeks
Smart Apps take operational decisions and create ROI
The Data Platform has the purpose to serve “smart apps”: Applications that deliver intelligence over the data made available.
Examples:
Choosing the right amount of inventory for each reference in each store.
Choosing the right price for each reference in each store.
Choosing the ‘right’ coupon for each loyalty card holder.
….
• •
Each questions is addressed by its own dedicated “smart app”. Theses apps can be developed internally or by 3rd parties. The Data Platform is built to offer the maximal agility in the process: ‘Plugging’ an app on top should be a matter of hours.
The goal for these Smart appsis to directly generate operation decisions(ex: adjusting a price, launching a re-order, choosing a coupon) that are plugged directly into the transactional system later on. Smart appsthat operate over the Data Platform should not be confused with Business Intelligence.
All
experts in the company are empowered to initiate
and use smart apps
The number of people or teams that are involved with these smart apps can be
very large:
The Data Platform makes it very
easy
to consume
clean
and
practical
data
.
The
performance
to access data is
both very stable and very high
.
Expertise that exists among all the teams of the retailers can be preserved and
leveraged.
The Data Platform is precisely designed to
not be the bottleneck
of
single data initiatives. The whole setup emphasizes self-service. After the initial
coaching, “consumer” teams are able to work autonomously (without the Data
Platform team).
The Pitfalls of classic data approaches
‘
True’ granular intelligence requires
the full depth and history
of data for really
smart decisions.
Classic approaches in commerce fail mainly for 3 reasons:
Lack of agility (speed):
Being “data smart” implies being able to iterate
rapidly (weekly or even daily) over the data. Classic approaches are too
slow: The world changes faster than results get delivered.
Ongoing loss of information
: A lot of data is lost in subtle yet critical
ways due to the performance limits of classic solutions (e.g.: historical
inventory levels, sales history beyond x month etc.)
Lack of affordable scalability:
Classic solutions limit the amount of data
that can be stored and processed, while being extremely expensive.
“Unlimited” scalability should neither require expensive hardware, nor
require expensive teams to run.
A Custom Design insures 100% fit and max performance
Custom Design:
Why not use a big framework/application for all clients?
90% of the development effort attributable to the creation of connectors
to the various relevant data sources – This is always required.
No packaged framework is ever a perfect fit for your business. The result
is unnecessary complexity and an
impedance mismatch
: Teams end up
trying to recycle features that where never intended for that exact use.
A customization from the bottom up increases performance and ‘fit’
compared to a pre-packaged approach.
Example:
By adopting a
domain-specific
data format, it is possible to
store
and process 1 year of checkout receipts for a network of 1000 stores on a
smartphone
.
See http://w3.lokad.com/receiptstreamSQL implies adopting tabular storage and its subtle yet very strong limitations.
Example Stock-On-Hand History: In most ERPs, the SKU is associated with a table that matches SKU ID and SKU Stock-on-Hand. However, the history that has led to this stock-on-hand situation is lost. Many important questions can therefore not be addressed:
What was the stock on hand at any point in time?
What was the list of stock corrections applied to the SKU over time?
How many units have been discarded because the reached the expiration date? What was the true service level of a product over time?
Example Price Change: Instead of just entering a new price X for a SKU/location, it becomes possible to capture “Price moved to X because competitor just lowered to Y”. This type of information is hugely valuable to build smart apps and create ROI.
Event Sourcing (I): Capture
Much
More…
In SQL, only the present stage of each data field is preserved. The history of each data point is however lost.
When using Event Sourcing, all tables are replaced by a single (potentially very long) list of events. The whole path that has led to the present situation is recorded. Each event can truly capture the intent behind each data change.
SQL is by design very slow on querying large datasets.
The event sourcing approach consumes the events as they come in, and the result is always up to date. The result is an extremely low latency. See example ‘Loyalty data’ at the end of the presentation.
Scalability & Cost: Storing billions of events is simple, and extremely well-suited for cloud storage. In practice monthly storage cost is approximately 1/100th of the cost of SQL. Storage cost example
1 TB of data : 100€ /month instead of 10.000€ /month in SQL storage
… at
a fraction
of the Latency and 1/100
th
of the price
A relational database trying to address a SQL query in real-timehas no other option but to sequential iterate over a massive chunk of data. It simply cannotbe made fast.
Availability and auto-scaling
The cloud offers two properties that are simply invaluable for a Data Platform:
1. Auto-scaling: The infrastructure will adapt (*) to the workload pressure. Performance is always exactly the same (no matter it is the first day of the month, or middle of the night)
2. High availability and fault-tolerance: The cloud allocatescomputing resources and abstracts away the hardware failure. No need to worry any more about failing hard drives and the myriad of similar hardware glitches that do happen all the time.
* However, auto-scaling works only if the software architecture has been natively designed for the cloud. Achieving this is exactly one of the aspects covered by Lokad.
While the Data Platform is custom software, we strongly suggest to adopt a public cloud as it is an essential ingredient both to massively lower the costs and massively increase the agility of the project.
Cloud Computing is a Must-Have for the Data Platform
Unprecedented TCO
Public clouds (Windows Azure in particular) offer an unprecedented total cost of ownership to access computing resources:
Ballpark: 100€/month per TB storage (Internal initiatives cannot come close to economies of scale of a public cloud)
Automation is a requirement for cost efficiency & scalability
When numbers are read by people, they are very expensive
. In retail, any
software that produces numbers that are expected to be read by
people
is
fundamentally
non-scalable
.
The goal of the “smart” apps is to generate
operation decisions
(ex: moving a
price down) that are fed directly into the transactional systems.
Smart apps
that
operate over the Data Platform should not be confused with
Business Intelligence
.
The reliability of smart app outputs is insured by the Data Platform
ERP systems are likely to expect very reliable data sources. The Data Platform
offers a way to collect the results from the smart apps and to expose them.
This introduces reliability even when the underlying smart app’s analysis
is
not
reliable.
By doing so, the Data Platform makes those results suitable for a production use
through the existing transactional systems.
The Data Platform serves as dedicated abstraction layer that helps
retail experts
to
focus on their core domain instead of IT technicalities.
The projections are made accessible through API (application programming interface).
We favor one very specific flavor of API: the REST API.
There are many practical benefits of having APIs:
The technology behind the API (aka the Data Platform itself) can be radically
different from the technology powering the
smart apps
. Retail is vast,
one size fits
all
is not a reasonable position for any relatively large retailer.
It creates overall “access” patterns that are much easier to document, much easier
to consume as well.
It allows tuning on a need basis very specific access rules, which can be extremely
valuable when plugging 3
rdparty companies to the Data Platform.
Challenge: Maintaining a projection of all loyalty card holders with half a dozen dimensions such as:
The number of purchases in the last 3 years
The average basket size
Demographics
….
Application Example: Loyalty Data Storage
* A possible work-around is a SQL table dedicated to the “client” profiles and nightly batches that will update this table with the data of the day. However:
It is convoluted and requires a database expert to devise a “strategy” to deal with the problem.
It creates confusion between tables containing input data and intermediate computations.
It creates data duplication and amplifies the overall scalability problems
SQL: The above projection is quasi impossible to run as a SQL query. The query has to iterate over every single transaction over the last 3 years, which proves extremely time intensive.
Cloud Data Platform: Retrieving such a “projection” can be done in seconds.
(*)
Status Quo: In SQL, only the present stage of each data field is preserved. The history of each data point is however lost.
The problem: The information of what was when on-hand at any point in time is very valuable for many decisions and smart analytics, e.g.:
Correction of electronic inventory records – increased accuracy Out-of-shelf monitoring – increased accuracy
Inventory optimization– measure ‘true’ availability and performance
Application Example: Inventory Tracking
SQL: SQL storage does not allow to preserve the history of on-hand inventory levels over time.
Cloud Data Platform: Full history of on-hand levels available. User for out-of-shelf monitoring, for the automatic correction of records and for tracking inventory performance.
Status Quo: Limitations on storage capacity cause the loss of data
history. Limitations on latency make reasonable queries impossible.
Examples:
Receipts can only be recorded for a few month, history is truncated
Querying receipts is time consuming, expensive and limited
Loyalty coupons cannot be accessed or created ‘in real time’
Application Example: Receipt and Coupon Storage
Cloud Data Platform: The platform allows the efficient storage of all data and full history
including ‘events’. The data is accessible from all parts of the company, even mobile devices. Low latency make the data useable for smart apps that provide value to staff and customers.
%
• •
Status Quo: eCommerce managers know the value of data for their business. Massive amounts of data are generated each day. Personalization is the next challenge. However, ambition is far ahead of the status quo.
Storage capacity – large amounts of data generated by web analytics and operations
Latency – ‘near real time’ access often required
Accessibility
‘Smart’ exploration (smart applications)
Application Example: eCommerce ‘Data Hub’
Cloud Data Platform: The platform allows the efficient storage of all data and full history
including ‘events’ from all relevant sources. The data is accessible from all parts of the company and all applications. Low latency provide ‘real time’ capabilities. Smart applications on top
increase profitability. Examples:
More relevant product suggestions/more relevant personalizationof the webshop
The best price for each product at any point in time
• •
50k€ for a scoping mission
Scope the usages that would benefits from the Data Platform.
Clarify the vision about the data, how it should be collected, structured and exposed.
Setup the proper collaborative tools, development tools and processes to carry on with this
Data Platform initiative.
If required: Help hiring the 1 or 2 developers that will be needed internally to run the Data
Platform.
20k€ for drafting a prototype Data Platform
Goal: Setup a minimal project that could be extended to your technical teams, with the
architecture and design patterns in place to kickstart the project.
Plugging-in two identified data sources.
Setting up an event storage over Windows Azure. Setup sample projections.
Setup sample APIs.
20k€ for drafting a prototype smart app
Goal: illustrate that ROI can be generated through the Data Platform.
Devise a statistical analysis (prototype) to address an existing problem for the retailer.
Required Investment < 100k€
Lokad can help you rolling out your own BIG DATA PLATOFORM, and to plug and/or
building
smart apps
that generate direct ROI.
Head Office 10 rue P. de Champaigne 75013, Paris France