Big Data and Big Data Modeling
Robin Bloor
The Bloor Group
March 19, 2015 TP02
The Age of Disruption
Presenter Bio
Robin Bloor, Ph.D.
Robin Bloor is Chief Analyst at The Bloor Group. He has been an
industry analyst and commentator on technology for 25 years, with expertise in software development, database, BI and associated
technologies. He is a frequent
keynote speaker at industry events and primary author of The Bloor Group’s research reports.
Big Data and Big Data Modeling – The Age of Disruption
The Data Curve and the Data Warehouse
Disruption, Disruption, Disruption
A New Modeling Dynamic
The Data Curve
Corporate data volumes grow at about 55% per annum –
exponentially
Data has been growing at this rate for, maybe, 40 years
There is nothing new about big data. It clings to an established exponential trend
(It may be speeding up)
The Visible Big Data Trend
Technology Evolution (The Way We Were – Bloor
Curve)
Software architectures change: centralized,
client/server, 3 tier/web, service-oriented
architecture, etc.
Applications migrate according to latencies.
Dominant applications and software brands can die via
“the innovator’s dilemma.”
Wholly new applications appear because of lower latencies e.g., virtual
machines and complex event processing (CEP).
And This Implies…
The biggest databases are new databases
They grow at the cube of Moore’s Law
Moore’s Law = 10x every 6 years
VLDB: 1000x every 6 years – 1991/2 megabytes
– 1997/8 gigabytes – 2003/4 terabytes – 2009/10 petabytes – 2015/16 exabytes
The Invisible Data Trend: Moore’s Law Cubed
The Genesis of Hadoop
The old databases were having scaling problems.
New databases appeared, but so did Hadoop.
The number of data sources was exploding.
Hadoop quickly became the staging area for these databases, even
though it was immature.
Serial batch workloads
MapReduce
Versatile data storage
Key-value access only
An island of processing
Multiple concurrent workloads
Multiple algorithms
“Optimized” data storage
SQL, JSON and even SPARQL access
Integrated processing
From To
The Evolution of Hadoop
The Data Warehouse: From/To
© Bl oor Group
The Staging Workload
© Bl oor Group
Disruption, Disruption, Disruption
Disruption in Several Dimensions
1. At the hardware layer
2. In software architecture
3. In the data layer
Parallelism: The Imp is Out of the Bottle
Multicore chips enabled parallelism
It has changed the whole performance equation
It enabled Big Data
Big Data is really Big Processing
Computer
Online
PC
Internet
Mobile
Internet of Things (IoT)
Batch
Centralized
Client/server
Multi-tier
Service orientation
Event driven/big
data/parallel/distributed
Tech Revolution Architecture
Technology Revolutions
Unprecedented Acceleration
Moore’s Law regularly delivered a speed-up of 10x every 6 years
Implication: apps get faster every 6 years or so
Parallelism delivers an almost
unlimited speed-up, assuming you can build the application with a scalable architecture
Implications: see later
Hardware Disruption: It’s Over for Spinning Disk
Solid state drives are now on the Moore’s Law curve
Disk is not and never was (in respect to seek time)
All traditional databases were engineered for spinning disk and not for scale-out
This explains the new database management (DBMS) products…
© Bl oor Group
Hardware: In-Memory Disruption
Memory may gradually
become the primary store for data (this impacts data flows)
Almost all applications are poorly built for this
Memory is an accelerator – as is CPU cache. This is becoming a factor
Hardware: The Memory Cascade
On chip speed v RAM
L1(32K) = 100x
L2(246K) = 30x
L3(8-20Mb) = 8.6x
RAM v SSD
RAM = 300x
SSD v Disk
SSD = 10x
Note: Vector instructions and data compression
Hardware: Putting a SoC in IT
It’s possible that the CPU- memory split will vanish (soon)
This requires the emergence of the commodity System on a Chip (SoC)
There are already Systems on a Chip that run Linux
Grids of Systems on a Chip could replace grids of servers
Graphic from Samsung Electronics
Data Disruption – The Barriers are Down
Server log files
Network log files
Unstructured sources
Data streams
Web data
Internal
Mobile data
Social media data
Internet of things
Web scavenging
Data markets
External streams External
Data Flow – A Set of Principles
The data layer is one logical collection of data, both external and internal
The data flows, from ingest through a refining process to a point of application
It is best if data doesn’t flow much
Hadoop means corporate data staging
Beyond that a database is required to manage workloads
The Corporate Data Flows
There need to be two data flows (at minimum)
Currently we can distinguish between:
• Real-time/business time applications
• Analytical applications
• We will build specific architectures for this
A New Modeling Dynamic
Data mapping/modeling
Metadata discovery
Metadata management
Master data management
Data lineage and lifecycle
The Staging Workloads
© Bl oor Group
The New World #1 …
The primary driver of the new world is that external data sources have expanded
Data is being captured without metadata knowledge or even relationship knowledge
Unstructured/semi-structured data is prevalent – even normal
The provenance of data has become an issue
The new dimensions: geography and time
The New World #2 …
The single source of truth idea is “dead.”
MDM will become about ontologies
Modeling will not die or even diminish – but we will explicitly model for context
Data flows will be modeled
There will be a metadata warehouse
There will be event to entity models
We will record data lineage
We may need to model data lifecyclesw
Big Data and Big Data Modeling – The Age of Disruption
The Data Curve and the Data Warehouse
Disruption, Disruption, Disruption
A New Modeling Dynamic
In Summary
Thank You for Attending!
For any further questions, feel free contact me following ERworld.
Robin Bloor
email: [email protected] twitter: @robinbloor
www.insideanalysis.com
Please enjoy the rest of your time at ERworld 2015!
Legal Notice
© Copyright CA 2015. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. No unauthorized use, copying or distribution permitted.
THIS PRESENTATION IS FOR YOUR INFORMATIONAL PURPOSES ONLY. CA assumes no responsibility for the accuracy or completeness of the information. TO THE EXTENT PERMITTED BY APPLICABLE LAW, CA PROVIDES THIS DOCUMENT “AS IS”
WITHOUT WARRANTY OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NONINFRINGEMENT. In no event will CA be liable for any loss or damage, direct or indirect, in connection with this presentation, including, without limitation, lost profits, lost investment, business interruption, goodwill, or lost data, even if CA is expressly advised of the possibility of such damages.