• No results found

Big Data and Big Data Modeling

N/A
N/A
Protected

Academic year: 2021

Share "Big Data and Big Data Modeling"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data and Big Data Modeling

Robin Bloor

The Bloor Group

March 19, 2015 TP02

The Age of Disruption

(2)

Presenter Bio

Robin Bloor, Ph.D.

Robin Bloor is Chief Analyst at The Bloor Group. He has been an

industry analyst and commentator on technology for 25 years, with expertise in software development, database, BI and associated

technologies. He is a frequent

keynote speaker at industry events and primary author of The Bloor Group’s research reports.

(3)

Big Data and Big Data Modeling – The Age of Disruption

 The Data Curve and the Data Warehouse

 Disruption, Disruption, Disruption

 A New Modeling Dynamic

(4)

The Data Curve

(5)

Corporate data volumes grow at about 55% per annum

exponentially

Data has been growing at this rate for, maybe, 40 years

There is nothing new about big data. It clings to an established exponential trend

(It may be speeding up)

The Visible Big Data Trend

(6)

Technology Evolution (The Way We Were – Bloor

Curve)

(7)

Software architectures change: centralized,

client/server, 3 tier/web, service-oriented

architecture, etc.

Applications migrate according to latencies.

Dominant applications and software brands can die via

“the innovator’s dilemma.”

Wholly new applications appear because of lower latencies e.g., virtual

machines and complex event processing (CEP).

And This Implies…

(8)

The biggest databases are new databases

They grow at the cube of Moore’s Law

Moore’s Law = 10x every 6 years

VLDB: 1000x every 6 years 1991/2 megabytes

1997/8 gigabytes 2003/4 terabytes 2009/10 petabytes 2015/16 exabytes

The Invisible Data Trend: Moore’s Law Cubed

(9)

The Genesis of Hadoop

 The old databases were having scaling problems.

 New databases appeared, but so did Hadoop.

 The number of data sources was exploding.

 Hadoop quickly became the staging area for these databases, even

though it was immature.

(10)

 Serial batch workloads

 MapReduce

 Versatile data storage

 Key-value access only

 An island of processing

 Multiple concurrent workloads

 Multiple algorithms

 “Optimized” data storage

 SQL, JSON and even SPARQL access

 Integrated processing

From To

The Evolution of Hadoop

(11)

The Data Warehouse: From/To

© Bl oor Group

(12)

The Staging Workload

© Bl oor Group

(13)

Disruption, Disruption, Disruption

(14)

Disruption in Several Dimensions

1. At the hardware layer

2. In software architecture

3. In the data layer

(15)

Parallelism: The Imp is Out of the Bottle

Multicore chips enabled parallelism

It has changed the whole performance equation

It enabled Big Data

Big Data is really Big Processing

(16)

 Computer

 Online

 PC

 Internet

 Mobile

 Internet of Things (IoT)

 Batch

 Centralized

 Client/server

 Multi-tier

 Service orientation

 Event driven/big

data/parallel/distributed

Tech Revolution Architecture

Technology Revolutions

(17)

Unprecedented Acceleration

 Moore’s Law regularly delivered a speed-up of 10x every 6 years

 Implication: apps get faster every 6 years or so

 Parallelism delivers an almost

unlimited speed-up, assuming you can build the application with a scalable architecture

 Implications: see later

(18)

Hardware Disruption: It’s Over for Spinning Disk

 Solid state drives are now on the Moore’s Law curve

 Disk is not and never was (in respect to seek time)

 All traditional databases were engineered for spinning disk and not for scale-out

 This explains the new database management (DBMS) products…

© Bl oor Group

(19)

Hardware: In-Memory Disruption

 Memory may gradually

become the primary store for data (this impacts data flows)

 Almost all applications are poorly built for this

 Memory is an accelerator – as is CPU cache. This is becoming a factor

(20)

Hardware: The Memory Cascade

 On chip speed v RAM

L1(32K) = 100x

L2(246K) = 30x

L3(8-20Mb) = 8.6x

 RAM v SSD

RAM = 300x

 SSD v Disk

SSD = 10x

Note: Vector instructions and data compression

(21)

Hardware: Putting a SoC in IT

 It’s possible that the CPU- memory split will vanish (soon)

 This requires the emergence of the commodity System on a Chip (SoC)

 There are already Systems on a Chip that run Linux

 Grids of Systems on a Chip could replace grids of servers

Graphic from Samsung Electronics

(22)

Data Disruption – The Barriers are Down

 Server log files

 Network log files

 Unstructured sources

 Data streams

 Web data

Internal

 Mobile data

 Social media data

 Internet of things

 Web scavenging

 Data markets

 External streams External

(23)

Data Flow – A Set of Principles

The data layer is one logical collection of data, both external and internal

The data flows, from ingest through a refining process to a point of application

It is best if data doesn’t flow much

Hadoop means corporate data staging

Beyond that a database is required to manage workloads

(24)

The Corporate Data Flows

There need to be two data flows (at minimum)

Currently we can distinguish between:

Real-time/business time applications

Analytical applications

We will build specific architectures for this

(25)

A New Modeling Dynamic

(26)

Data mapping/modeling

Metadata discovery

Metadata management

Master data management

Data lineage and lifecycle

The Staging Workloads

© Bl oor Group

(27)

The New World #1 …

The primary driver of the new world is that external data sources have expanded

Data is being captured without metadata knowledge or even relationship knowledge

Unstructured/semi-structured data is prevalent – even normal

The provenance of data has become an issue

The new dimensions: geography and time

(28)

The New World #2 …

The single source of truth idea is “dead.”

MDM will become about ontologies

Modeling will not die or even diminish – but we will explicitly model for context

Data flows will be modeled

There will be a metadata warehouse

There will be event to entity models

We will record data lineage

We may need to model data lifecyclesw

(29)

Big Data and Big Data Modeling – The Age of Disruption

 The Data Curve and the Data Warehouse

 Disruption, Disruption, Disruption

 A New Modeling Dynamic

In Summary

(30)

Thank You for Attending!

For any further questions, feel free contact me following ERworld.

Robin Bloor

email: [email protected] twitter: @robinbloor

www.insideanalysis.com

Please enjoy the rest of your time at ERworld 2015!

(31)

Legal Notice

© Copyright CA 2015. All trademarks, trade names, service marks and logos referenced herein belong to their respective companies. No unauthorized use, copying or distribution permitted.

THIS PRESENTATION IS FOR YOUR INFORMATIONAL PURPOSES ONLY. CA assumes no responsibility for the accuracy or completeness of the information. TO THE EXTENT PERMITTED BY APPLICABLE LAW, CA PROVIDES THIS DOCUMENT “AS IS”

WITHOUT WARRANTY OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NONINFRINGEMENT. In no event will CA be liable for any loss or damage, direct or indirect, in connection with this presentation, including, without limitation, lost profits, lost investment, business interruption, goodwill, or lost data, even if CA is expressly advised of the possibility of such damages.

References

Related documents

original jurisdiction of any civil action by an alien for a tort only, committed in violation of the law of nations or a treaty of the United States” (28 U.S.C. During

Chapter 2 - Ruminal characteristics and feedlot performance of steers during accelerated step-up to high-concentrate diets using Megasphaera elsdenii (Lactipro advance)

Delivery can be arranged and will be charged on a pallet basis.. Stock will be available on a first come, first

To better understand these political notions, as well as to have them set in their correct historical framework, Mazzini' s history, his relations with Malta and the rel- evance

This study aims to determine the spider fauna from the ground and understory (herbs, shrubs and small trees) of the TMCF in El Triunfo Biosphere Reserve (REBITRI for its

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

To measure this accuracy, we applied the color correction matrix