• No results found

Big Data: Big IT Party?

N/A
N/A
Protected

Academic year: 2021

Share "Big Data: Big IT Party?"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or

transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission

of the copyright owners.

by

Rick F. van der Lans

R20/Consultancy BV

Twitter @rick_vanderlans

www.r20.nl

Big Data:

Big IT Party?

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 2

Rick F. van der Lans

Rick F. van der Lansis an independent consultant, lecturer, and author. He specializes in data warehousing, business intelligence, database technology, and data virtualization. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects in which data warehousing, and integration technology was applied.

Rick van der Lans is an internationally acclaimed lecturer. He has lectured professionally for the last twenty five years in many of the European and Middle East countries, the USA, South America, and in Australia. He has been invited by several major software vendors to present keynote speeches. He is the author of several books on computing, including his new Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQLis available in English, Dutch, Italian, Chinese, and German and is sold world wide. He also authored The SQL Guide to Ingres and SQL for MySQL Developers.

As author for BeyeNetwork.com, writer of whitepapers, chairman for the annual European Enterprise Data and Business Intelligence Conference, and as columnist for a few IT magazines, he has close contacts with many vendors.

R20/Consultancy B.V.is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: Email: [email protected]

Twitter: @Rick_vanderlans

LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223

(2)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 5

WikiBon February 2014

Source: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 6

WikiBon February 2014

Source: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017

Gartner: Big Data Market Forecast

Big data will drive $232 billion in spending through 2016. It will directly or

indirectly drive $96 billion of worldwide IT spending in 2012, and is forecast to

drive $120 billion of IT spending in 2013.

(3)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 9

McKinsey Global Institute: Benefits of Big Data

Big Data has the potential …

to increase the value of the US Health

Care industry by $300 Billion

to increase the industry value of Europe’s

public sector administration by EUR 250

Billion

to decrease manufacturing (development

and assembly) costs by 50%

to increase service provider revenue by

$100 Billion due to global personal location

data

to increase US Retails net margin by 60%

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 10

Big Data Exaggerations

Big data:

A revolution that will transform

how we live, work and think

Companies are being destroyed and

created around

big data

, …

Management of

big data

Key to … survival

in the health care sector

Big data

has arrived and is shaping IT

today

The disruptive power of

big data

(4)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 13

Analytical Challenges of Tomorrow

Improve product development

Optimize business processes

Improve customer care

Improve customer delight

Improve pro-active customer care

Personalize products

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 14

External Data: UK-based Retail Company

10 degree rise in temperature

means 300% more barbecue

meat, 45% more lettuce, and

50% more coleslaw

A city-center store will see an

uplift in sandwiches (to eat

outside) on a warm weekday, and

almost no effect at all on a

warm weekend

Result: 6 million UK pounds less

food wastage in the summer, 50

million less stock in warehouses

(5)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 17

Privacy?

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 18

(6)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 21

Databases are Boring!

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 22 Source: The 451 Group

SQL is

Intergalactic

DataSpeak!

Or was?

(7)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 25

Scale Up – Scale Out

Scale up (vertical scaling)

means adding more resources

to one node in a system

Scale out (horizontal scaling)

means adding more nodes to

a system

Continuous

availability/redundancy

Cost/performance flexibility

Contiguous upgrades

Geographical distribution

scale out sc al e up

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 26

Operations of a Query

Analytical functions

Recursive operations

Joins

Having filters

Group by

Complex scalar functions

Projections and simple transformations

Filters - selections

WITH FLIGHTPLAN(FLIGHTNO, PLAN_AIRPORTS, PLAN_FLIGHTS, START_AIRPORT, END_AIRPORT, START_TIME, END_TIME, DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, PRICE, STOPS) AS (SELECT FLIGHTNO, CAST(DEPARTURE_AIRPORT || '->' ||

ARRIVAL_AIRPORT AS VARCHAR(100)), CAST(RTRIM(CHAR(FLIGHTNO)) AS VARCHAR(100)), DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, PRICE, 0 FROM FLIGHTS

WHERE DEPARTURE_AIRPORT='AMS' AND CAST(DEPARTURE_TIME AS DATE) = '2007-03-01' UNION ALL

SELECT P.FLIGHTNO, P.PLAN_AIRPORTS || '->' || F.ARRIVAL_AIRPORT, P.PLAN_FLIGHTS || '->' || RTRIM(CHAR(F.FLIGHTNO)), P.START_AIRPORT, F.ARRIVAL_AIRPORT, P.START_TIME, F.ARRIVAL_TIME, P.DEPARTURE_AIRPORT, P.ARRIVAL_AIRPORT, P.DEPARTURE_TIME, P.ARRIVAL_TIME, P.PRICE + F.PRICE, STOPS+1 FROM FLIGHTPLAN AS P, FLIGHTS AS F WHERE P.ARRIVAL_AIRPORT = F.DEPARTURE_AIRPORT AND P.ARRIVAL_TIME < F.DEPARTURE_TIME AND F.DEPARTURE_AIRPORT <> 'PHX' AND LOCATE(F.ARRIVAL_AIRPORT, P.PLAN_AIRPORTS) = 0 AND STOPS < 1

AND P.ARRIVAL_TIME + 4 HOURS > F.DEPARTURE_TIME) SELECT PLAN_AIRPORTS, PLAN_FLIGHTS, START_AIRPORT, END_AIRPORT,

START_TIME, END_TIME, PRICE FROM FLIGHTPLAN WHERE END_AIRPORT = 'PHX' ORDER BY PRICE ASC FETCH FIRST 1 ROW ONLY

Parallel Database Architecture

Database

server

Application

Analytical functionsRecursive operations Joins Having filters Group by Complex scalar functions Projections and simple transformations Filters - selections Master

Worker 1 Worker 2 Worker 3

Effect of Partitions on Query Response

number of partitions/processors

to

ta

l

th

ro

ug

h

pu

t

bottleneck

(8)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 29

Internal Database Server “Administration”

Source: VoltDB / Michael Stonebraker

NewSQL

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 30

The Market of Hadoop/NoSQL Products

Categories of Database Servers

all

database

servers

SQL

database

servers

NoSQL

database

servers

SQL

database

servers

NoSQL

database

servers

Classic SQL

database servers

Analytical SQL

database servers

NewSQL

database servers

Key-value

stores

Document

stores

Column-family

stores

Graph

database servers

Classic SQL

database servers

Analytical SQL

database servers

NewSQL

database servers

Key-value

stores

Document

stores

Column-family

stores

Graph

database servers

(9)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 33

Strong Consistency vs. Eventual Consistency

Strong

Eventual

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 34

SQL DBMS versus NoSQL Solution

application

application

NoSQL

solution

SQL

database

server

(10)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 37

Hadoop 2.0

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 38

Examples of Complex Values (1)

Comma-separated value

EDIFACT message

"Anchorage Daily News","PO Box 149001","Anchorage","AK","99514-9001",

"907-257-4200","907-258-2157","71","","82",

"http://www.adn.com/",[email protected]

UNB+UNOA:1+005435656:1+006415160:1+060515:1434+00000000000778'XXXUNH+

00000000000117+INVOIC:D:97B:UN'XXXBGM+380+342459+9'XXXDTM+

3:20060515:102'XXXRFF+ON:521052'XXXNAD+BY+792820524::16++

CUMMINSMIDRANGEENGINEPLANT'XXXNAD+SE+005435656::16++

GENERALWIDGETCOMPANY'XXXCUX+1:USD'XXXLIN+1++157870:IN'XXXIMD+

F++:::WIDGET'XXXQTY+47:1020:EA'XXXALI+US'XXXMOA+203:1202.58'XXXPRI+

INV:1.179'XXXLIN+2++157871:IN'XXXIMD+F++:::DIFFERENTWIDGET'XXXQTY+

47:20:EA'XXXALI+JP'XXXMOA+203:410'XXXPRI+INV:20.5'XXXUNS+S'XXXMOA+

39:2137.58'XXXALC+C+ABG'XXXMOA+8:525'XXXUNT+23+00000000000117'XXXUNZ+

1+00000000000778'

Example of Complex Value (2)

Weblog record

datestamp ip request 6/1/2012 11:10:19 AM 107.1.187.170 GET

/x.php?u=http://studio-5.financialcontent.com/synacor?Page=QUOTE&Ticker=DDD

HTTP/1.1 6/1/2012 5:53:49 AM 107.1.2.180 GET /tv/3/player/vendor/Chef%20Tips

/player/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 6/1/2012 8:55:54 AM

107.34.51.63 GET /tv/3/search/content/The%20Andy%20Griffith%20Show/s/The%20

Andy%20Griffith%20Show HTTP/1.1 6/1/2012 3:12:43 PM 107.5.115.117 GET

/tv/3/search/content/Kathie%20Lee%20Gifford's%20epic%20'Today'%20gaffe/s/Kathie

%20Lee%20Gifford's%20epic%20'Today'%20gaffe HTTP/1.1 6/1/2012 4:48:35 PM

108.225.132.245 GET /tv/3/search/content/Deadliest%20Catch/s/Deadliest%20Catch

HTTP/1.1 6/1/2012 10:25:12 AM 108.246.20.125 GET

/x.php?u=http://studio-5.financialcontent.com/synacor?Page=QUOTE&Ticker=DJ:DJI HTTP/1.1

6/1/2012 1:58:14 AM 108.246.25.117 GET /tv/3/player/vendor/Chef%20Tips/player

/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1

Unraveling the Data Model

Store

Classic

database

2

Query &

unravel

Unravel &

Store

Classic

database

1

Query

Store

MapReduce

database

3

Query &

unravel

(11)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 41

Schema-On-Write

SoW = Data written to a database has a

schema

A schema is not optional

Fixed schema-on-write

All records in a table have the same schema

For example, SQL systems

Variable schema-on-write

When data is stored in the database, a

schema is written together with the data

itself

Different records in a table can have

different schemas

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 42

Schema-On-Read

SoR = Data written to a database has a

schema

Stored data has no schema

Complex values or schema-less values

Schema-on-application-read

The application assigns a schema to the

schema-less data (unraveling)

Schema-on-database-read

The database server assigns a schema to the

schema-less data

The application receives data with a schema

Tyranny of Performance

The Balancing Act

Productivity

Maintainability

Time-to-market

Performance

Scalability

Availability

(12)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 45

The Classic Reporting Environment

production databases data marts personal data store data staging area production applications data warehouse Interactive reporting Executive reporting

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 46

The Upcoming Analytical Labyrinth

production databases data marts personal data store data staging area production applications big data big data analytics sandboxes operational reporting unstructured data Predictive analytics data warehouse Interactive reporting reporting external data private data

Do We Want Analytical Silos?

production databases streaming databases social media data data staging area big data

stores unstructureddata sandboxes data

warehouse & data marts

external data private

data production

applications Self-serviceBI reportingiterative predictiveanalytics reportingmobile predefinedreporting

Heading for an Integration Labyrinth

production databases streaming databases social media data data staging area big data

stores unstructureddata sandboxes data

warehouse & data marts

external data private

data production

(13)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 49

Different Database Workloads

xml database

sql database

sql database

sql database

sql database

OLAP database

OO database

pre-relational database

time

OLTP

OLCP

OLAP

OLXP

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 50

Hadoop APIs Too Technical?

Is Google Going SQL?

2012: Spanner supports general-purpose transactions, and

provides a SQL-based query language.

Google’s motivation: “We believe it is better to have

application programmers deal with performance problems

due to overuse of transactions as bottlenecks arise,

rather than always coding around the lack of transactions.”

Market of SQL-fication Products

SQL-on-Hadoop Engines

Examples: Apache Hive, Cassandra CQL, CitusDB,

Cloudera Impala, Concurrency Lingual, Hadapt,

InfiniDB, JethroData, MammothDB, MapR Drill,

MemSQL, Pivotal HawQ, Progress DataDirect,

ScleraDB, Simba, SpliceMachine, …

Data virtualization and data federation

servers

Examples: Cirro, Cisco/Composite, Denodo,

Informatica IDS, RedHat Jboss Data

Virtualization, Stonebond, …

SQL databases (polyglot persistence)

Examples: EMC Greenplum UAP, Hadapt, Microsoft

Polybase, Paraccell, Teradata Aster database

(SQL-H), …

(14)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 53

CitusData CitusDB

Designed for analytical queries

Characteristics

No use of MapReduce or Hive

Knows the location of data – speeds

up data access

Based on PostgreSQL

Queries are pushed to the data

nodes

Statistics are collected on the data

UDFs are supported

CitusDB

HDFS

MongoDB

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 54

JethroData Jethro

Designed for interactive queries

Characteristics

Every column is indexed!!

Append-only inverted lists – index

entries are appended

Inserts no impact on reads

30-40% extra storage

Columnar store

Ansi-92 SQL: DDL + query

Supports joins

Jethro

HDFS

PivotalHD Hawq

PivotalHD Hawq = Greenplum

on HDFS

Dual database strategy

Uses the same file format as

GemFire/SQLFire for

transactions

Greenplum = mature

cost-based query optimizer

Hawq compatible with

Greenplum

ACID compliant

HawQ

HBase

HDFS

(15)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 57

Data Virtualization Overview (1)

production databases

streaming

databases media datasocial production application big data stores website ESB analytics & reporting unstructured data mobile App data warehouse & data marts

internal portal dashboard external data private data

Data Virtualization Server

applications

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 58

Data Virtualization Overview (2)

streaming

databases media datasocial production application big data stores website ESB analytics & reporting unstructured data mobile App data warehouse & data marts

internal portal dashboard external data private data ODBC/SQL JDBC/SQL XML/SOAP REST/JSON XQuery MDX/DAX

JMS SQL SQL+ XSLT Hive Prop. Excel JSON

CICS SOAP

JMS message JMS message JMS message

JMS message SQL statementSQL statementSQL statementSQL statement

Data Virtualization Server

SOAP messageSOAP messageSOAP messageSOAP message

production databases applications SQL statement SQL statementSQL statement SQL statement

Definition of Data Virtualization

Data virtualization is the

technology that offers data

consumers a unified,

abstracted, and

encapsulated view for

querying and manipulating

data stored in a

heterogeneous set of data

stores.

The Market of Data Virtualization Servers

Cirro Data Hub

Cisco/Composite Information Server

Denodo Platform

IBM InfoSphere Federation Server

Informatica Data Services

Information Builders EII

Oracle Data Services Integrator

Progress Easyl

Red Hat Teiid and Jboss Data

Virtualization

Stone Bond Enterprise Enabler

Virtuoso

(16)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 61

Data Stays Where it’s Collected

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 62

Data generated by day is more than

can be moved across the network.

Network will look like this …

Big Data is Too Big To Move

Data Virtualization to the Rescue?

(17)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 65 Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 66

85% expect to gain substantial business and

IT benefits from Big Data initiatives

85% have Big Data initiatives planned or in

progress

70% report that these initiatives are

enterprise-driven

85% of the initiatives are sponsored by a

C-level executive or the head of a line of

business

75% expect an impact across multiple lines

of business

C-Level and Big Data

15% ranked their access to data as

adequate or world-class

21% ranked their analytic capabilities

as adequate or world-class

17% ranked their ability to use data

and analytics to transform their

business as more than adequate or

world-class

C-Level and Big Data

Battle of Chancellorsville, 1863

(18)

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 69

IT specialists?

IT departments?

Benelux / Europe?

Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 70

You Can’t Hide For Big Data Anymore

Big IT Party?

all

database

servers

SQL

database

servers

NoSQL

database

servers

Classic SQL

database servers

Analytical SQL

database servers

NewSQL

database servers

Key-value

stores

Document

stores

Column-family

stores

Graph

database servers

?

Recommended Books

References

Related documents

In this study I exploit a dataset of loss given default realizations to estimate a prediction model based on financial accounting information available to lenders at the

The policy provides 3 levels of lifetime insurance cover for cats subject to certain terms and conditions being met.. Significant features

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

This study aims to determine the spider fauna from the ground and understory (herbs, shrubs and small trees) of the TMCF in El Triunfo Biosphere Reserve (REBITRI for its

In addition to large companies in food industry such as Unilever, Ferrero, P &amp; G and Nestle, there are also NGOs members such as WWF, Solidaridad and Oxfam (Nikoloyuk, et

For both capital services and the capital stock, results are provided based on two different breakdowns of investment data: the 2-asset case drawing upon data for structures

On the single objective problem, the sequential metamodeling method with domain reduction of LS-OPT showed better performance than any other method evaluated. The development of