Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands. All rights reserved. No part of this material may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, electronic, mechanical, photographic, or otherwise, without the explicit written permission
of the copyright owners.
by
Rick F. van der Lans
R20/Consultancy BV
Twitter @rick_vanderlans
www.r20.nl
Big Data:
Big IT Party?
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 2
Rick F. van der Lans
Rick F. van der Lansis an independent consultant, lecturer, and author. He specializes in data warehousing, business intelligence, database technology, and data virtualization. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects in which data warehousing, and integration technology was applied.
Rick van der Lans is an internationally acclaimed lecturer. He has lectured professionally for the last twenty five years in many of the European and Middle East countries, the USA, South America, and in Australia. He has been invited by several major software vendors to present keynote speeches. He is the author of several books on computing, including his new Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQLis available in English, Dutch, Italian, Chinese, and German and is sold world wide. He also authored The SQL Guide to Ingres and SQL for MySQL Developers.
As author for BeyeNetwork.com, writer of whitepapers, chairman for the annual European Enterprise Data and Business Intelligence Conference, and as columnist for a few IT magazines, he has close contacts with many vendors.
R20/Consultancy B.V.is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: Email: [email protected]
Twitter: @Rick_vanderlans
LinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 5
WikiBon February 2014
Source: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 6
WikiBon February 2014
Source: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
Gartner: Big Data Market Forecast
Big data will drive $232 billion in spending through 2016. It will directly or
indirectly drive $96 billion of worldwide IT spending in 2012, and is forecast to
drive $120 billion of IT spending in 2013.
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 9
McKinsey Global Institute: Benefits of Big Data
Big Data has the potential …
•
to increase the value of the US Health
Care industry by $300 Billion
•
to increase the industry value of Europe’s
public sector administration by EUR 250
Billion
•
to decrease manufacturing (development
and assembly) costs by 50%
•
to increase service provider revenue by
$100 Billion due to global personal location
data
•
to increase US Retails net margin by 60%
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 10
Big Data Exaggerations
Big data:
A revolution that will transform
how we live, work and think
Companies are being destroyed and
created around
big data
, …
Management of
big data
Key to … survival
in the health care sector
Big data
has arrived and is shaping IT
today
The disruptive power of
big data
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 13
Analytical Challenges of Tomorrow
Improve product development
Optimize business processes
Improve customer care
Improve customer delight
Improve pro-active customer care
Personalize products
…
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 14
External Data: UK-based Retail Company
10 degree rise in temperature
means 300% more barbecue
meat, 45% more lettuce, and
50% more coleslaw
A city-center store will see an
uplift in sandwiches (to eat
outside) on a warm weekday, and
almost no effect at all on a
warm weekend
Result: 6 million UK pounds less
food wastage in the summer, 50
million less stock in warehouses
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 17
Privacy?
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 18
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 21
Databases are Boring!
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 22 Source: The 451 Group
SQL is
Intergalactic
DataSpeak!
Or was?
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 25
Scale Up – Scale Out
Scale up (vertical scaling)
means adding more resources
to one node in a system
Scale out (horizontal scaling)
means adding more nodes to
a system
•
Continuous
availability/redundancy
•
Cost/performance flexibility
•
Contiguous upgrades
•
Geographical distribution
scale out sc al e upCopyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 26
Operations of a Query
Analytical functions
Recursive operations
Joins
Having filters
Group by
Complex scalar functions
Projections and simple transformations
Filters - selections
WITH FLIGHTPLAN(FLIGHTNO, PLAN_AIRPORTS, PLAN_FLIGHTS, START_AIRPORT, END_AIRPORT, START_TIME, END_TIME, DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, PRICE, STOPS) AS (SELECT FLIGHTNO, CAST(DEPARTURE_AIRPORT || '->' ||
ARRIVAL_AIRPORT AS VARCHAR(100)), CAST(RTRIM(CHAR(FLIGHTNO)) AS VARCHAR(100)), DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, DEPARTURE_AIRPORT, ARRIVAL_AIRPORT, DEPARTURE_TIME, ARRIVAL_TIME, PRICE, 0 FROM FLIGHTS
WHERE DEPARTURE_AIRPORT='AMS' AND CAST(DEPARTURE_TIME AS DATE) = '2007-03-01' UNION ALL
SELECT P.FLIGHTNO, P.PLAN_AIRPORTS || '->' || F.ARRIVAL_AIRPORT, P.PLAN_FLIGHTS || '->' || RTRIM(CHAR(F.FLIGHTNO)), P.START_AIRPORT, F.ARRIVAL_AIRPORT, P.START_TIME, F.ARRIVAL_TIME, P.DEPARTURE_AIRPORT, P.ARRIVAL_AIRPORT, P.DEPARTURE_TIME, P.ARRIVAL_TIME, P.PRICE + F.PRICE, STOPS+1 FROM FLIGHTPLAN AS P, FLIGHTS AS F WHERE P.ARRIVAL_AIRPORT = F.DEPARTURE_AIRPORT AND P.ARRIVAL_TIME < F.DEPARTURE_TIME AND F.DEPARTURE_AIRPORT <> 'PHX' AND LOCATE(F.ARRIVAL_AIRPORT, P.PLAN_AIRPORTS) = 0 AND STOPS < 1
AND P.ARRIVAL_TIME + 4 HOURS > F.DEPARTURE_TIME) SELECT PLAN_AIRPORTS, PLAN_FLIGHTS, START_AIRPORT, END_AIRPORT,
START_TIME, END_TIME, PRICE FROM FLIGHTPLAN WHERE END_AIRPORT = 'PHX' ORDER BY PRICE ASC FETCH FIRST 1 ROW ONLY
Parallel Database Architecture
Database
server
Application
Analytical functionsRecursive operations Joins Having filters Group by Complex scalar functions Projections and simple transformations Filters - selections MasterWorker 1 Worker 2 Worker 3
Effect of Partitions on Query Response
number of partitions/processors
to
ta
l
th
ro
ug
h
pu
t
bottleneck
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 29
Internal Database Server “Administration”
Source: VoltDB / Michael Stonebraker
NewSQL
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 30
The Market of Hadoop/NoSQL Products
Categories of Database Servers
all
database
servers
SQL
database
servers
NoSQL
database
servers
SQL
database
servers
NoSQL
database
servers
Classic SQL
database servers
Analytical SQL
database servers
NewSQL
database servers
Key-value
stores
Document
stores
Column-family
stores
Graph
database servers
Classic SQL
database servers
Analytical SQL
database servers
NewSQL
database servers
Key-value
stores
Document
stores
Column-family
stores
Graph
database servers
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 33
Strong Consistency vs. Eventual Consistency
Strong
Eventual
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 34
SQL DBMS versus NoSQL Solution
application
application
NoSQL
solution
SQL
database
server
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 37
Hadoop 2.0
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 38
Examples of Complex Values (1)
Comma-separated value
EDIFACT message
"Anchorage Daily News","PO Box 149001","Anchorage","AK","99514-9001",
"907-257-4200","907-258-2157","71","","82",
"http://www.adn.com/",[email protected]
UNB+UNOA:1+005435656:1+006415160:1+060515:1434+00000000000778'XXXUNH+
00000000000117+INVOIC:D:97B:UN'XXXBGM+380+342459+9'XXXDTM+
3:20060515:102'XXXRFF+ON:521052'XXXNAD+BY+792820524::16++
CUMMINSMIDRANGEENGINEPLANT'XXXNAD+SE+005435656::16++
GENERALWIDGETCOMPANY'XXXCUX+1:USD'XXXLIN+1++157870:IN'XXXIMD+
F++:::WIDGET'XXXQTY+47:1020:EA'XXXALI+US'XXXMOA+203:1202.58'XXXPRI+
INV:1.179'XXXLIN+2++157871:IN'XXXIMD+F++:::DIFFERENTWIDGET'XXXQTY+
47:20:EA'XXXALI+JP'XXXMOA+203:410'XXXPRI+INV:20.5'XXXUNS+S'XXXMOA+
39:2137.58'XXXALC+C+ABG'XXXMOA+8:525'XXXUNT+23+00000000000117'XXXUNZ+
1+00000000000778'
Example of Complex Value (2)
Weblog record
datestamp ip request 6/1/2012 11:10:19 AM 107.1.187.170 GET
/x.php?u=http://studio-5.financialcontent.com/synacor?Page=QUOTE&Ticker=DDD
HTTP/1.1 6/1/2012 5:53:49 AM 107.1.2.180 GET /tv/3/player/vendor/Chef%20Tips
/player/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1 6/1/2012 8:55:54 AM
107.34.51.63 GET /tv/3/search/content/The%20Andy%20Griffith%20Show/s/The%20
Andy%20Griffith%20Show HTTP/1.1 6/1/2012 3:12:43 PM 107.5.115.117 GET
/tv/3/search/content/Kathie%20Lee%20Gifford's%20epic%20'Today'%20gaffe/s/Kathie
%20Lee%20Gifford's%20epic%20'Today'%20gaffe HTTP/1.1 6/1/2012 4:48:35 PM
108.225.132.245 GET /tv/3/search/content/Deadliest%20Catch/s/Deadliest%20Catch
HTTP/1.1 6/1/2012 10:25:12 AM 108.246.20.125 GET
/x.php?u=http://studio-5.financialcontent.com/synacor?Page=QUOTE&Ticker=DJ:DJI HTTP/1.1
6/1/2012 1:58:14 AM 108.246.25.117 GET /tv/3/player/vendor/Chef%20Tips/player
/fiveminute/content/steak/asset/gnrc_15879500 HTTP/1.1
Unraveling the Data Model
Store
Classic
database
2
Query &
unravel
Unravel &
Store
Classic
database
1
Query
Store
MapReduce
database
3
Query &
unravel
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 41
Schema-On-Write
SoW = Data written to a database has a
schema
•
A schema is not optional
Fixed schema-on-write
•
All records in a table have the same schema
•
For example, SQL systems
Variable schema-on-write
•
When data is stored in the database, a
schema is written together with the data
itself
•
Different records in a table can have
different schemas
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 42
Schema-On-Read
SoR = Data written to a database has a
schema
•
Stored data has no schema
•
Complex values or schema-less values
Schema-on-application-read
•
The application assigns a schema to the
schema-less data (unraveling)
Schema-on-database-read
•
The database server assigns a schema to the
schema-less data
•
The application receives data with a schema
Tyranny of Performance
The Balancing Act
Productivity
Maintainability
Time-to-market
Performance
Scalability
Availability
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 45
The Classic Reporting Environment
production databases data marts personal data store data staging area production applications data warehouse Interactive reporting Executive reporting
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 46
The Upcoming Analytical Labyrinth
production databases data marts personal data store data staging area production applications big data big data analytics sandboxes operational reporting unstructured data Predictive analytics data warehouse Interactive reporting reporting external data private data
Do We Want Analytical Silos?
production databases streaming databases social media data data staging area big data
stores unstructureddata sandboxes data
warehouse & data marts
external data private
data production
applications Self-serviceBI reportingiterative predictiveanalytics reportingmobile predefinedreporting
Heading for an Integration Labyrinth
production databases streaming databases social media data data staging area big data
stores unstructureddata sandboxes data
warehouse & data marts
external data private
data production
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 49
Different Database Workloads
xml database
sql database
sql database
sql database
sql database
OLAP database
OO database
pre-relational database
time
OLTP
OLCP
OLAP
OLXP
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 50
Hadoop APIs Too Technical?
Is Google Going SQL?
2012: Spanner supports general-purpose transactions, and
provides a SQL-based query language.
Google’s motivation: “We believe it is better to have
application programmers deal with performance problems
due to overuse of transactions as bottlenecks arise,
rather than always coding around the lack of transactions.”
Market of SQL-fication Products
SQL-on-Hadoop Engines
•
Examples: Apache Hive, Cassandra CQL, CitusDB,
Cloudera Impala, Concurrency Lingual, Hadapt,
InfiniDB, JethroData, MammothDB, MapR Drill,
MemSQL, Pivotal HawQ, Progress DataDirect,
ScleraDB, Simba, SpliceMachine, …
Data virtualization and data federation
servers
•
Examples: Cirro, Cisco/Composite, Denodo,
Informatica IDS, RedHat Jboss Data
Virtualization, Stonebond, …
SQL databases (polyglot persistence)
•
Examples: EMC Greenplum UAP, Hadapt, Microsoft
Polybase, Paraccell, Teradata Aster database
(SQL-H), …
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 53
CitusData CitusDB
Designed for analytical queries
Characteristics
•
No use of MapReduce or Hive
•
Knows the location of data – speeds
up data access
•
Based on PostgreSQL
•
Queries are pushed to the data
nodes
•
Statistics are collected on the data
•
UDFs are supported
CitusDB
HDFS
MongoDB
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 54
JethroData Jethro
Designed for interactive queries
Characteristics
•
Every column is indexed!!
•
Append-only inverted lists – index
entries are appended
•
Inserts no impact on reads
•
30-40% extra storage
•
Columnar store
•
Ansi-92 SQL: DDL + query
•
Supports joins
Jethro
HDFS
PivotalHD Hawq
PivotalHD Hawq = Greenplum
on HDFS
Dual database strategy
•
Uses the same file format as
GemFire/SQLFire for
transactions
Greenplum = mature
cost-based query optimizer
Hawq compatible with
Greenplum
ACID compliant
HawQ
HBase
HDFS
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 57
Data Virtualization Overview (1)
production databases
streaming
databases media datasocial production application big data stores website ESB analytics & reporting unstructured data mobile App data warehouse & data marts
internal portal dashboard external data private data
Data Virtualization Server
applications
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 58
Data Virtualization Overview (2)
streaming
databases media datasocial production application big data stores website ESB analytics & reporting unstructured data mobile App data warehouse & data marts
internal portal dashboard external data private data ODBC/SQL JDBC/SQL XML/SOAP REST/JSON XQuery MDX/DAX
JMS SQL SQL+ XSLT Hive Prop. Excel JSON
CICS SOAP
JMS message JMS message JMS message
JMS message SQL statementSQL statementSQL statementSQL statement
Data Virtualization Server
SOAP messageSOAP messageSOAP messageSOAP messageproduction databases applications SQL statement SQL statementSQL statement SQL statement
Definition of Data Virtualization
Data virtualization is the
technology that offers data
consumers a unified,
abstracted, and
encapsulated view for
querying and manipulating
data stored in a
heterogeneous set of data
stores.
The Market of Data Virtualization Servers
Cirro Data Hub
Cisco/Composite Information Server
Denodo Platform
IBM InfoSphere Federation Server
Informatica Data Services
Information Builders EII
Oracle Data Services Integrator
Progress Easyl
Red Hat Teiid and Jboss Data
Virtualization
Stone Bond Enterprise Enabler
Virtuoso
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 61
Data Stays Where it’s Collected
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 62
Data generated by day is more than
can be moved across the network.
Network will look like this …
Big Data is Too Big To Move
Data Virtualization to the Rescue?
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 65 Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 66
85% expect to gain substantial business and
IT benefits from Big Data initiatives
85% have Big Data initiatives planned or in
progress
70% report that these initiatives are
enterprise-driven
85% of the initiatives are sponsored by a
C-level executive or the head of a line of
business
75% expect an impact across multiple lines
of business
C-Level and Big Data
15% ranked their access to data as
adequate or world-class
21% ranked their analytic capabilities
as adequate or world-class
17% ranked their ability to use data
and analytics to transform their
business as more than adequate or
world-class
C-Level and Big Data
Battle of Chancellorsville, 1863
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 69
IT specialists?
IT departments?
Benelux / Europe?
Copyright © 1991 - 2014 R20/Consultancy B.V., The Hague, The Netherlands 70