An Interim SC 32
Viewpoint of Big Data &
Next-Generation Analytics
Jim Melton JTC 1/SC 32 Chair [email protected] November 2013 JTC 1 Plenary Perros-Guirec, FRAAbstract
The data management industry is not standing still. There are new capabilities in the SQL standard, new data
management technologies, and new applications.
Big Data
Analytics
Support for Big Data
SQL Standard
Other types of databases
NoSQL Databases New SQL Databases
3
Your Humble Servant
Architect at Oracle
Data management standardization
SQL Standards committees since 1986
Editor of all parts of ISO/IEC 9075 and TR 19075 since 1987 Major author of proposals for many years
Chair of SC 32 since 2011 (Acting Chair in 2011) XQuery Standardization since 1998
Editor/Co-Editor: Functions & Operators, XQueryX Chair of W3C WG since 2004 (Co-Chair until 2008) Many other standards activities and interests
Dimensions of Big Data
Characteristics: Quantity, size Complexity Rate of change Varieties Availability Persistence Integrity Location Relevance Etc. Aspects Data, per se Metadata*, models* Privacy & Security Storage, reliability Query & Analysis* Transport, interchange* Life cycle
Accessibility Integration Etc.
Paradigm Shift in Database Industry
Many database users are attempting to escape the restrictions of the current SQL databases and
database vendors
Distribution
Replication
High Availability
Large data volumes
Reduced up-front development costs
Minimal upfront licensing costs
5 2013 ISO/IEC JTC 1 Plenary
Driving Forces
Big Data
Inexpensive storage of large volumes of data
Inexpensive compute power
High bandwidth networks
Next Generation Analytics
Today’s Responses
SQL Databases
NoSQL Databases
Big Data Definition
Gartner's 3V definition of big data
Volume – terabytes, petabytes, …
Velocity – frequent inserts/updates, streaming
Variety – textual, geospatial, images, etc.
Additional Vs
Value – data is useful to someone
Veracity – validity of data can be assessed
7 2013 ISO/IEC JTC 1 Plenary
Diagram from NBD-WG M0055, Big Data Architecture Framework
How big is big?
Data Volume Terabytes – 1000**4 Petabytes – 1000**5 Exabyte – 1000**6 Zettabyte – 1000**7 Yottabyte – 1000**8 Data Distribution Server Cluster Datacenter Continent World Solar System 9 2013 ISO/IEC JTC 1 PlenaryBig Data Examples
Big Science – e.g., Large Hadron Collider, Sky
Survey
Search Engines – e.g., Google, Bing
Web page click streams
Sensor networks, Internet of Things
Medical Research & Healthcare
Next Generation Analytics
Analytics is moving from:
Off-line ⇨ in-line embedded analytics
Explaining what happened ⇨ predicting what will happen
Operating on:
Data at rest – stored someplace Data in motion – streaming
Examples:
Targeted web site advertising, real-time advertising Search engine results
Identifying best time to purchase tickets Identifying cancer factors
11 2013 ISO/IEC JTC 1 Plenary
Common Analytical Techniques
Analytical Functions
Logistic Regression
Random Forests
Naive Bayesian classifiers
Clustering
K-means clustering
Canopy clustering
LDA (Latent Dirichlet Analysis) for text analysis
Big Data Complications
Too much data to process sequentially
Process in parallel
Too much data to fit on one server
Distribute to multiple servers
Too much data to fit in one computer room
Distribute to multiple computer rooms
Too much data to move across network
Distribute queries to process in parallel on multiple
servers in multiple computer rooms
13 2013 ISO/IEC JTC 1 Plenary
Even More Complications
Availability
Replicate
Not all applications need all of the data all of the
time to provide useful responses
Network Latency and Bandwidth
Another reason to distribute the query
Process query on replica closest to query source
C: Not just a good idea; it’s the law
Database technologies and tools are being
Big Data Logical View
Service Layer
Analysis & Prediction
Platform Layer
Data Integration
Data Semantic Intellectualization
Data Layer Data Identification Data Collection Data Registry Data Repository 15 2013 ISO/IEC JTC 1 Plenary
Big Data Silo
Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification(Data Mining & Metadata Extraction)
Analysis & Prediction Service Layer
Integrating Big Data – Silos
17 2013 ISO/IEC JTC 1 Plenary Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification(Data Mining & Metadata Extraction) Analysis & Prediction Service Layer Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction) Analysis & Prediction Service Layer Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction)
Analysis & Prediction Service Layer
Integrated Big Data Silos
Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction)
…
Analysis & Prediction Service Layer
Big Data Integrated Silos
19 2013 ISO/IEC JTC 1 Plenary Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction)
…
Analysis & Prediction Service Layer
Data Quality
Management Data Visualization Workflow
Management Service Support
Big Data Integrated Silos
Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification
(Data Mining & Metadata Extraction)
…
Analysis & Prediction Service Layer
Data Quality
Management Data Visualization Workflow Management Service Support Layer Big Data Management Data Curation Security … Privacy
Terminology
Workflow Management – scheduling queries,
reports, etc.
Data Quality Management – minimize garbage
Data Visualization –displaying the results of
querying a trillion data points
Data Curation – where does it come from, where
does it go, provenance, lifetime
Security – define and enforce access policies Privacy – prevent release of personal data
21 2013 ISO/IEC JTC 1 Plenary
Terminology
Data Semantic Intellectualization –
Semantic Data Integration based on
technologies such as Ontology, Reasoning, and so on
Data Layer
SQL Databases (SQL Classic) NoSQL Databases NewSQL Databases Spatial Data Video/Image/Sound Satellite/Radar/Seismic/Sonar (sensors) Streaming Data Etc… 23 2013 ISO/IEC JTC 1 PlenaryData Layer Characteristics
Volume Storage Structure Row Store Column Store Document Store Key-Value Store Graph Streaming Metadata Can be queried Known beforehand (or
not) Distribution Interface SQL JDBC ODBC (SQL/CLI) Custom Transactions ACID Transactions BASE Transactions No Transactions
Challenges
Distribute queries across disparate data sources
Integrate query results
Security
Privacy
For each data source, need to understand
Structure
Types of queries needed & supported
25 2013 ISO/IEC JTC 1 Plenary
Multi-Data Source Queries
What is the correlation between mosquito borne
diseases, precipitation, and temperatures?
What correlations exist between the genomes of
cancer patients and the effectiveness of cancer treatments?
Based on recordings of vehicle sounds and
previous maintenance histories, what preventative maintenance is needed?
Seriously?
Provide names, photos, and travel history for all
individuals taller than 200 cm with blond hair below their shoulders and weighing between about 85 kg and 115 kg who flew between New York City and any
destination in southeast Asia during any period between 2007 and 2010 when the weather in southern Brazil
included rain > 2 cm/hour during the same month when the Prime Minister of Japan was on holiday and any southwest Asian nation experienced a slip-fault earthquake of magnitude 5.5 to 6.5.
27 2013 ISO/IEC JTC 1 Plenary
Industry & Standards Efforts
NoSQL Projects & Products
NewSQL Projects & Products
Standards – ISO/IEC JTC 1/SC 32
NoSQL Products/Projects
http://www.nosql-database.org/ lists 150 NoSQL Databases. Some examples:
Cassandra
CouchDB
Hadoop & Hbase
MongoDB
StupidDB
Etc.
NoSQL – Distributed Storage
Distribute across multiple servers potentially in
multiple computer rooms
Replicate across multiple servers potentially in
multiple computer rooms
Details depend on products & eco-system
Infrastructure to distribute queries
Map/Reduce
NoSQL – Map Reduce
Indexing and searching large data volumes Two Phases: Map and Reduce
Map
Extract sets of Key-Value pairs from underlying data Potentially in Parallel on multiple machines
Reduce
Merge and sort sets of Key-Value pairs Results may be useful for other searches
Techniques differ across products
Application developers, underlying software Must understand distribution scheme
Today: Mostly application responsibility
Automated Query Distribution
Some products automate query distribution &
execution
Isolate application from underlying distribution
NoSQL – Retrieving Data
Syntax Varies
No set-based or declarative query language
Procedural program languages such as Java, C, etc.
Application specifies retrieval path
No query optimizer
Quick answer is important
May not be a single “right” answer
NoSQL – Updating Data
BASE transactions
“Eventually correct”
Lazy updates
NewSQL Projects & Products
Scalable performance of NoSQL products
Distributed storage – sharding
Distributed queries
In-memory techniques
Etc.
Support for Online Transaction Processing –
ACID transaction guarantees
35 2013 ISO/IEC JTC 1 Plenary
What is ISO/IEC JTC 1/SC 32 Doing?
WG2 – Metadata
WG3 – Database Languages (SQL)
WG2 – Metadata
ISO/IEC 11179 Metadata Registry
Metadata Registry structure and procedures
Semantics, data representation, & data descriptions
ISO/IEC 19763 Metamodel Framework for
Interoperability (MFI)
Communicate, execute programs, or transfer data
among various functional units
Requires little or no knowledge of the unique
characteristics of those units
37 2013 ISO/IEC JTC 1 Plenary
WG3 – Database Languages
Next version of the SQL standards (ISO/IEC
9075) publication in late 2015 or early 2016
Row Pattern Recognition – already added
Bi-Temporal suppoprt – already added
Other possible additions
JavaScript Object Notation (JSON) documents
User Defined Aggregation Functions
Dynamic Table Functions
Row Pattern Recognition
Adds MATCH_RECOGNIZE clause to FROM
Specifies pattern across a sequence of rows
New Syntax:
ONE ROW PER MATCH
Returns single summary row for each match of the
pattern
Default
ALL ROWS PER MATCH
Returns one row for each row of each match
39 2013 ISO/IEC JTC 1 Plenary
Row Pattern Recognition Example
SELECT M.Symbol, /* ticker symbol */
M.Matchno, /* sequential match number */ M.Tradeday, /* day of trading */
M.Price, /* price on day of trading */ M.Classy, /* classifier */
M.Startp, /* starting price */ M.Bottomp, /* bottom price */ M.Endp, /* ending price */ M.Avgp /* average price */ FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY Symbol ORDER BY Tradeday
MEASURES MATCH_NUMBER () AS Matchno, CLASSIFIER () AS Classy, A.Price AS Startp,
FINAL LAST (B.Price) AS Bottomp, FINAL LAST (C.Price) AS Endp, FINAL AVG (U.Price) AS Avgp ALL ROWS PER MATCH
AFTER MATCH SKIP PAST LAST ROW PATTERN (A B+ C+)
SUBSET U = (A, B, C)
DEFINE /* A defaults to True, matches any row */ B AS B.Price < PREV (B.Price),
Other Possible Additions
JSON – Just Some Other Notation
User-Defined Data Aggregation Functions
Select Col1, MyFunction(col2), SUM(col3) From Table1 Group by Col1;
Dynamic/Polymorphic Table Functions
Parameter is one arbitrary table
Function result can be another arbitrary table
Insert into T1 Select * from
MyInferenceEngine(ExternalDataStream.T2);
Result table “shape” not known until run-time
41 2013 ISO/IEC JTC 1 Plenary
WG4 – SQL/MM
Support for extended datatypes within SQL databases
ISO/IEC 13249-2 Full Text – content-based
retrieval.
ISO/IEC 13249-5 Still Image – basic functions for
image data management.
ISO/IEC 13249-3 Spatial – functions to support
geo-spatial applications
ISO/IEC 13249-6 Data Mining – support for
Big Data Gaps
Data source registry
Location, contents, and semantics
Ability to discover and utilize data
Common interface to disparate data sources
Better support for
Queries against images, video, & sound
Streaming data
Security & privacy
Integration of analytical functions
43 2013 ISO/IEC JTC 1 Plenary
Summary
Big Data has arrived
Significant hype but practical applications emerging
Hype tends to focus on data store capabilities
It’s just another data store
SQL Standards development is ongoing
Next version in 2015/2016 Temporal data
Row Pattern Recognition
Additional temporal support? Multi-dimensional Data Type? Additional support for big data?
SC 32 Opportunities
WG 2 – Metadata for Big Data
Many kinds of metadata (structural, semantic,
catalog, integration, correlation)
Automatic discovery of & reasoning over metadata
WG 3 – Big Data Querying & Manipulation
Tables, trees, etc. necessary, but wholly inadequate
New query paradigms, analysis, transaction models
WG 4 – Specialized Big Data Types
Support for other forms of data (e.g., seismic)
“Helper” functionality
45 2013 ISO/IEC JTC 1 Plenary
Acknowledgements
All errors, misunderstandings, misleading
statements, and idiotic comments are mine and mine alone.
Keith Hare (JTC 1/SC 32/WG 3 Convenor)
JTC 1/SC 32 Study Group on Next-Generation
Analytics and Big Data
Jörn Bartels, 정성재 (Jung, Sung Jae), Keith
Questions?
47 2013 ISO/IEC JTC 1 Plenary
Big Data Analysis Challenges
A number of challenges in both data management and data analysis require new approaches to support the big data era. These challenges span generation of the data, preparation for analysis, and policy-related challenges in its sharing and use, including the following:
Dealing with highly distributed data sources,
Tracking data provenance, from data generation through data preparation, Validating data,
Coping with sampling biases and heterogeneity, Working with different data formats and structures,
Developing algorithms that exploit parallel and distributed architectures, Ensuring data integrity,
Ensuring data security,
Enabling data discovery and integration, Enabling data sharing,
Developing methods for visualizing massive data, Developing scalable and incremental algorithms, and
Coping with the need for real-time analysis and decision-making.
References
“Big Data”, Viktor Mayer-Schönberger &
Kenneth Cukier, Houghton Mifflin Harcourt, New York, NY, 2013.
“Big Data Now: 2012 Edition”,
http://oreilly.com/data/radarreports/big-data-now-2012.csp
NIST Big Data Working Group
http://bigdatawg.nist.gov/home.php
49 2013 ISO/IEC JTC 1 Plenary
References
“Cassandra vs MongoDB vs CouchDB vs Redis
vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison,” Kristóf
Kovács, viewed 2013-03-16
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs- redis?imm_mid=0a2ec6&cmp=em-velocity-newsletters-vlny-cfp-20130307-direct
References
National Research Council. 2013. Frontiers in
Massive Data Analysis. Washington, D.C.: The National Academies Press.
http://www.nap.edu/catalog.php?record_id=18374
51 2013 ISO/IEC JTC 1 Plenary
“What Does 'Big Data' Mean?”
Michael Stonebraker, Communications of the
ACM Blogs Part 1: http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext Part 2: http://cacm.acm.org/blogs/blog-cacm/156102-what-does-big-data-mean-part-2/fulltext Part 3: http://cacm.acm.org/blogs/blog-cacm/157589-what-does-big-data-mean-part-3/fulltext Part 4: http://cacm.acm.org/blogs/blog-cacm/162095-what-does-big-data-mean-part-4/fulltext