• No results found

An Interim SC 32 Viewpoint of Big Data & Next-Generation Analytics

N/A
N/A
Protected

Academic year: 2021

Share "An Interim SC 32 Viewpoint of Big Data & Next-Generation Analytics"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

An Interim SC 32

Viewpoint of Big Data &

Next-Generation Analytics

Jim Melton JTC 1/SC 32 Chair [email protected] November 2013 JTC 1 Plenary Perros-Guirec, FRA

(2)

Abstract

The data management industry is not standing still. There are new capabilities in the SQL standard, new data

management technologies, and new applications.

 Big Data

 Analytics

 Support for Big Data

 SQL Standard

 Other types of databases

 NoSQL Databases  New SQL Databases

(3)

3

Your Humble Servant

 Architect at Oracle

 Data management standardization

 SQL Standards committees since 1986

 Editor of all parts of ISO/IEC 9075 and TR 19075 since 1987  Major author of proposals for many years

 Chair of SC 32 since 2011 (Acting Chair in 2011)  XQuery Standardization since 1998

 Editor/Co-Editor: Functions & Operators, XQueryX  Chair of W3C WG since 2004 (Co-Chair until 2008)  Many other standards activities and interests

(4)

Dimensions of Big Data

 Characteristics:  Quantity, size  Complexity  Rate of change  Varieties  Availability  Persistence  Integrity  Location  Relevance  Etc.  Aspects  Data, per se  Metadata*, models*  Privacy & Security  Storage, reliability  Query & Analysis*

 Transport, interchange*  Life cycle

 Accessibility  Integration  Etc.

(5)

Paradigm Shift in Database Industry

Many database users are attempting to escape the restrictions of the current SQL databases and

database vendors

 Distribution

 Replication

 High Availability

 Large data volumes

 Reduced up-front development costs

 Minimal upfront licensing costs

5 2013 ISO/IEC JTC 1 Plenary

(6)

Driving Forces

 Big Data

 Inexpensive storage of large volumes of data

 Inexpensive compute power

 High bandwidth networks

 Next Generation Analytics

 Today’s Responses

 SQL Databases

 NoSQL Databases

(7)

Big Data Definition

 Gartner's 3V definition of big data

 Volume – terabytes, petabytes, …

 Velocity – frequent inserts/updates, streaming

 Variety – textual, geospatial, images, etc.

 Additional Vs

 Value – data is useful to someone

 Veracity – validity of data can be assessed

7 2013 ISO/IEC JTC 1 Plenary

(8)

Diagram from NBD-WG M0055, Big Data Architecture Framework

(9)

How big is big?

 Data Volume  Terabytes – 1000**4  Petabytes – 1000**5  Exabyte – 1000**6  Zettabyte – 1000**7  Yottabyte – 1000**8  Data Distribution  Server  Cluster  Datacenter  Continent  World  Solar System 9 2013 ISO/IEC JTC 1 Plenary

(10)

Big Data Examples

 Big Science – e.g., Large Hadron Collider, Sky

Survey

 Search Engines – e.g., Google, Bing

 Web page click streams

 Sensor networks, Internet of Things

 Medical Research & Healthcare

(11)

Next Generation Analytics

 Analytics is moving from:

 Off-line ⇨ in-line embedded analytics

 Explaining what happened ⇨ predicting what will happen

 Operating on:

 Data at rest – stored someplace  Data in motion – streaming

 Examples:

 Targeted web site advertising, real-time advertising  Search engine results

 Identifying best time to purchase tickets  Identifying cancer factors

11 2013 ISO/IEC JTC 1 Plenary

(12)

Common Analytical Techniques

 Analytical Functions

 Logistic Regression

 Random Forests

 Naive Bayesian classifiers

 Clustering

 K-means clustering

 Canopy clustering

 LDA (Latent Dirichlet Analysis) for text analysis

(13)

Big Data Complications

 Too much data to process sequentially

 Process in parallel

 Too much data to fit on one server

 Distribute to multiple servers

 Too much data to fit in one computer room

 Distribute to multiple computer rooms

 Too much data to move across network

 Distribute queries to process in parallel on multiple

servers in multiple computer rooms

13 2013 ISO/IEC JTC 1 Plenary

(14)

Even More Complications

 Availability

 Replicate

 Not all applications need all of the data all of the

time to provide useful responses

 Network Latency and Bandwidth

 Another reason to distribute the query

 Process query on replica closest to query source

 C: Not just a good idea; it’s the law

Database technologies and tools are being

(15)

Big Data Logical View

 Service Layer

 Analysis & Prediction

 Platform Layer

 Data Integration

 Data Semantic Intellectualization

 Data Layer  Data Identification  Data Collection  Data Registry  Data Repository 15 2013 ISO/IEC JTC 1 Plenary

(16)

Big Data Silo

Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction)

Analysis & Prediction Service Layer

(17)

Integrating Big Data – Silos

17 2013 ISO/IEC JTC 1 Plenary Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Analysis & Prediction Service Layer Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Analysis & Prediction Service Layer Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction)

Analysis & Prediction Service Layer

(18)

Integrated Big Data Silos

Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction)

Analysis & Prediction Service Layer

(19)

Big Data Integrated Silos

19 2013 ISO/IEC JTC 1 Plenary Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction)

Analysis & Prediction Service Layer

Data Quality

Management Data Visualization Workflow

Management Service Support

(20)

Big Data Integrated Silos

Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction) Platform Layer Data Semantic Intellectualization Data Integration Data Layer Data Collection Data Registry Data Repository Data Identification

(Data Mining & Metadata Extraction)

Analysis & Prediction Service Layer

Data Quality

Management Data Visualization Workflow Management Service Support Layer Big Data Management Data Curation Security Privacy

(21)

Terminology

Workflow Management – scheduling queries,

reports, etc.

Data Quality Management – minimize garbage

Data Visualization –displaying the results of

querying a trillion data points

Data Curation – where does it come from, where

does it go, provenance, lifetime

Security – define and enforce access policies Privacy – prevent release of personal data

21 2013 ISO/IEC JTC 1 Plenary

(22)

Terminology

Data Semantic Intellectualization –

Semantic Data Integration based on

technologies such as Ontology, Reasoning, and so on

(23)

Data Layer

 SQL Databases (SQL Classic)  NoSQL Databases  NewSQL Databases  Spatial Data  Video/Image/Sound  Satellite/Radar/Seismic/Sonar (sensors)  Streaming Data  Etc… 23 2013 ISO/IEC JTC 1 Plenary

(24)

Data Layer Characteristics

 Volume  Storage Structure  Row Store  Column Store  Document Store  Key-Value Store  Graph  Streaming  Metadata  Can be queried

 Known beforehand (or

not)  Distribution  Interface  SQL  JDBC  ODBC (SQL/CLI)  Custom  Transactions  ACID Transactions  BASE Transactions  No Transactions

(25)

Challenges

 Distribute queries across disparate data sources

 Integrate query results

 Security

 Privacy

 For each data source, need to understand

 Structure

 Types of queries needed & supported

25 2013 ISO/IEC JTC 1 Plenary

(26)

Multi-Data Source Queries

 What is the correlation between mosquito borne

diseases, precipitation, and temperatures?

 What correlations exist between the genomes of

cancer patients and the effectiveness of cancer treatments?

 Based on recordings of vehicle sounds and

previous maintenance histories, what preventative maintenance is needed?

(27)

Seriously?

 Provide names, photos, and travel history for all

individuals taller than 200 cm with blond hair below their shoulders and weighing between about 85 kg and 115 kg who flew between New York City and any

destination in southeast Asia during any period between 2007 and 2010 when the weather in southern Brazil

included rain > 2 cm/hour during the same month when the Prime Minister of Japan was on holiday and any southwest Asian nation experienced a slip-fault earthquake of magnitude 5.5 to 6.5.

27 2013 ISO/IEC JTC 1 Plenary

(28)

Industry & Standards Efforts

 NoSQL Projects & Products

 NewSQL Projects & Products

 Standards – ISO/IEC JTC 1/SC 32

(29)

NoSQL Products/Projects

http://www.nosql-database.org/ lists 150 NoSQL Databases. Some examples:

 Cassandra

 CouchDB

 Hadoop & Hbase

 MongoDB

 StupidDB

Etc.

(30)

NoSQL – Distributed Storage

 Distribute across multiple servers potentially in

multiple computer rooms

 Replicate across multiple servers potentially in

multiple computer rooms

 Details depend on products & eco-system

 Infrastructure to distribute queries

 Map/Reduce

(31)

NoSQL – Map Reduce

 Indexing and searching large data volumes  Two Phases: Map and Reduce

 Map

 Extract sets of Key-Value pairs from underlying data  Potentially in Parallel on multiple machines

 Reduce

 Merge and sort sets of Key-Value pairs  Results may be useful for other searches

 Techniques differ across products

 Application developers, underlying software  Must understand distribution scheme

 Today: Mostly application responsibility

(32)

Automated Query Distribution

 Some products automate query distribution &

execution

 Isolate application from underlying distribution

(33)

NoSQL – Retrieving Data

 Syntax Varies

 No set-based or declarative query language

 Procedural program languages such as Java, C, etc.

 Application specifies retrieval path

 No query optimizer

 Quick answer is important

 May not be a single “right” answer

(34)

NoSQL – Updating Data

 BASE transactions

 “Eventually correct”

 Lazy updates

(35)

NewSQL Projects & Products

 Scalable performance of NoSQL products

 Distributed storage – sharding

 Distributed queries

 In-memory techniques

Etc.

 Support for Online Transaction Processing –

ACID transaction guarantees

35 2013 ISO/IEC JTC 1 Plenary

(36)

What is ISO/IEC JTC 1/SC 32 Doing?

 WG2 – Metadata

 WG3 – Database Languages (SQL)

(37)

WG2 – Metadata

 ISO/IEC 11179 Metadata Registry

 Metadata Registry structure and procedures

 Semantics, data representation, & data descriptions

 ISO/IEC 19763 Metamodel Framework for

Interoperability (MFI)

 Communicate, execute programs, or transfer data

among various functional units

 Requires little or no knowledge of the unique

characteristics of those units

37 2013 ISO/IEC JTC 1 Plenary

(38)

WG3 – Database Languages

 Next version of the SQL standards (ISO/IEC

9075) publication in late 2015 or early 2016

 Row Pattern Recognition – already added

 Bi-Temporal suppoprt – already added

 Other possible additions

 JavaScript Object Notation (JSON) documents

 User Defined Aggregation Functions

 Dynamic Table Functions

(39)

Row Pattern Recognition

 Adds MATCH_RECOGNIZE clause to FROM

 Specifies pattern across a sequence of rows

 New Syntax:

 ONE ROW PER MATCH

 Returns single summary row for each match of the

pattern

 Default

 ALL ROWS PER MATCH

 Returns one row for each row of each match

39 2013 ISO/IEC JTC 1 Plenary

(40)

Row Pattern Recognition Example

SELECT M.Symbol, /* ticker symbol */

M.Matchno, /* sequential match number */ M.Tradeday, /* day of trading */

M.Price, /* price on day of trading */ M.Classy, /* classifier */

M.Startp, /* starting price */ M.Bottomp, /* bottom price */ M.Endp, /* ending price */ M.Avgp /* average price */ FROM Ticker

MATCH_RECOGNIZE (

PARTITION BY Symbol ORDER BY Tradeday

MEASURES MATCH_NUMBER () AS Matchno, CLASSIFIER () AS Classy, A.Price AS Startp,

FINAL LAST (B.Price) AS Bottomp, FINAL LAST (C.Price) AS Endp, FINAL AVG (U.Price) AS Avgp ALL ROWS PER MATCH

AFTER MATCH SKIP PAST LAST ROW PATTERN (A B+ C+)

SUBSET U = (A, B, C)

DEFINE /* A defaults to True, matches any row */ B AS B.Price < PREV (B.Price),

(41)

Other Possible Additions

 JSON – Just Some Other Notation 

 User-Defined Data Aggregation Functions

Select Col1, MyFunction(col2), SUM(col3) From Table1 Group by Col1;

 Dynamic/Polymorphic Table Functions

 Parameter is one arbitrary table

 Function result can be another arbitrary table

Insert into T1 Select * from

MyInferenceEngine(ExternalDataStream.T2);

 Result table “shape” not known until run-time

41 2013 ISO/IEC JTC 1 Plenary

(42)

WG4 – SQL/MM

Support for extended datatypes within SQL databases

 ISO/IEC 13249-2 Full Text – content-based

retrieval.

 ISO/IEC 13249-5 Still Image – basic functions for

image data management.

 ISO/IEC 13249-3 Spatial – functions to support

geo-spatial applications

 ISO/IEC 13249-6 Data Mining – support for

(43)

Big Data Gaps

 Data source registry

 Location, contents, and semantics

 Ability to discover and utilize data

 Common interface to disparate data sources

 Better support for

 Queries against images, video, & sound

 Streaming data

 Security & privacy

 Integration of analytical functions

43 2013 ISO/IEC JTC 1 Plenary

(44)

Summary

 Big Data has arrived

 Significant hype but practical applications emerging

 Hype tends to focus on data store capabilities

 It’s just another data store

 SQL Standards development is ongoing

 Next version in 2015/2016  Temporal data

 Row Pattern Recognition

 Additional temporal support?  Multi-dimensional Data Type?  Additional support for big data?

(45)

SC 32 Opportunities

 WG 2 – Metadata for Big Data

 Many kinds of metadata (structural, semantic,

catalog, integration, correlation)

 Automatic discovery of & reasoning over metadata

 WG 3 – Big Data Querying & Manipulation

 Tables, trees, etc. necessary, but wholly inadequate

 New query paradigms, analysis, transaction models

 WG 4 – Specialized Big Data Types

 Support for other forms of data (e.g., seismic)

 “Helper” functionality

45 2013 ISO/IEC JTC 1 Plenary

(46)

Acknowledgements

 All errors, misunderstandings, misleading

statements, and idiotic comments are mine and mine alone.

 Keith Hare (JTC 1/SC 32/WG 3 Convenor)

 JTC 1/SC 32 Study Group on Next-Generation

Analytics and Big Data

 Jörn Bartels, 정성재 (Jung, Sung Jae), Keith

(47)

Questions?

47 2013 ISO/IEC JTC 1 Plenary

(48)

Big Data Analysis Challenges

A number of challenges in both data management and data analysis require new approaches to support the big data era. These challenges span generation of the data, preparation for analysis, and policy-related challenges in its sharing and use, including the following:

 Dealing with highly distributed data sources,

 Tracking data provenance, from data generation through data preparation,  Validating data,

 Coping with sampling biases and heterogeneity,  Working with different data formats and structures,

 Developing algorithms that exploit parallel and distributed architectures,  Ensuring data integrity,

 Ensuring data security,

 Enabling data discovery and integration,  Enabling data sharing,

 Developing methods for visualizing massive data,  Developing scalable and incremental algorithms, and

 Coping with the need for real-time analysis and decision-making.

(49)

References

 “Big Data”, Viktor Mayer-Schönberger &

Kenneth Cukier, Houghton Mifflin Harcourt, New York, NY, 2013.

 “Big Data Now: 2012 Edition”,

http://oreilly.com/data/radarreports/big-data-now-2012.csp

 NIST Big Data Working Group

http://bigdatawg.nist.gov/home.php

49 2013 ISO/IEC JTC 1 Plenary

(50)

References

 “Cassandra vs MongoDB vs CouchDB vs Redis

vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison,” Kristóf

Kovács, viewed 2013-03-16

http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs- redis?imm_mid=0a2ec6&cmp=em-velocity-newsletters-vlny-cfp-20130307-direct

(51)

References

 National Research Council. 2013. Frontiers in

Massive Data Analysis. Washington, D.C.: The National Academies Press.

http://www.nap.edu/catalog.php?record_id=18374

51 2013 ISO/IEC JTC 1 Plenary

(52)

“What Does 'Big Data' Mean?”

 Michael Stonebraker, Communications of the

ACM Blogs  Part 1: http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext  Part 2: http://cacm.acm.org/blogs/blog-cacm/156102-what-does-big-data-mean-part-2/fulltext  Part 3: http://cacm.acm.org/blogs/blog-cacm/157589-what-does-big-data-mean-part-3/fulltext  Part 4: http://cacm.acm.org/blogs/blog-cacm/162095-what-does-big-data-mean-part-4/fulltext

References

Related documents

In addition, the work provides a data distribution model to improve resource management for Big Data applications in hybrid infrastructures.. Further, a new platform for the

For the purpose of Big Data and Analytics, this layer calls out several types of applications that support analytical, intelligence gathering, and performance management

Bid price evaluation system based on big data includes four levels: construction cost data collection layer, data acquisition and integration layer, data analysis layer

 Platform as a Service (PaaS): The NoSQL data stores and distributed caches that logically queried using query languages form the platform layer of big data. This layer

Big data analytics platform is based on the big data processing infrastructure, and it is composed of the techniques for data collection/integration,

Information Integration &amp; Governance Hadoop System Stream Computing Data Warehouse New analytic applications drive the.. requirements for a big data platform • Integrate

These fields' potentials are unlocked by the application of different Big Data techniques such as crowd sourcing, data fusion and data integration, natural language

the only big data big analytics platform based on open source