• No results found

Olivier Caudron. Big Data and NoSQL

N/A
N/A
Protected

Academic year: 2021

Share "Olivier Caudron. Big Data and NoSQL"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Olivier Caudron

Big Data and NoSQL

(2)

"Big" Data?

"Big data

is the term for a collection of data sets so large

and

complex

that it becomes difficult to process using on-hand

database management tools

or traditional data processing

applications"

(3)

The 3 V's of Big Data (or more…

)

V

olume

V

elocity

V

ariety

(4)

Why Big Data?

"Monetizing data" is what the hype is all about:

some "big data" monetization stories that have

gone viral evidently make many people envious

For many, Big Data is nothing more than

finding

as many needles (preferably golden) as

possible in the huge haystack of Internet data

http://datascienceseries.com/assets/blog/GREENPLUM_Information_is_the_new_oil-LR.pdf

"Big Data is not about the

amounts of data. It's about

the cool stuff you can do

with Big Data"

(5)

Taxonomy of Big Data

There is a lot of

debate

on the exact

domain of application

of

"Big Data"

– First off: Big Data is NOT a conceptual revolution!!!

– The most practical definition of "Big Data" is a negative one: any problem that is not tractable through "traditional" means because of its size and/or complexity and/or velocity will be considered a "Big Data" problem

– … However it's not all that simple…

"Big Data" was

popularized by some big players on the Internet

,

however,

the reality is much less clear cut

:

– Facebook and Twitter use MySQL mostly (and some Cassandra) – Wikipedia and YouTube use MySQL (and little or no "NoSQL") – Amazon is on Oracle DB

(6)

Taxonomy of Big Data

"Big Data" solutions can be divided into 2 categories:

Big Data "processing"

solutions are mostly offline (batch,

non-transactional) solutions for processing data and can be seen as an

evolution of OLAP

Example:

Apache Hadoop (and its ecosystem)

Big Data "database"

solutions that come mostly under the "NoSQL"

terminology ("No" SQL or "Not Only" SQL) and can be seen as an

evolution of OLTP

Examples:

MongoDB, CouchBase, Cassandra,

Big Table, Redis, Neo4J…

(7)

Apache Hadoop in a Nutshell

Low-level

set of libraries

designed for parallel processing

of

large data sets

2 main components:

– Hadoop Distributed File System (file system designed for horizontal scaling and replication on a cluster of commodity servers)

– Hadoop Map/Reduce (utilities for analyzing data using the Map/Reduce paradigm)

Open-source, built by the community under the Apache Software

Foundation and distributed under the Apache License 2.0

(8)

Apache Hadoop in a Nutshell

HDFS is designed to handle

immutable files

(once written, they

don't change) and is not suitable for just any FS use

Map/Reduce requires

heavy programmer involvement

Has generated a host of solutions (of diverse levels of maturity)

that are meant to simplify its use and/or build functionality on it

– Pig, Hive, Cascading: higher-level map/reduce frameworks

– Yarn: Hadoop resource management

– Elasticsearch, Kibana: search and analytics engine – Lingual: SQL layer on Cascading

– And more…

InterSystems is currently integrating Caché with Hadoop

– Real-time copy of Caché data to HADOOP for offline processing – In development (alpha)

(9)

Types of NoSQL Databases

Data Complexity

Velocity

vs

Data

Size

(10)

Commonalities of Volume-Oriented NoSQL Databases

There are too many different NoSQL solutions out there to

characterize them in general terms, but the following usually

applies to all paradigms except graph-oriented:

Typically non-ACID transactions

("BASE": Basically Available,

Soft state, Eventually consistent)

Always denormalized: no referential integrity means the same

data will probably be present in several entities and won't be

synchronized by the system

Often built for horizontal scaling

(e.g. sharding)

Typically optimized for inserts and retrieval, not meant for full

CRUD

Not typically meant for classical applications (client/server,

multi-tier, web applications)

(11)

Key/Value Databases

The Key

is the only retrieval parameter

– In some products, several data types can be supported for keys, including collections (lists, maps, sorted sets…)

– Users often structure the key in a way that allows for multi-parameter record search – quite a dirty trick, and this must be carefully planned in advance

The Value

can be anything:

– The database doesn't have to understand the contents

– Contents can be completely different for each record

e.g. Redis, Membase, LevelDB, Aerospike, Tokyo

Cabinet, Project Voldemort, Hyperdex…

(12)

Key/Value Databases

Pros:

– Ultrafast on inserts and key-based retrieval in large volumes – Horizontal scaling possible (?)

Cons:

– Messy paradigm

– No standardization whatsoever, no SQL support (usually)

– Popular solutions (Redis) actually in-memory with clunky persistence options

– Must use tricks for multi-parameter queries (typically, use special structure for keys)

– Any non-key query is unrealistic (full table scan with document interpretation for each record required)

– Key size often limited (but key contents essential for queries!)

(13)

Document-oriented Databases

Similar to Key/Value

stores except that

the database understands

the data structure

– No need to tinker with keys to optimize searches on diverse items

Typically based on some variant of JSON (e.g. BSON: "Binary"

JSON)

Typically allows

extra indexes

to be defined (beyond the key) to

speed up non-key-based queries

(14)

Document-oriented Databases

Pros:

– Very popular paradigm at the moment (MongoDB, CouchBase) – Good match with JSON, quite popular at the moment

– Handles a reasonable level of complexity – Handles reasonably large amounts of data

– Typically provides horizontal scaling out of the box

Cons:

– (Typically) not optimized for updates and deletes

– No relationship between entities, no normalization, no referential integrity

– Not really standardized, but is the most converging of all NoSQL DBs – Typically relies on eventual consistency – no ACID transactions

(15)

e.g. Google BigTable, Apache Cassandra, Hbase,

Accumulo…

Column-oriented Databases

Id Name Age WorksOn

1 Olivier 47 Caché, Ensemble

2 Danny Caché, DeepSee, iKnow

3 Alain 53 Caché 4 Luc Id Name 1 Olivier 2 Danny 3 Alain 4 Luc Id Age 1 47 3 53 Id WorksOn 1 Caché 1 Ensemble 2 Caché 2 DeepSee 2 iKnow 3 Caché

Classical relational model

(16)

Column-oriented Databases

Select count(*) from People where Age>50

• Select Name, WorksOn from People where Age<50

"Lockstep" BigQuery Algorithm

Id Name 1 Olivier 2 Danny 3 Alain 4 Luc Id Age 1 47 3 53 Id WorksOn 1 Caché 1 Ensemble 2 Caché 2 DeepSee 2 iKnow 3 Caché See http://cdn.parleys.com/p/529c6b62e4b039ad2298ca1b/529c5678140df_1385976886785.pdf

(17)

Column-oriented Databases

Columns can be distributed on separate servers, distributing the

load automatically

Sharding

Separate Servers Id Name 1 Olivier 2 Danny 3 Alain 4 Luc Id Age 1 47 3 53 Id WorksOn 1 Caché 1 Ensemble 2 Caché 2 DeepSee 2 iKnow 3 Caché

(18)

Column-oriented Databases

Typically, resultsets for big queries are "reconstructed" by

higher-level servers

Sharding and Big Data Aggregation

Storage Layer (e.g. Google FS)

Leaf Servers

Intermediate Servers

(19)

Column-oriented Databases

Pros:

– Ultrafast queries on huge amounts of data

– No indexing required (each column is its own index)

Cons:

– Actually less efficient (than relational) for small databases – Requires a significant infrastructure in any relevant scenario

– No referential integrity – limited complexity in structure AND queries – Not designed for updates (and deletes?)

– Transactions?

(20)

e.g. Neo4J, OrientDB, Allegrograph, Dex…

Graph-oriented Databases

Rel: Spouse Rel: Spouse

Rel: Spouse Since: 4/19/1987 Rel: Daughter Rel: Daughter Rel: Son Rel: Son

Rel: Daughter Rel: Daughter

Rel: Son

Rel: Son

Rel: Daughter

Rel: Daughter

Rel: Friend Rel: Employee

Rel: Victim Rel: Sister Rel: Sister Rel: Brother Rel: Brother Lastname: Bouvier Firstname: Clancy Maidenname: Gurney Lastname: Bouvier

Firstname: Clancy Firstname: Mona

Lastname: Simpson Firstname: Abraham Maidenname: Bouvier Lastname: Simpson Firstname: Marjorie Nickname: Marge Lastname: Simpson Firstname: Homer Middlename: Jay Lastname: Simpson Firstname: Bartholomew Midname: Jojo AKA: Bart Lastname: Simpson Firstname: Lisa Gender: F Lastname: Simpson Firstname: Margaret Nickname: Maggie

Lastname: Van Houten

Firstname: Milhouse Middlename: Mussolini

Lastname: Burns

Firstname: Montgomery AKA: Monty

(21)

Pros:

– According to their supporters, more "natural" way of handling structured data

– Typically ACID transactions

– Capable of handling reasonable volumes, horizontal scaling typically supported, indexing possible

– Support a high level of data complexity with good mining tools

– Contrary to other NoSQL solutions, can (possibly) be fit for general, non-specific use

Cons:

– Still unproven paradigm in all but specialized cases – Complexity might be too high for simple problems – Maintenance of the data model might be complicated – Not yet popular, not yet standardized

Pros & Cons

(22)

What about Object-Oriented Databases?

THE classical NoSQL database paradigm!

Still a very valid paradigm but…

Object-oriented databases have had their chance and missed it

– Poor overall performance

– Competition of ORM tools (Hibernate, EclipseLink, JPA…) with equivalent ease of use and better performance of underlying relational database

– Deserved to generate hype but failed to do it

The only exception today is Caché

– very powerful

object-oriented database system, the only OO DB to really pass the test

of real-life use with competitive performance

(23)

The original NoSQL database!

(remember globals? = ultrafast

multidimensional key/value store!)

Relational database that easily competes with the best of them

The ONLY object-oriented database to past the test of real-life

projects

… All in one consistent package

… With fully ACID transactions

… With extensive enterprise tooling (monitoring, backup, task

scheduling, horizontal scaling, replication, etc.)

… With outstanding support from InterSystems

… And added value technologies (DeepSee, iKnow, Ensemble)

Caché pushes ACID transactions to the extreme

(24)

Research Projects

InterSystems is working on several research projects related to

Big Data and NoSQL

Apache Hadoop integration

Document-based database

implemented in Caché (Morpheus

project)

Graph-oriented

approach via the Globals client interface

(currently Node.js) – github project:

https://github.com/GlobalsDB/Contributions/tree/master/NodeJS/

GlobalsGraphDB

Your feedback is important to determine the future directions

of our technology

(25)

Conclusions

There is no real universal "game changer" in new database

architectures, only scoped solutions to specific problems

Only graph-oriented databases can possibly attempt at

universality – but they have yet to prove themselves in general

When considering a NoSQL solution, one must consider the

whole picture including known limitations

e.g. ACID

transactions, CRUD, in-memory, etc.

Having the same data in different data stores

(or offline copies

like for Hadoop Map/Reduce) to solve your problem(s) is no trivial

decision: doubling 100s of TB of data is hardly inconsequential

Caché simplifies these issues – and it pushes the

boundaries of transactional processing of high volumes far

enough to be the right solution in most cases

(26)

Caché Big Data Success Stories

(27)

Olivier Caudron

Big Data and NoSQL

References

Related documents

Sources: Institute Website; Commission for Academic Accreditation; Booz &amp; Company Report; SBD Analysis Technical Education Landscape in the UAE- updated 04Apr13..

Shawn Pfaff is one point out of the point standings lead after a feature win Saturday night in the 30-lap Kwik Trip NASCAR Late Model race. Pfaff started seventh and was working his

To make a particular line the current line, simply list that line by typing the line number the following SQL*PLUS session illustrates some of the editing commands.. Table 3.1

60,150/- (Rupees Sixty Thousand One Hundred and Fifty Only) through the download online Bank Challan Form in any branch of Allied Bank Limited of your City or by Bank Draft payable

Against this background, this paper analyzes migrant responses in crisis situations and examines the impact of the 2011 floods in Thailand on migrants from Myanmar, Lao PDR,

As mentioned previously, once a client can correctly execute each basic punch the instructor can start to implement these punches within a round of focus pad work in a random

Conceptualizing advanced nursing practice: Curriculum issues to consider in the educational preparation of advanced practice nurses in

Kevin: there was a strain of reviews for the “What is Real” record which were like “Oh well they used to be great but now it’s just lame” Rob: that’s the downside, the