Olivier Caudron
Big Data and NoSQL"Big" Data?
"Big data
is the term for a collection of data sets so large
and
complex
that it becomes difficult to process using on-hand
database management tools
or traditional data processing
applications"
The 3 V's of Big Data (or more…
)
V
olume
V
elocity
V
ariety
Why Big Data?
•
"Monetizing data" is what the hype is all about:
some "big data" monetization stories that have
gone viral evidently make many people envious
•
For many, Big Data is nothing more than
finding
as many needles (preferably golden) as
possible in the huge haystack of Internet data
http://datascienceseries.com/assets/blog/GREENPLUM_Information_is_the_new_oil-LR.pdf
"Big Data is not about the
amounts of data. It's about
the cool stuff you can do
with Big Data"
Taxonomy of Big Data
•
There is a lot of
debate
on the exact
domain of application
of
"Big Data"
– First off: Big Data is NOT a conceptual revolution!!!
– The most practical definition of "Big Data" is a negative one: any problem that is not tractable through "traditional" means because of its size and/or complexity and/or velocity will be considered a "Big Data" problem
– … However it's not all that simple…
•
"Big Data" was
popularized by some big players on the Internet
,
however,
the reality is much less clear cut
:
– Facebook and Twitter use MySQL mostly (and some Cassandra) – Wikipedia and YouTube use MySQL (and little or no "NoSQL") – Amazon is on Oracle DB
Taxonomy of Big Data
•
"Big Data" solutions can be divided into 2 categories:
Big Data "processing"
solutions are mostly offline (batch,
non-transactional) solutions for processing data and can be seen as an
evolution of OLAP
Example:
Apache Hadoop (and its ecosystem)
Big Data "database"
solutions that come mostly under the "NoSQL"
terminology ("No" SQL or "Not Only" SQL) and can be seen as an
evolution of OLTP
Examples:
MongoDB, CouchBase, Cassandra,
Big Table, Redis, Neo4J…
Apache Hadoop in a Nutshell
•
Low-level
set of libraries
designed for parallel processing
of
large data sets
•
2 main components:
– Hadoop Distributed File System (file system designed for horizontal scaling and replication on a cluster of commodity servers)
– Hadoop Map/Reduce (utilities for analyzing data using the Map/Reduce paradigm)
•
Open-source, built by the community under the Apache Software
Foundation and distributed under the Apache License 2.0
Apache Hadoop in a Nutshell
•
HDFS is designed to handle
immutable files
(once written, they
don't change) and is not suitable for just any FS use
•
Map/Reduce requires
heavy programmer involvement
•
Has generated a host of solutions (of diverse levels of maturity)
that are meant to simplify its use and/or build functionality on it
– Pig, Hive, Cascading: higher-level map/reduce frameworks– Yarn: Hadoop resource management
– Elasticsearch, Kibana: search and analytics engine – Lingual: SQL layer on Cascading
– And more…
•
InterSystems is currently integrating Caché with Hadoop
– Real-time copy of Caché data to HADOOP for offline processing – In development (alpha)
Types of NoSQL Databases
Data Complexity
Velocity
vs
Data
Size
Commonalities of Volume-Oriented NoSQL Databases
•
There are too many different NoSQL solutions out there to
characterize them in general terms, but the following usually
applies to all paradigms except graph-oriented:
•
Typically non-ACID transactions
("BASE": Basically Available,
Soft state, Eventually consistent)
•
Always denormalized: no referential integrity means the same
data will probably be present in several entities and won't be
synchronized by the system
•
Often built for horizontal scaling
(e.g. sharding)
•
Typically optimized for inserts and retrieval, not meant for full
CRUD
•
Not typically meant for classical applications (client/server,
multi-tier, web applications)
Key/Value Databases
•
The Key
is the only retrieval parameter
– In some products, several data types can be supported for keys, including collections (lists, maps, sorted sets…)
– Users often structure the key in a way that allows for multi-parameter record search – quite a dirty trick, and this must be carefully planned in advance
•
The Value
can be anything:
– The database doesn't have to understand the contents
– Contents can be completely different for each record
e.g. Redis, Membase, LevelDB, Aerospike, Tokyo
Cabinet, Project Voldemort, Hyperdex…
Key/Value Databases
•
Pros:
– Ultrafast on inserts and key-based retrieval in large volumes – Horizontal scaling possible (?)
•
Cons:
– Messy paradigm
– No standardization whatsoever, no SQL support (usually)
– Popular solutions (Redis) actually in-memory with clunky persistence options
– Must use tricks for multi-parameter queries (typically, use special structure for keys)
– Any non-key query is unrealistic (full table scan with document interpretation for each record required)
– Key size often limited (but key contents essential for queries!)
Document-oriented Databases
•
Similar to Key/Value
stores except that
the database understands
the data structure
– No need to tinker with keys to optimize searches on diverse items
•
Typically based on some variant of JSON (e.g. BSON: "Binary"
JSON)
•
Typically allows
extra indexes
to be defined (beyond the key) to
speed up non-key-based queries
Document-oriented Databases
•
Pros:
– Very popular paradigm at the moment (MongoDB, CouchBase) – Good match with JSON, quite popular at the moment
– Handles a reasonable level of complexity – Handles reasonably large amounts of data
– Typically provides horizontal scaling out of the box
•
Cons:
– (Typically) not optimized for updates and deletes
– No relationship between entities, no normalization, no referential integrity
– Not really standardized, but is the most converging of all NoSQL DBs – Typically relies on eventual consistency – no ACID transactions
e.g. Google BigTable, Apache Cassandra, Hbase,
Accumulo…
Column-oriented Databases
Id Name Age WorksOn
1 Olivier 47 Caché, Ensemble
2 Danny Caché, DeepSee, iKnow
3 Alain 53 Caché 4 Luc Id Name 1 Olivier 2 Danny 3 Alain 4 Luc Id Age 1 47 3 53 Id WorksOn 1 Caché 1 Ensemble 2 Caché 2 DeepSee 2 iKnow 3 Caché
Classical relational model
Column-oriented Databases
•
Select count(*) from People where Age>50
• Select Name, WorksOn from People where Age<50
"Lockstep" BigQuery Algorithm
Id Name 1 Olivier 2 Danny 3 Alain 4 Luc Id Age 1 47 3 53 Id WorksOn 1 Caché 1 Ensemble 2 Caché 2 DeepSee 2 iKnow 3 Caché See http://cdn.parleys.com/p/529c6b62e4b039ad2298ca1b/529c5678140df_1385976886785.pdf
Column-oriented Databases
•
Columns can be distributed on separate servers, distributing the
load automatically
Sharding
Separate Servers Id Name 1 Olivier 2 Danny 3 Alain 4 Luc Id Age 1 47 3 53 Id WorksOn 1 Caché 1 Ensemble 2 Caché 2 DeepSee 2 iKnow 3 CachéColumn-oriented Databases
•
Typically, resultsets for big queries are "reconstructed" by
higher-level servers
Sharding and Big Data Aggregation
Storage Layer (e.g. Google FS)
Leaf Servers
Intermediate Servers
Column-oriented Databases
•
Pros:
– Ultrafast queries on huge amounts of data
– No indexing required (each column is its own index)
•
Cons:
– Actually less efficient (than relational) for small databases – Requires a significant infrastructure in any relevant scenario
– No referential integrity – limited complexity in structure AND queries – Not designed for updates (and deletes?)
– Transactions?
e.g. Neo4J, OrientDB, Allegrograph, Dex…
Graph-oriented Databases
Rel: Spouse Rel: Spouse
Rel: Spouse Since: 4/19/1987 Rel: Daughter Rel: Daughter Rel: Son Rel: Son
Rel: Daughter Rel: Daughter
Rel: Son
Rel: Son
Rel: Daughter
Rel: Daughter
Rel: Friend Rel: Employee
Rel: Victim Rel: Sister Rel: Sister Rel: Brother Rel: Brother Lastname: Bouvier Firstname: Clancy Maidenname: Gurney Lastname: Bouvier
Firstname: Clancy Firstname: Mona
Lastname: Simpson Firstname: Abraham Maidenname: Bouvier Lastname: Simpson Firstname: Marjorie Nickname: Marge Lastname: Simpson Firstname: Homer Middlename: Jay Lastname: Simpson Firstname: Bartholomew Midname: Jojo AKA: Bart Lastname: Simpson Firstname: Lisa Gender: F Lastname: Simpson Firstname: Margaret Nickname: Maggie
Lastname: Van Houten
Firstname: Milhouse Middlename: Mussolini
Lastname: Burns
Firstname: Montgomery AKA: Monty
•
Pros:
– According to their supporters, more "natural" way of handling structured data
– Typically ACID transactions
– Capable of handling reasonable volumes, horizontal scaling typically supported, indexing possible
– Support a high level of data complexity with good mining tools
– Contrary to other NoSQL solutions, can (possibly) be fit for general, non-specific use
•
Cons:
– Still unproven paradigm in all but specialized cases – Complexity might be too high for simple problems – Maintenance of the data model might be complicated – Not yet popular, not yet standardized
Pros & Cons
What about Object-Oriented Databases?
THE classical NoSQL database paradigm!
•
Still a very valid paradigm but…
•
Object-oriented databases have had their chance and missed it
– Poor overall performance– Competition of ORM tools (Hibernate, EclipseLink, JPA…) with equivalent ease of use and better performance of underlying relational database
– Deserved to generate hype but failed to do it