• No results found

NOSQL DATABASE SYSTEMS

N/A
N/A
Protected

Academic year: 2021

Share "NOSQL DATABASE SYSTEMS"

Copied!
56
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

NoSQL Database Systems

Categorization – Data Model – Storage Layout – Query Models – Solution Architectures • Data Modeling Application Development

Scalability, Availability and Consistency

– Partitioning, Replication

– Consistency Models and Transactions

Select the Right DBMS

– Performance and Benchmarks

– Polyglot Persistence

id ti …

(3)

NoSQL Database Systems

Considered Categories of NoSQL Database Systems

– Key-Value Database Systems – Document Database Systems

– Column Family Database Systems

(4)

Key-Value Database Systems

Data Model

– Key-value pairs

• Unique keys

• Values

arbitrary type (serialized byte arrays) or – strings, lists, sets, ordered sets (of strings)

Schema-free

Storage Layout

– Hash-Maps, B-Trees, …

Indexes

Primary indexes (Hash, B-tree) on key Secondary indexes on values?

Big Data Technologies: NoSQL DBMS - SoSe 2015 4

key value

key value

key value

key value

key value

(5)

Key-Value Database Systems (Cont.)

Query Models

– Simple API

• set (key, value)

• value = get (key)

• delete (key)

• Operations on values?

– More complex operations

Language Bindings

MapReduce  later in this chapter

Systems

– Oracle Berkeley DB (mid-90s) – Caches (EHCache, Memcache)

– Amazon Dynamo/S3, Redis, Riak, Voldemort, …

Big Data Technologies: NoSQL DBMS - SoSe 2015 5

key value

key value

key value

key value

key value

h_da Prof. Dr. Uta Störl

(6)

{

"id": 1,

"name": “football boot", "price": 199, "stock": { "warehouse": 120, "retail": 10 } }

Document Store Database Systems

Data Model

– Key-value pairs with “documents” as value

– Document format: JSON or BSON (Binary JSON)

• Loosely structured name(key)-value pairs

• Hierarchical

Additionally, MongoDB uses collections

• “arbitrary” documents could be

grouped together

• documents in a collection should be similar

to facilitate effective indexing

Storage Layout

– B-Trees to store the documents

– MongoDB: Documents in a single collection are stored together

Big Data Technologies: NoSQL DBMS - SoSe 2015 6

(7)

Document Store Database Systems (Cont.)

Indexes

– Primary indexes on documentId (key)

– Secondary indexes on JSON-names

• Default or user defined

• Composite indexes may be supported

Query Models

– Simple API: set/get/delete

– Further query support differ widely

• Powerful ad-hoc queries with integrated query language (MongoDB) • No ad-hoc queries, predefined views with indexes only (CouchDB &

Couchbase)

– Language Bindings

MapReduce  later in this chapter

Systems

– MongoDB, CouchDB, Couchbase, …

{

"id": 1,

"name": “football boot", "price": 199, "stock": { "warehouse": 120, "retail": 10 } } NoSQL

(8)

Column Family Database Systems

Data Model

– Loosely structured by columns and column families (“set of nested maps”)

Column Family

• set of columns grouped together into a bundle

• Column families have to be predefined

Column

• Not predefined; any type or data (can be nested)

Column Family Column Family Row Key1 column Row Key2 column column column column column column column column Table

(9)

Column Family Database Systems (Cont.)

Data Model (Cont.)

– Example:

Column family database systems support multiple versions of each cell by timestamps:

Row Key: title Column Family text Column Family revision

"NoSQL" text:content: "A NoSQL database provides

a mechanism …" revision:author: "Mendel" revision:comment": "changed … " "Redis" text:content: "Redis is an open-source,

networked …" revision:author: "Torben" revision:comment: "initial …"

9 Big Data Technologies: NoSQL DBMS - SoSe 2015

Row Key: title Time Stamp Column Family text Column Family revision

"NoSQL" t5 text:content: "…" revision:author: "Mendel" revision:comment: "changed …"

t4 revision:author: "Torben" revision:comment: "there …"

"Redis" t3 text:content: "…" revision:author: "Torben" revision:comment: "initial …"

h_da Prof. Dr. Uta Störl

(10)

Column Family Database Systems (Cont.)

Storage Layout

– Data is stored by column family

Row Key: title Time Stamp Column Family text column: content

NoSQL t5 A NoSQL database provides a mechanism …

Redis t3 Redis is an open-source, networked …

Row Key: title Time Stamp ColumnFamily revision

column: author column: comment

NoSQL t5 Mendel changed view …

NoSQL t4 Torben there should be

Redis t3 Torben initial …

Row Key: title Time Stamp Column Family text Column Family revision

"NoSQL" t5 text:content: "…" revision:author: "Mendel"

revision:comment: "changed view … "

t4 revision:author: "Torben"

revision:comment: “there should be …"

"Redis" t3 text:content: "…" revision:author: "Torben" revision:comment: "initial …"

(11)

Column Family Database Systems (Cont.)

• Classical example: Web table

Row Key Time Stamp Column Family contents Column Family anchor

"com.cnn.www" t9 anchor:anchor:"cnnsi.com“ anchor:anchortext:"CNN" t8 anchor:anchor:"my.look.ch“ anchor:anchortext: "CNN.com" t6 "<html>…" t5 "<html>…" NoSQL

(12)

Column Family Database Systems (Cont.)

Query Models

– Simple API

• set (table, row, column, value) • value = get (table, row, column) • delete (table, row, column)

timestamp optional

– Language Bindings

– More powerful query engines integrated (Cassandra Query Language) or

as additional software products (e.g. Google App Engine / Google Datastore for BigTable, Hive for Data Warehousing on HBase)

MapReduce  later in this chapter

Indexes

– Primary indexes (B-Trees  sorted ordered)

– Default or user defined secondary indexes

Systems

(13)

NoSQL (Not only SQL): Definition

“NoSQL Definition: Next Generation Databases mostly addressing some

of the points: being non-relational, distributed, open-source and

horizontally scalable.

The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more

characteristics apply such as: schema-free, easy replication support,

simple API, eventually consistent / BASE (not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now

translates it mostly with "not only sql") should be seen as an alias to something like the definition above.”

Source: S. Edlich, nosql-database.org

(14)

NoSQL (Not only SQL): Definition

Next Generation Databases mostly addressing some of the points:

non-relational

schema-free

simple API

distributed and horizontally scalable

easy replication support

eventually consistent / BASE (not ACID)

open-source ???

more complex APIs currently under development

BASE as well as ACID are supported nowadays

(15)

NoSQL: The Essence

Data Model

non-relational

schema-free

Scalability

distributed and horizontally scalable

easy replication support

(16)

NoSQL Database Systems: Use Cases

Key-Value Database

Systems Document Store Database Systems Column Family Database Systems

Suitable Use Cases • Storing Session

Information User Profiles, Preferences Shopping Cart Data Event Logging Content Management Systems Blogging Platforms Web Analytics or Real-Time Analytics Event Logging Content Management Systems Blogging Platforms

Examples • Amazon (shopping

carts) • Temetra (meter data) • … • Forbes (CMS) • MTV (CMS) • …

• Google (web pages)

• Facebook (messaging)

• Twitter (places of interest)

(17)

NoSQL Family Tree

Source: cloudant.com

(18)

Google Stack

Source: Saake/Schallehn:2011

Solution Architectures (Examples)

(19)

NoSQL Database Systems

Categorization Data Model Storage Layout Query Models Solution Architectures Data Modeling Application Development

Scalability, Availability and Consistency

– Partitioning, Replication

– Consistency Models and Transactions

Select the Right DBMS

– Performance and Benchmarks

– Polyglot Persistence

id ti …

(20)

Data Modeling

Object-relational impedance mismatch

• Example: blog, blogpost, comment, author – Object-oriented modeling

– Mapping to relational database

(21)

Data Modeling Decisions

Primary Decision: Embedding vs. Referencing

However, to consider

There are no join operations within NoSQL database systems! There are no distributed transactions within NoSQL!

Advantages and Disadvantages of Embedding

Advantages and Disadvantages of Referencing

• Martin Fowler: Aggregate-Oriented Modeling

Big Data Technologies: NoSQL DBMS - SoSe 2015 21

h_da Prof. Dr. Uta Störl

(22)

Data Modeling: Document Store DBS

• How to realize references?

• Direction of references?

• Embedding: What about denormalization and redundancy?

(23)

Data Modeling: Column Family DBS

• How to implement embedded objects in column family database

systems?

– Variant 1: Using run-time named column qualifiers – Variant 2: Using timestamps (or other id’s)

New (Cassandra CQL3): Using collection types (map, set, list)

• What about column families?

(24)

Data Modeling

• What about data modeling in key-value database systems?

Data Modeling: Conclusion

– More degrees of freedom – Embedding vs. referencing

– Denormalization and redundancy

(25)

NoSQL Database Systems

Categorization Data Model Storage Layout Query Models Solution Architectures Data Modeling Application Development

Scalability, Availability and Consistency

– Partitioning, Replication

– Consistency Models and Transactions

Select the Right DBMS

– Performance and Benchmarks

– Polyglot Persistence

id ti …

(26)

Application Development for NoSQL

• Simple command line APIs

• REST-API

• (Some) more powerful query languages / query engines • Language Bindings

– Java, Ruby, C#, Python, Erlang, PHP, Perl, – REST

(27)

Application Development for NoSQL

• Example: title, content from blogpost with id = 042

// HBase

get 'blogposts', '042', { COLUMN => ['blogpost_data:title', 'blogpost_data:content'] }

// Cassandra

SELECT title, content FROM blogposts WHERE id = '042';

// MongoDB

db.blogposts.find( { _id : '042' }, { title: 1, content: 1 } ) // Couchbase

function (doc) {

if (doc._id == '042') {

emit(doc._id, [doc.title, doc.content]); }

(28)

Application Development for NoSQL

Challenge

– Big data

– Data distributed over several hundred notes (remember: scale out)

Data-to-Code or Code-to-Data?

Executing jobs in parallel over several nodes

(29)

MapReduce: Basic Idea

• Old idea from functional programming (LISP, ML, Erlang, Scala etc.) – Divide tasks into small discrete tasks and run them in parallel – Never change original data (pipe concept)

Different operations on the same data do not influence No concurrency conflicts

No deadlocks

No race conditions

MapReduce

– Basic idea and framework introduced by Google 2004:

J. Dean and S. Gehmawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04. 2004

http://labs.google.com/papers/mapreduce.html

Big Data Technologies: NoSQL DBMS - SoSe 2015 29

(30)

MapReduce: Basic Idea & WordCount Example

• Developers should implement two primary methods

Map: (key1, val1) → [(key2, val2)]

Reduce: (key2, [val2]) → [(key3, val3)]

Documents

Sport, Handball, Soccer

Soccer, FIFA

Documents

Sport, Gym, Money Soccer, FIFA, Money Key Value Sport 1 Handball 1 Soccer 1 Soccer 1

FIFA Key 1 Value

Sport 1 Gym 1 Money 1 Soccer 1 FIFA 1 Money 1 Key Value Sport 2 Handball 1 Soccer 3 Key Value FIFA 2 Gym 1 Money 2 MAP MAP REDUCE REDUCE Doc1 Doc2 Doc3 Doc4

(31)

MapReduce: Architecture and Phases

(32)

Map & Reduce Functions (Example)

Hadoop Example

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, …) … { String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, …) … { int sum = 0;

while (values.hasNext()) { sum += values.next().get(); }

output.collect(key, new IntWritable(sum));

} }

(33)

MapReduce: Optional Combine Phase

• Decrease the shuffling cost

– Reduce the result size of map functions

– Perform reduce-like function in each machine

Documents

Sport, Handball, Soccer

Soccer, FIFA

Documents

Sport, Gym, Money Soccer, FIFA, Money Key Value Sport 1 Handball 1 Soccer 1 Soccer 1

FIFA Key 1 Value

Sport 1 Gym 1 Money 1 Soccer 1 FIFA 1 Money 1 MAP MAP REDUCE REDUCE Key Value Sport 1 Handball 1 Soccer 2 FIFA 1 Key Value Sport 1 Gym 1 Money 2 Soccer 1 FIFA 1 COMBINE COMBINE

(34)

MapReduce Frameworks

• MapReduce frameworks take care of

– Scaling

– Fault tolerance – (Load balancing)

MapReduce Frameworks

– Google (however, Google now promotes Dataflow) – Apache Hadoop

• standalone or integrated in NoSQL (and SQL) DBMS

• Also commercial distributors: Cloudera, MapR, HortonWorks, …

(35)

Map Reduce and Query Languages

• MapReduce paradigm is too low-level

– Only two declarative primitives (map + reduce)

– Custom code for simple operations like projection and filtering – Code is difficult to reuse and maintain

Combination of high-level declarative querying and low-level programming with MapReduce

Dataflow Programming Languages

– HiveQL – Pig

(36)

Hadoop Stack

(37)

HiveQL

• Hive: data warehouse infrastructure built on top of Hadoop, providing: – Data Summarization

– Ad hoc querying

• Simple query language: HiveQL (based on SQL) • Extendable via custom mappers and reducers

• Developed by Facebook, now subproject of Hadoop • http://hadoop.apache.org/hive/

(38)

HiveQL: Example

(39)

Pig

• A platform for analyzing large data sets • Pig consists of two parts:

– PigLatin: A Data Processing Language

– Pig Infrastructure: An Evaluator for PigLatin programs – Pig compiles Pig Latin into physical plans

– Plans are to be executed over Hadoop

• Interface between the declarative style of SQL and low-level, procedural style of MapReduce

(40)

Pig: Example

(41)

MapReduce in Practice

VLDB 2012: Chen, Alspaugh, Katz: Interactive Analytical Processing in Big Data

Systems: A CrossIndustry Study of MapReduce Workloads:

(42)

MapReduce in Practice (Cont.)

(43)

MapReduce Trends

Hadoop 2.0 with YARN (Abstract from MapReduce)

Apache

– “In-Memory” Hadoop – Performance!

– Written in Scala

Big Data Technologies: NoSQL DBMS - SoSe 2015 43

h_da Prof. Dr. Uta Störl

(44)

Application Development for NoSQL

 MapReduce: Concept and Frameworks

• „State of the art“ application development

– With relational database systems: Object-Relational Mapping (ORM) frameworks and standards (Java Persistence API etc.) – Frameworks for Object-NoSQL mapping?!

(45)

Object-NoSQL Mapper: Architecture

Objekt-NoSQL Mapper

Applikation

SELECT titel, text FROM blogposts WHERE id = ’042’;

get ’blogposts’, ’042’,

{ COLUMN => [’blogpost_daten:titel’, ’blogpost_daten:text’] }

SELECT b.titel, b.text FROM blogpost b WHERE b.id = ’042’

id titel 042 … id titel 042 … id tit … id tit … db.blogposts.find ( { _id : ’042’ } , { titel: 1, text: 1 } ) { "id" : "042", "titel" : ... }

(46)

Object-NoSQL Mapper: Market Overview

Mapper for different Programming Languages

– Java, .NET, Python, Ruby … – Volatile Market …

Main Focus: Object-NoSQL Mapper for Java

– Standardization: Java Persistence API (JPA) with Java Persistence Query Language (JPQL)

Categorization

– Multi Data Store Mapper – Single Data Store Mapper

(47)

Java Multi Data Store Mapper

• Support for Document Store, Column Family, and Graph Database Systems in Java Multi Data Store Mapper

Data

Nucleus Eclipse Link Hibernate OGM Kundera PlayORM Spring Data

Document Store Couchbase CouchDB   MongoDB       Column-Family DBMS Cassandra     HBase    Graph DBMS Neo4J    

(48)

Java Multi Data Store Mapper

• Support for Key-Value Database Systems in Java Multi Data Store Mapper

Data

Nucleus Eclipse Link Hibernate Kundera PlayORM Spring Data

Key-Value DBMS AmazonDynamoDB Apache Solr Ehcache Elasticsearch   GemFire Infinispan Oracle NoSQL   Redis  

(49)

Java Object-NoSQL Mapper:

Supported Functionality

Single Data Store Mapper

*Limited functionality (depending from the underlying NoSQL data store)

Source: Störl/Hauf/Klettke/Scherzinger: Schemaless NoSQL Data Stores – Object-NoSQL Mappers to the Rescue? BTW 2015, Hamburg, March 2015

(50)

Object-NoSQL Mapper: Query Language Support

Challenge: Different Query Language Interfaces

– Examples:

• Most systems do not support any JOINS

• Many systems do not offer aggregate functions, LIKE operator, or NOT

operator, …

Approaches

1. Offer only the particular subset of features that is implemented by all supported NoSQL data stores, i.e. the intersection of features 2. Distinguish by data store and offer only the set of features

implemented by a particular NoSQL data store

3. Offer the same set of features for all supported NoSQL data

stores, possibly complementing missing features by implementing them inside the Object-NoSQL Mapper

(51)

Object-NoSQL Mapper: Query Language Support

Approach 2: NoSQL data store specific support of JPQL operators

– Drawback: restricted portability

– Systems: Hibernate OGM, Kundera, EclipseLink

• Example: JPQL operators (selection) in Kundera

(52)

Object-NoSQL Mapper: Query Language Support

Approach 2: NoSQL data store specific support of JPQL operators

Extension: Use third-party libraries to offer more functionality for

some but not for all supported NoSQL data stores – Systems: Hibernate OGM

(Hibernate Search), Kundera each with Apache Lucene NoSQL-DBMS Object-NoSQL Mapper Search Engine Index Application

(53)

Object-NoSQL Mapper: Query Language Support

Approach 3: Offer the same set of features for all supported NoSQL

data stores

– Complementing missing features by implementing them inside the Object-NoSQL Mapper

– Benefit: Portability

– Drawback: Performance – Systems: DataNucleus,

Hibernate OGM (announced)

NoSQL-DBMS

Object-NoSQL Mapper

(54)

Object-NoSQL Mapper: Query Language Support

Outlook: Combination of Approach 2 and 3

– Systems: Hibernate OGM (announced)

NoSQL-DBMS Search Engine Index

Object-NoSQL Mapper

(55)

Conclusion: Java Object-NoSQL Mapper

Vendor Independency / Portability

Standardized Query Language (JPQL) – Support for different NoSQL data stores

– Supported query operators often depend on the capabilities of the underlying NoSQL data stores

Performance (as of end of 2014)

In reading data, there is only a small gap between native access and the Object-NoSQL Mappers for the majority of the evaluated

products

– Yet in writing, object mappers introduce a significant overhead – Further reading: U. Störl, Th. Hauf, M. Klettke and S. Scherzinger:

Schemaless NoSQL Data Stores – Object-NoSQL Mappers to the Rescue? BTW 2015, Hamburg, March 2015

(56)

NoSQL Database Systems

Categorization Data Model Storage Layout Query Models Solution Architectures Data Modeling Application Development

Scalability, Availability and Consistency

– Partitioning, Replication

– Consistency Models and Transactions

Select the Right DBMS

– Performance and Benchmarks

– Polyglot Persistence

id ti …

References

Related documents

In most of the countries examined, more current smokers have desired to stop smoking (except in Iraq, Libya, Morocco, Sudan, Syria, the UAE, and Yemen); however, in the Gaza

By investing in a best-of-breed master data management solution for commerce (MDM/C), businesses can capitalize on the power of product content to build customer relationships

We have found that bank type matters as they have different role in bank networks, the Brazilian bank network is characterized by money centers, in which large banks are

College of Arts Crafts and Design Konstfack Glass Stockholm Urban Glass New York City, USA. Urban Glass Abroad, Codligiocco Italy Pittsburgh

In order to minimise the reduction in the tensile strength and modulus caused by introducing uniform fibre waviness into the UD composites and to produce

Sastraruji, Kwankamol; Pyne, Stephen G.; Ung, Alison T.; Mungkornawawakul, Pitchaya; Lie, Wilford; and Jatisatienr, Araya: Structural revision of stemoburkilline from an E-Alkene to

First, we evaluate the per- formance of SSDD by comparing the calculated semantic similarities with human ratings as described in [38] ; we then compare SSDD with other existing

„ Four types of events are stored in Event Viewer logs „ Error events are created when a serious problem occurs. (corruption of a