NoSQL Data Base Basics

(1)

NoSQL Data Base Basics

Spring- 2013

Jordi Torres, UPC - BSC

Cloud Computing – MIRI (CLC-MIRI)

UPC Master in Innovation & Research in Informatics

Course Notes in

Transparency Format

(2)

HDFS

Hadoop: standard storage mechanism for HADOOP Hadoop Distributed File System (HDFS)

(3)

HDFS

§ 

Hadoop Distributed File System (HDFS)

–  Fault tolerance

•  Assuming that failure will happen allows HDFS to run on commodity hardware.

–  Streaming data access

•  HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.

–  Extreme scalability

•  HDFS will scale to petabytes (current versions)

(4)

Hadoop: standard storage mechanism

§ 

Hadoop Distributed File System (HDFS)

–  Most HDFS applications need a write-once-read-many access model for files

•  By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.

–  “Moving Computation is Cheaper than Moving Data”: Locality of computation

•  Due to data volume, it is often much faster to move the program near to the data

à HDFS has features to facilitate this.

(5)

Hadoop: standard storage mechanism

Starting point

http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html /

(6)

Hadoop: standard storage mechanism

§ 

HDFS Interface

–  Interface similar to that of regular filesystems.

–  can only store and retrieve data, not index it.

§ 

Simple random access to data is not possible.

§ 

Solution: higher-level layers àHBase

•  have been created to provide finer-grained functionality to Hadoop deployments

Map Reduce Hbase

HDFS

(7)

Hbase, the Hadoop Database

§  ^HBase

–  Creates indexes àoffers fast and random access to its content

–  Modeled after Google's BigTable DB

–  is a column-oriented database designed to store massive amounts of data.

–  Uses HDFS as a storage system

§ 

It belongs to the NoSQL universe

–  similar to Cassandra, Hypertable, …

Map Reduce

Hbase

HDFS

(8)

Hbase versus HDFS (a brief comparison)

§ 

^HDFS:

–  Optimized For:

–  Large Files

–  Sequential Access (High Throughput) –  Append Only

–  Use for fact tables that are mostly append only and require sequential full table scans.

§ 

^HBase:

–  Optimized For:

–  Small Records (but many records) –  Random Access

–  Atomic Record Updates

–  Use for dimension lookup tables which are updated frequently and require random low-latency lookups.

(9)

HDFS: an example

§ 

A given file

–  is broken down into blocks (default=64MB),

1 2 3 4 5

(10)

HDFS: an example

–  then blocks are replicated across cluster (default=3).

1 2 3 4 5

2 3 4

1 3 5

1 3 4 2 4 5

1 2 5

(11)

MapReduce: Resource Management

§ 

Scheduling

–  A given job is broken down into tasks,

–  then tasks are scheduled to be as close to data as possible.

–  Optimized for

•  Bach processing

•  Failure recovery

2 3 4

1 3 5

1 3 4 2 4 5

1 2 5

(12)

Common characteristics of NoSQL

§ 

Shared nothing systems

Shared nothing systems have proven to be most cost-effective and flexible Shared Disk

Shared RAM Shared Nothing

CPU RAM

SAN

LAN

CPU

RAM

CPU

BUS

CPU RAM Disk

LAN

Source: h*p://www.slideshare.net/Couchbase/webinar-‐making-‐sense-‐of-‐

nosql-‐applying-‐nonrela?onal-‐databases-‐to-‐business-‐needs?ref=h*p://

(13)

Common characteristics of NoSQL

§ 

Distributed models

Master-Slave Peer-to-Peer

Master

Standby Master

Node Node

Node

requests requests

Used only if primary master fails

ce: h*p://www.slideshare.net/Couchbase/webinar-‐making-‐sense-‐of-‐ l-‐applying-‐nonrela?onal-‐databases-‐to-‐business-‐needs?ref=h*p:// w.slideshare.net/slideshow/embed_code/18124982?rel=0

(14)

Common characteristics of NoSQL

§ 

Move Queries to the Nodes

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Queries work best if the run on the local node that

has the data Query

Source: h*p://www.slideshare.net/Couchbase/webinar-‐making-‐sense-‐of-‐nosql-‐applying-‐nonrela?onal-‐

(15)

Alternatives to Hbase/HDFS?

§ 

An Apache project, Cassandra

originated at Facebook and is now in production in many large-scale websites (also at BSC).

§ 

Hypertable was created at Zvents and spun out as an open source project.

§ 

Are both scalable column-store

databases that follow the pattern of BigTable, similar to HBase.

Map Reduce

Cassandra

Map Reduce

Hypertable

(16)

And … dozens

§ 

http://nosql-database.org

List Of NoSQL Databases [currently 150]

(17)

NoS QL

§ 

The concept is something that has gained momentum in recent years

§ 

Today is a mature and efficient alternative that can help us solve the problems of scalability and

performance

(e.g. online applications with thousands of concurrent users and million hits a day)

(18)

NoSQL on Google Trends

Source: http://www.google.com/trends/explore#q=NoSQL

(19)

Different Types of NoSQL Systems

•  Distributed Key-Value Systems

–  Amazon’s S3 Key-Value Store (Dynamo) –  Voldemort (LinkedIn)

–  Cassandra (Facebook) –  …

•  Column-based Systems

–  BigTable (Google) –  HBase

–  Cassandra –  …

•  Document-based systems

–  CouchDB –  MongoDB –  …

(20)

Common Themes

§ 

Horizontal scalability

§ 

Clever use of hashing and caching

§ 

Parallel execution of queries

–  move queries to the data, not the other way around

§ 

Share resources when possible

–  Example – memcached protocol

§ 

Use simple interfaces when possible

–  put, get, delete

Source: Kelly-McCreary & Associates, LLC

http://www.slideshare.net/Couchbase/webinar-making-sense-of-nosql-applying- nonrelational-databases-to-business-needs?ref=http://www.slideshare.net/

slideshow/embed_code/18124982?rel=0

(21)

NoSQL Data Base Basics