• No results found

NoSQL Data Base Basics

N/A
N/A
Protected

Academic year: 2022

Share "NoSQL Data Base Basics"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

NoSQL  Data   Base  Basics  

Spring- 2013

Jordi Torres, UPC - BSC

Cloud Computing – MIRI (CLC-MIRI)

UPC Master in Innovation & Research in Informatics

Course  Notes  in  

Transparency  Format  

(2)

HDFS

Hadoop: standard storage mechanism for HADOOP Hadoop Distributed File System (HDFS)

(3)

HDFS

§ 

Hadoop Distributed File System (HDFS)

–  Fault tolerance

•  Assuming that failure will happen allows HDFS to run on commodity hardware.

–  Streaming data access

•  HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.

–  Extreme scalability

•  HDFS will scale to petabytes (current versions)

(4)

Hadoop: standard storage mechanism

§ 

Hadoop Distributed File System (HDFS)

–  Most HDFS applications need a write-once-read-many access model for files

•  By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.

–  “Moving Computation is Cheaper than Moving Data”: Locality of computation

•  Due to data volume, it is often much faster to move the program near to the data

à HDFS has features to facilitate this.

(5)

Hadoop: standard storage mechanism

Starting point

http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html /

(6)

Hadoop: standard storage mechanism

§ 

HDFS Interface

–  Interface similar to that of regular filesystems.

–  can only store and retrieve data, not index it.

§ 

Simple random access to data is not possible.

§ 

Solution: higher-level layers àHBase

•  have been created to provide finer-grained functionality to Hadoop deployments

Map Reduce Hbase

HDFS

(7)

Hbase, the Hadoop Database

§  HBase

–  Creates indexes àoffers fast and random access to its content

–  Modeled after Google's BigTable DB

–  is a column-oriented database designed to store massive amounts of data.

–  Uses HDFS as a storage system

§ 

It belongs to the NoSQL universe

–  similar to Cassandra, Hypertable, …

Map Reduce

Hbase

HDFS

(8)

Hbase versus HDFS (a brief comparison)

§ 

HDFS:

–  Optimized For:

–  Large Files

–  Sequential Access (High Throughput) –  Append Only

–  Use for fact tables that are mostly append only and require sequential full table scans.

§ 

HBase:

–  Optimized For:

–  Small Records (but many records) –  Random Access

–  Atomic Record Updates

–  Use for dimension lookup tables which are updated frequently and require random low-latency lookups.

(9)

HDFS: an example

§ 

A given file

–  is broken down into blocks (default=64MB),

1 2 3 4 5

(10)

HDFS: an example

–  then blocks are replicated across cluster (default=3).

1 2 3 4 5

2 3 4

1 3 5

1 3 4 2 4 5

1 2 5

(11)

MapReduce: Resource Management

§ 

Scheduling

–  A given job is broken down into tasks,

–  then tasks are scheduled to be as close to data as possible.

–  Optimized for

•  Bach processing

•  Failure recovery

2 3 4

1 3 5

1 3 4 2 4 5

1 2 5

(12)

Common characteristics of NoSQL

§ 

Shared nothing systems

Shared nothing systems have proven to be most cost-effective and flexible Shared Disk

Shared RAM Shared Nothing

CPU RAM

CPU RAM

SAN

LAN

CPU

RAM

CPU

BUS

CPU RAM Disk

CPU RAM Disk

LAN

Source:  h*p://www.slideshare.net/Couchbase/webinar-­‐making-­‐sense-­‐of-­‐

nosql-­‐applying-­‐nonrela?onal-­‐databases-­‐to-­‐business-­‐needs?ref=h*p://

(13)

Common characteristics of NoSQL

§ 

Distributed models

Master-Slave Peer-to-Peer

Master  

Standby   Master  

Node   Node  

Node   Node  

Node  

Node  

Node  

requests requests

Used only if primary master fails

ce:  h*p://www.slideshare.net/Couchbase/webinar-­‐making-­‐sense-­‐of-­‐ l-­‐applying-­‐nonrela?onal-­‐databases-­‐to-­‐business-­‐needs?ref=h*p:// w.slideshare.net/slideshow/embed_code/18124982?rel=0  

(14)

Common characteristics of NoSQL

§ 

Move Queries to the Nodes

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Database

MapReduce

Queries  work  best  if  the   run  on  the  local  node  that  

has  the  data   Query

Source:  h*p://www.slideshare.net/Couchbase/webinar-­‐making-­‐sense-­‐of-­‐nosql-­‐applying-­‐nonrela?onal-­‐

(15)

Alternatives to Hbase/HDFS?

§ 

An Apache project, Cassandra

originated at Facebook and is now in production in many large-scale websites (also at BSC).

§ 

Hypertable was created at Zvents and spun out as an open source project.

§ 

Are both scalable column-store

databases that follow the pattern of BigTable, similar to HBase.

Map Reduce

Cassandra

Map Reduce

Hypertable

(16)

And … dozens

§ 

http://nosql-database.org

List Of NoSQL Databases [currently 150]

(17)

NoS QL

§ 

The concept is something that has gained momentum in recent years

§ 

Today is a mature and efficient alternative that can help us solve the problems of scalability and

performance

(e.g. online applications with thousands of concurrent users and million hits a day)

(18)

NoSQL on Google Trends

Source: http://www.google.com/trends/explore#q=NoSQL

(19)

Different Types of NoSQL Systems

•  Distributed Key-Value Systems

–  Amazon’s S3 Key-Value Store (Dynamo) –  Voldemort (LinkedIn)

–  Cassandra (Facebook) – 

•  Column-based Systems

–  BigTable (Google) –  HBase

–  Cassandra – 

•  Document-based systems

–  CouchDB –  MongoDB – 

(20)

Common Themes

§ 

Horizontal scalability

§ 

Clever use of hashing and caching

§ 

Parallel execution of queries

–  move queries to the data, not the other way around

§ 

Share resources when possible

–  Example – memcached protocol

§ 

Use simple interfaces when possible

–  put, get, delete

Source: Kelly-McCreary & Associates, LLC

http://www.slideshare.net/Couchbase/webinar-making-sense-of-nosql-applying- nonrelational-databases-to-business-needs?ref=http://www.slideshare.net/

slideshow/embed_code/18124982?rel=0

(21)

References

Related documents

This research focuses on clustering of genes with similar expres- sion patterns using Hidden Markov Models (HMM) for time course data because they are able to model the

For each individual whose compensation must be reported in Schedule J, report compensation from the organization on row ( i) and from related organizations , described in

Examples are pulse width modulation (PWM) converters and cycloconverters. Interharmonics generated by them may be located anywhere in the spectrum with respect to

In the following section, information on the VDR variants found in the populations of South Africa was collated, with the aim of determining the function of such polymorphisms

• Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. • HDFS creates multiple replicas of 64+ Megabyte data blocks and distributes

The proposal system design the environment with hadoop with base layer as Hadoop Distributed File System (HDFS) stores a large number of data to accessing the data on the

•  Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data.. •  Hadoop YARN: A framework for job scheduling

• Hadoop proper is a distributed master-slave architecture consists of the Hadoop Distributed File System ( HDFS ) for storage and MapReduce for computational capabilities... What