NoSQL Data Base Basics
Spring- 2013
Jordi Torres, UPC - BSC
Cloud Computing – MIRI (CLC-MIRI)
UPC Master in Innovation & Research in Informatics
Course Notes in
Transparency Format
HDFS
Hadoop: standard storage mechanism for HADOOP Hadoop Distributed File System (HDFS)
HDFS
§
Hadoop Distributed File System (HDFS)– Fault tolerance
• Assuming that failure will happen allows HDFS to run on commodity hardware.
– Streaming data access
• HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.
– Extreme scalability
• HDFS will scale to petabytes (current versions)
Hadoop: standard storage mechanism
§
Hadoop Distributed File System (HDFS)– Most HDFS applications need a write-once-read-many access model for files
• By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.
– “Moving Computation is Cheaper than Moving Data”: Locality of computation
• Due to data volume, it is often much faster to move the program near to the data
à HDFS has features to facilitate this.
Hadoop: standard storage mechanism
Starting point
http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html /
Hadoop: standard storage mechanism
§
HDFS Interface– Interface similar to that of regular filesystems.
– can only store and retrieve data, not index it.
§
Simple random access to data is not possible.§
Solution: higher-level layers àHBase• have been created to provide finer-grained functionality to Hadoop deployments
Map Reduce Hbase
HDFS
Hbase, the Hadoop Database
§ HBase
– Creates indexes àoffers fast and random access to its content
– Modeled after Google's BigTable DB
– is a column-oriented database designed to store massive amounts of data.
– Uses HDFS as a storage system
§
It belongs to the NoSQL universe– similar to Cassandra, Hypertable, …
Map Reduce
Hbase
HDFS
Hbase versus HDFS (a brief comparison)
§
HDFS:– Optimized For:
– Large Files
– Sequential Access (High Throughput) – Append Only
– Use for fact tables that are mostly append only and require sequential full table scans.
§
HBase:– Optimized For:
– Small Records (but many records) – Random Access
– Atomic Record Updates
– Use for dimension lookup tables which are updated frequently and require random low-latency lookups.
HDFS: an example
§
A given file– is broken down into blocks (default=64MB),
1 2 3 4 5
HDFS: an example
– then blocks are replicated across cluster (default=3).
1 2 3 4 5
2 3 4
1 3 5
1 3 4 2 4 5
1 2 5
MapReduce: Resource Management
§
Scheduling– A given job is broken down into tasks,
– then tasks are scheduled to be as close to data as possible.
– Optimized for
• Bach processing
• Failure recovery
2 3 4
1 3 5
1 3 4 2 4 5
1 2 5
Common characteristics of NoSQL
§
Shared nothing systemsShared nothing systems have proven to be most cost-effective and flexible Shared Disk
Shared RAM Shared Nothing
CPU RAM
CPU RAM
SAN
LAN
CPU
RAM
CPU
BUS
CPU RAM Disk
CPU RAM Disk
LAN
Source: h*p://www.slideshare.net/Couchbase/webinar-‐making-‐sense-‐of-‐
nosql-‐applying-‐nonrela?onal-‐databases-‐to-‐business-‐needs?ref=h*p://
Common characteristics of NoSQL
§
Distributed modelsMaster-Slave Peer-to-Peer
Master
Standby Master
Node Node
Node Node
Node
Node
Node
requests requests
Used only if primary master fails
ce: h*p://www.slideshare.net/Couchbase/webinar-‐making-‐sense-‐of-‐ l-‐applying-‐nonrela?onal-‐databases-‐to-‐business-‐needs?ref=h*p:// w.slideshare.net/slideshow/embed_code/18124982?rel=0
Common characteristics of NoSQL
§
Move Queries to the NodesDatabase
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Database
MapReduce
Queries work best if the run on the local node that
has the data Query
Source: h*p://www.slideshare.net/Couchbase/webinar-‐making-‐sense-‐of-‐nosql-‐applying-‐nonrela?onal-‐
Alternatives to Hbase/HDFS?
§
An Apache project, Cassandraoriginated at Facebook and is now in production in many large-scale websites (also at BSC).
§
Hypertable was created at Zvents and spun out as an open source project.§
Are both scalable column-storedatabases that follow the pattern of BigTable, similar to HBase.
Map Reduce
Cassandra
Map Reduce
Hypertable
And … dozens
§
http://nosql-database.orgList Of NoSQL Databases [currently 150]
NoS QL
§
The concept is something that has gained momentum in recent years§
Today is a mature and efficient alternative that can help us solve the problems of scalability andperformance
(e.g. online applications with thousands of concurrent users and million hits a day)
NoSQL on Google Trends
Source: http://www.google.com/trends/explore#q=NoSQL
Different Types of NoSQL Systems
• Distributed Key-Value Systems
– Amazon’s S3 Key-Value Store (Dynamo) – Voldemort (LinkedIn)
– Cassandra (Facebook) – …
• Column-based Systems
– BigTable (Google) – HBase
– Cassandra – …
• Document-based systems
– CouchDB – MongoDB – …
Common Themes
§
Horizontal scalability§
Clever use of hashing and caching§
Parallel execution of queries– move queries to the data, not the other way around
§
Share resources when possible– Example – memcached protocol
§
Use simple interfaces when possible– put, get, delete
Source: Kelly-McCreary & Associates, LLC
http://www.slideshare.net/Couchbase/webinar-making-sense-of-nosql-applying- nonrelational-databases-to-business-needs?ref=http://www.slideshare.net/
slideshow/embed_code/18124982?rel=0