Databases 2 (VU) (707.030)
Introduction to NoSQL
Denis Helic
KMI, TU Graz
Outline
1 NoSQL Motivation 2 NoSQL Systems 3 NoSQL Examples 4 NoSQL Approaches SlidesSlides are based on “Introduction to Databases” course from Stanford University by Jennifer Widom
NoSQL Motivation
NoSQL: What is in the name?
Very often “SQL” = traditional relations DBMS Experience from the past decade: not every data
management/analysis problem is best solved using a traditional relational DBMS
“NoSQL” = not using traditional relational DBMS “NoSQL” 6= don’t use SQL language
NoSQL Motivation
Databases: Types of problems
Not every data management/analysis problem is best solved using a traditional DBMS
DBMS
Database Management System (DBMS) provides . . . efficient, reliable, convenient, and safe multi-user storage of and access to massive amounts of persistent data.
NoSQL Motivation
Databases: Types of problems
Not every data management/analysis problem is best solved using a traditional DBMS Convenient Multi-user Safe Persistent Reliable Massive Efficient
NoSQL Motivation
Databases: Types of problems
Not every data management/analysis problem is best solved using a traditional DBMS
E.g. a tagging system
http://delicious.com/ http://www.bibsonomy.org/
The relation between tags and URLs is a “many-to-many” (m : n) relation
NoSQL Motivation
Tagging
url id url 1 kti.tugraz.at 2 www.tugraz.az . . . .NoSQL Motivation
Tagging
tag id tag 1 knowledge 2 data 3 school . . . .Table: Tag table
NoSQL Motivation
Tagging
tag id url id 1 1 2 1 3 2 . . . .NoSQL Motivation
Tagging
To retrieve tags and URLs you JOIN the tables and select records that fulfill a criteria
E.g. to form a tag cloud
http://www.bibsonomy.org/
How to retrieve related tags?
NoSQL Motivation
Tagging
Retrieve all URLs for a tag, then for each URL retrieve all tags “Hub” tags have millions of URLs
That is just one-hop neighbors Two-hop, three-hop neighbors? What is the problem here?
NoSQL Motivation
Tagging
Retrieve all URLs for a tag, then for each URL retrieve all tags “Hub” tags have millions of URLs
That is just one-hop neighbors Two-hop, three-hop neighbors? What is the problem here?
Relational data model is not well-suited for tagging
NoSQL Motivation
Tagging
knowledge data school ...
NoSQL Motivation
Tagging
Graph-based model is better suited Bipartite graph
Finding related tags is equivalent to traversing graphs E.g. breadth-first search
NoSQL Systems
NoSQL systems
Alternative to traditional relational DBMS Advantages
Flexible schema
Quicker/cheaper to set up Massive scalability
NoSQL Systems
NoSQL systems
Alternative to traditional relational DBMS Disadvantages
No declarative query language → more programming Relaxed consistency → fewer guarantees
NoSQL Examples
Example: Web log analysis
Each record: UserID, URL, timestamp, additional-info Task: Load into database system
NoSQL Examples
Example: Web log analysis
Relational DBMS 1 Data cleaning 2 Data extraction 3 Verification 4 Schema specification NoSQL 1 Nothing
NoSQL Examples
Example: Web log analysis
Each record: UserID, URL, timestamp, additional-info Task: Find all records for:
Given UserID Given URL Given timestamp
Certain construct appearing in additional-info
None of the operations requires SQL: “NoSQL” Highly parallelizable
NoSQL Examples
Example: Web log analysis
Each record: UserID, URL, timestamp, additional-info Task: Find all pairs of UserIDs accessing same URL Here a SELFJOIN is required
But very unlikely that we will have such a query at all
NoSQL Examples
Example: Web log analysis
Each record: UserID, URL, timestamp, additional-info Separate records: UserID, name, age, gender, . . . Task: Find average age of user accessing given URL SQL-like
Some aspects of NoSQL: do we need consistency? Statistical operation: e.g. we could sample
NoSQL Examples
Example: Social network
Each record: UserID1, UserID2
Separate records: UserID, name, age, gender, . . . Task: Find all friends of given user
Simple operation: we do not need a complicated query language
NoSQL Examples
Example: Social network
Each record: UserID1, UserID2
Separate records: UserID, name, age, gender, . . . Task: Find all friends of friends given user The number of JOINs increases
NoSQL Examples
Example: Social network
Each record: UserID1, UserID2
Separate records: UserID, name, age, gender, . . .
Task: Find all women friends of men friends of given user The number of JOINs increases
NoSQL Examples
Example: Social network
Each record: UserID1, UserID2
Separate records: UserID, name, age, gender, . . .
Task: Find all friends of friends of friends of . . . friends of given user The number of JOINs explodes
Graph-based data model Do we need consistency?
NoSQL Examples
Example: Wikipedia
Large collection of documents
Combination of structured and unstructured data
Task: Retrieve introductory paragraph of all pages about U.S. presidents before 1900
Not suitable for SQL systems because of the mix of structured and unstructured database
Do we need consistency?
NoSQL Examples
NoSQL systems
Alternative to traditional relational DBMS Advantages
Flexible schema
Quicker/cheaper to set up Massive scalability
NoSQL Examples
NoSQL systems
Alternative to traditional relational DBMS Disadvantages
No declarative query language → more programming Relaxed consistency → fewer guarantees
NoSQL Approaches
Types of NoSQL systems
Map-Reduce framework Key-value stores
Document stores Graph database systems
NoSQL Approaches
Map-Reduce Framework
Originally from Google Open source Hadoop
No data model, data stored in files User provides specific functions
NoSQL Approaches
Map-Reduce Framework
System provides data processing “glue”, fault-tolerance, scalability Map and Reduce functions
Map: Divide problem into subproblems
NoSQL Approaches
Key-Value Stores
Extremely simple interface Data model: (key, value) pairs
Operations: Create(key,value), Retrieve(key), Update(key), Delete(key)
CRUD operations
NoSQL Approaches
Key-Value Stores
Implementation: efficiency, scalability, fault-tolerance Records distributed to nodes based on key
Replication
NoSQL Approaches
Key-Value Stores
Some allow (non-uniform) columns within value Some allow Retrieve on range of keys
Example systems
Google BigTable, Amazon Dynamo, Cassandra, Voldemort, HBase, etc.
NoSQL Approaches
Document Stores
Like Key-Value Stores except value is document Data model: (key, document) pairs
Document: JSON, XML, other semistructured formats
Basic operations: Create(key,document), Retrieve(key), Update(key), Delete(key)
NoSQL Approaches
Document Stores
Also Retrieve based on document contents Example systems
CouchDB, MongoDB, SimpleDB, etc.
NoSQL Approaches
Graph database systems
Data model: nodes and links
Nodes may have properties (including ID) Links may have labels, roles, or other properties
NoSQL Approaches
Graph database systems
Interfaces and query languages vary
Single-step versus “path expressions” versus full recursion Example systems
Neo4j, FlockDB, Pregel, etc.
RDF “triple stores” can map to graph databases