Comparing Scalable NOSQL Databases

(1)

Comparing Scalable NOSQL Databases

Functionalities and Measurements

Dory Thibault UCL

Contact : [email protected] Sponsor : Euranova

Website : nosqlbenchmarking.com February 15, 2011

(2)

Overview of the databases Methodology Results Summary and conclusion

Clarications

As a lot of people who read those slides did not get the oral explanations that MUST go with it, here are a few words of warning :

All the databases were used with default congurations, I will post them soon on nosqlbenchmarking.com

No index was set manually, doing so could have a big impact on performances

Don't jump too fast on the conclusions, it would be WRONG to say that Cassandra is very good and that HBase sucks.

The Cassandra implementation of MapReduce seems to be

buggy and do not scale. There must be something wrong with my HBase conguration, HBase is known to run gigantic

cluster without problems.

(3)

Clarications

Also keep in mind that a benchmark is always biased by the chosen methodology so :

The way I store data in each database could have an impact on the performances

The summary about the results should not be taken in an absolute way, especially the rst one. When I say Good or

Bad it is in THIS particular case. Moreover raw results are not the most important, scalability is very important too. So good performances for Cassandra MapReduce but without

scalability is NOT good.

The data set is too small, I'm testing cache performances (but it is the same for all of the databases)

(4)

Motivation

YCSB

Yahoo! Cloud Servicing Benchmark is the best known noSQL bench- marking application so why make another one?

YCSB uses data generated from statistical distributions instead of real data

YCSB only focuses on read/write/update/scan performances YCSB results for elasticity are not conclusive

Idea

Data and use case inspired by a concrete case : Wikipedia Test read/update performances

Test MapReduce performances by computing an inverted

(5)

Cassandra 0.6.10 HBase 0.20.6 mongoDB 1.6.5 Riak 0.14

Cassandra 0.6.10

Overview

Cassandra is a fully distributed column oriented data store that provides a MapReduce implementation using Hadoop.

All the nodes in the cluster play the same role

The data (existing and new) are sharded automatically among the nodes

The developer can choose the consistency level for each request

(6)

HBase 0.20.6

Overview

HBase is a column oriented database that aims to provide low latency requests on top of Hadoop HDFS

An HBase cluster uses several kinds of servers :

HDFS needs at least one namenode and several datanodes HBase needs a ZooKeeper cluster, a master and several regionservers

The requests must be made to the master(s)

On the HDFS level, existing data are not sharded automatically but new data are

On the HBase level, the data are divided into regions that are sharded automatically across regionservers

(7)

mongoDB 1.6.5

Overview

mongoDB is a document oriented database that stores JSON dic- tionnaries. It provides auto sharding and a MapReduce implementation.

A mongoDB cluster is made of several kinds of servers :

The shard servers that store data

The conguration servers that store the conguration The router servers that receive and route the requests

Existing and new data are sharded automatically MapReduce can only use one thread by server

(8)

Riak 0.14

Overview

Riak is a fully distributed key/bucket store with an implementation of MapReduce.

Buckets can store the data directly or be a link to another bucket

All the nodes in the cluster play the same role

The data (existing and new) are sharded automatically amongs the nodes

The developer can choose the consistency level for each request

(9)

The data used The client

The methodology

The data

Wikipedia export

20.000 pages downloaded from Wikipedia Every document is in XML format All documents sum up to 620Mo

Each document is associated to a single integer ID Insertions

Each document is inserted only once during the whole benchmark

(10)

The methodology

The client

Overview

Fully random requests

Acts as a perfect load balancer

The proportion of updates can be specied

Specic parts : read/write/update and MapReduce Updates

The updates simply concatenate the string \1" at the end of the article.

(11)

The methodology

MapReduce

Overview

MapReduce is used to build a reverse index for a given keyword.

The reverse index is a list of pairs made of : ID : the ID of the article if Count 6= 0

Count : the number of occurrences of the keyword in this article

Justication

This kind of computation implies that all the documents are crawled and take advantage of the specications of MapReduce

(12)

The methodology

1 Start up a clean cluster of size 3 and insert all the documents

2 Choose a total number of requests, a read percentage and starts the benchmark

3 Wait one minute and starts the benchmark again

4 Wait ve minutes and starts the benchmark again

5 Start the MapReduce benchmark

6 Add a new node to the cluster and wait for it to be ready then restart immediately the bench with the new node's IP in the list

7 Jump to 3 until there are no more computer to add to the cluster

(13)

Read/update results

(14)

Read/update results without HBase

(15)

MapReduce performance

(16)

The HBase case

Verications made :

Checked the logs : nothing seemed problematic

HDFS level : running the balancer with a very low threshold distributed the blocks evenly but without any impact on the performances

HBase level : the regions where always nearly evenly distributed across the regionservers

The number of rows did not change and the content of each row was correct

(17)

Summary of raw performances

DB read/update performances MapReduce performances

Cassandra Good Very Good

HBase Bad / N.A. Average / N.A

mongoDB Good Poor but scalable

Riak Poor / unstable Average but scalable

(18)

Summary of scalability

Going from 3 to 8 servers is a 266% increase in capacity, here are the observed increases in performances :

DB read/update MapReduce

Cassandra 153% 112%

HBase 11% 43%

mongoDB 145% 211%

Riak 74% 189%

Riak 7 nodes max 155% 168%

(19)

Conclusion and future work

Conclusion

The elastic gain seems more apparent than with YCSB but not linear either

It is worth testing MapReduce performances as the results vary a lot between databases for both raw and scalability performances

Future work

This is still a work in progress :

Applying this benchmark to other databases (Terrastore, Voldemort, Scalaris ...)

(20)

Questions and remarks

Any questions or remarks?

Comparing Scalable NOSQL Databases