• No results found

Comparing Scalable NOSQL Databases

N/A
N/A
Protected

Academic year: 2022

Share "Comparing Scalable NOSQL Databases"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Comparing Scalable NOSQL Databases

Functionalities and Measurements

Dory Thibault UCL

Contact : [email protected] Sponsor : Euranova

Website : nosqlbenchmarking.com February 15, 2011

(2)

Overview of the databases Methodology Results Summary and conclusion

Clari cations

As a lot of people who read those slides did not get the oral explanations that MUST go with it, here are a few words of warning :

All the databases were used with default con gurations, I will post them soon on nosqlbenchmarking.com

No index was set manually, doing so could have a big impact on performances

Don't jump too fast on the conclusions, it would be WRONG to say that Cassandra is very good and that HBase sucks.

The Cassandra implementation of MapReduce seems to be

buggy and do not scale. There must be something wrong with my HBase con guration, HBase is known to run gigantic

cluster without problems.

(3)

Overview of the databases Methodology Results Summary and conclusion

Clari cations

Also keep in mind that a benchmark is always biased by the chosen methodology so :

The way I store data in each database could have an impact on the performances

The summary about the results should not be taken in an absolute way, especially the rst one. When I say Good or

Bad it is in THIS particular case. Moreover raw results are not the most important, scalability is very important too. So good performances for Cassandra MapReduce but without

scalability is NOT good.

The data set is too small, I'm testing cache performances (but it is the same for all of the databases)

(4)

Overview of the databases Methodology Results Summary and conclusion

Motivation

YCSB

Yahoo! Cloud Servicing Benchmark is the best known noSQL bench- marking application so why make another one?

YCSB uses data generated from statistical distributions instead of real data

YCSB only focuses on read/write/update/scan performances YCSB results for elasticity are not conclusive

Idea

Data and use case inspired by a concrete case : Wikipedia Test read/update performances

Test MapReduce performances by computing an inverted

(5)

Overview of the databases Methodology Results Summary and conclusion

Cassandra 0.6.10 HBase 0.20.6 mongoDB 1.6.5 Riak 0.14

Cassandra 0.6.10

Overview

Cassandra is a fully distributed column oriented data store that pro- vides a MapReduce implementation using Hadoop.

All the nodes in the cluster play the same role

The data (existing and new) are sharded automatically among the nodes

The developer can choose the consistency level for each request

(6)

Overview of the databases Methodology Results Summary and conclusion

Cassandra 0.6.10 HBase 0.20.6 mongoDB 1.6.5 Riak 0.14

HBase 0.20.6

Overview

HBase is a column oriented database that aims to provide low latency requests on top of Hadoop HDFS

An HBase cluster uses several kinds of servers :

HDFS needs at least one namenode and several datanodes HBase needs a ZooKeeper cluster, a master and several regionservers

The requests must be made to the master(s)

On the HDFS level, existing data are not sharded automatically but new data are

On the HBase level, the data are divided into regions that are sharded automatically across regionservers

(7)

Overview of the databases Methodology Results Summary and conclusion

Cassandra 0.6.10 HBase 0.20.6 mongoDB 1.6.5 Riak 0.14

mongoDB 1.6.5

Overview

mongoDB is a document oriented database that stores JSON dic- tionnaries. It provides auto sharding and a MapReduce implemen- tation.

A mongoDB cluster is made of several kinds of servers :

The shard servers that store data

The con guration servers that store the con guration The router servers that receive and route the requests

Existing and new data are sharded automatically MapReduce can only use one thread by server

(8)

Overview of the databases Methodology Results Summary and conclusion

Cassandra 0.6.10 HBase 0.20.6 mongoDB 1.6.5 Riak 0.14

Riak 0.14

Overview

Riak is a fully distributed key/bucket store with an implementation of MapReduce.

Buckets can store the data directly or be a link to another bucket

All the nodes in the cluster play the same role

The data (existing and new) are sharded automatically amongs the nodes

The developer can choose the consistency level for each request

(9)

Overview of the databases Methodology Results Summary and conclusion

The data used The client

The methodology

The data

Wikipedia export

20.000 pages downloaded from Wikipedia Every document is in XML format All documents sum up to 620Mo

Each document is associated to a single integer ID Insertions

Each document is inserted only once during the whole benchmark

(10)

Overview of the databases Methodology Results Summary and conclusion

The data used The client

The methodology

The client

Overview

Fully random requests

Acts as a perfect load balancer

The proportion of updates can be speci ed

Speci c parts : read/write/update and MapReduce Updates

The updates simply concatenate the string \1" at the end of the article.

(11)

Overview of the databases Methodology Results Summary and conclusion

The data used The client

The methodology

MapReduce

Overview

MapReduce is used to build a reverse index for a given keyword.

The reverse index is a list of pairs made of : ID : the ID of the article if Count 6= 0

Count : the number of occurrences of the keyword in this article

Justi cation

This kind of computation implies that all the documents are crawled and take advantage of the speci cations of MapReduce

(12)

Overview of the databases Methodology Results Summary and conclusion

The data used The client

The methodology

The methodology

1 Start up a clean cluster of size 3 and insert all the documents

2 Choose a total number of requests, a read percentage and starts the benchmark

3 Wait one minute and starts the benchmark again

4 Wait ve minutes and starts the benchmark again

5 Start the MapReduce benchmark

6 Add a new node to the cluster and wait for it to be ready then restart immediately the bench with the new node's IP in the list

7 Jump to 3 until there are no more computer to add to the cluster

(13)

Overview of the databases Methodology Results Summary and conclusion

Read/update results

(14)

Overview of the databases Methodology Results Summary and conclusion

Read/update results without HBase

(15)

Overview of the databases Methodology Results Summary and conclusion

MapReduce performance

(16)

Overview of the databases Methodology Results Summary and conclusion

The HBase case

Veri cations made :

Checked the logs : nothing seemed problematic

HDFS level : running the balancer with a very low threshold distributed the blocks evenly but without any impact on the performances

HBase level : the regions where always nearly evenly distributed across the regionservers

The number of rows did not change and the content of each row was correct

(17)

Overview of the databases Methodology Results Summary and conclusion

Summary of raw performances

DB read/update performances MapReduce performances

Cassandra Good Very Good

HBase Bad / N.A. Average / N.A

mongoDB Good Poor but scalable

Riak Poor / unstable Average but scalable

(18)

Overview of the databases Methodology Results Summary and conclusion

Summary of scalability

Going from 3 to 8 servers is a 266% increase in capacity, here are the observed increases in performances :

DB read/update MapReduce

Cassandra 153% 112%

HBase 11% 43%

mongoDB 145% 211%

Riak 74% 189%

Riak 7 nodes max 155% 168%

(19)

Overview of the databases Methodology Results Summary and conclusion

Conclusion and future work

Conclusion

The elastic gain seems more apparent than with YCSB but not linear either

It is worth testing MapReduce performances as the results vary a lot between databases for both raw and scalability performances

Future work

This is still a work in progress :

Applying this benchmark to other databases (Terrastore, Voldemort, Scalaris ...)

(20)

Overview of the databases Methodology Results Summary and conclusion

Questions and remarks

Any questions or remarks?

References

Related documents

6 the distribution of possible food security outcomes for pastoralist regions based on this combination of rainfall and forecast is noticeably different for forecasts of

Although theoretically the likelihood of finding evidence that dumped imports have in- jured the domestic industry should fall as the industry increases its output, the results from

The purpose of this study was to examine the perceptions of diagnostic medical sonography online program directors concerning traditional classroom education versus online

This Service Level Agreement (SLA or Agreement) document describes the general scope and nature of the services the Company will provide in relation to the System Software (RMS

de Klerk, South Africa’s last leader under the apartheid regime, Mandela found a negotiation partner who shared his vision of a peaceful transition and showed the courage to

Make measurements on timeslot 0 [FREQUENCY] {Timeslot Off} [Enter] Activate the ORFS measurement (figure 25) [MEASURE] {GMSK Output RF The default setting measures spectrum

Using a nationwide database of hospital admissions, we established that diverticulitis patients admitted to hospitals that encounter a low volume of diverticulitis cases have

The Modified Principal Component Analysis technique shall take care of issues such as problem arising from the reconstruction of the face images using their corresponding