• No results found

6.2 Data Management Approaches: File System vs Relational Database System vs NOSQL

6.2.3 NoSQL Database System

In the age of big data, the relational database model has faced scalability issues and performance issues as data volumes increase. Thus, NoSQL databases have been developed as an alternative data management solution to relational databases.

The term NoSQL initially referred to a database system that had no structure and no standard query language (SQL is standard query language for a relational database), but after a few years the developers found a more meaningful translation for this term; Not Only SQL. Any database system that does not use a relational model can be described as a NoSQL database.

The concept of NoSQL was proposed long ago, but the big movement in this area began in 2009 [163] and has since grown rapidly. This is because the NoSQL database system has been found to have advantages when dealing with large volumes of data. The main advantages of the NoSQL database system over the RDBMS system are:

x High performance data access: One design factor of the NoSQL database system is that it provides efficient data access in the large data volume context. In order to achieve this goal, the NoSQL database system uses BASE (Basically Available, Soft state, Eventual consistency) compliancy, instead of ACID (Atomicity, Consistency, Isolation and Durability) compliancy as in the RDBMS system [189] [190]. The main reason why the ACID compliancy of relational databases usually has worse performance is due to its rigid enforcement of data consistency. Thus, the NoSQL database system sacrifices ACID compliancy for performance. In other words, the NoSQL database system has no foreign

177

keys/referential integrity constraints and no JOIN operations, as these features would significant affect performance.

The “consistency” that is a concern with NoSQL databases is found in the CAP (Consistency, Availability, Partition tolerance) theorem (also referred as Brewer’s theorem [191]), which signifies the immediate or eventual consistency of data across all nodes that participate in a distributed database.

x Flexible data model: The NoSQL database system is based on a dynamic schema (also referred to “schema-less”). This dynamic schema nature allows the NoSQL database system to accept all structures of data (structured, semi-structured and unstructured) much more easily than in a relational database (which requires a predefined schema). Thus, the NoSQL database systems can support many use cases and others that do not fit well into a RDBMS system. In addition, we can modify the structure of a NoSQL database in real-time, which means that any system updates, modifications or adding of new information into the system can be made at any time without taking the database system offline.

x Scalability: The NoSQL database system is horizontally scalable, which means that data are very easily distributed across multiple servers. These multiple servers can be cheap commodity hardware or cloud instance, which is much more cost- effective than vertical scaling. Moreover, many NoSQL database systems also support distributing data across servers automatically, so sometimes NoSQL database systems are also referred to as cloud based databases [164].

x High Data Availability: All NOSQL databases achieve high data availability through using the replication feature; the data replication model of NOSQL databases consists of one master server (also called “primary”) and one or more slave servers (also known as “secondary”). In this model, the can only be one master server at any particular time, with the other servers being slave servers. The data writing operation always takes place on the master server, but the read operations can be served by either the master server or the slave servers. When

178

new data has been written to the master server, the same data will be cloned to all of the slave servers. In other words, all of the slave servers have a copy of the same data as the master server. Moreover, the replication feature also offers a fault-tolerance mechanism (also called fail-over) [266] to guarantee the high availability of the database systems. When the master server goes down, one of the slave servers will be automatically elected as the new master. Therefore, even if a single database server is unavailable, the whole database system is still able to continuously offer service to users without interruption, because users can access the data they want from other database servers (slaves) within the replication infrastructure. Please note that enabling the replication feature is an important condition to achieving high data availability; without data replication, the data availability of the NOSQL database is still not able to be guaranteed, because if only one NOSQL database server is utilized to provide services, and this database server is unavailable, users can only access the data until the database server comes back online.

Even though the NoSQL database system offers lots of remarkable benefits when handling ‘big data’ situations, it is not without drawbacks. In general, the NoSQL database system has two major drawbacks. First, the NoSQL database system does not have a standard query language as with the RDBMS system; each type of NoSQL database system has its own query language. In other words, developers need to learn many different query languages in order to use different types of NoSQL databases. This non standardization of query languages nature requires extra effort and time spent by the developers.

The second drawback is that the NoSQL database system does not have a standard interface; because each NoSQL database system has unique aspects in both how the data are stored, as well as how different bits of data relate to each other, no single interface manages them all. When using a new NoSQL database system for data storage, the developer must invest time and effort to learn about the new database system.

179

There are more than a hundred open source and commercial NoSQL databases available, but they can be divided into four major categories; Key/Value stores, Object Oriented stores, Document Oriented stores, and Graph stores [165]. Each category is described in detail below:

x Key/Value stores: The fundamental data model of Key/Value (KV) stores is based on an associative array (like a hashtable or map). To store data in a KV database, it is necessary to create a key for each record; a key consists of a string and cannot be duplicated. After the key has been created, the actual data that need to be stored is considered as the value; the value can consist of any primitive types (such as string, integer, or array), and it is retrieved through indexing a key. In other words, in a KV database, data is represented as a collection of key/value pairs, and each key is only allowed to appear once in the collection.

x Object oriented stores: Object oriented stores (also referred to as Object databases) first appeared in 1986. The data stored in an object database are formed as objects, as in object oriented programming.

The main usage of an object database is in object oriented development areas. This category of database is mainly used for storing objects created by programming languages such as C++, Java and others. The object oriented stores databases works well for storing and managing multimedia data, such as images, text, audio and video.

x Graph stores: The graph database is a database management system with CRUD methods that expose a graph data model; it is typically designed for storing inter- connected data. In theory, the graph database consists of two components: “Node” and “Edge”. A node is used for data storage, with all data in the graph database stored in a series of nodes. An edge is a line that connects one node with another node. Each edge must connect with one node and it represents the relationship between the nodes. The graph database is good for shaping of the domain when it is naturally a graph (i.e. social networks) and is often used for storing and

180

working with unstructured and semi-structured data (i.e. blogs, audio files, video files, emails, etc.).

x Document Oriented Stores: A document-oriented database is another type of NOSQL database that is designed for storing, retrieving and managing structured, semi-structured, or machine-readable data. The key concept of a document- oriented database is that it uses the document as its storage unit, instead of the records as in a relational database.

In general, the documents inside a document oriented database can be any standard document formats, such as XML, YAML, JSON, BSON and flat text [166]. This category of database is useful for content management systems, blogging platforms, web analytics and real-time analytic systems.

Some of the major open source NOSQL database implementations are listed in Table 6:

Database name Data model category Implementation

language

License

Oracle Berkeley DB (DBD)

Key/Value store C GNU Affero

General Public License (AGPLv3)

Redis Key/Value store ANSI C BSD

NeoDatis ODB Object Oriented store Java GNU Lesser

General Public License (LGPL)

Db4o Object Oriented store Java GNU General Public

License (GPL)

Neo4J Graph store Java GPLv3

OpenCog Graph store C++ AGPL

MongoDB Document Oriented Store C++ AGPLv3

Xindice Document Oriented Store Java Apache License 2.0

CouchDB Document Oriented Store Erlang Apache License 2.0

181

Each category of NOSQL database system has its own strong usage area; there is no one correct category that is suitable for all situations. As mentioned before, the NOSQL database is not designed to replace traditional relational database systems. Instead, it is designed for handling those situations that traditional relational database systems cannot cope with, such as out scaled data, requiring a dynamic schema to meet the demands of evolving data, and storage and management of large quantities of data [267][268][206][270]. The next section begins the discussion of how to use a NOSQL database to store and manage different astronomical transient data.

6.3 Suitable NoSQL Category for Transient Data Storage and