Choice of Document Stores Implementation - Suitable NoSQL Category for Transient Data Storage a

6.3 Suitable NoSQL Category for Transient Data Storage and Management

6.3.2 Choice of Document Stores Implementation

The XML, JSON and BSON formats are common standard data storage formats used by document stores database products. In the early stage of the development, a document stores database called Xindice [167] was used for our data storage. Xindice is an open

184

source database developed by Apache, and data is formed as an XML document to store in this database, so it is also referred to as an XML based database.

Xindice allows a different structure of XML document to be stored in a same collection, because it does not require an XML schema association with the collection; it is referred to as being schema independent [168]. However, after mapping the data into XML documents and storing these XML documents into Xindice, several disadvantages were discovered in using XML documents as data storage units, as well as other disadvantages of Xindice.

First, the XML syntax is redundant; this redundancy may affect application efficiency through higher processing and transmission costs.

Second, XML documents are very verbose in comparison to other alternative ‘text- based’ data formats, such as JSON and BSON, because XML requires all open and close markup tags to be present in order for it to work properly; these open and close markup tags will also consume lots of extra storage space.

Finally, XML documents are stored in Xindice via document trees (this storage mechanism is common for all type of XML based databases); when running a lookup on some particular data, the search must go through all branches of the tree before it locates the expected results. If the structure of the XML document is complex and there are a large number of documents in the same collection, the querying performance can be very slow. In fact, in our situation, each collection could contain a large number of XML documents, and the structure of some of the XML documents could be very complex (many nested tags) after data mapping. Thus, XML based databases do not fit the requirements of our project, and it is not a scalable data management solution for any future astronomical projects, so XML based databases are rejected and it is necessary to look for other types of document stores databases.

185

Following further research, another type of document stores database is found, which is called MongoDB [169]. MongoDB has been developed and maintained by 10gen (10gen has now changed its name to MongoDB Inc.) since 2009; this database is an open source database written in C++ and licensed under GNU Affero General Public License [170]. It is the fourth most popular type of database management system [171], and the most popular document stores database with a large number of user groups, strong community support and wide use by a lot of different projects, including The New York Times, Foursquare, SourceForge and eBay, among others.

Using MongoDB for data storage has many advantages over XML based databases. The major benefits are:

x Lightweight and efficient data format: The data storage format used by MongoDB is BSON (Binary JSON) [172]; it is a binary representation of JSON documents, with the data structure of BSON documents composed of field-and- value pairs and using “,” to separate different pairs, as seen in the example below:

{“field1”:”value”, “field2”:”value”}

The BSON format is much more lightweight compared to the XML format, because it does not require open and close markup tags to make documents work properly, so the data structure of the BSON format is clear and robust. The most important thing is that, without the need for overhead markup tags, this BSON will save lots of disk storage space. In fact, reducing overheads to a minimum is an important factor for any data representation format. Moreover, BSON is a very efficient data format; because it uses C data types, encoding data to BSON or decoding data from BSON can be performed very quickly in most programming languages.

x Rich queries: The MongoDB query language aims to offer similar features to SQL; it supports both complex queries and ad-hoc queries [270]. The MongoDB query engine allows the user to search for data from documents through using

186

field names, data range, regular expression, geospatial and even full text searches, which makes it much more powerful than XPath. With this powerful query engine, we are able to search data from MongoDB with multi query criteria. In addition, The MongoDB query engine has a built-in 2D search feature [271], which is a very useful feature in astronomical contexts, because in some situations, we may want to list event data by the given specified sky coordination, and 2D search feature is well suited for this particular data search situation mentioned above.

x Indices: MongoDB supports creating unique indices for enhanced query performance over a RDBMS system. Indices can be created on single and multiple fields, these indices are B-tree indexes and defined per collection, which means each collection has its own indices. Moreover, we can also enhance query performance by putting only documents of a single type into the same collection.

x Replication and auto failure recovery: MongoDB also supports replication (also referred to as ReplicaSet), which allows us to use a set of servers that are linked in a master-slaves mode. Setting replication in MongoDB is easy and fast. Furthermore, the failover mechanism of replication is automatically handled. If the primary server has a problem and goes down, the secondary server will become the primary automatically without the need for any human interaction. Note that a minimum of three servers are required for replication; Primary server, Secondary server, and Arbiter server. The Arbiter server is not used for data storage, but is only used during failover to decide which server will be the next primary server. Indeed, the election process for the new master server will be performed automatically and instantly when the old primary server is unavailable.

x Auto sharding: Sharding distributes the data across physical partitions to overcome issues of hardware limitations (CPU, storage space and RAMs), and the data is automatically balanced in the clusters.

187

x Rich programming language APIs: MongoDB supports a variety of programming APIs that allow programmers to develop data access programs with their familiar programming languages. These APIs cover almost all modern programming languages, including C++, Python, Java, Javascript, PHP, C, and so forth.

x Binary communication protocol: The communication protocol used by the MongoDB is called Mongo Wire Protocol [173], which is a simple socket-based, request-response style protocol. With Mongo Wire Protocol, programs can communicate with MongoDB through a regular TCP/IP socket.

x Cost effective: MongoDB is also a cost effective data management solution [275], because it uses a horizontal scaling approach to serve the increased workload and increased dataset quantities. Additionally, horizontal scaling could reduce the cost in terms of hardware and storage, because increasing the capacity of the database system can be done through using cheaper commodity machines, rather than purchasing expensive hardware (RAM, HDD, etc…) to boost the capacity of the single server machine.

Even though MongoDB offers lots of advantages in data storage, data access and data management, similarly to other database systems, MongoDB also has its drawbacks.

First, a BSON document has size limitations; a single BSON document cannot exceed 16MB, and any BSON document over 16MB will not be able to be stored in MongoDB.

Second, NOSQL type databases do not implements or supports the ‘join’ feature, so MongoDB only allows a query on one collection at a time. As mentioned before, different types of data will be stored in different collections. Therefore, if we want to display multiple types of data in a single output, it is necessary to make multiple queries to gather different types of data from different collections, and then combine them

188

manually within the code. In other words, the extra programming efforts need to be done in the program side.

In fact, the lack of join is the major drawback in migrating from SQL to NOSQL. However, none of the database management solutions are perfect; extra coding efforts are costs that the programmer needs to pay when they try to use benefits offered by NOSQL databases.

However, the disadvantages of MongoDB are relatively small compared to advantages provided by MongoDB, and the MongoDB is a feasible NoSQL solution for astronomical data management, because it shows a strong ability to store and manage transient data in an efficient manner. Thus, we finally decided to use MongoDB database as the data management system. It should be noted that there is another similar database implementation called CouchDB, which uses JSON as its data storage format. However, the data access mechanism supported by CouchDB is limited, since it only allows data exchange through REST API. Therefore, the CouchDB has not been considered to use for our data management system.

6.4 Design & Implementation of NoSQL Based Astronomical Database

In document Event based transient notification architecture and NoSQL solution for astronomical data management : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany (Auc (Page 195-200)