The Hybrid Data Storage - Implementing WikiSensing’s Data Management and Collaboration

4. Implementing WikiSensing’s Data Management and Collaboration

4.1. The Hybrid Data Storage

WikiSensing’s database is implemented using a hybrid storage strategy with a relational database, non-relational database and ontology to store different parts of the sensor data. Each of these databases run on separate virtual machines on the IC

Cloud [85]. Furthermore multiple server instances are used as replicas for the non-

relational database cluster that stores the sensor data streams. The hybrid database strategy stores the sensor measurements in a MongoDB non-relational database, the sensor meta-information (accuracy, range, etc.) using an ontology and all other sensor data (location details, virtual sensor, unit of measurements, etc.) in a relational MySQL database.

4.1.1. Relational and Non-Relational Databases

Data in MongoDB is stored in collections which is a grouping of

MongoDB documents or records. A collection maps to the concept of

an RDBMS table and documents within a collection can have different fields (records with heterogeneous formats). The sensor measurements in WikiSensing are stored in a single collection.

The MongoDB in WikiSensing is configured as a cluster of shards [83] as illustrated by Figure 4.1. Moreover Sharding is applied to the collection that stores the sensor measurements. Sharding leads to better performance and improved scalability to adapt with the increase in demand of users and storage space. Sharding is also known as horizontal partitioning. Hence when Sharding is implemented on MongoDB a replica of the schema is created, and then the data divided among each shard. This contrasts to vertical partitioning that splits up the data stored in one entity into multiple entities.

The Sharding process enables data records to be stored across multiple machines and is the mechanism used by MongoDB to support the growth of data. With the increase of the size of sensor data (e.g. with the growth of new sensors and measurements), multiple machines are required to store this data and provide

acceptable read and write throughput. With Sharding WikiSensing is able to scale horizontally by adding more hardware to support data growth and the demands of read and write operations. Moreover when the data capacity reaches a certain threshold Sharding can be applied to add new storage resources. This threshold is based on the usage of disk space in a machine and is currently, heuristically set to be between 60% and 70%. The MongoDB cluster automatically corrects imbalances between shards by migrating ranges of data from one shard to another.

Indexes are used in MongoDB for efficient execution of queries that are fundamentally similar to other database systems. All MongoDB collections have an index on the _id field (an id representing the document id, automatically generated from MongoDB for each record) that exists by default. There is also a user defined index created based on UserId and SensorId. Usually any sensor measurement that is stored in WikiSensing contains these fields and is also used when querying.

A to F - Virtual Machines in IC Cloud

addShard A (mongod) Primary A (mongos) B (mongod) Secondary C (mongod) Secondary D (mongod) Primary E (mongod) Secondary F (mongod) Secondary Replica Set: RS0 Replica Set: RS1

addShard WikiSensing MongoDb Interface Config Server B (mongo -configsrv) C (mongo -configsrv) D (mongo -configsrv)

Figure 4.1: The MongoDB cluster for WikiSensing using Sharding

The relational database of WikiSensing is a MySQL server deployed on a virtual machine in the IC Cloud. WikiSensing’s application layer communicates with all database servers through a secure virtual private network. Moreover the connection and communication between the data from other sources such as the

non-relational database and the ontology is mapped by the business logic at the application layer.

4.1.2. Managing Data by Ontology

The use of an ontology to store the sensor properties or meta-information enables a shared understanding of the concepts in the domain of sensors. Sensor properties can be extensible as well as different users can have several interpretations of certain concepts. Hence an ontology resolves these issues by supporting an extensible list as well as providing constructs (RDF and OWL) to incorporate semantics into the data. The advantage of using RDF (Resource description Framework) [86] with OWL (Web Ontology Language) [67] over XML is based on the complexity of querying the data. There are a large number of ways in which data can be represented in XML and hence it is difficult to design queries that are independent of this structure. In contrast RDF enforces a standard way of writing statements so that irrespective of the way it occurs in a document, they produce the same effect in RDF terms (e.g. RDF uses URI’s to represent elements that are globally unique). WikiSensing uses the OntoSensor ontology to represent sensor properties as well as extending this ontology in situations where new properties are introduced.

The ontology data is managed by using the proprietary Virtuoso server. Moreover the dotNetRDF [87] .Net Library is used to access (e.g. query, update, etc.) the ontology data from the WikiSensing middleware. dotNetRDF provides a set of API‘s for working with ontology files and supports the Virtuoso data store.

4.1.3. Motivations for Hybrid Data Storage

The hybrid database approach of WikiSensing offers several advantages. Firstly using a high speed database to store the vast number of sensor readings will enhance performance of the data access. MongoDB is a document-oriented, schema free storage system that uses memory-mapped files [23]. It is also a relational free

database that provides better performance to a relational database such as MySQL [88][24]. The schema-less nature of MongoDB has the advantage of storing heterogeneous types of records. Moreover sensors that produce different types or different numbers of measurement can be accommodated in a single document (analogues to a relational table).

Secondly non-relational databases such as MongoDB lack the atomicity, consistency and durability transaction properties [25] and are not suitable to store information that require a degree of concurrency control. Hence the primary aim of using MongoDB is that it is lightweight and fast as it does not use traditional locking or complex transactions with rollback [26]. Furthermore MongoDB is used to store data that is not modified but only inserted and hence does not require any data record locking mechanisms. On the other hand data that are usually modified (e.g. sensor environment details, virtual sensor configurations, etc.) are stored in the more mature MySQL database that ensures these transactional properties.

Thirdly the use of the ontology enables the storage of extensible information while preserving a common vocabulary. For example, ontologies help to distinguish different sensor properties (e.g. sensor resolution and sensitivity) as well as symbolise properties that are the same but named differently (e.g. sensor accuracy or true variation). It is also useful to define rules by setting relationships between these concepts, for example subclasses are a good way to define certain sensor properties that have similar characteristics.

In document WikiSensing: A Collaborative Sensor Management System with Trust Assessment for Big Data (Page 68-71)