12/2/2013
CS 480: Senior Design| Group 3- LightSpeed | Chelsea Skotnicki
W
EST
V
IRGINIA
U
NIVERSITY
I
NDIVIDUAL
R
ESEARCH
P
APER
Chelsea Skotnicki
1
Table of Contents
Table of Figures ... 1 Needs ... 2 Background ... 2 Objectives ... 7 Reliable ... 7 Fast ... 7 User-Friendly ... 7 Stakeholders ... 8 References ... 9Table of Figures
Figure 1: Relational Database Model ... 3Figure 2: Distributed Database Model ... 4
Figure 3: Speed Comparison of Database Reads ... 5
Figure 4: Actor Model ... 6
Chelsea Skotnicki
2
Needs
Databases have become the most effective means of storing and organizing data. There is a growing need for large scale database systems that solve and manage large dataset
problems. Databases like this already exist, but our goal is for our product, LightSpeed, to be faster, more reliable, and more user-friendly, compared to the database systems already on the market and sought after, such as Hadoop, an industry standard. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models (What Is Apache Hadoop?, 2012). Currently some of its biggest clients include EBay, Facebook, LinkedIn, and Hulu, just to name a few, so we’d be targeting some big name companies if we can pull this off. LightSpeed will be based off the Lightning Memory-Mapped Database (LMDB), an ultra-fast, ultra-compact key-value embedded data store developed by Symas (Symas Lightning Memory-Mapped Database (LMDB), 2013). From there, we’ll have to implement concurrency using Scala, an object-oriented programming language. Our reason for doing so is because Scala allows for actor-based concurrency. We will be using Spray, an open-source toolkit for building REST/HTTP-based integration layers on top of Scala, as our actor system, since it is asynchronous, actor-based, fast, lightweight, modular and testable (Spray, 2012). Finally, we will be using Amazon Web Services computing platform in the cloud to store everything, giving LightSpeed scalability. By using all the aforementioned sub-products to create LightSpeed, we’ll be fulfilling the need of an ultra-fast, super reliable, and user-friendly database system to handle “big data”.
Background
A challenge that has existed throughout human history and long before modern computer systems is information storage. The 1960s marked the beginning of computerized databases. E.F. Codd changed the way people thought about databases when he proposed the use of a relational database model. In his model, the database’s logical organization is
disconnected from physical information storage, and this became the standard principle for database systems (A Timeline of Database History, 2013).
Chelsea Skotnicki
3
Figure 1: Relational Database Model
A distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Replication and duplication ensure the database is up to date. Replication involves identifying changes in the distributive database and making all the databases look identical. Duplication identifies one database as a master and then duplicates that database. During this process, only changes to the master database are allowed, so that local data will not be
overwritten. Both of the processes can keep the data current in all distributive locations (Distributed database, 2013).
Chelsea Skotnicki
4
Figure 2: Distributed Database Model
LMDB is a binary tree-based database management library recently released by Howard Chu, CTO at Symas Corp. and Owner, Symas Corp. The great thing about this database is it uses a copy-on- write strategy so no active data pages are ever overwritten, which also provides resistance to corruption and eliminates the need of any special recovery procedures after a system crash. Unlike other well-known database mechanisms which use either write-ahead transaction logs or append-only data writes, LMDB requires no maintenance during operation (Symas Lightning Memory-Mapped Database (LMDB), 2013). As you’ll see in Figure 3, this LMDB’s performance is out performing others on the market in read speeds by a substantial amount. So once we add a layer of concurrency to LMDB, LightSpeed’s speeds will blow the competition away.
Chelsea Skotnicki
5
Figure 3: Speed Comparison of Database Reads
By using LMDB, we are able to implement concurrency using a wrapper language. We chose to use the Scala language per our mentor, Ray Morehead. Scala is a Java-like programming language which unifies object-oriented and functional programming (Ordersky, 2013). Scala will allow us to implement actor-based concurrency. There is an actor system within Scala called Spray. It will allow for non-blocking asynchronous processes between clusters. Each actor encapsulates a state and a thread of control that manipulates this state. In response to a message, an actor may perform one of the
following actions (see Figure 4): (1) alter its current state, possibly changing its future behavior, (2) send messages to other actors asynchronously, (3) create new actors with a specified behavior, and (4) migrate to another computing host (Weng, 2006).
Chelsea Skotnicki
6
Figure 4: Actor Model
In order to store LightSpeed, we decided to use Amazon Web Services. Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud (Amazon Elastic Compute Cloud (Amazon EC2), 2013). Storage is provided with Amazon Simple Storage Service (S3) which provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web (Amazon Simple Storage Service (Amazon S3), 2013). Lucky for us, these servers are pay as you go. This will be the only part of our project that needs funding.
Chelsea Skotnicki
7
Objectives
Figure 5: Objective Tree
Reliable
Companies need and depend on a reliable database system. Lost data is a worst case scenario for them. No company is going to use a database system that has any track record of losing data. For that reason, this has to be our number one concern when creating LightSpeed.
Fast
In order to run an efficient business, companies need a speedy database. The majority of people simply don’t like waiting and come to expect fast results. Our goal for LightSpeed is to make it 10-100 times faster than competitor database systems. If the end result is drastically faster than the database system companies are currently using, then hopefully they’ll be
convinced to switch to our system.
User-Friendly
We want LightSpeed to be user friendly, so people won’t shy away from using it. Without an easy to use graphical interface, our target audience will shrink. Our goal with LightSpeed is to create a web interface that is easy enough for the average user to use with little instruction.
Chelsea Skotnicki
8
Stakeholders
Lightspeed’s stakeholders are companies that need a fast, reliable way of storing a lot of data. Since these databases are so sought after from big companies, then if LightSpeed were to meet our objectives of being exponentially faster and more reliable than databases already on the market, we could make top dollar for our product and gain big name clientele.
Chelsea Skotnicki
9
References
A Timeline of Database History. (2013). Retrieved October 15, 2013, from Intuit: http://quickbase.intuit.com/articles/timeline-of-database-history
Amazon Elastic Compute Cloud (Amazon EC2). (2013). Retrieved December 1, 2013, from Amazon Web Services: http://aws.amazon.com/ec2/
Amazon Simple Storage Service (Amazon S3). (2013). Retrieved December 1, 2013, from Amazon Web Services: http://aws.amazon.com/s3/
Distributed database. (2013). Retrieved October 15, 2013, from Princeton:
http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Distributed_database.html
Hadoop Wiki PoweredBy. (2013, June 19). Retrieved October 15, 2013, from Apache: http://wiki.apache.org/hadoop/PoweredBy
Ordersky, M. (2013). The Scala Language Specification. Retrieved December 1, 2013, from
lang.org/files/archive/nightly/pdfs/ScalaReference.pdf: http://www.scala-lang.org/files/archive/nightly/pdfs/ScalaReference.pdf
Spray. (2012). Retrieved October 15, 2013, from Spray: http://spray.io/
Symas Lightning Memory-Mapped Database (LMDB). (2013, September 17). Retrieved October 15, 2013, from Symas: http://symas.com/mdb/
Weng, W.-J. (2006, October 14). The Actor Model. Retrieved December 1, 2013, from Worldwide Computing Laboratory: http://wcl.cs.rpi.edu/salsa/tutorial/salsa-1_1_0/node8.html
What Is Apache Hadoop? (2012). Retrieved October 15, 2013, from Apache: http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F