Database management as a cloud-based service for small and medium organizations

(1)

Masaryk University

Faculty of Informatics

Master Thesis

Database management

as a cloud-based service

for small and medium

organizations

Student:

Dime Dimovski

(2)

2

Statement

I declare that I have worked on this thesis independently using only the sources listed in the bibliography. All resources, sources, and literature, which I used in preparing or I drew on them, I quote in the thesis properly with stating the full reference to the source.

(3)

3

Resume

The goal of this thesis is to explore the cloud computing, manly focusing on database management systems as a cloud service. It will give review of some of current available solutions of SQL and NOSQL based database management systems as a cloud service; advantages and disadvantages of the cloud computing in general and the common considerations.

Keywords

Cloud computing, SaaS, PaaS, Database management, SQL, NOSQL, DBaaS, Database.com, SQL Azure, Amazon Web Services, SimpleDB, DynamoDB, Google SQL, MongoDB, CouchDB, Google Datastore.

(4)

4

1. Introduction

The boom of the cloud computing over the past few years has led to situation that it is common to many innovations and new technologies. It became common for enterprises and a person to use the services that are offered in the cloud and recognize that cloud computing is a big deal even though they are not clear why that is so. Even the phrase “in the cloud” has been used in our colloquial language. Huge percentage of the developers in the world is currently working on “cloud-related” products. Therefore the cloud is this amorphous entity that is supposed to represent the future of modern computing. In an attempt to gain a competitive edge, businesses are looking for new innovative ways to cut costs while maximizing value. They recognize the need to grow but at the same time they are under pressure to save money. The cloud gave this opportunity for the business allowing them to focus on their core business by offering hardware and software solution without having to develop them by their own. In this thesis I will give an overview of what cloud computing is. I will describe its main concepts and architecture; and take a look at the paradigm XaaS (something/everything as a service) and the current available options in the cloud mostly focusing on Database in the cloud or Database as a service. I will give a closer look on how the cloud computing in general and database as a service can be used for small and medium enterprises, what are the main benefits that it offers and will it really help businesses to reduce the budget and focus on their core business.

(9)

9

2. Introduction to Cloud Computing

In reality the cloud is something that we have been using for a long time, it is the Internet, with all the standards and protocols that provide Web services to us. Usually the Internet is drawn as a cloud, this represent s one of the essential characteristics of cloud computing, abstraction. Cloud computing refers to applications and services that run on a distributed network using virtualized resources and are accessed by common Internet protocols and networking standards. It is distinguished by the notion that resources are virtual and limitless and that details of the physical system on which software runs are abstracted from user .[1]

One of the main things that is driving cloud computing is the recent advancements in wireless speed and connectivity. Without these in place, cloud computing wouldn’t be practical or even possible. In many ways, cloud computing was/is an eventuality. The influence of telecommunications

organizations and their push towards simplifying and miniaturizing virtually every electronic device that can be used by the mobile users is pushing cloud computing even faster. This represents a major breakthrough in not only computing but also communication.

Cloud computing represents a real paradigm shift in the way in which systems are deployed. The massive scale of cloud computing systems was enabled by the popularization of the Internet and the growth of some large service companies.[1]

Cloud computing has been compared to the standard utility companies, and it does bear a striking resemblance to these institutions. Just like water, electricity or gas, cloud computing makes the long- held dream of utility computing possible with a pay-as-you-go, infinitely scalable, universally available system. In other words, the ‘goods’ come from one central location; we’re just turning things off and on. This may ultimately give more people access to a larger pool or resources at an extremely reduced cost. One of the biggest benefits of cloud computing is its ability to offer users access to off-site hardware and software. With cloud computing the resources of the cloud itself are at your disposal. This means all the hardware, software, processors and networks will combine to give individuals much more computing power than has ever been possible. This will completely change nearly every facet of information exchange as well as influence everything from social networking to web development. By keeping things light and simple individual access devices are going to last a lot longer and become much more durable. And of course, losing or breaking a device is no longer going to be of any particular concern, as they will be easily replaced and there’s no danger of losing your files or information either.

With cloud computing, you can start very small and become big very fast. That's why cloud computing is revolutionary, even if the technology it is built on is evolutionary.

2.1 Cloud computing – definition

The use of the word “cloud” makes reference to the two essential concepts:

 Abstraction

(10)

10

Abstraction

Cloud computing is abstracting the details of the system implementation from the users and the developers. Applications run on physical systems that aren't specified, data is stored in locations that are unknown, administration of systems is outsourced to others, and access by users is ubiquitous.[1]

Virtualization

Cloud computing virtualizes systems by pooling and sharing resources. Systems and storage can be provisioned as needed from a centralized infrastructure, costs are assessed on a metered basis, multi-tenancy is enabled, and resources are scalable with agility.

Cloud computing is an abstraction based on the notion of pooling physical resources and presenting them as a virtual resource. It is a new model for provisioning resources, for staging applications, and for platform-independent user access to services. Clouds can come in many different types, and the services and applications that run on clouds may or may not be delivered by a cloud service provider.

2.2 Cloud Types

Usually the cloud computing is separated into two distinct sets of models:

 Deployment models – refers to location and management of the cloud’s infrastructure.

 Service models – particular types of services that can be accessed on a cloud computing platform.

2.2.1 NIST model

The NIST model is set of working definitions published by the U.S. National Institute of Standards and Technology. This cloud model is composed of five essential characteristics, three service models, and four deployment models.[2]

Essential Characteristics:

 On-demand self-service - A consumer can unilaterally provision computing capabilities, such

as server time and network storage, as needed automatically without requiring human interaction with each service provider.

 Broad network access - Capabilities are available over the network and accessed through

standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).

 Resource pooling - The provider’s computing resources are pooled to serve multiple

consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, and network bandwidth.

(11)

11

 Rapid elasticity - Capabilities can be elastically provisioned and released, in some cases

automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.

 Measured service - Cloud systems automatically control and optimize resource use by

leveraging a metering capabilityat some level of abstraction appropriate to the type of service (e.g. storage, processing, bandwidth, and active user accounts). Resource usage can be

monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

Service Models:

 Software as a Service (SaaS) - The capability provided to the consumer is to use the provider’s

applications running on a cloud infrastructure2. The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user specific application

configuration settings.

 Platform as a Service (PaaS) - The capability provided to the consumer is to deploy onto the

cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.

 Infrastructure as a Service (IaaS) - The capability provided to the consumer is to provision

processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud

infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls).

Deployment Models:

 Private cloud - The cloud infrastructure is provisioned for exclusive use by a single

organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.

 Community cloud - The cloud infrastructure is provisioned for exclusive use by a specific

community of consumers from organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be owned, managed, and operated by one or more of the organizations in the community, a third party, or some combination of them, and it may exist on or off premises.

 Public cloud - The cloud infrastructure is provisioned for open use by the general public. It is

usually open system available to general public via WWW or Internet. It may be owned, managed, and operated by a business, academic, or government organization, or some combination of them. It exists on the premises of the cloud provider. Examples of public cloud: Google application engine, Amazon elastic compute cloud, Microsoft Azure.

(12)

12 infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds). [2]

2.3 Cloud computing architecture

Cloud computing is essentially a series of levels that function together in various ways to create a system. This system is also referred to as cloud computing architecture. The cloud creates a system where resources can be pooled and partitioned as needed. Cloud architecture can couple software running on virtualized hardware in multiple locations to provide an on-demand service to user-facing hardware and software. A cloud can be created within an organization's own infrastructure or outsourced to another datacenter. Usually resources in a cloud are virtualized resources because virtualized resources are easier to modify and optimize. A compute cloud requires virtualized storage to support the staging and storage of data. From a user's perspective, it is important that the

resources appear to be infinitely scalable, that the service be measurable, and that the pricing be metered.[1]

Figure 1 Cloud computing stack

Applications in the cloud are usually composable systems, this means that they are using standard component so assemble services that are tailored for a specific purpose. A composable component must be:

• Modular: It is a self-contained and independent unit that is cooperative, reusable, and reeplaceable.

(13)

13 • Stateless: A transaction is executed without regard to other transactions or requests

In general cloud computing doesn’t require that hardware and software to be composable but it is a highly desirable characteristic. It makes system design easier to implement and solutions are more portable and interoperable.

Some of the benefits from composable system are:

 Easier to assemble systems

 Cheaper system development

 More reliable operation

 A larger pool of qualified developers

 A logical design methodology

There is a trend toward designing composable systems in cloud computing in the widespread adoption of what has come to be called the Service Oriented Architecture (SOA). The essence of a service oriented design is that services are constructed from a set of modules using standard communications and service interfaces. An example of a set of widely used standards describes the services themselves in terms of the Web Services Description Language (WSDL), data exchange between services using some form of XML, and the communications between the services using the SOAP protocol. There are, of course, alternative sets of standards.[1]

What isn't specified is the nature of the module itself; it can be written in any programming language the developer wants. From the standpoint of the system, the module is a black box, and only the interface is well specified. This independence of the internal workings of the module or component means it can be swapped out for a different model, relocated, or replaced at will, provided that the interface specification remains unchanged. That is a powerful benefit to any system or application provider as their products evolve.

Essentially there are 3 tiers in a basic cloud computing architecture:

 Infrastructure

 Platform

 Application

If we further break down the standard cloud computing architecture there are really two areas to deal with; the front end and back end.

Front End - The front end includes all client (user) devices and hardware in addition to their computer

network and the application that they actually use to make a connection with the cloud.

Back End - The back end is populated with the various servers, data storage devices and hardware that

facilitate the functionality of a cloud computing network.

2.3.1 Infrastructure

The infrastructure of cloud computing architecture is essentially all the hardware, data storage devices (including virtualized hardware), networking equipment, applications and software that operates and

(14)

14 drives the cloud.

Most Infrastructure as a Service (IaaS) providers use virtual machines to deliver servers that run applications. Virtual machines images or instances are containers that have assigned specific resources (number of CPU cycles, memory access, network bandwidth, etc.).

Figure 2 shows the cloud computing stack that is defined as the server. The Virtual Machine Monitor, also called a hypervisor is the low level software that allows different operating systems to run in their own memory space and manages I/O for the virtual machines.[1]

Figure 2 "Server" stack

2.3.2 Platform

A cloud computing platform is the actual programming, code and implemented systems of interfacing that help user-level devices (and applications) connect with the hardware and software resources of the cloud. It is a software layer that is used to create higher level of services.

A cloud computing platform is generally divided up between the front end and back end of a network. Its job is to provide a communication and access portal for the client, so that they may effectively utilize the resources of the cloud network. The platform may only be a set of directions, but it is in all actuality the most integral part of a cloud computing network; without it cloud computing would not be possible.

There are many different Platform as a Service (PaaS) providers, we will mention some of them:

 Salesforce.com’s Force.com and Databse.com Platforms

 Windows Azure Platform

 Google Apps and Google AppEngine

 Amazon Web services

All platform services offer hosted hardware and software needed to build and deploy Web application or services that are custom built by the developers.

(15)

15 with the same technologies that have been successfully used to create Web applications. Thus, you might find a platform based on an Oracle xVM hypervisor virtual machine that includes

a NetBeans Integrated Development Environment (IDE) and that supports the Oracle GlassFish Web stack programmable using Perl or Ruby. For Windows, Microsoft would be similarly interested in providing a platform that allowed Windows developers to run on a Hyper-V VM, use the ASP.NET application framework, support one of its enterprise applications such as SQL Server, and be

programmable within Visual Studio—which is essentially what the Azure Platform does. This approach allows someone to develop a program in the cloud that can be used by others.

Platforms often come with tools and utilities to aid in application design and deployment. Depending on a vendor they can be: tools for team collaboration, testing tools, versioning tools, database and web service integration, and storage tools. Platforms providers begin with creation of developer’s community to support the work done in the environment.

Platform is exposed to users through an API, also an application built in the cloud using a platform service would encapsulates the service through its own API. An API can control data flow,

communications, and other important aspects of the cloud application. Till now there are is no standard API and each cloud vendor has their own.

2.3.3 Application Platform as a Service (APaaS ) or Virtual appliances

A virtual appliance is software that installs as middleware onto a virtual machine. This are usually a Web server, database server, BPM, ESBs, Messaging Portals and others that are running on a virtual machine image. This, by someone referred to as Application platform as a Service, is more or less horizontal extension of the offerings of PaaS.

APaaS is a type of service model that gives cloud software developers the power to actually do their jobs. This gives an opportunity to use the APaaS /Virtual Appliances to build more complex services. Within the ApaaS system, the actual software architectures of applications are built and established. It is also within this layer that overall portability (and the ability of an application to function alongside a bevy of other cloud applications as well as operating systems) is established. Since most of the actual developmental breakthroughs (both in terms of software and overall cloud usability) occur within the realms of the middleware (PaaS, APaaS), it makes sense that a great deal of attention is paid to it. [3] For example Amazon WS is offering more than 700 different virtual machine images preconfigured with enterprise applications like Oracle BPM, SQL Server, and even complete application stacks such as LAMP (Linux, Apache, MySQL, and PHP) which are used to create a virtual machine within the Amazon Elastic Compute Cloud (EC2). It serves as the basic unit of deployment for services delivered using EC2.

APaaS gives software developers a solid part of platform that they can stand on, with its own impressive workbench of tools, while they are constructing and envisioning new possibilities. The true benefit from APaaS however is its ability to provide accurate feedback regarding the functionality and compatibility of applications that are still under development. This is extremely important to software developers, who can take serious losses (in terms of both money and time spent) if they produce an application that simply won’t function in an environment, behave as expected once deployed, or function in a compatible manner with other elements in a cloud

infrastructure. For those companies that want to run their IT and/or software development projects through an APaaS, they need only pay subscription fees and not licensing fees. Subscription is

(16)

16 packages that are put together for designers are often much easier to use than most standardized design tools. These packages often allow software development teams to integrate and share their work more smoothly as well as run the project from start to finish much faster than with other systems.[3]

The global emergence of APaaS will no doubt lead to the creation of a number of companies that will utilize the tools of APaaS to create their own business model, especially one that seeks to provide yet another proprietary service aimed at delivering timely solutions to business software issues. One particular area that could use the help is enterprise software, for example. Enterprise software is often hard to manage, difficult to customize and frequently falls short in its functionalities. When you couple these shortcomings with the fact that it is often quite expensive, there is a serious problem. An obvious solution for dealing with enterprise software problems would be the deployment of an APaaS-style service. Not only would this greatly increase the overall functionality of expensive

enterprise business software, but it would also allow for a great range of customization, as well as the option for integrating it with other cloud services and/or networking opportunities. APaaS was created to make the lives of software designers, developers and investors much easier. It is through the use of APaaS that many excellent next generation apps have been developed and many experts in the field of cloud computing agree that it is APaaS that will produce some of the upcoming “game changing” applications that will actually shape the future of cloud computing in general.

2.3.4 Application

This area is compromised of the client hardware and the interface used to connect to the cloud. Big problems arise from the design of Internet protocols to treat each request to a server as an

independent transaction (stateless service) [1]. The standard HTTP commands are all atomic in nature. While stateless servers are easier to architect and stateless transactions are more resilient and can survive outages, much of the useful work that computer systems need to accomplish are stateful. Usage of transaction servers, message queuing servers and other similar middleware is meant to bridge this problem. Standard methods that are part of Service oriented Architecture that help to solve this issue and that are used in cloud computing are:

 Orchestration – process flow can be choreographed as a service

 Use of service bus that controls cloud components

There are many ways how clients can connect to a cloud service. The most common are:

 Web browser

 Proprietary application

This application can run on number of different devices, PC, Servers, Smartphones, and Tablets. They all need a secure way to communicate with the cloud. Some of the basic methods to secure the connection are:

 Secure protocol such as SSL (HTTPS). FTPS, IPSec or SSH

 Virtual connection using a virtual private network (VPN)

 Remote data transfer such as Microsoft RDP or Citrix ICA that are using tunneling mechanism

(17)

17

3. Scalability

The scalability is the ability of a system to handle growing amount of work in a capable manner or its ability to improve when additional resources are added.

The scalability requirement arises due to the constant load ﬂuctuations that are common in the context of Web-based services. In fact these load ﬂuctuations occur at varying frequencies: daily, weekly, and over longer periods. The other source of load variation is due to unpredictable growth (or decline) in usage. The need for scalable design is to ensure that the system capacity can be

augmented by adding additional hardware resources whenever warranted by load ﬂuctuations. Thus, scalability has emerged both as a critical requirement as well as a fundamental challenge in the context of cloud computing.[1][4]

Typically there are two ways to increase scalability:

 Vertical scalability – by adding hardware resources, usually addition of CPU, memory etc. This vertical scaling (scaling-up) enables them to use virtualizations technologies more effectively by providing more resources for the hosted operating systems and applications to share.

 Horizontal scalability – means to add more nodes to a system, such as adding new node to a distributed software application or adding more access points within the current system. Hundreds of small computers may be conﬁgured in a cluster to obtain aggregate computing power. The Horizontal scalability (scale-out) model also creates an increased demand for shared data storage with very high I/O performance especially where processing of large amounts of data is required. In general, the scale-out paradigm has served as the

fundamental design paradigm for the large-scale data-centers of today.

Integrating multiple load balancers into your system is probably the best solution for dealing with scalability issues. There are many different forms of load balancers to choose from; server farms, software and even hardware that have been designed to handle and distribute increased traffic. Items that interfere with scalability[3]:

 Too much software clutter (no organization) within the hardware stack(s).

 Overuse of third-party scaling.

 Reliance on the use of synchronous calls.

 Not enough caching

 Database not being used properly.

Creating a cloud network that offers the maximum level of scalability potential is entirely possible if we apply a more “diagonal” solution. By incorporating the best solutions present in both vertical and horizontal scaling, it is possible to reap the benefits of both models[3]. Once the servers reach the limit of diminishing returns (no growth), we should simply start cloning them. This will allow us to keep a consistent architecture when adding new components, software, apps and users. For most individuals, problems arise from lack of resources not the inherent architecture of their cloud itself. A more diagonal approach should help the business to deal with the current and growing demands that it is facing.

(18)

18

4. Elasticity

Of all the attributes possessed by cloud computing in general, the most important is certainly its elasticity. It’s ability to amplify and instantly upgrade resources and/or capacities on a moment notice. Storage, processing and the scalability of applications are all elastic in the cloud. The really remarkable thing about cloud computing is the real-time infrastructure that actively responds on user requests for resources. Without the real-time monitoring and support behind this elasticity, the effectiveness, adaptability and muscle of cloud computing would be greatly undermined. It is this elastic ability that the service providers possess which allows them to offer their users access to cloud computing services at such reduced costs. Since users only pay for what they use they can save money. For example with the traditional grid computing network every user has its own intensive hardware setup of which most of the users rarely use more than 50% of the capacity. Their combined resource usage might be 20-30% of the total resources available on their central cloud computing hardware stack. What cloud computing is really offering is the ability for average users to retain their current standards and expectations, while leaving the door open for instant expansion opportunities if they desire it. This also gives a much more efficient way to use energy.

Elasticity offers the same computing experience to which we are accustomed, with the added benefit of near limitless resources at the same time offering a way to manage the energy consumption. [1][3] The elastic capabilities offered by cloud computing makes it perfectly suited toward handling certain activities or processes.

 Establishing an “in office” communication and online networking infrastructure (for employees). Setting up a system that allows those in the organization a cleaner and more efficient system for communicating and working often leads to greatly increased profits.

 Using cloud computing to handle overdrafting - high volume data transfer periods and events. Some businesses only use cloud computing when they run out of their own resources, or perhaps anticipate that they might lack needed functionalities.

This can be something that is scheduled for an annual or bi-annual basis; designed to meet a seasonal demand for a particular product for example.

 Assigning all customer data and transaction information to a cloud computing element. This allows an organization to keep their customer’s data safe even from their own employees. Utilizing a third party to handle all customer data can also pay off in the event of a catastrophic type event. Cloud computing providers tend to keep your information more securely backed-up than most are even aware of. [3]

(19)

19

5. Database Management Systems in the cloud (Database as a

service)

Data and database management are integral part of wide variety of applications. Particularly Relation DBMSs had been massively used due to many futures that they offer:

 Overall functionality offering intuitive and relatively simple model for modeling different types of applications.

 Consistency, dealing with concurrent workloads without worrying about the data getting out of sync

 Performance, low latency and high throughput combined with many years of engineering and development

 Reliability, persistence of data in the presence of different types of failures and ensuring safety.

The main concern is that the DBMSs and RDBMSs are not cloud-friendly because they are not as scalable as the web-servers and application servers, which can scale from a few machines to hundreds. The traditional DBMSs are not design to run on top of the shared-nothing architecture (where a set of independent machines accomplish a task with minimal resource overlap) and they do not provide the tools needed to scale-out from a few to a large number of machines.

Technology leaders such as Google, Amazon, and Microsoft have demonstrated that data centers comprising thousands to hundreds of thousands compute nodes, provide unprecedented economies-of-scale since multiple applications can share a common infrastructure. All three companies have provided frameworks such as Amazon’s AWS, Google’s AppEngine and Microsoft Azure for hosting third party application in their clouds (data-center infrastructures).

Because the RDBMs or “transactional data management” databases that back banking, airline

reservation, online e-commerce, and supply chain management applications typically rely on the ACID (Atomicity, Consistency, Isolation, Durability) guarantees that databases provide and It is hard to maintain ACID guarantees in the face of data replication over large geographic distances1, they even have developed propriety data management technologies referred to as key-value stores or informally called NO-SQL database management systems.[6] The need for web-based application to support virtually unlimited number of users and to be able to respond to sudden load fluctuations raises the requirement to make them scalable in cloud computing platforms. There is a need that such scalability can be provisioned dynamically without causing any interruption in the service. Key-value stores and other NOSQL database solutions, such as Google Datastore offered with Google AppEngine, Amazon SimpleDB and DynamoDB, MongoDB and others, have been designed so that they can be elastic or can be dynamically provisioned in the presence of load fluctuations. We will explain some of these systems in more details later on.

1_{CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to}

simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it was successful or failed) Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.

(20)

20 As we move to the cloud-computing arena which typically comprises data-centers with thousands of servers, the manual approach of database administration is no longer feasible. Instead, there is a growing need to make the underlying data management layer autonomic or self-managing especially when it comes to load redistribution, scalability, and elasticity. [7]

Figure 3 Traditional VS Cloud Data Services

This issue becomes especially acute in the context of pay-per-use cloud-computing platforms hosting multi-tenant applications. In this model, the service provider is interested in minimizing its operational cost by consolidating multiple ten-ants on as few machines as possible during periods of low activity and distributing these tenants on a larger number of servers during peak usage [7]. Due to the above desirable properties of key-value stores in the context of cloud computing and large-scale data-centers, they are being widely used as the data management tier for cloud-enabled Web applications. Although it is claimed that atomicity at a single key is adequate in the context of many Web-oriented

applications, evidence is emerging that indicates that in many application scenarios this is not enough. In such cases, the responsibility to ensure atomicity and consistency of multiple data entities falls on the application developers. This results in the duplication of multi-entity synchronization mechanisms many times in the application software. In addition, as it is widely recognized that concurrent programs are highly vulnerable to subtle bugs and errors, this approach impacts the application reliability

adversely. The realization of providing atomicity beyond single entities is widely discussed in developer blogs. Recently, this problem has also been recognized by the senior architects from Amazon and Google, leading to systems like MegaStore [10] that provide transactional guarantees on key-value stores.

Both RDBMs and NOSQL DBMs offerings in the cloud will be explained in more details, how they work who offers them and how they are provisioned.

I will first focus on the relational database offered in the cloud. I will start with one of the first Enterprise database built for the cloud, the Salesforce’s database.com.

(21)

21

6. Database.com

Database.com is a database management system that is built for cloud computing with multitenancy inherent in its design. Traditional RDBMSs were designed to support on premises deployments for one organization. All core mechanisms such as system catalog, cashing mechanisms and query optimizer are built to support single-tenant applications and to run directly on a specifically tuned host operating system and hardware. Only possible way to build multi-tenant cloud database service with standard RDBMS is to use virtualization. Unfortunately, the extra overhead of the hypervisor typically hurts the performance of the RDBMS. Database.com combines several different persistence technologies, including a custom -designed relational database schema, which are innately designed for clouds and multitenancy - no virtualization required.

6.1 Database.com Architecture

Database.com’s core relational database technology uses a runtime engine that materializes all application data from metadata - data about the data itself. In Database.com’s metadata-driven architecture, there is a clear separation of the compiled runtime database engine (kernel), tenant data, and the metadata that describes each application’s schema. These distinct boundaries make it possible to independently update the system kernel and tenant -specific application schemas.

Figure 4 Databse.com Architecture [9]

Every logical database object is internally managed using metadata. Objects, (“tables” in traditional relational database parlance), fields, stored procedures, and database triggers are all abstract

(22)

22 constructs that exist merely as metadata in Database.com’s Universal Data Dictionary (UDD).

Database.com used terminology is shown in Table 1.

Relational Database Term Equivalent Term in Databse.com

Database Organization

Table Object

Column Field

Row Record

Table 1 Database.com Terminology

When a new application object is defined or some procedural code is written, Database.com does not create an actual table in a database or compile any code, it simply stores metadata that the system’s engine can use to generate the virtual application components at runtime. When modification or customization of something about the application schema is needed, like modify an existing field in an object, all that’s required is a simple non-blocking update to the corresponding metadata [9].

In order to avoid performance-sapping disk I/O and code recompilations, and improve application response times, Database.com uses massive and sophisticated metadata caches to maintain the most recently used metadata in memory. The system runtime engine must be optimizes to access metadata because frequent metadata access would prevent the service from scaling.

At the heart of Database.com is its transaction database engine. Database.com uses a relational database engine with a specialized schema build for multitenancy. It also employs a search engine (separate from the transaction engine) that optimizes full -text indexing and searches. As applications update data, the search service’s background processes asynchronously update tenant - and user-specific indexes in near real time. The goal of this separation of duties between the transaction engine and the search service lets applications process transactions without the overhead of text index updates [9].

6.2 Multitenant data model

Database.com storage model manages virtual database structures using a set of metadata, data, and pivot tables, as illustrated in Figure 5

(23)

23 When application schemas are created, the UDD keeps track of metadata concerning the objects, their fields, their relationships, and other object attributes. Few large database tables store the structured and unstructured data for all virtual tables. A set of related multitenant indexes, implemented as simple pivot tables with denormalized data, make the combined data set extremely functional. Because Database.com manages object and field definitions as metadata rather than actual database structures, the system can tolerate online multitenant application schema maintenance activities without blocking the concurrent activity of other tenants and users [9].

6.3 Multitenant indexes

Database.com automatically indexes various types of fields to deliver scalable performance. Traditional database systems rely on native database indexes to quickly locate specific rows in a database table that have fields matching a specific condition. Index of MT_Data is managed by synchronously copying field data marked for indexing to an appropriate column in a pivot table called MT_Indexes.

In some circumstances the external search engine can fail to respond to a search request. In this cases Database.com falls back to a secondary search mechanism. A fallback search is implemented as a direct database query with search conditions that reference the Name field of target records. To optimize global object searches (searches that span tables) without having to execute potentially expensive union queries, a pivot table called MT_Fallback_Indexes that records the Name of all records is maintained. Updates to MT_Fallback_Indexes happen synchronously, as transactions modify records, so that fall-back searches always have access to the most current database information [9].

6.4 Multitenant relationships

Database.com provides “relationship” datatypes that an organization can use to declare relationships (referential integrity) among tables. When an organization declares an object’s field with a relationship type, the field is mapped to a Value field in MT_Data, and then uses this fie ld to store the ObjID of a related object [9].

6.5 Multitenant field history

Database.com provides history tracking for any field. When a tenant enables auditing for a specific field, the system asynchronously records information about the changes made to the field (old and new values, change date, etc.) using an internal pivot table as an audit trail [9].

6.6 Partitioning of metadata, data, and index data

All Database.com data, metadata, and pivot table structures, including underlying database indexes, are physically partitioned by tenant (OrgID) using native database partitioning mechanisms. Data partitioning is a proven technique that database systems provide to physically divide large logical data structures into smaller, more manageable pieces. Partitioning can also help to improve the

performance, scalability, and availability of a large database system such as a multitenant environment. For example, by definition, every Database.com query targets a specific tenant’s information, so the

(24)

24 query optimizer need only to consider accessing data partitions that contain a tenant’s data rather than an entire table or index. This common optimization is sometimes referred to as “partition pruning.” [9]

6.7 Application development

Developers can declaratively build server-side application components using the Database.com Console. This point-and-click interface supports all facets of the application schema building process, including the creation of an application’s data model (objects and their fields, relationships, etc.), security and sharing model (users, profiles, role hierarchies, etc.), declarative logic (workflows), and programmatic logic (stored procedures and triggers). The Console provides access to built-in system futures which makes it easy to implement application functionality without the need of writing code [9].

6.8 Data Access

Database.com provides the following tools to query and work with data.

Database.com REST API and Force.com Web Services API

The REST API and Web Services API can be used to interact with Database.com by creating, retrieving, updating, and deleting records, maintaining passwords, performing searches, etc. This APIs can be used with any language that supports Web services.

The SOAP-based API is optimized for real-time client applications that update small numbers of records at a time [8] [9].

Force.com Bulk API

The Bulk API is based on REST principles, and is optimized for loading or deleting large sets of data. It can be used to insert, update, delete, or restore a large number of records asynchronously by submitting a number of batches that are processed in the background by Database.com. The Bulk API is designed to simplify the processing of a few thousand to millions of records.

Apex Data Manipulation Language (DML)

DML statements are used to insert, delete, and update data from within your Apex code.

Apex Web Services

Apex methods can be exposed as Web service operations that can be called by external Web client applications. This is a powerful tool for building efficient communication between data service and application tier. By aggregating business logic onto Database.com, it can:

(25)

25

 Client development and maintenance by providing a coarse-grained application- level API

 Build more robust applications, since all of the logic implemented in Apex is executed within a transaction on Database.com [9]

6.9 Query languages

Database.com is using the Salesforce Object Query Language (SOQL) to construct database queries. Similar to the SELECT command in the Structured Query Language (SQL), SOQL allows you to specify the source object, a list of fields to retrieve, and conditions for selecting rows in the source object. Database.com also includes a full-text, multi-lingual search engine that automatically indexes all text-related fields. Apps can leverage this pre-integrated search engine using the Salesforce Object Search Language (SOSL) to perform text searches.

Unlike SOQL, which can only query one object at a time, SOSL can search text, email, and phone fields for multiple objects simultaneously [9].

6.10 Multitenant search processing

Web-based application users have come to expect an interactive search capability to scan the entire or a selected scope of an application’s database, return ranked results that are up-to-date, and do it all with sub-second response times. To provide such robust search functionality for applications, Database.com uses a search engine that is separate from its transaction engine. The relationship between the two engines is depicted in the figure 4.

Figure 6 Transaction and Search engine [9]

The search engine receives data from the transactional engine, with which it creates search indexes. The transactional engine forwards search requests to the search engine, which returns results that the transaction engine uses to locate rows that satisfy the search request.

As applications update data in text fields (CLOBs, Name, etc.), a pool of background processes called indexing servers are responsible for asynchronously updating corresponding indexes, which the search

(26)

26 engine maintains outside the core transaction engine. To optimize the indexing process, Database.com synchronously copies modified chunks of text data to an internal “to-be - indexed” table as

transactions commit, thus providing a relatively small data source that minimizes the amount of data that indexing servers must read from disk. The search engine automatically maintains separate indexes for each organization (tenant).

Depending on the current load and utilization of indexing servers, text index updates may noticeably lag behind actual transactions. To avoid unexpected search results originating from stale indexes, Database.com also maintains an MRU (most recently used) cache of recently updated rows that the system considers when materializing full-text search results. In order to efficiently support possible search scopes, MRU caches are maintained per-user and per-organization.

Database.com’s search engine optimizes the ranking of records within search results using several different methods. For example, the system considers the security domain of the user performing a search and weighs those rows to which the current user has access more heavily. The system can also consider the modification history of a particular row and rank more actively updated rows ahead of those that are relatively static. The user can choose to weight search results as desired, for example, placing more emphasis on recently modified rows.

6.11 Multitenant isolation and protection

To protect the overall scalability and performance of the shared database system for all concerned applications, Database.com is using an extensive set of governors and resource limits associated with code execution. Execution of a code script is monitored and limited how much CPU time it can use, how much memory it can consume, how many queries and DML statements it can execute, how many math calculations it can perform, how many outbound Web services calls it can make, and much more. Individual queries that optimizer regards as too expensive to execute throw an exception to the caller [9].

Before an organization can transition a new application from development to production status, salesforce.com requires unit tests that validate the functionality of the application’s Database.com code routines. Salesforce.com executes submitted unit tests in Database.com’s sandbox development environment to ascertain if the application code will adversely affect the performance and scalability of the multitenant population at large.

Once an application’s code is certified for production by salesforce.com, the deployment process copies all the application’s metadata into a production Database.com instance and reruns the corresponding unit tests

After a production application is live, the performance profiler automatically analyzes and provides associated feedback to administrators. Performance analysis reports include information about slow queries, data manipulations, and sub-routines that you can review and use to tune application functionality.

(27)

27

6.12 Deletes, undeletes

When an app deletes a record from an object, Database.com simply marks the row for deletion. It is possible to restore selected rows from the Recycle Bin for up to 30 days before it is permanently removed. The total number of records that is maintained for an organization is limited based on the storage limits for that organization.

The Recycle Bin also stores dropped fields and their data until an organization permanently deletes them or 45 days has elapsed, whichever happens first. Until that time, the entire field and all its data is available for restoration [9].

6.13 Backup

Database.com uses a variety of methods to ensure that organizations do not experience any data loss. Every transaction is stored to RAID disks in real-time with archive mode enabled, allowing the database to recover all transactions prior to any system failure. Every night all data is backed up to a separate backup server and automatic tape library. The backup tapes are cloned as an additional precautionary measure, and the cloned tapes are transported to an off-site, fireproof vault twice a month [8].

6.14 Pricing

Database.com pricing is based on number of users, records and transactions per month. The registration of new account is free and it includes:

 3 Standard Users

 3 Administration Users

 100,000 records in the database

 50,000 Transactions per month

(28)

28

7. Microsoft’s SQL AZURE

Microsoft SQL Azure Database is a cloud-based relational database service that is built on SQL Server technologies and runs in Microsoft data centers on hardware that is owned, hosted, and maintained by Microsoft.

SQL Azure is probably the most fully-featured relational database available in the cloud. It is based on the SQL Server standalone database but the way data is managed and stored in SQL Azure is

significantly different.

Similar to an instance of SQL Server, SQL Azure Database exposes a tabular data stream (TDS) interface for Transact-SQL-based database access. This allows your database applications to use SQL Azure Database in the same way that they use SQL Server. Because SQL Azure Database is a service, administration in SQL Azure Database is slightly different.

Unlike administration for an on-premise instance of SQL Server, SQL Azure Database abstracts the logical administration from the physical administration. Users continue to administer databases, logins, users, and roles, but Microsoft administers the physical hardware such as hard drives, servers, and storage. This approach helps SQL Azure Database provide a large-scale multitenant database service that offers enterprise-class availability, scalability, security, and self-healing [11].

7.1 Subscriptions

To use SQL Azure, Windows Azure platform account must be used. This account allows access to all the Windows Azure-related services, such as Windows Azure, Windows Azure AppFabric, and SQL Azure. The Windows Azure platform account is used to set up and manage subscriptions and to bill for consumption of any of the Windows Azure services including SQL Azure, and running SQL Azure does not require Windows Azure. Whit the Windows Azure platform account, the Windows Azure Platform Management portal can be used to create SQL Azure servers, databases, and its associated

administrator accounts [11].

Each subscription allows one instance of SQL Server to be defined, which will initially include only a

master database. For each server firewall settings has to be configured, to determine which

connections will be allowed access.

7.2 Databases

Each SQL Azure server always includes a master database. Up to 149 additional databases can be created for each SQL Azure server. Microsoft is offering two editions of SQL Azure databases: Web and Business, and when you create a database using the Windows Azure Platform Management portal, the maximum size you specify determines the edition you create. A Web Edition database can have a maximum size of 1 GB or 5GB. A Business Edition database can have maximum size of up to 150 GB of data, in 10GB increments up to 50GB, and then 50 GB increments [11][12]. If the size of the database reaches the limit it is not possible to insert data, update data, or create new database

(29)

29 objects. However, read and delete data, truncate tables, drop tables and indexes, and rebuild indexes are still possible.

SQL Azure data access model does not support cross-database queries in the current version a connection is made to a single database. If data from another database is needed, new connection must be created [11].

7.3 Security and Access to a SQL Azure Database

Most security issues for SQL Azure databases are managed by Microsoft within the SQL Azure data center, with very little setup required by the users. A user must have a valid login and password in order to connect to the SQL Azure database. Because SQL Azure supports only standard security, each login must be explicitly created.

In addition, the firewall can be configured on each SQL Azure server to only allow traffic from specified IP addresses to access the SQL Azure server. This helps to greatly reduce any chance of a denial-of-service (DoS) attack. All communications between clients and SQL Azure must be SSL encrypted, and clients should always connect with Encrypt = True to ensure that there is no risk of man-in-the-middle attacks. DoS attacks are further reduced by a service called DoSGuard that actively tracks failed logins from IP addresses and if it notices too many failed logins from the same IP address within a period of time, the IP address is blocked from accessing any resources in the service [11]. The security model within a database is identical to that in SQL Server. Users are created and mapped to login names. Users can be assigned to roles, and users can be granted permissions. Data in each database is protected from users in other databases because the connections from the client application are established directly to the connecting user’s database.

7.4 SQL Azure architecture

Each SQL Azure database is associated with its own subscription. From the subscriber’s perspective, SQL Azure provides logical databases for application data storage. In reality, each subscriber’s data is replicated across three SQL Server databases that are distributed across three physical servers in a single data center. Many subscribers may share the same physical database, but the data is presented to each subscriber through a logical database that abstracts the physical storage architecture and uses automatic load balancing and connection routing to access the data. The logical database that the subscriber creates and uses for database storage is referred to as a SQL Azure database [11].

7.5 Logical Databases on a SQL Azure Server

SQL Azure subscribers access the actual databases, which are stored on multiple machines in the data center, through the logical server. The SQL Azure Gateway service acts as a proxy, forwarding the Tabular Data Stream (TDS) requests to the logical server. It also acts as a security boundary providing

(30)

30 login validation, enforcing the firewall and protecting the instances of SQL Server behind the gateway against denial-of-service attacks. The Gateway is composed of multiple computers, each of which accepts connections from clients, validates the connection information and then passes on the TDS to the appropriate physical server, based on the database name specified in the connection. Figure 8 shows the physical architecture represented by the single logical server.

Figure 7 Figure 8 A logical server and its databases distributed across machines in the data center [11]

The machines with the SQL Server instances are called data nodes. Each data node contains a single SQL Server instance, and each instance has a single user database, divided into partitions. Each

partition contains one SQL Azure client database, either a primary or secondary replica. Each database hosted in the SQL Azure data center has three replicas: one primary replica and two secondary

replicas. All reads and writes go through the primary replica, and any changes are replicated to the secondary replicas asynchronously. The replicas are the central means of providing high availability for your SQL Azure databases.

The other SQL Azure databases partitions existing within the same SQL Server instances in the data center are completely invisible and unavailable between different subscribers [11].

For SQL Azure databases every commit needs to be a quorum commit. That is, the primary replica and at least one of the secondary replicas must confirm that the log records have been written before the transaction is considered to be committed.

Each data node machine hosts a set of processes referred to as the fabric. The fabric processes perform the following tasks:

 Failure detection: notes when a primary or secondary replica becomes unavailable so that the Reconfiguration Agent can be triggered

 Reconfiguration Agent: manages the re-establishment of primary or secondary replicas after a node failure

(31)

31

 PM (Partition Manager) Location Resolution: allows messages to be sent to the Partition Manager

 Engine Throttling: ensures that one logical server does not use a disproportionate amount of

the node’s resources, or exceed its physical limits

 Ring Topology: manages the machines in a cluster as a logical ring, so that each machine has two neighbors that can detect when the machine goes down

The machines in the data center are all commodity machines with components that are of low-to- medium quality and low-to-medium performance capacity. The low cost and the easily available configuration make it easy to quickly replace machines in case of a failure condition. In addition, Windows Azure machines use the same commodity hardware, so that all machines in the data center, whether used for SQL Azure or for Windows Azure, are interchangeable

In Figure 7, the logical server contains three databases: DB1, DB2, and DB3. The primary replica for DB1 is on Machine 6 and the secondary replicas are on Machine 4 and Machine 5. For DB3, the primary replica is on Machine 4, and the secondary replicas are on Machine 5 and on another machine not shown in this figure. For DB4, the primary replica is on Machine 5, and the secondary replicas are on Machine 6 and on another machine not shown in this figure. Note that this diagram is a simplification. Most production Microsoft SQL Azure data centers have hundreds of machines with hundreds of actual instances of SQL Server to host the SQL Azure replicas, so it is extremely unlikely that if multiple SQL Azure databases have their primary replicas on the same machine, their secondary replicas will also share a machine [11].

The physical distribution of databases that all are part of one logical instance of SQL Server means that each connection is tied to a single database, not a single instance of SQL Server.

7.6 Network Topology

Four distinct layers of abstraction work together to provide the logical database for the subscriber’s application to use: the client layer, the services layer, the platform layer, and the infrastructure layer. Figure 8 illustrates the relationship between these four layers.

The client layer resides closest to the application, and it is used by the application to communicate directly with SQL Azure. The client layer can reside on-premises in a data center, or it can be hosted in Windows Azure. Every protocol that can generate TDS over the wire is supported. Because SQL Azure provides the TDS interface same as SQL Server, known and familiar tools and libraries can be used to build client applications for data that is in the cloud.

The infrastructure layer represents the IT administration of the physical hardware and operating systems that support the services layer.

(32)

32

(33)

33

7.7 High Availability with SQL Azure

The goal for Microsoft SQL Azure is to maintain 99.9 percent availability for the subscribers’ databases. As it was stated earlier this goal is achieved by the use of commodity hardware that can be quickly and easily replaced in the case of machine or drive failure and the management of the replicas, one primary and two secondary, for each SQL Azure database [12].

7.8 Failure Detection

Management in the data centers needs to detect not only a complete failure of a machine, but also conditions where machines are slowly degenerating and communication with them is affected. The concept of quorum commit, discussed earlier, addresses these conditions. First, a transaction is not considered to be committed unless the primary replica and at least one secondary replica can confirm that the transaction log records were successfully written to disk. Second, if both a primary replica and a secondary replica must report success, small failures that might not prevent a

transaction from committing but that might point to a growing problem can be detected [11].

7.9 Reconfiguration

The process of replacing failed replicas is called reconfiguration. Reconfiguration can be required due to failed hardware or to an operating system crash, or to a problem with the instance of SQL Server running on the node in the data center. Reconfiguration can also be necessary when an upgrade is performed, whether for the operating system, for SQL Server, or for SQL Azure. All nodes are monitored by six peers, each on a different rack than the failed machine. The peers are referred to as neighbors. A failure is reported by one of the neighbors of the failed node, and the process of reconfiguration is carried out for each database that has a replica on the failed node. Because each machine holds replicas of hundreds of SQL Azure databases (some primary replicas and some secondary replicas), if a node fails, the reconfiguration operations are performed

hundreds of times. There is no prioritization in handling the hundreds of failures when a node fails; the Partition Manager randomly selects a failed replica to handle, and when it is done with that one, it chooses another, until all of the replica failures have been dealt with.

If a node goes down because of a reboot, that is considered a clean failure, because the neighbors receive a clear exception message.

Another possibility is that a machine stops responding for an unknown reason, and an ambiguous failure is detected. In this case, an arbitrator process determines whether the node is really down. Although this discussion centers on the failure a single replica, it is really the failure of a node that is detected and dealt with. A node contains an entire SQL Server instance with multiple partitions containing replicas from up to 650 different databases. Some of the replicas will be primary and some will be secondary. When a node fails, the processes described earlier are performed for each affected database. That is, for some of the databases, the primary replica fails, and the arbitrator chooses a new primary replica from the existing secondary replicas, and for other databases, a