Data Characteristics - ARCHITECTING THE CLOUD

There are many characteristics of data that should be taken into consideration when building cloud services. Here is a short list of categories:

■ Physical characteristics

■ Performance requirements

■ Volatility

■ Volume

■ Regulatory requirements

■ Transaction boundaries

■ Retention period

All of the data requirements listed here factor into the decision of how to store the underlying data. There are two key decisions to make that we will discuss toward the end of the chapter:

1. Multitenant or single tenant.

2. Which type of data store to use: SQL, NoSQL, fi le, and so on.

In the following sections we will discuss design considerations for each data characteristic.

Physical Characteristics

When analyzing physical characteristics, many data points need to be col-lected. The location of the data is an important piece of information. Does the data already exist or is this a new data set? If it already exists, does the data need to be moved to the cloud or will the data be created in the cloud?

If the data has to be transported to the cloud, how big is it? If we are talking about a huge amount of data (terabytes), this presents a challenge. Some cloud vendors offer the capability to ship large amounts of data to them so they can manually load the data on the customer ’s behalf, but if the data is highly sen-sitive, do we really want a truckload of sensitive data going off-site? If the data is new, more than likely all of the data can be created in the cloud (public or private) and the ugly step of transferring huge amounts of data for an initial load is not needed. The location of the data also is relevant when analyzing legal responsibilities. Different countries have different laws about data enter-ing and leaventer-ing the country ’s borders.

Who owns the data? Does the company building the software own it, is the data coming from a third party, or does the customer of the system own the data? Can the data be shared with other companies? If so, do certain attrib-utes need to be masked to hide it from other parties? Ownership of data is an important characteristic and the answer to the ownership question should be written in the contracts between the service providers and their customers.

For a company building Software as a Service (SaaS), Platform as a Service (PaaS), or Infrastructure as a Service (IaaS) solutions, answers to data owner-ship and sharing could drive design decisions on whether separate databases and even separate database servers are required per client in order to meet certain demands around privacy, security, and service level agreements (SLAs).

Performance Requirements

Performance falls into three categories: real time, near real time, and delayed time. Real-time performance is usually defi ned as subsecond response time.

Websites typically strive for half-second response time or less. Near real time usually refers to within a second or two. Sometimes near real time means “per-ceived real time.” Per“per-ceived real time means the performance is near real time but to the end user it appears to be real time. For example, with point-of-sale (POS) technology, the perceived time for cashing out a customer at the cash register is the time it takes to spit out the receipt after all the items are scanned and the discounts are processed. Often, tasks are performed during the entire

shopping experience that may take one second or more, but these tasks are fi n-ished by checkout. Even though some tasks take longer than the real-time stan-dard of half a second, the tasks visible to the consumer (generating the receipt) occur in real time, hence the term perceived real time. Delayed time can range from a few seconds to batch time frames of daily, weekly, monthly, and so on.

The response time categories drive major design decisions. The faster the response time required, the more likely the architects will need to leverage memory over disk. Common design patterns for high-volume fast-performing data sets are:

■ Use a caching layer.

■ Reduce the size of data sets (store hash values or binary representa-tions of attributes).

■ Separate databases into read-only and write-only nodes.

■ Shard data into customer-, time-, or domain-specifi c shards.

■ Archive aging data to reduce table sizes.

■ Denormalize data sets.

There are many other methods. Understanding performance require-ments is key in driving these types of design decisions.

Volatility

Volatility refers to the frequency in which the data changes. Data sets are either static or dynamic. Static data sets are usually event-driven data that occur in chronological order. Examples are web logs, transactions, and collection data.

Common types of information in web logs are page views, referring traffi c, search terms, user IP addresses, and the like. Common transactions are bank debits and credits, point-of-sale purchases, stock trading, and so forth. Exam-ples of collection data are readings from manufacturing machines, environmen-tal readings like weather, and human genome data. Static data sets like these are write-once, read-many type data sets because they occur in a point of time but are analyzed over and over to detect patterns and observe behaviors. These data sets often are stored over long periods of time and consume terabytes of data. Large, static data sets of this nature often require nonstandard database practices to maximize performance. Common practices for mining these types of data sets are to denormalize the data, leverage star or snowfl ake schemas, leverage NoSQL databases, and, more recently, apply big data technologies.

Dynamic data requires entirely different designs. If data is changing frequently, a normalized relational database management system (RDMS) is the most common solution. Relational databases are great for processing ACID (atomicity, consistency, isolation, durability) transactions to ensure the

reliability of the data. Normalized relational databases protect the integrity of the data by ensuring that duplicate data and orphan records do not exist.

Another important characteristic of volatility is the frequency of the data.

One million rows of data a month is a much easier problem to design for than one million rows a minute. The speed at which data fl ows (add, change, delete) is a huge factor in the overall architecture of the data layer within the cloud. Understanding the different disk storage systems within the cloud is critical. For example, on Amazon Web Services (AWS), S3 is a highly reliable disk storage system, but it is not the highest performing. EBS volumes are local disk systems that lack the reliability and redundancy of S3 but perform faster. It is key to know the data requirements so that the best disk storage system is selected to solve specifi c problems.

Volume

Volume refers to the amount of data that a system must maintain and pro-cess. There are many advantages of using relational databases, but when data volumes hit a certain size, relational databases can become too slow and too expensive to maintain. Architects must also determine how much data is rel-evant enough to be maintained online and available for access and how much data should be archived or stored on slower and less expensive disks. Volume also impacts the design of a backup strategy. Backing up databases and fi le sys-tems is a critical component of business continuity and disaster recovery, and it must be compliant with regulations such as SSAE16 and others. Backups tend to consume large amounts of CPU cycles and could impact the overall system performance without a sound design. Full backups are often performed daily while incremental backups are performed multiple times throughout the day.

One common strategy is to perform backups on a slave database so that the application performance is not impacted.

Regulatory Requirements

Regulation plays a major role in design decisions pertaining to data. Almost any company delivering cloud services in a B2B model can expect custom-ers to demand certifi cations in various regulations such as SAS 70, SSAE 16, ISO 27001, HIPAA, PCI, and others. Data that is classifi ed as PII (personally identifi able information) must be encrypted in fl ight and at rest, which creates performance overhead, especially if those fi elds have high volatility and vol-umes. PII data is a big contributor for companies choosing to leverage private and hybrid clouds. Many companies refuse to put sensitive and private data in a public, multitenant environment. Understanding regulatory constraints and risks can drive deployment model decisions.

Transaction Boundaries

Transaction boundaries can be thought of as a unit of work. In e-commerce, shoppers interact with data on a web form and make various changes to the data as they think through their shopping event. When they place an order, all of the decisions they have made are either committed or rejected based on whether they have a valid credit card and available balance or if the items are still in stock. A good example of a transaction boundary is the following process fl ow common to travel sites like Expedia.com.

A consumer from Chicago is looking to take a family vacation to Disney World in Orlando, Florida. She logs onto Expedia.com and proceeds to book a fl ight, a hotel, and a rental car. Behind the scenes, Expedia is calling application programming interfaces (APIs) to three different companies, like US Airways for the airfare, Marriott for the hotel, and Hertz for the rental car. As the con-sumer is guided through the process fl ow of selecting an airline, a hotel, and a car, the data is uncommitted until the consumer confi rms the purchase. Once the consumer confi rms the itinerary, Expedia calls the three different vendors ’ APIs with a request to book the reservation. If any one of the three calls fails, Expedia needs to ask the consumer if she still wants to proceed with the other two. Even though the initial call to US Airways may have been valid, the entire transaction boundary is not yet complete, and, therefore, the fl ight should not yet be booked. Committing each part of the transaction independently could create real data quality and customer satisfaction issues if any of the parts of the transaction are purchased while other parts fail. The consumer may not want the trip if any one of the three parts of the trip can ’t be booked.

Understanding transaction boundaries is critical for determining which data points need to store state and which don ’t. Remember that RESTful (Representational State Transfer) services are stateless by design, so the archi-tect needs to determine the best way to maintain state for this multipart trans-action, which might require caching, writing to a queue, writing to a temp table or disk, or some other method. The frequency in which a multipart transaction like the Expedia example occurs and the volume of these trans-actions also come into play. If this use case is expected to occur often, then writing the data to tables or disk will likely create performance bottlenecks, and caching the data in memory might be a better solution.

Retention Period

Retention period refers to how long data must be kept. For example, fi nan-cial data is usually required to be stored for seven years for auditing pur-poses. This does not mean that seven years must be available online; it simply means the data should not be destroyed until it is older than seven years. For

example, online banking usually provides six months’ to a year ’s worth of bank statements online. Users need to submit requests for any statements older than a year. These requests are handled in batch and sometimes come with a processing fee, because the bank has to retrieve the data from off-line storage.

Understanding retention periods is a deciding factor in selecting the proper storage solutions. Data that needs to be maintained but does not have to be available online in real time or near real time can be stored on very cheap off-line disk and/or tape solutions. Often this archived data is kept off-site at a disaster recovery site. Data that needs to be retrieved instantly needs to be stored on a fast-performing disk that is redundant and can recover quickly from failures.

In document ARCHITECTING THE CLOUD (Page 111-116)