The VMware vSphere Install, Configure, Manage course teaches that a good choice for virtualization is a “well-behaved application.” An application that is badly coded or that has memory or CPU process issues, for example, can behave poorly on a vir-tual platform even if it performs acceptably on a physical platform.
There may be a few differences between an application running in a physical plat-form and the same application running in a virtualization platplat-form. One example is the impact on other applications when such problems occur. For example, memory peaks when a Windows service in an application stalls, and this could cause a virtual machine to request more and more memory (up to the amount of configured mem-ory) from the hypervisor. This, in turn, causes host active memory overcommitment and possibly has an application impact, with other workloads not being able to uti-lize memory as needed on the same physical host.
Creating a good virtual machine guest design, based on the application attributes, to create a well-behaved workload (assuming that the application is stable) is crucial.
The specific elements that make up guest design are discussed in Chapter 5 , “Virtual Machine Design.”
What makes a good design in general, though? You need to answer this question before you can make any specific design choices. Consider a normal enterprise IT platform with file, print, email, and business application services. If the platform hosts the applications so work is generally completed on time, and the company makes money, can the IT platform be classed as a success and also a good design?
The answer may be “yes” to some extent, but there are many factors to consider.
Figure 3-1 and Table 3-2 show a key list of considerations an architect should al-ways use to validate designs and plans. These infrastructure qualities are availability, manageability, performance, recoverability, and security. If virtualizing an applica-tion affects any of these qualities detrimentally, you need to answer the quesapplica-tion
“Should I virtualize?” When you’re considering vSphere platforms or infrastructure solutions in general, these key aspects are extremely useful as a minimum checklist to ensure that a good, well-thought-out design has been created.
ptg10893717
The Infrastructure Qualities
Availability
Uptime Requirements (SLA - Min)
Redundancy
Easy to Install
Easy to Support
Simple Upgrade
Backup
Archival
Risk
The KISS Principle
Patch Management
Workload Attributes SLA Requirements (Min)
Recovery Time Objects (RTO) Recovery Point Objective (RPO) SLA Requirements
Disaster Recovery
Access Points Compliance Requirements?
Platform
VMware HA and Fault Tolerance
Multiple Pieces of Hardware (for example, Teamed NIC)
Application Level
Manageability
Performance
Recoverability
Security
Figure 3-1 General Areas to Consider with the Infrastructure Qualities
Table 3-2 The Infrastructure Qualities Quality Description
Availability The requirements for uptime and whether fault tolerance is required Manageability A system that works and doesn’t need an army of dedicated engineers Performance KPI and satisfaction
Recoverability How long it takes to recover; how long can the business do without it?
How much data loss can be tolerated?
Security A balance of ease of access and preventing intruders
ptg10893717 A platform designer must develop a platform that meets the business objectives
or requirements. It can be very easy to see nothing else but the requirements and to produce a solution that, although it meets requirements, is not a technically good design.
Say that a technical consultant has been hired by a business department in a large enterprise to deploy a specific multitier application. The normal in-house IT department performs the enterprise application deployment; however, due to a backlog and internal processes, the delivery time is far longer than the business de-partment can tolerate in current market conditions. The main vision is to deploy a specific application as quickly as possible. It consists of three tiers:
The application vendor can provide support in the event of a development issue.
The business department has several projects to place in this application, within a very tight time frame.
The technical consultant reviews the size of the current project workload and con-siders the minimum specifications for the software:
A project plan is constructed to deploy the application, hardware is acquired, and the platform is built for around $30,000. Approximately seven days after the consul-tant begins work, the application is in production, and the business department is using the application on live projects.
Has the project been a success? It depends. What are the expectations from the business? The application is in production; it is serving the current workloads, and both the business users and end clients are happy. The project has technically met business requirements. However, consider a few scenarios:
Q What if the web server operating system crashes?
Q What if the workloads increase dramatically?
Q What if a security patch needs to be installed?
Q What if the hardware fails?
ptg10893717
Q What if the system is hacked?
Q What if there is a human error, and the data is deleted with a SQL job or a remote access query?
Would the business reasonably consider that the application should be tolerant to these issues? A customer or client may not specify these kinds of things, but a plat-form designer should at least consider them. It is assumed that the technical require-ments discussed during interviews during the information gathering phase would have identified the expectations.
When you factor in these possible issues, the delivery time frame may be unsuitable for the business, or the cost may be too much for the current projects. However, the project team should be made aware that there are possible associated risks with the proposed design and the current technical approach. All these scenarios could be stated as assumptions or risks (for example, “It is assumed that the business does not require disaster recovery”).
If the business takes on more projects that use this application, or if the criticality of the data changes (for example, through brand damage), a subsequent project could be created to add the ability to scale or recover. Should ease of scalability be consid-ered in the original installation, or does the business feel comfortable with a quick installation, going live, seeing if it works and sells, and repeating with a bigger proj-ect, considering the possible scenarios that might affect the infrastructure?
When a platform or VMware designer is asked to review or to plan a solution, it is crucial that with all the technical platform components, the designer considers the attributes of good design—whether they are specifically requested or not. These at-tributes may not change an approach in some cases, but in others they could prevent a possible project failure. A platform designer must show due diligence and ensure that every design is of a good quality as well as functional for the end client.
Personally, I use the infrastructure qualities as a checklist. Each quality helps form a set of questions and areas to consider while I’m designing or reviewing a design.
If I can check each quality as considered and incorporated in the design, I feel more comfortable.
The key thoughts associated with each quality will differ from project to project.
Figure 3-1 shows some of the general areas to consider.
“It’s Okay, We Have HA!”
This is a statement I have heard in many meetings. It is sometimes valid, depending on the business requirements. But before it can be taken as true, a VMware architect needs to clarify a few basic points:
ptg10893717
Nothing can make a techie switch off quicker than the mention of acronyms such as RTO and RPO. However, these aspects are pretty crucial to a designer.
The recovery time objective (RTO)—also known as the return to operation —is the amount of time it takes to restore a service after a failure has taken place. The recov-ery point objective (RPO) is the point in time to which the system needs to be restored following a failure.
Say that a database crashes at 1 a.m. on Sunday morning. The website is rendered unusable. Users cannot log into their accounts. The company has a service-level agreement (SLA) with an RTO of 4 hours and an RPO of 15 minutes.
The design for the solution must provide a system that can be completely restored within 4 hours of the outage (that is, the RTO). Within 4 hours, the system must also be in the state it was in at a maximum of 15 minutes before the outage (that is, the RPO).
The RTO and RPO expectations can have a dramatic effect on the other infrastruc-ture qualities. The lower these values, the less time the business can tolerate the service being unavailable. Added redundancy may be required. But additional redun-dancy can add complexity, which may in turn reduce the ease of management or in some cases make the platform harder to secure.
Understanding an application’s data flow is extremely useful. Consider the following scenario: You are a VMware consultant at a web firm. The latest solution consists of a multitier application. There are three major server roles:
Q Web server—Provides web content with IIS and a custom application service.
Q Application servers—Provides a platform for three services:
Q Database—Provides a large single instance of Microsoft SQL 2008
The application listener service takes relevant data from the feed and places it in an in-memory database. The calculation service takes this useful data and places it into
ptg10893717 usable form in the database. The web service communicates with the database and
provides data to the user.
The SLA specifies an RTO of 4 hours and an RPO of 15 minutes.
From a platform perspective, a well-thought-out and configured HA solution com-bined with a backup/restore routine could in fact meet the SLA. Closer inspection is needed, however.
VMware HA handles a host or virtual machine restart in an automated way. The fact that the system has a number of services and an in-memory database raises concerns about usability and the state of the servers in the event of a restart. Do the services restart automatically? How long does it take for the in-memory database to populate and display to the end user? The answers to these questions will help the designer determine whether HA meets the requirements. In my experience, the RTO of a few minutes using VMware HA is enough for the majority of business applications.
Ultimately, it is the end users’ experience that matters. End users expect their sys-tems to work and their data to be available when needed. But what does “when needed” mean?
Consider a multitier application. The end users review data over a web connec-tion; internal analysts upload and process data from a backend private connection.
Who are the end users? From an infrastructure perspective, different levels of avail-ability may be required for the two sets of users: An SLA helps for service delivery, but when designing a platform, it would be beneficial to understand the working practices.
In this example, the different user types at each end of the product life cycle could make your application uptime requirement a lot higher than expected. This concern would not only need to be addressed in a solution but brought to the attention of management if solutions are impractical in regard to costs and so on.
Single Points of Failure and Risk
Many IT engineers look at a technology and highlight the obvious single points of failure. This is a good practice, but it’s important to also answer some follow-up questions:
Q How likely is it that this failure will happen, and are the improvements required to remove all single points of failure pragmatic from a business standpoint?
Q What do we gain by removing the single points of failure?
ptg10893717
Q What about factors such as cost and ease of support and complexity? What about the applications running on or using the platform?
Consider another scenario, in which internal users use an application to upload large amounts of data using an application-specific upload tool. As part of the upload process, this tool removes file extensions and changes file names to globally unique identifiers (GUIDs). The removed information is pushed into a load file and placed into a database.
End users log in to a website and review the uploaded data securely by searching the database and being presented with rendered images of the documents. No sensitive file names or executables are revealed to the outside world. A high level of availabil-ity is required in this case. The RTO is 4 hours, and RPO is 1 day.
A platform architect could start to build in HA by using mirrored databases and multiple single-port cards throughout. He could present a detailed process to allow the operations guys to back up and restore the individual components.
A full disaster recovery test, often neglected until later in the project, can expose a number of issues. In this example, the operations team restores all servers from backups and templates, and all are individually successful. Great news!
The issue that arises is that the original data is backed up and restored, but the up-loaded data from the tool is not backed up and restored. The team determines that the data needs to be uploaded to the remote hosting site and the original native form should be suitable.
The database is restored, and it contains a list of GUIDs that reference documents.
The original data is uploaded using the tool and given different GUIDs. This pro-cess makes the application unusable, and the objective of bringing the service back fails within the time frame of the SLA. It is possible to add remedial steps such as backing up the uploaded data and contacting the vendor to repopulate GUIDs in a restore mode. These measures would fail to work in the SLA time frame with the platform that has been designed.
Although such a restore problem may be difficult for a platform engineer to spot, this issue could have a major impact on a business. In this example, is the process a complete failure? Not at all. The tested process could in fact meet business con-tinuity requirements. The application consists of 20+ servers; the likelihood of all servers being corrupted or impacted in the designed stable environment is very low.
Therefore, a more likely request would be to restore a specific server. A file server with static data could be seen as a low risk.
ptg10893717 This example illustrates the difference between planning for handling problems
while keeping the system up and running with some user impact (for example, one project down versus all projects down) and planning to recover from a major outage.
What Is DR, and When Is It Invoked?
In the example we’ve been looking at, it is understood that due to how the applica-tion works and factors such as security of the documents, the recovery of the full system could take a long time. Defining when the DR process (that is, full restore of service to RTO and RPO times) should be invoked would therefore be useful. For instance, a single project corruption caused by a server error may not necessarily cause the DR process to be invoked. The impact to the other projects outweighs the loss of data in that one project.
Defining clear entry criteria and success criteria for DR would make these times less stressful. Who will perform the DR process? How complicated is the process of DR? Having a DR plan that encompasses the items listed here would ensure that in the event of an emergency, the service can be brought back in a clear, methodical manner:
Q Entity relationship diagram
Q Key contacts in the business and vendor support
Q End user notification process
Q Information on backup strategy and location
Q Entry criteria for the DR process
Q Operations guide on the service restore process
Q How to determine whether the service is successful
Testing the DR process for a service is normally carried out a specified minimum number of times per year. A platform architect should consider a well-tested and verified DR process critical and a key factor in the measurement of a successful proj-ect finish.