We’ve talked a lot about the main principles by which we’ve been designing our system—scalability and elasticity—and in the process, you’ve learned about the concepts. Thus far, we have barely been able to put these principles into practice. While you have seen that you can rapidly increase your resource allocation with a variety of AWS services, we have not done so intelligently yet. Sure, you could always run your application on the largest servers possible, but that misses the point. Elasticity, once again, means being able to scale our infrastructure in response to demand or other events, which we will collectively refer to as incidents. In this chapter, you will learn how to apply this principle by first identifying incidents and then responding to them.
How do we know we need to scale? The first and most important obstacle in responding to incidents is assessing the health of our infrastructure. While we don’t have to know specifically how many users are logged in currently, we do need to be able to assess specific metrics that have some consequence for our application. For example, if we want to know if the size of our EC2 instances is sufficient, we have to measure things such as CPU utilization and memory in order to determine the status of the current instances. If our application is running on a single instance, and its CPU utilization is 100%, there are going to be serious performance issues, and we can call this an incident.
Once we have detected an incident, we have to formulate a response. In the preceding scenario, the most obvious response would be to add another instance to our application stack. In this chapter, we will plan for this and other eventualities and automate the response to the incident. You will learn how to use load-based and time-based EC2 instances to deploy extra resources in response to, and in anticipation of, high demand. There are, of course, some incidents that will require manual intervention to fix or whose resolution is outside the scope of a beginner book. In these cases, we will set up notifications when a critical incident has occurred.
CloudWatch
Amazon has consolidated all of their monitoring metrics under an umbrella service called CloudWatch (http://aws.amazon.com/cloudwatch/). Any metric you can view in any other AWS service can also be collected and tracked in CloudWatch. Typically, the metrics will be easier to view in detail in CloudWatch. It’s important to note that AWS metrics are only available for two weeks.
Let’s begin by taking a look at the metrics in OpsWorks and comparing them to CloudWatch. Log in to the AWS Console and navigate to OpsWorks. Select the Photoalbums stack. Then open the Navigation drop- down and click Monitoring. By default, you will see the monitoring view for the layers of your application that have EC2 instances assigned (see Figure 7-1). In this view, the RDS and ELB layers do not appear.
The four categories of layer metrics are displayed, each with its own graph: CPU System, Memory Used, Load, and Processes. By default, the metrics are loaded for the last 24 hours. You can quickly see whether there are any major incidents wherein the CPU system (utilization), memory use, or load were maxed or if there is a spike in active processes. Each of the first three graph headers is actually a drop-down that can be used to select another metric from that category. Figure 7-2 shows the Memory Used drop-down.
Figure 7-1. The OpsWorks Monitoring view
Figure 7-2. An OpsWorks Monitoring metric drop-down
You can also change the date range from 24 hours to another range. These seem like pretty useful metrics, but there is no way to view them in greater detail. You might think that clicking one of the graphs will expand it, but it will take you elsewhere: to the Monitoring view for the instances in the layer. Take note of the Memory Used metric, as we will be viewing this in CloudWatch. Now let’s go to the CloudWatch dashboard to view this metrics there.
Open the Services drop-down, and select CloudWatch. At the top of the console, you’ll see a heading titled “Metric Summary” (see Figure 7-3). Under this header, click the Browse Metrics button.
173
In the Metrics view, you can see all the metrics for your AWS account, broken down by service and then by category within that service (see Figure 7-4). The number of metrics you see here will depend on what resources have already been created on your account.
Figure 7-3. CloudWatch Metric Summary
Figure 7-4. CloudWatch Metrics by Category
Let’s take a look at the metrics for our application. Imagine that we want to see how much memory has been used by our application layer. Under the OpsWorks Metrics header, click Layer Metrics. You should see a list of metrics alphabetized by name and their corresponding LayerId, as shown in Figure 7-5. Since we only have one OpsWorks layer with EC2 instances assigned to it, each metric appears once and corresponds to the Node.js application layer.
Scroll down to the metric named memory_used. Click the check box next to it. Underneath the metric list, a graph will magically appear, displaying the metric at 5-minute intervals over the past 12 hours (see Figure 7-6).
Figure 7-5. OpsWorks Layer Metrics
Figure 7-6. OpsWorks Layer Metrics in CloudWatch
This is the same metric we were just looking at in OpsWorks, only with greater detail. By default, the span and interval of the data points is different, but we can easily change the graph to 24 hours and 1-minute averages to match the graph in OpsWorks. Click 5 Minutes to expand the interval drop-down and change to 1 Minute. Using the Time Range filter to the right of the graph (not shown in the figure), change the From field to 24. Then click Update Graph. You should now see the same data set as you saw in OpsWorks. You can also mouse over the line in the graph to view more details about each point.
Another useful feature in CloudWatch is that you can view multiple metrics. In our case, the amount of memory used is a more valuable metric when compared to the total available memory. In the metrics list, select the memory_total metric. Your memory_used plot should turn orange, and the memory_total metric
175
As you can see in Figure 7-7, I have 600MB in total available memory, and my usage in the past 24 hours has hovered between 400MB and 500MB for the most part. Keep in mind that this isn’t just the memory usage of the Node.js application; it includes all software, including the operating system, running on the instances. The next question is what to do with this information.
CloudWatch Alarms
When your metrics cross certain thresholds, you can configure a CloudWatch alarm to send a notification to you or your team. You can create up to 5,000 alarms on your account, using any of the metrics that are accessible in CloudWatch, and at the time of this writing, these alarms cost $.10 per month per alarm. The main purpose of creating these alarms is to quantify an incident that is occurring in your application stack, leading to either a manual or automated response.
Creating useful alarms may not be as easy as it sounds. It is entirely possible to create alarms that you would expect to go off if there was a problem with your application and then completely miss an incident occurring. For example, let’s say you were creating alarms to monitor the output of your application over HTTP (more on this later). You might create an alarm that fires when an HTTP response code of 500 is returned to a user, but instead, your application just hangs, and the request times out, due to some unforeseen error in the code. Your application would be unresponsive, and you would never know it!
When you create an alarm, you choose a metric and a comparison operator (greater than, less than, greater than or equal to, etc.). You cannot create an alarm directly comparing two metrics. For example, you could not create an alarm that goes off when memory_used >= memory_total. You would have to configure the alarm to go off when memory_used >= 600,000. Unfortunately, this alarm would not be that useful, unless you intended to keep your instances at a particular scale.
At first glance, you might think, couldn’t the alarm go off when we use all 600MB? Then, couldn’t we add new instances and turn them back off when the alarm stops going off? When you add another instance to your layer, there will also be significant memory overhead associated with that instance, so this may not be the most practical alarm to use. As long as you have more instances (and thus more memory) online, the memory footprint may inaccurately keep the alarm state active, when in actuality the incident is no longer occurring.
An alarm has three possible states: OK, ALARM, and INSUFFICIENT_DATA. The OK state occurs when the condition for the alarm is false. If your alarm is designed to sound when memory_used == 600,000 (which as we know, is not that useful), it will be in the OK state until this condition is met, at which point it will switch to ALARM. The INSUFFICIENT_DATA state means that there is not enough data to determine whether the alarm is in the OK or ALARM state. You may see this alarm when it is first created and has not collected enough data or if, for some reason, the metric data is not currently available. If you see this state for an extended period of time, it means there is something wrong with your alarm, and you should investigate why it is not working as expected.