Enhancing Reliability and Service Ability

Part II Foundation Elements

Chapter 3 Workstations

3.1 The Basics

4.2.1 Enhancing Reliability and Service Ability

4.2.1.1 Server Appliances

Anapplianceis a device designed specifically for a particular task. Toasters make toast. Blenders blend. One could do these things using general-purpose devices, but there are benefits to using a device designed to do one task very well.

The computer world also has appliances: file server appliances, web server appliances; email appliances; DNS appliances; and so on. The first appliance was the dedicated network router. Some scoffed, “Who would spend all that money on a device that just sits there and pushes packets when we can easily add extra interfaces to our VAX and do the same thing?” It turned out that quite a lot of people would. It became obvious that a box dedicated to a single task, and doing it well, was in many cases more valuable than a general-purpose computer that could do many tasks. And, heck, it also meant that you could reboot the VAX without taking down the network.

A server appliance brings years of experience together in one box. Architecting a server is difficult. The physical hardware for a server has all the

4.2 The Icing 85

requirements listed earlier in this chapter, as well as the system engineering and performance tuning that only a highly experienced expert can do. The software required to provide a service often involves assembling various pack- ages, gluing them together, and providing a single, unified administration system for it all. It’s a lot of work! Appliances do all this for you right out of the box.

Although a senior SA can engineer a system dedicated to file service or email out of a general-purpose server, purchasing an appliance can free the SA to focus on other tasks. Every appliance purchased results in one less system to engineer from scratch, plus access to vendor support in the unit of an outage. Appliances also let organizations without that particular expertise gain access to well-designed systems.

The other benefit of appliances is that they often have features that can’t be found elsewhere. Competition drives the vendors to add new features, increase performance, and improve reliability. For example, NetApp Filers have tunable file system snapshots, thus eliminating many requests for file restores.

4.2.1.2 Redundant Power Supplies

After hard drives, the next most failure-prone component of a system is the power supply. So, ideally, servers should have redundant power supplies.

Having a redundant power supply does not simply mean that two such devices are in the chassis. It means that the system can be operational if one power supply is not functioning: n+ 1 redundancy. Sometimes, a fully loaded system requires two power supplies to receive enough power. In this case, redundant means having three power supplies. This is an important question to ask vendors when purchasing servers and network equipment. Network equipment is particularly prone to this problem. Sometimes, when a large network device is fully loaded with power-hungry fiber interfaces, dual power supplies are a minimum, not a redundancy. Vendors often do not admit this up front.

Each power supply should have a separate power cord. Operationally speaking, the most common power problem is a power cord being accidentally pulled out of its socket. Formal studies of power reliability often overlook such problems because they are studying utility power. A single power cord for everything won’t help you in this situation! Any vendor that provides a single power cord for multiple power supplies is demonstrating ignorance of this basic operational issue.

Another reason for separate power cords is that they permit the following trick: Sometimes a device must be moved to a different power strip, UPS, or circuit. In this situation, separate power cords allow the device to move to the new power source one cord at a time, eliminating downtime.

For very-high-availability systems, each power supply should draw power from a different source, such as separate UPSs. If one UPS fails, the system keeps going. Some data centers lay out their power with this in mind. More commonly, each power supply is plugged into a different power distribution unit (PDU). If someone mistakenly overloads a PDU with two many devices, the system will stay up.

Benefit of Separate Power Cords

Tom once had a scheduled power outage for a UPS that powered an entire machine room. However, one router absolutely could not lose power; it was critical for projects that would otherwise be unaffected by the outage. That router had redundant power supplies with separate power cords. Either power supply could power the entire system. Tom moved one power cord to a non-UPS outlet that had been installed for lights and other devices that did not require UPS support. During the outage, the router lost only UPS power but continued running on normal power. The router was able to function during the entire outage without downtime.

4.2.1.3 Full versusn+ 1 Redundancy

As mentioned earlier,n+ 1 redundancyrefers to systems that are engineered such that one of any particular component can fail, yet the system is still func- tional. Some examples are RAID configurations, which can provide full service even when a single disk has failed, or an Ethernet switch with additional switch fabric components so that traffic can still be routed if one portion of the switch fabric fails.

By contrast, infull redundancy, two complete sets of hardware are linked by a fail-over configuration. The first system is performing a service and the second system sits idle, waiting to take over in case the first one fails. Thisfailovermight happen manually—someone notices that the first system failed and activates the second system—or automatically—the second system monitors the first system and activates itself (if it has determined that the first one is unavailable).

4.2 The Icing 87

Other fully redundant systems are load sharing. Both systems are fully operational and both share in the service workload. Each server has enough capacity to handle the entire service workload of the other. When one system fails, the other system takes on its failed counterpart’s workload. The systems may be configured to monitor each other’s reliability, or some external resource may control the flow and allocation of service requests.

Whennis 2 or more,n+ 1 is cheaper than full redundancy. Customers often prefer it for the economical advantage.

Usually, only server-specific subsystems aren+ 1 redundant, rather than the entire set of components. Always pay particular attention when a vendor tries to sell you onn + 1 redundancy but only parts of the system are redundant: A car with extra tires isn’t useful if its engine is dead.

4.2.1.4 Hot-Swap Components

Redundant components should be hot-swappable. Hot-swap refers to the ability to remove and replace a component while the system is running. Nor- mally, parts should be removed and replaced only when the system is powered off. Being able to hot-swap components is like being able to change a tire while the car is driving down a highway. It’s great not to have to stop to fix common problems.

The first benefit of hot-swap components is that new components can be installed while the system is running. You don’t have to schedule a downtime to install the part. However, installing a new part is a planned event and can usually be scheduled for the next maintenance period. The real benefit of hot-swap parts comes during a failure.

Inn+1 redundancy, the system can tolerate a single component failure, at which time it becomes critical to replace that part as soon as possible or risk adouble component failure.The longer you wait, the larger the risk. Without hot-swap parts, an SA will have to wait until a reboot can be scheduled to get back into the safety of n+ 1 computing. With hot-swap parts, an SA can replace the part without scheduling downtime. RAID systems have the concept of ahot spare disk that sits in the system, unused, ready to replace a failed disk. Assuming that the system can isolate the failed disk so that it doesn’t prevent the entire system from working, the system can automatically activate the hot spare disk, making it part of whichever RAID set needs it. This makes the systemn+ 2.

The more quickly the system is brought back into the fully redundant state, the better. RAID systems often run slower until a failed component

has been replaced and the RAID set has been rebuilt. More important, while the system is not fully redundant, you are at risk of a second disk failing; at that point, you lose all your data. Some RAID systems can be configured to shut themselves down if they run for more than a certain number of hours in nonredundant mode.

Hot-swappable components increase the cost of a system. When is this additional cost justified? When eliminated downtimes are worth the extra expense. If a system has scheduled downtime once a week and letting the system run at the risk of a double failure is acceptable for a week, hot- swap components may not be worth the extra expense. If the system has a maintenance period scheduled once a year, the expense is more likely to be justified.

When a vendor makes a claim of hot-swappability, always ask two ques- tions: Which parts aren’t hot-swappable? How and for how long is service interrupted when the parts are being hot-swapped? Some network devices have hot-swappable interface cards, but the CPU is not hot-swappable. Some network devices claim hot-swap capability but do a full system reset after any device is added. This reset can take seconds or minutes. Some disk subsystems must pause the I/O system for as much as 20 seconds when a drive is replaced. Others run with seriously degraded performance for many hours while the data is rebuilt onto the replacement disk. Be sure that you understand the ramifications of component failure. Don’t assume that hot-swap parts make outages disappear. They simply reduce the outage.

Vendors should, but often don’t, label components as to whether they are hot-swappable. If the vendor doesn’t provide labels, you should.

Hot-Plug versus Hot-Swap

Be mindful of components that are labeledhot-plug. This means that it is electrically safe for the part to be replaced while the system is running, but the part may not be recognized until the next reboot. Or worse, the part can be plugged in while the system is running, but the system will immediately reboot to recognize the part. This is very different from hot-swappable.

Tom once created a major, but short-lived, outage when he plugged a new 24-port FastEthernet card into a network chassis. He had been told that the cards were hot- pluggable and had assumed that the vendor meant the same thing as hot-swap. Once the board was plugged in, the entire system reset. This was the core switch for his server room and most of the networks in his division. Ouch!

4.2 The Icing 89

You can imagine the heated exchange when Tom called the vendor to complain. The vendor countered that if the installer had to power off the unit, plug the card in, and then turn power back on, the outage would be significantly longer. Hot-plug was an improvement.

From then on until the device was decommissioned, there was a big sign above it saying, “Warning: Plugging in new cards reboots system. Vendor thinks this is a good thing.”

4.2.1.5 Separate Networks for Administrative Functions

Additional network interfaces in servers permit you to build separate administrative networks. For example, it is common to have a separate network for backups and monitoring. Backups use significant amounts of bandwidth when they run, and separating that traffic from the main network means that backups won’t adversely affect customers’ use of the network. This separate network can be engineered using simpler equipment and thus be more reliable or, more important, be unaffected by outages in the main network. It also provides a way for SAs to get to the machine during such an outage. This form of redundancy solves a very specific problem.

In document The Practice of System and Network Administration 2nd Edition pdf (Page 125-130)