Clustering is the running of a single application on multiple machines at one time. This allows you to apply the resources of many machines to one problem, and when properly
implemented, it is an excellent way to handle large problems like enterprise databases, commercial websites, and serious scientific applications.
There are actually two different technologies that fall in the clustering definition: fail-over clustering and load balancing.
Fail-over clustering, also called server replication, is the process of maintaining a running spare for a server that can take over automatically in the event that the primary server fails. Typically, these solutions use disk systems that can be switched from one machine to another automatically or they mirror changes to the disk from the primary server to the secondary server, so that if something happens to the primary server, the secondary server can take over immediately.
fail-over clustering
A fault tolerance method where a server can assume the services of a failed server.
Fail-over clustering does not allow multiple servers to handle the same service at the same time; rather, responsibility for clients is switched amongst members of the cluster when a failure event occurs.
These solutions are not without their problems—information stored in RAM on the servers is not maintained, so while the server can switch over, open network sessions will be dropped unless they are stateless protocols like HTTP. This would happen anyway if the primary server failed and was not replicated, and sessions can usually be automatically re-established on the new server for file sharing protocols without difficulty. But fail-over clustering must be specifically supported by application services like SQL servers and messaging servers,
because those applications maintain responsibility for moving data amongst the members of the cluster.
stateless protocol
Protocols which do not maintain any information about the client session on the server side. Stateless protocols can be easily clustered across multiple machines without fear of data loss or side effects because it does not matter which server the client connects to from one instance to the next.
Note Fail-over clustering is the form implemented natively by Windows 2000 Advanced Server.
Load-Balancing
There is another form of clustering that works quite well for certain problems: load balancing. Load balancing is quite simple; it allows multiple machines to respond to the same IP
address and balances the client load among that group. For problems such as a web service, this makes all the servers appear to be one server that can handle a massive number of simultaneous connections. Both Windows and Unix support this type of clustering. Load balancing
A clustering mechanism where individual client sessions are connected to any one of a number of identically configured servers, so that the entire load of client sessions is spread evenly among the pool of servers.
Load balancing doesn’t work for problems such as file service, database, or e-mail, because there’s no standard way to replicate data stored on one server to all the rest of the servers. For
example, if on your first session you stored a file to the cluster (meaning one of the machines in the cluster) and then connected to the cluster at a later date, there’s only a small chance that you would connect again to the machine that had your file. Stateless clustering works only with applications that don’t maintain any data transmitted by the client—you can think of them as “output only” applications. Examples of this sort of application are web and FTP services.
There is a solution to even that problem, though—all the clustered machines can transmit their stored data to a single back-end storage or database server. This puts all the information in one place, where any user can find it, no matter which clustered server they’re attached to. Unfortunately, it also means that the cluster is no faster than the single machine used to store everything.
Stateless clustering works well in the one environment it was designed for: web service for large commercial sites. The amount of user information to store for a website is usually miniscule compared to the massive amount of data transmitted to each user. Because some websites need to handle millions of simultaneous sessions, this method lets designers put the client-handling load on frontline web servers and maintain the database load on back-end database servers.
Simple Server Redundancy
High availability and clustering solutions are all expensive—the software to implement them is likely to cost as much as the server you put it on. There are easy ways to implement fault tolerance, but they change depending on what you’re doing and exactly what level of fault tolerance you need. I’ll present a few ideas here to get you thinking about your fault tolerance problems.
Vendors traditionally calculate the cost of downtime using this method: Employees x Average Pay rate x Down Hours = Downtime Costs
Sounds reasonably complete, but it’s based on the assumption that employees in your organization become worthless the moment their computers go down. Sometimes that’s the case, but often it’s not. I’m not advocating downtime, I’m merely saying that the assumptions used to cost downtime are flawed, and that short periods of downtime aren’t nearly as
expensive as data loss or the opportunity cost of lost business if your business relies on computers to transact.
If you can tolerate 15 minutes of downtime, a whole array of less expensive options emerges. For example, manually swapping an entire server doesn’t take long, especially if the hard disks are on removable cartridges. For an event that might occur once a year, this really isn’t all that bad.
The following inexpensive methods can achieve different measures of fault tolerance for specific applications.
The DNS service can assign more than one IP address to a single domain name. If there’s no response from the first address, the client can check, in order, each of the next addresses until it gets a response (however, depending on the client-side address caching mechanism, it may
take a few minutes for the client to make another DNS attempt). This means that for web service, you can simply put up an array of web servers, each with their own IP address, and trust that users will be able to get through to one of them. With web service, it rarely matters which server clients attach to as long as they’re all serving the same data, you have fault tolerance.
Another way to solve the load-balancing problem is with firewalls. Many firewalls can be configured to load balance a single IP address across a group of identical machines, so you can have three web servers that all respond to a single address behind one of these firewalls. Fault tolerance for standard file service can be achieved by simply cross-copying files among two or more servers. By doubling the amount of disk space in each server, you can maintain a complete copy of all the data on another machine by periodically running a script to copy files from one machine to another, or using a mechanism like the Windows File Replication
Service. In the event that a machine has crashed, users can simply remap the drive letter they use for the primary machine to the machine with the share to which you have backed
everything up. By using the archive bit to determine which files should be copied, you can update only those files that have changed, and you can make the update period fairly frequent—say, once per hour.
There is a time lag based on the periodicity of your copy operation, so this method may not work in every situation. Since it’s not completely automatic (users have to recognize the problem and manually remap a drive letter), it’s not appropriate for every environment. You reduce the automation problem by providing a desktop icon that users can click to run a batch file that will remap the drive.
Fault tolerance doesn’t mean you have to spend a boatload of money on expensive hardware and esoteric software. It means that you must think about the problem and come up with the simplest workable solution. Sometimes that means expensive hardware and esoteric software, but not always.
Review Questions
1. What are the four major causes for loss, in order of likelihood? 2. What is the best way to recover from the effects of human error? 3. What is the most likely component to fail in a computer?
4. What is the most difficult component to replace in a computer?
5. What is the easiest way to avoid software bugs and compatibility problems?
6. How can you recover from a circuit failure when you have no control over the ISPs repair actions?
7. What are the best ways to mitigate the effects of hacking? 8. What is the most common form of fault tolerance?
9. What is the difference between an incremental backup and a differential backup? 10. What causes the majority of failures in a tape backup solution?
12. RAID-10 is a combination of which two technologies?
13. If you create a RAID-5 pack out of five 36GB disks, how much storage will be available?
14. What are the two methods used to perform offsite storage? 15. What is the difference between backup and archiving? 16. What are the two common types of clustering?
Answers
1. The four major causes for loss are human error, routine failure, crimes, and environmental events.
2. Having a good archiving policy is the best way to recover from the effects of human error.
3. The hard disk is the most likely component to fail in a computer. 4. The hard disk is the most difficult component to replace in a computer.
5. Deployment testing is the easiest way to avoid software bugs and compatibility problems. 6. Using multiple circuits from different ISPs will help you recover from a circuit failure. 7. Strong border security, permissions security, and offline backup are the best ways to
minimize the damage caused by hackers.
8. Tape backups are the most common form of fault tolerance.
9. An incremental backup contains all the files changed since the last incremental backup, while a differential backup contains the files changed since the last full system backup. 10. Humans cause the majority of failures in a tape backup system.
11. RAID-0 actually makes failure more likely rather than less likely. 12. RAID-1 and RAID-0 are combined in RAID-10.
13. Since you have to leave 1 disk for parity information, the storage available would be (5-1) x 36GB = 144GB.
14. Physically moving offline backup media to another location and transmitting data to another facility via a network are the two methods used to perform offsite storage. 15. Backup is the process of making a copy of every file for the purpose of restoration.
Archiving is the process of retaining a copy of every version of all files created by users for the purpose of restoring individual files in case of human error.
16. The two common types of clustering are fail-over clustering and load balancing.