Chapter 1. Linux high availability cluster solutions
1.5 Implementing a Linux cluster
1.5.2 Red Hat High Availability Server 1.0
We also took the opportunity to use a different software distribution, still based on the same Linux Virtual Server architecture. The Red Hat Server is more recent than TurboLinux TurboCluster Server 4.0, but it should be pointed out that a new version of the TurboCluster Server will be available shortly and that we simply did not have enough time to test a beta version of this code ourselves during the preparation of this paper.
1.5.2.1 Setting up Red Hat Server
We chose to set up a failover solution as opposed to the load balancing solution we implemented using TurboLinux. This should not imply that the Red Hat software cannot also implement a load balancing solution.
Figure 32 shows an example of our setup.
Figure 32. FOS setup/scenario
The installation process itself was straightforward and involved booting our servers from the installation CD, selecting the Install option (the “upgrade”
option was not available because this was the first release of the
Page Summary Hits TTFB Avg TTLB Avg
via ATM GET/ 4685 82.08 82.80
direct to server GET/ 4157 167.45 171.61
Primary node s6
Backup node s7
FastEthernet eth0 = 192.168.140.151 eth0 = 192.168.140.152
FOS Cluster providing web (http, port 80) service, virtual IP address = 192.168.140.155 (eth0:1)
high-availability server). We did not customize the package selection, allowing all the “standard packages and services” to be installed.
Since this was a new product, we also chose to upgrade the installation by using the updates available from the Red Hat FTP Web site.
Some minimal configuration pre-requisite work was required: we had to set uppiranha-passwd<your password> on all cluster machines and had to enable remote network access by editing/etc/hosts.allow and.rhostsfor root.
Each time the configuration files are set up or changed, they must be copied to all nodes in the cluster. Eitherrcp(in the case of rsh) orscp(in the case of ssh) will be used to perform the actual copy. Note that this copy is done as root. You must decide whether to use rsh or ssh and, although the installation panels allow for either, you must install ssh manually (using rpm) if you want to use it, because it is not installed by default. SSH should be used unless all the network access will be over a private, trusted network.
After setting the initial Piranha password, connect to one of your servers pointing your browser to:
http://<hostname>/piranha
The panel shown in Figure 33 on page 42 should appear. If this does not work, make sure that it is actually running the HTTP server. If the HTTP server is not running, you can start this server by issuing the following command:
/etc/rc.d/init.d/httpd start
Figure 33. Piranha welcome window
Now you can log in using the password defined earlier, as shown in Figure 34.
Figure 34. Logging on to Piranha
After login, you will see the control/monitoring page of Piranha as in Figure 35. Since we have not yet completed the setup, there is nothing running and the daemon is stopped.
Figure 35. Control/monitoring
To configure the cluster, go to the “Global Settings page” and select the type of cluster environment to be configured. We selected fos(failover service) rather thanlvs(load balancing) for our example. In this environment, there is a primary node and a backup node. As with the load-balancing environment, there is also a cluster IP address, which is adopted by the active server as well. First of all, you must provide the IP address of the primary server/node (we used 192.168.140.151 = s6), and then select rsh or ssh for performing file synchronization. As mentioned earlier, you must install ssh separately if you want to use it, but there are security advantages to doing this. ClickAcceptto save your settings. Figure 36 on page 44 shows the configuration information we provided.
Figure 36. Global settings
Having defined the primary server, we then needed to define the backup server. This is done in the “Redundancy” section as shown in Figure 37 on page 45. Here, we entered the IP address of the backup server/node (we used 192.168.140.152 = s7).
You must also decide on some timeouts/intervals on this panel. The possible choices include:
• Heartbeat interval: The number of seconds between every heartbeat packet.
• Assume dead after: The number of seconds that pass without receiving a heartbeat and the server is assumed dead; thus, making a failover occur.
• Heartbeat runs on port: The port number to use for the heartbeat protocol; normally the default value of 539 is acceptable.
Values depend on your expectations of failover timeouts. Smaller values mean smaller timeouts, but more heartbeat traffic on the network. However, heartbeat packets are really very small, so this should not be an overriding consideration. By having a dead timeout greater than the heartbeat interval, you allow a certain amount of heartbeats to be lost before considering failing over. This allows for temporary network traffic overload or network errors to be tolerated, without initiating automatic failover.
Remember to pressAccept after entering your data in the fields. Your panel should now sayBackup: active, as shown in Figure 37.
Figure 37. Backup server is now active
Finally, we configured the failover parameters as shown in Figure 38 on page 47. This is where we configured the services covered by the failover setup.
This panel requires that the following fields are completed:
• Name: A meaningful name for the service
• Address: A virtual IP address for this service (we used 192.168.140.155), distinct from the actual IP addresses of the servers themselves
• Application port: The application port (for example, 80 for HTTP, 21 for FTP, 25 for SMTP)
• Device: The name of the aliased interface associated with the virtual IP address
• Service timeout: The allowed service time-out, in seconds, until it is considered non-functioning.
If you want, you can edit the generic monitoring scripts for watching the service.
PressAccept, and you are done. You can add more services by choosing Addinstead ofEditon the failover panel.
Figure 38. Failover parameters
Next, we distributed the configuration file/etc/lvs.conf using rcp (or scp) to the other (backup) server. rcp is a remote copy protocol but scp is more secure and probably prompts for a password before allowing the copy to take place. Not copying these configuration files (so that the primary and backup servers are both configured identically) is the most likely cause of cluster failure. Figure 39 shows the result of thershcopy command.
Figure 39. Configuration file distribution using rsh [root@s6 /etc]# rcp lvs.cf s7:/etc/lvs.cf s7.tcs.org: Connection refused
Trying krb4 rcp...
s7.tcs.org: Connection refused trying normal rcp (/usr/bin/rcp) [root@s6 /etc]#
The Connection refusedwarnings can be ignored. They were received because the application tried using kerberos in the first two attempts, which we had not set up. The third try, using /usr/bin/rcp, was successful.
1.5.2.2 Using Red Hat Server 1.0
Once the configuration steps are complete, you can start your cluster by typing/etc/rc.d/init/pulse start on both machines. If successful, you should see the startup messages in/var/log/messages, as shown in Figure 40.
Figure 40. Master starting (s6)
The backup node should log startup messages to/var/log/messages:, as shown in Figure 41.
Aug 10 16:21:19 s6 pulse[31468]: STARTING PULSE AS MASTER
Aug 10 16:21:19 s6 pulse[31468]: Starting Failover Service Monitors
Aug 10 16:21:19 s6 pulse[31469]: running command "/sbin/ifconfig" "eth0:1" "down"
Aug 10 16:21:19 s6 pulse[31468]: running command "/usr/sbin/fos" "--monitor" "-c" "/etc/lvs.cf"
"--nofork"
Aug 10 16:21:19 s6 fos[31470]: Stopping local services (if any)
Aug 10 16:21:19 s6 fos[31470]: Shutting down local service 192.168.140.155:80 Aug 10 16:21:19 s6 fos[31470]: running command "/etc/rc.d/init.d/httpd" "stop"
Aug 10 16:21:19 s6 pulse: pulse startup succeeded Aug 10 16:21:19 s6 httpd: httpd shutdown failed
Aug 10 16:21:19 s6 fos[31470]: running command "/usr/sbin/nanny" "-c" "-h" "192.168.140.152" "-V"
"192.168.140.155" "-p" "80" "-s" "GET / HTTP/1.0\r\n\r\n" "-x" "HTTP" "-R" "/etc/rc.d/init.d/httpd start" "-D" "/etc/rc.d/init.d/httpd stop" "-t" "2"
Aug 10 16:21:19 s6 fos[31470]: Starting monitor for 192.168.140.155:80 running as pid 31470 Aug 10 16:21:19 s6 nanny[31485]: Failover service monitor for 192.168.140.155:80 started Aug 10 16:21:19 s6 nanny[31485]: No service active & available...
Aug 10 16:21:21 s6 pulse[31468]: partner dead: activating failover services Aug 10 16:21:21 s6 fos[31470]: Shutting down due to signal 15
Aug 10 16:21:21 s6 fos[31470]: Shutting down monitor for webservice 192.168.140.155:80 running as pid 31485
Aug 10 16:21:21 s6 nanny[31485]: terminating due to signal 15 Aug 10 16:21:21 s6 fos[31470]: will now exit to notify pulse...
Aug 10 16:21:21 s6 pulse[31468]: running command "/usr/sbin/fos" "--active" "-c" "/etc/lvs.cf"
"--nofork"
Aug 10 16:21:21 s6 fos[31488]: Stopping local services (if any)
Aug 10 16:21:21 s6 fos[31488]: Shutting down local service 192.168.140.155:80 Aug 10 16:21:21 s6 fos[31488]: running command "/etc/rc.d/init.d/httpd" "stop"
Aug 10 16:21:21 s6 pulse[31491]: running command "/sbin/ifconfig" "eth0:1" "192.168.140.155" "up"
Aug 10 16:21:21 s6 pulse[31490]: running command "/usr/sbin/send_arp" "-i" "eth0" "192.168.140.155"
"0004AC6EE826" "192.168.140.159" "ffffffffffff"
Aug 10 16:21:21 s6 httpd: httpd shutdown failed
Aug 10 16:21:21 s6 fos[31488]: Starting local service 192.168.140.155:80 ...
Aug 10 16:21:21 s6 fos[31488]: running command "/etc/rc.d/init.d/httpd" "start"
Aug 10 16:21:22 s6 httpd: httpd startup succeeded
Aug 10 16:21:26 s6 pulse[31487]: gratuitous fos arps finished
Figure 41. Backup starting (s7)
The messageshttpd shutdown failedare received because the httpd server was not actually running in the first place.
The Control/Monitoring panel should stateDaemon: running and list the current failover processes as shown in Figure 42 on page 50.
Aug 10 16:20:13 s7 pulse[24524]: STARTING PULSE AS BACKUP
Aug 10 16:20:13 s7 pulse[24524]: Starting Failover Service Monitors
Aug 10 16:20:13 s7 pulse[24525]: running command "/sbin/ifconfig" "eth0:1" "down"
Aug 10 16:20:13 s7 pulse[24524]: running command "/usr/sbin/fos" "--monitor" "-c"
"/etc/lvs.cf" "--nofork"
Aug 10 16:20:13 s7 fos[24526]: Stopping local services (if any)
Aug 10 16:20:13 s7 fos[24526]: Shutting down local service 192.168.140.155:80 Aug 10 16:20:13 s7 fos[24526]: running command "/etc/rc.d/init.d/httpd" "stop"
Aug 10 16:20:13 s7 pulse: pulse startup succeeded Aug 10 16:20:13 s7 httpd: httpd shutdown failed
Aug 10 16:20:13 s7 fos[24526]: running command "/usr/sbin/nanny" "-c" "-h" "192.168.140.151"
"-V" "192.168.140.155" "-p" "80" "-s" "GET / HTTP/1.0\r\n\r\n" "-x" "HTTP" "-R"
"/etc/rc.d/init.d/httpd start" "-D" "/etc/rc.d/init.d/httpd stop" "-t" "2"
Aug 10 16:20:13 s7 fos[24526]: Starting monitor for 192.168.140.155:80 running as pid 24526 Aug 10 16:20:13 s7 nanny[24541]: Failover service monitor for 192.168.140.155:80 started Aug 10 16:20:13 s7 nanny[24541]: Remote service 192.168.140.151:80 is available
Figure 42. The failover cluster operating
1.5.2.3 Red Hat Server 1.0 in operation
We adopted a simple approach: we pulled the network cable from the master server and watched the/var/log/messages log file (tail -f
/var/log/messages). After the configured timeout period, we saw messages indicating that the master was no longer available, and that the backup took over the role of active server. The messages shown in Figure 43 are taken from “s7” which was our backup server (see Figure 32 on page 40 for our network diagram again).
Figure 43. Failover to backup node (s7)
All services remained reachable using the virtual IP address
192.168.140.155. We replaced the cable again and observed a fail-back in which the primary server, s6, resumes its role again as shown in Figure 44.
Figure 44. Fall back to master (log from backup node = s7)