Well-distributed Linux Clusters,
Virtual Guests and Nodes
Article first appeared in
c’t 2007
, issue 26 (published by Heise)
Authored by Mirko Dölle
The current server distributions from Novell®
and Red Hat* include all of the programs and tools needed to combine Linux* servers into a highly available cluster. With Xen*, you can not only create virtual servers, but you can also form clusters of virtual machines to further increase availability. Companies demand virtualization and safeguarding against failure from today’s computer systems. Virtualization is often only understood as being the potential to replace a physical server farm with a few modern, powerful machines, in which the various server systems run fully virtualized or paravirtualized. However, it is possible to guarantee operation by means of a cluster operating in an active- passive configuration and comprising multiple servers, even if an individual machine or an infrastructure component fails.
The combination of the two technologies also makes multi-layer clusters possible, which can distribute both individual services and a cluster of virtual machines redundantly via several computers. This means that the individual physical machines can be better utilized during normal operation, for instance. Of course, virtualized and distributed clusters are not needed everywhere. In fact, the required degree of availability must be weighed against the costs of such a solution in practice. Selecting the right operating system is also a matter of costs. For example, Debian* Linux is free and is perfectly capable of managing a cluster together with virtual guests; however it lacks the certifications of various software manufacturers that are required by medium-sized and large companies in particular to obtain support for databases or special pro-grams. CentOS, the free clone of the Red Hat Enterprise Server, can also only be used
where certifications are not required. There-fore, the only choice that usually remains in the professional arena is between the Red Hat
Enterprise Linux (RHEL) and the SUSE® Linux
Enterprise Server. Red Hat and Novell also offer security updates and bug fixes for their server distributions for five years and beyond; whereas, with the free distributions, there are no more updates after just a few years. With the support of the administrators and the IT quality management department at Munich Airport, we have gone through various cluster and virtualization scenarios with the newly released RHEL 5.1 from Red Hat and the current SUSE Linux Enterprise Server 10 with Service Pack 1 from Novell, testing them as they are used in businesses.
Complex Infrastructure
The starting point for a cluster is two or more servers (cluster nodes), which are con-nected with one another in such a way that the second node automatically takes over if the first one fails. However, the second node must then be able to determine with certainty whether the first is still running or whether it has failed. This is where heartbeats come in: Each node sends data packages to all the other nodes at regular intervals. In practice, a separate network is used for heartbeats, its only intended purposes being to transmit heartbeats and to permit communication between the cluster nodes. If a server does not send any heartbeats over a certain period of time, the remaining cluster nodes assume that it has failed and assume the tasks of the failed computer.
However, the remaining nodes must ensure that the computer suspected of failure can no longer interfere with the cluster in any way. This is called “fencing,” whereby the cluster excludes the failed node. The most secure
Well-distributed Linux
*Clusters,
method is for a node that is still running to switch off the computer that is no longer responding there and then. This is usually achieved via remote power switches in the server cabinet, to which all of the cluster’s servers are connected. If the computers have redundant power supplies, several power switches are required to ensure that the server’s power supplies continue to be supplied redundantly and yet can still be switched off reliably.
If the heartbeat is only transmitted via one network connection, there is only one heart-beat channel and this will become a weak spot. A simple cable break is enough to separate the cluster nodes from one another and to create a split-brain situation, whereby both nodes think the neighboring node has failed. The worst-case scenario here is that the nodes switch each other off via the power switches and then the whole cluster fails. This is why at least two heartbeat channels are used in practice. It is not just power switches and heartbeat channels that must be designed redundantly, but also the Internet connection, emergency power supply, air con-ditioning and the connection of the individual cluster nodes to the common—fail-safe or redundantly arranged—data storage unit. If the cluster consists of just two nodes, e.g., because you merely want to operate a Web server in a fail-safe manner, it is recommended that you use one network connection solely as a heartbeat channel and connect it via network cable directly to the network connection of the second node. The second network connection can be used both for the Internet connection and as a second heartbeat channel. This dispenses with the need for a third network connection. With the SUSE Linux Enterprise Server 10, the individual heartbeat channels are set up using YaST; with Red Hat Enterprise Linux 5.1, Conga* takes care of it as part of the cluster setup—in this case, the browser front-end Luci* must be installed on a separate RHEL machine and may not run on one of the cluster
nodes. However, Luci can manage several clusters at the same time.
The Web server provided by the cluster is referred to as the cluster service; it normally runs on just one of the nodes and is started on the other cluster node after the physical machine has failed. The cluster service itself consists of three different “cluster resources”: the Apache* daemon, the IP address of the Web server and the data storage unit with the HTML pages. For a simple Web server, it is generally sufficient to place the HTML files on a data storage unit that can reach both nodes. In practice, however, it is recommended that you use a cluster file system simultaneously integrated by all cluster nodes. This means you can, for example, add an FTP server at any time that provides a download directory that is in addition to the Web server but runs on a different cluster node. However, just like the Apache daemon that uses it, the cluster resource IP address may be assigned exclu-sively to just one cluster node. Therefore, when setting up a cluster service, you must consider very carefully which cluster resources the service depends on, and whether these resources may only be used by one node or are generally available.
If, in the case of the highly available Web server, the node on which the Web server service has run up to that point fails, the node that is working properly begins to switch off the failed node. Then it starts the cluster resources used up to that point exclusively on the defective cluster node. In this case, it configures the IP address of the Web server and then calls up the Apache daemon. Because the defective node has been switched off via the power switch, it prevents a second computer with the IP address of the Web server from running.
Cluster File Systems
Fencing is especially important if all nodes are to use the same data storage unit at the same time. A special cluster file system
usually takes care of access coordination. Novell uses the Oracle* Cluster File System 2 (OCFS2) in SUSE Linux Enterprise Server 10 for this purpose; Red Hat uses the Global File System (GFS). One disadvantage of the OCFS2 in SUSE Linux Enterprise Server 10 is that, by default, it does not prevent cluster nodes from simultaneously writing to the same file. In contrast, GFS from RHEL prevents simultaneous writing “out of the box” through integrated locking, and has to be enabled via special file system attributes when required. Whether the common data storage unit is an SCSI-RAID, an iSCSI drive or a SAN only plays a secondary role for the cluster. It must only be ensured that it is a redundant system and that both nodes can access it at the same time. The usual solutions in SMEs are external RAID systems with Dual Channel SCSI, which have two separate SCSI connections for the two nodes. Even iSCSI systems are extremely interesting; however, you must ensure a
redundant connection to the storage network with this type of system. Storage Area Networks (SANs) are usually only found in larger com-panies. If the appropriate SAN host bus adapter is supported by the kernel, Linux treats the SAN drives like conventional SCSI hard drives and there are no peculiarities. SAN is also the easiest way to set up redundant data paths.
Care must be taken when connecting iSCSI drives under SUSE Linux Enterprise Server 10. The Open-iSCSI package on the installation CD contains an error that caused the com-puter to freeze when the graphical Yast module iSCSI Initiator was called up and when the command line program /sbin/fwparam-ibft was launched in the test on HP* Proliant* servers. Novell assumes there is an error when extracting the iBFT (iSCSI Boot Firmware Table), which has been fixed in the Open-iSCSI package now available following the online update. If possible, the configuration
XXL—High Availability
Major airports place huge demands on their computer systems in terms of availability— not one bag or flight can be checked in without their support. Central databases such as the flight check-in database have multiple security features and are operated on two independent clusters—one to carry out operations and one as backup. Each cluster comprises two Sun servers running under Solaris*. Experience has shown that Solaris drivers are often not as complex and are better tested than Linux ones as their hardware is not as diverse; in other areas we also employ Linux alongside Solaris.
Each cluster has its own database and SAN. In the event of an error, cluster services are automatically switched to the nodes that are still available within the cluster. Switching between clusters must be carried out manually; this also takes place every week for functional testing.
Clusters and data storage units are often distributed across several buildings and connected via redundant, individually routed fiber-optic networks to ensure protection against natural phenomena and accidents. Servers and SAN components use up to four power supplies provided by various UPS units, which in turn are connected to several energy suppliers in order to prevent power failure. The network connections, switches and routers also have the same redundant setup.
Oliver Kluge is the IT Quality Manager at Munich Airport and is responsible for technical quality assurance in the computer center.
of the iSCSI drives should therefore be post-poned until after the first update. Alternately, you can use Yast with the Ncurses interface or install it directly in text mode. This way the iSCSI drives in the test can also be used with the old Open-iSCSI package without any crashes.
Drawback of the Device ID
Many companies use servers with similar constructions for all nodes of a cluster. This simplifies maintenance and the procurement of spare parts. It is not rare for one more server than is necessary to be purchased the clus-ter, and for this extra server to be used as an exchange unit or source for spare parts. In the event of a hardware failure, an admin-istrator imports a backup of the failed node on the replacement machine (bare metal recovery) and swaps the computers. The defective server is repaired where possible or replaced with a new, similarly constructed machine, which then functions as a reserve machine. However, one feature of the SUSE Linux Enterprise Server installation makes the restoring of a complete backup on a replace-ment machine very complicated. By default, Yast uses an ID generated from the serial number of the hard drive or the MAC address of the network card to differentiate the drives, partitions and network configurations. This is useful if an additional hard disk or network card is integrated into an existing system and it leads to changes in the device names— the system can clearly define the root partition or a network interface at any time via the device ID. However, SUSE Linux Enterprise Server does not carry out this concept con-sistently. In the bootloader configuration of the GRUB, we continued to find the usual device names such as /dev/sda5 as root parameters of the kernel so that if the device name is changed, SUSE Linux Enterprise Server can no longer find its root file system and remains frozen in a panic with one kernel. However, the big catch is when one of the ID-referenced components fails. While you
normally only have to transplant the hard drives into a similarly constructed server to be able to restart it when there is damage to the mainboard; with SUSE Linux Enterprise Server 10 the network must be reconfigured because the MAC address and therefore the ID have changed. The initscripts no longer find the old network cards and have no configuration on hand for the new ones. Therefore, the server remains without a network connection and the respective KVM extender must be manually configured using the keyboard, mouse and monitor.
Even worse is a total server failure: If you replace the entire device as described above with a replacement system via a restore operation, SUSE Linux Enterprise Server still boots the initrd system thanks to the device name used in the GRUB configuration but it remains hanging during mounting of the partitions because the serial number of the hard drive has changed at the very least, even if the partitioning is still the same. The administrator must then first manually adjust the file /etc/fstab in a minimal system before he can deal after reboot with the network’s new configuration.
Therefore, it is wise to assign a label for each partition during the installation and to instruct YaST to use it in the partitioning dialog. As long as there are no duplicate labels, Linux finds the associated partition and mounts it correctly, regardless of the hard drive it happens to be on at the time. However, if a backup is made on a reserve server, you must take care to name the parti-tions again with the corresponding label. It is also generally recommended that you select the proven Ext3 file system in the partitioning dialog during installation instead of the ReiserFS suggested by YaST. Ext3 is supported “out of the box” by practically every Linux distribution; while the Red Hat Enterprise server, for example, has no ReiserFS support.
Virtualization
Both Enterprise distributions include the virtu-alization solution Xen ex-works. It is therefore possible, for example, to place the operating systems of several dedicated servers on an individual computer that is equipped accord-ingly. While Xen is included under RHEL 5.1 in the standard installation; you must first in-stall it via the Yast module “Inin-stall Hypervisor and Tools” when working with SUSE Linux Enterprise Server. After a restart, the hyper-visor is to be selected explicitly in the boot menu. Otherwise, SUSE Linux Enterprise Server starts a non-virtualized system. This can be remedied through the conversion of the standard boot entry using Yast, or manually in the bootloader configuration file /boot/grub/ menu.lst.
For the administration of Xen and the setup of virtual guest systems, both Enterprise dis-tributions use the Virtual Machine Manager. Originally developed by Red Hat, it creates a new virtual machine with just a few mouse clicks. It is best to begin with a new installa-tion of the desired operating system so that there are no problems with the hardware emulated by Xen.
With fully virtualized systems—i.e., all Windows* guests and Linux distributions—for which there is no paravirtualized Xen kernel, the installation requires a bootable CD, DVD or the image of one on the hard drive. If the Installation CD set encloses more than one CD, you have to keep in mind that virt-manager does not support exchanging CD images yet during the installation process. In this case, you should use a physical drive as installation source. You must also ensure that the virt- manager only provides the installation source for the installation procedure itself and deletes it again from the Xen configuration before the finished guest system starts.
However, Novell and Red Hat have chosen different times for the deletion of the instal-lation source. The virt-manager from RHEL deletes the installation source as soon as the
guest system shuts down the virtual machine when restarting. That is fatal if the Windows 2003 Server is installed, because Windows initially only configures the virtual hard drive and copies some files for the graphical installation. The actually installation does not occur until the hard drive is started; however, there is no longer a CD drive at that point. If it is only a question of completing this installation, you can simply delete the con-figuration file of the virtual guest using the function in the virt-manager for deleting a virtual machine. Because the virtual hard drive is managed separately and is not deleted, you simply create a new virtual machine in the next step and use the same settings as the ones for the VM you just deleted—including the existing virtual hard drive. Once again the CD or an image can serve as the installation source. As the Windows CD bootloader recognizes the graphical installation system on the hard drive and starts it, you can now complete the Windows installation and virt-manager removes the installation drive once again.
In the Novell Enterprise Server on the other hand, the virt-manager does not remove the installation source until the Windows installa-tion is complete. However, just like with RHEL, the Windows guest no longer has a CD drive once the installation is complete. With SUSE Linux Enterprise Server, the easiest way to provide the guest with access to the host system’s CD drive is once again to delete the virtual machine in the virt-manager first of all. Although a subsequent adjustment of the VM configuration is indeed made avail-able, the changes were not saved during the test. When recreating the VM, specify that you already have a system installed. You then add the virtual hard disk drive and the new VM host’s physical CD drive, using the values from the installation VM for the rest of the settings.
With the virt-manager from RHEL, it is not possible to add a CD drive afterwards or
to add to a new VM a CD-ROM drive that is not used as an installation source. The only way is to change the virtual machine’s con-figuration file manually; this file is found in the directory /etc/xen. Below is an example of a drive configuration with a physical hard disk drive and physical access to the CD drive: disk=[ ‘phy:/dev/sda6,hda,w’,— ‘phy:/dev/ cdrom,hdc:cdrom,r’, ]
The installation of a paravirtualized Linux guest is then only possible if the installation medium has a suitable kernel. While a para-virtualized SUSE Linux Enterprise Server 10 onto an SUSE Linux Enterprise Server 10 host can be installed from a corresponding DVD image file with no problem and scarcely dif-fers from the installation of a fully virtualized guest, RHEL 5.1 must be installed onto an RHEL host from a network drive. The reason for this is that while RHEL does provide a Xen kernel for installation, the installation system does not recognize the virtual CD drive. An FTP, HTTP or NFS server that provides the directory structure of the installation DVD is required. The corresponding URL that leads to the DVD’s base directory is specified as the installation source in the virt-manager; for example: http://192.168.1.100/pub/rhel or nfs:192.168.1.100:/pub/rhel. If the guest sys-tem is completely installed and configured, there is the option of using it as a basis for other guests. That way you can, for example, perform generic operating system installa-tions, which are later merely cloned and then adapted individually. However, to do this you must refer to the command line in both RHEL and SUSE Linux Enterprise Server—the virt-manager does not offer such a function. The clone requires a copy of the virtual data carrier, which is usually stored in the directory /var/lib/xen/images. The Xen configuration file from /etc/xen is copied and is given a new name. In addition, a new UUID that delivers the command uuidgen must be entered in the copied configuration file and must be adjusted to the name of the virtual hard drive.
With SUSE Linux Enterprise Server, there are immediately three different configuration files per virtual machine. In this case, it is easier just to copy the virtual data carrier and then to create a new virtual machine that uses the data carrier with the help of the virt-manager.
VMs as Cluster Resources
A VM is not fail-safe by simply running it on a cluster’s node. With no additional configura-tion, the virtual machine is not a cluster service and is therefore neither monitored by the cluster nor restarted on another node in the cluster if the host fails. However, you can con -figure the virtual machine using the distribu tion tools as a cluster resource and then let it run on a node as an exclusive service. The ad van tage of this is that, if the physical computer fails, the virtual machine is automatically restarted on another cluster node. If, for example, there is maintenance work to be carried out on a computer, it also allows Xen to move a running virtual system to another node on the command line. However, in the test, the live migration worked with neither RHEL nor SUSE Linux Enterprise Server the first time. Xend was not preconfigured accordingly and the front-ends could not compensate for this. Should virtual machines be included as services in the cluster, the Xen configuration files and the virtual hard drives must first be moved to the cluster’s common data stor-age unit. With both Enterprise distributions, the directories /etc/xen and /var/lib/xen are affected; with SUSE Linux Enterprise Server the directory /var/lib/xend is also affected. On the individual cluster nodes, these directories are replaced by corresponding symbolic links on the common data storage medium. It is also possible to use only individual virtual machines as cluster services. In that case, the named directories may not be moved to the cluster data storage unit and replaced by the corresponding symbolic links; only the individual configuration files in the directories and the virtual drive can be moved and replaced in this way.
The cluster treats virtual machines like all the other services—they possess resources. In this case, you have the Xen configuration file and the common data storage medium, and the service may only be started on one of the nodes. Accordingly, the setup in the cluster frontend also rarely differs from that of a Web server with regard to the XML cluster configuration files. In order to be able to select with SUSE Linux Enterprise Server the node on which the virtual machine should run by default, a dependency to the host name must be inserted in the XML cluster configuration file for the virtual machine. Here is an example for a dependency to the host name “sles-p1”:
<rsc_location id=“ vm1_location”— rsc=“ vm1”> <rule id=“ pref _vm1_location”—
score=“INFINITY”>
<expressi on attribute=“#uname”—
operation=“eq” value=“sles-p1”/> </rule>
</rsc_location>
The value of the attribute “#uname” can then be adjusted using the hb_gui and thus the virtual machine can be restarted on another node. In this regard, Red Hat’s front-end is much easier to operate. To reposition a service, you simply select it in the services overview and indicate the desired node. Conga takes care of the rest in the background.
Virtual Clusters
With relatively little effort, Red Hat Linux allows you to make a Web server running in a virtual machine on the cluster even more fail-safe. You clone the virtual machine and set up the clone as another cluster service. After cloning the VMs, both should first re-ceive a new IP address different from that of the other. Consequently, there are two virtual machines with almost identical configuration, which operate the cluster in a fail-safe man-ner as services and which should run best on different cluster nodes. Then you use Conga to combine both VMs in a separate cluster, whereby you follow the procedure described above to set up the Web server as a cluster service with its IP address and a common data storage medium with cluster file
system, to which both virtual machines have access. Such a configuration is also possible with SUSE Linux Enterprise Server; however, it is a lot of hard work without a convenient frontend.
The advantage of a cluster like this—which consists only of virtual nodes and which for their part are cluster services of a physical cluster—is the double redundancy. If the first virtual machine crashes due to a soft-ware or storage problem, the virtual cluster recognizes the failure of a virtual node. The remaining virtual node switches off the failed node once more by fencing it to ensure that under no circumstances will two Web servers be running at the same time. It then restarts the Web server service. At the same time, the physical cluster recognizes that a cluster service in the form of a virtual machine has crashed and, depending on the configura-tion, restarts the VM on the same physical node or on another physical node. If the virtual guest has started, the significance of this for the virtual cluster is that the virtual node that failed earlier is responding again, like after a repair. Consequently the cluster is redundant again.
If the physical node fails with the first virtual machine, the second recognizes that its neighbor has failed and starts the Apache. The physical cluster for its part also recog-nizes the failure of a physical node and the failure of the VM as cluster service and restarts the VM on the remaining physical node. After the VM has started, it once again responds as a virtual cluster node at the virtual cluster, so there continues to be a virtual cluster that
is protected against software errors. Until the physical node is repaired, the cluster has merely lost its hardware redundancy.
Literature
[1] Example of an SUSE Linux Enterprise Server cluster configuration: www.novell.com/ linux/technical_library/has.pdf
www.novell.com
Contact your local Novell Solutions Provider, or call Novell at: 1 800 714 3400 U.S./Canada 1 801 861 1349 Worldwide 1 801 861 8473 Facsimile Novell, Inc. 404 Wyman Street Waltham, MA 02451 USA
472-001002-001 | 08/08 | © 2008 Novell, Inc. All rights reserved. Novell, the Novell logo, the N logo and SUSE are registered trademarks of Novell, Inc. in the United States and other countries.
*Linux is a registered trademark of Linus Torvalds. All other third-party trademarks are the property of their respective owners.
3 picas (0.5 in) (12.5 mm)
2