A software image is a complete Linux file system that is to be installed on a non-head node. Chapter 9 describes images and their management in detail.
The head node holds the head copy of the software images. Whenever files in the head copy are changed using CMDaemon, the changes automatically propagate to all provisioning nodes via the updateprovisioners command (section 6.2.4).
6.3.1 Booting To A “Good State” Software Image
When nodes boot from the network in simple clusters, the head node sup- plies them with a known good state during node start up. The known good state is maintained by the administrator and is defined using a software image that is kept in a directory of the filesystem on the head node. Sup- plementary filesystems such as/home are served via NFS from the head node by default.
For a diskless node the known good state is copied over from the head node, after which the node becomes available to cluster users.
For a disked node, by default, the hard disk contents on specified lo- cal directories of the node are checked against the known good state on
the head node. Content that differs on the node is changed to that of the known good state. After the changes are done, the node becomes avail- able to cluster users.
Each software image contains a Linux kernel and a ramdisk. These are the first parts of the image that are loaded onto a node during early boot. The kernel is loaded first. The ramdisk is loaded next, and contains driver modules for the node’s network card and local storage. The rest of the image is loaded after that, during the node-installer stage (section 6.4). 6.3.2 Selecting Kernel Driver Modules To Load Onto Nodes Kernel Driver Modules Withcmsh
Incmsh, the modules that are to go on the ramdisk can be placed using thekernelmodules submode of the softwareimage mode. The order in which they are listed is the attempted load order.
Whenever a change is made via thekernelmodules submode to the kernel module selection of a software image, CMDaemon automatically runs the createramdisk command. The createramdisk command re- generates the ramdisk inside the initrd image and sends the updated im- age to all provisioning nodes, to the image directory, set by default to /cm/images/default-image/boot/. The original initrd image is saved as a file with suffix “.orig” in that directory. An attempt is made to generate the image for all software images that CMDaemon is aware of, regardless of category assignment, unless the image is protected from modification by CMDaemon with aFrozenFile directive (Appendix C).
Thecreateramdisk command can also be run from cmsh at any time manually by the administrator when insoftwareimage mode, which is useful if a kernel or modules build is done without using CMDaemon. Kernel Driver Modules Withcmgui
In cmgui the selection of kernel modules is done from by selecting theSoftware Images resource, and then choosing the “Kernel Config” tabbed pane (figure 6.4).
Figure 6.4:cmgui: Selecting Kernel Modules For Software Images The order of module loading can be rearranged by selecting a mod- ule and clicking on the arrow keys. Clicking on the “Recreate Initrd” button runs thecreateramdisk command.
Implementation Of Kernel Driver Via Ramdisk Or Kernel Parameter An example of regenerating the ramdisk is seen in section 6.8.5.
may be more convenient. How to do that is covered in section 9.3.4. 6.3.3 InfiniBand Provisioning
On clusters that have InfiniBand hardware, it is normally used for data transfer as a service after the nodes have fully booted up (section 4.3). It can also be used for PXE booting (section 6.1.3) and for node provisioning (described here), but these are not normally a requirement. This section (about InfiniBand node provisioning) may therefore safely be skipped in almost all cases.
During node startup on a setup for which InfiniBand networking has been enabled, theinit process runs the rdma script. For SLES and dis- tributions based on versions prior to Red Hat 6, theopenib script is used instead of therdma script. The script loads up InfiniBand modules into the kernel. When the cluster is finally fully up and running, the use of InfiniBand is thus available for all processes that request it. Enabling In- finiBand is normally set by configuring the InfiniBand network when in- stalling the head node, during theAdditional Network Configuration screen (figure 2.10).
Provisioning nodes over InfiniBand is not implemented by default, be- cause theinit process, which handles initialization scripts and daemons, takes place only after the node-provisioning stage launches. InfiniBand modules are therefore not available for use during provisioning, which is why, for default kernels, provisioning in Bright Cluster Manager is done via Ethernet.
Provisioning at the faster InfiniBand speeds rather than Ethernet speeds is however a requirement for some clusters. To get the cluster to provision using InfiniBand requires both of the following two configu- ration changes to be carried out:
1. configuring InfiniBand drivers for the ramdisk image that the nodes first boot into, so that provisioning via InfiniBand is possible during this pre-init stage
2. defining the provisioning interface of nodes that are to be provi- sioned with InfiniBand. It is assumed that InfiniBand networking is already configured, as described in section 4.3.
The administrator should be aware that the interface from which a node boots, (conveniently labeledBOOTIF), must not be an inter- face that is already configured for that node in CMDaemon. For example, ifBOOTIF is the device ib0, then ib0 must not already be configured in CMDaemon. EitherBOOTIF or the ib0 configuration should be changed so that node installation can succeed.
How these two changes are carried out is described next: InfiniBand Provisioning: Ramdisk Image Configuration
An easy way to see what modules must be added to the ramdisk for a particular HCA can be found by runningrdma (or openibd), and seeing what modules do load up on a fully booted system.
One way to do this is to run the following three lines as root:
modlist(){ cut -f1 -d" " /proc/modules; } IB=/etc/init.d/rdma
The first line sets up a functionmodlist that lists the modules in use by the system at any instant. The list is obtained by using thecut opera- tion to extract only the first column of/proc/modules.
For the second line, the InfiniBandinit script is set to using rdma. The rdma setting should be replaced by openibd when using SLES, or distributions based on versions of Red Hat prior to version 6.
In the third line, thediff command then finds the difference between modlist output when starting up or stopping InfiniBand, using a bash redirection technique called process substitution.
Forrdma, the output may display something like:
Example
1c1,13
< Unloading OpenIB kernel modules: [ OK ] ---
> Loading OpenIB kernel modules: [ OK ] > ib_ipoib > rdma_ucm > ib_ucm > ib_uverbs > ib_umad > rdma_cm > ib_cm > iw_cm > ib_addr > ib_sa > ib_mad > ib_core
As suggested by the output, the modules ib_ipoib, rdma_ucm and so on are the modules loaded when rdma starts, and are therefore the modules that are needed for this particular HCA. Other HCAs may cause different modules to be loaded.
The modules should then be part of the initrd image in order to allow InfiniBand to be used during the node provisioning stage.
The initrd image for the nodes is created by adding the required Infini- Band kernel modules to it. How to load kernel modules into a ramdisk is covered more generally in section 6.3.2. A typical Mellanox HCA may have it created as follows (some text elided in the following example):
Example
[root@bright52 ~]# cmsh
[bright52]% softwareimage use default-image
[bright52->softwareimage[default-image]]% kernelmodules [b...image[default-image]->kernelmodules]% add mlx4_ib
[b...image*[default-image*]->kernelmodules*[mlx4_ib*]]% add ib_ipoib [b...image*[default-image*]->kernelmodules*[ib_ipoib*]]% add ib_umad [b...image*[default-image*]->kernelmodules*[ib_umad*]]% commit [bright52->softwareimage[default-image]->kernelmodules[ib_umad]]% Tue May 24 03:45:35 2011 bright52: Initial ramdisk for image default-im\ age was regenerated successfully.
InfiniBand Provisioning: Network Configuration
It is assumed that the networking configuration for the final system for InfiniBand is configured following the general guidelines of section 4.3. If it is not, that should be checked first to see if all is well with the InfiniBand network.
The provisioning aspect is set by defining the provisioning interface. An example of how it may be set up for 150 nodes with a working Infini- Band interfaceib0 in cmsh is:
Example
[root@bright52~]# cmsh %[bright52]% device
[bright52->device]% foreach -n node001..node150 (set provisioninginter\ face ib0)
[bright52->device*]% commit