ClusterWorX r : A Framework to Manage Large Clusters Effectively

(1)

ClusterWorX

r

: A Framework to Manage Large Clusters

Eﬀectively

Dr. Thomas M. Warschko

Linux NetworX Inc., Sandy, Utah USA (http://www.lnxi.com) E-mail: [email protected]

AbstractLinux Clusters are going to be the high performance compute engine of choice for research labs as well as industry. Clusters are now well known for their ﬂexibility, reliability, scalability and price/performance ratio compared to traditional su-percomputers and Linux seems to be the operating system of choice to drive these clusters eﬀectively.

As cluster systems scale to thousands of proces-sors, management becomes exponentially complex, and can be a daunting challenge for any organi-zation. To alleviate this eﬀort, Linux NetworX has developed ClusterWorXr, which integrates all aspects of cluster management and administration within a simple and user-friendly solution.

Keywords: Cluster Management, LinuxBIOS, ICE BoxT M, ClusterWorXr, High-Speed Inter-connects, High-Performance Cluster Computing.

1 Introduction

Paying for expensive and proprietary software and hardware only to end up tied to an in-flexible platform is a trend of the past. To-day’s rapidly changing IT industry is shift-ing towards open source software platforms us-ing commercial-off-the-shelfhardware compo-nents. By using Linux as the operating system and hardware based on standard x86 architec-tures, Linux clustering is the culmination of both ofthese concepts. It leverages the power ofthe open source community’s prize, while harnessing the power oflow-cost components to deliver a solution that is powerful, scalable, flexible and very reliable.

Despite the cost-savings, questions remain about the manageability ofLinux clusters. A common myth is that Ph.D.-level knowledge is required to adopt the technology. At one time this was true. The earliest adopters ofLinux

clusters were in fact universities and national laboratories, generally because they possessed the knowledge-base and resources to take on the ’challenge’ ofsetting up and maintaining a cluster system. Today, however, vendors pro-vide services such as integration, installation, system optimization and training. New cluster management tools also help empower adminis-trators over these complex systems. Today, the barriers to adopting the technology have been signiﬁcantly lowered.

Administrators have several issues and con-cerns with managing and maintaining a Linux cluster. Cluster administrators need to not only know where the nodes are, but also who they are with, what they are doing, how hard they are working, and even the locations of the network bottlenecks. They need to see all, know all, and be able to take action on the sys-tem remotely. The challenge to the administra-tor is ﬁnding the best available tools to help to do their job as painlessly as possible. Clus-ter administrators need empowering tools to help them essentially become omniscient and omnipotent over their systems. Items to con-sider include: cluster eﬃciency, hardware fail-ures, software upgrades, remote access, cloning and storage management, and system consis-tency integrated within a single tool to make an administrator’s life easy. In fact this was the motivation when designing the components of ClusterWorX.

This article is organized as follows: section 2 discusses the LinuxBIOS project and it’s fea-tures in detail, section 3 focuses on the integra-tion ofthe ICE Box1 hardware within Clus-terWorX, section 4 explains the cloning and

(2)

image management capabilities and section 5, the event handling and management ofClus-terWorX. A hint to further software develop-ments is given in section 6 and a conclusion is drawn in section 7.

2 LinuxBIOS

The primary motivation behind LinuxBIOS[6] is the desire to have the operating system gain control ofa cluster node from power on. It aims to replace the normal BIOS found on PCs with a Linux kernel that can boot Linux from a cold start. LinuxBIOS is primarily Linux with a few changes to the current Linux ker-nel. It initializes the hardware, activates serial console output, checks for valid memory, and starts loading the operating system - only it does it in about 3 seconds, whereas most com-mercial BIOS alternatives require about 30 to 60 seconds to boot.

Current PCs used as cluster nodes depend on a vendor-supplied BIOS for booting. The BIOS in turn relies on inherently unreliable devices such as video cards, floppy disks, CD-ROM and hard drives to boot the operating system. In addition, current BIOS software is unable to accommodate non-standard hard-ware, making it difficult to support experimen-tal work. The BIOS is slow, often redundant, and most importantly in a cluster environment, it is difficult to maintain. Imagine walking around with a keyboard and monitor to ev-ery one ofthe 1000 nodes in a large cluster to change one BIOS setting.

Using a real operating system to boot an-other operating system provides much greater flexibility than using a simple netboot program or the BIOS. Because Linux is the boot mech-anism, it can boot over standard Ethernet or over other interconnects such as Myrinet [9], Quadrics [10], or SCI [11]. It can use SSH connections to load the kernel, or it can use the InterMezzo caching file system or tradi-tional NFS. Cluster nodes can be as simple as they need to be - perhaps as simple as a CPU and memory, no disk, no floppy, no graphics adapter, and no file system. The nodes will be much less autonomous, making them easier to maintain.

With a terminal server, such as the ICE Box (see section 3), an administrator is able to trace the boot process from the very beginning and access the nodes using the serial console. Lin-uxBIOS reports all detected errors and hard-ware failures using the serial console. The out-put is captured and logged through the ICE Box to allow even post-mortem trouble shoot-ing ofnodes. After initializshoot-ing the hardware, LinuxBIOS is able to boot from the network or local hard disk. Booting options (see section 4) can be easily changed using ClusterWorX or network conﬁguration options such as DHCP.

Additional tools are provided to change BIOS settings or to ﬂash new LinuxBIOS re-leases on demand. Because LinuxBIOS can be accessed and conﬁgured from within the Linux operating system, changes can be made re-motely to a single node or to all nodes in a cluster system. These changes become active as soon as the nodes are rebooted.

3 ICE Box

T M

The ICE Box provides three essential cluster management capabilities: serial console, re-mote power management, and rere-mote moni-toring accessible through a variety ofproto-cols (NIMP, SIMP, Serial, Telnet, SSH, SNMP, ClusterWorX). All the hardware ofan ICE Box is controlled and all services are provided by a embedded computer running Linux. For de-tailed information on the ICE Box see [4, 3]. 3.1 Remote Power Management Controlling the power to the nodes and other devices is a basic cluster management task. However, this feature is one which is most of-ten overlooked in the cluster system design. A remotely managed power solution is superior to one that requires an on site user. Each ICE Box provides power to 10 compute nodes and two auxiliary devices. Two 15A power inlets each provide power to ﬁve nodes and one aux-iliary device. Whereas the node outlets can be power-cycled on demand, the auxiliary out-lets are powered on and stay on as long as the ICE Box is receiving power. This is to en-sure that host nodes, switches and other

(3)

de-vices are not powered oﬀ by mistake. During the power up procedure, ICE Box also auto-matically sequences power, reducing the risk ofpower spikes.

3.2 Remote Monitoring

The ICE Box hardware contains power and temperature probes and a reset switch inside each node. The reset switch allows the user to remotely reset any standard motherboard - preventing a full power down. The power probe is used to detect failing power supplies and the temperature probes are used in combi-nation with the event handling capabilities of ClusterWorX (see section 5) to prevent over-heating ofthe system.

3.3 Serial Terminal Access

Serial terminal access, also known as console port management or serial console, is gener-ally used for managing remote systems in data centers. Though not a new technology, because ofthe low scalability and legacy design with traditional console access or terminal servers, widespread use ofterminal servers for clusters has not been widely implemented. ICE Box overcomes this challenge by oﬀering unprece-dented scalability and high port density, mak-ing it the perfect solution for cluster manage-ment.

Serial networks provide remote access to a machine by opening a UNIX console through the serial (COM) port on a machine. How-ever, this type ofaccess usually has two inher-ent problems: it requires a user to plug in a cable and it is not scalable. To solve this prob-lem, terminal servers are used to access many serial devices from a centralized location. Be-sides providing serial access to each connected device, the ICE Box also provides logging and buﬀering (up to 16k) ofthe output on each serial device. This capability allows even post-mortem analysis on what has happened to a speciﬁc node.

3.4 Accessing the ICE Box

The ICE Box itselfprovides serial as well as network (ethernet) access. There are native

command protocols which can be used with ClusterWorX or other software to control ICE Box remotely. The serial ICE management protocol SIMP facilitates the serial connection ofan ICE Box and the network ICE manage-ment protocol NIMP uses the onboard ethernet ofan ICE Box, respectively. Furthermore, the ICE Box provides access via telnet and ssh

(v1 & v2) and native IP ﬁltering can be used for higher security. Telnet and ssh connec-tions can be established either with the ICE Box or with each individual device connected to the ICE Box using speciﬁc port numbers. Last but not least, the ICE Box is SNMP com-pliant, so ICE Boxes can be controlled through standard SNMP management software.

4 Image Support and Cloning

Disk image consistency is accomplished using a technique called disk cloning– a process of quickly copying a system image from the Clus-terWorX management host to individual nodes within the cluster. Disk cloning allows the ad-ministrator to load or update the operating system on single nodes, or the entire cluster at one time using reliable multicast technol-ogy. Using a multicast mechanism, even a sin-gle fast ethernet is suﬃcient to clone several hundred nodes simultaneously2. On startup all participating nodes listen to the multi-cast stream, buﬀering the received data locally. Once the multicast stream is spread out, indi-vidual nodes acknowledge the reception ofthe new image in a round robin fashion controlled by the cloning host. Ifan individual node is still lacking image data, the missing parts are transferred during the acknowledging phase on a peer-to-peer base with the master node. As soon as a node gets all the image data, it starts the cloning process locally and reboots itselfto operational mode.

With ClusterWorX, cloning is done from the easy-to-use GUI. Administrators are able to load the OS and applications to build the re-quired functionality into an image. Then Clus-terWorX automatically clones the images to selected nodes. Improvements to cloning add

2_{It took about 12 min. to clone and reboot over 400} nodes of the Lawrence Livermore cluster.

(4)

the ability to more easily update the kernel on all nodes, create new types ofimages, and up-date files or packages on the nodes in parallel. Disk cloning greatly reduces the time, effort, and cost ofinstalling, upgrading, or updating a large cluster system. For convenience we of-fer prebuilt images for cloning, harddisk as well as NFS boot. Furthermore, customized images can be build with little effort (see [5]).

5 Monitoring

and

Event

Handling

ClusterWorX is the main framework for our cluster management solution. Besides provid-ing a graphical user interface (GUI) to the ICE Box and to the disk cloning facilities, Cluster-WorX is responsible for monitoring and event handling within a cluster, which is the topic for this section. For a more detailed description of ClusterWorX see [1, 2].

5.1 Monitoring

ClusterWorX can virtually monitor any sys-tem function including CPU usage, CPU type, network bandwidth, memory usage, disk I/O and system uptime. It comes standard with over 40 monitors build in. The UDP echo port is used to ensure network connectivity. In addition, ClusterWorX offers plug-in sup-port so administrators can include their own monitors. In combination with additional sen-sor packages (e.g. lmsensen-sors [7]) it is possible to monitor fans, CPU and board temperature, although temperature monitoring is usually ac-complished using the ICE Box sensors. A plug-in itselfcan be any program, script (shell, perl, etc.) or any combination thereof- as long as it resides in the ClusterWorX plug-in directory it will be recognized by the system automati-cally. This flexible concept ofplug-ins allows ClusterWorX to fit the needs ofany system, no matter how unique its functionality.

Through a secure connection, ClusterWorX allows administrators to remotely monitor and manage a cluster system from an on-site or oﬀ-site location with any Java-enhanced browser. Ifproblems arise, administrators have full ac-cess to the cluster at home or on the road.

ClusterWorX is written in Java for cross-platform, client side independence. The Java based GUI provides a platform for advanced visually-based cluster management. The 3-tier design allows multiple clients to access the ClusterWorX server at the same time without conﬂict. The ClusterWorX main monitoring screen is easily customized to allow administra-tors to view system statistics relevant to their system in near real time. With ClusterWorX, cloning an image or adding a node to the clus-ter becomes as simple as a few mouse clicks.

Historical graphing allows the administrator to chart monitoring values over time. The ad-ministrator can view cluster use and perfor-mance trends over a selected time interval, an-alyze the relationships between monitored val-ues, or compare performance between nodes. Analyzing this data can help the administra-tor spot system bottlenecks, improve cluster eﬃciency, and predict future computing needs. 5.2 Event Handling

Online monitoring is only one capability of ClusterWorX. More important especially in case ofproblems or failures is the event and no-tification engine. When cluster problems arise, administrators can customize ClusterWorX to automatically take action, e.g. power down, reboot, or halt any malfunctioning node. This is accomplished through an event engine that allows administrators to set thresholds on any value monitored. This allows corrective action to be taken before problems become critical (e.g. powering down a node on CPU fan fail-ure to prevent the CPU from burning). If the administrator-defined threshold is exceeded, ClusterWorX automatically triggers an action. Default actions include node power down and node reboot. For example, the event engine can report and take an administrator-defined action, such as powering down a node, when processors rise above a certain temperature, or ifthe load is too high. Events are configured by administrators and allow administrators the choice ofreceiving a notification when an event occurs. Events are also extendable in that they monitor administrator-defined values and execute administrator-defined plug-ins. Cus-tomizable action can be created using shell

(5)

scripts, perl scripts, symbolic links, programs, and more.

Using a smart notification algorithm, Clus-terWorX notifies administrators ofproblems without swamping them with unnecessary e-mails. The e-mail informs the administrator which cluster is malfunctioning, the name of the triggered event, the node(s) which are ex-periencing the problem, and the action (ifany) that was taken. Only one e-mail is sent per triggered event, even ifmultiple nodes are in-volved. Ifa node is fixed by an administra-tor bur fails again later, the event re-fires au-tomatically, without administrative interven-tions. For those who desire, e-mail can be di-rected to most wireless devices such as pagers and cell phones.

5.3 Performance Issues

Monitoring is at the heart ofcluster manage-ment. The data is used to schedule tasks, load-balance devices and services, notify adminis-trators ofsoftware and hardware failures, and generally monitor the health and usage ofa system. Unfortunately the information used to perform these operations must be gathered from the cluster without impacting application performance.

Cluster monitoring primarily consumes two important resources: CPU cycles and network bandwidth. The CPU usage problem is com-pletely localized on a node, and is addressed by creating efficient gathering and consolidat-ing algorithms. The network bandwidth prob-lem affects a shared resource and is addressed by finding ways to minimize the amount of data transmitted over the network. To address these two issues, we divide cluster monitoring into three stages: gathering, consolidation, and transmission.

5.3.1 Gathering

The gathering stage is responsible for loading the data from the operating system, parsing the values, and storing the results in memory. Standard tools for gathering system statistics, such as rstatdand SNMP tools, only provide limited information and tend to be slow and ineﬃcient. Thus we focus on using the /proc

virtual file system to gather all system statis-tics. An important note about the proc file system is that each time a proc file is read, a handler is called by the kernel, or the own-ing module, to generate the data. The data is generated on the fly, and the entire file is reconstructed whether a single character or a large block is read, which is a crucial point for efficiency.

The test system used was a 1 GHz Pentium III with 1 GB ofmemory, using the 2.4.18 ver-sion ofthe Linux kernel. Our ﬁrst implementa-tion loading and analyzing the memory statis-tics (/proc/meminfo) only renders 85 samples per second at 100% CPU utilization. Loading

/proc/meminfo at once into a separate buﬀer and parsing the data within that buﬀer in-creases the gathering rate to 4173 samples per second, or a 4800% increase in performance. By taking advantage ofthe fact that /proc

data uses standard ASCII output and by us-ing a priori knowledge about the output for-mat of/proc/meminfo, we were able to achieve another 236% increase in performance, result-ing in a monitorresult-ing rate of14031 samples per second. The last improvement was due to not closing and reopening/proc/meminfoeach time we needed new memory statistics. Instead we keep the file open all the time, just reset-ting the file pointer to the beginning ofthe file between two consecutive steps. The result ofthis optimization yields an additional 141% increase in performance. Now we reach a gath-ering rate of33855 samples per second, which translates to 29.5µsofCPU time per call. In other words, the optimized gathering process takes approximately 5 seconds ofCPU time per hour at a monitoring rate of50 samples per second.

Other statistics are taken from /proc/stat

at 35µsper call, from/proc/loadavgat 7.5µs per call, from /proc/uptimeat 6.2µsper call, and from/proc/net/devat 21.6µsper call per network device. Furthermore we’ve been inves-tigating the diﬀerence between implementing the gathering process in C or Java and found out that C is only slightly ahead ofJava. Thus we decided to use the Java implementation be-cause ClusterWorX is also written in Java.

(6)

5.3.2 Consolidation

The consolidation stage is responsible for bringing the data from multiple sources to-gether to determine ifvalues have changed, and for ﬁltering. In the interest of eﬃciency this task is exclusively performed on a node because the node is the gatherer and provider ofthe monitored data. The consolidation stage is used to combine data from multiple data sources at independent gathering rates. The consolidation process distinguishes be-tween static and dynamic monitoring data and transmits only data that has not changed since the last transmission. This reduces the amount oftransferred data substantially. Furthermore, monitor data is cached so that simultaneous re-quests can be served using the same set ofdata. This approach reduces the burden on the oper-ating system and increases the responsiveness ofthe monitoring system.

5.3.3 Transmission

The transmission stage is responsible for com-pression and transmission ofthe data to a management node. Since we use the /proc

ﬁlesystem, monitored data is stored in human-readable form. Although binary formats re-quire less storage, we leave the data in text form because of platform independency and the human-readable nature ofthe data. Nev-ertheless, when transmitting the data, we use data compression techniques, which are known to be very eﬀective on text input.

6 Future Work

The Lawrence Livermore National Laboratory (LLNL) and Linux NetworX are designing and developing SLURM3. SLURM provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (com-pute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and mon-itoring work (typically a parallel job) on a set ofallocated nodes. Finally, it arbitrates con-ﬂicting requests for resources by managing a

3_{Simple Linux Utility for Resource Management}

queue ofpending work. SLURM is not a so-phisticated batch system, but it does provide an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler [8]. While other resource managers do exist, SLURM is unique in several respects:

• Its source code is freely available under the GNU General Public License.

• It is designed to operate in a heteroge-neous cluster with up to thousands of nodes.

• It is portable; written in C with a GNU autoconfconﬁguration engine. While ini-tially written for Linux, other UNIX-like operating systems should be easy porting targets. The interconnect to be initially supported is Quadrics Elan3, but support for other interconnects (e.g. Myrinet) is already planned.

• SLURM is highly tolerant ofsystem fail-ures including failure of the node execut-ing its control functions.

• It is simple enough for the motivated end user to understand its source and add functionality.

Further information about the design and the current state ofSLURM is available on the SLURM homepage [12].

7 Conclusion

Linux Clustering is a reasonable alternative to supercomputing because it is a reliable, flexi-ble, scalable and cost-effective solution. How-ever, many organizations are prevented from benefiting from Linux clusters because of lim-ited technical resources. To help alleviate this problem, we developed ClusterWorX and ICE Box, lowering the barriers to adopt this tech-nology.

On the software side, the cluster manage-ment solution ClusterWorX scales to meet the needs ofany size system and includes: re-mote management capabilities, a customiz-able, easy-to-use graphical user interface,

(7)

in-tegrated disk cloning, sophisticated monitor-ing and event handlmonitor-ing, and automatic admin-istrator notiﬁcation. On the hardware side, the cluster management solution ICE Box fully integrates with ClusterWorX to provide ad-vanced power monitoring and power control as well as thermal probing and serial console ac-cess to all nodes ofa cluster.

Furthermore, we support and participate in open source projects such as LinuxBIOS and SLURM to provide future cluster management enhancements.

References

[1] Linux NetworX.ClusterWorX 2.1, April 2002. http://www.lnxi.com/news/pdf/whitepaper cwx.pdf. [2] Linux NetworX. ClusterWorX User Guide,

2002.

[3] Linux NetworX. ICE Box, 2002. http://www.lnxi.com/news/pdf/whitepaper ice.pdf. [4] Linux NetworX. ICE Box User Guide, 2002. [5] Linux NetworX. Image Manager User Guide,

2002.

[6] The LinuxBIOS Homepage . http://www.linuxbios.org.

[7] Hardware Monitoring by lm sensors . http://www2.lm-sensors.nu/∼lm78.

[8] Maui Scheduler Open Cluster Software . http://mauischeduler.sourceforge.net.

[9] Myrinet. http://www.myri.com. [10] Quadrics. http://www.quadrics.com. [11] Dolphin. http://www.dolphinics.com.

[12] Simple Linux Utility for Resource Manage-ment. http://www.llnl.gov/linux/slurm.