Single System Image with Virtualization
Technology for Cluster Computing Environment
Liang Yong∗
∗Network Center of Nanchang University, Nanchang 330031, China
Email: [email protected]
Abstract—Single system image (SSI) for clusters has ever
been one of the hot topics in the research field of parallel computer architecture, since SSI supports easier programming and administration for cluster environment. Currently, most SSI studies focused on the middleware level on the top of operating system, leading to some problems of poor transparency, low performance and so on. This paper presents a novel solution using a distributed virtual machine monitor (DVMM) on the basis of hardware-assisted virtualization technologies. The DVMM contains some symmetrical and cooperative VMMs distributed on multi-nodes. The cooperation among the VMMs virtualizes the distributed hardware resources. Thus, the DVMM can support an unmodified legacy operating system to run transparently on the distributed nodes of a cluster. Our solution significantly out-performs other state of SSI solutions, and has more transparency, high performance and easy implementation.
I. INTRODUCTION
Parallel computer architecture has been presenting two important development directions, one is the shared memory architecture represented by SMP(Symmetric Multiprocessor) and cc-NUMA(Cache-Coherent Non-Uniform Memory Archi-tecture), the other is the distributed memory architecture repre-sented by COW(Cluster of Workstations) and MPP(Massively Parallel Processing)[1].The shared memory architecture usu-ally supports the shared memory programming mode, has good programmability, but has poor scalability because of some constrains, such as the bandwidth of the shared memory. The distributed memory architecture usually uses the complicated message passing programming mode, has poor programma-bility, however, it has strong scalability because of using the loosely coupled interconnection. Since the advantages from the two architectures are complementary mutually, how to obtain the two-sided advantages is a certain idea. One of possible ways for the combination with the advantages form the two architectures is to implement the image of shared memory architecture based on the hardware of distributed memory architecture. Both DSM and SSI on clusters are the typical practices.
This paper presents a novel solution to provide SSI on clusters using a distributed virtual machine monitor (DVMM) with hardware-assisted virtualization technologies. The rest of the paper is organized as follows. Section II describes the background of SSI and virtualization technologies as well as an introduction about relative works. Then, we describe the implementation details of the DVMM in Section III. Sec-tion IV compares the DVMM with existing soluSec-tions. Finally,
this paper ends up with a concluding remark in Section V.
II. BACKGROUND
A. Single System Image
SSI means that all the distributed resources are organized to a uniform unit for users, users can not be aware of the existence of every node that makes up of the computer system. SSI includes some attributes such as single memory space, single process space, single I/O space, single log on point, single file system,single loads management, and so on. The key attributes of SSI are single memory space and single process space [2]. The SSI of a cluster can be implemented on the hardware level, the underware level, the middleware level and the application level. Currently there are few solutions on the hardware level, they are Enterprise X-Architecture [3], cc-NUMA [4] and DSM [5], [2]. Special chips or hardware are adopted in these solutions, so that they have high cost and limited applications. The solutions on the underware level are also seldom. The representative solutions are MOSIX [6]Sun Solaris-MC [2]. SSI Implemented on this level has good trans-parency for the users, but it is difficult to implement. Current solutions on this level can only implement part attributes of SSI. The middleware level is the most convenient level for implementing SSI on a cluster, so there are many relative studies on this level, the typical work include: distributed shared memory systems such as IVY [7], the parallel and distributed file systems such as Lustre [8], the systems of resource management and loads schedule such as LSF [9], the parallel programming environments such as MPI and PVM [10]. The SSI implemented on the middleware level has poor transparency for the users. Implementing SSI on the application level is relative to the applications too closely, so that there are seldom solutions on this level, the most representative solution is the Linux Virtual Server (LVS) [11]. Therefore, the possible levels for implementing SSI of a cluster usually may be the application level, the middleware level, the underware level and the hardware level. From the top down, the difficulty to implement the SSI increases, but the transparency for the users increase, too. Currently most studies were done on the middleware Level, leading to some problems, such as poor transparence. There are seldom solutions on the application levelthe underware level or the hardware level, furthermore, the solutions on these levels have pitfalls respectively, for example, the solutions on the hardware level have high cost, the solutions on the underware level can Third 2008 International Conference on Convergence and Hybrid Information Technology
not implement the SSI attributes roundly, and the solutions on the application level have poor flexibility.
1) Virtualization: Virtualization means that computation
and processing are done on the virtual base instead of on the real base. A virtual platform can be constructed between the hardware and the OS by means of virtualization techniques for creating multiple running domains on one hardware platform, the running domains are isolated respectively, and each domain can support the running of his OS and applications [12].
The emphasis of virtualization technology is to construct the Virtual Machine Monitor (VMM) based on virtualizing the Instruction Set Architecture (ISA). Virtualization techniques can be differentiated to full-virtualization, para-virtualization, pre-virtualization and hardware-assisted virtualization [13], [14].
Hardware-assisted virtualization is the most advanced vir-tualization technology. VT-x (Virtual Technology) [15] is a hardware-assisted virtualization technology for the IA-32 architecture. The contents of VT-x are listed as follows. A new operation form, called VMX (Virtual Machine Extensions), is added to the processor. Two kinds of VMX transitions are defined, called VM entries and VM exits. A VMCS (Virtual-Machine Control Structure) is added to the architecture. Ten new instructions used for controlling the VM are added to the architecture. With the support of VT-x, the design of VMM can be simplified, and full virtualization can be implemented without using binary translation, so the guest software can run transparently on the virtual machine without any modification.
B. Related Work
The essential of virtualization is to make the details of the hardware be transparent to the software and to separate the software from the hardware through abstracting the physical resources. The goal of SSI is to hide the distributed hardware environment, and to make the cluster appear a single system view. The aims of SSI and virtualization are consistent, thus, SSI can be implemented by virtualization. The related studies are Virtual Multiprocessor [16] and vNUMA [17].
1) Virtual Multiprocessor: Virtual Multiprocessor
imple-ments an 8-way shared memory multiple processors virtual machine on a cluster of 8 PCs. The VMMs implemented on the support of the host OS runs in the application space. The guest OS is modified because of using para-virtualization. The shared memory space is supported by the DSM technologies; the virtual processors are emulated by some special processes; the I/O virtualization is implemented through the cooperation between the VMMs and the dedicated I/O sever. The dis-advantages of Virtual Multiprocessor are that VMMs on the application level lead to low performance and weak flexibility; para-virtualization needs to modify the code of the guest OS and only the I/O devices in the I/O server can be utilized, so it has limited application and it is difficult to implement. Furthermore, Virtual Multiprocessor can not provide SSI on a SMP cluster because it doesn’t support the SMP node.
2) vNUMA: VNMUA implements a 2-way NUMA virtual
machine on a cluster of two workstations, each one has only
one IA64 processor. The VMMs are implemented directly on the hardware without host OS support. Pre-virtualization technology is used to scan the guest OS code during compiling and replace the privileged instructions statically. The guest OS is compelled to run on the ring 1 so that it can ensure the privileged station of the VMM. The shared memory is supported by DSM. One node is the master node from which the system is set up and the others are slave nodes. The disadvantages of vNMUA are that pre-virtualization solution needs to modify the guest OS, and degrading the privilege level of the guest OS can bring the confusion of privileges. Also, vNUMA can not provide SSI on a SMP cluster.
III. DESIGN ANDIMPLEMENTATION OFDVMM
A. Overview
The major goal of DVMM is to hide the distributed hard-ware attributes, provide SSI on a SMP cluster, and support a single OS to run transparently on the cluster. In other words, DVMM is to implement the image of shared memory architecture based on the hardware of distributed memory architecture to combine with the advantages from the two architectures, such as easy to program and good scalability. Therefore, three essential problems must be solved. Firstly, the distributed hardware configurations of the cluster can be detected and merged to form the global information including the all distributed hardware resources. Secondly, the global hardware resources can be virtualized and presented to the OS. Thirdly, the OS can manage, schedule and utilize the global resources on distributed nodes of a cluster just as on a single SMP or cc-NUMA machine.
For providing SSI on a cluster, a new layer named DVMM is added between the OS and the cluster hardware. The DVMM contains some symmetrical and cooperative VMMs distributed on each node of the cluster. A single OS supporting cc-NUMA architecture runs on the DVMM. Using hardware-assisted virtualization technologiesthe DVMM detects and merges the physical resources of the cluster to form the global information including the all physical resources, virtualizes the whole physical resources to virtual resources, and presents the virtual resources to the OS. The OS schedules and runs the processes, manages and allocates the virtual resources. These actions by OS are transparent to the bottom DVMM. The DVMM intercepts the operations of accessing resources from the OS and handles them on behalf of the OS, such as mapping the virtual resources to the physical resources and manipulating the physical resources. In this way, it is assured that the OS can be aware of the whole resources of the cluster as well as can manage and utilize them. And the distributed attributes of the bottom hardware are hidden and the whole cluster is presented to the OS as a cc-NUMA virtual machine.
B. Strategie
of VMMs and integrating the physical resources of the cluster through communication among the VMMs; virtualzing the physical resources to the virtual resources then reporting them to the OS through hardware-assisted virtualization; managing and utilizing the physical resources of the cluster through the cooperation between the OS and the DVMM. The details of the strategies are described as follows.
1) Resource Detecting and Merging: Emulates and extends
the BIOS to the eBIOS (Extended Basic Input/Output System). After the eBIOS acquires the information about the physical resources of native node through the basic BIOS operations, it communicates with the other nodes to collective the informa-tion about the physical resources of whole cluster, and merges the information to form the information of the global physical resources. Based on the global physical resources, DVMM reserves some resources and virtualizes the remained resources to virtual resources. DVMM organizes the various virtual resources. This includes forming various resources mapping tables, implementing the mappings from the virtual resources to the physical resources and from the physical resources to the nodes, creating the global virtual resources information table oriented to the OS. Based on the virtual resources, the OS is set up, the calls for BIOS during the setting up are captured, and the information of the global virtual resources are reported to the OS, so that the OS is aware of the global virtual resources.
2) Resource Virtualization: Resource virtualization
in-cludes ISA virtualization, interrupt mechanism virtualization, memory virtualization and the I/O device virtualization. Virtu-alizations are based on the information of the global physical resources formed by the eBIOS. Unlike existing virtualization techniques, the virtualization technique in this paper can implement the virtualization for resources crossing the nodes, for example, virtualizes the distributed memory resources.
The IA-32 ISA is virtualized through the VT-x mechanism; the techniques are similar to that used in the HVM of Xen [18]. The interrupt mechanism is virtualized as follows. DVMM emulates various interrupt controllers with software, interferes the accesses from the OS to the interrupt controllers, if the target interrupt controller is in the native node, then DVMM reads or modifies the contents of the interrupt controller to reflect the effect of the guest’s operation; if the target interrupt controller is in a remote node, then DVMM sends the access request to the target node, the VMM on the target node manufactures the virtual interrupt controller accordingly. DVMM catches the hardware interrupt, and the contents of the virtual interrupt controller are modified by the native VMM or by the remote VMM according to the node in which the interrupted object is, so that the interrupt can be shown to the OS. Combine the techniques of shadow page table and software DSM to virtualize the distributed memory resources. That is merging the whole memory resources of the cluster to a distributed shared memory with the technique of software DSM, and then virtualizing the distributed shared memory with the technique of shadow page table. The I/O resources virtualization is implemented through interfering and
Operating System
A cc-NUMA virtual machine
Node Node Node Node
Gigabit Ethernet Cluster with SMP nodes Distributed Virtual Machine Monitor (DVMM)
Fig. 1. System architecture
In it ia liz at io n
eBIOS VirtualizationISA I/O Virtualization Interrupt Virtualization MMU Virtualization DSM Communication Hardware Hardware Gigabit Ethernet Operating System In it ia liz at io n
eBIOS VirtualizationISA I/O Virtualization Interrupt Virtualization MMU Virtualization DSM Communication Fig. 2. DVMM structure
processing the I/O operations by the VMM on behalf of the OS. The I/O operations of the OS are interfered by the VT-x mechanism, if the I/O operation will be processed on the native node, then the native VMM processes the interfered instruction and returns the results to the OS, If the I/O operation will be done on a remote node, then the I/O instruction is sent to the target node, and the I/O instruction will be executed by the VMM of the target node and the results are sent back to the native VMM, then the results are returned to the OS by the native VMM.
3) Resource Management and Utilizing: The OS is
C. Design and Implementation
1) System Architecture: The system architecture is shown
in Figure 1. The system contains three levels. From the bottom up, they are hardware level, DVMM level and OS level. The hardware level contains some SMP nodes interconnected by the gigabit Ethernet, and the CPUs of the nodes can support VT-x. The DVMM level contains some symmetrical and cooperative VMMs distributed on the nodes. The VMMs can communicate through the dedicated bottom communication software, based on the gigabit Ethernet. The OS can be any one which supports the cc-NUMA. The key element for implementing this system is to construct the DVMM. The structure and the mechanism of the DVMM will be introduced in the following passages.
2) DVMM Structure: The DVMM is composed of the
VMMs distributed on the nodes of the cluster, and each VMM runs on a node. The DVMM runs on the bare machines. The functions of the VMM are detecting, integrating and virtual-izing the physical resources, reporting the virtual resources to the operating system and cooperating across the nodes. The structure of the function modules of the DVMM is shown in the Figure 2.
The initialization module is responsible for loading and running the VMM on each node. The eBIOS module is respon-sible for detecting and integrating the resource information of the cluster and reporting the information to the OS. The ISA virtualization module is responsible for virtualizing the IA-32 ISA and cooperating with the interrupt virtualization module so as to the OS can manage and schedule the virtual computing resources of the cluster. The I/O virtualization module is responsible for virtualizing the global I/O resources, making the OS can detect and utilize the I/O resources of the whole system. The interrupt virtualization module is responsible for virtualizing the interrupt control mechanism, notifying the interrupt event to the OS through injecting a virtual interrupt to the OS. The MMU virtualization module is responsible for virtualizing the memory resources and assuring that the OS can run correctly in the virtual physical address space. The DSM module is responsible for implementing a distributed shared memory transparently. The communication module is responsible for providing the communication service with low delay, high bandwidth and high reliability for the cooperative VMMs.
3) DVMM Mechanism: The relations among the modules
of the DVMM are shown in the Figure 3.
The ISA virtualization module is the entry point as well as the exit point of the DVMM. This module may invocate every other module of the VMM except the communication module, and vice versa. When a VM exit occurs, the ISA virtualization module analyzes the reason of the VM exit and invocates appropriate module to handle. When one module completes his duties, it invocates the ISA virtualization module to return to the guest system. The communication module is the base of the cooperation among the modules of the DVMM. The communication module may invocate every other module of the VMM except the ISA virtualization module, and vice
Hardware Operating System VM entry/VMexit External Interrupt Communicating eBIOS ISA Virtualization I/O Virtualization Interrupt Virtualization MMU Virtualization Communication Direct Access DSM
Fig. 3. Relations among the modules of DVMM
versa. The eBIOS module is used only during the initialization of the DVMM and the setting up of the OS. Firstlythe eBIOS module invocates the interrupt virtualization module, the I/O virtualization module and the communication module to detect and build the resource information of the whole system. Sec-ondly, while the ISA virtualization module captures the calls to the BIOS during setting up the OS and sends the information to the eBIOS module, the eBIOS module returns the information about the virtual resources of the whole system to the OS. The I/O virtualization module receives instructions from the ISA virtualization module, according to the node on which the I/O request should be done, it may executes the I/O instruction or invocate the communication module to send the I/O request to the target node. When the I/O virtualization module receives an I/O request from a remote node, it manipulates the native I/O device and sends the result to the source node. On the one hand, the interrupt virtualization module is invocated by the ISA virtualization module to emulate the operation to the virtual interrupt controller by the OS; on the other hand, it converts the external interrupt vectors to the virtual interrupt vectors and injects a virtual interrupt to the OS. The MMU module is only be invocated by the ISA virtualization module. While the ISA virtualization module captures a sensitive instruction or a trap which is related to MMU, it invocates the MMU virtualization module to handle it. When the MMU virtualization module finds that the requested page is not in the native node, it invocates the DSM module to get the page from the host node. Invocated by the MMU virtualization module, the DSM module requests the page from the remote node, while invocated by the communication module it serves the request and sends the result to the remote node.
TABLE I
COMPARISON AMONGDVMM, VIRTUALMULTIPROCESSOR AND VNUMA
Level Technique Difficulty Transparence Symmetry Performance SMP Supporting
Virtual Multiprocessor Application Level Para-virtualization High Poor No Low No
vNUMA Underware Level Pre-virtualization High Good No Moderation No
DVMM Underware Level Hardware-assisted Virtualization Moderation Good YES High YES
IV. DISCUSSION
Providing SSI on clusters has ever been one of the hot topics in the research filed of parallel computer architecture, so there are many existing solutions. Few of them are based on virtualization techniques, and the others are not. To distinguish the features of our solution, we compare it with the existing solutions as follows.
Known from the Table I, the DVMM has advantages to the Virtual Multiprocessor and the vNUMA. Firstly, the DVMM support the SMP node and can implement SSI on the SMP clusters, while the Virtual Multiprocessor and the vNUMA do not support the SMP node, so that they can not implement SSI on the SMP clusters. Secondly, the DVMM utilizes the most advanced virtualization technology, the hardware-assisted virtualization, implements full virtualization based on the support of hardware, need not to modify the guest OS, so that it has moderate difficulty to design and implement. While the Virtual Multiprocessor and the vNUMA adopt the para-virtualization and the pre-para-virtualization respectively, both of them need to modify the guest OS, so that they have high difficulty to implement and have limited applications. Thirdly, the DVMM is implemented based on assistance of the hard-ware, and runs on the metal, so that it has high performance. While both the Virtual Multiprocessor and the vNUMA are implemented by software, so that they have low performance, further more the Virtual Multiprocessor is implemented at the application level, it must pass through several software layers leading to lower performance. Finally, the nodes of the DVMM are full symmetrical, while the nodes of the Virtual Multiprocessor and the vNUMA are not symmetrical, one of them is the master node, and the others are the slave nodes. Besides, the DVMM is implemented at the underware level, while the Virtual Multiprocessor is implemented at the application level, so that the DVMM is more transparent than the Virtual Multiprocessor; because the IA-32 is used more widely than the IA64, the DVMM has wider application and higher utilization value than the vNUMA.
As mentioned in section 2.1, there are many solutions for providing SSI on clusters; they are respectively implemented on the hardware level, the underware level, the middleware level and the application level. Compared with these existing solutions, the DVMM also has advantages. Firstly, the DVMM does not demand special hardware, so it has lower cost and wider application than the solutions on the hardware level. Secondly, the DVMM can implement full attributes of SSI, while the solutions on the firmware level can only implement part attributes of SSI, so the DVMM has higher utilization
value. Thirdly, the DVMM has better transparence and higher performance than the solutions on the middleware level. Fi-nally, the DVMM has better flexibility and higher performance than the solutions on the application level.
V. CONCLUSIONS ANDFUTUREWORK
The DVMM implements the SSI of clusters on the un-derware level based on the hardware-assisted virtualization technologies, so it can support an unmodified legacy OS to run transparently on a cluster. Compared with the existing solutions for implementing the SSI of clusters, the DVMM has some advantages.
There are still further improvements to be made: firstly, using the most advanced VT-d [19] and EPT(Extended Page Tables) [20] techniques to reduce the implementing difficulty and adopting the processor consistency model instead of the sequential consistency model for higher performance; sec-ondly, detecting the physical resources dynamically to support the dynamic change of the number of the nodes; thirdly, adding the functions of resource management and load schedule to the DVMM for supporting multiple guest OS running transparently and separately on a cluster.
REFERENCES
[1] G. A. Culler D E, Singh J P, Parallel computer architectureł a hard-ware/software approach. China Machine Press, 1999.
[2] T. C. Rajkumar Buyya, “Single system image (ssi),” the International Journal of High Performance Computing Applications, vol. 15, pp. 124– 135, 2001.
[3] I. Corporation, “Ibm enterprise x-architecture technology.”
[4] B. Brock, G. D. Carpenter, E. Chiprout, M. E. Dean, P. L. D. Backer, E. N. Elnozahy, H. Franke, M. Giampapa, D. Glasco, J. L. Peterson, R. Rajamony, R. Ravindran, F. L. R. III, R. L. Rockhold, and J. Rubio, “Experience with building a commodity intel-based ccnuma system,” IBM Journal of Research and Development, vol. 45, no. 2, pp. 207– 228, 2001.
[5] A. Itzkovitz and A. Schuster, “Distributed shared memory: Bridging the granularity gap,” in Proceedings of the First ACM Workshop on Software Distributed Shared Memory (WSDSM, 1999.
[6] L. Amar, A. Barak, and A. Shiloh, “The mosix direct file system access method for supporting scalable cluster file systems,” Cluster Computing, vol. 7, no. 2, pp. 141–150, 2004.
[7] K. Li and P. Hudak, “Memory coherence in shared virtual memory systems,” ACM Trans. Comput. Syst., vol. 7, no. 4, pp. 321–359, 1989. [8] I. Sun Microsystems, “Lustre file system.” [Online]. Available:
http://www.sun.com/software/products/lustre/
[9] P. C. Corporation, “Platform lsf reference.” [Online]. Available: http://www.platform.com/Products/platform-lsf
[10] V. S. Sunderam, “PVM: a framework for parallel distributed computing,” Concurrency, Practice and Experience, vol. 2, no. 4, pp. 315–340, 1990. [Online]. Available: citeseer.ist.psu.edu/sunderam90pvm.html [11] W. Zhang, “Linux virtual servers for scalable network services,”
[12] J. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann, 2005.
[13] I. VMware, “Understanding full virtualization, paravirtualization, and hardware assist.” [Online]. Available: http://www.vmware.com/files/pdf/ VMware paravirtualization.pdf
[14] J. LeVasseur, V. Uhlig, M. Chapman, P. Chubb, B. Leslie, and G. Heiser, “Pre-virtualization: Slashing the cost of virtualization,” Tech. Rep., 2005. [Online]. Available: http://l4ka.org/publications/2005/ previrtualization-techreport.pdf
[15] I. 64 and I.-. A. S. D. Manuals, “System programming guide,” 2007. [Online]. Available: http://www.intel.com/products/processor/manuals/ [16] K. Kaneda, O. Yoshihiro, and Y. Akinori, “A virtual machine monitor
for providing a single system image,” Transactions of Information Processing Society of Japan, vol. 46, pp. 27–39, 2006.
[17] M. Chapman and G. Heiser, “Implementing transparent shared memory on clusters using virtual machines,” in ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference. Berkeley, CA, USA: USENIX Association, 2005, pp. 23–23.
[18] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. L. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art of virtu-alization,” in SOSP, 2003, pp. 164–177.
[19] I. Corporation, “Intel virtualization technology for directed i/o.” [Online]. Available: http://www.intel.com/technology/itj/2006/v10i3/ 2-io/5-platform-hardware-support.htm