PERFORMANCE ENHANCEMENT OF INTER-CLUSTER COMMUNICATION WITH SOFTWARE-BASED DATA COMPRESSION IN LINK LAYER

(1)

PERFORMANCE ENHANCEMENT OF INTER-CLUSTER

COMMUNICATION WITH SOFTWARE-BASED DATA COMPRESSION IN

LINK LAYER

Shinichi Yamagiwa INESC-ID Lisboa, Portugal

Rua Alves Redol, 9 1000-029 Lisboa Portugal

[email protected]

Keiichi Aoki and Koichi Wada Department of Computer Science

University of Tsukuba

1-1-1 Tennodai, Tsukuba, Ibaraki, Japan [email protected], [email protected]

ABSTRACT

The communication performance is an important fac-tor of the performance in cluster computer, which must be considered for the challenge that tries to improve its over-all performance to accomplish the desired one. This pa-per focuses on the communication method for reducing the communication data in cluster computer using a data com-pression mechanism based on software in the link layer of cluster network, called the link layer compression. The technique is implemented on a lower layer of the clus-ter network comparatively to the mechanisms proposed in the preceding researches. Finally, the performance results show that the technique is able to enhance the performance of cluster computer, as well as achieve good compatibility for cluster-oriented applications.

KEY WORDS

cluster computing, network performance, data compression

1 Introduction

Cluster computers, which are composed of commercial available commodity personal computers or workstations and networks, are widely accepted in the parallel com-puting field, due to its excellent cost/performance ratio [15][13]. Cluster computers execute computations with exchanging data through its computing nodes. Therefore, the communication performance becomes one of the most important factors to maintain high performance in cluster computers[18].

The former researches have tried to improve the com-munication performance in cluster computers. For exam-ple, high performance networks oriented to cluster comput-ing such as Myrinet[2], SCI[8] and Infiniband[9], are now commercially available to accomplish gigabit-per-second bandwidth. Moreover, the zero-copy communication [7] has been proposed to skip the thick protocol stack to de-crease latency and to inde-crease throughput of those cluster-oriented networks. To achieve higher computational power in a cluster computer, the communication methods will be upgraded to the ones regarding those techniques without

modifying the parallel application program itself.

As well as the techniques that overcome the potential communication overheads in a cluster computer mentioned above, reducing the communication data is another con-siderable issue to achieve high performance inter-cluster communication. We can consider two solutions: (1) the modifications regarding parallel algorithm and (2) the com-pression of communication data. On the one hand, the for-mer method needs to change the applications’ codes such as [19]. If there is no more availability to modify the al-gorithm, there is no chance to upgrade the performance exists. On the other hand, the latter method is widely used to reduce the communication data such as MiMPI[3] and cMPI[11] communication libraries. Although it is not necessary to change the algorithm of the application if it uses the standardized interface of those communication li-braries, application programs that are not using the same interface as the ones in the communication library, can not receive the benefit of the compression mechanism in the communication library. In addition, if the library does not support the network interface hardware embedded in the target cluster computer, application programs can not trans-fer the communication data with the compression mecha-nism. For example, because application programs that use PVM[4] communication functions never fit to the ones of MPI[5], and thus, it is not possible to use those communi-cation libraries including the compression mechanism re-garding the communication data. Thus, there exists a func-tional disparity regarding the current compressing method for the communication data among an application program, communication library with a compression mechanism and network interface hardware.

To address this disparity, we will propose a slid so-lution in this paper that works out the disparity migrating the compression mechanism to the lower layer of a cluster computer, which is a link layer of the network interface. We call this technique the link layer compression.

The next section of this paper describes the charac-teristics of cluster computers and the detailed description of the conventional communications where they are used the compression mechanisms. Section 3 shows the design and implementation of the link layer compression. In the

(2)

section 4, the experimental evaluation will be performed. Finally, in the section 5, we will conclude this paper.

2 Backgrounds and definitions

2.1 Characteristics of inter-cluster

commu-nication

The network of cluster computers connects each comput-ing node with a high performance network hardware such as Gbit Ethernet, Myrinet, Infiniband etc., gathering the computing nodes in physically and electrically closed en-vironments. That means any computing node of a cluster computer never access to the outside, that is the internet or intranet of the laboratory. Therefore, the communication of a cluster computer performs the data exchange just in a physically closed environment.

According to this characteristics, vigorous research efforts were made to address the potential communication overheads in cluster computers. We can categorize the so-lutions’ types to achieve the higher communication perfor-mance as follows:

• Network interface hardware

physical medium of network is getting to change from the copper wire to optical wire, and also from the serial to the parallel. For example, in this ten years, possible bandwidth of Ethernet has changed from 10Mbps to 10Gbps. In addition, with increasing the base band clock speed of the network medium, the latency of the network has decreased from mil-lisecond order to microsecond order. This drastic performance shift is raising the popularity of cluster computing. Nowadays, we can use 3Gbps bandwidth with commercial available networks such as Myrinet or Infiniband as network hardware for a cluster computer.

• Communication controls

Due to very thick communication layer in a node of a cluster computer, the application program needs to take a long time to transfer data among the other nodes [18]. To bypass those layers, the direct access from the application program in a node to its network inter-face hardware has been proposed. Moreover, network hardware of a node exchanges a communication data in the application program without the copy operation, so-called zero-copy communication. Nowadays, GM, PM and BIP implements these communication styles. The solutions above have resolved potential bottle-necks with respect to the physical controls of the commu-nication in cluster computers. However, there should be possibilities to devise the way to transfer data before an ap-plication program gives it to the network. Therefore, we

need to consider another aspect that reduces the communi-cation data in a cluster computer.

The communication data, which is generated by an application program running in a cluster computer, can be reduced by changing the algorithm or compressing the communication data. For the algorithm changes, we must explore chances in the algorithm to let the computation in a node longer, or to remove redundant communications. However, those changes need big modifications in the ap-plication code itself. Therefore, the users of cluster com-puter first try to avoid this situation.

On the other hand, the communication data compres-sion will fit into any application program because the per-formance improvement is expected not by the application contents but by the data communication itself. When a compression mechanism is supported in a communication method regarding a cluster computer, application program can just execute its computation with the compressed data exchange. We will focus on this attractive speedup tech-nique for the cluster computing applications.

2.2 Conventional compression techniques

The proceeding researches have proposed two types of compression mechanisms regarding the communication of cluster computing: (1) algorithm combining with com-pressing and (2) library supporting compression.

The former method combines the communication data compression. For example, TRLE[12] and libpglc[19] compress data before passing it to the communication method of the cluster computer, with adding cooperating processes that compress the communication data. How-ever, this strategy is coded only for the applications and can not migrate it to the other numerical computations easily.

On the other hand, the latter method compresses the communication data in the communication library. For ex-ample, MiMPI[3] and cMPI[11] compress the communica-tion data inside of the MPI[5] library. If an applicacommunica-tion pro-gram is using the MPI functions and if the network inter-face hardware supported by the MPI library, it will receive great benefits from the communication data compression without any modification of the application program.

Regarding the discussion above, we conclude that the communication data compression is useful to reduce the communication data amount. However, we need to be care-ful to use the method because the application program will be rewritten many times and more codes will be involved, or face the incompatibility to the interface of a communi-cation library. The main aim of this paper is to address this disparity data among an application program, communica-tion library with a compression mechanism and network interface hardware. Thus, in order to reduce the communi-cation traffic with the smallest modificommuni-cation of the system and keeping the compatibility for all the application pro-grams we propose to compress communication data in the data link layer, as it is possible to be checked out in the following sections of this paper.

(3)

3 Link layer compression

3.1 Compression mechanism

The link layer of inter-cluster network is in charge of the control of the physical medium. Generally, that is im-plemented as a device driver or embedded hardware. All the communications from application programs running in a cluster computer exchanges the messages via a com-munication library and/or a comcom-munication protocol soft-ware. For example, an MPI-based application program sends/receives messages using the MPI library that can communicate with TCP/IP. Finally, the link layer, that is a device driver, will transfer the message from/to the net-work hardware.

We propose to add a data compression mechanism in the link layer. The compression mechanism can be a part of driver based on software or a part of network hardware. We call this mechanism the link layer compression.

Figure1 summarizes the mechanism of the link layer compression. The link layer compression will compress the transmitted communication data from the upper layer (Figure1 (1)) and decompress the received data from the network (Figure1 (2)). At the transmission operation, it will prepare two types of data: compressed data and un-compressed data (Figure1 (3) and (4)). In the case when the compressed data is bigger than the uncompressed one, it sends the uncompressed data to the network setting a flag which indicates that the data is compressed or not (Figure1 (5)). In the receiver side, the flag is tested and it will de-compress the received data if de-compressed (Figure1 (6)).

The merits of the link layer compression are: (1) the application program does not need to be modified only for the utilization of the compression mechanism, (2) the com-munication library does not need to prepare the special con-sideration for the compression mechanism and (3) the ap-plication can easily change the compression algorism in the link layer because the compressing operation is an add-in function just before sending/receiving data to/from the net-work.

To implement the link layer compression mechanism, we can consider two styles. One is fully based on soft-ware. In this case, the mechanism is implemented in a device driver of the network hardware. While this imple-mentation fits easily into the recent cluster computers as it can be just installed into the OS, it could be an overhead that involves the main computation of the application pro-gram in the cluster computer because the compression op-eration needs the CPU power of computing node. On the other hand, another choice is hardware-based implementa-tion that includes DMA with compression. When the net-work hardware sends/receives the message, it will access to the main memory of the computing node. In this data transfer, the link layer compression mechanism will com-press data with DMA operation. Thus, this implementation does not include the compression overhead such as the one based on software.

Compressor

Network physical medium Communication library

(ex. MPI library) Application program Protocol stack (ex. TCP/IP) Link Layer Decompressor Communication library

(ex. MPI library) Application program

Protocol stack (ex. TCP/IP)

Link Layer

Sender side Receiver side

Comparator Compressed data Uncompressed data Flag tester (1) (2) (3) (4) (5) (6)

Figure 1. Overview of the link layer compression mecha-nism

3.2 Implementation

In this paper, we will focus on the software-based imple-mentation to see the performance of the link layer com-pression. As an implementation example, we choose a gigabit Ethernet[17] because it is one of the widely ac-cepted cluster-oriented network and we can easily change the bandwidth of the physical medium with changing the mode of switching HUB among 10BASE, 100BASE and 1000BASE. For the control of the Ethernet, the device driver sends and receives the IP data between the IP layer and the NIC. Therefore, we will implement the link layer compression into the driver.

We will use Intel 1000/MT gigabit Ethernet card, for which the Linux driver’s source code is provided. We will add the link layer compression mechanism into the Linux driver source code.

3.2.1 Transmission operation

Regarding the transmission, the upper layer of the link layer will send fragments via the socket buffer mecha-nism of Linux kernel[10], generally called skb. The skb maintains the information and data specified to the net-work hardware’s frame. In the Ethernet case, it includes the MAC frame.

The device driver will receive the skb, compress it, and compare the size after compression with the one be-fore compression. Finally, the smaller buffer will be trans-ferred into the network hardware. We have implemented the compression function in thexmit_frame()function in the Intel’s driver.

(4)

(a) 425.raw (5760000bytes) (b) j.raw (5760000bytes) (c) r.raw (5760000bytes) (d) o.raw (3686400bytes)

Figure 2. 24bit full color photo images used in the experiments (the format is BMP without the bitmap header)

to care the Ethernet frame when the driver decides to send the compressed frame. The Ethernet frame has the field of the frame type, called Ethernet frame type filed. We use this field to indicate if the payload is compressed or not. Therefore, we use an undefined type number (ex. 0x0860 for IP) to specify the compressed payload.

3.2.2 Receiving operation

The Ethernet frames are received into a ring buffer of the Intel’s driver. The receiving operation of the frames from the NIC to these buffers is not able to be involved by any software operation because the DMA hardware will di-rectly transfer the receiving frames into the buffers didi-rectly. After receiving a frame, thereceive_frame() func-tion of the Intel’s driver will receive the frame, and pass it to the upper layer as a structure of skb. In these op-erations, we have added two operations: (a) checking the Ethernet frame type and (b) decompressing the data. When the checking of the Ethernet frame in an skb detects the number such as 0x0860, the receive_frame() func-tion will detect that the frame in the skb is an IP packet, decompress the frame into a temporary buffer, copy it to the skb again, and modify the frame type to the one for the uncompressed such as 0x0800 for IP and the data size after the decompression.

When the link layer compression uses the detection mechanism and the Ethernet frame type, the device driver can keep the compatibility for the frame without compres-sion. Therefore, the link layer compression mechanism will never mess up the inter-cluster communication. More-over, the application program using the network interface hardware is able to use the compression mechanism in the link layer without any modification of the code itself. Thus, the communication data in a cluster computer can be reduced effectively, and the potential performance of the cluster computer will be enhanced.

4 Experimental Evaluation

To evaluate the link layer compression, we will perform four experimental evaluations in the environment shown in Table1:

Table 1. The experimental environment Host PCs 4x Dual AthlonMP 1600

PC2100 DDR SDRAM 512MByte OS Linux 2.4.18-smp

NIC Intel 1000/MT (64bit 66MHz PCI) MPI MPICH-1.2.6 0 50 100 150 200 250 300 350 400

425.raw j.raw o.raw r.raw

T im e f o r c o m p re s s io n ( m s e c ) 0 1 2 3 4 5 6 7 C o m p re s s e d s iz e ( M b y te s ) RLE (size) LZW (size) LZO (size) RLE (time) LZW (time) LZO (time)

Figure 3. Compression speed and ratio among three com-pression algorithms

1. Comparison of compression algorithms using photo images

We need to choose carefully the compression algo-rithm because it will degrade the overall performance if the algorithm can not accomplish short computa-tion time and high compression ratio. First, we will compare the compression speeds and the compression ratios for three major compression algorithms (RLE, LZW and LZO), using four photo images shown in Figure2. RLE is the run-length compression based on the 8bit data unit. LZW is the Lempel-Ziff-Welch compression algorithm used in compress command in Unix OS. LZO is the modified LZ compression algorithm[14].

2. Static communication performance evaluation To see the best performance of the link layer

(5)

com-pression, we will measure throughput by using netperf [16] with transferring all-zero communication data and the four photos used in the experiment (1) be-tween the two nodes. The result of all-zero data trans-fer will show the best performance of the mechanism because the compression ratio is the highest when all the bits in the communication data contain only 0 or 1. The amount of transferred data per experiment will be restricted by the command line options of netperf to 110MBytes. Therefore, the same image data will be repeatedly transferred.

3. Dynamic communication performance evaluation We will perform Himeno benchmark[6] and CG and IS benchmarks from NAS parallel benchmark suite[1], and compare the results with the performance without the link layer compression.

For the first experiment, we will just compress the image files with different three compression algorithms and compare the elapsed times and the compression ratios. For the second and third experiments, we will use the TX counter of the Ethernet device driver to see the total amount of transmitted data from all the computation nodes. Before each experiment, the counters are reset to zero. To see the effects of the link layer compression with SMP nodes, we will use both processors in a computing node only for the result regarding 8 processors. The following experiments will use 1500byte MTU for the Ethernet frame. In addition, we will use the word ”original” to show the result without the link layer compression.

4.1 Comparison of compression algorithms

Figure 3 shows the compressing times and the compression ratios when the three compression algorithms are applied to the photo images. The horizontal axis shows the names of photographs. The lines in the graph show the compress-ing times. The bars in the graph show the file sizes after compression.

According to the results, RLE is the fastest method in the three algorithms. However, the compression ratio for RLE is the lowest one in the three algorithms. On the other hand, LZW and LZO show much better compression ratios comparatively to RLE. Moreover, LZO is from 2.6 to 2.8 times faster than LZW in spite of the close compression ra-tio against LZW. Therefore, we can expect that the commu-nication data compression with LZO will achieve the best communication performance due to its short compressing time and its good compression ratio from this result.

4.2 Static performance analysis

Figure 4 shows the throughputs measured by netperf with and without the link layer compression mechanism using three compression algorithms. The transmitted data sets are all-zero data and the photo images used in the previous

experiment. The bars in the graphs show the throughputs corresponding to the three compression methods, respec-tively. The lines in the graphs show the total amount of transmitted data.

First, we measured the throughput of all-zero data transmission using 1000BASE Ethernet as shown in Figure4(A). However, the transfer speed of network medium is faster than the compression calculation for the communication data. We confirmed that the link layer com-pression must be implemented by hardware at which the network medium obtains 1Gbps or more. Therefore, we will argue about the effects of the link layer compression using 100BASE Ethernet in the following evaluations.

Figure4(B) shows the results using 100BASE Ether-net. RLE achieves the best throughput when netperf trans-mits all-zero data. The throughput is about six times as high as the peak performance of the physical medium. Al-though LZW achieves the best compression ratio in this experiment, the throughputs of LZW are much lower than the other compression algorithms due to the long compress-ing time. The compression ratio of LZO is lower than the one of LZW. However, the throughputs of LZO achieve five times higher at the all-zero data transfer and 20-50% higher at the photo image transfers than the one of original.

4.3 Dynamic performance analysis

4.3.1 Himeno benchmark

Himeno benchmark is a parallel benchmark test suite of a kernel computation mainly used in an incompressible Navier-Stokes solver. It represents the performance fac-tor as the MFLOPS (Mega floating point instructions per second). The higher the MFLOPS value is, the higher per-formance the system achieves. We use the middle size of Himeno98 suite for this experiment. From this experiment, we can confirm the performance when an input data set is orderly realistic floating point values. Himeno98 suite uses the MPI interface for the communication. Figure 5 shows the MFLOPS measured by Himeno benchmark with the bars and the total amount of transmitted data using three compression algorithms with the lines.

RLE can not compress at all the communication data in this experiment, due to the very small continuousness of the same bits of the communication data. Only in the execution with eight processors, RLE can compress data a little and the overall performance is about 2% higher than the one of original.

Although LZW can reduce 75% of the communica-tion data at the execucommunica-tion of eight processors, the overall performance is lower than the one of original, due to the large overhead of the compression operation.

LZO is able to compress the communication data about 3% more than LZW, in addition, obtains up to 50% higher performance than the one of original, when the number of processor are eight. Even though the original performance does not speed up with increasing the number

(6)

0 100 200 300 400 500 600 700 800 900

Original RLE LZW LZO

T h ro u g h p u t ( M b p s ) 0 100 200 300 400 500 600

All Zero data 425.raw o.raw j.raw r.raw

T h ro u g h p u t ( M b p s ) 20 40 60 80 100 120 140 T ra n s m it te d s iz e ( M b y te s ) Original(throughput) RLE(throughput) LZW(throughput) LZO(throughput) Original(size) RLE(size) LZW(size) LZO(size) (A) (B)

Figure 4. Throughput and transmitted data size by netperf among three compression algorithms over 100BASE and 1000BASE Ethernet

of processors from four to eight, the performance of LZO accomplishes 10% or more speedup.

4.3.2 NAS parallel benchmark

We used CG and IS benchmark in NAS parallel benchmark suite. CG performs a parallel conjugate gradient method. IS performs a parallel integer sorting. The NAS parallel benchmark suite represents the performance factor as the MOPS (Mega operations per second). The higher MOPS is, the better performance the computer achieves. From these experiments, we can confirm the performance when an in-put data set is randomly generated integer or floating point values. In addition, for the cluster computer with 100Mbps Ethernet, it is very hard to achieve the speedup when the number of processors increases. Therefore, we can expect how the performance will be improved by the link layer compression.

Figure 6 (A) and Figure6(B) respectively show the re-sults of CG and IS benchmarks. The bars show the MOPS. The lines show the amount of data transmitted.

The result of CG shows that it is very hard for three compression algorithms to compress the communication data, and thus the total amounts of transmitted data are al-most the same as the one of original. The result of the performance shows that the performance does not degrade even if the compression algorithm can not compress the communication data at all when we use a fast compression algorithm such as RLE or LZO.

On the other hand, the performance of the IS bench-mark is not linear, because the communication data in IS benchmark is exchanged at the same phase of the calcula-tion and also the size of a message is big. Thus, to achieve a good performance for IS benchmark, the cluster computer needs to embed a network as high bandwidth as possible.

0 200 400 600 800 1000 1200 1400 1600 1800 2 4 8 Num. of procs M F L O P S 100 200 300 400 500 600 700 800 900 T ra n s m it te d s iz e ( M b y te s ) Original(MFLOPS) RLE(MFLOPS) LZW(MFLOPS) LZO(MFLOPS) Original(size) RLE(size) LZW(size) LZO(size)

Figure 5. MFLOPS and transmitted data sizes by Hi-meno benchmark among three compression algorithms over 100BASE Ethernet

The graph shows that RLE is not able to compress com-munication data at all. On the other hand, LZO and LZW are able to reduce the communication data up to 60% and 80% respectively of original communication data when the number of processor are eight. Thus, the MOPS from two processors to eight processors with LZO achieves higher than the one of original at any number of processors.

4.4 Summary

Finally, let us summarize the experimental results per-formed above as follows:

1. Compatibility

We used MPI programs in the experimental evalua-tions. However, any communication methods used by

(7)

0 20 40 60 80 100 120 140 160 180 200 2 4 8 Num. of procs M o p /s 50 100 150 200 250 300 T ra n s m it te d s iz e (M B y te s ) Original(MOPS) RLE(MOPS) LZW(MOPS) LZO(MOPS) Original(size) RLE(size) LZW(size) LZO(size) 0 1 2 3 4 5 6 7 8 9 2 4 8 Num. of procs M o p /s 50 100 150 200 250 300 350 T ra n s m it te d s iz e ( M b y te s ) Original Mop/s RLE Mop/s LZW Mop/s LZO Mop/s Original size RLE size LZW size LZO size (A) CG (B) IS

Figure 6. Performance and transmitted data size by (A) CG and (B) IS among three compression algorithms over 100BASE Ethernet

any application programs such as NFS, HTTP, FTP, UDP-based program etc. that access the Ethernet NIC are able to use the compression mechanism because the mechanism is implemented in the device driver. Thus, the link layer compression mechanism will be-come a breakthrough to reduce the amount of commu-nication data in a cluster computer.

2. Compression ratio

We did perform the comparisons of compression performance and ratio among RLE, LZW and LZO. The communication performance results using LZO showed the best performance/compression ratio. Moreover, the static and dynamic performance com-parisons showed the performance was promised when we used LZO for the link layer compression. How-ever, we believe that LZO is one of the compression algorithm that best fits into the previous applications. If the photo images are black and white, for instance, RLE might achieve better performance than LZO be-cause of its compression speed.

3. Performance

The experiment with 1000BASE Ethernet did not achieve better performance than the one of original. Therefore, we need to migrate the compression mech-anism to the network hardware, when we use a faster network because the link layer compression based on software works in a peep time when the applica-tion program is waiting for the communicaapplica-tion’s com-pletion. Considering this point, we did perform the experimental evaluations with 100BASE. Regarding the static performance, the results showed almost the same relations to the one of the compression algo-rithm comparison. As mentioned in 2, it might change when we use other data sets to be transferred. Regard-ing the dynamic performance, we showed an example

that best fits into the link layer compression, which was the Himeno benchmark. It achieved better perfor-mance than the original. Moreover, we showed an ex-ample that was able to improve the performance with the link layer compression, which was IS. It improved the potential network performance by the compression mechanism. However, CG was a bad example that can not compress any of the communication data in the computing process. In this kind of application, we should have chance to be able to compress the data with trying other compression algorithms.

All in all, the link layer compression maintains a very good compatibility for the communication methods of ap-plications. Moreover, we confirmed that the link layer com-pression can improve the potential performance of the clus-ter compuclus-ter, when we choose a compression algorithm best fits into the target application program. Thus, we can conclude the link layer compression is a remarkable method to enhance the performance of cluster computers.

5 Conclusion

We proposed the link layer compression, designed and im-plemented it in a Gigabit Ethernet, and performed its ex-perimental evaluations based on a software implementation in this paper. According to the experimental evaluation, we confirmed that the validity for reducing the inter-cluster communication and also for improving the potential per-formance of a cluster computer, with maintaining the com-patibility of communication interfaces for any application programs.

However, when we use a faster network for the software-based link layer compression, it is very difficult to achieve higher performance than the one without the com-pression mechanism, because the comcom-pression calculation time is larger comparatively to the communication speed.

(8)

Therefore, for the future work, we need to consider a chal-lenge to embed the link layer compression mechanism into a part of network hardware. Moreover, the compression algorithms used in this paper may not fit into the other ap-plication programs. Therefore, we need to have more ex-perimental evaluation with other compression algorithms, and also add a function to select an algorithm for each ap-plication program.

References

[1] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Brown-ing, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The nas parallel benchmarks. The International

Jour-nal of Supercomputer Applications, 5(3):63–73, Fall

1991.

[2] Nanette J. Boden, Danny Cohen, Robert E. Felder-man, Alan E. Kulawik, Charles L. Seitz, Jakov N. Seizovic, and Wen-King Su. Myrinet – a gigabit-per-second local-area network. IEEE Micro, Vol.15, No.1:29–35, 1995.

[3] Alejandro Calderon, Felix Garcia, Jesus Carretero, Javier Fernandez, and Oscar Perez. New Techniques for Collective Communications in Clusters: A Case Study with MPI. In 2001 International Conference

on Parallel Processing (ICPP ’01), pages 185–193,

2001.

[4] Adam Beguelin et al. PVM: Parallel Virtual Machine

: A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, 1994.

[5] Message Passing Interface Forum. MPI: A

Message-Passing Interface Standard, 1995.

http://www-unix.mcs.anl.gov/mpich/. [6] Ryutaro Himeno. Himeno benchmark.

http://accc.riken.jp/HPC/HimenoBMT/.

[7] Yutaka Ishikawa Hiroshi Tezuka, Atsushi Hori and Mitsuhisa Sato. PM: An operating system coordi-nated high performance communication library. In

high-performance Computing and Networking, vol-ume 1225 of Lecture Note in Computer Sciences,

pages 708–717, 1997.

[8] Maximilian Ibel, Klaus E. Schauser, Chris J. Scheiman, , and Manfred Weis. High-Performance Cluster Computing Using SCI. In Proceedings of Hot

Interconnects V, 1997.

[9] Infiniband Trade Association. InfiniBand Architecture

Specification, Release1.0, October 2000.

[10] Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman. Linux Device Drivers, 3rd Edition.

O’Reilly and Associates, 3 edition, February 2005.

[11] Jian Ke, Martin Burtscher, and Evan Speight. Run-time Compression of MPI Messages to Improve the Performance and Scalability of Parallel Applications. In proceedings of ACM/IEEE SC 2004 Conference

(SC’04), 2004.

[12] Chin-Feng Lin, Yeh-Ching Chung, and Don-Lin Yang. TRLE - an efficient data compression scheme for image composition of parallel volume rendering systems. In Proceedings of the First International

Symposium on Cyber Worlds (CW’02), pages 499–

507, 2002.

[13] Mark Baker. Cluster Computing White Paper, De-cember 2000.

[14] Markus F.X.J. Oberhumer. LZO real-time data com-pression library.

http://www.oberhumer.com/opensource/lzo/.

[15] Rajkumar Buyya ed. High Performance Cluster

Com-puting: Systems and Architectures, volume 1.

Pren-tice Hall, 1999.

[16] R.Jones. Netperf: a network performance monitoring tool. http://www.netperf.org/.

[17] Stephen Saunders. Gigabit Ethernet. McGraw-Hill, 1998.

[18] V.Karamcheti and A.Chien. Software overhead in messaging layers: Where does the time go? In

Proceedings of the Sixth Symposium on Architectural Support of Programming Languages and Operating Systems (ASPLOS-VI), pages 51–60, 1994.

[19] Brian Wylie, Constantine Pavlakos, Vasily Lewis, and Ken Morel. Scalable Rendering on PC Clusters. In

IEEE Computer Graphics and Applications, pages