Benchmarks and Comparisons of Performance for Data
Intensive Research
Saad A. Alowayyed
August 23, 2012
MSc in High Performance Computing
The University of Edinburgh
Abstract
The rapid developments in technology and the increase in the precision of the data captured are the reasons for a significant increase in the amount of data, known as big data. This big data is valuable because it is the outcomes of efforts such as experiments or discoveries, so the interpretation is necessary to obtain information. This information could not be interpreted efficiently without significant methods for storing, retrieving, manipulating and analysing the big data. The modest performance of the secondary storages prevents the approach of these significant ways, in other words, it exposes the I/O bottlenecks that prevent the I/O system from providing the CPU with the right amount of data. The right amount of the I/O provided to the CPU was analysed by Gene Amdahl, who derived a ratio widely known as the Amdahl number, which will be one of the main criteria used in this project to decide whether or not a system is balanced.
Two different systems at the Edinburgh University are examined, EDIM1 and Eddie. This project is to decide whether or not EDIM1 and Eddie are balanced and, hence, suitable for data intensive computing. This is achieved by using a number of benchmarks for each component to determine if there is a sustainable bandwidth at the same order of magnitude. Moreover, the Amdahl number would be calculated theoretically and experimentally, using three benchmarks with different I/O patterns. Two of these benchmarks are developed during the scope of this project.
From the results, it was found that the two systems are suitable for data intensive computing and can provide a good platform for data intensive programs. The first machine, EDIM1, was designed to be similar to the Amdahl balance blades by attaching more disks to a low power processor. This would increase the I/O throughput considerably and hence, the I/O should cope with the CPU performance. The second machine, Eddie, is not specifically for data intensive research, but it could reduce the gap between the I/O and the CPU in the field of general high performance computations.
Contents
Chapter 1 Introduction ... 1
Chapter 2 Background ... 3
2.1 Big Data ... 3
2.2 Data Intensive Computing ... 3
2.2.1 Balanced systems ... 4
2.2.2 Raw sequential I/O ... 5
2.2.3 Multi-scaling... 5
2.3 Styles of Data Intensive Computing ... 5
2.3.1 Data-processing pipelines ... 6 2.3.2 Data warehouse ... 7 2.3.3 Data centres ... 7 2.4 Amdahl-balanced blades ... 8 2.5 Devices ... 9 2.5.1 Technical information ... 9
2.5.2 Machines’ theoretical values ...11
2.5.3 Mythical system ...15
Chapter 3 Benchmarks...16
3.1 Floating point operations per second ...16
3.1.1 FLOPS ...16 3.1.2 LINPACK ...17 3.1.3 Simple CPU ...18 3.2 Memory bandwidth ...18 3.2.1 Stream ...19 3.2.2 Simple memory ...20 3.3 File system ...20 3.3.1 IOZone ...20 3.3.2 Simple I/O ...21 3.4 Balanced systems ...21
3.4.1 Amdahl synthetics benchmarks ...22
3.4.1.1Existing benchmark ...22
3.4.1.2Modified benchmarks ...22
Chapter 4 Benchmarks Results and Analysis ...29
4.1 CPU flops rate ...29
4.1.1EDIM1 ...29
4.1.2Eddie ...29
4.2Memory flops rate ...30
4.2.1EDIM1 ...30
4.2.2Eddie ...31
4.3Disk’s flops rate ...33
4.3.1EDIM1 ...33
4.3.2Eddie ...33
4.2 Testing ...42
Chapter 5 Conclusions ...43
Chapter 6 Project analysis ...45
6.1 Project Post-mortem ...45
6.1.1 Process ...45
6.1.2 Materials and technology ...45
6.1.3 What made the project successful ...45
6.1.4 What needs more improvements ...46
6.1.5 Lessons learned ...46
6.2 Deviations from the work plan ...47
6.3 Risks ...48
Appendix ABenchmarks results ...50
Appendix BBenchmarks compiling and running...69
List of Tables
Table 2-1 Theoretical (Datasheet) Values ...10
Table 2-2: Machine theoretical values ...14
Table 3-1: The eight kernels in the FLOPS benchmark ...16
Table 3-2: The sizes of the array compared to the data used in LINPACK ...17
Table 4-1: Benchmark results for the EDIM1 and Eddie CPUs ...30
Table 4-2: Benchmark results for the EDIM1 and Eddie main memories ...31
Table 4-3: Benchmark results for the EDIM1 and Eddie disks ...33
Table 4-4: Summaries of test suites completed in the CUnit ...42
Table 6-1: the Gantt Chart for the project ...48
List of Figures
Figure 2-1: Illustration of the steps in the data-processing pipeline ... 6
Figure 2-2: Node Architecture in EDIM1 ... 9
Figure 3-1: Illustration of Amdahl synthetics benchmarks ...22
Figure 3-2: Different threads compositions’ throughputs and Amdahl numbers ...24
Figure 3-3: Different real program patterns ...26
Figure 3-4: Algorithm for real program I/O pattern benchmark ...28
Figure 4-1: Summary of the EDIM1 CPU-memory flops rates ...31
Figure 4-2: Summary of the Eddie CPU-memory flops rates ...32
Figure 4-3: Throughput for different sets of storage ...35
Figure 4-4: Amdahl numbers for different sets of storage ...36
Figure 4-5: Illustration of Amdahl numbers among different buffer sizes ...37
Figure 4-6: Performance of different sets of storage ...38
Figure 4-7: Amdahl numbers for a combination of disks ...39
Figure 4-8: Amdahl number for different I/O patterns ...40
Figure 4-9: Performance for different I/O patterns ...41
Acknowledgements
I am sincerely grateful to Dr Adam Carter; this dissertation would not have been possible without his guidance, encouragement and feedback.
I owe thanks to Mr Gareth Francis for helping me with problems related to the cluster. I am truly indebted to Dr Orlando Richards from ECDF for his help in exploring new information.
I would like to show my gratitude to my family, with special reference to my father, Dr Abdullah Alowayyed, for giving me their support without which none of this would have happened.
Chapter 1
Introduction
Recently, a great increase in the demand for retrieving, processing, sorting and storing large amounts of raw data has been seen. It is a challenge to deal with this amount of data efficiently, especially with the knowledge of its value, and that it has resulted from considerable efforts, such as experiments or discoveries. Storage, on the other hand, has seen a remarkable increase in size, and costs have decreased significantly. However, storage performance is not in line with CPU performance, because, in most cases, the CPU can process the data, while the storage cannot retrieve or store it at the same speed; this is widely-known as the I/O bottleneck. The I/O bottleneck is a real difficulty for computer experts, programmers, developers and users. Although the algorithm is fast and well defined, the I/O will slow down the overall speed. This reduction in speed decreases sharply with the increased amount of data or an increase in the I/O operations.
The precise definition of Data Intensive Computing (DIC) is the reduction of the I/O bottleneck and the performance gap between the CPU and the I/O by organising the system to store/retrieve the data sufficiently quickly to ensure that the CPU does not remain idle.
This project takes an in-depth examination of two systems — EDIM1 and Eddie — to determine whether or not they are suitable for data intensive computing by analysing the theoretical and experimental values. To make that decision, two main objectives were set: first and foremost, to run a set of benchmarks for those components that are involved in the CPU-I/O model. These components are the CPU, main memory, file system and disks.
The second objective is to calculate the Amdahl number, because this gives a precise indication of whether or not the system is balanced in terms of the throughput of the I/O system and the performance of the CPU. The Amdahl number would be calculated theoretically for each system and, in addition, pragmatically. In 2011, a student in the EPCC wrote a benchmark to calculate the Amdahl number [1]. In this
benchmarks, should be in the same order of magnitude to maintain a similar achievable sustained CPU rate that is independent of the source of the data.
Chapter 2 will begin with the elementary definitions of ‘big data’ and ‘data intensive computing’. Next, a review of the related literature in the same field is included to provide a wider image of the project and the data intensive computing field. The chapter ends with an overview of the two systems under study and the calculation of their theoretical values.
Chapter 3 will use several theories to go through the set of benchmarks being used with and will determine how to calculate the results. This chapter will categorise the benchmarks, depending on the components that will flow from the CPU to the disks. At the end of this chapter, a detailed description will discuss the different Amdahl synthetic benchmarks being developed in the scope of this project.
Chapter 4 illustrates the results obtained from the previous chapter, in addition to the theoretical number taken from Chapter 2. This chapter will gather the numbers obtained to develop some hypotheses and an appropriate conclusion that will answer the research question discussed earlier.
In Chapter 5, a final set of conclusions will be drawn to determine whether or not the machines are suitable for data intensive computing, and if programs with the I/O highly-bound will receive benefits from the machine. This chapter will close with recommendations for both systems.
Finally, in Chapter 6 the project will be analysed, and deviations from the work plan and the risks mentioned in the project preparation period will be clarified.
Appendix A illustrates the benchmarks results and a sample of tests results. Appendix B shows how to obtain the benchmarks source, compile and run it with different configurations. Moreover, one of the tests functions is illustrated.
Chapter 2
Background
2.1
Big Data
Big Data is defined as large and fast-growing datasets. These datasets are captured from several resources, such as experiments and sensors. The technology is developing and becoming more precise, which produces more data. The storage, search, analysis and resultant availability of this Big Data is a problem. This issue
doubles every year [2] [3], and perhaps even more so, because of the imminence of the
Exascale age. Currently, various resources, such as sensors, search-engines, social networks and experiments are creating 2.5 exabytes of data daily [4]. This is exemplified by Google producing 20 petabytes of data per day and The European Organization for Nuclear Research (CERN), in its ATLAS experiment, a petabyte per second [5], as two examples out of many sources.
Obviously, dealing efficiently with this big data is a great demand. Moreover, big data will generate more data movements than before. Unfortunately, present-day secondary storage may not behave so well, which may increase the I/O bottlenecks caused by their modest performance. These bottlenecks lead to most of the time being spent in moving instead of processing data [6]. However, data intensive computing is now being introduced to increase the ability to analyse and understand the data where captured [1]. Moreover, [7] mentioned that the increase in processing capabilities is another reason for big data’s growth.
2.2
Data Intensive Computing
Data-intensive computing, according to [8], is the fourth basic research paradigm, after the experimental, theoretical and computer simulations and is urgently required
The second reason involves saving the energy employed by this great demand by using computational resources more effectively through compiling low-power systems [3]. These systems will be described later.
Finally, analysing the large distributed amount of data is not an easy task. The ATLAS experiment, for example, is processing 10 petabytes of data every year [9]. Three main issues [2] should be considered when the DIC is mentioned. These issues are a balanced system, multi-scaling and the raw sequential I/O. The next subsections will describe these terms in depth.
2.2.1Balanced systems
Balanced systems could be defined as systems that can move data at a reasonable speed to prevent the CPUs from experiencing delays. The performance of a system is limited by its slowest component, when the I/O operations are usually the slowest component, in addition to the main memory and a poorly designed cache [6]. Thus, Amdahl has set up a number of laws [10] that determine whether or not the system is balanced. These laws are:
• the Throughput Balance Law: A system needs one I/O bit per second per instruction per second; this ratio is known as the Amdahl number. This number could be calculated by dividing the bandwidth — how many bytes the system can read per second —by the flops rate, which is the number of floating point operations per second. This number will be the main concern in this project.
• the Memory Law: One byte of memory per one instruction/sec. This ratio provides an indication of whether or not there is sufficient memory to undertake the computation. The ratio is known as the Amdahl memory ratio.
• the I/O Law: This is the ratio of one I/O operation for every 50,000 instructions. Moreover, this ratio captures the I/O latency when there is less than one I/O operation per 50,000 instructions.
Amdahl’s experience as a computer architect and his observations of real life applications led to these practical laws [11]. Although these laws are a few decades old now, the kind of computations being carried out over these decades has not changed in any fundamental way. Thus, these laws are still valid and are easily applied to current real-life applications.
In addition to the Throughput Balanced law, the sustained bandwidth of the CPU memory and I/O being of the same order of magnitude is another indication of the system in balance.
2.2.2Raw sequential I/O
When considering a large amount of data, several terabytes for example, the memory and cache should be taken out of the performance equation. The best achievement is to use external schedulers such as in-drive scheduling, a schedule command queue to employ maximum efficiency. Another scheduler is on-board caching, which is a simple cache on modern disks to hold the data without searching on the disk itself [12]. On-board caching is effective in the case of a sequential I/O. The performance of the Hard Disk Drive (HDD) has experienced significant improvements over the years. For example, the first 500 GB hard drive [13] — the Hitachi GST — had a transfer speed of 3 Gb/s; then, in 2011, Seagate produced the first 4.0 Terabyte hard drive with data transfer boosted up to 80 Gb/sec [14]. However, these improvements in size and data transfer are not sufficient when compared with modern developments in CPUs or the huge increase in data sizes. The problem of the modest I/O sequential read performance could be resolved by an order of magnitude using Amdahl-balanced Blades (to be discussed in Section 3.4).
2.2.3Multi-scaling
To solve the slowdown in the performance of the I/O, one of the scaling approaches must be used. These approaches are scaling-up and scaling-out. First, scaling-up means adding more multi-processors and a high-performance disk array [3]. This method is expensive and produces low productivity — especially in the case of the sequential I/O — but it has less management and partitioning overheads [2]. Second, scaling-out is known as using lower-power processors and attaching more nodes, each node being attached to one or more disks [1]. Scaling-out is cheaper and increases the throughput, thus reducing the bottleneck. On the other hand, this method creates a massive overhead, represented by the disks’ I/O management and the data partitioning between the disks. [15] illustrates that the scale-up is used in a large symmetric multiprocessing (SMP), while the scale-out are in the form of clusters. A number of research outcomes, such as Gray’s law [16], have suggested that the scale-out is the most suitable type for data intensive computing. In this project, these two scaling approaches are similar to the differences between Eddie and EDIM1. In other words, EDIM1 is scale-out while Eddie is more scale-up.
2.3
Styles of Data Intensive Computing
There are three styles of Data Intensive Computing that have been developed over recent years [17]. First, data-processing pipelines initialise a pipeline of processes
2.3.1Data-processing pipelines
To reach small and easy-to-interpret datasets, data-processing pipelines have three defined main steps. The first is the High-throughput Capture; in this process, the data are captured from the source. After that, the researchers manipulate the data, such as deleting unnecessary content and searching for an easy route for analysis. Finally, the data is stored and is ready for the next step.
The second step is the Analytics; in this process, the data were interpreted using a specific, complicated algorithm. This step requires high-performance computers to obtain the results.
Finally, the Understanding Step, which is the visual analytical focus area. In this process, the researchers understand the analysis of the results produced from the previous step. The outcomes of the work might finally be visualised.
High-throughput capture Analytics Understand Low Low High High
Data volume Information density Time
Figure 2-1: Illustration of the steps in the data-processing pipeline [17]
2.3.2Data warehouse
The data warehouse could be considered as the main storage location of the data or the main archive that draws the complete image to make a decision [18]. Its main function is to store the data until researchers’ extract and interpret the information. The Sloan Digital Sky Survey (SDSS) [19] is the most important survey in the astronomy age. All the images and information produced by the SDSS telescope are collected in the SDSS SkyServer. The SDSS SkyServer, as an example of a data warehouse, can host up to 40 terabytes of raw data, to be interpreted later by the astronomers [20]. The SkyServer provides a search tool called the dr6 for the public; this allows the data to be searched after it has been interpreted.
The concept of the data warehouse is similar to “cloud computing”. Alexander Szalay [2] considers this a similarity, because both cloud computing and the data warehouse are accessible to services, both data and computation, in the same shared centralised facility. Szalay argued that a data warehouse design has more advantages than a grid-based one, because of the capability of allowing many users to access the shared datasets. On the other hand, it has been demonstrated that the popularity of using remote grids is vague [21].
2.3.3Data centres
The newest data intensive computing style is inspired by the Internet and the concept of data distribution [17]. In this style, the data is stored in any distributed data centre, and a specific programming model, such as MapReduce, is used to process these large datasets. An example of data centres are the data centres that support the academic research provided by the National Science Foundation, Google and IBM. In 2008, they provided 1,600-processor clusters to provide the researchers with access to distributed sources using Hadoop [22]. Another issue that may classify the data centres, as suggested by [2], is that because of the large amount of data, the computation should be brought to where the data is, not vice versa, thus, saving transmission time. Avoiding the I/O problems and possible data loss are the expected outcomes of this change. In addition to the previous suggestion, Jim Gray [16] shows that scientific computing would benefit from the data centres that they are, in fact, becoming more data intensive.
2.4
Amdahl-balanced blades
The increase in the data size yields a corresponding increase in the power consumptions [23]. Moreover, the main target, to achieve a system with fewer I/O bottlenecks, was the reason that convinced [3] to find a blade that is energy efficient and produces a high sequential read throughput. Sequential read is now used, because it is the most appropriate metric for an I/O throughput, because of the intensive usage and the high throughput.
The power consumption problem could be improved by using energy-efficient CPUs [23], such as the Intel 1.6 GHz Atom N330 (used in EDIM1 [24] and GrayWulf [2]). Theoretically, and according to [3] the 1.6 GHz CPU is balanced with 32,000 I/O operations per second; and this number is what the Solid State Drives (SSDs) used in GrayWulf can almost produce. Another example of an SSD, the Crucial Real SSD C300 [25] — the Solid State Drive used in EDIM1 — can carry out up to 45,000 I/O operations per second. Another reason for choosing the SSD is that there is less time elapsed, compared to the HDD. On the other hand, SSDs are much more expensive.
The Amdahl-balanced blade, used in GrayWulf [3], consists of a dual-core Atom 330, and the blade is considered to be balanced by using two SSDs. EDIM1 uses a similar architecture as the Amdahl-balanced blades. Instead of using a second SSD, three HDDs are used. After examining the prices of one of the SSDs used in the GrayWulf and the three HDDs used in EDIM1, one can find that the total disks in EDIM1 are cheaper by USD$100.
2.5
Devices
In this project, two supercomputers will be examined. These systems are EDIM1 and Eddie. The following sections will first describe the technical information; and then, the theoretical machine values will be computed and demonstrated. Finally, a well-balanced mythical system will be described.
2.5.1
Technical information
2.5.1.1 EDIM1
Edinburgh Data Intensive Machine 1 is a data-intensive experimental cluster, owned by EPCC and the School of Informatics at the University of Edinburgh, containing 120 compute nodes in three racks. Each compute node consists of a dual-core Intel Atom N330 CPU (1.6 GHz) on a Zotac mini-ITX motherboard, all having a low-power consumption. The flops rate has been increased through the use of the NVIDIA ION GPU. EDIM1 has a 4 GB DDR3 shared-main memory, three 2-TB Hitachi Deskstar 7K3000 HDDs and one 256 GB RealSSD C300 SSD [24], as shown in Figure 2-2.
As noted in the specifications, each node is similar to the Amdahl balance blades described in Section 2.4. Table 2-1 concludes the technical information and clearly shows that the nodes do not rise to the high flops rates of computational clusters — such as Eddie — but are cheaper, more suitable for data intensive work and have a lower power consumption rate.
2.5.1.2 Eddie
The Edinburgh Compute and Data Facility (ECDF) has a high performance cluster known as Eddie with 156 nodes, each consisting of two 8-core Intel Xeon E5645 CPUs (2.4 GHz). Each node comes with 24 GB of DDR3 memory and a 250 GB 7200 rpm SATA drive [26]. Moreover, the Eddie is equipped with a special storage system that permits it to supply a massive amount of storage for the University of Edinburgh. The nodes are connected by a 1 Gb Ethernet to 48-port switches, each of which connects back by a 1 x 10 Gb Ethernet to another network switch. There are eight storage servers that connect to that switch (also at 10 Gb). The storage servers connect to the storage via 2 x 8 Gb fibre channel SAN connections [27].
Table 2-1: Theoretical (Datasheet) Values
EDIM1 Eddie
CPU model Intel Atom N330 Intel Xeon E5645
CPU clock rate 1.6 GHz 2.4 GHz
Cores/CPU 2 8
CPUs / node 1 2
Memory / node 4 GB DDR3 24 GB DDR3
Secondary storage / node 256 GB SSD + 6 TB HDD
250 GB HDD + storage system
2.5.2
Machines’ theoretical values
2.5.2.1 Theoretical flops rate
The theoretical flops rate for a node, widely known as the theoretical peak, could be calculated as the following formula [28] :
Theoretical flops rate / = CPU speed × number of CPU instructions/ cycle × number of cores in CPUs × number of CPUs in a node
(1)
In EDIM1
Theoretical flops rate = 1.6 × 2 × 2 × 1 = 6.4 / In Eddie
Theoretical flops rate = 2.4 × 2 × 8 × 2 = 76.8 /
Unsurprisingly, this gap in the results between the two machines provides the first indicator that these two machines are built for different purposes. In other words, EDIM1 is a specific for data intensive research and provides a balance between the CPU and the disks, while Eddie is designed for general purpose high-performance computations.
2.5.2.2 Theoretical Amdahl Memory Ratio
The best indicator determines whether or not there is sufficient memory to undertake the computation. This indicator is based on the Law discussed previously in Section 2.2.1. In the case when the memory ratio is exactly one, this theoretically means there is exactly one byte of memory per instruction per second. Which means the system’s performance would not decrease, because of the reliance on reading from the disks several times. In other words, the main memory has plenty of space, so there is no need to read from the disk. The Amdahl Memory ratio could be calculated as follows [1]:
Amdahl Memory Ratio = Memory size per node 01
While Eddie's value is:
Amdahl Memory Ratio = 24 ∗ 1024
19.2 ∗ 1000= 1.28
From the previous results, one can note that there is more than one byte of memory per instruction in both systems. That means both systems’ nodes offer more memory than Amdahl noted for a real-world program’s requirements.
2.5.2.3
Theoretical Amdahl IOPS ratio
The I/O, in addition to memory, produces a bottleneck, widely known as the I/O bottleneck. In the case of the HPC, the I/O bottleneck is painful and wastes the CPU's time and energy in waiting. This dilemma is exacerbated in the DIC systems, because of the massive number of data and I/O operations in them. Theoretically, it is possible to decide whether or not there is a likelihood of I/O bottlenecks occurring through the Amdahl IOPS ratio [29]. Similar to the Amdahl Memory Ratio, when the number is close to one, it means it is unlikely for I/O bottlenecks to occur and shows the system can do more I/O than calculations. This value is a very important value in DIC systems; reducing as many I/O bottlenecks as possible will save much time and power, which are the criteria to be borne in mind during the purchase or when building the system. To compute the theoretical Amdahl IOPS ratio, it is necessary to calculate the IOPS value before applying the following formula [1]:
Amdahl IOPS ratio =A=B CDEF GHGIJK K;<=> ∗ ?@@@@⁄ (3)
To find the IOPS value for the HDD
HDD IOPS = 1000
average seek time R + average latency R (4)
According to the technical specification for the RealSSD C300 [25], the random read IOPS is 60,000. Moreover, for each node, there are three Hitachi Deskstar 7K3000 HDDs [30] for which it is stated that the average seek time is 0.5 ms and the average latency is 4.16 ms. From these numbers we get:
HDD IOPS = 1000
0.5 + 4.16 = 214.6 SSD IOPS = 60000
Finally, the Amdahl IOPS ratio is:
Amdahl IOPS ratio =W60643.8 × 50000XW6.4 × 10YX = 0.473
For Eddie, there is only one 250 GB HDD per node with an average seek time of 2.0 ms and an average latency of 4.2 ms. Calculating the HDD IOPS as before, we get:
HDD IOPS = 1000
2.0 + 4.2= 161.29
Moreover, the ECDF researchers have found a way to store a large amount of data using the storage architecture to retrieve data from two-tier storage. According to [27], the IOPS for Eddie’s file system is 652113:
Total Nodes′ IOPS = 652113 + 161.29 = 652274.29 Then, the Amdahl IOPS ratio for a single node reading from all disks is:
Amdahl IOPS ratio=652274.29×50000
76.8×109 = 0.424
To conclude, the Amdahl IOPS ratios for both systems are not so close to one, but they are similar to the average results noted in [10]. However, knowing that Eddie relies on a special storage system to deal with parallel I/O operations, the Amdahl IOPS ratio is not reliable and then this ratio would be negligible compared to the whole system’s I/O performance.
2.5.2.4 Theoretical Amdahl number
The most important number among Amdahl's numbers is the value that decides the balance between the I/O throughput and the CPU rate. The Balanced System Law ( 2.2.1) argues that, to consider the system is balanced, it should perform exactly one bit of I/O per second per instruction per second. The theoretical Amdahl number is calculated as [1]:
Amdahl number =Total throughput Z[\ CPU Rate ]^]_` ⁄ ⁄
Maximum throughput = W3 × 162X + 265 = 751 01/
Finally, the theoretical Amdahl number is:
Amdahl number =751× 8×10242
6.4×109 = 0.98
For Eddie, on the other hand, the ECDF researchers used a special I/O system using NAS and GPFS to retrieve data from two-tier storage (described in Section 2.5.1.2). According to [27], the throughput for Eddie is 3.72 GB/s:
Amdahl number = 3.72 × 8 × 1024
b
76.8 × 10Y = 0.416
EDIM1 is perfectly balanced, because the Amdahl number is so close to 1.0. In the case of Eddie, this ratio is not accurate, owing to the use of a system that depends on multiple components, such as the network and fibre channels, that may have bottlenecks.
To compute this ratio in practice, [1] wrote a benchmark to determine the Amdahl number of the machine. Moreover, other synthetic benchmark with different I/O patterns will be written during this project.
To see whether the sustained bandwidth of the CPU, memory and I/O are in the same order of magnitude, simple benchmarks will be written during the scope of this project. To make the theoretical numbers clearer, Table 2-2 includes all of these values and compares EDIM1's and Eddie's theoretical numbers visually.
Table 2-2: Machine theoretical values
EDIM1 Eddie
FLOPS rate 6.4 GFLOP/s 76.8 GFLOP/s
IOPS / node 60643.8 IOPS 652274.29 IOPS
Disks peak throughput/node 0.733 GB/s 1.636 GB/s
Amdahl Memory ratio 1.28 1.28
Amdahl IOPS ratio 0.47 0.42
2.5.3Mythical system
To make the Amdahl number easier to understand, this mythical system was assumed to show the best balance situation and number that the system could have. Because of the mythical situation of the system, a set of assumptions would be raised. These assumptions are that the CPU’s sustain a flops rate that is the theoretical FLOP/s, which in the case of the Atom 330 is 6.4 GFLOP/s, and a mythical budget and space.
To obtain the full benefit from the CPU, the I/O should manage to transfer 6.4 GFLOP/s, which means the desired throughput is 51.2 GB/s.
6.4 GFLOP/s × 8 = 51.2 1/
Thus, to achieve this throughput, the system needs ~ 321 HDD OR ~ 191 SSD per node. In other words, if it is preferable to use the ratio 3 HDD: 1 SSD — used in EDIM1 — which means the system should have a 210 HDD and 70 SSD per node. Should the system have this mythical I/O system then that means the throughput is 51.3 GB/s, which is what the system requires. Finally, the Amdahl number for this mythical system is:
Amdahl number =51.3 ×8 × 10243
6.4 × 109 = 68.85
This is an illustration of how data intensive computation could be, in theory. However, it is very unlikely to find an application that would benefit from this ratio; the reason being that Amdahl’s idea of what would make a balanced machine in practice is based on his observations of real-life applications, not theoretical numbers.
Chapter 3
Benchmarks
3.1
Floating point operations per second
The measurement of the computer's Floating Point Operations per Second (FLOP/s) for determining the overall performance is widely used nowadays. For instance, the TOP500 list is compiled using the flops rate from the LINPACK benchmark, and the list also contains the theoretical peak for each system in the list [31] ( 2.5.2.1). Although the computer has other components, such as the I/O, memory, cache and communications, the reason for choosing the flops rate to be the main measurement criterion is that the floating-point operations are intensively computed and are used a great deal in the scientific computing field [32].
The author noted that, because the data intensive field is relatively new, certain terminologies are used most inconsistently amongst the authors. For instance, the term FLOPS could refer to the flops rate, the name of the FLOPS benchmark, the unit of floating point operations per second or the abbreviation for floating point operations. Obviously, maintaining consistency in the terminology is important for avoiding confusions.
3.1.1FLOPS
Before running the LINPACK stress test, a simple single-threaded c program was identified [33]. In this benchmark, there are eight computational kernels. Each kernel carries out several floating-point operations (Table 3-1).
Table 3-1: The eight kernels in the FLOPS benchmark [1]
Kernel FADD FSUB FMUL FDIV TOTAL
1 7 0 6 1 14 2 3 2 1 1 7 3 6 2 9 0 17 4 7 0 8 0 15 5 13 0 15 1 29 6 13 0 16 0 29 7 3 3 3 3 12 8 13 0 17 0 30
Thus, the output from each kernel is the flops rate. From this stage, the average of all the computational kernels would be calculated and would result in the performance of one core in one CPU in the supercomputer. Finally, depending on the machine, the following formula is used to calculate the overall performance [1]:
flops rate = number of cores in CPU × number of CPUs in a node × avg. number of flops from FLOPS benchmark
(6) 3.1.2LINPACK
In 1979, Jack Dongarra introduced the LINPACK benchmark, which solves a Gaussian elimination using the LU decomposition with partial pivoting [34]. Because of the nature of the algorithm, the benchmark could be considered as a stress test for the CPU, the on-chip caches and the main memory [35]. This provides a good estimate of how fast the computer could solve real problems. In most powerful CPUs, the LINPACK flops rate is close to the theoretical flops rate [36] calculated in Section 2.5.2.1.
Unlike the FLOPS benchmark, the LINPACK benchmark uses all the CPUs and cores available in the system. Thus, there is no need to multiply by the number of CPUs and number of cores to obtain the flops rate. The benchmark asks the user for the size of the array to solve a Gaussian elimination, see Table 3-2 . [37], comparing the size of the array and the data used. Moreover, the number of run must be precisely specified by the user. Finally, the output of this benchmark will be the average number of the FLOP/s from all runs.
Table 3-2: The sizes of the array compared to the data used in LINPACK [37] Size of the Array Data Used
7000 1 GByte
13700 2 GBytes
20000 4 GBytes
3.1.3Simple CPU
To help in finding the system’s main bottleneck, a simple set of benchmarks will be written. These benchmarks, with the defined sets of benchmarks, would provide a clear image of the device flops rate and the possible bottlenecks, if any. These benchmarks (Sections 3.1.3, 3.2.2 and 3.3.2) are similar to the other widely-used benchmarks, but the reason for writing them is to make the benchmark as simple as possible for defining the real number of FLOP/s observed.
In this defined set, three simple benchmarks are written. First is a simple CPU benchmark, which is the same concept as the FLOPS benchmark; in this benchmark, a register is defined as a variable of type double that fits in the CPU cache. Then, the time to carry out N floating-point operations will be measured. Finally, the number of FLOP/s will be calculated by:
CPU / = Number of operations
time
From the maximum FLOP/s value, it is possible to calculate the maximum data rate from the memory that derives the full benefit of the CPU rate. Depending on the computer architecture — the 64-bit processor used in the example — and the maximum number of FLOP/s calculated above, this rate could be calculated as [1]:
Data rate1 ⁄ =g64h ×maximum flops rate 0/
8 ×10243
(7)
The divisor is used to convert the bits to Gbytes. This is the rate at which the bits of data would have to arrive at the processor to ensure that it could maintain its full flops rate.
3.2
Memory bandwidth
The next main measurement is the memory bandwidth. It is known that the memory bandwidth determines how fast the memory can maintain the flops from the CPU. Therefore, fast CPUs with a low memory bandwidth lead to a bottleneck. In the following benchmarks, the cache effects are not taken into account, because the size of the problem is larger than the cache.
3.2.1Stream
Another stress test examines the memory rather than the CPU, as in the previous two benchmarks. The output of this benchmark is the actual data rate that the main memory provides, and is known as the Memory Bandwidth [38]. The stream benchmark was written by John McCalpin and Joe Zagar and uses four operations [39]:
1. Copy: a[i] = b[i] 2. Scale: a[i] = d * b[i] 3. Sum: a[i] = b[i] + c[i] 4. Triad: a[i] = b[i] + d * c[i]
As shown above, these four operations are applied to a number of vectors. These vectors should be much larger than the cache. [40] suggests that the size of the vector should be at least:
i`]\jk [` =Cache size 1^\`2 × Number of CPUs (8)
In the case of more than one core in the CPU, the benchmark should be run on all the cores using the OpenMP version of the benchmark. The benchmark runs a number of times, with the best result being taken. As a result of the high reliance on the memory, the Triad kernel will be the benchmark used to quantify memory bandwidth [39]. To determine in practice whether or not the memory can deliver data at a speed that avoids the bottleneck, the balance ratio should be calculated — conceptually, it is similar to the Amdahl number. To calculate the balance, one must calculate how many FLOP/s the memory can deliver using the triad bandwidth resulting from the stream benchmark:
Memory / =Triad bandwidth 1^\`/
8
3.2.2Simple memory
The second simple benchmark is the memory benchmark. In this simple benchmark, a double vector would be defined with a size of the L2 cache × 4 [39], by using a number of threads to add 1 to all the elements in the vector; the time will be taken and then compute the following:
Memory / =size of vector mjnZ_` × number of threads
time
The balance is calculated, as mentioned earlier as the Amdahl Memory Law, by the following formula [41]:
Balance =Memory /CPU /
(9)
The best result is when the balance is exactly 1.0; this shows that the CPU can retrieve the data from the memory without being idle.
3.3
File system
For data intensive applications, the I/O is valuable, because these applications are dealing with massive amounts of data retrieved from disks. Thus, these data intensive applications are more likely than others to expose I/O bottlenecks in a computer. For this reason, the I/O tests are important for determining the computers’ balances.
3.3.1IOZone
The most popular and generally-used File System benchmark is the IOZone. Using a number of operations, multiple file/record sizes and secondary options, the benchmark determines the file system's performance, the CPU usage and the throughput [42] [43].
To ensure reliability, the entire test should be applied to files larger than the node’s main memory [43]. Moreover, the sequential read will be considered. This is because the read would reflect the main throughput, because of the high bandwidth and wide usage [1] [32].
3.3.2Simple I/O
The third simple benchmark is a simple I/O benchmark. Basically, this benchmark calculates the aggregate throughput by reading a number of files sequentially using a number of threads. The number of threads should be equal to the number of files, because each thread reads a file. Also, the user sets the appropriate buffer size and the program calculates the throughput by:
Disk Throughput 1^\`/ =bytes read
time
3.4
Balanced systems
Finally, the concept of a balanced system is an important concept in the data intensive computing field. The situation is where there is a balance between the computations and the I/O operations, that is, few I/O bottlenecks. There are a number of laws that classify the definition of balance, one of which is the throughput balance law (Section 2.2.1). The I/O machine balance ratio, better known as the Amdahl number, clarifies the I/O balance as the system requiring one bit of I/O per second per instruction per second (Section 2.5.2.4). In this report, the previous theoretical value will be measured using an existing benchmark and a set of modified benchmarks that represent both the real and theoretical I/O patterns.
3.4.1Amdahl synthetics benchmarks 3.4.1.1 Existing benchmark
Written by Kulkarni [1] as a dissertation at EPCC in 2011, the existing benchmark calculates mainly the practical value of the Amdahl number using the equation 5 in (Section 2.5.2.4). Figure 3-1 shows the basic structure of the benchmark, where the main idea is that it declares a set of buffers — the same as the number of threads — and two kinds of threads: I/O threads, which read from a file to the buffer; and compute threads, responsible for carrying out certain computations on the data they read in. The number of I/O threads is the same as the number of files; here, the number of computing threads should be specified by the user.
Figure 3-1: Illustration of Amdahl
synthetics benchmarks
[1]
There are three states of the buffer: free, loading and ready. At the start of the program the buffers are free. Reading the data from the files to the buffers by the I/O threads would change the status of the buffer to loading. When the buffer is filled, the buffer’s status changes to ready for computation. On the other hand, the computational threads wait for the buffers to be ready and then carry out the appropriate computations — determined by the user — turning them to free again when finished.
In addition, the benchmark decides whether the system is compute or I/O-bound by calculating the idle time of the threads — either the compute or the I/O threads. When it is noticeable that a system has a large idle time with the compute threads waiting for the buffers to be ready for a long time, the system is an I/O-bound. More information about the algorithm used and the details on how to use the benchmark can be found in [1].
3.4.1.2 Modified benchmarks
The next set of benchmarks is a modification of the original Amdahl benchmark. In this set, the I/O patterns are changed. This is employed to determine the best number of threads, the best technique used and the best buffer sizes used in a real application to reach to a good balance level between the I/O and compute.
3.4.1.2.1
Synthetic benchmark
The synthetic benchmark decides the best composition of the threads to be used to obtain the best Amdahl number (closer to 1.0). This benchmark is based on the theoretical calculations for the output from a simple I/O (Section 3.3.2) and a simple CPU (Section 3.1.3), using different numbers of threads. Here, the main calculation is the calculation for the Amdahl number ( 2.5.2.4). Figure 3-2 shows the theoretical calculations of the outputs from a simple CPU and a simple I/O, using multiple threads. In Figure 3-2, a number of four-threads have been used, because any number of threads could be run on the core; however, hyper-threading makes each core in the system appear to the operating system as if it is two virtual cores with a total of two cores in a node. In Figure 3-2, (a) is useless, because no data can reach the application; (b) is also useless, because there are no computations; (c) is a waste of I/O because, for a typical application, it is unlikely to be able to sustain a throughput of 0.415 GB/s. Moreover, this scenario is limiting the amount of the FLOP/s that should be carried out. Starting from the fourth possibility, (d), the scenarios look quite balanced. In this instance, one could manage to achieve just over 1 FLOP on each byte read in, if the application can maintain a throughput of 0.302 GB/s. Finally, Figure 3-2 (e) and (f) are the most balanced by using one thread for the I/O operations and three compute threads. Using the HDD in (f) provides the least throughput among the other scenarios and, as a consequence, the most balance. To conclude, by getting closer to 1.0, there is a greater chance that all components will be put to good use for a typical code. On the other hand, if there is a high data intensive code, it benefits from a machine with an Amdahl number greater than one.
a. Four compute threads
Flop/s: 1.04 GFLOP/s Throughput: 0 GB/s Amdahl number: 0
b. Four I/O threads
Flop/s: 0 GFLOP/s Throughput 0.615 GB/s Amdahl number: inf
c. One compute thread and 3 I/O threads
Flop/s: 0.186 GFLOP/s Throughput: 0.41 GB/s Amdahl number: 19.16
d. Two compute threads and 2 I/O threads
Flop/s: 0.39 GFLOP/s Throughput: 0.30 GB/s Amdahl number: 6.66
e. Three compute threads and 1 I/O thread (SSD)
Flop/s: 0.95 GFLOP/s Throughput: 0.25 GB/s Amdahl number: 2.24
f. Three compute threads and 1 I/O thread (HDD)
Flop/s: 0.955 GFLOP/s Throughput: 0.15 GB/s Amdahl number: 1.34
Figure 3-2: Different threads compositions’ throughputs and Amdahl numbers A synthetic benchmark is written to bring the previous results from a theoretical manner to one fit for a practical application. The benchmark defines two types of threads: I/O and compute. The number of I/O threads is the same as the number of files, while the compute threads are calculated by subtracting the number of I/O threads from the number of total threads requested. For each I/O thread there is a corresponding buffer, whose size should be entered by the user, defined and filled by the thread until the thread reads the entire corresponding file. The other type of threads, the compute threads, are carrying out an operation — floating point adding, for simplification — for the number of times specified by the user.
3.4.1.2.2
Real world I/O patterns benchmark
The real world I/O pattern benchmark is used to calculate the balance — using the Amdahl number — and the performance of the system by using I/O patterns similar to those found in real-world applications.
The I/O is the medium that connects the application with the real world. In the traditional applications, the I/O operations were only used at the beginning of the programs by reading the data from the file and writing the results to the file at the end of the program. However, the reliability of these programs decreased with the time, because of the increasing faults in the processors. To increase the reliability of these programs, writing the results at some points before the end of the programs is necessary for tolerating these faults; these are widely known as checkpoints. Thus, the main role for the checkpoints is to allow for the re-starting of the application at a given point, if an interesting event happens in the simulation. Moreover, the application could re-start with different parameters for many purposes, such as wishing to simulate from that point on in more detail or with more instrumentation; data dumps for visualisation; partial results to allow computational steering; an output of the system state at a fixed point in simulation time to later measure the dynamic properties.
Here, different techniques and effects would be measured to achieve the most balanced and high performance I/O patterns that are recommended for use in the machine. These techniques are Checkpoints and Barriers. Figure 3-3 shows the different I/O patterns that would be considered for the scope of this benchmark. In Figure 3-3, (a) is using the barriers and checkpoints, so it writes twice to the files at the same thread’s times; (b) is not using checkpoints, merely barriers. This pattern only writes once to the files. The third possibility, (c), is using the barriers and checkpoints, so it writes twice to the files at the different times between the threads and, hence, finishes at different times. Finally, (d) is not using either barriers or checkpoints, which is the simplest situation among the other scenarios.
a. With Checkpoint, with Barriers b. Without Checkpoint, With Barrier c. With Checkpoint, without Barrier d. Without Checkpoint, without Barrier
Figure 3-3: Different real program patterns
All the patterns begin by reading from the number of files entered by the user. The number of files should be the same as the number of threads. A number of barriers (a and b) and a checkpoint (a and c) will be taken into account as options in this benchmark to differentiate between the patterns. These two options (Barrier and Checkpoint) should be requested by the user by running the program with --f = 1 to add barriers and --c = 1 for checkpoints, otherwise, replace 1 with 0. For example, to get a benchmark that consists of a checkpoint, but without the barriers (scenario c), the user should choose --f = 1 and --c = 0.
READ Write READ Write READ Write READ Write
The benchmark begins by reading a buffer size from a file and carrying out a set of operations on the buffer. Next, depending on whether or not the user requests using barriers or checkpoints between the computations, the benchmark will take them into account. Finally, the output will be written in separate files. Each thread writes its own results in a separate file named an output_[thread_id], where the thread_id is the number of the thread, starting from zero. Finally, a set of results would be calculated, one of them being the Amdahl number. Figure 3-4 illustrates the algorithm of the real-world I/O pattern benchmark.
In all of the benchmarks developed during the scope of this project, the precision when calculating the time was an important consideration. To achieve an acceptable precision when calculating the times in the benchmarks, the time measured by a loop that carried out a number of floating point operations was subtracted from the time for an empty loop (without optimisations). Moreover, any cache effect was turned off by running the code with special flags. A volatile keyword was assigned to the variable being measured to avoid any optimisation of the variable. The simple benchmarks would be carried out in a number of threads, depending on the machine and the number of times determined by the user. Again, the best time would be taken.
Chapter 4
Benchmarks Results and Analysis
4.1
CPU flops rate
There are four ways in this project to calculate the flops rate. The main theoretical indication of the flops rate is the theoretical peak. The theoretical peak is 6.4 GFLOP/s for EDIM1 and 76.8 GFLOP/s for Eddie. However, the benchmark results for the FLOP/s could not achieve these values. The difference between these two theoretical peaks is almost 12x. This is because in a single node in Eddie, there are two processors, while there is only one processor in the EDIM1’s node. Moreover, the number of cores per processor in Eddie is four times the number of cores in EDIM1.
Next, the results will be defined on the two machines where three sets of benchmarks for the flop/s are carried out. These benchmarks are FLOPS (Section 3.1.1), LINPACK (Section 3.1.2) and the simple CPU (Section 3.1.3).
4.1.1 EDIM1
From the results, it was observed that EDIM1 can achieve 1.305 GFLOP/s. This result is the output of the simple CPU FLOPS benchmark using four threads, while the FLOPS benchmark gave 1.062 GFLOP/s and the LINPACK value is 0.7851 GFLOP/s when using an array of four gigabytes.
4.1.2 Eddie
The LINPACK benchmark result in Eddie is very close to the theoretical peak of the machine. The LINPACK result is 65.95 GFLOP/s with an array size of 30,000 elements. Unlike EDIM1, the FLOPS and simple CPU benchmark show a lesser flops rate than the LINPACK. The flops rate from the FLOPS benchmark is 36.249 GFLOP/s and the simple CPU flops benchmark is 18.6 GFLOP/s.
Table 4-1: Benchmark results for the EDIM1 and Eddie CPUs Benchmark EDIM1 (GFLOP/s) Eddie (GFLOP/s) Theoretical peak 6.4 76.8
Simple CPU flops 1.305 18.6
FLOPS 1.062 36.249
LINPACK 0.7851 65.95
Using the best-observed result, the EDIM1 flops rate would be 1.305 GFLOP/s, while the Eddie flops rate is 36.249. From this stage, it is hard to distinguish whether or not these results are good for data intensive computing. But by comparing these with the resultant flop/s of the other components, it will be easier to decide, in addition to finding the bottlenecks.
From the previous numbers, it is possible to calculate the maximum data rate from the memory that would be required to obtain the full benefit of the CPU rate described in Section 3.1.3 equation 7. The desired memory rate for both machines is:
EDIM1 Desired data rate =64 × 1.305 × 10Y
8 × 1024b = 9.72 1/
Eddie Desired data rate =qr×bq.srY×t@v×t@srw u = 270 1/ section 3.1.3.
4.2
Memory flops rate
Similar to the CPU’s flops rate, the memory’s flops rate was measured using two benchmarks. These benchmarks are STREAM (Section 3.2.1) and a simple memory (Section 0). As the memory’s desirable rate, the EDIM1’s data rate is 9.72 GB/s and the Eddie’s data rate is 270 GB/s. These values are equivalent to 1.21 GFLOP/s and 33.75 GFLOP/s, respectively. These desired rates are non-attainable rates, but it is the scale that decides if the memory performs well or not. In the other words, when the delivery of the data from the memory is close to this rate, it is better. In the following sections, the bandwidth rate (Bites/s) would be converted to the flops rate (FLOP/s). This is used to unify the units between the components and make the context clearer and easier to interpret.
4.2.1 EDIM1
The results of the simple memory benchmark show that the system could deliver data at a speed of 2.7 GB/s, which is equal to 0.3408 GFLOP/s. On the other hand, the STREAM shows that the data rate is 2.57 GB/s, equivalent to 0.321 GFLOP/s.
4.2.2 Eddie
As a result of the high CPU flops rate, the required memory bandwidth should be high, if it is desirable to consider the machine as a memory-balanced machine. The STREAM result shows that the memory bandwidth is 49.5 GB/s. This rate is equal to 6.2 GFLOP/s. On the other hand, the simple memory benchmark provided the data rate of 69.81 GB/s, which is equivalent to 8.29 GFLOP/s. These rates are less than 25% of the desired memory bandwidth. This may occur because the two processors in a single node are sharing the same memory, which reduces by half the reserved bandwidth for each processor.
Table 4-2: Benchmark results for the EDIM1 and Eddie main memories
Benchmark EDIM1
(GFLOP/s)
Eddie (GFLOP/s)
Desired Memory flops 1.21 33.75
Simple Memory flops 0.340 8.29
STREAM 0.321 6.2
Figure 4-1: Summary of the EDIM1 CPU-memory flops rates Theoretical
peak
Simple
CPU flops FLOPS LINPACK
Desired Memory flops Simple Memory flops STREAM EDIM1 6.4 1.305 1.062 0.7851 1.21 0.34 0.321 0 1 2 3 4 5 6 7 F LO P S r a te s G F LO P /s Benchmarks
Figure 4-2: Summary of the Eddie CPU-memory flops rates
Taking the best observed memory flop/s results from both systems and knowing the best observed CPU flop/s results (Table 4-1 and Table 4-2), it is straightforward to compute the memory balance of the system. Using the formula in Section 3.2.2 equation 9, one can find that:
EDIM1 Balance ratio = 1.305
0.3408= 3.82
Eddie Balance ratio =36.24
8.29 = 4.37
Therefore, from the previous calculations, one can find that both memories can handle the speed of the CPU better than the results noted in [44]. This could be the first indicator that both of these systems might be a good platform for the data intensive programs.
It was noticeable from the results that the Atom is not good in the LINPACK (LINPACK in EDIM1 achieved 11.3% of the theoretical peak). Therefore, the Atom processors are not good in a matrix computation, because they depend on both the CPU and the main memory, where the results of both of them in EDIM1 are low performances. In contrast, the Xeon is very good in the LINPACK and, from the LINPACK benchmark result, it is close to the theoretical peak of 85.8%. For this reason, the Xeon is used in the TOP500 list processors, while the Atom is not.
Theoretical peak
Simple
CPU flops FLOPS LINPACK
Desired Memory flops Simple Memory flops STREAM Eddie 76.8 18.6 36.249 65.95 33.75 8.29 6.2 0 10 20 30 40 50 60 70 80 90 F LO P S r a te s G F LO P /s Benchmarks
4.3
Disk’s flops rate
As shown in (Section 2.5.2.4) the total node's theoretical throughput for EDIM1 was 0.73 GB/s, while it was 1.636 GB/s for Eddie. In this chapter, it is necessary to calculate these numbers using the IOZone benchmark (Section 3.3.1) and the simple I/O benchmark (Section 3.3.2). Then, a comparison of the throughput with the numbers taken earlier will determine the ability of the system and whether or not it is suitable for data intensive needs. To compute the throughput, the IOZone for Eddie was taken from [27], while in the case of EDIM1, the throughput was measured using IOZone and simple I/O benchmarks.
4.3.1 EDIM1
The IOZone benchmark was run on EDIM1 on a file 2 GB in size and with a 4 MB record size. The result shows that the aggregate throughput was 0.562 GB/s and the simple I/O benchmark shows that the aggregate throughput was 0.620 GB/s. From these results, it was noted that the experimental throughputs are close to the theoretical of about 85%. This indicates that all the components of the I/O subsystem — disks, disk controllers, SATA connections and the memory — are performing well.
4.3.2 Eddie
Using the same configuration as EDIM1— 2 GB and 4 MB record size — the IOZone test shows that the file system has a transfer rate of 1.636 GB/s. This result did not transcend the level of the prospective results, which EDIM1’s results did. The result of the experiment reaches only half (50%) of the theoretical throughput. This may because bottlenecks occur in one or more components, such as the network, switches, storage servers or fibre channels.
Table 4-3 illustrates the GFLOP/s calculated from the sustained bandwidth taken from the different benchmarks in addition to those calculated from the theoretical throughput. Considering the best-observed result among the benchmarks, the EDIM1 disks can achieve 0.0775 GFLOP/s, while Eddie storage system can obtain 0.204 GFLOP/s.
Table 4-3: Benchmark results for the EDIM1 and Eddie disks
Benchmark EDIM1 Eddie
At this stage, it is possible to determine the Amdahl number of the machine using the simple I/O and a simple CPU benchmarks. Using these outputs, the experimental Amdahl number was obtained. The experimental Amdahl number for EDIM1 is 4.08, while it is 0.387 for Eddie. These results show that EDIM1 provides 4 bits per second per one floating point operation per second. This rate is greater than Amdahl had noted for a real application, which is only 1 bit/s per flop/s. Eddie, on the other hand, shows a lower balance ratio that is 1 bit/s per 2.5 flop/s. However, this ratio is better than most of the results in [10], which are considered to be clusters for data intensive applications. Because of the relatively low Amdahl number for Eddie and to reach a good practical balance level, adding a local SSD with the HDD for each node would increase the balance ratio to be 0.45, which is near to the best number (0.5 for Graywulf) in [10]. However, this is more expensive for purchasing, maintenance and power consumption. A proposed solution for this issue is that it is not necessary to attach the SSDs to each node, but only to the nodes that request high I/O usage rates.
On the other hand, the CPU rate in EDIM1 is lower in the experiments than in theory, which makes EDIM1 unbalanced. This low rate is because of the Atom processor, which is a low voltage processor, however, with less performance. For example, the performance obtained from the Atom processors is half the performance obtained from the Pentium at the same clock rate [45]. In the case that the power consuming factor is irrelevant, using a high performance processor or even two atom processors in the same node would allow the system to be practically balance and would reduce the ratio of 4 bits/s per one flop/s. To calculate these assumptions theoretically, adding another Atom processor to the node would increase the flop rate to be 2.6 flop/s. This means the Amdahl number would decrease to 2.04, which is the half of the original ratio. However, the new Amdahl number is also greater than 1, which might benefit an application that is so data-intensive, i.e. do more I/O than Amdahl observed as being typical application.
The other assumption is to place the CPUs used in Eddie’s node into EDIM1’s node, then the Amdahl number will be 0.28. This ratio shows that using Eddie’s processors in EDIM1 is not a good idea. Also it is possible to infer that the storage system used in Eddie is playing a better role at obtaining a suitable level of balance without assigning a specific set of disks per node. Next, a set of Amdahl synthetics benchmarks with different I/O patterns on EDIM1 will be used to find the best way to define algorithms that attain a good level of balance when using the machine.
4.4
Amdahl synthetics benchmarks
As explained in Section 3.4.1, there are three types of benchmarks, existing benchmarks, synthetic benchmarks and real world I/O patterns benchmarks. Starting with the existing benchmark, which overlaps between the computation and the I/O, Figure 4-3 shows the aggregate throughput for different sets of storages by reading a 16 GB file from each disk with a 16 M buffer for each I/O thread. This figure shows that the throughput is reasonable when adding more storage. For example, the throughput of two HDDs is almost double the throughput of using one HDD. That means the disks controller is handling the increment in the number of disks successfully, and there are no bottlenecks occurring. Furthermore, that may mean it is possible to add more disks without having problems. However, the system would not benefit from this addition, because it is practically imbalanced in favour of the I/O; by adding more disks, this situation will be exacerbated. This figure also shows that using an SSD would increase the throughput considerably.
Figure 4-3: Throughput for different sets of storage
Figure 4-4 illustrates the Amdahl number for each set of storages. In Section 2.5.2.4, the theoretical Amdahl number of EDIM1 is 0.832, which can be reached by using two HDDs. This means that by using two HDDs, best usage of the programs that overlap the I/O and compute is achieved.
0 100 200 300 400 500 600 700 800 900 T h ro u g h p u t (M B /s ) Throughput MB/s
Figure 4-4: Amdahl numbers for different sets of storage
To reach a good balance level, the selection of the buffer size is critical. The buffer will control the delay occurring when reading the file from the disk. In other words, if the size of the buffer is too small (1 Kbyte), then there will be overheads in reading the data from the disk several times. The other case is when the buffer is too large (521 Mbyte). Then, copying the data from the disk to the buffer and to the main memory would consume time. Figure 4-5 shows that reading from the disk several times will take up more time than copying from the disk to the memory through the buffer. This may occur, because the time to seek/retrieve the data is much more than just to copy it. Moreover, it is noted from the Figure that the most balanced situation is achieved when choosing an 8 to 16 Kbyte buffer size for the I/O operations.
0 0.5 1 1.5 2 2.5 SSD 2HDD SSD+HDD 3HDD SDD + 2HDD SDD + 3HDD A m d a h l n u m b e r Amdahl Number
Figure 4-5: Illustration of Amdahl numbers among different buffer sizes Finally, using one SSD and three HDDs achieves the best performance. As shown in Figure 4-6, it takes 12.89 seconds per one average disk to carry out a single floating operation on a 16 Gbyte file. This means it is capable of achieving 2.1 × 10Y floating-point operations in 12.89 seconds. As a result of a simple calculation, the maximum performance that could be obtained is 0.16 GFLOP/s. Using all the disks in the node produces the best performance, because there is less CPU delay owing to the high aggregate throughput. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 K 2 K 4 K 8 K 1 6 K 3 2 K 6 4 K 1 2 8 K 2 5 6 K 5 1 2 K 1 M 2 M 4 M 8 M 1 6 M 3 2 M 6 4 M 1 2 8 M 2 5 6 M 5 1 2 M HDD SSD 2 HDD HDD+SSD BUFFER SIZE A m d a h l n u m b e r