Bottleneck Detection in Parallel File Systems with Trace-Based Performance Monitoring

(1)

with Trace-Based Performance Monitoring

Julian M. Kunkel and Thomas Ludwig

Ruprecht-Karls-Universit¨at Heidelberg Im Neuenheimer Feld 348, 69120 Heidelberg, Germany

[email protected] http://pvs.informatik.uni-heidelberg.de/

Abstract. Today we recognize a high demand for powerful storage. In

industry this issue is tackled either with large storage area networks, or by deploying parallel ﬁle systems on top of RAID systems or on smaller storage networks. The bigger the system gets the more important is the ability to analyze the performance and to identify bottlenecks in the architecture and the applications.

We extended the performance monitor available in the parallel file system PVFS2 by including statistics of the server process and informa-tion of the system. Performance monitor data is available during run-time and the server process was modified to store this data in off-line traces suitable for post-mortem analysis. These values can be used to de-tect bottlenecks in the system. Some measured results demonstrate how these help to identify bottlenecks and may assists to rank the servers depending on their capabilities.

1 Introduction

Deployed storage systems continuously increase in size: Last year the US De-partment of Defense linked existing Storage Area Networks (SANs) into a single 17,000-port Meta SAN [1]. Also GPFS [2] and Lustre [3] are deployed over thou-sands of nodes. High-availability is mandatory for large setups, fail-over mecha-nisms must be incorporated to deal with defective hardware. Administration of such an environment requires sophisticated tools at least capable to monitor the state of each particular component to detect failures. For cost-eﬀectiveness oper-ators of such a system expect a high utilization of the provided infrastructure. In case the performance of a deployed system stays behind expectations, an on-line analysis of the system is necessary to detect the bottleneck. There are many reasons for performance degraded hardware. In case the components consist of inhomogeneous hardware a setup is likely to result in load-imbalance among the components. The common way to increase cost-eﬀectiveness in big setups is to deploy a hierarchical storage management (HSM). Also, already existing hard-ware might be combined to improve throughput leading to an inhomogeneous environment.

The performance of disks and RAID systems varies depending on the observed access patterns. Ongoing RAID-rebuilds degrade the capabilities of a server.

E. Luque, T. Margalef, and D. Ben´ıtez (Eds.): Euro-Par 2008, LNCS 5168, pp. 212–221, 2008. c

(2)

A server with more main memory could cache more data leading to performance variation. The question which arises is: how do we determine the ineﬃcient subsystem or component that causes the performance degradation?

To tackle this issue most storage systems implement a performance monitor which measures various metrics like throughput of disk and network. Unfor-tunately, these metrics are not always suitable to determine if this particular component is the bottleneck or if another component involved in the I/O path causes this problem. While it is easy to determine the recent throughput of a component, it is no easy task to assess the observed performance or even to determine the bottleneck of the current workload. For instance, one component might be twice as fast as another, but the measured performance is at the same level. This paper is structured as follows: In section 2 related work shows an excerpt of the state of monitoring concepts in SANs and parallel file systems. Available and potential useful performance statistics are discussed in section 3. The implemented solution for an extended performance monitoring with PVFS is presented in section 4. Results obtained by a set of experiments with differ-ent levels of access in MPI are assessed in section 5. The concept and potdiffer-ential future work is discussed in section 5.

2 State-of-the-Art and Related Work

It is common for network, distributed or parallel and cluster file systems to in-clude a performance monitor. On Linux machines NFS provides access statistics via the /proc interface on client and server nodes. The statistics include the total number of observed data- or metadata operations of a given type on this partic-ular machine since startup, e.g. the number of write operations, lookups, created directories. In GPFS a monitoring of client I/O performance is possible with a separate command line application (mmpmon, see [4]). The performance statistics provided are quite similar to NFS, but data is monitored on the file system level. In addition, the amount of data read or written is available. Furthermore, the number of requests within a byte range, which took a specific amount of time is maintained in configurable histograms. These histograms contain up to 16 byte ranges and up to 16 latency ranges.

In Lustre more information is available via the /proc interface [5]. Lustre is capable to track information per process on the client nodes. This includes statistics about non-sequential accesses as well as total numbers of accesses de-pending on the amount of data accessed in a histogram, i.e. it is possible to determine that a process has written an amount of data between 0 and 4 KBytes about 10 times. Lustre also monitors the usage of each Object Storage Target (i.e. storage node or SAN disk) including the number of pending and currently active operations. It is also possible to get more information, for instance by enabling the debugging output. See [6] for more information. Recently a new project was started, which collects interesting performance data of the servers using the debugging output and the /proc interface to visualize this performance data with the help of Ganglia [7]. Collected data includes the average pending

(3)

(and currently processed) operations and the processing time of the different VFS calls, which is obtained from the clients. The parallel file system PVFS [8] embeds a performance monitor in the server process which counts the number of metadata and I/O operations, and the amount of data accessed. Statistics are recorded in configurable sampling intervals and can be fetched with the com-mand line tool pvfs2-perf-mon-example directly from the server process.

There are some frameworks which allow to monitor the state of a system be-yond the state of just the ﬁle system. These tools have in common that collected data is used to generate diagrams showing the usage of a given resource within a ﬁxed sampling interval, e.g. the number of NFS opens per second or the CPU usage could be visualized in a diagram. The collectl utility [9] uses the proc interface to access Lustre statistics and NFS statistics and shows them in rela-tion to other system parameters like network throughput or load. Collectd [10] is built on a plugin concept and can be fed by any source, i.e. by external appli-cations. The already mentioned Ganglia uses a similar module concept.

In a Storage Area Network (SAN) it is possible to measure the performance of a switch, storage I/O response time on SCSI transactions or the average queue depth of the host port. Available performance statistics vary depending on the manufacturer of the storage system.

An oﬀ-line performance analysis tool for PVFS2 is PIOviz [11,12]. This tool allows to trace MPI applications in conjunction with all their induced server ac-tivities and visualizes them with Jumpshot. However, the level of detail provided by the tool is more than a normal user can cope with. Also, the data only gets available after the application ﬁnished. Further system information like observed disk throughput or CPU usage is not traced by the tool. Currently, the Univer-sity of Dresden works to integrate I/O rates of Lustre for each target device in the visualization tool Vampir [13]. However, compared to the PIOviz approach, no detailed insight into the server activities is given. In contrast to discussed solutions the extensions presented in this paper allow to trace MPI-I/O calls together with several metrics provided by Linux and new PVFS2-internal met-rics. These metrics could also be accessed directly to allow a quantitative on-line detection of performance bottlenecks. First, however, interesting performance statistics are discussed.

3 Useful Performance Statistics and Metrics

It is important to detect the bottleneck of the workload to tune or upgrade the system. While it is necessary to detect the machines responsible for a de-graded performance, it is also important to detect the component within the particular machine. Basic components are CPU, network and disks. In general, two categories of sampled performance statistics can be diﬀerentiated: absolute (but limited) metrics i.e. statistics showing a value, but we do not know how this value corresponds to the maximum possible throughput in the given sit-uation, and relative metrics i.e. values of statistics which directly depend on the capability and usage of the component or subsystem. An example for the

(4)

absolute metrics is observed throughput. While the value of this metric might re-veal a difference in current usage, it does not allow to determine the capability of the monitored system. Imagine data being striped in round-robin-fashion across multiple servers. If we observe that all servers contribute a specific amount of throughput to the aggregated throughput, we can not conclude that one server is faster than another. An instance of this problem is the case where servers are attached to the network with different speeds. However, if we can measure a higher throughput on a subset of servers, we can conclude that these servers are more utilized than the others. As an administrator you might be interested to see a metric like 90% utilization of network and 50% utilization of the disks of a particular machine. Knowing the maximum throughput of a component, these values could be computed easily. Consequently, the administrator could see a relative usage of the component. Unfortunately, even these computed values de-pend on the requests of the clients. Assume a machine’s network is utilized by 50% and another machine’s network by 90%. It is no good idea to conclude that the machine with 50% utilization is the bottleneck. The reason might be that the clients simply accessed less data on this machine. However, a load imbalance could be detected with this information.

Sometimes it is not possible to achieve the maximum possible throughput, for instance with random disk access patterns or during a RAID rebuild. Also, just a speciﬁc network path between a subset of machines might be slower. Con-sequently, measured values are inaccurate. On the other hand, metrics showing a value depending on the relative usage help to identify whether a component is overloaded in comparison to other components. An instance is the 60 second kernel load on machines, which depends on the number of currently active pro-cesses and indirectly on the capabilities of the system i.e. faster machines ﬁnish their jobs earlier, resulting in a lower average load.

In the following, load refers to such a relative metric, in particular the load of a subsystem is the average number of queued and scheduled requests within the sampling interval. Normally, a component must multiplex available capabil-ities somehow between pending jobs. Consequently, a problem with this kind of relative values is that a measured load of a sampling interval might vary depending on the processing order of short requests within the sampling in-terval. This problem is demonstrated for a configuration with two clients and two servers in figure 1. While jobs running longer than the length of the sam-pling interval result in the expected average load (see figure 1(a)), the load of short jobs within a sampling interval is higher if they are processed concurrently (see figure 1(b)). In this example, the time to process both jobs concurrently doubles. Due to statistical effects, the average value should be close to reality. Under the assumption that the component in the first server works twice as fast as the component in server two, the two situations in figure 2 are possible. Jobs running over multiple intervals might finish earlier, resulting in intervals with zero load (see figure 2(a)). Short running jobs lead to a decreased load as shown in figure 2(b). Now one might ask the question whether the load metric is more suitable instead of measuring some absolute metric like throughput?

(5)

Job started by client 2 Job started by client 1 Job started by client 2 Job started by client 1

t Server 1 Server 2 2 Load Load 2 2 2 Sampling Interval

(a) Long running jobs

t Server 1 Server 2 2 Load Load 0.5 1 1

(b) Short running jobs

Fig. 1. Example load for homogenous hardware

t Server 1 Server 2 2 Load Load 2 2 0

(a) Long running jobs

t Server 1 Server 2 2 Load Load 1 2 1

(b) Short running jobs

Fig. 2. Example load for inhomogeneous hardware

Considering a ﬁle distributed in a round-robin fashion, the average throughput of each server is identical for unbalanced hardware. Therefore, looking at the average throughput does not help to identify the faster (or slower) components, while the load allows to do so.

Another metric which directly depends on the behavior of the clients and the performance of a component is the idle time i.e. the time the component does not serve any jobs at all. The ratio between jobs currently pending on the network and the disk might also help to determine whether the network path or I/O subsystem is the bottleneck. For each physical component (network, disk) of a server, the maximum (best case) value and the current value is important. For physical and virtual components, e.g. diﬀerent server internal layers, the average number of processed jobs (load) and the idle time are useful.

4 Extended Performance Monitoring with PVFS

This section presents extensions made to PVFS. The modiﬁcations include the new statistics and extensions to the PIOviz environment capable to trace the values and to visualize them with Jumpshot. First, to acquire various statistics available by the kernel, pieces of the atop [14] source code were incorporated into the PVFS performance monitor. This code uses the proc interface to fetch kernel statistics into a single data structure. The following selection of system wide metrics is incorporated into the performance monitor: average load over

(6)

Fig. 3. Screenshot of the enhanced Jumpshot of PIOviz

one minute, free memory (Bytes), memory used for I/O caches (Bytes), CPU usage (Percent), data received from the network (Bytes/s), data sent to the net-work (Bytes/s), data read from the I/O subsystem (Bytes/s), data written by the I/O subsystem (Bytes/s). Next, a set of server internal metrics and statis-tics was added to the performance monitor. Modifications to the persistency layer (TROVE), the network layer (BMI), the job layer (glues the other layers together) and the server’s main process allow to trace the load (as defined in section 3) of the server, the flow layer, the network and disk subsystems. Addi-tionally, the idle time of the persistency layer was added. Note that in PVFS the persistency layer uses an underlying local file system to store the actual data. For metadata multiple Berkeley databases are used. These layers can use back-ground threads to synchronize modified data with the disks. The statistics are updated in the fixed interval of the performance monitor, which is configurable. All statistics are computed (or fetched from /proc) at the end of each interval. In order to visualize the percentage of a metric’s value, Jumpshot must be adapted, because it is not possible to render such a concept with the plain Jumpshot. Fig-ure 3 shows a screenshot of the modified version of Jumpshot included in the extended PIOviz environment. The window on the left contains new categories for the metrics of the performance monitor, while the main window shows dif-ferent height levels depending on the value of each metric. Modifications of the menu panel are highlighted in the screenshot. For more details refer to [15].

5 Results

The evaluation focuses on the four levels of access in MPI [16]. All experi-ments are run on our evaluation cluster consisting of 10 Linux machines with the following conﬁguration: Two Intel Xeon 2000 Mhz processors (32-bit archi-tecture), Kernel 2.6.21-rc5, Debian 3.1, MPICH2-1.0.5, 1 GByte memory, Intel 82545EM Gbit Ethernet network card. All nodes are equipped with an IBM IC35L090AVV207-0 hard drive and ﬁve nodes contain two additional Maxtor hard drives bundled to a software RAID-0 by a Promise FastTrack TX RAID

(7)

controller. Specifications from IBM list the disk’s read throughput between 29 and 56 MiB/s and an average access time of 8.5 ms. Characteristics of a single Maxtor 6G160E0 (DiamondMax 17 series) as specified by Maxtor are an access time of 8.9 ms, a track-to-track seek time of 2.5 ms and a sustained I/O through-put between 30.8 MiB/s and 58.9 MiB/s. Ext3 is configured as a local file system on top of a 38 GByte partition. The RAID performance was measured by IOzone and for convenience, a sequential throughput of 100 MiB/s can be assumed on the RAID partition and about 50 MiB/s on the local drive. Currently, the nodes use the anticipatory elevator to schedule I/O operations.

In PVFS2, files are split in datafiles (subfiles), which are distributed over the available servers. Instead of placing only a single datafile per server, two datafiles (subfiles) are placed in the experiments. The experiments use the benchmark MPI-IO-level which is an I/O-intensive application designed by the working group to record traces in the PIOviz environment for different access levels. It first writes an amount of data with a strided access pattern into a single file, and once all processes finish, it waits a second, then the data is read back in the same fashion. Non-contiguous operation accesses a number of i contiguous blocks at a time and may repeat this process a couple of times (o times). Con-tiguous operation accesses a specified block size a number of iterations equal to the total number of accessed blocks by a non-contiguous version (i∗ o). In the following experiments, each client accesses a total of 2000 MByte. With in-dependend access (level 0), a contiguous 10 MByte block is accessed 200 times by each client. In level 2 and 3, each client accesses 500 MByte 4 times with a single call.

Average statistics over the whole run are extracted from the resulting trace files by a script. In the first experiment, an inhomogeneous I/O subsystem is used. While three servers use the local RAID system, one server operates on the single local disk, which runs at almost half the speed of the RAID disks. Screenshots of a few statistics are shown in figure 4. For each server and statis-tic an individual timeline is given. Note that the first timeline is always the slower server. With independent client I/O in figure 4(a), the actual data ac-cessed during the write and read phase is balanced over all three servers (see statistics IV and VI ). However, the Request load (statistic II ) and Trove load (statistic IX ) show that the operations aggregate on the first server. Conse-quently, it can be concluded that the first server is slower than the others. Also, the I/O subsystem’s idle time on the first server is close to 0, while it is much higher on the others (statistic III ). In PVFS2, up to 8 concurrent Trove opera-tions are started per datafile, which leads to the observed Trove load of 80 (five clients use two datafiles). When non-contiguous requests are used, more data is transfered to the servers by each call (see figure 4(b)). Now requests aggre-gate on the slow server - when the faster servers finish to access data, the first server is still busy to process the data. Then long idle times manifest (statis-tic III ) and can be seen in the access statis(statis-tics IV and V as well. However, compared to the request load II a look at the write statistic does not reveal per-formance variation that clearly. The statistics in figure 4(d) were measured with

(8)

collective non-contiguous calls. With this level of access, the two-phase protocol of MPICH2 is used to access data. Two-phase ships data between the clients leading to more idle time on all machines. In conjunction with the kernel’s write behind strategy, the server with the single disk is not slower during the write phase. However, in the read phase the load is higher and the idle time is lower than on the faster machines. This reveals a load-imbalance during reads. For verification a screenshot of a run on homogenous hardware i.e. all servers use the RAID system, is given in figure 4(c). Average statistics of these experiments are shown in table 1. Statistics include values for metadata accesses and ob-served throughput. In addition, the statistics for homogenous hardware and for only one datafile are listed for the sake of completion. By comparing the aver-age Trove load or Request load, it is possible to identify the slower server in all

(a) level 0 (inhomogeneous hardware) (b) level 2 (inhomogeneous hardware)

(c) level 3 (homogenous hardware) (d) level 3 (inhomogeneous hardware)

(9)

cases. However, there are some abnormalities for homogenous hardware. With level 0 and level 2, one server appears slower than the others i.e. with level 0 the first server has a request load of 7.8, while the other servers have a load of 5.1 and 5.5. The reason for the load variance are short term processing imbalances leading to faster or slower processing on a single server. In average all servers have the same capabilities. However, the I/O subsystem on a server may need slightly longer to reposition the access arm on a disk for a sequence of pending operations. Thus, the operations on the other servers finish earlier. An I/O op-eration ends when all pending flows are finished, therefore once the slower server finishes, a new I/O operation is started by the client. In the meantime, the other servers have to process less work and potentially finish earlier. In fact, this leads to a congestion on the server which takes a bit longer first. Due to statistical effects the roles of the servers may change. With increasing network and CPU speed compared to the I/O subsystem, the effect of a local congestion gets more likely.

Further evaluated experiments can be found in [15]. These experiments in-clude data for cached I/O, partially cached data, unbalanced access patterns which prefer a server subset, and a static load balancing of inhomogeneous server hardware. With cached access, the network is the bottleneck resulting in a high BMI load compared to the Trove load.

Table 1. Average statistics for ﬁve clients and three data-servers. The ﬁrst two blocks

contain data measured on homogeneous hardware. The last set of experiments uses an inhomogeneous conﬁguration in which the ﬁrst server stores data on a single local disk instead of the faster RAID.

Re que s t load BM I lo a d Tr o v e lo a d [%] T r o v e id len es s [o p / s ] Me t a d. r e ads [o p / s ] Me t a d. w r it e s Wr it e [M iB / s ] Re ad [M iB /s] level 0 7.8 5.1 5.5 2.1 2.1 2.1 36.7 17.9 20.6 5.4 9.7 9.8 23.09 0.05 216 80 level 1 5.1 5 5.1 1.7 1.7 1.7 9.2 9 9.2 26.9 27.6 27.3 55.08 0.04 166 64 level 2 8.7 8.8 8.2 2.2 2.3 2.2 54.4 54.9 49.9 3.4 2.1 4.4 1.2 0.05 235 78 level 3 2.9 3 2.9 1.2 1.2 1.2 4.4 4.8 4.6 56.7 54 56.3 34.19 0.03 86 57 level 0 - 1 datafile 3.2 3.1 3.9 1.2 1.2 1.2 14.8 14.5 20.2 15.5 16.9 9.5 12.85 0.03 214 93 level 1 - 1 datafile 2.7 2.5 2.6 1.1 1.1 1.1 8.5 7.6 7.9 30.3 34.9 33.2 30.37 0.02 167 73 level 2 - 1 datafile 4 4.3 4.1 1.2 1.4 1.2 25.7 28.6 26.6 10.1 5.3 8.8 0.64 0.02 198 92 level 3 - 1 datafile 1.5 1.5 1.5 0.9 0.9 0.9 4.1 4.1 4.1 60.8 60.4 60.9 18.3 0.01 83 66

level 0-inh. I/O 8.4 2.3 2.2 1.2 1.3 1.3 44.1 5 4.6 2.6 40.6 42.1 15.2 0.03 120 56

level 1-inh. I/O 5.8 3.5 3.6 1.2 1.3 1.3 11.9 6 6.2 10.1 49.5 49.2 42.08 0.03 119 50

level 2-inh. I/O 9.2 4.1 4.1 2.6 1.1 1.1 65.2 25.6 25 1.6 39.1 39.8 0.76 0.03 114 54

(10)

6 Conclusions and Future Work

The presented implementation allows to record system behavior of the servers in conjunction with the activities of the MPI application. Compared to absolute metrics like measured throughput, relative metrics allowed us to identify the limiting server and even the component i.e. disk or network, which might be the bottleneck. This is an important task in inhomogeneous and large environments. In combination with well known maximum values, servers could be identiﬁed to perform load balancing in dynamic environments. Also, implemented average statistics support the monitoring of the ﬁle system, or they could be used for an automatic on-line bottleneck detection by the administrator.

References

1. Mellor, C.: US defense department builds world’s biggest SAN (2006), http://www.techworld.com/news/index.cfm?RSS\&NewsID=6846

2. Schmuck, F., Haskin, R.: GPFS: A Shared-Disk File System for Large Comput-ing Clusters. In: Proc. of the First Conference on File and Storage Technologies (FAST), January 2002, pp. 231–244 (2002)

3. Cluster File Systems Inc: Lustre, http://www.lustre.org

4. IBM: General Parallel File System - Advanced Administration Guide V3.1. (2006), http://publib.boulder.ibm.com/epubs/pdf/bl1adv00.pdf

5. Cluster File Systems Inc: Lustre 1.6 Manual, http://manual.lustre.org/manual/ LustreManual16 HTML/DynamicHTML-21-1.html

6. Cluster File Systems Inc: Lustre Debugging (2007),

http://wiki.lustre.org/index.php?title=Lustre Debugging 7. Cluster File Systems Inc: Lustre: Proﬁling Tools for IO (2007),

http://arch.lustre.org/index.php?title=Profiling Tools for IO

8. Ligon, W., Ross, R.: PVFS: Parallel Virtual File System. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux. Scientiﬁc and Engineering Computation, November 2001, pp. 391–430. The MIT Press, Cambridge (2001)

9. Seger, M.: Homepage of collectl, http://collectl.sourceforge.net/ 10. Forster, F.: Homepage of collectd, http://collectd.org/

11. Ludwig, T., Krempel, S., Kunkel, J.M., Panse, F., Withanage, D.: Tracing the MPI-IO Calls’ Disk Accesses. In: Mohr, B., Tr¨aﬀ, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 322–330. Springer, Heidelberg (2006) 12. Ludwig, T., Krempel, S., Kuhn, M., Kunkel, J.M., Lohse, C.: Analysis of the MPI-IO Optimization Levels with the PMPI-IOViz Jumpshot Enhancement. In: Cappello, F., Herault, T., Dongarra, J. (eds.) PVM/MPI 2007. LNCS, vol. 4757, pp. 213–222. Springer, Heidelberg (2007)

13. Juckeland, G.: Vampir and Lustre (2007), http://clusterfs-intra.com/cfscom/ images/lustre/LUG2007/lug07-dresden.pdf

14. AT Consultancy bv: Atop, http://www.atcomputing.nl/Tools/atop

15. Kunkel, J.M.: Towards Automatic Load Balancing of a Parallel File System with Subﬁle Based Migration. Master’s thesis, Ruprecht-Karls-Universit¨at Heidelberg, Institute of Computer Science (July 2007)

16. Gropp, W., Thakur, R., Lusk, E.: 3.10.1. In: Using MPI-2: Advanced Features of the Message Passing Interface, pp. 101–105. MIT Press, Cambridge (1999)