4.4 An Outlook on Automated Performance Analysis
4.4.1 Automated Performance Analysis
The computational power of supercomputers grows at a significantly higher rate than what would be expected according to Moore’s law. This performance increase is achieved by an increased level of parallelism – in terms of cores per socket as well as in terms of total number of sockets in the system. Five years ago, the majority of systems in the Top500 supercomputer list (395 systems, 79%) consisted of ≤ 1024 cores and only 16 systems of more than 4096 cores. As of June 2010, the majority of systems (387 systems, 77.4%) comprises more than 4096 processor cores, with five systems encompassing more than 128,000 cores. The number one system even comprises 224,162 processor cores. This trend is about to continue with more and more systems entering the Petascale league.
Developing applications for such massively parallel architectures is a challenging task and good scalability may be hard to achieve. However, due to the enormous costs of such machines – with respect to both acquisition as well as operation – a high level of efficiency is required. Yet, optimizing an application for best performance in the design phase is almost impossible due to the complex architecture of such machines. Donald Knuth even points out that ”premature optimization is the root of all evil” [87] as programmers often tend to focus on performance too early in the development process. Hence, performance optimization is usually conducted after the main development of an application has been finished and its features have been tested for functional correctness. Additionally, most scientific codes have a long history and have usually been ported repeatedly from one hardware architecture to another over time. The source code of those applications is typically sufficiently portable and therefore compiles with only little modifications on a new platform. Yet, the time-consuming task of the porting process is optimizing the code for high performance on the new hardware architecture. Hence, performance analysis is a crucial task during the development and adaption of applications for high-performance computers.
Performance optimization is typically not straight-forward but an iterative process with alternating phases of performance analysis and application tuning. This development process if often referred to as the measure-analyze-modify cycle [8, 101], as the performance analysis step actually consists of two steps: measuring performance-critical properties of an application
and analyzing the measured data. While the measuring step is typically predetermined by the actual properties to be measured, there are multiple approaches for handling and analyzing the measured data.
Most traditional performance analysis tools for parallel systems are so-called trace-based tools. That is, they collect information on performance critical events during the execution of an application and write these event-traces to an output file. After the application has finished, the traces are visualized by the tool and interpreted by the developer. However, the increasing level of parallelism required to exploit the full potential of current supercomputers also increases the complexity of the applications. Manually analyzing trace files of such large- scale applications and identifying performance bottlenecks therefore is challenging if not even impossible. Yet, as scaling down the application (i.e., executing it with a lower processor count) also alters its performance characteristics, actual program runs with real data and the desired number of processors have to be analyzed to detect performance problems and their causes. Hence, tools are required that automatically collect, filter, and analyze the performance data of a given application and pinpoint the code regions that impact performance.
Automated performance analysis has been a research topic for more than 15 years now. Over the time many tools have been developed, the most notable ones being Paradyn [100], TAU [129], Vampir [19], KOJAK [155], and SCALASCA [50]. The automation of the analy- sis process by these tools typically aims at detecting pre-defined properties that are known to cause performance bottlenecks in applications. For MPI applications most performance critical problems are related to load imbalance that causes a subset of the processes to wait for commu- nication partners that need more time to process their work share and thus call the respective MPI function too late. This waiting time can be measured by utilizing the MPI profiling inter- face which also allows for identifying the respective MPI function calls that caused the delay. A list of frequently observed performance bottlenecks and load imbalance problems in MPI applications has been assembled by the APART1 working group [39].
Most performance analysis tools are trace-based, i.e., they perform the analysis step after the application to be analyzed has terminated (post-mortem). However, collecting information about all performance-relevant events of a large-scale parallel application can easily produce a huge amount of data. Without further filtering, this mass of information would not only overwhelm the user but may also be a strain for the machine as those amounts of data have to be stored in main memory or written to hard disk during the application’s runtime. Periscope takes a different approach to automated performance analysis. Its analysis is solely based on summary information: instead of keeping record of every single performance-critical event, the collected information is aggregated as soon as possible. For example, instead of storing the individual time spent on every single call to a specific function, the total time spent on all calls is stored along with the total number of calls to the function. This approach reduces the amount of performance data to be stored dramatically but still allows for drawing the same conclusions as a trace-based analysis: a developer is typically not interested into the time wasted on a single function call but on the total time wasted by a specific function. A function that is called only
4.4. AN OUTLOOK ON AUTOMATED PERFORMANCE ANALYSIS 65
once during a program run may waste a whole second without any impact on the program’s performance. However, a function that is called a million times and wastes just a millisecond on every call usually degrades the performance significantly.
Another benefit of summary information-based performance analysis is the fact that the analysis can be performed online, i.e., while the application is still running. This allows for instantaneous feedback to the user without time-consuming post-processing of trace files after the program has terminated on the one hand. On the other hand, an online analysis allows for reacting on previously gathered information. For example, if a function had been identified as a bottleneck, the analysis tool could drill deeper into the function body to pinpoint actual source code lines that cause the bottleneck.