Parallel Performance Analysis - Schmidberger, Markus (2009): Parallel Computing for Biol

of task idle time. Load balancing is important to parallel programs for performance reasons.

Granularity: In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. Periods of computation are typically separated from periods of communication by synchronization events. In fine-grain parallelism rela- tively small amounts of computational work is done between communication events. The opposite is called coarse-grain parallelism. The most efficient granularity is de- pendent on the algorithm and the hardware environment in which it runs.

Random Numbers: Generating random numbers presents a particular problem for parallel programming. For example, if you are using a large number of random numbers on a number of different processors and using the same random number generator on each, there is a chance that some of the streams will overlap. However, there are tools available to fix these problems, e.g., SPRNG.

4.5 Parallel Performance Analysis

Performance analysis and tuning for parallel algorithms is very difficult. As with debugging, monitoring and analyzing parallel program execution is significantly more of a challenge than for serial programs. A number of tools for monitoring, and program analysis for parallel code are available. For debugging – especially of code running at the workers – only a limited number of tools exists.

4.5.1 Computation Time

First of all the computation time for different numbers of processors and different sizes of input data can be measured and visualized. Typically a

TN ≈1/N

trend can be seen for the computation time (T) plotted over the number of processors (N). In theory a N times acceleration in computation is expected using N processors.

4.5.2 Speedup

In parallel computing, speedup (S) refers to how much a parallel algorithm is faster than a corresponding sequential algorithm:

SN =

Where N is the number of processors, T1 the execution time of the sequential algorithm

called absolute speedup when T1 is the execution time of the best sequential algorithm,

and relative speedup when T1 is the execution time of the same parallel algorithm on one

processor.

Amdahl’s Law

One of the best rates for describing the limits and costs of parallel programming isAmdahl’s Law [Amd67]. It states that the potential program speedup is defined by the fraction of code that can be parallelized (P):

S ≤ 1

1−P

If none of the code can be parallelized (P = 0) then the speedup is 1 (no speedup). If all of the code is parallelized (P = 1), the speedup is infinite (in theory). If 50% of the code can be parallelized, maximum speedup is 2, meaning the code will run twice as fast (see Figure 4.6). Introducing the number of processors (N) performing the parallel fraction of work, the relationship can be modeled by

SN ≤

N +S

where P is the parallel fraction and S the serial fraction. As visualized in Figure 4.6 there are limits to the scalability of parallelism. Due to parallelization, additional costs for

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 0.2 0.4 0.6 0.8 2 4 6 8 10

Parallel portion of code

Speedup 1 5 50 500 5000 5 10 15 20 Number of processors Speedup 95% 90% 75% 50% 100%

Figure 4.6: Visualization of theoretical speedup for parallel computing plotted for the parallel portion of code (left) and number of processors (right).

4.5 Parallel Performance Analysis 51

to add a parameter o(N), which grows with increasing N.

SN ≤

N +S+o(N)

The speedup curves do not more convergence to ₁₋1_P. They reach a maximum and then fall off (visualized in Figure 5.11). This effect commonly can be observed in practical examples. If the number of processors is big enough, the costs for communication exceed the computing time.

Amdahl’s law was written in 1967 and new technologies – especially caching – have not been considered. Therefore, a super-linear speedup is sometimes possible. Sometimes a speedup of more than N, when using N processors, is observed in parallel computing, which is calledsuper linear speedup. Super linear speedup rarely happens, that could have different reasons: Bad serial code, cache effects, . . . .

4.5.3 Efficiency

Another performance metric is called efficiency and is defined as

EN =

N .

It is a value – typically between zero and one – estimating how well-utilized the processors are in solving the problem, compared to how much effort is wasted in communication and synchronization. Algorithms with linear speedup and algorithms running on a single processor have an efficiency of 1, while many difficult-to-parallelize algorithms have efficiency such as _log1_N that approaches zero as the number of processors increases.

4.5.4 Karp-Flatt Metric

TheKarp-Flatt Metric is a measure of parallelization of code in parallel processor systems.

This metric exists in addition to Amdahl’s Law as an indication of the extent to which a particular computer code is parallelized [KF90]. The experimentally determined serial fraction e is defined as e= 1 SN − 1 N 1− 1 N .

The lower the value ofe the better the parallelization. In case of super-linear speedup the value becomes negative.

4.5.5 Resource Requirements

The primary intent of parallel programming is to decrease execution wall clock time. How- ever, in order to accomplish this, more CPU time is required. For example, a parallel code that runs in one hour on eight processors actually uses eight hours of CPU time. The

amount of memory required can be greater for parallel codes than serial codes, due to the need to replicate data and for overheads associated with parallel support libraries and subsystems. For short running parallel programs, there can actually be a decrease in performance compared to a similar serial implementation. The overhead costs associated with setting up the parallel environment, task creation, communications and task termination can comprise a significant portion of the total execution time for short runs.

Table 4.1 shows the CPU time and used main memory which was consumed for this PhD thesis at the IBE. The monitoring by the batch system ’Sun Grid Engine’ is available since March 2009. In May 2009 the permutation test described in Chapter 7.4 was calculated.

March April May June July

CPU time in days - non parallel 0.6 4.0 6.0 0.9 0.8

CPU time in days - parallel 246.3 118.5 1143.9 162.6 267.4

Memory in GB - non parallel 0.1 0.79 2.1 0.1 0.1

Memory in GB - parallel 7.2 3.9 55.2 12.3 13.8

Table 4.1: Used computer resources for this PhD thesis at the cluster at the IBE.

In document Schmidberger, Markus (2009): Parallel Computing for Biological Data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 65-68)