Patterns of production and consumption - Quantifying the potential benefits of automatic overla

5.4 Quantifying the potential benefits of automatic overlap

5.4.2 Patterns of production and consumption

For studying the potential of automatic overlap, it is very important to explore how are the transferred messages produced at the sender side and consumed at the receiver side. Each MPI program consists of computation bursts separated by MPI communication calls. A computation burst consumes the data received in the preceding MPI call and

produces the data for the forthcoming MPI call. Ideally, if the production of the buffer to be sent is linear and uniform across the burst, the process can send the first half of the message half way through the computation burst. Similarly, if the consumption of the received buffer is linear and uniform across the burst, the process can postpone the blocking wait for the second half. Thus, computation patterns determine the potential for advancing sends and postponing receives and strongly influence the potential for overlap. Therefore, our first study focuses on characterizing patterns of consumption and production of the communicated data. To that end we developed an instrumenta- tion tool based on Valgrind (Section 4.2.1). For each MPI transfer, the tool identifies

the address of a communicated buffer and instruments all memory accesses to that

buffer. This information allows us to determine when each element of a communicated buffer is produced or consumed.

Results of tracing patterns

Due to complex computational algorithms, scientific applications tend to have diverse patterns of production and consumption (Figure 5.11). Figures on the left show production patterns, and figures on the right show consumption patterns. For each plot, the x axis represents the normalized time within the corresponding computation interval (from start to end), while the y axis represents an element’s address within the accessed buffer. Points represent when each element was written for production patterns and read for consumption patterns. Many such patterns have been identified within each application.

Figures5.11a and5.11bpresent the production and consumption patterns that are present in 50% of all transfers in Sweep3D. As presented on the y-axis, the communicated buffer has 600 elements and all of them are revisited and accessed many times during one computation phase. This causes very late production of the final version of the elements and decreases the potential for advancing partial sends. As shown in Figure5.11a, the final version of the first element is produced at 66.3% of the production interval, while the final version of the first quarter of the message is produced at 94.8% of the production interval. Figure5.11bindicates that the potential for postponing partial receives is even lower. This is because the application reads all the elements of the received message very early in the consumption phase. Compared to the ideal

0 100 200 300 400 500 600 0 20 40 60 80 100