• No results found

Applications Present and Future

Multiphysics Simulations and Petascale Computing

3.5 Applications Present and Future

The arrival of terascale supercomputers such as Blue Gene/L ushers in a new era of computational science, in which scientific simulation will emerge as a true peer to theory and experiment in the process of scientific discovery. In the past, simulations were viewed largely as an extension of theory. Moreover, these simulations often were lacking in some fundamental way, for example, insufficient spatial resolution due to limited computational resources. Today’s supercomputers finally possess sufficient computing power (and memory) to enable unprecedented simulations of physical phenomena — simulations that often suggest new theories or guide future experiments. This section examines the state of the art in terascale simulation through two illustrative applica- tions. It also offers a glimpse of the future of petascale simulation by describing the use of cooperative parallelism to enable a multiphysics simulation.

3.5.1

State of the art in terascale simulation

The LLNL Blue Gene/L supercomputer described in Section 3.2 represents a milestone in both scientific computing and computational science. In the

former area, Blue Gene/L is the first computer to exceed 100 TFLOPS on the LINPACK benchmark [15]. It achieved 280.6 TFLOPS, 76% of the machine’s theoretical peak. It is the first computer to employ more than 100,000 cores, thus redefining the term “massively parallel.” This is important because it challenges the scientific computing community to think anew about paral- lelism and scalability. Many observers view Blue Gene/L as a stepping stone to the petascale.

The computational science milestone is even more impressive and impor- tant: In November 2005, Blue Gene/L ran the first meaningful scientific sim- ulation to sustain more than 100 TFLOPS. Less than a year later, two ad- ditional application codes exceeded 100 TFLOPS on Blue Gene/L, including one that has since sustained 207 TFLOPS. In fact, the last two winners of the Gordon Bell Prize were LLNL’s ddcMD and Qbox codes. Both codes ran molecular dynamics (MD) simulations of material properties under extreme conditions on Blue Gene/L. MD codes are particularly well suited to this machine, but other codes have been ported with excellent results.

World’s first 100 TFLOPS sustained calculation (ddcMD). The

2005 Gordon Bell Prize was awarded to an LLNL-IBM team led by compu- tational physicist Fred Streitz for the world’s first 100 TFLOPS sustained calculation [17]. They simulated the solidification of tantalum and uranium via classical molecular dynamics using pseudopotentials. They ran a num- ber of simulations, including some with more than 500 million atoms. These simulations begin to span the atomistic to mesoscopic scales.

The code, called ddcMD, sustained 102 TFLOPS over a seven-hour run, thus achieving a remarkable 28% of theoretical peak performance. It was fine-tuned by IBM performance specialists, and it demonstrated exemplary strong and weak scaling across 131,072 cores for several different simulations. The simulation results are scientifically important in their own right. For the first time, scientists had both sufficient computing power (in Blue Gene/L) and a scalable application code (ddcMD) to fully resolve the physics of inter- est. Specifically, their 16 million atom simulation of grain formation was the first to produce physically correct, size-independent results. In contrast, pre- vious simulations on smaller machines were able to use at most 64,000 atoms. In order to simulate the physical system of interest, periodic boundary con- ditions were imposed, resulting in an unphysical periodicity in the simulation results. The striking difference between these two simulations is readily seen in Figure 3.5, where three snapshots in time are shown. In the top row, one can see the rich 3D detail in the planar slices, whereas in the bottom row one sees the replicated pattern resulting from the under-resolved simulation. Moreover, the team proved that for this problem, no more than 16 million atoms are needed.

Current world record of 207 TFLOPS sustained (Qbox). The 2006

Gordon Bell Prize was awarded to an LLNL-UC Davis team led by computa- tional scientist Fran¸cois Gygi for their 207 TFLOPS simulation of molybde- num [12]. This result is the current world record for performance of a scientific

FIGURE 3.5: (See color insert following page 18.) Three-dimensional molec- ular dynamics simulation of nucleation and grain growth in molten tantalum. Three snapshots in time are shown for two simulations. The top row corre- sponds to a simulation using 16 million atoms on the Blue Gene/L supercom- puter at LLNL. This 2005 Gordon Bell Prize-winning calculation was the first to produce physically correct, size-independent results. The rich 3D detail is seen in the planar slices. The bottom row used 64,000 atoms on a smaller su- percomputer. Periodic boundary conditions were used to generate the entire domain, resulting in the unphysical replicated pattern. (Image from Streitz et al.[17])

application code on a supercomputer. The code, called Qbox, simulates ma- terial properties via quantum molecular dynamics. It is a first principles ap- proach based on Schr¨odinger’s equation using density functional theory (with a plane-wave basis) and pseudopotentials. This versatile code has been used to simulate condensed matter subject to extreme conditions (such as high pressure and temperature) in a variety of applications.

The code is written in C++ and MPI and is parallelized over several physics parameters (plane waves, electronic states, and k-points). It employs opti- mized ScaLAPACK and BLACS linear algebra routines, as well as the FFTW

Fast Fourier Transform library. The test problem involved 1000 molybde- num atoms. In November 2005, this code sustained 64 TFLOPS on Blue Gene/L. Less than one year later, after considerable effort and tuning, the code achieved 207 TFLOPS on the problem above. It should be mentioned that this code’s performance is highly dependent on the distribution of tasks across processors. The best performance was achieved using a quadpartite distribution across a 64×32×32 processor grid.

The heroic effort behind the simulations described above should not be understated: In each case, teams of dedicated computational and computer scientists toiled to tune their codes to perform well on a radically new archi- tecture. These pioneers deserve accolades, but future success will be measured by how routine and easy similar simulations become. The next section hints at one approach to realizing this vision.

3.5.2

Multiphysics simulation via cooperative parallelism

The two preceding examples illustrate the state of the art in large-scale sim- ulation using single-physics codes on terascale supercomputers. In many sci- entific applications of interest, the desire is to integrate across multiple scales and physics regimes rather than resolve further a single regime. For example, a computational scientist might wish to federate several extant simulation codes (each scalable in its own right) into a single multiphysics simulation program. As discussed earlier, such applications require petascale computing power. The question is how to harness this computing power in an application developer-friendly way.

One approach to building such multiphysics application codes is coopera- tive parallelism, an MPMD programming model discussed in Section 3.3. Al- though this programming model is still in its early development, some work is already being done to show how it can facilitate the development of mul- tiphysics applications. Specifically, a team has modified a material modeling code so that the coarse-scale material response computation uses parameters computed from fine-scale polycrystal simulations [4]. If the fine-scale parame- ters were recomputed each time they were needed, these computations would overwhelm the execution time of the simulation and make it infeasible to com- plete in a reasonable period of time. However, using a technique known as adaptive sampling (derived fromin situ adaptive tabulation [16]), the simula- tion can eliminate many of these expensive calculations. It does so by storing results from early fine-scale computations in a database. Subsequent fine-scale computations can interpolate or extrapolate values based on previously stored results, if the estimated error is not too large. When it does become necessary

Quantum MD codes are much more expensive than classical MD codes, so Qbox cannot

model as many atoms as can ddcMD. On the other hand, Qbox is much more accurate for certain classes of problems. These are but two codes in a family of MD codes being used at LLNL for various scientific and programmatic applications.

to perform a full fine-scale computation, the simulation assigns the work to a server, as illustrated in Figure 3.1. So far, this application has run on just over 4000 cores, but the combined effect of the reduced need for fine-scale computations (because of adaptive sampling) and the improved load balance (because of the server pool model) has demonstrated performance gains rang- ing from one to two orders of magnitude, depending on the problem and the desired accuracy. Moreover, the application shows how MPMD programming can exploit several kinds of parallelism concurrently.