PARALLEL COMPUTING - Computational Wave Field Modeling in Anisotropic Media

9.1 What is Parallel Computing?

Parallel computing essentially is a type of computation in which multiple calculations are performed at the same time. This operation is performed on the basis that a big sequential or serial problem can, in most cases, be split up into tinier parallel problems such that the problem can be solved simultaneously in a more efficient manner[85]. The development of computers in past few decades has helped advance the technological and scientific fields by leaps and bounds[86]. Even more so, the development of parallel computing has exponentially increased the speed with which the information can be processed. This made possible to tackle and solve previously unsolvable complex scientific research problems in various field such as science, engineering, information technology and so on. Parallel computing has been able to turn a new leaf with the appearance of hardware with multicore designs[87]. The use of parallel hardware has been global as of current time. All the newly developed laptops, desktops and even servers use a multicore processor. And these new platforms require development of software in a new manner; one that can fully exercise the benefits of multiple cores.

9.2 Historical Background

The idea of development of parallel hardware and software comes from shortcomings of conventional serial hardware and software that runs a single job at a time and the need for faster computation and better overall performance. The shift in computing architectures from traditional serial machines to a model with multiple not-so-fast processors put together started in early 1980s. Although designing parallel computers was difficult, it was put into motion due to very high predicted gains. Over time, researchers found ways to utilize parallel machines for various scientific applications. The upsurge of parallel computing was a conceptual parting from the expensive-to-build supercomputing, since it was able to accomplish the foundation of better computing power by making use of hundreds of thousands – of microprocessors, all running calculations concurrently[88].

These small processors working together with memory and an interconnect were the building blocks of a new computing paradigm, leading to a development and advancement of diverse parallel computing architectures within the next decade.

The systems with multiple cores and processors was no longer just the domain of supercomputing but rather ubiquitous[86]. The new laptops and even mobile phones contain more than one processing core. The mainstream adoption of parallel computing is a result of the cost of components dropping due to Moore's law that the number of transistors on a microchip doubles every two years, though the cost of computers is halved[89]. Due to this, the cost of hardware such as CPU has gone down and various multicore chips have been released, such as the 64-core Tilera TILE64, the Cell BE. These chips were a natural evolution of the multisocket platforms, i.e., machines that could host several CPUs each on a separate chip, of the mid- to late 1990s[90]. The manufacturers

such as Dell and Apple have produced even faster machines for the home market that easily outperform the supercomputers of old that once took a room to house. Devices that contain multiple cores allow us to explore parallel-based programming on a single machine.

After the development of multiple core CPU hardware was the arrival of GPGPU (General Purpose Graphical Processing Unit) computing i.e. the concept of using Graphical Processing Unit (GPU) for General Purpose computing[91]. Although a single GPU core can’t be compared with a contemporary CPU core, the massively parallel architectures with hundreds or thousands of cores connected with high-bandwidth and high-speed RAM of the GPUs is able to overcome this disadvantage. As a result, the computational speed of GPU is several magnitudes faster. In addition, GPGPU also has a definite advantage in terms of energy consumption. In other words, you can get more computation done with less energy consumption. This is especially critical in the server and cloud infrastructure domain, where the energy consumed by a CPU over its operational lifetime, can be much higher than its actual price. The development of GPGPU technology is a revolutionary step.

It enables the solving of problems that were previously not possible with contemporary single or even multicore CPU technology. Although these multicore architectures have significantly improved the computation performance. It requires an explicit redesign of algorithms such that the traditional serial program can be transformed into a parallel program.

9.3 Why Parallel Computing?

Before the development of multiple core processors, most programs were written for sequential operation which utilized the then standard, single-core systems. Even after the progress in the hardware, many researchers in various fields still are mostly writing

sequential programs since they are unaware of the presence and prospects of multiple cores and parallel computing. Even though many programs can obtain acceptable performance on a single core, the researchers need to be made aware of the enormous runtime improvements that can be obtained with implementation of parallel computing. The researchers unknowingly try to utilize the multiple cores by running multiple instances of the same program. In addition to that, simply running serial programs in a computer with multiple core processors will not improve the performance of program either. The reason for that is very simple: the serial programs are not designed to utilized multiple processors and hence are ignorant of their existence. Therefore, the effectiveness of such a program on a system with multiple processors will be the same as its performance on a single processor of the multiprocessor system. However, that is not what we want. Rather, we want a faster execution and runtime for the program we are running in an efficient and timely manner. To achieve this goal, we need to transform our serial programs into parallel programs and utilize multiple cores in the current computers to the fullest extent[92]. This can be done in two ways: 1) develop software and libraries that can automatically convert serial programs into parallel programs or 2) manually rewrite the serial program so that it can be executed in parallel fashion.

9.4 How to write parallel program?

We know that for faster computation, we need to utilize parallel computation. We can do so by writing parallel program that supports parallel computing. There are many theoretical ideas as to how we proceed with writing a parallel program. However, the most fundamental concept that is used to accomplish the task is by apportioning the at-hand

work into smaller works to be completed among the cores. There are two extensively used approaches: task-parallelism and data-parallelism[86].

In task-parallelism, we partition the various tasks carried out in solving the problem among the cores i.e. you have multiple tasks that need to be done. This form of parallelism covers the simultaneous execution of computer programs across multiple processors on same or multiple machines. It is the execution on multiple cores of many different functions across the same or different datasets. It focuses on executing different operations or tasks in parallel to fully utilize the available computing resources in form of processors and memory. One way to do so would be creating threads for doing parallel processing where each thread is responsible for performing a different operation. For example, you have a large data set and you want to know the minimum, maximum and the average value. You can have different processors each look at the same data set and compute three different answers. So, in task parallelism you're dividing up the task to be done.

In data-parallelism, we partition the data used in solving the problem among the cores, and each core carries out same operations on its part of the data. This form of parallelism focuses on distribution of data sets across the multiple computation programs.

It is the simultaneous execution on multiple cores of the same function across the elements of a dataset. One way to do so is to divide the input data into subsets and pass it to the threads performing same task on different CPUs. For example, you have a lot of data that you want to process such as a lot of pixels in an image or a lot of payroll cheques to update.

Taking that data and dividing it up among multiple processors is a method of getting data parallelism.

For a clear distinction between the two parallelisms, here is an example showcasing both in a single operation. Suppose that a class with one hundred students had a midterm exam consisting of four questions. The class has four graders: A, B, C, and D. In order to grade the exam, four graders can use the following two approaches. 1) Each of them can grade all one hundred students on four different questions i.e. A grades question 1, B grades question 2, and so on. 2) They can divide the one hundred students’ exam papers into four subsets of twenty-five exams each, and each of them can grade all the papers in one of the subsets i.e. A grades the papers in the first subset, B grades the papers in the second subset, and so on.

In both approaches the “cores” are the graders. The first approach can be considered as task-parallelism. There are four tasks to be carried out: grading the first question, grading the second question, and so on. So, the graders will be “executing different tasks in parallel”. On the other hand, the second approach can be considered as data-parallelism.

The “data” are the students’ exam papers, which are divided among the cores, and each core grades all four questions in each paper.

9.5 Understanding the patterns of program structure

To be able to select the appropriate parallelization approach, it is of great importance that we understand the patterns of the program structure. We can distinguish the parallel program structure patterns into two major categories[93]:

1) Globally Parallel, Locally Sequential (GPLS): GPLS implies that the program can perform multiple tasks in parallel, however each of the tasks runs sequentially.

Some of the distinguished patterns that fall into this category include are Single program, multiple data and Multiple program, multiple data

2) Globally Sequential, Locally Parallel (GSLP): GSLP implies that the program runs as a sequential program, however some of the tasks can run in parallel when required. Some of the distinguished patterns that fall into this category include are Fork/join and Loop parallelism

9.6 Types of Parallel Hardware

It is essential to understand the architectural characteristics of parallel machines. In 1966, Michael Flynn introduced a taxonomy of computer architectures which distinguishes between the number of instruction streams and the number of data streams a system can handle i.e. how many data and instructions they can execute simultaneously. For both cases, it can either be single or multiple, which means that their combination can produce four possible outcomes:

1) Single Instruction, Single Data (SISD): A simple system that executes one instruction at a time, operating on a single data item. A von Neumann system has a single instruction stream and a single data stream, so it is classified as a single instruction, single data, or SISD, system. Nowadays, most of the CPUs have multiple core processors configuration and each of those cores can be considered a SISD machine.

2) Single Instruction, Multiple Data (SIMD): A system that executes a single instruction at a time, but the instruction can be applied on multiple data items. This type of system often executes its instructions in lockstep. The first instruction is

applied to all the data items simultaneously, only then the consequent instructions are applied. This type of parallel system is usually employed in data parallel programs, programs in which the data are divided among the processors and each data item is subjected to the same set of instructions. Vector processors were the first systems that followed this concept. Similarly, graphics processing units are also classified as SIMD systems; Streaming Multiprocessor (SM for Nvidia) or the SIMD unit (for AMD).

3) Multiple Instructions, Single Data (MISD): A system that executes multiple instructions on single data item i.e. performing different operations on the same data. Systems built using the MISD model are not useful in most of the application.

However, it is useful when fault tolerance is required in a system.

4) Multiple Instructions, Multiple Data (MIMD): A system that executes multiple independent instruction and each of those instructions can have its own multiple data. It is considered the most versatile system. MIMD system, practically, is an assemblage of autonomous processors that perform execution independently.

Multicore machines, including GPUs, follow this concept. GPUs are made from a collection of SM/SIMD units, whereby each can execute its own program. So, although each unit is considered SIMD system, collectively they conduct as a MIMD system.

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, massively parallel processors (MPPs), and grids use multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks.

9.7 Type of Parallel Software

Parallel processing software is an application that manages the execution of tasks in a program on a parallel computing architecture by dispensing huge application calls between multiple CPU or GPU cores within an underlying architecture reducing runtime.

Specific algorithms are built for efficient task processing. It is used to solve large and complex back-end computations and programs. Parallel processing manages division and distribution of task between processors. Its primary purpose is to utilize processors to ensure that throughput, application availability and scalability provide optimal end user processing through the usage of multiple core processors.

Parallel programming languages

Various parallel programming languages, libraries, APIs, and parallel programming models have been developed for parallel computing. These programming languages are based on the underlying memory architecture—shared memory and distributed memory. In a shared-memory system, the cores can share access to the computer’s memory i.e. each core can read and write each memory location. In a

distributed-memory system, conversely, each core has its own, private memory, and the cores must communicate explicitly by doing something like sending messages across a network. Shared memory programming languages communicate by manipulating shared memory variables. Distributed memory uses message passing. POSIX Threads and OpenMP are two of the most widely used shared memory APIs, whereas Message Passing Interface (MPI) is the most widely used message-passing system API.

Automatic parallelization

Automatic parallelization also auto-parallelization refers to converting serial program into parallel program by employing multiple processors simultaneously in a shared-memory multiprocessor (SMP) system[94]. The objective of automatic parallelization, as the name implies, is to automate the process of transforming serial program to parallel such that the programmers don’t have to go through the hectic and error-prone manual parallelization process[95].Though the quality of automatic parallelization has improved in the past several decades, fully automatic parallelization of sequential programs by compilers is still far from being a standard norm. The auto-parallelization mostly focuses on loops since most of the execution time of a program takes place inside of loop.

Mainstream parallel programming languages remain either explicitly parallel or (at best) partially implicit, in which a programmer gives the compiler directives for parallelization. A few fully implicit parallel programming languages exist—SISAL[96], Parallel Haskell, System C (for FPGAs), Mitrion-C, VHDL, and Verilog.

9.8 CPU vs GPU Parallel Computing

With the advancement in the field of parallel computing, the serial program are being parallelized most prominently using two different approaches: CPU and GPU. A CPU primarily decides what to do to a data item depending on the tasks that’s already completed. Parallel programming for CPUs is about differentiating instructions that can take place simultaneously from those that take place in sequence and interpreting them accordingly. A GPU primarily decides what to do to a data item based on its location among other data items. Parallel programming for GPUs is about subdividing the input data using a coordinate system that you invent to distinguish between data items that need to be processed with different instructions.

A CPU can do parallel computing using its cores. Each core is strong with significant processing power. So, a CPU core can execute a big task few times due to hardware limit implemented for a core and the core count. If you compare this with a GPU, it will have hundreds of cores with limited processing power. However, all weak GPU cores executing a single instruction at a time, depending on the calculations they need to do, the GPU architecture is suitable to finish the specific job much faster. The important part is that the GPU is a system not simply a processor (singular). The GPU system is organized for graphics problems with a massively parallel architecture consisting of thousands of smaller efficient cores designed for handling multiple tasks simultaneously.

A CPU on the other hand is organized to address sequential processing and has a lot of cache and coherency and isolation between the two to 12 CPU cores (72 in a new Intel announcement).

The CPU parallel computing is done by writing parallel programs using Message-Passing Interface (MPI), POSIX threads or Pthreads, and OpenMP—three of the most widely used application programming interfaces (APIs) for parallel programming[86]. MPI and Pthreads are libraries of type definitions, functions, and macros that can be used in various compilers. Pthreads and OpenMP were designed for programming shared-memory systems. They provide mechanisms for accessing shared-memory locations. MPI, on the other hand, was designed for programming distributed-memory systems. It provides mechanisms for sending messages. OpenMP is a relatively high-level extension if used in C/C++. It can parallelize a loop with a single directive. On the other hand, Pthreads provides some coordination constructs that are unavailable in OpenMP. OpenMP allows us to parallelize many programs with relative ease, while Pthreads provides us with some constructs that make other programs easier to parallelize.

Similarly, GPU parallel computing has been advancing with leaps and bounds since the first early attempts. Current tools cover a wide range of capabilities as far as problem decomposition and expressing parallelism are concerned. On one side of the spectrum we have tools that require explicit problem decomposition, such as CUDA and OpenCL, and on the other extreme we have tools like OpenACC that let the compiler take care of all the data migration and thread spawning necessary to complete a task on a GPU[90].

9.9 Implementation of Parallel Computing in (Distributed Point Source Method) DPSM problem

This problem is focused on the development of procedures to perform computational non-destructive evaluation (NDT) modeling or simulation with distributed point source method (DPSM). We want to develop an automated and one-stop wave

simulation platform for any anisotropic media with damage scenario such as material degradation, delamination, etc. The simulation platform is developed by combining the knowledge regarding the physics of wave propagation in conjunction to the computational technique, DPSM to model the wave field in accurate and computationally efficient

In document Computational Wave Field Modeling in Anisotropic Media (Page 169-198)