Active Disk System
Chapter 6: Software Structure
6.1 Application Structure for Active Disks
This section provides an outline of the structure of applications that execute on Active Disks, including that design philosophy, the structure of the on-drive code and the types of changes required for the code that remains on the host.
6.1.1 Design Principles
The basic design principles of developing an application to run in an Active Disk setting are to 1) expose the maximum amount of parallelism in the data-intensive portion of the processing, 2) isolate the code to be executed at the disks as much as possible to form self-contained and manageable units, and 3) utilize adaptive primitives to take full advantage of variations in available resources during execution. These three goals allow the largest amount of performance and flexibility in the placement and execution of the application code.
6.1.2 Basic Structure
The basic structure of an Active Disk application is that a “core” piece of code will run at each of the drives, while any high-level synchronization, control, or merging code continues to run at the host. This creates a client-server parallel programming model as illustrated in Figure 6-1. Input parameters are initialized at the host and distributed to all of the disks. Each of the disks then computes on its own local data and produces its portion of the final result. The host collects the results from all the disks and merges them into the final answer. Since there is a portion of code that runs at the host and processes data from
the disks, it is always possible to “fix up” an incomplete computation at the drive, as long as the drive functions always act conservatively. If they improperly filter a record that should have been part of the result, for example, then the host code will never catch it.
The high-level structure of an Active Disk application is similar to the normal pro- cessing loop that simply uses that basic filesystem calls to do I/O. At an abstract level, any data processing application will have a structure similar to the following:
The challenge for Active Disks is to specify the code for steps (3) and (4) in such a way that this portion of the code can be executed directly at the disks, and in parallel. This means there must not be any dependence on global state, or requirements for ordering in the processing of the blocks, because the execution will occur in parallel across all the disks on which F is stored. In all of the applications discussed here, this extraction of the appropriate code was done manually within the source code, although it should be possi- ble to do a significant portion of this extraction automatically, or at least provide tools to aid the programmer in identifying candidate sections of code and eliminating global dependencies.
As a specific example of this two-stage processing, operating in parallel on the sep- arate blocks in a file and then combining all these partial results, consider the frequent sets application. It operates on blocks of transaction data individually and converts them into itemset counts. The itemsets counts for all the blocks are then combined at the host simply
Figure 6-1 Basic structure of Active Disk computation. The host initializes the computation, each of the drives computes results for its local data, and these results are combined by the host.
Initialize Parameters Compute Local Results Merge Results Input Final Result (1) initialize
(2) foreach block(B) in file(F)
(3) read(B)
(4) operate(B) -> B’
(5) combine(B’) to result(R)
by summing the individual counts, as shown in Figure 6-2. The operate step for fre- quent sets takes raw transaction data and converts it into an array of counts of candidate itemsets. The figure shows the 2-itemset phase, where the counts are stored as a two- dimensional array and a 1 is added whenever a pair of items appears together in a transac- tion (in a particular shopper’s basket). This set of counts can be calculated independently at each of the disks, and the combine step merges the arrays from all the disks by simply summing the values into a final array. There are no global dependencies once the initial list of candidate itemsets is distributed, and each block of transactions can be processed in parallel and in any order.
The search application is partitioned in a similar manner, with the operate step choosing the k closest records on a particular disk, and the combine step choosing the k closest records across all these individual lists. As with the frequent sets application, the serial fraction of the search application is orders of magnitude less expensive than the par- allel, counting fraction of the application, which leads to the linear speedups shown in Chapter 5.
The edge detection and image registration applications operate purely as filters, so they consist of completely parallel operate phases that convert an input image into a set of output parameters. These is no combine step, as the results are simply output for each image, without being combined further at the host. Since these two applications contain essentially no serial fraction (every application has a tiny serial fraction in overhead to
combine
operate
combine
Figure 6-2 Basic structure of one phase of the frequent sets application. The blocks of transaction data are converted into itemset counts, which can then be combined into total counts.
0002 C G F K P A B C D E 0004 C G E X Y 0003 A B C D V C 1 1 1 - - B 1 - - - - A - - - - - E 0 0 1 0 - D 1 0 0 - - opera te 0296 B E L O X A B C D E 0298 D E M R T 0297 A Y D G J C 0 0 0 - - B 0 - - - - A - - - - - E 0 1 0 1 - D 1 0 0 - - A B C D E C 1 1 1 - - B 1 - - - - A - - - - - E 0 1 1 1 - D 2 0 0 - -
start the computation across n disks, however, this cost is quickly amortized when used across reasonably-sized data sets), they also show linear scalability in the prototype results from Chapter 5.