Active Disk System
Chapter 6: Software Structure
6.3 Database System
6.3.4 Active Disk Code
The code for PostgreSQL processing on the Active Disks can be divided into four categories. The disk must include the code for basic page layout, for the layout of tuples or records within a page, support for operations on the data types required by a particular
query (operators such as less-than, greater-than, equality, plus, or minus), and support for the core database functions. This means that all the infrastructure for query parsing, query optimization, and recovery does not have to be duplicated at the drive. Table 6-3 gives the sizes and breakdown of the major portions of code required at the drives in the Postgr- eSQL prototype.
6.3.4.1 Page Layout
The code required at the drive must understand the layout of a PostgreSQL page. Since database pages come in fixed-length chunks (8 KB, in the case of PostgreSQL), this is relatively easy to manage at the Active Disks. There are no concerns about alignment or pages that span multiple disks, as there is in the case of arbitrary filesystem files. How- ever, this code will still need to be specific each database system being used, Oracle and Informix, for example, will have their own page layout formats.
6.3.4.2 Tuple Layout
The code for dealing with database schemas and tuple layouts, as well as NULL- handling, must also be specific for each database system. There must be a way to describe this structure between the host-based code and the drive code. In the execution of Postgr-
Table 6-3 Code sizes for the Active Disk portions of PostgreSQL. Size of the various portions of the Active Disk code for the database system.
Module Original Modified Host
(New & Changed)
Active Disk
Files Code Files Code Files Code
access 72 26,385 - - 1 838 bootstrap 2 1,259 - - - - catalog 43 13,584 - - - - commands 34 11,635 - - - - executor 49 17,401 9 938 4 3,574 parser 31 9,477 - - - - lib 35 7,794 - - - - nodes 24 13,092 - - 6 4,130 optimizer 72 19,187 2 620 - - port 5 514 - - - - regex 12 4,665 - - - - rewrite 13 5,462 - - - - storage 50 17,088 1 273 - - tcop 11 4,054 - - - - utils/adt 40 31,526 - - 2 315 utils/fmgr 4 2,417 - - 1 281 utils 81 19,908 - - 1 47 Total 578 205,448 12 1,831 15 9,185 New 1,257
eSQL with Active Disks, the tuple format is one of the parameters passed to the Active Disk code when it is initialized.
6.3.4.3 Data Type Operators
Operators for dealing with basic calculations on database fields must be present at the drive for all the data types that the drive supports. This includes comparison operators for scanning and sorting and the arithmetic operators for aggregation.
In PostgreSQL, the type system is extensible through user-provided functions, and these functions are written in C and linked with the core database code. For the prototype Active Disk system, only the operators necessary for completing the TPC-D queries were ported to the drive, and the total code size of these pieces is shown in Table 6-3.
6.3.4.4 Core Operations
The basic database operations on Active Disks - scan, semijoin, sort, and aggrega- tion can be used in common among multiple database systems. These operations require user-provided routines for comparing and merging tuples, but the code for the basic scan, sort, and replacement selection with aggregation can be common among multiple database platforms. The basic operators are scan, semijoin, and aggregate.
6.3.4.5 Scan
The scan primitive at the drive supports all simple scans, using a static condition that is evaluated for every tuple in the relation, such as the where clause from Query 1 of TPC-D that specifies l_shipdate <= ‘1998-09-02’. The tuple_desc describes the layout of tuples within the relation and qual_expression gives the con- dition to be evaluated, which is can be any SQL condition, including constants, expres- sions, and references to individual fields in the relation being processed.
When used in a database system that allows user-defined-functions and the addition of user-defined abstract data types, this scan primitive with the appropriate data types could be used to implement all of the data mining and multimedia applications described in Chapter 4. This would require providing the user-defined functions and data type opera-
typedef struct database_scan_param_s {
char* tuple_desc; /* format of the tuples on disk */ char* qual_expression;
} database_scan_param_t;
int database_setup_scan(database_scan_param_t params) void database_scan(char* buffer, unsigned int len,
char* output,unsigned int *out_len, unsigned int max_len) void database_complete_scan(void)
tors that operate as disk functions, just as the basic data type operators must be made avail- able at the Active Disks.
6.3.4.6 Semijoin
The semijoin primitive at the drive supports the semijoin portion of a full join operation. The tuple_desc describes the layout of tuples within the relation and join_key gives the field that is being joined. The join_filter is the bloom filter representing a list of join key values, such as those for l_partkey from Query 9 or l_suppkey from Query 5.
When used in a multiple-pass algorithm such as the hybrid hash join [DeWitt90], the bloom filter is simply set to select only tuples from the the partition that is currently being processed. Depending on the number of partitions, simply re-scanning the entire relation and returning the matching tuples will be more efficient than writing the partitioned records during the initial scan phase.
6.3.4.7 Aggregate/Sort
The aggregate primitive at the drive supports both aggregation and generic sort- ing. The basic algorithm used is replacement selection in both cases. Aggregates are sim- ply a special case where records with matching keys are merged using the aggregation function, rather than being inserted into adjacent slots in the sorted heap. In the case of sorting, the aggr_expr is null and no merge operation is provided, all records are sim-
typedef struct database_semijoin_param_s {
char* tuple_desc; /* format of the tuples on disk */ char* join_keys;
char* join_filter; } database_semijoin_param_t;
int database_setup_semijoin(database_scan_param_t params) void database_semijoin(char* buffer, unsigned int len,
char* output,unsigned int *out_len, unsigned int max_len) void database_complete_semijoin(void)
ply output in sorted order. If the amount of memory at the Active Disks exceeds the size of the sorted output, then separate sorted runs are output and must then be merged at the host. This allows a single, flexible primitive to be used for both operations.
The use of replacement selection as the basic algorithm provides the benefit of longer average run length, adaptivity in the face of changing memory conditions (the abil- ity to give up memory pages as the operation progresses, and make use of additional mem- ory page that become available [Pang93a]), and support for the merging operation necessary to perform aggregation while sorting.