• No results found

5.2 Adaptive Tile Matrix

5.2.4 Internal Structure

The organization of the adaptive tiles in the AT Matrix is internally realized via a non-periodic grid index. It is similar to the organization of the Grid File (Nievergelt et al., 1984), but instead of buckets, the two-dimensional grid indexes adaptive matrix tiles. Figure 5.3 sketches the different physical components of the AT Matrix. We utilize a dynamic 2D-array that contains references of tile instances. The tile instances are regular instances of either a sparse (CSR) or a dense matrix in the column-oriented storage layer, as described in Section 3.4. In order to query matrix A, the AT Matrix contains a row-, and a column dimension-index, in the form of two separate search trees. Note that having two separate indexes is advantageous over a single, linearized tree as the LAB-tree (Zhang et al., 2011), since most operations process matrix tiles in either row- or column direction, e.g., in the blocked matrix multiplication. The leaf nodes map from a range in absolute matrix coordinates to the corresponding index in the grid. The leaf-ranges are non-periodic, and depend on the actual topology of the matrix. Let the matrix in Figure 5.3 have the dimensions 3500× 3500. Hence, with batomic= 1024, there can at most be 4 intervals in each direction, thus a

maximum of 4×4 = 16 tiles in total. Due to the particular topology of A, the row-dimension is split into only three intervals {[0,1023], [1024,2047], [2048,3499]}. Furthermore, the example matrix is symmetric, hence, the matrix row- and column-dimension indexes are equivalent. In general this is not the case, and the grid can index arbitrary non-symmetric matrices. It is worthwhile mentioning that each grid entry refers to an area that has at least the size of a logical block, but at most covers one physical tile. In particular, one matrix tile is potentially referenced by multiple grid entries. For instance, Grid[0][2] and Grid[1][2] both point to tile e in Figure 5.3.

R1 R2 R3 R4 0 0.05 0.1 Matrix Instance Time [s] (a) R1 R2 R3 R4 0 100 200 Matrix Instance (b) CSR ATmult

Figure 5.4: The performance for reading 2000 random single rows (a) and columns (b) of an AT Matrix vs. a naive CSR representation for several real world matrices of Table A.1.

The translation of absolute matrix coordinates (i, j) into the grid coordinates (ti, tj) and the additional indirection naturally incurs some overhead. However, regarding set- or block-based al- gorithmic patterns, whole blocks are queried rather than single matrix elements, and many index lookups can be avoided. An example of such a pattern is the tiled matrix multiplication operator, which will be explained in greater detail in Section 5.3. Since the tile grid is row-major ordered, the operation control flow can be shifted to directly work on the grid. In particular, the matrix multiplication is driven by iterating over grid coordinates (ti, tj), and works on a complete tile-level

rather than on the single element-level. In addition, the AT Matrix contains a hash table that maps each tile reference to its absolute position in matrix A. This is necessary, since tile instances them- selves do not carry any metadata information, i.e., they are “unaware” of their membership in the AT Matrix.

Accessing Rows and Columns

Many algorithms work either on complete matrix rows and/or matrix columns. As a consequence, data structures are usually chosen such that they offer a good access performance of either one or the other, by indexing either the rows (CSR) or the columns (CSC). However, there are also use cases where access in both directions is desired. For example, imagine an adjacency matrix of a directed graph, where one is interested in both outgoing (row access) and incoming edges (column access) of a certain vertex. In order to fetch a single column from the CSR structure, a full scan of all elements is required, and vice versa. A naive workaround for this problem is to keep two copies of the matrix, one in a CSR-, and the other in a CSC representation. Obviously, this is an undesired solution for very large matrices that occupy a significant fraction of the system’s memory. Moreover, the manipulation of a matrix in such a setup creates a considerable synchronization overhead, since both matrices have to be maintained in parallel.

In contrast, our approach foresees the installation of a single AT Matrix instance for a matrix. Unlike a plain CSR data representation, the tiled substructure of our AT Matrix already lowers the penalty of column reads significantly. This is because only the affected tiles that are in the queried

column range have to be scanned. Figure 5.4 shows a small study on the single-core read perfor- mance by using the AT Matrix data structure vs. a plain CSR data structure for row- (Fig. 5.4a) and column reads (Fig. 5.4b). In this experiment, 2000 randomly selected matrix rows and columns were sequentially read from different matrices of Table A.1, using an Intel Xeon X5650 CPU with 48 GB RAM as platform. The matrix dimensions range from 17,040 (R1) to 45,101 (R4). In the left plot of Figure A.1, we observe that the overhead of the additional indirection, which is involved in the reads of AT Matrix-rows, results in a runtime that is at most doubled compared to CSR. However, for column accesses, the read performance on AT Matrix is up to 25x faster. As a consequence, the AT Matrix can be considered as much more robust against irregular access patterns.

Note that unlike the random reads of single rows/columns, many operations such as blocked ma- trix multiplication are tile-based. Hence, tiles are read as a whole, and the incurred overhead by the additional indirection is significantly lower. Furthermore, the overhead is easily overcompensated by the advantages of the tiled, heterogeneous AT Matrix substructure, as can be seen in the evalua- tion in Section 5.5. Nonetheless, although we used CSR as sparse tile representation, the concept of the AT Matrix is orthogonal to the physical tile representations. As mentioned in Chapter 3, many algorithms can be written in a way that they are solely based on either of both representations. For example, the sparse accumulator algorithm sketched in Algorithm 10 is CSR-based, but there is an equivalent CSC-version (Gilbert et al., 1992). If for some reason, the most frequent access pattern are random reads on columns, then CSC can be used instead of CSR to increase the efficiency. For instance, an internal tile optimizer might select the layout based on the user’s priority, and the usage statistics of a matrix.