Patent Application: Accelerator for Analytical Workloads

Table 5.7: Inputs to VecOp to answer Q4

Parameter Value Comments

ad1 ad(price) see Table 5.3

ad2 ad(discount) start address of the second vector; see above

size |price| see Table 5.3

type 64-bit fix point the type of each element on the memory page vec op multiply vector operation used

agg op sum operation used for aggregation

agg 0.00 see Table 5.3

mask ad 0 no mask is specified

mask N/A not used

invmask N/A not used

5.8 Patent Application: Accelerator for Analyt-

ical Workloads

The following figures show possible implementations of the operations, that were introduced in the previous section.

The Filter Engine, shown in Figure 5.6, produces a bit mask that is stored in memory. The Filter Engine consists of a Parallel Compare circuit, which performs 2 · k parallel comparisons between the k input elements from the L4 cache and the lower bound as well as upper bound provided by the CPU instruction. The outcome of those comparisons is given by k bits, which are fed into the MaskBuffer, to create a bit mask, and additionally into an Incrementer to keep track of the number of elements that passed the filter.

In the figure we see an instance of the Filter Engine with k = 4 and a scalar width of 64 bit. Figure 5.7 shows the control sequence for the Filter Engine.

L4 32 B Next Fetch Address 8 B Fetch Address 8 B

Page Start Address

Filter Engine Parallel Compare 8 B _{8 B} L4 MaskBuffer 4 Bit 32 B Next Mask Store Address 8 B

Mask Fetch Address

8 B

Mask Address

Low

Buffer BufferHigh

Upper bound Lower bound 8 B 8 B Bounds handling 2 Bit Count Buffer 4 bit Incrementer Count

5.8. Patent Application: Accelerator for Analytical Workloads 141

Start

Software receives request to filter a storage area Software issues instruction to

filter a page CPU send command to Filter

Engine

Filter Engine fetches line

Filter Engine compares all elements of the line with the upper and lower bound Filter Engine puts the results in the bitmask buffer and increments count

Bitmask buffer full?

Filter Engine stores bit mask buffer at the bit mask address and increments the address

Last Line in Page?

Software adds count to running sum Last page in storage area? Stop no yes yes no Next page

Filter Engine ensures no other CPU works on the Cacheline of the bit mask address

yes

Figure 5.7: Filter Engine Process Flow

A bit mask, that was for example created by the Filter Engine, can be used by the Aggregation under Mask Engine, see Figure 5.8, to aggregate all qualifying entries of a memory page. This is achieved by loading the bit mask and chunks of

ResultBuffer L4 32 B 8 B 8 B 8 B Next Fetch Address 8 B Fetch Address 8 B

Page Start Address Result Aggregation under Mask Engine 5 Way-Add₆₄under mask 8 B _{8 B} L4 MaskBuffer 4 Bit 32 B Next Mask Fetch Address 8 B

Mask Fetch Address

8 B

Mask Address Figure 5.8: Aggregation under Mask Engine

the memory page into the engine. Afterwards aggregation operations, like finding the maximum or summing up all qualifying values, are performed with respect to the loaded bit mask. Figure 5.9 visualizes the process flow of the Aggregation under Mask Engine.

It is interesting to note that the two engines could be operated interleaved instead of sequential. This would allow to cache the bit mask in the engine and could thereby reduce memory traffic for storing and loading the bit mask.

5.8. Patent Application: Accelerator for Analytical Workloads 143

Start

Software receives request to conditionally sum up storage area

Software issues instruction to conditionally sum up page

CPU send command to Aggregation Engine

Aggregation Engine fetches mask

Aggregation Engine fetches line

Aggregation Engine sums up all qualifying elements of the line

Mask exhausted?

Last Line in Page?

Software adds page sum to running sum Last page in storage area? Stop no yes yes no Next page no yes Next mask

5.9 Conclusion

In this chapter we developed a new hardware approach to improve the runtimes of many analytical queries. The idea to bring computational power closer to the data, by introducing it into caches, is a promising compromise to intelligent RAM. Not only filter and aggregation operations could be performed at the cache level, but also vector operations and compression techniques can further increase the payload that is brought into the upper levels of the cache hierarchy.

This idea also led to a patent application and will be hopefully used in next generation mainframes to enable efficient in-memory databases and analytics as the future workload for the IBM System Z.

Appendix A

Additional Results of the ”New

Workloads for the IBM

Mainframe System Z” Project

A.1 Algorithms

The survey paper Top Ten Algorithms in Data Mining [93] served as a starting point to identify the most important problems and algorithms in data mining workloads. Table A.1 lists those ten algorithms.

Algorithm Type

C4.5 Classification

k-means Clustering

Support Vector Machines Classification Apriori Frequent Itemset Mining Expectation-Maximation Clustering

PageRank Ranking

AdaBoost Classification

k-nearest neighboor Classification

CART Classification

Table A.1: Top Ten Algorithms in Data Mining

The performance of classification algorithms is not only measured with respect to runtime but also the quality of the resulting model. The quality of a classification model is not easy to measure. Therefore we decided to first look at improved algorithms for clustering and frequent item set mining.

Additionally we took a look at sorting in the context of in-memory databases. Since sorting is, especially for string data, one of the more expensive operations performed in an in-memory database. For some algorithms sorting the data is a necessary preprocessing step to achieve high performance.

In document Understanding fundamental database operations on modern hardware (Page 155-162)