Table 5.7: Inputs to VecOp to answer Q4
Parameter Value Comments
ad1 ad(price) see Table 5.3
ad2 ad(discount) start address of the second vector; see above
size |price| see Table 5.3
type 64-bit fix point the type of each element on the memory page vec op multiply vector operation used
agg op sum operation used for aggregation
agg 0.00 see Table 5.3
mask ad 0 no mask is specified
mask N/A not used
invmask N/A not used
5.8
Patent Application: Accelerator for Analyt-
ical Workloads
The following figures show possible implementations of the operations, that were introduced in the previous section.
The Filter Engine, shown in Figure 5.6, produces a bit mask that is stored in memory. The Filter Engine consists of a Parallel Compare circuit, which performs 2 · k parallel comparisons between the k input elements from the L4 cache and the lower bound as well as upper bound provided by the CPU instruction. The outcome of those comparisons is given by k bits, which are fed into the MaskBuffer, to create a bit mask, and additionally into an Incrementer to keep track of the number of elements that passed the filter.
In the figure we see an instance of the Filter Engine with k = 4 and a scalar width of 64 bit. Figure 5.7 shows the control sequence for the Filter Engine.
L4 32 B Next Fetch Address 8 B Fetch Address 8 B
Page Start Address
Filter Engine Parallel Compare 8 B 8 B L4 MaskBuffer 4 Bit 32 B Next Mask Store Address 8 B
Mask Fetch Address
8 B
Mask Address
Low
Buffer BufferHigh
Upper bound Lower bound 8 B 8 B Bounds handling 2 Bit Count Buffer 4 bit Incrementer Count
5.8. Patent Application: Accelerator for Analytical Workloads 141
Start
Software receives request to filter a storage area Software issues instruction to
filter a page CPU send command to Filter
Engine
Filter Engine fetches line
Filter Engine compares all elements of the line with the upper and lower bound Filter Engine puts the results in the bitmask buffer and increments count
Bitmask buffer full?
Filter Engine stores bit mask buffer at the bit mask address and increments the address
Last Line in Page?
Software adds count to running sum Last page in storage area? Stop no yes yes no Next page
Filter Engine ensures no other CPU works on the Cacheline of the bit mask address
no
yes
Figure 5.7: Filter Engine Process Flow
A bit mask, that was for example created by the Filter Engine, can be used by the Aggregation under Mask Engine, see Figure 5.8, to aggregate all qualifying entries of a memory page. This is achieved by loading the bit mask and chunks of
ResultBuffer L4 32 B 8 B 8 B 8 B Next Fetch Address 8 B Fetch Address 8 B
Page Start Address Result Aggregation under Mask Engine 5 Way-Add64 under mask 8 B 8 B L4 MaskBuffer 4 Bit 32 B Next Mask Fetch Address 8 B
Mask Fetch Address
8 B
Mask Address Figure 5.8: Aggregation under Mask Engine
the memory page into the engine. Afterwards aggregation operations, like finding the maximum or summing up all qualifying values, are performed with respect to the loaded bit mask. Figure 5.9 visualizes the process flow of the Aggregation under Mask Engine.
It is interesting to note that the two engines could be operated interleaved instead of sequential. This would allow to cache the bit mask in the engine and could thereby reduce memory traffic for storing and loading the bit mask.
5.8. Patent Application: Accelerator for Analytical Workloads 143
Start
Software receives request to conditionally sum up storage area
Software issues instruction to conditionally sum up page
CPU send command to Aggregation Engine
Aggregation Engine fetches mask
Aggregation Engine fetches line
Aggregation Engine sums up all qualifying elements of the line
Mask exhausted?
Last Line in Page?
Software adds page sum to running sum Last page in storage area? Stop no yes yes no Next page no yes Next mask
5.9
Conclusion
In this chapter we developed a new hardware approach to improve the runtimes of many analytical queries. The idea to bring computational power closer to the data, by introducing it into caches, is a promising compromise to intelligent RAM. Not only filter and aggregation operations could be performed at the cache level, but also vector operations and compression techniques can further increase the payload that is brought into the upper levels of the cache hierarchy.
This idea also led to a patent application and will be hopefully used in next generation mainframes to enable efficient in-memory databases and analytics as the future workload for the IBM System Z.
Appendix A
Additional Results of the ”New
Workloads for the IBM
Mainframe System Z” Project
A.1
Algorithms
The survey paper Top Ten Algorithms in Data Mining [93] served as a starting point to identify the most important problems and algorithms in data mining workloads. Table A.1 lists those ten algorithms.
Algorithm Type
C4.5 Classification
k-means Clustering
Support Vector Machines Classification Apriori Frequent Itemset Mining Expectation-Maximation Clustering
PageRank Ranking
AdaBoost Classification
k-nearest neighboor Classification
CART Classification
Table A.1: Top Ten Algorithms in Data Mining
The performance of classification algorithms is not only measured with respect to runtime but also the quality of the resulting model. The quality of a classifica- tion model is not easy to measure. Therefore we decided to first look at improved algorithms for clustering and frequent item set mining.
Additionally we took a look at sorting in the context of in-memory databases. Since sorting is, especially for string data, one of the more expensive operations performed in an in-memory database. For some algorithms sorting the data is a necessary preprocessing step to achieve high performance.