In this section we will present the GPU implementation of RLE decoding. It is a fairly straight forward implementation of RLE with minor modifications to speed it up on GPUs.
4.1.1
Layout of the RLE data
The format of the RLE is as described in Section 2.1.2. That is, a counter followed by a single value or a number of different values. The implemented RLE encoder uses a 32-bit type. The counter either gives the number of times to repeat a value or the number of following bytes that should be copied to the output stream. This is marked by the counter by having a positive value if the next byte is to be repeated, and a negative value if the following bytes are to be copied directly to the output, the absolute value gives the actual count.
Furthermore the modification done to the RLE-decoding on the GPU is the addition of a table with offsets into the input stream where the decoding can start. The motivation for this offset table is mainly to increase per- formance. A run-length encoded stream has to be decoded from the start because there are no way of telling where a counter starts without following
the counters from the start of the input.
To distribute the workload of decoding a RLE encoded stream among several processors the encoded stream is partitioned into smaller sections. This allows the different processors to work on their section of the encoded stream and produce their own output section. The decomposition of an encoded stream is such that the sections produced by each processors are about the same size. This is achieved by choosing the counters that are close to the given positions in the original stream. If we have a table with two offsets, we would start at the beginning of the encoded input stream and at a position in the encoded stream that would start writing close to the middle of the decoded output stream.
The offset table contains three variables for each entry: input position, output position and a tag count. The input position gives the offset in number of bytes from the beginning of the encoded stream. The output position gives the offset in number of bytes from the beginning of the decoded stream. Finally, tag count gives the number of the tag from the beginning of the encoded stream. This last variable is used to keep track of the extent of the section being decoded, by knowing the tag count of the next section, decoding can proceed until the tag count of the section being decoded equals the tag count of the next section. A pseudocode of the RLE-decoding can be seen in Algorithm 4.1.1.
4.1.2
The RLE decoding kernel
Instead of using branches to select which section of the encoded stream a group of threads should handle, the thread number is used to select the correct section. The kernel is designed to handle a RLE-stream that is divided into eight parts. To be able to fully utilize coalesced memory accesses it has 128 threads assigned to it. This way each section has 16 threads available to utilize coalesced memory access while reading or writing to global memory.
The partitioning is as follows: First, the thread ID is shifted to the right such that the three most significant bits of the maximum number of threads in a block, here 128, can be found as the three least significant bits. Then a mask is used to ensure that the only valid values are in the range 0 to 7. The result is that thread IDs in the range 0–15 belong to section 0, thread IDs in range 16–31 in section 1 and so on. An illustration of this scheme is given in Figure 4.1. Figure 4.1 illustrates how the binary number with range 0002 to 1112 maps to the different sections in the decoded stream1.
1We denote the radix by subscript, e.g. 11
2is 3 (decimal) in radix 2, and assume radix
4.1. RUN-LENGTH ENCODING IMPLEMENTATION 35
Algorithm 4.1.1: rle-decode(input, output, lenOut, threadID)
local currentP os, posOut, startT ag, stopT ag, currentT ag
(currentP os, posOut) ← GetPositions(input, threadID) (startT ag, stopT ag) ← GetTagNumbers(threadID)
repeat
comment: Read counter from input stream.
count ← getCount(input[currentP os])
if count ≥ 0 then symbol ← getNextSymbol() for i ← 1 to count do ( write(output[posOut], symbol) posOut ← posOut + 1 else
comment: Copy count symbols from input to output.
copy(output[posOut], input[currentP os], abs(count)) currentP os ← currentP os + abs(count)
currentT ag ← currentT ag + 1
Starting with the initial length at the top, where binary numbers starting with a zero as the leftmost digit handles the first part of the output stream. Furthermore binary numbers starting with 00 handle the first quarter of the output stream. At the bottom of the figure, section 0 to 1 and section 1 to 2 are handled by binary numbers 000 and 001.
Initial length
0xx 00x
0 1 2 3 4 5 6 7 8
Figure 4.1: Illustration of division based on a binary number