a CUDA kernel provided by the courtesy of Jiri Matela3. However, this code contained approximations that introduced unacceptable noise in the decoded image. Therefor, no successful decoding was obtained with IDWT mapped to the GPU.
5.3
Core Components Revisited
The components briefly visited in section 5.1 are now discussed more thor- ough. For the deepest level of details, the reader is referred to the source code given in appendix E.
5.3.1 Socet2PacketBuffer: Receive UDP Packets
The transport layer services are maintained by the User Datagram Proto-
col (UDP). A datagram socket must be created in order to join a UDP
multicast and thereby receive UDP packets. This is an interface between an application process and the IP protocol stack provided by the operat- ing system. The operation system uniquely identifies a datagram socket by the protocol, the local IP address and port number. The operating system forwards incoming IP data packets to the correct application by extract- ing the above socket address information from the IP and UDP headers. Sockets were developed at Berkley as an abstraction to enable computer communication over different mediums, and forms the de-facto standard for networking. Windows Sockets 2 builds on the Berkley Sockets and provides the necessary methods to join a multicast and receive UDP packets. The received UDP payloads are placed in a circular buffer to facilitate detection of packet reordering, duplication, deletion and immunity to mild jitter. The application layer services are maintained by the Real-time Transport Pro-
tocol (RTP), where the RTP header contains a sequence number used to
detect packet reordering, insertion and deletion.
5.3.2 PacketBuffer2Codestream: Reconstruct JPEG2000 Code- stream from Multiple UDP Payloads
Figure 5.2 shows an example of a packetized codestream originating from a TVG430. As seen in this figure, each UDP payload is composed of a RTP header and a RTP payload. The RTP payload is composed of a JPEG2000 codestream segments, encapsulated in the Material eXchange Format (MXF)[39, 40]. MXF is a container format used to carry sound-, image- and metadata content, collectively termed essence. It is also used for synchronization. Since each compressed video frame is divided amongst multiple UDP payloads, several codestream segments must be extracted and put together in order to obtain a complete frame ready for decoding.
3
32 Chapter 5. Implementation
The compressed video frames are coded with the YCbCr color space, where luma and chroma components are received as separate JPEG2000 codestreams, i.e., the Y component (luma) is encoded as one codestream, while the Cb and Cr components (chroma) are encoded as another code- stream. These two codestreams could be decoded separately and combined before the YCbCr to RGB color space transform, but instead they are com- bined into one single codestream before decoding in order to avoid multiple decoders. This is possible without any transcoding, owing to the flexibil- ity in JPEG2000. Some minor changes must be made in the codestream header, since the luma and chroma codestreams contains redundant marker segments4, but no changes are made to the compressed bitstreams. By inserting a Progression Order Change (POC) marker segment in the code- stream header and encapsulating the luma and chroma codestream in indi- vidual tile-parts, the underlying decoder is able to correctly decode all three components in one run. The meaning of progression order within JPEG2000 is explained in section 3.1.3. For more information on the POC marker seg- ment and the changes made in the codestream header, please consult annex A in [14] and the source code in appendix E, respectively.
Complete encoded frame, ready for decoding:
Header RTP payload
Header MXF payload
codestream segment
Payload from UDP packet #1 ....
....
Header RTP payload
MXF payload JPEG2000 codestream
segment
Payload from UDP packet #2
Header RTP payload
MXF payload JPEG2000 codestream
segment
Payload from UDP packet #n Header
Figure 5.2: Simplified overview of the TVG430 bitstream.
5.3.3 HostDecoder and DeviceDecoder: Decoding of JPEG2000 Codestream
The implemented decoder is composed of two parts, where the first is named HostDecoder5, is executed on the CPU and involve all decoding steps up to and including inverse wavelet transform. The second part is named De- viceDecoder5, include chroma upsampling and Irreversible Color Transform (ICT) and is executed on the GPU. This architectureis identical to the one described in section 5.2.4, named Kakadu enhancement 2.
HostDecoder is based on Kakadu[37], which is a JPEG2000 SDK and a
complete implementation of JPEG2000 Part 1, and a great deal of Parts 2 and 3. More specific, the CPU executed part of the decoder is based on the high-level application kdu_stripe_decompressor, which again is based on the low-level application kdu_decoder, both found in the Kakadu SDK.
4See section 3.1.4 for more information on marker segments. 5
5.3 Core Components Revisited 33
Input to the decoder is a JPEG2000 compressed codestream, while the out- put is a three component bitmap represented with the YCbCr color space with 4:2:2 subsampling.
The CUDA language was used to perform all GPU processing in De-
viceDecoder. After the output from the CPU executed decoder is transfered
from host to device memory, a Pixel Buffer Object (PBO) is mapped into CUDA memory space. This enables the computed RGBA6 values with 4:4:4 subsampling to be directly written to the PBO during the concurrent chroma upsampling and Irreversible Color Transform (ICT). The ICT is from the YCbCr color space to the RGB color space, and is treated mathematically in appendix C. The PBO is used in order to reduce the overhead connected with the later displaying of the result, which is further discussed in section 5.3.4. The CPU and GPU are pipelined to exploit the inherent parallelism between them, meaning that both the CPU an GPU can perform decoding at the same time, just on different frames. Synchronization is needed when accessing shared buffers, and is obtained with mutexes. The synchronization is based on the assumption that the CPU decoder is slower than the GPU decoder, meaning that frames will be lost if the opposite is true. This is a tradeoff between fool-prof synchronization and execution time, since additional synchronization would add latency to the critical path. The buffer between the CPU and GPU executed decoder should probably be increased to accommodate multiple partly decoded frames, in order to better absorb the inevitable jitter in processing time. Currently it holds only one partly decoded frame.
In addition to the implemented decoder described above, a serious at- tempt has been made to map the inverse wavelet transform from CPU to GPU. This is explained more thorough in section 5.2.5.
5.3.4 Gui: Application Front-End
The application is implemented with a graphical user interface, which also is composed of two parts. The first part is a startup wizard based on the Qt framework[41], used at startup to obtain the multicast address and port number for the UDP multicast. This window can be seen in figure 5.3.
The second part is the main window used to display the decoded video frames, and can be seen in figure 5.4. It is implemented with the OpenGL interoperability found in the CUDA runtime API, which enables the Gui to create a texture based on the above discussed PBO. This texture is then drawn to the screen. As mentioned in section 4.1.2, transfers between host and device are costly with respect to execution time. This architecture com- pletely avoids data read-back from device to host, since the host does not access the data after they are transfered to device memory. This is accom-
6
34 Chapter 5. Implementation
plished since the decoded stream is meant for immediate visual consumption, and CUDA facilitate a simple binding to OpenGL.
Because of restrictions in CUDA with respect to threads and memory access, the GPU decoding procedure described in section 5.3.3 has to be called from the Gui main loop. The reason is that "any CUDA resources created through the runtime in one host thread cannot be used by the run- time from another host thread"7. This means that two host threads not can use the same device memory. The same restrictions seems to apply to a PBO, because the PBO has not been successfully mapped to two threads for writing or reading. The only known solution is to let the thread running the Gui main-loop copy data from hostOutputBuffer to deviceInputBuffer, perform the GPU processing and use the OpenGL bindings to display the decoded frame. This may be a little illogical, since the GUI not should be involved in the decoding process, but it is the solution used in the OpenGL sample applications in the CUDA SDK. A more logical partitioning between decoder and GUI would be obtained, if the decoder could update the PBO from a thread other than the GUI main-loop thread. However, the cur- rent arrangement is advantageous from a performance perspective, since it provide a simple way to pipeline the CPU and GPU decoding operations.
Figure 5.3: Screenshot of application startup wizard.
7