Review of Embedded GPUs and Accelerators - Background and Prior Work

Chapter 2: Background and Prior Work

2.3 Review of Embedded GPUs and Accelerators

In this section, we examine the current state of GPU technology in embedded applications; this review includes a discussion of similar GPU-like accelerator technologies. This review demonstrates that there is a market for data-parallel processor architectures in embedded systems, a domain where real-time constraints are common. This confirms the relevance of the research presented in this dissertation.

GPUs may be “discrete” or “integrated.” Discrete GPUs (dGPUs) are those that plug into a host system as a daughter card. Integrated GPUs (iGPUs) are found on the same physical chip as CPUs. iGPUs are common to system-on-chip (SoC) processors targeted to embedded applications, including smartphones and tablets. dGPUs are traditionally far more computationally capable than iGPUs. It is feasible to use a dGPU in an embedded system, as long as the host platform supports the necessary I/O peripheral interconnects (e.g., PCIe). Unfortunately, conventional dGPUs may not be well-suited to all embedded applications for several reasons. First, dGPUs may be physically too large, taking up too much space. Second, dGPUs commonly draw enough power (commonly between 150 watts to 250 watts) that they must rely upon active cooling, which requires a fan and an unobstructed airflow. Third, the physical port where the dGPU connects to the host system may be prone to physical vibration. However, these challenges are not insurmountable.

General Electric (General Electric, 2011) manufactures “ruggedized” dGPU platforms designed to deal with harsh embedded environments. These dGPUs may be secured to the host computing platform through reenforced I/O ports or soldered directly onto the motherboard of the computing system. Special heat-dissipating enclosures cool the GPU without the need for free-flowing air. General Electric’s ruggedized systems have been used in radar for unmanned aerial vehicles (Pilaud, 2012), sonar for unmanned underwater vehicles (Keller, 2013), and situational awareness applications in manned armored vehicles (McMurray, 2011). However, these dGPUs may still not meet the needs of every embedded application due to several limitations: (i)the heat-dissipating enclosures are large and heavy;(ii)ruggedized dGPUs draw the same power as conventional counterparts; and(iii)they are expensive. General Electric’s platforms are clearly meant for defense applications. What is affordable for a multi-million dollar military vehicle may not be affordable for a mass-market automobile.

GPU Designer / GFLOPS GPGPU SoCs Name Maker (single-precision) Runtime (not exhaustive) GC2000 Vivante 32a OpenCL 1.2 (embedded) Freescale i.MX6

SGX544 MP3 PowerVR 51b OpenCL 1.1 MediaTek MT6589

GC4000 Vivante 64a _{OpenCL 1.2 (embedded)} _{hiSilicon K3V2}

Mali-628 MP6 ARM 109c OpenCL 1.1 Samsung Exynos 5422

G6400 PowerVR 256d OpenCL 1.2 Renesas R-Car H2

GC7000 Vivante 256a _{OpenCL 1.2} _—

Radeon HD 8210 AMD 256e OpenCL 1.2 AMD A4-1340

HD Graphics 4000 Intel 295f OpenCL 1.2 Intel BayTrail-T

Mali-760 MP16 ARM 326h OpenCL 1.2 —

GX6650 PowerVR 384e _{OpenCL 1.2} _{Apple A8 (iPhone 6)}

K1 NVIDIA 384e OpenCL 1.2, CUDA NVIDIA Tegra K1

a_{Vivante (2014)} b_{Klug (2011)} c_{Sandhu (2013)} d_{Shimpi (2013b)} e_{Smith (2014b)} f_{Shimpi (2013a)} h_{Athow (2013); Smith (2014b)}

Table 2.5: Performance and GPGPU support of several embedded GPUs.

Although less capable, iGPUs may offer a viable alternative to dGPUs for some applications. iGPUs lack the physical limitations of dGPUs. The size of an iGPU is negligible as it resides on-chip with CPUs. The interconnect between the host system and iGPU is also on-chip, so it is not prone to physical vibration. iGPUs require far less power; common SoCs with iGPUs commonly draw four watts of power, and rarely more than eight watts. As a consequence, iGPUs seldom require active cooling. In addition to these ideal physical characteristics, iGPUs are also more affordable, due to the economies of scale in the smartphone and tablet markets.

We now examine recent trends in iGPU performance and capabilities. Table 2.5 lists several recent iGPUs and their characteristics. We quantify computational capabilities in terms theoretical peak floating point performance, measured in GFLOPS. Unfortunately, GPU manufacturers do not always provide these numbers to the public. As a result, our data is gathered primarily from technology news websites. Each source is cited by footnote. We caution that this data may not be entirely precise. Nevertheless, we are confident that the GFLOPS reported in Table 2.5 are accurate enough to get a sense of performance.

We begin by observing trends in computational performance. The Freescale i.MX6, which includes the GC2000 iGPU, was first announced in early 2011. The NVIDIA Tegra K1, which includes the K1 GPU, was first made available to developers in mid-2014. We see that the K1 is twelve times faster than the GC2000 (32 versus 384 GFLOPS). The K1’s performance is not unusual. The Mali-760 MP16 and the GX6650 perform at a similar level. The K1, Mali-760 MP16, and GX6650 were released in 2014. Comparing the GFLOPS of these recent iGPUs to the trends in Figure 1.1(a), we see that the performance of an iGPU today is roughly

equivalent to a high-end dGPU in 2006. We note, however, that dGPUs of that era regularly required over 150 watts of power. Contrast this with the four to eight watts of anentireSoC today.

In Table 2.5, we also observe wide adoption of GPGPU technology. Table 2.5 lists six different GPU designers that produce iGPUs with GPGPU support. These GPUs are licensed by even more SoC manufacturers. We see that OpenCL 1.2 is widely supported. Only the four least-performing GPUs are limited to OpenCL 1.1 or the embedded profile of OpenCL 1.2.

iGPUs that support GPGPU also cross instruction-set boundaries. Although the SoCs in Table 2.5 predominantly use the ARM instruction set, we also see support for the x86 instruction set from the Intel BayTrail-T and AMD A4-1340.

It is difficult to judge which GPUs best support an embedded real-time system, as this is not strictly defined by the GPU. Other SoC features are important to consider as well. Freescale and Renesas have an established presence in embedded markets. They have demonstrated an understanding and appreciation of real-time system constraints. In contrast, Samsung, Apple, and NVIDIA largely focus on consumer electronics like smartphones and tablets. Each tailors their SoC for their selected market. For example, the Tegra K1 includes a modern cell phone radio for smartphones and tablets (NVIDIA, 2014f). However, it lacks integrated support for CAN, a data bus commonly used in automotive electronics. The converse is true of the Renesas R-Car H2—it supports CAN, but lacks a cellphone radio (Renesas, 2013).

There are also differences in software to consider. For instance, the CUDA programming language is more succinct than OpenCL. Less code is necessary to perform the same operations. Moreover, NVIDIA has developed a broad set of tuned CUDA libraries and development tools. As a result, development may proceed faster on a K1 than it might on any of the other OpenCL-only GPUs. The instruction set of the SoC may also affect development. For example, the Intel BayTrail-T and AMD A4-1340 support the x86 instruction set. Development on these platforms benefits from a wide set of tools and software libraries originally developed for desktops and servers. Also, prototypes developed on x86 workstations are easier to port to x86 SoCs, than ARM SoCs.

Before concluding with this survey of iGPUs, we wish to discuss digital signal processors (DSPs) designed to support computer vision computations. These DSPs function much like an iGPU that executed GPU kernels. As such, we can apply the same GPU scheduling techniques we present in Chapter 3 to these DSPs.

We begin with the G2-APEX DSP developed by CogniVue (CogniVue, 2014), which CogniVue licenses to SoC manufacturers. CogniVue claims that the G2-APEX consumes only milliwatts of power, making it more power efficient than iGPUs. The company provides development tools for implementing computer vision applications, including a custom version of the popular OpenCV computer vision library (OpenCV, 2014). Freescale has licensed CogniVue technology for their own SoCs (Freescale, 2014).

The other computer vision accelerator is the IMP-X4 computer vision DSP, which is incorporated into the Renesas R-Car H2 SoC (Renesas, 2013). Like CogniVue, Renesas also distributes a custom version of OpenCV. What is unique to the R-Car H2 is that it also includes a G6400 iGPU from PowerVR. As we see in Table 2.5, the G6400 is among the more modest iGPUs. However, such deficiencies may be offset when paired with the IMP-X4. Unfortunately, we were unable to obtain benchmark information from Renesas or other sources to support this speculation.

It is clear from this survey of ruggedized dGPUs, iGPUs, and unique accelerator DSPs, that there is a market for data-parallel processor architectures in embedded systems. Solutions today range from expensive military-grade hardware, to specialized embedded DSPs, to common consumer-grade electronics.

These technologies will evolve with time. What direction will this evolution take? Industry has already signaled that we can expect CPUs and GPUs to become more tightly coupled. For instance, CUDA 6.0 (released in early 2014) introduced memory management features that automatically move data between host and GPU local memory. This eases programming because it frees the programmer from the burden of explicit memory management in their program code. A yet stronger signal for tightly coupled CPUs and GPUs comes from the development of the “Heterogeneous System Architecture” (HSA), which is backed by several industry leaders. HSA is a processor architecture where CPU and GPU memory subsystems are tightly integrated with full cache coherency (HSA Foundation, 2014). A GPU is more of a peer to CPUs in this architecture, rather than an I/O device as GPUs are today. Many of the advanced features of OpenCL 2.0 (the latest revision of the OpenCL standard) require HSA-like functionality from hardware. We speculate that CPUs and GPUs with HSA-like functionality will first come from manufacturers that design both types of processors, as they are in the best position to tightly integrate them. This includes companies such as Intel, AMD, NVIDIA, and ARM. It may take more time for makers of licensed GPU processors, such as PowerVR and Vivante. Also, although dGPUs typically lead iGPUs in functionality and performance, iGPUs are likely support HSA-like features before dGPUs.

Application Linux Device Driver GPGPU Runtime User Space Kernel Space GPGPU API ioctl() GPL Layer

Figure 2.18: Layered GPGPU software architecture on Linux with closed-source drivers.

In document Elliott_unc_0153D_15621.pdf (Page 88-92)