Convolutional Neural Network Accelerators

The rapid development of convolutional neural networks in the recent years is also attributed to the introduction of GPUs into general purpose computing. Of course big training datasets and new training methods along side new architectures played a signicant role as well, but in absence of sheer computing power those achievements would not be available to everybody.

In this part we discuss some of the hardware architectures that are designed to accelerate the training and the inference of neural nets. In general such specialized hardware can be faster and is more energy ecient than the GPUs, CPUs on the other hand are not the optimal platform in any way for neural networks, as they fail to exploit the parallelism that exists in neural network computation.

Although researchers published some very specialized hardware for ASICs and FPGAs, it is be noted that big tech companies nowadays are behind many of the recent achievements and also provide a lot of open sourced tools and training sets. The same companies will usually oer cloud computing platforms in order to make the usage of their software possible even in less capable devices like the Raspberry Pi 3. In the case of an computer vision application for example, the image would be acquired from the camera and then transmitted to the cloud platform, the computation will take place in the data-center and only the answer will be transmitted back to the embedded device.

Tensor Processing Unit (TPU)[14] is the ASIC designed by Google in order to allow its datacenters to perform the computation during inference of deep neural networks. Although the rst version was not able to assist during the

training phase, the second version that was recently announced will be able to do both inference and training. In gure 16 illustrates the block diagram of the TPU, it can be seen that the basic block of the device is the matrix multiplication unit which performs the convolutions. A signicant amount of area is dedicated to the connection of the computing blocks to the weights and data memory but most of the area is allocated for the internal buers and the matrix multiply unit.

Microsoft on the other hand decided to deploy FPGA based solution in order to provide cloud DNN computing that is targeting only the inference of real-time applications. The main reason behind the choice of FPGAs is the exibility that they oer for future improvements and the changes that can be done regarding the datatype of the accelerator.

3 CNN based Object Detection

This part documents the development of a suitable training dataset for the prey - predator application and the training procedure of the CNN together. We also show the results we got from the dierent classication architectures that were tested. Furthermore, it is shown that those architectures are suitable for basic object detection and this can be further improved if we combine our neural networks with a pre-trained CNN.

Due to the iterative development, all the components of the work were re- designed and evaluated multiple times, especially due to the close relation of the CNN functionality and the accelerator. Here, presented is only the nal state of the development

3.1 Training Dataset Generation

An articial neural network training may be unsupervised or supervised. The former means that we dene a cost function and we let the back-propagation algorithm minimize the cost by working on the task without any previous knowl- edge or example. During supervised learning on the other hand, we also dene a cost function but now we rst provide examples of the desired output and we hope that the network will imitate the behavior on new inputs as well. For this thesis we train the CNN in a supervised fashion as this is the common case in literature for image classication and has been proven to deliver good results, whereas unsupervised learning is more challenging to apply.

In order to perform supervised training there is need for a big number of examples in order to tune the parameters for maximum performance on unseen data. Moeys et al. created a training dataset of 500.000 examples by recording 1.2 hours of video and manually annotating each frame to indicate the ground truth position. The video recordings were acquired directly from the image sensor of the robot which was remotely controlled for this purpose. In this thesis there was no constraint on the environment that the robots could operate so the goal was to be able to detect the robot in any environment. This rises the demand for an even bigger training dataset that will include many and also diverse examples. Additionally, at the beginning of the project the image source was not specied so the option of recording video from the robot itself was not available.

To overcome those issues so as to quickly start experimenting, it was decided to create an articial dataset by using a small number of images from the robot in combination with a larger number of random background images. In order to have the GoPiGo in various poses that we can combine with the dierent backgrounds, it was rst needed to extract only the robot like in gure 17 . Then we save it in an le format that can also include transparency information, in this case the PNG le format was selected.

After the extraction of 50 images, we mirror them in order to create a total number of 100 PNG les that contain only the GoPiGo robot. The next step was to download about 1500 random images from the Internet and combine them by overlaying the robot's pictures over them. The motivation behind the choice of random backgrounds is that our goal was to create a system that would work in any environment. So, by presenting to the network a large number of images that contain shapes and colors that can be found around us. We will

Figure 17: Extraction of the GoPiGo robot. It can be seen that the extracted picture of the robot is inuenced by the environment due to the transparent material that it is made from and furthermore, by the lighting conditions as well.

help the network to not correlate the GoPiGo with features that can also be extracted from other objects as well. To make the dataset even more versatile, the robot's pictures are randomly resized and rotated in addition to altering the color and lighting properties of the generated image. Some of the images used are shown in gure 19 together with samples of the generated training images and the corresponding label. The generated images are in most cases unrealistic but that is not a disadvantage as we want to extract features that do not correlate with the context of a scene.

Figure 18: Labeling of the dierent regions that the robot may lie within. L, C, R are for the left, central and right part of the image whereas N indicates the absence of the robot. The dimensions of the images we used to train of our CNNs are160×120pixels.

The great advantage of this method is that as far as the script places the robot over the background, the pixel coordinates are already known. Thus, independently from how the problem is dened, the ground truth is available and can be directly used for the creation of the training dataset. The output product of the generating script are two matrices, a 4D matrix which contains a number of color images and a second 2D matrix which contains the ground truth for every image that the former matrix contains. In this specic case we treat the problem with classication so the ground truth value is an integer which indicates the output class for each image. Figure 18 shows how the image is divided into 3 regions, namely the Left, Right and Center. In order to describe

Figure 19: Example of images created by the training dataset generation algorithm. The ratio between the GoPiGo and background images is altered in this illustration. In reality, the number of GoPiGo photos is much smaller that the backgrounds we have used. The capital letters on the left and right side of the generated images are their corresponding labels. Finally, two examples with the same background are shown in order to emphasize on the dierent color and lighting settings.

the absence of the robot we need a fourth class, this means that the classier will be trained to classify each frame into those four categories. Finally, for better results during the training phase it is benecial to have an equal number for examples per class, so each category (Left, Center, Right, Non-visible) should be represented ideally by 25% of the images in the training dataset.

It was chosen to follow this method (LCRN) which was found in the work of Moeys et al. The reason is that the initial approach to the problem was to train a network from end-to-end, thus the output of the network would directly drive the robot. This is the simplest way to tackle the predator - prey problem but is not optimal due to the inability of this simple solution to preserve information from the past frames. However this method was proven to be a good starting point and the classication networks that were trained with this training dataset were also used in for localization as we discuss in part 3 of this Chapter.

In document Embedded neural network design on the ZYBO FPGA for vision based object localization (Page 33-38)