Transfer Learning - Embedded neural network design on the ZYBO FPGA for vision based object loc

Within the eld of machine learning, transfer learning is a term that describes the use of knowledge that was gained for solving a given task, into another task dierent from the rst one. In our case, knowledge is saved as a network topology together with a set of parameters, namely a model. The task for which the knowledge was developed is the Imagenet classication challenge. Finally, the new task we want to solve is the GoPiGo loose detection with the LCRN classier we discussed so far.

In fact, what we do is that we replace the handcrafted feature extraction from the traditional image processing approach, with a CNN that has learned parameters. On top of that we add some extra layers, similar to the traditional SVM on top of the feature extractor. The resulting CNN can been seed in Figure 29. Note that the ILSVRC is mainly focused on classication, thus we have to remove the top fully connected layers and only use the convolutional part of the CNN.

When using an already trained CNN it is known that its possible to achieve robust performance with a relatively small number of examples. We can work around the small number of example by augmenting them. The Keras library we use, has build-in functionality for image augmentation. Thus it can na- tively support various transformations like rotation, zooming, lighting and color alteration etc. The only limitation we face is that our classication problem (LCRN) needs the GoPiGo to be in a certain position. Consequently we can not use most of the build-in functionality because it will change the position of the robot. But, our dataset generating algorithm does exactly the same thing and also provides the correct position so it can be used without any change.

The pre-trained CNN we've chosen to utilize is one of the MobileNets family. The main reason is its small computation and memory requirements, this can be seen also in [11] where MobileNets are used in dierent object tracking pipelines and compared to other available CNNs. At this stage we rely on the Keras library for support of the pre-trained ConvNets as it is time consuming and challenging to port the architecture and the weights from another framework or library. Thankfully, most of the state-of-the-art CNNs are supported ocially or can be found open-sourced from Keras' users.

In order to begin with the training procedure we only need to chose an appropriate value for theαparameter of the MobileNets. It controls the size of

the network, so by setting α= 0.25we get the smallest possible conguration

of the CNN, whereas α = 1 is fully utilizing the network. What the Alpha

parameter controls, is the number of lters applied at each convolution layer. For example when α= 0.25the output of the network are 256 feature maps,

whenα= 1 the number of feature maps at the output is 1024. Subsequently,

the number of parameters is also inuenced. From about 300.000 forα= 0.25,

up to 3.200.000 forα= 1. For our classier we chose the minimum conguration

as we only want to track a single object, namely the GoPiGo robot.

The procedure of training the last layers that we've added on top of the mobilenet, takes a bit longer this time but its much more predictable and pro- gressive. We use 38.000 images for training and 2.000 images for validation like we did with the compact CNNs we trained. The time that each epoch requires now is about 10 minutes but we achieve very high accuracy with only 4 epochs. The accuracy on the training dataset is 98% but more important is the fact that

Figure 29: MobileNets-based convolutional neural network. Because the Mo- bileNet serves as a feature extractor, we 'freeze' its parameters and let the optimization algorithm tune only the layers we added on top.

the performance on real images this time is very robust. We test the classier with streaming video from a webcam and we observe that the lighting condi- tions or other objects that previously degraded the performance of our custom CNNs, now do not inuence the performance of the MobileNets based classier. Of course, the performance drops in absence of adequate lighting.

Finally, the last think we've tested is to train the same MobileNets based classier but with a dierent conguration from the LCRN we did so far. The decision was based on the observation that the LCRN classier produces most of the erroneous results when the robot is in between the segments we dened. This is when the robot for example, is not exactly on the left or central part of the image but rather in both. In order to avoid this eect we train a binary classier with only two outputs that indicate the presence or not of the GoPiGo in the image.

To achieve this, we only alter the training dataset generation slightly and we train the layers that we add on top of the MobileNets which acts as feature extractor. Similarly to the previous conguration, the CNN achieves very high accuracy on the training dataset and this hold also on previously unseen images. The performance of the binary classier is very robust and we only observe some false positives from a very small number of frames. This problem in the case that the robot is moving, will not be dicult to overcome if we use a Kalman lter for example to lter out those errors that last for a couple of frames. 3.4.1 Using Mobilenets for Robust GoPiGo Detection

The GoPiGo detection pipeline we previously presented, suers from the poor generalization of the classication neural network. On the other hand the localization method delivers adequate performance even in its most basic form. Having solved the classication problems, with the aid of a pre-trained CNN we can now improve the overall performance of the detection pipeline.

This version of the GoPiGo detector can be seen in gure 30. Now we have to evaluate the MobileNets-based CNN rst and only if the prediction is positive we go on with evaluating our compact CNN. Because the classication is based on a dierent network, it is only needed to perform the computation of the convolutional layers and not compute the fully connected layers of our compact CNN. This is because the fully connected layers produce the output

Figure 30: MobileNets-based object detection pipeline.

of the classier but are not needed in order to extract the location with our method.

In document Embedded neural network design on the ZYBO FPGA for vision based object localization (Page 53-55)