Transfer Learning Techniques - DCNNs for Grasp Detection

2.4 DCNNs for Grasp Detection

2.4.2 Transfer Learning Techniques

The most successful robotic grasp detection work has used transfer learning methods to achieve accuracies close to 90%. Any transfer learning approach includes the following steps:

1. Data pre-processing 2. Pre-trained CNN model

Compared to image classification, robotic grasp detection requires the capability of a DCNN to identify grasp configurations for novel objects. This requires training on generalised object scene images. Therefore, most researchers limit their pre-processing techniques to just accommodate CNN input dimensions such as image width and the number of channels. Unlike in image classification, the ground truth data for grasping was less augmented. Redmon et al. [19] reported the minimum amount of necessary pre-processing for RGB-D datasets as centre cropping of images and replacing the blue channel of RGB images with depth data while normalising the depth data to (0, 255) range, which is the default RGB colour space range. While following the exact same procedure in [19], Watson et al. [58] normalised all RGB values and grasp labels to (0, 1) arguing that training targets should be in the same range as the training data. In their method, Pinto et al. [64] resized images to 227 × 227 which was the input size for their model. Their network had a similar architecture to AlexNet [46] as shown in Figure 2.8.

Due to the problem of overfitting with the limited available datasets, the deep learning robotic grasp detection literature indicates that many authors have used pre- training. Pre-training was identified as a transfer learning technique where the deep network was pre-trained on larger datasets prior to the training on domain specific

Figure 2.8: 18-way binary classifier by Pinto et al. [64].

data. The weights of the convolutional layers that were originally learnt during the pre-training were kept frozen during the training on domain specific data.

In their one-shot detection method, Redmon et al. [19] used the AlexNet [46] convolutional network architecture in compiling their network model [19] which achieved a grasp detection accuracy of 84.4%. The same approach was used in [20, 58] with similar results reported. They modified the orientation parameter from the grasp rep- resentation while arguing that the angle predictions are two-fold (positive or negative) [19]. Therefore the authors replaced θ with (sin 2θ, cos 2θ) following the trigonometric definitions. The argument was further supported in [20, 58].

A similar pre-training approach was employed in [10] using the deeper ResNet- 50 architecture as opposed to the variant of the AlexNet from [19]. They reported that the deeper network model afforded them an increased accuracy of 89.21%. They further reported a second model as shown in Figure 2.9. It was aimed at uni-modal data such as RGB or RG-D images. This model achieved an accuracy of 88.84%.

Even though the transfer learning made it less challenging to use a pre-trained model for the convolutional part of a DCNN, researchers still had to design the fully connected part of the DCNN. While there is not enough evidence to determine the optimal number of units required, Redmon et al. [19] used two fully connected layer that had 512 units in each individual layer. In [10], Kumra et al. used one fully connected layer with 512 units for their Uni-modal architecture and two layers with 512 units for their Multi-modal architecture. A model similar to the deep network in [19] was later followed in [20] and [58]. In DCNN training applications, the popular

learning optimiser is the Stochastic Gradient Decent (SGD). In deep grasp detection most authors used the SGD but they argued that it was not an optimal optimiser and reported more advanced optimisers were necessary. Ruder provided a comprehensive overview of different learning optimisers in [93].

While most transfer learning approaches employed pre-trained network models in an end-to-end learning process some researchers used them as feature extractors for shallow network models. Chollet [94] stated that due to the 2-step method of feature extraction it was impossible to employ data augmentation techniques if required as the learnt features would not be the same as the training images. Also running end- to-end learning was costlier as it required the convolutional base of the network to be run on data repetitively.

Table 2.2: Comparison between different transfer learning techniques in one-shot grasp detection. Results were reported from tests performed on the Cornell Grasp Dataset [70]

Method Architecture Accuracy (%) Accuracy (%)

(image-wise) (object-wise)

Direct regression AlexNet [46] 84.4% 84.9%

by Redmon et al. [19]

Regression+Classification AlexNet [46] 85.5% 84.9% by Redmon et al. [19]

MultiGrasp Detection AlexNet [46] 88.0% 87.1% by Redmon et al. [19] Uni-modal, SVM, RGB ResNet-50 [89] 84.76% 84.47% by Kumra et al. [10] Uni-modal RGB ResNet-50 [89] 88.84% 87.72% by Kumra et al. [10] Uni-modal RGD ResNet-50 [89] 88.53% 88.40% by Kumra et al. [10] Multi-modal SVM, RGB-D ResNet-50 [89] 86.44% 84.47% by Kumra et al. [10] Multi-modal RGB-D ResNet-50 [89] 89.21% 88.96% by Kumra et al. [10]

Direct Regression (RG-D) AlexNet [46] 71% NA by Basalla et al. [20]

Direct Regression (RG-D) AlexNet [46] 78% NA by Watson et al. [58]

Detecting object grasping configurations from images is still accurately solved using analytical methods but the use of empirical methods is exponentially increasing due to successful results in recent publications. One commonly used method is to train a visuomotor controller using deep learning that iteratively corrects the grasping point until the object is successfully grasped between the gripper jaws. The next best method is to learn a function that scores the possible grasps on an image and use it to select the highest scored grasp as the candidate. There are other methods that learn a certain heuristic and exhaustively search for possible grasps on the images. Training CNNs to detect grasps requires a high volume of manually labelled data. As a solution, most researchers opt to use simulated training data as in [77, 65, 14]. Alternatively, data collection can be automated as in [64]. Recent approaches have suggested network pre-training can help to avoid overfitting due to the limited availability of training data [19, 10, 20]. As shown in Table 2.2, AlexNet [46] and ResNet-50 [89] were widely used in these studies. The reasoning for the reduced results in [20] and [58] compared to [19] was not clarified by any author. However,

the literature survey suggested that these methods utilised a lower number of training data instances and a varied set of data augmentation techniques as compared to [19]. This, further supported the arguments made by many different researchers regarding the limited availability of high quality annotated training data for such robotic grasp detection work.

In document Learning to grasp in unstructured environments with deep convolutional neural networks using a Baxter Research Robot (Page 50-54)