The OrigCPU Algorithm - Support Vector Machines

4.2 Support Vector Machines

5.1.1 The OrigCPU Algorithm

The OrigCPU algorithm is depicted in Figure5.2. The face detection component is the

foundation of the feature extraction procedure and is used to obtain the centre of the face in the current image. The centre of the face is used for two purposes:

1. To obtain a colour distribution that is representative of the individual’s skin. 2. For normalization purposes. The individual is repositioned in the image such that

he/she is in the centre of the frame.

The nose region is positioned around the centre of the facial frame. A 10 × 10 pixel area around the centre of the nose is extracted as illustrated in Figure 5.3. The Hue values of this area are represented as a histogram, which functions as a look-up table for skin pixel values. This method was discussed in the previous chapter. The histogram is back-projected on to the original image to produce a greyscale image in which regions that are more likely to be skin appear brighter. As per Achmed [1] and Li’s [51] work, the result is thresholded with the value of 60 to obtain the binary skin image illustrated in Figure 5.4.

The GMM background subtraction technique is used to highlight the moving foreground in the current image as illustrated in Figure5.5. The result of the background subtraction is combined with the result of the skin detection using a logical And operation.

Chapter 5. Design and Implementation of the Upper Body Pose Recognition and

Estimation System 55

Figure 5.2: _{Original upper body pose recognition and estimation algorithm.}

Figure 5.3: _{Face detection and nose region.}

Chapter 5. Design and Implementation of the Upper Body Pose Recognition and

Estimation System 56

Figure 5.4: _{Skin Image.}

Figure 5.5: _{GMM background subtraction.}

This technique highlights only the skin pixels that have moved, henceforth referred to as the moving skin image and depicted in Figure5.6. Stationary pixels that were falsely detected as skin, such as skin-coloured furniture, are eliminated using this technique. Additionally, moving objects that are non-skin-coloured are also eliminated using this technique. The majority of noise in the image is eliminated. The feature extraction technique is very robust.

No further processing is performed when the result of the background subtraction contains less than a certain number of pixels. Achmed found 7000 to be the optimal number of pixels [1].

It can be observed in Figure5.6 that the arm of the individual has rough contours and a large hole—a big discontinuity of white pixels in the arm. The following morpholog- ical operations are applied to remove such unwanted features from the image: Erosion, Opening and Dilation, in that order. Erosion is applied, using a 17 × 17 rectangular structuring element, to remove isolated noise regions from the image. The rough

Chapter 5. Design and Implementation of the Upper Body Pose Recognition and

Estimation System 57

Figure 5.6: _{The results of a skin image superimposed on the objects of interest to} obtain the moving skin image.

Figure 5.7: _{Enhanced moving skin image with noise removed.}

contours and any remaining noise are removed by applying Opening with a 21 × 21 rectangular structuring element. Dilation is applied to the resulting image, using a 13 × 13 rectangular structuring element, to produce the enhanced image illustrated in Figure

5.7.

The location of the face is used to normalize the moving skin image. The normalization process shifts the moving skin image vertically and horizontally such that the face in the current frame is aligned with the first facial frame in the sequence.

The normalized image is resized. The input images used have a size of 640 × 480 pixels. The more pixels an image contains, the greater the amount of detail, which in turn allows

Chapter 5. Design and Implementation of the Upper Body Pose Recognition and

Estimation System 58

for more accurate extraction of features. However, training a large number of features results in very long SVM training and testing times. An image size of 640 × 480 amounts to 307 200 features per image. When training on 1500 images, a total of 460 800 000 features are obtained. An efficient way to reduce the number of features, while retaining the essence thereof is to reduce the size of the image [12].

Each image is resized to 40 × 30 pixels using an external program called Convert. This takes place by averaging every 16 × 16 pixels into a single pixel. The resized image contains the feature vectors of the input image. It is written to a data file to be used by the SVM.

This system was implemented on the CPU. In this implementation, the entire input image sequence is first loaded into primary memory before the feature extraction phase. This is carried out to avoid the latencies associated with transferring data between the hard disk and memory. Also, each step in the process is executed on all cores of the CPU using Threaded Building Blocks (TBB). Making use of these two optimizations ensures that the CPU runs at its full capacity. This ensures that a fair comparison between the CPU and GPU implementations is carried out in the next chapter.

In document Faster upper body pose recognition and estimation using compute unified device architecture (Page 66-70)