2.2 Rapid Object Detection: a review of the Viola and Jones Method
2.2.5 Multiresolution Analysis in Feature-based Systems
Methods using either invariant or non-invariant features should have different approaches regarding translation, scaling, rotation, lighting conditions, and articulation. As invari- ance is usually limited to a few transformations (there is no ideal feature set that is absolute invariant to all transformations), even invariant features often need to be computed over sub-sets of pixels (such as sub-windows at a specific position, scale and rotation). To be able to compute features on the region where the image of the object is, one needs to compute the feature set in many sub-windows using many different scales and positions. An exhaustive search is usually impossible due to the real-time constraints.
The Haar-like features used by Viola and Jones (2001a) can be made invariant to scaling (by dividing the feature value by the area of the feature), but not to any other transformation. This requires the use of a multi-scale approach where each sub-window is a rectangular sub-set of the image. Multiresolution analysis would usually require an image pyramid approach, which is computationally expensive.
Next, it is discussed how to assess sub-windows of a frame in view of translation, scaling, and rotation. For Haar-like features, rotation is limited to specific angles, while translation and scaling are only limited by accuracy and speed.
• Translation
In order to detect an object of the same size as the kernel, several sub-windows are examined. If the original kernel has a size NxM pixels and the image has a size
WxH pixels, then the number of sub-windowsS is:
S= (W −M).(H−N)
t (2.6)
where:
– W,H are the width and the height of the image
– M,N are the width and the height of the kernel
– t is the translation factor in pixels
A common problem is the fact that classifiers can hit the same object more than once. This happens because two different sub-windows that are very close to each other can yield values that are within the margins allowed by the classifier. Usually, some form of post-processing is necessary to eliminate these additional hits and compose a single coherent hit. Two approaches are possible. The first approach, used in the OpenCV library, is to eliminate little hit regions inside other larger hit regions and take an average for the final hit position. The second approach, is to take into account how close each hit was from the final classifier threshold, and it assumes that the actual hit position is the one with the best threshold.
• Scaling
The smallest sub-window is of the size of the images with which the classifier was trained. This base window is called a kernel.
Scaling is necessary to find the objects with different sizes from the trained kernel. Once a classifier is trained, there is no need to scale down the image to be assessed. Instead, Haar-like features can be computed directly from the SATs, once for each frame (Viola and Jones, 2004). One drawback occurs due to the rounding process, the discrete nature of digital images causes scaling to generate fractional positions and sizes. Lienhart, Kuranov and Pisarevsky (2003) showed how to compute cor- rection factors that minimised this problem.
Computing every possible scale is not feasible if the real-time constraints are to be met. A reasonable number of sub-windows have to be neglected. Typically, scales
2.2. Rapid Object Detection: a review of the Viola and Jones Method25
are computed using factors from 1.1 up to 1.4. The smaller the factor, the more demanding the computation. If the factor is too large, objects may be missed. The total size of sub-windows that have to be assessed is :
n
X
i=0
(W −M.fi)(H−N.fi) (2.7)
where:
– W,H are the width and the height of the image
– M,N are the width and the height of the kernel
– f is the scaling factor (the kernel sizes need to be rounded to an integer)
– n is the maximum number of times the scaling is computed, limited by:
Round(M.fn)< W and Round(N.fn)< H
For example, for a 640x480 pixels frame, with a kernel of size 24x24 pixels and a factor of 1.1, the total number of sub-windows is 4482974. At 15 frames per second, and if each feature needs 8 lookups, approximately 5.3×108 lookups per second are needed just for the calculation of features. In practice, translation with scaling are usually computed in steps of more than one pixel. By using large translation and scaling factors, there can be loss of accuracy in the form of missed objects. On the other extreme, small translation and scaling factors can slow down the classification process and present an overwhelming number of detections.
• Rotation
Haar-like features are not invariant to rotation and it is computationally expensive to rotate every sub-window and detect all possible rotations. An alternative for dealing with rotation using Haar-like features is to train several classifiers using rotated examples. The disadvantage of this is the added time and effort to train the set of classifiers, but this is compensated by the flexibility and by the control over separate parts of this process. Jones and Viola (2003) suggested that for faces only a few extra classifiers for different angles would be necessary.
In order to deal with the problem of multi-view faces, Lienhart, Liang and Kuranov (2003) implemented the idea of a detector tree (rather then a single cascade classi- fier). They validated the results with the XM2FDB video database and concluded that a tree structure for the classifier improves both accuracy and performance. Sev- eral parallel cascades of classifiers would also work well for objects that could have very different patterns when assessed from different view points.
Summary of Multiresolution Analysis
Various approaches to analysis of sub-windows in images have been described. The perfor- mance of detection and recognition systems are related to the total number of sub-windows surveyed. This number depends on various factors that have to be tuned for best perfor- mance or best accuracy. The factors are:
• Kernel size (base sub-window)
• Translation factor
• Scaling factor
• Number of simultaneous classifiers
• Number of SAT