2. BACKGROUND
2.1 MCT Based Face Detection
This section includes the information required to understand the testing phase of the face detection algorithm for which a CUDA implementation is developed.
2.1.1 MCT based weak classifiers
MCT [1] is a transform that generates a binary number from the pixel values in a given neighbourhood N . The binary value corresponding to a pixel location is obtained by comparing all pixel values on a neighbourhood centered on that pixel with the mean of all pixel values in the neighbourhood in row-major order. Let N(x) be a spatial neighbourhood centered at the pixel location x and I(x) be the mean of the pixel intensity values on this neighbourhood. If is the concatenation operator, then the MCT can be defined as follows:
Γ(x) = y∈N
ζ(I(x), I(y)) (2.1)
where the comparison function ζ(I(x), I(y)) is defined as
ζ(I(x), I(y)) =
1, I(x) < I(y)
0, otherwise (2.2)
The definition of MCT does not put any restrictions on the size of the neighbourhood, but for the purposes of object detection, the size of the neighbourhood is chosen as
3 × 3, because using larger sizes lead to memory inefficient weak classifiers as one can see later in this section. When a3×3 neighbourhood is considered, the resulting MCT value becomes a 9 bit string that can take values in the range [0, 510] when converted to a decimal value. These values correspond to local structure kernels, some of which are shown in Figure 2.1.
Figure 2.1: Some of the possible local structure kernels in a3 × 3 neighbourhood [1].
These kernels can code information about the structures like edges, ridges, etc. in the image in binary form. An example of computing the MCT value from a 3 × 3 neighbourhood is shown in Figure 2.2.
Ϯϵ ϭϵ ϰϲ Ϯϴ Ϯϱ ϱ ϭϱ ϵ ϳϲ ϭ Ϭ ϭ Ϭ Ϭ Ϭ Ϭ ϭ DĞĂŶсϮϴ ;ϭϬϭϬϬϬϬϬϭͿϮсϯϮϭ Ϭ
Figure 2.2: MCT Computation example in a3 × 3 neighbourhood.
A MCT based weak classifier hx consists of a coordinate pair x = (x, y) relative to the origin of the scanning window and a 511 element lookup table. The coordinates specify the center location of the neighbourhood that needs to be used when calculating
the MCT value. The detector uses a base resolution of24 × 24 and since MCT values cannot be calculated at the edges and corners, total number of different values x can take is limited to22 × 22 = 484. The lookup table contains a weight for each kernel index γ such that0 ≤ γ ≤ 510. The output of a weak classifier on a window in the input image is determined by defining a3 × 3 neighbourhood centered at the location
x, calculating the MCT value which is equal to the γ index, and getting the value from
the lookup table that corresponds to γ.
An important advantage of MCT based weak classifiers is their robustness to illumination variations, which are commonly simulated by smooth, linear monotonic transformations [1]. Since the MCT is defined in a small neighbourhood, the order of its bits does not change after the application of such a transform and makes the detector more robust to illumination variations without bringing additional computational overhead. This property can be seen clearly from Figure 2.3 that shows the fact that results of applying MCT to two face images having very different illumination levels are nearly identical. Also, MCT based features are very easy to compute, because once MCT is applied to the grayscale input image, computing the value a weak classifier takes on a window becomes a matter of doing a single memory lookup.
Figure 2.3: Illumination invariance of the MCT.
2.1.2 Classifier cascade
The detector has a cascaded structure as in [3], but has fewer stages because of the superior distinguishing ability of MCT based weak classifiers and a different learning algorithm. Each stage contains a strong classifier containing a number of weak classifiers, each of which having a differentx position. Maximum number of allowed
different positions in each stage is given as input to the training algorithm as explained in [1]. Hj(Γ), the strong classifier of the j. stage, is defined as follows:
Hj(Γ) = x∈W
hx(Γ(x)) (2.3)
where W ⊆ W is the set of unique positions used by the weak classifiers hx in the strong classifier. A window W passes the j. stage if the sum of the responses of the weak classifiers on that window is less than or equal to the stage threshold Tj. Formally, stage j allows the window to pass if Hj(Γ) ≤ Tj. Stage thresholds Tj are tuned after the strong classifiers are trained using a set of samples not used during training. A window is classified as the searched object by the algorithm when it passes all stages.
The cascade used in this work has 5 stages utilizing 10, 20, 40, 80, and 217 positions, respectively, as shown in Figure 2.4. The maximum number of possible positions allowed for the stages were 10, 20, 40, 80 and 484 for the stages in question. It should be noted that even though all 484 positions were allowed to be used in the last stage, only 217 of them are utilized by the training algorithm. During training, all stages were trained on the same positive set consisting of 8400 manually collected and scaled images of frontal faces with limited variance in pose, expression and illumination. 40000 negative samples were used for training each stage. These samples were obtained by using the partial cascade trained after each stage on 10000 large images collected from the internet that do not contain any faces and selecting the windows that are misclassified. This training method, which is adopted from [3], is the only difference between the implementation used in this work and the one explained in [1]. Since the focus on this thesis is on the testing algorithm and not the training stage, things explained this thesis can be used with a cascade trained using the method in [1] as well.
5HMHFW 5HMHFW 5HMHFW 5HMHFW
)DFH 5HMHFW 7HVW:LQGRZ 6WDJH&ODVVLILHU&DVFDGH
Figure 2.4: The 5-stage cascade used in this work.
2.1.3 Scanning process
The scanning process starts with the construction of an image pyramid from the input image. Then MCT is applied to every image in the pyramid to obtain the transformed image pyramid. The whole image pyramid is scanned with a sliding window and at each location the classifier cascade is used to find out whether the window contains the searched object or not. Then multiple detections whose area of intersection is close to the total area of the detections are grouped and merged into a single detection. As in any other sliding window approach, there are scanning parameters like horizontal and vertical step sizes (Δx and Δy), the ratio between consecutive scales, or the scale factor (s) and the starting scale. In this work, we chooseΔx = Δy = 1, s = 1.15 and use a starting scale of 1, which results in a computationally demanding, fine-grained scanning process and makes it possible to detect faces as small as24 × 24 accurately.