Kernel methods map the original samples into an implicit high-dimension space, where the linear classification is subsequently applied. As a result, kernel classifiers achieve better accuracy compared to linear classifiers. In boosted classifier, the overall accuracy is mainly
Figure 6.3: Sample distribution before and after BhattaCharyya Mapping (BCM).
decided by the weak classifier [93]. If we utilize the kernel weak classifiers instead of the conventional weak classifiers, the accuracy of the overall boosted classifier will be clearly improved. In this section, we will show that the basis mapping is an approximation of applying additive kernel methods as weak classifiers in the boosting algorithm.
Generally, linear classification in the implicit space can be implemented in the original space through the kernel trick. Given two m-dimension samples x, z in the original space and a kernel function K(x, z) that satisfies the Mercer’s Condition, there exists a function
ψ,
K(x, z) = ψ(x) • ψ(z), (6.12) where • is the dot product of two vectors.
In boosting training, learning weak classifier h could be considered as finding an optimal classification hyper-plane based on the training samples in the original m-dimensional space. If the kernel method is applied in the procedure of weak classifier learning, denote the optimal classification hyper-plane in the implicit space by w∗, given a sample in the original
m-dimension space by x = [x(1), . . . , x(m)], the optimal weak classifier h∗(x) is the dot product of the ψ(x) and w∗
h∗(x) = w∗• ψ(x). (6.13)
In the extreme case, if there is a vector x∗ ∈ Rm satisfies ψ(x∗) = w∗, Eq. 6.13 can be
h∗(x) = w∗• ψ(x) = ψ(x∗) • ψ(x) = K(x, x∗). (6.14) Using Eq. 6.14 as the weak classifier is relatively convenient. So the only problem is to find out such an x∗.
Unfortunately, in most of the cases, ψ is not invertible or even ψ itself could not be explicitly described, so it seems to be impossible to find such an x∗. But in boosting framework, we could approximate x∗ by selecting one of the current training samples x0. After evaluating several training samples to select the best one to approximate x∗, the optimal h∗ could be approximated as the h in Eq. 6.15
h∗(x) ≈ h(x) = w0• ψ(x) = K(x, x0), (6.15)
where w0 = ψ(x0). This implies that by referring to an appropriate sample x0, the linear classification in the implicit space could be approximated by the above kernel function.
Then we turn back to the basis mapping. In all three proposed basis mappings, each dimension of the feature vector x is independent with each other, so the basis mapping Φ(x) could be written as
Φ(x) = ϕ(x, xb) = [ϕ(x(1), x(1)b ), . . . , ϕ(x(m), x(m)b )]. (6.16)
Notice that all these basis mappings correspond to the additive kernels in Eq. 6.17. The HIM corresponds to the histogram intersection kernel, the CHM corresponds to the chi-square kernel, and the BCM corresponds to the bhattacharyya kernel
K(x, x0) =
m X i=1
k(x(i), x0(i)). (6.17)
So the ϕ in Eq. 6.16 is exactly the same as the k in Eq. 6.17 for these basis mappings. Then the weak classifier in Eq. 6.15 could be written as
h(x) = K(x, x0) =
m X i=1
ϕ(x(i), x0(i)). (6.18)
As mentioned above, in the boosting framework, we could use x0 to approximate x∗. This is achieved by evaluating different hard samples in current training stage to get the best one xb. Then Eq. 6.18 is achieved by Eq. 6.19
h(x) = m X i=1 ϕ(x(i), x0(i)) = m X i=1 ϕ(x(i), x(i)b ). (6.19)
Parameters
N number of training samples
Nb number of basis samples per iteration
Nf number of features per iteration
T maximum number of weak classifiers
θ threshold of false positive rate
Input: Training set {(xi, yi)}, xi ∈ Rm, yi∈ {0, 1}
1. Initialization
wi = 1/N, H(xi) = 0, p(xi) = 0.5
2. Repeat for t = 1, 2, . . . , T
2.1 Compute zi and wi 2.2 For m = 1 to Nf
For n = 1 to Nb
2.2.1 Randomly select a basis sample xb with top 20% weights
2.2.2 Calculate the original feature vectors xi
2.2.3 Calculate the mapped vectors Φ(xi) = ϕ(xi, xb)
2.2.4 Fit the function h Eq. 6.20 by weighted least square regression from Φ(xi) to zi
2.2.5 Select the best feature and basis sample with minimum regression error 2.2.6 Calculate the false positive rate. If it is lower than θ, break
2.3 Update H(xi) and p(xi)
3. Output classifier H(x) = sign[PTj=1hj(x)]
Figure 6.4: LogitBoost training with basis mapping.
h(x) =
m X i=1
a(i)ϕ(x(i), x(i)b ) + b. (6.20)
According to Eq. 6.16, Eq. 6.20 is the linear classification on the mapped space Φ(x) around the basis sample. The kernel classification Eq. 6.14 is finally transformed to a linear classi- fication. So we get the conclusion that the weak classifier based on the basis mapping Φ is an approximation of additive kernel classification in the original space, which significantly has better discriminative ability than simple decision stump or linear weak classifiers. Be- cause the performance of a boosted classifier mainly depends on the weak classifiers. The proposed basis mapping will contribute to the overall accuracy of the boosted classifier. In addition, the basis mappings constructed by Eq. 6.6, 6.8, 6.10 are computational efficient, they does not increase the feature dimension. Therefore, the computation cost will not increase much compared to the linear weak classifiers.