2.9.1 Proof of Theorem 1
SinceπΈ[βπ(π)ππ (π π(πΏ))] =πΈ[πΈ[βπ(π)ππ (π π(πΏ))β£πΏ=π]], we can minimizeπΈ[βπ(π)ππ (π π(πΏ))]
by minimizing πΈ[βπ(π)ππ (π π(πΏ))β£πΏ = π] for every π. Note that πΈ[βπ(π)ππ (π π(πΏ))β£πΏ =
π] = π(π)(1βπ)ππ (π(π)) + (1 βπ(π))πππ (βπ(π)). Because ππ is a nonincreasing function,
the minimizer πβ
π should satisfy that ππβ β©Ύ 0 if π(π)(1βπ) > (1βπ(π))π, ππβ β©½ 0 other-
wise. Note that π(π)(1βπ) > (1βπ(π))π is equivalent to π(π) > π. Hence, it is suf- ο¬cient to show that π = 0 is not a minimizer. We can assume π(π) > π without loss of generality. For π = 0, πΈ[βπ(π)ππ (0)β£πΏ = π] = π(π)(1βπ)ππ (0) + (1βπ(π))πππ (0), and πΈ[βπ(π)ππ (1)β£πΏ=π] =π(π)(1βπ)ππ (1) + (1βπ(π))πππ (β1). Hence πΈ[βπ(π)ππ (0)β£πΏ=π]> πΈ[βπ(π)ππ (1)β£πΏ=π] becauseππ (0)> ππ (1) andππ (0) =ππ (β1). Thusπ = 0 is not a minimizer
in this case. Forπ <0, ππ(ππ)πΈ[βπ(π)ππ (π π(πΏ))β£πΏ =π]β£π(π)=0 = ππ(ππ)[π(π)(1βπ)ππ (π(π)) +
(1βπ(π))πππ (βπ(π))]β£π(π)=0=π(π)(1βπ)πβ²
π (0) + (1βπ(π))ππβ²π (0) = (π(π)βπ)πβ²π (0)<0 be-
causeπβ²
π (0)<0. Thusπ = 0 is not a minimizer. Hence,πππβ(π) has the same sign asπ(π)βπ.
2.9.2 Proof of Theorem 2
Deο¬ne π΄(π) = πΈ[βπ(π)ππ (π π(πΏ))β£πΏ = π]. Observe that π΄(π) = π(π)(1βπ) min(π‘,log(1 + πβπ(π))) + (1βπ(π))πmin(π‘,log(1 +ππ(π))), where π‘= log(1 +πβπ ). We consider three cases, π β©½π β©½βπ ,π < π , andπ >βπ .
First, when π β©½π β©½βπ ,π΄β²(π) = π
ππ(π)[π(π)(1βπ) log(1 +πβπ) + (1βπ(π))πlog(1 +ππ)] = 1
1+ππ[βπ(π)(1βπ) + (1βπ(π))πππ], and π΄β²β²(π) = (π(π)(1βπ) + (1βπ(π))π)ππ/(1 +ππ)2. Note thatπ΄β²β²(π)>0 for anyπ β[π ,βπ ], andπ΄β²( Λπ) = 0 when Λπ = logπ(1(1ββπ)ππ((ππ))) = logπ(π(π), π). Hence, Λπ is the minimizer of π΄(π) for π β [π ,βπ ]. Note that π΄( Λπ) = (π(π)(1βπ) + (1β π(π))π) log(π(π)(1βπ) + (1βπ(π))π)βπ(π)(1βπ) log(π(π)(1βπ))β(1βπ(π))πlog((1β π(π))π).
Second, when π < π , note that π΄(π) = π(π)(1βπ)π‘+ (1βπ(π))πlog(1 +ππ(π)) and it
is an increasing function in π. Thus, the minimum of π΄(π) in this case is limπβββπ΄(π) =
π(π)(1βπ)π‘.
a decreasing function in π. Likewise, the minimum of π΄(π) in this case is limπββπ΄(π) = (1βπ(π))ππ‘.
Hence, Λπ is the minimizer of π΄(π) if π΄( Λπ) < limπβββπ΄(π) = π(π)(1βπ)π‘ and π΄( Λπ) <
limπββπ΄(π) = (1βπ(π))ππ‘. If π΄( Λπ) > limπβββπ΄(π) = π(π)(1βπ)π‘ and limπββπ΄(π) = (1βπ(π))ππ‘ > limπβββπ΄(π) =π(π)(1βπ)π‘, π = ββ is the minimizer of π΄(π). Similarly,
π =β is the minimizer of π΄(π) if π΄( Λπ) >limπββπ΄(π) = π(π)(1βπ)π‘ and limπββπ΄(π) =
(1βπ(π))ππ‘ < limπβββπ΄(π) = π(π)(1βπ)π‘. Finally, ifπ΄( Λπ) > limπββπ΄(π) = π(π)(1β
π)π‘ = limπβββπ΄(π) = π(π)(1βπ)π‘, then π = ββ,β is the minimizer of π΄(π). The de-
sired results can follow with that π»1(π, π(π)) = π‘π΄( Λπ)/limπβββπ΄(π) and π»2(π, π(π)) = π‘π΄( Λπ)/limπββπ΄(π).
2.9.3 Proof of Lemma 1
Let π(βπ) = (π§
1,β β β , π§πβ1, ππ(βπ)(ππ), π§π+1,β β β , π§π)π, and βΛπβ(π§, π) = βπ§π + log(1 +ππ). Since ββΛπββπ(π§,π) =βπ§+ 1/(1 +πβπ) andββ2Λπβ(π§,π)
βπ2 =ππ/(1 +ππ)2 β©Ύ0, for any ο¬xedπ§, the minimizer of βΛπβ(π§, π) is π which satisο¬es π§= 1/(1 +πβπ). Therefore, using π(βπ)
π (ππ) = 1/(1 +πβπ (βπ) π (ππ)), we have βΛπβ(π(βπ) π (ππ), π (βπ) π (ππ))β©½βΛπβ(π (βπ) π (ππ), ππ(ππ)). This implies βΛπ(ππ(βπ)(ππ), ππ(βπ)(ππ))β©½βΛπ(ππ(βπ)(ππ), ππ(ππ)) (2.35)
sinceβΛπ(π§π, π(ππ)) = min{π‘,βΛπβ(π§π, π(ππ))}. Hence, for anyπ, we have
πΌπ(π,π(βπ)) =βΛπ(π(βπ) π (ππ), π(ππ))β β πβ=πΛπ(π§π, π(ππ)) +πππ½(π) β©ΎβΛπ(ππ(βπ)(ππ), π(βπ)(ππ))β β πβ=πΛπ(π§π, π(ππ)) +πππ½(π) β©ΎβΛπ(ππ(βπ)(ππ), π(βπ)(ππ))β β πβ=πΛπ(π§π, π (βπ) π (ππ)) +πππ½(π (βπ) π ) (2.36)
Chapter 3
Bounded Constraint Machine
3.1
Introduction
The Support Vector Machine (SVM) has been very popular due to its success in many appli- cations (Vapnik, 1998; Cristianini and Shawe-Taylor, 2000). It was originally proposed using the idea of maximal separation. It is well known that the SVM can be ο¬t in πππ π +ππππππ‘π¦
framework using the hinge loss. In this regularization framework,πππ π measures goodness of ο¬t on the training data, andππππππ‘π¦ reο¬ects smoothness of the resulting model. Viewing the SVM in the regularization framework with the hinge loss helps us understand how the SVM uses the training data to build a classiο¬er.
Despite its success, the SVM has some drawbacks. One known drawback is that the SVM classiο¬er only depends on the set of SVs, which include training data points that are correctly classiο¬ed but relatively close to the boundary as well as those misclassiο¬ed training points. As a result, extreme outliers can have relatively big impact on the resulting classiο¬er. In the liter- ature, there have been some attempts to modify the SVM to gain robustness to outliers (Shen et al., 2003; Liu and Shen, 2006; Collobert et al., 2006; Wu and Liu, 2007). The idea is to trun- cate the unbounded hinge loss function so that the eο¬ect of extreme outliers can be bounded. The corresponding optimization, however, involves challenging nonconvex minimization. An- other drawback is that the standard SVM was originally designed for binary classiο¬cation. Its extension to multicategory classiο¬cation is nontrivial. Previous attempts include Vapnik (1998); Weston and Watkins (1999); Crammer and Singer (2001); Lee et al. (2004). Despite these extensions seem natural and reasonable, not all of them are Fisher consistent (Liu, 2007).
Our motivation here is to modify the criterion of the SVM. Instead of the maximum separa- tion criterion whose solution only depends on a subset of the training data, we propose to use an alternative criterion so that all data points can inο¬uence the solution. One main advantage of using all data points for the classiο¬er is that the resulting classiο¬er may depend less heavily on a smaller subset and consequently can be more robust to outliers. More speciο¬cally, we propose the Bounded Constraint Machine (BCM), which minimizes the sum of the signed distance to the classiο¬cation boundary subject to some constraints on the solution. Our focus in this chapter is on binary classiο¬cation. However, the BCM can be extended for multicategory classiο¬cation directly with Fisher consistency.
To further study the relationship between the SVM and the BCM, we investigate another method, the Balancing Support Vector Machine (BSVM). The BSVM can be viewed as a mod- iο¬cation of the SVM with all training points inο¬uencing the resulting classiο¬er. The BSVM is characterized using the parameter π£ with π£ = 0 corresponding to the SVM and π£ = β corre- sponding to the BCM. As a result, the BSVM helps to build a continuous path from the SVM to the BCM by changing the value ofπ£. Along with the eο¬ect ofπ£, the properties of the BSVM including Fisher consistency and asymptotic behaviors of the coeο¬cients are investigated.
In practice, the performance of these methods may vary from problem to problem. Therefore, it may be desirable to treat π£ data dependent. To improve the computational eο¬ciency, we establish the entire solution path with respect to the value ofπ£, so that we can get the solution of the BSVM for every value ofπ£ eο¬ciently.
The rest of this chapter is organized as follows. Section 3.2 brieο¬y reviews the standard SVM and proposes the BCM. In Section 3.3, we investigate the BSVM and describe its behavior using the Lagrange dual problem. The eο¬ect ofπ£is explored and we show how the BSVM builds connection from the SVM to the BCM. Section 3.4 shows Fisher consistency of the BSVM and the BCM, as well as some asymptotic properties. Section 3.5 develops the regularized solution path with respect toπ£. Numerical results are reported in Section 3.6 and Section 3.7 gives some discussion. The proofs of our theorems are included in Section 3.8.