1.6 Regularized risk estimation and optimizations
1.6.1 Regularized risk minimization
Besides the quadratic regularizer in Eq. (1.23), many other measures of model complex- ity exist. For example the L1 norm kwk1:=
P
i|wi|encourages sparse solution where many wi are zero meaning the corresponding features are unimportant (Tibshirani, 1996;Candes & Tao,2005). Entropy or relative entropy is also commonly used when
Table 1.2: Loss for multi-class classification Name Definition
“0-1” loss δ(argmaxy∈Yhφ(xi, y),wi)6=yi)
hinge loss max{0,1−miny6=yihφ(xi, yi)−φ(xi, y),wi}
= maxy∈Y{hφ(xi, y)−φ(xi, yi),wi+δ(y=yi)}
logistic loss − hφ(xi, yi),wi+ logP ¯
yiexp (hφ(xi,y¯i),wi)
Table 1.3: Loss for binary classification
Name Definition
“0-1” loss δ(sign(hφ(xi),wi)6=yi)
hinge loss max{0,1−yihφ(xi),wi} logistic loss log(1 + exp(−yihφ(xi),wi)) exponential loss exp(−yihφ(xi),wi)
w corresponds to a distribution on the features, and this prior encourages a uniform distribution. It is noteworthy that the L2 norm and entropy are strongly convex and smooth while L1 norm is just convex but not strongly convex or differentiable.
On the other hand, empirical risk also admits a wide range of choice. In the simplest case of the statistical query model (Kearns,1998), it can be decomposed additively to the loss on individual training examplesl(xi, yi;w). Examples include
• Logistic loss as in Eq. (1.24) named in analogy to logistic regression,
• 0-1 loss as in Eq. (1.25) which simply checks whether the output label is correct, • Hinge loss as in Eq. (1.27) which looks at all the incorrect labels and encourages
their discriminant values to be less than the correct label’s value by at least 1 (margin).
We summarize these losses in Table 1.2.
When specialized to binary classification with y ∈ {−1,1}, the above definitions
can be simplified by letting φ(xi, y) :=yφ(xi)/2, and are summarized in Table 1.3. All the four losses in Table 1.3 for binary classification are plotted in Figure 1.6. Exponential loss is used in boosting (Hastie et al.,2009, Section 10.4). Hinge loss leads to maximum margin models, and the commonly used support vector machine (SVM) for binary classification is simply a combination of hinge loss andL2regularization. Notice that “0-1” loss is neither convex nor continuous. Hinge loss is convex and continuous but not differentiable at one point. Logistic loss and exponential loss are both smooth,
−2
−1
0
1
2
0
1
2
3
0−1 loss
hinge loss
log loss
exp loss
yiφ(xi),w loss 1Figure 1.6: 0-1 loss, hinge loss, logistic loss and exponential loss for binary classification.
strongly convex, and have Lipschitz continuous gradient on any compact subset ofR.
Historically, although 0-1 loss was the real objective that one wants to minimize, its discontinuity and nonconvexity prompted people to use other convex upper bounds as surrogates for easier optimization. The statistical consequences of these surrogates are under research,e.g. (Bartlett et al.,2006).
More general loss functions can be defined for regression, ranking, novelty detection, etc.. In the case of multi-class classification, the hinge loss defined above can be generalized in two ways which can be summarized by
max
y∈Y {ρ(y, yi) [hφ(xi, y)−φ(xi, yi),wi+ ∆(y, yi)]}.
Here ∆(y, yi) gives a more refined comparison between the proposed labely and the correct labelyi, characterizingto what extent the proposed label is wrong. This is much more informative than δ(y=yi) which merely checks whether the labeling is correct.
For instance, when the output space is a sequence, ∆(y, yi) can be the Hamming
distance. Path distances (Dekel et al., 2004) or H-loss (Cesa-Bianchi et al., 2006) can also be used when the output space has hierarchies or ontology. ρ(y, yi) yields a similar
effect of penalizing different mistakes differently, but in a different way from ∆(y, yi).
This can be best illustrated by using two concrete examples: a) margin rescaling where
1
2
2
<Φ(x
i, y
i) – Φ(x
i, y), w>
slack rescaling
margin rescaling
loss
O
Figure 1.7: Slack rescaling and margin rescaling.
Name Example Proposed by
margin rescaling max{0,2− hφ(xi, yi)−φ(xi, y),wi} (Taskar et al.,2004)
slack rescaling 2 max{0,1− hφ(xi, yi)−φ(xi, y),wi} (Tsochantaridis et al.,2005)
Plotting these two rescalings in Figure 1.7, we can see that the margin rescaling starts to penalize early but mildly: oncehφ(xi, yi)−φ(xi, y),wifalls below 2, it starts to incur a unit loss for each unit gap. In contrast, slack rescaling starts to penalize only after hφ(xi, yi)−φ(xi, y),wi falls below 1, but once it kicks in, the penalty is severe:
two units for each unit gap.
When the output spaceYis equipped with a graphical model, the sufficient statistics
φdecomposes, and furthermoreTaskar et al.(2004) assumed the same factorization of
∆(y, y0):
∆(y,y0) =X c∈C
∆c(yc, yc0).
This factorization is crucial for efficient maximum margin estimation for structured data with margin rescaling. We will revisit it in Section5.4.1.
Finally, non-decomposable loss functions are also common, especially in applica- tions like information retrieval. For example, the F-score and area under ROC curve (Joachims,2005). Optimization for these multivariate performance measures is noncon- vex and harder, and we will introduce an approximate method for optimizing F-score in Chapter 3.
estimation framework (RRM): min
w J(w) :=λΩ(w) +Remp(w), where Remp(w) := 1 n n X i=1 l(xi, yi;w).
Here Ω(w) is the regularizer and Remp(w) is the empirical risk. l is a loss function
measuring the discrepancy between the true labelyiand the output of modelw. We will consider the optimization for this framework of problems, with special focus on strongly convex Ω(w) and convex (nonsmooth)Remp. This makes optimization relatively simple
(Boyd & Vandenberghe, 2004), and allows one to focus on modeling without being entangled with numerical stability or suboptimal solutions due to local minima. In addition, this assumption is not too restrictive as we have shown above that a large number of machine learning models do fit in this framework.