• No results found

Discriminative regression methods

3.2 Algorithms for learning with probabilistic soft-labels

3.2.1 Discriminative regression methods

In the discriminative classification approach we want to learn a function f :X R that

lets us discriminate examples in the two classes. Once the function f is known, the class

decision is made with the help of a thresholdσsuch that for values f(x)≥σwe classify the example as class 1, otherwise as class 0. In our framework, in addition to binary class labels {0, 1}, we also have auxiliary probabilistic information associated with these class labels. The question is how this information can be used to learn a better discriminant function.

3.2.1.1 Linear regression One relatively straightforward solution is to assume the dis- criminant function is defined directly in terms of these auxiliary probabilities. In such a case, the learning of the discriminant function can be converted into a regression problem. One way to learn the function is to regress the features directly to probabilities, that is, we can learn a regression mapping f where (xi,pi) are the input-output pairs.

Assuming the function f :X R is formed by a linear model f(x)=wTx, the learning

problem becomes a linear regression problem solved by minimizing the error function based on the sum of squared residuals:

w=argminw 1 N N X i=1 ¡ wTx i−pi ¢2 +Q(w)

whereQ(w) is an optional regularization term that may help to prevent model over-fit (Sec-

tion2.1.1.2). The solutionw∗ yields a weight vector optimizing the linear model.

Further in this work, we refer to this method as to LinRaux (Linear regression with

Defining the classification threshold. Once the weights of the discriminant function

are learned, a classifier can be defined using a decision threshold σ. To find the optimal

threshold, we can use class labels and minimize the overall loss in the training data.

3.2.1.2 Consistency with probabilistic assessment Obviously, using an arbitrary func- tion model, the outputs of the regression may not be consistent with probabilities. For ex- ample, by applying a linear regression directly to the input-probability pairs we may not guarantee the consistency of probabilistic labels once the model is learned, that is, some data points may fall outside [0, 1] interval. An alternative is to regress inputs to a new

space inR obtained by transforming the probabilistic space, such that the transformation

is monotonic in pi, and its inverse lets us revert back to probabilities. An example of such

a transformation is t(pi)=ln1pipi which is the inverse of the logistic function. In such a

case the regression model is trained on (xi,t(pi)) pairs. The results of the regression can be

transformed back to the probability space by using the logistic function g(s)=1+1e−s and the

probabilities are consistent. Now, the learning problem becomes a linear regression problem solved by minimizing the following error function:

w∗=argminw 1 N N X i=1 ¡ wTxi−t(pi) ¢2 +Q(w)

where Q(w) is a regularization term (Section 2.1.1.2). The solution w∗ yields a weight

vector optimizing the linear model and if needed, the posterior probability is recovered as: p(y=1|x,w)= 1

1+e−wTx.

Further in this work, we refer to this method as toLogRaux(Regression with log trans-

formation on auxiliary soft-label information).

3.2.1.3 Soft-labels help to learn better classification models To test whether the probabilistic soft-label information helps to learn better classification models, we compare AUC performance of the two regression methods learned from soft-labels - LinRaux and LogRaux, with two standard binary classifiers learned from binary class labels only - sup- port vector machines (SVM) and logistic regression (LogR). We conduct experiments on five UCI data sets with simulated soft-labels and our medical data with real soft-labels given by

experts. For illustration purpose, we take one of the data sets - "Concrete", as the running example in the current and following sections. Details about experimental setup, how class and soft labels are given, and results on all other data sets are described in Experiment Sections3.4and3.5.

Figure 10shows the AUC performance of the four methods on "Concrete" data set. We

see that LinRaux and LogRaux clearly outperforms SVM and logistic regression. Similar re- sults are also observed on all other data sets. This demonstrates that soft-label information can help us to learn better classification models.

Note that the performance of the two regression methods LinRaux and LogRaux are very similar and overlapped on "Concrete" data as shown in Figure 10. It turns out that this is true also for all other data sets we experimented with (see Section3.5for full results on all data sets). Therefore, for simplicity sake, we sometimes omit LogRaux results in the remaining figures. 20 40 60 90 120 160 0.75 0.8 0.85 0.9

Data: concrete, 25% pos, no noise

Number of Training Examples

AUC

LogR SVM LinRaux LogRaux

Figure 10: Regression methods LinRaux and LogRaux, that learn from probabilistic soft- label information, outperform binary classification models SVM and logistic regression (LogR), that learn from binary class labels only. The soft-labels are not corrupted by noise.