CSC 411: Lecture 07: Multiclass Classification

(1)

CSC 411: Lecture 07: Multiclass Classification

Class based on Raquel Urtasun & Rich Zemel’s lectures

Sanja Fidler

University of Toronto

(2)

Today

Multi-class classification with: Least-squares regression Logistic Regression K-NN

(3)

Discriminant Functions for

K

>

2 classes

First idea: UseK−1 classifiers, each solving a two class problem of separating point in a classCk from points not in the class.

Known as1 vs allor 1 vs the restclassifier

(4)

Discriminant Functions for

K

>

2 classes

Another simple idea: IntroduceK(K−1)/2 two-way classifiers, one for each possible pair of classes

Each point is classified according to majority vote amongst the disc. func. Known as the1 vs 1 classifier

(5)

K-Class Discriminant

We can avoid these problems by considering a single K-class discriminant comprising K functions of the form

yk(x) =wTkx+wk,0

and then assigning a pointxto classCk if

∀j6=k yk(x)>yj(x)

Note thatwT

k is now a vector, not thek-th coordinate

The decision boundary between class Cj and classCk is given by

yj(x) =yk(x), and thus it’s a (D−1) dimensional hyperplane defined as

(wk−wj)Tx+ (wk0−wj0) = 0

What about the binary case? Is this different? What is the shape of the overall decision boundary?

(6)

K-Class Discriminant

The decision regions of such a discriminant are always singly connected andconvex

In Euclidean space, an object isconvexif for every pair of points within the object, every point on the straight line segment that joins the pair of points is also within the object

(7)

K-Class Discriminant

The decision regions of such a discriminant are always singly connected andconvex

Consider 2 pointsxA andxB that lie inside decision regionRk

Any convex combination ˆx of those points also will be inRk

ˆ

(8)

Proof

A convex combination point, i.e.,λ∈[0,1] ˆ

x=λxA+ (1−λ)xB

From the linearity of the classifiery(x)

yk(ˆx) =λyk(xA) + (1−λ)yk(xB)

SincexA andxB are inRk, it follows thatyk(xA)>yj(xA),yk(xB)>yj(xB), ∀j 6=k

Sinceλand 1−λare positive, then ˆx is insideRk

(9)

(10)

Multi-class Classification with Linear Regression

From before we have:

yk(x) =wTkx+wk,0

which can be rewritten as:

y(x) = ˜WT˜x where thek-th column of ˜Wis [wk,0,wTk]

T_{, and ˜}_x_{is [1}_,_xT_]T

Training: How can I find the weights ˜Wwith the standard sum-of-squares regression loss?

1-of-K encoding:

For multi-class problems (withK classes), instead of usingt =k(target has labelk) we often use a1-of-K encoding, i.e., a vector ofK target values containing a single 1 for the correct class and zeros elsewhere

Example: For a 4-class problem, we would write a target with class label 2 as:

(11)

Multi-class Classification with Linear Regression

Sum-of-least-squares loss: `( ˜W) = N X n=1 ||W˜T˜x(n)−t(n)||2 = ||X˜W˜ −T||2 F

where then-th row of ˜Xis [˜x(n)]T, andn-th row ofTis [t(n)]T Setting derivative wrt ˜Wto 0, we get:

˜

(12)

Multi-class Logistic Regression

Associate a set of weights with each class, then use a normalized exponential output

p(Ck|x) =yk(x) =

exp(zk)

P

jexp(zj)

where theactivationsare given by

zk =wTkx

The function exp(zk)

P

(13)

Multi-class Logistic Regression

The likelihood p(T|w1,· · ·,wk) = N Y n=1 K Y k=1 p(Ck|x(n)) t(_kn) = N Y n=1 K Y k=1 y_k(n)(x(n))t (n) k with p(Ck|x) =yk(x) = exp(zk) P jexp(zj)

wheren-th row of Tis 1-of-K encoding of examplenand

zk =wTkx+wk0

What assumptions have I used to derive the likelihood? Derive the loss by computing the negative log-likelihood:

E(w1,· · · ,wK) =−logp(T|w1,· · · ,wK) =− N X n=1 K X k=1 t_k(n)log[y_k(n)(x(n))]

This is known as thecross-entropyerror for multiclass classification How do we obtain the weights?

(14)

Training Multi-class Logistic Regression

How do we obtain the weights?

E(w1,· · ·,wK) =−logp(T|w1,· · ·,wK) =− N X n=1 K X k=1 t_k(n)log[y_k(n)(x(n))]

Do gradient descent, where the derivatives are

∂y_j(n) ∂z_k(n) =δ(k,j)y_j(n)−y_j(n)y_k(n) and ∂E ∂z_k(n) = K X j=1 ∂E ∂y_j(n) · ∂y (n) j ∂z_k(n) =y_k(n)−t_k(n) ∂E ∂wk,j = N X n=1 K X j=1 ∂E ∂y_j(n) · ∂y (n) j ∂z_k(n) · ∂z (n) k ∂wk,j = N X n=1 (y_k(n)−t_k(n))·x_j(n)

(15)

Softmax for 2 Classes

Let’s write the probability of one of the classes

p(C1|x) =y1(x) = exp(z1) P jexp(zj) = exp(z1) exp(z1) + exp(z2)

I can equivalently write this as

p(C1|x) =y1(x) =

exp(z1)

exp(z1) + exp(z2)

= 1

1 + exp (−(z1−z2))

So the logistic is just a special case that avoids using redundant parameters Rather than having two separate set of weights for the two classes, combine into one

z0=z1−z2=w1Tx−w2Tx=wTx

The over-parameterization of the softmax is because the probabilities must add to 1.

(16)

Multi-class K-NN

(17)

Multi-class Decision Trees

Can directly handle multi class problems How is this decision tree constructed?