• No results found

3F3: Signal and Pattern Processing

N/A
N/A
Protected

Academic year: 2021

Share "3F3: Signal and Pattern Processing"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

3F3: Signal and Pattern Processing

Lecture 3: Classification Zoubin Ghahramani [email protected] Department of Engineering University of Cambridge Lent Term

(2)

Classification

We will represent data by vectors in some vector space.

Let x denote a data point with elements x = (x1, x2, . . . , xD)

The elements of x, e.g. xd, represent measured (observed) features of the data point; D

denotes the number of measured features of each point.

The data set D consists of N pairs of data points and corresponding discrete class labels: D = {(x(1), y(1)) . . . , (x(N ), y(N ))}

where y(n) ∈ {1, . . . , C} and C is the number of classes.

The goal is to classify new inputs correctly (i.e. to generalize).

x o x x x x x x o o o o o o x x x x o Examples: • spam vs non-spam • normal vs disease • 0 vs 1 vs 2 vs 3 ... vs 9

(3)

Classification: Example Iris Dataset

3 classes, 4 numeric attributes, 150 instances

A data set with 150 points and 3 classes. Each point is a random sample of measurements of flowers from one of three iris species—setosa, versicolor, and virginica—collected by Anderson (1935). Used by Fisher (1936) for linear discrimant function technique.

The measurements are sepal length, sepal width, petal length, and petal width in cm.

Data: 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa ... 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor ... 6.3,3.3,6.0,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 7.1,3.0,5.9,2.1,Iris-virginica 24 5 6 7 8 2.5 3 3.5 4 4.5 sepal length sepal width

(4)

Linear Classification

Data set D:

D = {(x(1), y(1)) . . . , (x(N ), y(N ))}

Assume y(n) ∈ {0, 1} (i.e. two classes) and x(n) ∈ <D, and let ˜x(n) = 

x(n) 1

 .

Linear Classification (Deterministic)

y(n) = H β>x˜(n)

where H(z) = 

1 if z ≥ 0

(5)

Linear Classification

Data set D:

D = {(x(1), y(1)) . . . , (x(N ), y(N ))}

Assume y(n) ∈ {0, 1} (i.e. two classes) and x(n) ∈ <D, and let ˜x(n) = 

x(n) 1

 .

Linear Classification (Probabilistic)

P (y(n) = 1|˜x(n), β) = σ β>x˜(n)

where σ(·) is a monotonic increasing function σ : < → [0, 1].

For example if σ(z) is σ(z) = 1+exp(−z)1 , the sigmoid or logistic function, this is called logistic

regression. Compare this to...

Linear Regression y(n) = β>x˜(n) + n −50 0 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 z σ (z) Logistic Function

(6)

Logistic Classification

1

Data set: D = {(x(1), y(1)) . . . , (x(N ), y(N ))}.

P (y(n) = 1|˜x(n), β) = σ β>x˜(n)

P (y(n) = 0|˜x(n), β) = 1 − P (y(n) = 1|˜x(n), β) = σ −β>x˜(n)

Since σ(z) = 1+exp(−z)1 , note that 1 − σ(z) = σ(−z).

This defines a Bernoulli distribution over y at each point x. (cf θy(1 − θ)(1−y)) The likelihood for β is:

P (y|X, β) = Y

n

P (y(n)|˜x(n), β) = Y

n

σ(β>x˜(n))y(n)(1 − σ(β>x˜(n)))(1−y(n))

Given the data, this is just some function of β which we can optimize...

(7)

Logistic Classification

Maximum Likelihood learning:

L(β) = ln P (y|X, β) = X n ln P (y(n)|˜x(n), β) = X n h y(n) ln σ(β>x˜(n)) + (1 − y(n)) ln σ(−β>x˜(n))i

Taking derivatives, using ∂ ln σ(z)∂z = σ(−z), and defining zn def = β>x˜(n), we get: ∂L(β) ∂β = X n h y(n)σ(−zn)˜x(n) − (1 − y(n))σ(zn)˜x(n) i = X n h y(n)σ(−zn)˜x(n) + y(n)σ(zn)˜x(n) − σ(zn)˜x(n) i = X n (y(n) − σ(zn))˜x(n) Intuition?

(8)

Logistic Classification

The gradient: ∂L(β) ∂β = X n (y(n) − σ(zn))˜x(n) A learning rule (steepest gradient ascent in the likelihood):

β[t+1] = β[t] + η∂L(β

[t])

∂β[t] where η is a learning rate

Which gives:

β[t+1] = β[t] + η X

n

(y(n) − σ(zn))˜x(n)

This is a batch learning rule, since all N points are considered at once.

An online learning rule, can consider one data point at a time...

β[t+1] = β[t] + η(y(t) − σ(zt))˜x(t) This is useful for real-time learning and adaptation applications.

(9)

Logistic Classification

Demo of the online logistic classification learning rule:

β[t+1] = β[t] + η(y(t) − σ(zt))˜x(t) where zt = β[t]>x˜(t). −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 lrdemo

(10)

Maximum a Posteriori (MAP) and Bayesian Learning

MAP: Assume we define a prior on β, for example a Gaussian prior with variance λ:

P (β) = √ 1 2πλD+1 exp{− 1 2λβ > β}

The instead of maximizing the likelihood we can maximize the posterior probability of β ln P (β|D) = ln P (D|β) + ln P (β) + const

= L(β) − 1

2λβ

>

β + const’

The second term penalizes the length of β.

Why would we want to do this?

Bayesian Learning: We can also do Bayesian learning of β by trying to compute or approximate P (β|D) rather than optimize it.

(11)

Nonlinear Logistic Classification

Easy: just map the inputs into a higher dimensional space by computing some nonlinear functions of the inputs. For example:

(x1, x2) → (x1, x2, x1x2, x21, x22)

Then do logistic classification using these new inputs.

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 nlrdemo

(12)

Multinomial Classification

How do we do multiple classes?

y ∈ {1, . . . , C}

Answer: Use a softmax function which generalizes the logistic function: Each class c has a vector βc

P (y = c|x, β) = exp{β > c x} PC c0=1 exp{β > c0x}

(13)

Some exercises

1. show how using the softmax function is equivalent to doing logistic classification when C = 2.

2. implement the online logistic classification rule in Matlab/Octave. 3. implement the nonlinear extension.

4. investigate what happens when the classes are well separated, and what role the prior has in MAP learning.

References

Related documents

Ultimately, the fundamental inspiration of this project is to describe pattern recognition of electromyography (EMG) signal during load lifting using Artificial

Simulations show that average correct signal modulation type classification % C &gt; 90% is achieved for SNR &gt; 9 dB, average signal modulation family classification % C &gt; 90%

The number of discrete time samples needed to represent a frequency-sparse signal with known sparsity pattern follows the law of algebra, i.e., the number of time samples should

The extracted features (Standard deviation, &amp;Normalized values) have been taken for non- stationary power signal classification using fuzzy wavelet neural networks

Thus it is feasible that AATs may be reduced by using a signal with a pattern that includes silences (of unknown duration). The aim of Part A was to determine the signal with

ANFIS based MiCS Pattern Classification sEMG Signal Acquisition Wheelchair’s Motor Movement Wheelchair’s Navigation Control Unit

Applications in digital signal processing can be implemented on the SHARC using both C and SHARC assembly code. The SHARC can be efficiently used to process time-critical

Thus, this study aims to explore the pattern of platoon dispersion caused by traffic signal, in which the focus are given on vehicle headway, intra-platoon headway