In parametric classification the observed data is assumed to be similar to some known distribution, e.g. the Gaussian distribution, and the parameters needed to specify this distribution are estimated from the feature data, e.g. the mean µi
and Σi in the normal distribution.
5.2.1
Classification based on Bayesian theory
Bayes classification rule classifies an object to the class with highest posteriori probability P (ωi|x). From statistical theory this well known probability is given
as
P (ωi|x) =
P (x|ωi)P (ωi)
p(x) (5.18)
where p(x) is the pdf of x, which can be written as p(x) =
n
X
i=1
p(x|ωi)P (ωi) (5.19)
In other words, we classify the pattern x to the class which satisfies the rule P (x|ωi) > P (x|ωj),∀j 6= i (5.20)
Given the n classes, we can assume that the apriori class probabilities P (ωi) are
known, which usually is estimated as the class frequencies. We then need to find the class-conditional probability functions p(x|ωi), which describe the distribution
of the feature vectors in each of the classes, also known as the likelihood function.
5.2.2
Discriminant functions
Instead of working directly with probabilities it is normal, and often useful, to represent a classifiers as discriminant functions gi(x), i = 1, ...n. The definition
of this function is given as gi(x) ≡ f(P (ωi|x)), where f(·) is a monotonic func-
tion [21]. We then classify x to class ωi if
gi(x) > gj(x)∀j 6= i (5.21)
The decision surfaces, which separate the regions, are defined as [21]
5.2. PARAMETRIC CLASSIFICATION 51 This means, for the Bayesian case, that the discriminant function is just the posteriori probability, and we get
gi(x) = P (ωi|x) =
p(ωi|x)P (ωi)
p(x) (5.23)
This could be simplified, because ( 5.23) is proportional to
gi(x) = p(ωi|x)P (ωi) (5.24)
which again is proportional to
gi(x) = ln p(ωi|x) + ln P (ωi) (5.25)
5.2.3
Gaussian distribution
The Gaussian distribution is one of the, or probably the most commonly used pdf in practice. Mainly because of it’s well known properties and it’s assymptotical nature, which means that if we have a large number of data, the Gaussian distri- bution will fit the data well. Given the d-dimensional feature vector x = x1, ..., xd,
the multivariate Gaussian density is defined as p(x|ωi) = 1 (2π)1/2|Σ i|1/2 exp{−1 2(x− µi) TΣ−1 i (x− µi)} (5.26)
where µi and Σi is the maximum likelihood estimate. We now have all the ingre-
dients to compute the aposteriori probability and i.e., the discriminant functions. The discriminant function is then given as
gi(x) =− 1 2(x− µi) TΣ−1 i (x− µi)− d 2ln 2π− 1 2ln|Σi| + ln P (ωi) (5.27)
Three cases of covariance structure
The covariance structure in the Gaussian density can be specified in one of three ways, this will change the complexity of the model, the decision surface and for sure the computation time. The three cases are:
1. Σi= σ2I :
In the first case we have the same variance for all classes and in fact the features are also independent, i.e., the correlation between the features is zero and the covariance matrix is just a matrix with a diagonal with equal elements, σ2
, and with the rest of the elements zero. We then get the discriminant functions as gi(x) =− 1 2σ2(x− µi) T(x− µ i)− 1 2ln|σ 2 I| + ln P (ωi) (5.28)
The samples fall in equal-size hyperspherical cluster, centered about the mean. In figure 5.3 the prior probabilities are equal for the two classes, and the decision boundary is a linear function which is orthogonal to the line between the means and crosses this line exactly at the midpoint of the means. If the priors hadn’t been the same the decision boundary would have been closer to the mean with lowest probability.
5.2. PARAMETRIC CLASSIFICATION 53 2. Σi = Σ :
In this case we will have a common covariance structure for the different classes, which will result in hyperellipsoidal clusters with the same shape and size. Now we will assume that there is correlation between the features and they are no longer independent. The discriminant function will be
gi(x) =− 1 2(x− µi) TΣ−1 (x− µi)− 1 2ln|Σ| + ln P (ωi) (5.29) Again we will have a linear classifier, which is shown in figure 5.4. The two classes have the same prior probabilities and the decision rule is in principle the same as for the case above.
3. Σi:
In the general multivariate Gaussian model each class has it’s own covari- ance matrix, and the discriminant functions will be quadratic. This means that we now have a complex decision boundary, but even if such a func- tion could do a better separation of the objects, there still are some issues to consider. How many coefficients should be used to estimate the deci- sion boundary? Do we want a classifier that complex? The discriminant function in this case, will be
gi(x) =− 1 2(x− µi) TΣ−1 i (x− µi)− 1 2ln|Σi| + ln P (ωi) (5.30)
Figure 5.5: The case with arbitrary covariance structure.
It might be that the features computed from the different prognosis of cancer in our data have different variance, we will still assume independent features with the same variance.