Problem Context - Feature and Decision Level Fusion Using Multiple Kernel Learning and Fuzzy In

It is no coincidence that the high-level examples of the previous section ultimately ended with a decision; essentially all of the algorithms discussed in this dissertation are binary decision makers, i.e., binary classiﬁers.1 Hence, following the typical workﬂow for many machine learning or statistical prediction methods, the algorithms discussed later will betrained on a set of training data—data accurately labeled with known classes (labels) and is representative of the problem at hand, then tested on

testing data—data with unknown labels. The training process allows the classiﬁer to

learn the underlying model so that accurate predictions can be made on the testing data.

An example of a high-level feature-level fusion pipeline is shown in Figure 1.1. The input data, which can be from various sources or even a single source, is shown on the left and is fed into the first processing blocks that process the data in some way2. Next, the features are fused in some manner before a single classifier gives an overall decision. In the work that follows, I use an “off the shelf” classifier known as asupport

1_{While the classiﬁer I use in these algorithms is the} _{support vector machine} _{(SVM), generally any}

classiﬁer can be used in its place.

Figure 1.1: High-level block diagram of feature-level fusion.

vector machine (SVM), and the feature-level fusion algorithms I propose focus on the

feature fusion block just before classiﬁcation.

Figure 1.2 shows a similar block diagram for a decision-level fusion pipeline. Just as with feature-level fusion, the input data is on the left and is fed to some processing blocks. The difference here is that classification is performed before the fusion block; each processing block gets its very own classifier. The decisions generated by the different classifiers are then aggregated to form an overall decision by the fusion block. Again, the classifiers I use are SVMs and the decision-level fusion methods discussed later are included in the decision fusion block.

Figure 1.2: High-level block diagram of decision-level fusion.

The following sections briefly explain the tools used for fusion in the following chapters. Specifically, kernel SVMs and multiple kernel learning are discussed as the tools chosen for classification and feature-level fusion, respectively, and the Choquet fuzzy integral is introduced as the tool of choice for decision-level fusion.

1.1.1 SVMs and Kernels

A support vector machine is a type of binary classifier that finds a hyperplane in some space that discriminates between two classes of data; for linearly separable data, the SVM will work perfectly. This is not to say, however, that the SVM cannot be applied to more “complex” data—data that are not linearly separable can be accurately classified with a kernel SVM, i.e., a SVM that has been extended using

The kernel trick allows data to be nonlinearly mapped to a new higher-dimensional space termed the reproducing kernel Hilbert space (RKHS), where the data are (po- tentially) linearly separable. A linear classifier implemented in the RKHS can then perfectly discriminate the two classes. The SVM is one of the most popular classifiers to utilize the kernel trick since its formulation turns out to be very efficient—the nonlinear mapping can be performed implicitly through the use of kernel matrices, Hermitian matrices whose elements represent all pairwise inner products of the training data. The elements of a kernel matrix are computed using a kernel function, which represents the inner product of two vectors in a RKHS defined by the kernel function chosen. There are many kernel functions to choose from, e.g., various radial

basis function kernels,polynomial kernels,sigmoidal kernels, etc., and they each have

at least one free parameter that must be chosen. This abundance of choice leads to the problem of determining which kernel (and parameter) to employ with the SVM. Recall that the goal of using a kernel is to project the data to a space where the data

are linearly separable, something not all kernels can achieve. This is the challenge

that multiple kernel learning (MKL) addresses.

MKL assumes that the kernel used as described in the previous paragraph is actually a linear combination of pre-selected base kernels. One must still choose the various base kernels with this MKL approach, but the process of learning the mixing coeﬃcients generally minimizes the inﬂuence of kernels that do not work well and

maximizes the contribution of kernels that do separate the data well. In a mathe- matical nutshell, given m base kernel matrices, Kk, MKL is the process of learning

the mixing coeﬃcients, σk, that form an “optimal” kernel as

k=1

σkKk. (1.1)

The MKL algorithms in this dissertation all assume the formulation in (1.1), and many address the problem of learning a suitable set of mixing coeﬃcients. Appendix A provides a more quantitative discussion of SVMs including their kernel extension.

1.1.2 The Choquet Fuzzy Integral

Most of the decision-level fusion work in this dissertation uses the Choquet fuzzy

integral to combine the outputs of an ensemble of decision makers into a single overall

decision. This integral is extremely ﬂexible and is parametrized by thefuzzy measure

(FM), a function that maps the power set of all decision makers to the unit interval and can be thought of as the “worth” of a set. Therefore, we can say the Choquet fuzzy integral is “uber-parametrized,” since aggregating the decisions from a set of

m decision-makers using the integral requires 2m _{terms in its FM}3_{. Similar to MKL’s}

goal of learning the “optimal” mixing coeﬃcients based on training data, techniques

3_{Note that due to some properties of the FM, the number of required terms is actually 2}m₋_{2. This} will be explained in later chapters.

using the Choquet integral learn the FM that ﬁts the training data.

The number of required terms of the FM explodes as 2m_{, so learning the FM quickly}

becomes an underdetermined problem since sets of training data will rarely have the diversity to include 2m _{independent observations. This manifests as a learned FM}

that is only partially accurate—values of the FM driven by the training data are very accurately learned, but the remaining values are driven only by constraints; their values are essentially erroneous. Thus, when faced with testing data that utilizes the incorrectly learned FM values the classiﬁcation accuracy will generally suﬀer.

Much of the work in the following chapters addresses this problem through the use of

regularization, a technique commonly used in machine learning to prevent overﬁtting.

Doing so reduces the influence of the constraints on the learned FM and rather reas- signs the influence to the regularization function; the choice of regularization function defines how the values of the FM not driven by training data are learned.

In document Feature and Decision Level Fusion Using Multiple Kernel Learning and Fuzzy Integrals (Page 36-41)