It is no coincidence that the high-level examples of the previous section ultimately ended with a decision; essentially all of the algorithms discussed in this disserta- tion are binary decision makers, i.e., binary classifiers.1 Hence, following the typical workflow for many machine learning or statistical prediction methods, the algorithms discussed later will betrained on a set of training data—data accurately labeled with known classes (labels) and is representative of the problem at hand, then tested on
testing data—data with unknown labels. The training process allows the classifier to
learn the underlying model so that accurate predictions can be made on the testing data.
An example of a high-level feature-level fusion pipeline is shown in Figure 1.1. The input data, which can be from various sources or even a single source, is shown on the left and is fed into the first processing blocks that process the data in some way2. Next, the features are fused in some manner before a single classifier gives an overall decision. In the work that follows, I use an “off the shelf” classifier known as asupport
1While the classifier I use in these algorithms is the support vector machine (SVM), generally any
classifier can be used in its place.
Figure 1.1: High-level block diagram of feature-level fusion.
vector machine (SVM), and the feature-level fusion algorithms I propose focus on the
feature fusion block just before classification.
Figure 1.2 shows a similar block diagram for a decision-level fusion pipeline. Just as with feature-level fusion, the input data is on the left and is fed to some processing blocks. The difference here is that classification is performed before the fusion block; each processing block gets its very own classifier. The decisions generated by the different classifiers are then aggregated to form an overall decision by the fusion block. Again, the classifiers I use are SVMs and the decision-level fusion methods discussed later are included in the decision fusion block.
Figure 1.2: High-level block diagram of decision-level fusion.
The following sections briefly explain the tools used for fusion in the following chap- ters. Specifically, kernel SVMs and multiple kernel learning are discussed as the tools chosen for classification and feature-level fusion, respectively, and the Choquet fuzzy integral is introduced as the tool of choice for decision-level fusion.
1.1.1
SVMs and Kernels
A support vector machine is a type of binary classifier that finds a hyperplane in some space that discriminates between two classes of data; for linearly separable data, the SVM will work perfectly. This is not to say, however, that the SVM cannot be applied to more “complex” data—data that are not linearly separable can be accurately classified with a kernel SVM, i.e., a SVM that has been extended using
The kernel trick allows data to be nonlinearly mapped to a new higher-dimensional space termed the reproducing kernel Hilbert space (RKHS), where the data are (po- tentially) linearly separable. A linear classifier implemented in the RKHS can then perfectly discriminate the two classes. The SVM is one of the most popular classifiers to utilize the kernel trick since its formulation turns out to be very efficient—the nonlinear mapping can be performed implicitly through the use of kernel matrices, Hermitian matrices whose elements represent all pairwise inner products of the train- ing data. The elements of a kernel matrix are computed using a kernel function, which represents the inner product of two vectors in a RKHS defined by the kernel function chosen. There are many kernel functions to choose from, e.g., various radial
basis function kernels,polynomial kernels,sigmoidal kernels, etc., and they each have
at least one free parameter that must be chosen. This abundance of choice leads to the problem of determining which kernel (and parameter) to employ with the SVM. Recall that the goal of using a kernel is to project the data to a space where the data
are linearly separable, something not all kernels can achieve. This is the challenge
that multiple kernel learning (MKL) addresses.
MKL assumes that the kernel used as described in the previous paragraph is actually a linear combination of pre-selected base kernels. One must still choose the vari- ous base kernels with this MKL approach, but the process of learning the mixing coefficients generally minimizes the influence of kernels that do not work well and
maximizes the contribution of kernels that do separate the data well. In a mathe- matical nutshell, given m base kernel matrices, Kk, MKL is the process of learning
the mixing coefficients, σk, that form an “optimal” kernel as
K=
m
k=1
σkKk. (1.1)
The MKL algorithms in this dissertation all assume the formulation in (1.1), and many address the problem of learning a suitable set of mixing coefficients. Appendix A provides a more quantitative discussion of SVMs including their kernel extension.
1.1.2
The Choquet Fuzzy Integral
Most of the decision-level fusion work in this dissertation uses the Choquet fuzzy
integral to combine the outputs of an ensemble of decision makers into a single overall
decision. This integral is extremely flexible and is parametrized by thefuzzy measure
(FM), a function that maps the power set of all decision makers to the unit interval and can be thought of as the “worth” of a set. Therefore, we can say the Choquet fuzzy integral is “uber-parametrized,” since aggregating the decisions from a set of
m decision-makers using the integral requires 2m terms in its FM3. Similar to MKL’s
goal of learning the “optimal” mixing coefficients based on training data, techniques
3Note that due to some properties of the FM, the number of required terms is actually 2m−2. This will be explained in later chapters.
using the Choquet integral learn the FM that fits the training data.
The number of required terms of the FM explodes as 2m, so learning the FM quickly
becomes an underdetermined problem since sets of training data will rarely have the diversity to include 2m independent observations. This manifests as a learned FM
that is only partially accurate—values of the FM driven by the training data are very accurately learned, but the remaining values are driven only by constraints; their values are essentially erroneous. Thus, when faced with testing data that utilizes the incorrectly learned FM values the classification accuracy will generally suffer.
Much of the work in the following chapters addresses this problem through the use of
regularization, a technique commonly used in machine learning to prevent overfitting.
Doing so reduces the influence of the constraints on the learned FM and rather reas- signs the influence to the regularization function; the choice of regularization function defines how the values of the FM not driven by training data are learned.