• No results found

Consider a set of numericalfeature-vector data that has the formX ={x1, . . . ,xn} ⊂

Rd, where the coordinates of x

i provide feature values (e.g., bits per second, speed,

volts, etc.) describing some object (e.g., a wireless sensor network node, traffic cam- era, or radar). We are also given a set of training labels for each feature vector, such that we have the pair (y, X), wherey= (y1, . . . , yn)T andyi is the label ofith object.

Each yi is associated with a respective feature vector xi. The classifier learning task

is thus to learn some prediction function f, such that we can predict the label of the feature-vectors, i.e., y=f(x).

Most classifiers delineate the classes by finding some “best” decision boundary in the feature space. Perceptrons and linear support vector machines (SVMs) find hyper- planes1. These classifiers are easy to train, often can be effective, and are compu- tationally very efficient (the operational decision is just a single dot-product in the feature space). However, they are ineffective for classes that are not linearly separa- ble, i.e., by a hyperplane. Hence, we will use kernel classifiers to non-linearly project the features into a high-dimensional space, where hyperplanes may be more easily found that serve as good decision boundaries.

its name implies, MKL combines multiple kernels together to form a new kernel, and thus a new classification space. Furthermore, since kernels known to exploit the data’s various features can be used as building blocks for MKL, it can do very well with heterogeneous data. There are many works that discuss MKL [8, 9, 10, 11, 12, 13, 14], and nearly all of them rely on operations that aggregate kernels in ways that preserve symmetry and positive semi-definiteness, such as element-wise addition and multiplication. Most MKL algorithms learn a “best” kernel space in which to classify by learning respective weights on each component kernel. Details are contained in Section 2.4.

Two MKL formulations explored in this chapter focus on aggregation using the Cho-

quet fuzzy integral (FI) with respect to a fuzzy measure (FM) [15]. First, we inves-

tigate our previously proposed fuzzy integral: genetic algorithm (FIGA) approach to MKL [11, 12], proving that it reduces to a special kind of linear convex sum (LCS) kernel aggregation. This leads to the proposition of the p-norm genetic algorithm

MKL (GAMKLp) approach, which learns an MKL classifier using a genetic algorithm

and generalizedp-norm weight domain. These algorithms perform a feature-level ag- gregation of the kernel matrices, producing a new feature representation. We also propose a decision-level MKL called DeFIMKL, which learns a FM with respect to the Choquet FI to fuse decisions from individual kernel classifiers. The FM is learned from training data with a regularized quadratic program (QP) approach [16]. We

Table 2.1

Acronyms and Select Notation SVM support vector machine

MKL multiple kernel learning FM fuzzy measure

FI fuzzy integral

FIGA fuzzy integral: genetic algorithm LCS linear convex sum

GAMKLp p-norm genetic algorithm MKL

DeFIMKL decision-level fuzzy integral MKL

DeGAMKLp p-norm decision-level genetic algorithm MKL

DeLSMKL decision-level least squares MKL QP quadratic program

MKLGL MKL group lasso

MKLGLp MKLGL with p-norm regularization

RBF radial basis function

X feature-vector data, X ={x1, . . . ,xn} ⊂Rd y data labels, y= (y1, . . . , yn)T

f(x) prediction function

g fuzzy measure

π(i) sorting index in Choquet integral

φ(x) non-linear mapping of x

κ(xi,xj) kernel function, κ(xi,xj) = φ(xi)(xj) K kernel matrix K = [Kij =κ(xi,xj)] fk(x) decision function using kth kernel, Kk

fg(x) decision function using Choquet integral, wrt FM g

further explore two additional decision-level methods based on a least-squares for- mulation. We start with decision-level least-squares MKL (DeLSMKL) where we compute the weights for decision values from an ensemble of classifiers using a closed form expression. We then extend this method using a nonlinear cost function and use a genetic algorithm to compute the weights in decision-level genetic algorithm MKL

(DeGAMKL).

Since the size of these kernels is directly related to the number of feature-vectors in the dataset, large datasets lead to large kernels. Thus, approximations to the kernel matrices that reduce the required number of values to store could allow MKL methods to be used for these large datasets. We explore the use of the Nystr¨om approximation for this task, and show the effects of the approximation on classifier accuracy.

The FI-based MKL approaches are first compared with a leading machine learning MKL method, called MKL group lasso (MKLGL)2 [9] on several benchmark data sets. We also investigate the behavior of regularization on the results of DeFIMKL. In Section 2.2 we briefly review data fusion, and Section 2.3 introduces FMs and FIs, specifically the fuzzy Choquet integral. Section 2.4 details the MKL methods. A review of the preliminary experimental results generated in [17] are presented in Section 2.6, Section 2.7 presents the details and results of our Nystr¨om experiments, and Section 2.8 discusses our final experiment with a large data set. Table 2.1 contains acronyms and selected notation used in this chapter.