where PM LE(x|Ωi) is the maximum likelihood estimate of P (x|Ωi), and Υ is a decision
threshold parameter (for details, see, e.g., Duda et al. [33]). The results of the Collins et al. study showed that the GLRT significantly outperforms the conventional threshold based methods. The GLRT has also been used for CW EMI data by Riggs et al. [69]. The performance of the GLRT using both signal space, i.e., using only the raw signal, and feature space classification has been compared in the literature. For example, Tantum and Collins [104] used decay parameters as features and compared that with using the whole time-domain signal for classification. Based on their simulated results, they argue that using the whole signal yields a higher level of accuracy than using the features. However, Aliamiri et al. [50] have made a similar comparison and come to an opposite conclusion. They state that this is because of model mismatch, i.e., the presence of non-Gaussian deterministic noise within the signal, especially in case of large targets. Yet, they show that some reasonable violations may be tolerated against the dipole model assumptions if their effect on parameter estimation is well understood. Furthermore, they state that overcoming the model mismatch problem requires developing a rich library that takes parameter variation, such as their position and orientation dependency, into account as completely as possible. They used kernel density estimation (KDE) with a Gaussian kernel to estimate prior PDFs for the classes. The MPT eigenvalues at 4 distinct frequencies were used as features. However, they observed that the eigenvalue PDFs were markedly non-Gaussian by nature. Consequently, a whitening transform was applied to normalize the shape of the data distribution prior to applying KDE. Furthermore, though the MPT eigenvalues should be orientation- and position- invariant in principle, this is not the case in reality; therefore, measurement data from several orientations and positions are needed to model the PDFs [105].
In summary, while feature-space classifiers are suboptimal in theory compared, e.g., to a signal space GLRT, they are more robust against the problems of the dipole model. Aliamiri et al. [50] argue that as long as the feature value clouds are distinct and well-defined, the feature-based methods perform well regardless of the above model mismatch.
4.4
Non-parametric discriminative methods
Parametric classification methods are problematic because they depend on the estimation of prior PDFs for each class. Moreover, parametric PDF estimation always assumes a unimodal PDF, making modeling of multimodal PDFs impossible. On the other hand, discriminative, non-parametric feature-based methods skip prior PDF estimation and instead, using the training data, aim to directly estimate the posterior probabilities
P(Ωi|x) for a given feature vector, i.e., to solve the probability of each class Ωi, given the
unknown sample x.
Typically, a discriminative classifier finds a decision rule that divides the feature space into regions, each of which corresponds to a certain class Ωi. A linear discriminant analysis
(LDA) -based binary classifier is one of the simplest examples of such functionality. Based on training data, it finds a linear function that divides unknown samples into two categories. The linear function is defined by finding a weight factor wi for each
training sample xi. In case of two features, the problem can be seen as finding a line in
two-dimensional space that divides the training samples into two categories in an optimal way. The linear discriminant is calculated by
where w is a weight vector, and Υ is the decision threshold. This simple binary LDA approach may be easily applied to a multiclass case by defining a binary decision tree classifier. A binary decision tree is an intuitive and transparent way to define the classification logic of a system. Such a tree consists of a series of decision nodes, which divide the feature space hierarchically into subspaces until a conclusion is reached, i.e., a leaf of the tree. Each decision node may be defined as a binary LDA rule, though any rule that divides the feature space can be used. Moreover, the nodes need not even deal with numeric features. This makes decision trees a logical choice for problems in which similarity between feature values is difficult to define. The complexity of a node can vary from a simple linear discriminant to a multilayer neural network.
Pasion and Oldenburg [73] propose a classification scheme that resembles a tree classifier. The inputs of the classifier are, as discussed in Section 4.2, parameters ψ1, ψ2, κ1, and κ2. First, the algorithm calculates ψ = (ψ1+2)/2. If ψ > 0.8, the object is likely to be
magnetic; otherwise, it is considered non-magnetic. Then in case of magnetic targets, if
κ1/κ2 >1 and ψ1/ψ2<1, the object is magnetic and rod-like. However, if κ1/κ2 <1
and ψ1/ψ2 > 1, the object is magnetic and plate-like. In the case of non-magnetic
targets, if κ1/κ2>1, the object is non-magnetic and plate-like. Otherwise, the object is
non-magnetic and rod-like [73]. This heuristics-based method has been shown applicable in practice to identify of UXO [106]. Why this does not make a proper binary tree is that in theory undefined outcomes are possible because the leaves do not cover the whole feature space.
A support vector machine (SVM) is a linear binary classifier, and hence related to the LDA classifier. Therefore, it is suited, e.g., for distinguishing between threatening and innocuous items. An SVM requires no a priori knowledge about the underlying process that has generated the data [101]. Let xi be feature vectors for training and gi ∈ {−1, 1} the
corresponding ground truth labels for the two classes in consideration, i.e., for Ω0, g = −1
and for Ω1, g = 1. An SVM searches for a hyperplane gi(xi· w+ sf) − 1 ≥ 0, where w is a
vector of weight factors and sfis a scaling factor. The optimal hyperplane should maximize
the margin 2/|w|. The idea is that most weights w are found irrelevant during the training phase in such a way that they converge to zero terms. The remaining samples that get nonzero weights are called support vectors, and they essentially define the hyperplane. Moreover, to prevent overfitting, an SVM imposes a penalty for misclassifications. This is called the capacity of the machine (for details on SVMs, see, e.g., Duda et al. [33] and Fernandez et al. [95]).
This linear SVM can be transformed into a non-linear version by using a kernel function
K(a, b) = Φ(a) · Φ(b), which maps the input vectors a and b into a higher dimensional
space [33]. The radial basis function (RBF), also known as the Gaussian kernel, is a commonly used kernel function. It is given by K(a, b) = e−|a−b|2/2σ2
, where σ is a parameter controlling kernel width. This function essentially measures the similarity between a and b; i.e., when they are close in Euclidean space, the output is close to one, and if they are dissimilar, the output is close to zero. Therefore, the classifier will converge to the nearest neighbour classifier with small values of σ [33, 101, 107] (see Section 4.5).
To discriminate between multiple classes, the problem must be split into several binary classification problems by using multiple one-against-one (OAO) or one-against-all (OAA) SVM classifiers, or a directed acyclic graph SVM [108]. In the OAO method, the output class is usually determined by choosing the class that has most positive outcomes out of all comparison pairs. In the OAA method, on the other hand, usually the class with