As already noted the feature vector format is the standard format used in data mining. The format initially requires the generation of a feature space. In some cases, this is straightforward; in others, it is not so straightforward. Thus in the case of the region- based representations considered in this thesis the feature space of interest is formed by identifying a set of features of interest across the regions. Whatever the case, for reasons of computational efficiency, it is often necessary to reduce the number of dimensions in a given feature space. There are a variety of techniques whereby this can be achieved and these are discussed in Subsection 2.6.1. In the whole image-based representation where the decompositions as represented as a graph, extracting a feature space is not so straightforward. Techniques for identifying a feature space with a graph representation
are therefore discussed in Subsection 2.6.2.
2.6.1 Feature Vector Generation for Region-based Methods
Two commonly used methods used to reduce the dimensionality of a given feature space are: (i) Principal Component Analysis (PCA) and (ii) the coding-pooling framework. In PCA [69] orthogonal linear transforms are applied to the set of feature vectors, forming a new set of vectors according to the variance of the feature vectors. PCA is used to transform a feature space into lower-dimensional space. PCA operates by first calculating the Eigenvectors and Eigenvalues for the new space. Feature vectors are then generated using the list of Eigenvectors. They are sorted and a specific number of Eigenvectors are chosen. There is an assumption that an Eigenvector with a larger Eigenvalue indicates that this Eigenvector is significant, so Eigenvectors with the largest Eigenvalues are selected to represent the image [146].
In the coding-pooling framework the coding element consists of identifying a subset of vectors (the “dictionary”). The pooling element is then used to generate a single feature vector guided by the dictionary where feature vectors linked to the same vec- tor in the dictionary are combined. The coding should operate so that the selected vectors include the most representative features. There are different ways of conduct- ing the coding, of note are: (i) Vector Quantization (VQ), (ii) Sparse Coding (SC), (iii) Locality-constrained Linear Coding (LLC), (iv) Improved Fisher Kernel Encoding (IFK) and (v) SuperVector encoding (SV).
Lazebnik et al. [82] proposed the use of Vector Quantization (VQ) for feature selection. VQ is essentially a clustering technique. In the context of region-based representation methods K-means is applied to a random sub-set of regions extracted from the training images to form the dictionary. The dictionary in this case thus consists of cluster centres. The dictionary typically comprises between 200 to 400 cluster centres. Alternatively, Sparse Coding (SC) or Locality-constrained Linear Coding (LLC) may be used to form the dictionary. SC tries to find a feature vector that best represents a group of feature vectors by measuring the “response” of the vector to the group. In the case of region-based representation methods, this group of feature vectors is extracted from the whole set of given feature vectors across all images [149]. In order to reduce the computing time, SC may be applied to a random sample of feature vectors. LLC may be used as an alternative to SC to achieve the same result [142]. Improved Fisher Kernel Encoding (IFK) [105] is another method used for generating dictionaries. Here, given a set of feature vectors, the feature vectors’ distributions are computed using a Gaussian Mixture Model (GMM) (based on the Maximum Likelihood (ML) estimation) to identify a feature vector-specific distribution. In order to distinguish between different image signatures, L2 normalisation is applied to the feature vector- specific distribution to form the “Fisher vector signature” aimed at encapsulating class-
specific information. Similar to IFK, SuperVector (SV) encoding was used in [158]. Instead of GMM as in IFK, K-means clustering was computed. Then the clusters were improved by using upper bounds aimed at minimising the error using the Euclidean distance between feature vectors and their means. In [62] an experiment is reported that compares the operation of different coding methods; FK proposed in [105] was shown to outperform the rest.
With respect to the pooling element of the coding-pooling framework, the aim is to map each feature vector with its equivalent vector in the dictionary in order to form a single feature vector. There are two common methods for achieving this, average and maximum pooling. In “average pooling”, the average values between similar feature vector elements in the dictionary are computed and then used to form a new feature vector. Following this, a long global feature vector is generated by concatenating the new feature vectors for each image. The resulting feature vector is then normalised [82]. One example of maximum pooling is Multi-scale Spatial Maximum Pooling (MSMP) [149]. MSMP recursively computes the histograms of the maximum values for a given set of vectors and their association with elements in the dictionary. Feature vectors in neighbouring regions are recursively united by getting the maximum values of each element. This process of combing feature vectors is applied until a final single feature vector is reached.
2.6.2 Feature Vector Generation for Whole Image-based Methods As noted above, the generation of feature vectors from tree-based representations is more challenging than in the case of region-based representation methods. One ap- proach, used in [59] in the context of 2D retinal images and in [89] with respect to MRI brain scan data, is to first identify frequently occurring sub-graphs in the tree data using some appropriate search method. Various Frequent Sub-Graph (FSG) mining techniques can be used for this purpose. One of the most commonly used is the graph- based Substructure pattern mining (gSpan) algorithm [148]. The gSpan algorithm uses a Depth First Search (DFS) approach to identify frequent sub-graphs (sub-trees). A sub-graph is said to be frequent according to a “support threshold” σ. In the case of the tree representations considered in this thesis, each identified frequent sub-tree is then conceptualised as a dimension within a feature space. Sub-graphs may be ranked according to some weighting measure as suggested in [15] and the top K sub-graphs selected. This later approach was adapted in [59] in the context of 2D retinal images.