Chapter 5 Classifier Fusion Strategy
5.3 Classifiers Combining Rules—Combiners
After the selection of particular base-level classifiers, the next step is a search for a module that is needed to assemble the classifiers together, which is called the combiner. Combiners can be differentiated on the base of different characteristics— trainability, adaptivity, and
requirement of the output of individual classifier. Some combining techniques are adaptive in their nature. They work by evaluating the decisions of the individual classifiers depending on the input of individual classifier such as adaptive weighting [124], associative switch, mixture of local experts (MLE) [125], and hierarchical MLE.
Depending on the type of output from individual classifier, Xh et al. [126] grouped the expectation level in to three states: 1) measurement (or confidence), 2) rank, 3) abstract. At the measurement level the output of the classifier is a numerical number which indicates the chances of given output belongs to a particular class. At the rank level, the choice of the class depends on the highest rank assigns by the classifier. However, it is not necessary that a highest rank is also the highest confidence level. At the abstract level, the decision is normally made on the base of unique class label or class labels. Further to his explanation, he added that the confidence level provides the highest information about the decision of a class while abstract level provides the least information.
A combination process consists of a set of individual classifiers (base-level classifiers) and a combining rule which combines the results of individual classifiers for a final decision. When and how the base-level classifiers will work together depends upon the combination scheme. According to Anil et al. [119] the combination schemes could be differentiated on the basis of their architecture, the characteristics of the combiner, and the selection of the individual classifies.
On the basis of the architecture, the combining schemes are divided into three categories, that are addressed in [119]; 1) parallel, 2) cascading (or serial combination), 3) hierarchical (tree like). In the parallel scheme, all the base-level classifiers are invoked separately and independently and later the results are combined by a combiner. In the cascading style, the individual classifiers are invoked in a linear sequence. For the sake of efficiency the cheap classifiers in term of computational time and measurement demands, are invoked first followed by the most accurate and the expensive one. In the hierarchical architecture, the
base-level classifiers are combines into a decision tree like structure.
In our implementation of classifier Fusion strategy, parallel architecture is selected due to its simplicity, less computational time and also higher confidence level.
Once the posterior probability of all the classifier is computed, the next step is to combine them into a new set that can be used for maximum selection, for final classification. Robert and David in their paper [127] mention two sets of combining rules; 1) fixed combining rules, 2) trained combining rules.
5.3.1 Fixed Combining Rule
The fixed combing rules make sure that the classifier output is not just a number rather it should have a clear interpretation— class labels, distance and confidence level. The posterior probabilities are also considered the confidence. Following are the main fixed combining rules:
Maximum: the maximum rule selects the outcome of the classifier producing the highest estimated confidence, which seems to noise sensitive. This apparently seems quite simple to select a classifier that is more confident on its output. However, this fails if the classifiers are overtrained. In that case the final decision is based on overconfidence, hence dominating the confidence without providing a better performance [128]. In addition, maximum rule fails for simple classifiers that are not sensitive for nuances hence better classifiers are required for detection.
Median and Mean: they both average the posterior probability estimates thereby reducing the estimation error. This works well if all the base-level classifiers estimate more or less the same quantity.
Minimum: the minimum rule selects the outcome of the classifier that has the least objection against a certain class. Likewise the maximum rule, it is hard to find the adequate situation where this rule performs the best.
Product: it works by taking the product of posterior probabilities of each classifier.
Majority/Voting: it counts the vote for each class over the input classifiers and selects the majority class. It simply coincides with the simple majority, normally (50% of the vote+1) in case of 2-class dataset.
5.3.2 Trained Combining Rule
Trained combining rules, on the other hand, train an arbitrary classifier using all the trained data in the intermediate space. The classifiers are usually trained as an output classifier, using the same training data set. The posterior probabilities are directly used for the building of the intermediate space. If the classes are not normally distributed then it is more advantageous to use the nonlinear rescaling.
5.3.3 Fixed vs. Trained combining Rules
This section provides a brief description about the advantages and disadvantages of fixed as well as trained combining rules:
Fixed rules are simple to use and can be used without training of the classifier.
Fixed rules require low memory space and less computational time while trained combining rules require more time as well as more memory space.
Fixed rules are suitable for independent/ low correlated errors and exhibit similar performance. On the contrary, trained rules are suitable for classifiers that are correlated or exhibiting different performance.
Flexibility of trained rules is better than fixed rules and also most of the time they perform better than fixed rules.