• No results found

Supervised classification: Comparison and conclusion

B. Region-based descriptor

3.7 Supervised classification: Comparison and conclusion

The simplest approach of selecting an appropriate algorithm is to approximate the correctness of the algorithms on the problem. Combining two or more algorithms, thus instigating hybrid algorithms, can increase accuracy. Table 3.5 compares the algorithms as presented within this chapter.

44 The study of supervised classification techniques has shown that the k-nearest neighbour technique is a labour intensive classifier. It only allows for good performance when working with small data sets. The k-nearest neighbour outperforms other classifiers given the data sets are small. Its drawback is that it requires extensive memory and also becomes computationally impractical when large data sets are used. The Decision Trees classifier offers better performance on large data sets as compared to k-nearest neighbour classifier. It uses tree structured graphs of internal and external nodes to classify the data. Its drawback is that it suffers from over-fitting of the training data, meaning it wastes time growing the tree. Some techniques have been proposed to eliminate the over fitting problem, such as pre- pruning and post-pruning methods. The artificial neural networks are mathematical models which are trained to predict specific behaviour and to remember that behaviour in the future like a human brain does.

Naïve Bayes classifier was compared to other supervised classifiers and it was determined that the Naïve Bayes classifier’s performance is better than Decision Trees as well neural network classifiers. When applied to large databases they also offer high accuracy and speed. It is a simple technique to implement that obtains good results in most of the cases. SVMs are classifiers which depend on kernel function and linearly inseparable cases to classify training data into classes. They two categories of SVMs, One-Against-One and One- Against-All which are most used for multiclass SVM problems.

45 Table 3.5: Comparison of supervised classification techniques

Techniques Advantages Disadvantages

K-Nearest Neighbour

 Simple and easy to learn.  Training is very fast.

 Robust to noise training data.  Optimal for large number of

samples.

 Outperforms support vector machine, Naïve Bayes and Decision Trees.

 Biased by value of K.  Computationally complex.

 Large memory storage is required as all results need to be stored until the algorithm completes the classification.  A supervised learning-lazy algorithm.  Easily fooled by irrelevant attributes.

Decision Trees  There is no excess degradation in the performance, when a smaller number of features at each internal node is used.

 It provides a high level of accurate results for large data sets.

 Overlap when the number of classes is large.

 For large tree level, there is error accumulation.

 Optimal Decision Trees tend to be hard when designing.

Artificial neural Networks

 They learn to recognize the pattern which exists in the data set.  Fault tolerance: built in redundancy

or the capability to withstand component failures without crashing

 Associative recall: ability to retrieve information instantaneously based on content and to make an

intelligent guess if there is no exact match for the required information.

 They have an inability to explain the model. They are built in a useful way, and get better results but difficult to explain how the results were obtained.

 Analyst has to spend time understanding the problem and outcomes that will be predicted. If the data is not a good representative of the problem, the neural network may not produce good results.

 It is time consuming, due to the learning of the system by an analyst who specifies the behaviour of the model.

Naïve Bayes classifier

 Easy to learn, and implement.  It mostly provides accurate results.

 When relying on assumption of class conditional, does not help in terms of holding.

 Naive classifier is not reliable in modelling.

Support vector Machine

 Produces accurate results

 Less overfitting and robust to noise  In non-regularities of data, the SVM

is a useful tool for insolvency analysis.

 SVM are expensive and run slowly  The inability of SVM to deal with non-

46

CHAPTER 4

UNSUPERVISED CLASSIFICATION

4.1 Introduction

Clustering, also called unsupervised classification, is the partitioning of a data set into clusters. For an object to belong to a cluster it must share the same characteristics as those in the same cluster as well as dissimilar objects that belong to their their unique cluster (Rai & Singh, 2010:1).

The similar or dissimilar attributes of clusters are computed using distance measuring techniques, such as Euclidean measure, proximity matrix and Manhattan distance measures. Methods of clustering include: hierarchical and partitional clustering methods (Gupta, 2011). As shown in Figure 4.1, within each category a number of sub-categories exist, which are discussed within this chapter.

Figure 4.1: The categories of unsupervised classification method (Adapted from Gupta, 2011)

4.1.1 Hierarchical Method

According to Aggarwal and Reddy (2014), hierarchical clustering produces a tree, also called a dendrogram, that shows the number of clusters. When the dendogram is constructed, the right number of clusters can be chosen by splitting the tree at different levels to get different clustering solutions for the same data, without running the clustering algorithm again. Hierarchical methods were created to overcome the drawbacks associated with partitional clustering methods. Two categories of hierarchical clustering, namely agglomerative and divisive clustering techniques, are identified. The agglomerative method merges clusters that are generated at the bottom levels, whereas the divisive technique separates clusters into small clusters. Although the divisive method splits and the agglomerative method merges clusters, both methods utilizes the dendogram when clustering the data and they differ only within the criterion used when clustering the data.

Unsupervised classification methods

Hierarchical Partition

Agglomerative Divisive k-Means Isodata Expectation Maximization

47

Related documents