3.3 Data Mining
3.3.2. DM modelling approaches
3.3.3.4. Apriori
Apriori is the algorithm that is commonly utilised to find associations and relationships among sets of data. The algorithm was first introduced back in 1994 by Agrawal and Srikant (1994) and it utilises an iterative approach and makes various passes to the database to detect frequent item-sets. The algorithm, in addition, utilises the two thresholds described above, support and confidence, in order to generate desired rules. This algorithm generates all possible rules and counts the support and confidence for them in order to generate only rules that meet these two constraints to produce powerful results (Lai and Cerpa, 2001). The Apriori algorithm separates the association rule mining into two main stages (Vishwakarma, 2013). First, it generates frequent item-sets that meet the minimum predefined support value; second, it extracts only the association rules that meet the minimum confidence value specified before. It utilises a level-wise approach to detect frequent item-sets. In this approach, the algorithm scans the database for candidate item-sets with the size of one. Any of the found candidates that has less than the minimum support value is rejected. Other item- sets that exceed the minimum support are added to frequent item-sets with the size of one. The algorithm continues searching for other frequent item-sets but with bigger size. It generates new candidate item-sets, with the size of previous frequent item-sets + one, and
80
scans the database again to check the support value of each candidate. Similar to the previous step, only candidates that exceed the minimum support value are added to the new frequent item-sets, which is larger by one item. The scan of the database continues until no more frequent item-sets are generated.
The Apriori algorithm has been frequently applied in healthcare. An example of such a utilisation is documented by Nahar et al. (2013) who used this algorithm to discover factors that contribute to heart conditions in males and females. In another example, Ilayaraja (2013) utilised the Apriori algorithm to find frequently occurring diseases in multiple locations and times to facilitate medical decision-making for healthcare staff.
Advantages and limitations have investigated the Apriori algorithm in the literature. On the positive side, Apriori is considered as the most popular algorithm in centralised databases (Manoj and Rajni, 2016). In addition, it has influenced most proposed algorithms for generating association rules. It is a simple approach, which can be incorporated in many DM software tools. Moreover, it reduces the time of scanning of the database and uses improved efficiency methods, such as hash table, partitioning, transaction reduction etc., (Agrawal and Srikant, 1994; Ji et al., 2011). For the negative side, Apriori produces a very large number of candidate item-sets in situations where the item-sets are large or where if the minimum support value is very low (Vishwakarma, 2013; Wu et al., 2008). As the algorithm requires multiple scans to the database for support counting of each candidate, this process is seen to be tedious and consumes more processing time and memory space (Vishwakarma, 2013; Kumar and Rukmani, 2010). According to Witten et al. (2011), this process is costly, especially with a large amount of data. Improvement have been suggested and proposed to overcome the issues of the Apriori algorithm. Han et al. (2000) proposed a new algorithm, FP-Growth, which avoids the cost of mining large frequent patterns. The FP-Growth algorithm works in two simple stages. Firstly, it constructs a compressed database, called FP- Tree, which represents its frequent item-sets; secondly, it extracts the association rules directly from the previously constructed tree. FP-Growth scans the database only twice and requires less searching time and space, as it does not generate a large number of candidate item-sets. However, large space is required when dealing with large databases (Vishwakarma, 2013).
81
3.3.3.5. Expectation-Maximisation
The Expectation-Maximisation (EM) algorithm is a simple mathematical approach to cluster continuous data through normal mixture models. EM also provides a flexible way to estimate the underlying density functions. The algorithm is a hierarchical approach, which aims to provide finite mixture distributions to cluster observed data in a random phenomenon (Wu et al., 2008). This approach to modelling has been used widely to cluster datasets on different occasions (McLachlan and Peel, 2004). The EM algorithm is one of the preferred approaches for maximum likelihood estimation, which is commonly used for many areas, such as system identification, image processing, pattern recognition and other statistical applications (Couvreur, 1997). The EM algorithm has many advantages. The algorithm is simple and its implementation is easy. In general, the algorithm produces the best performance when the dimensionality of data is not too large. However, it requires many iterations and some of its steps may perform slowly because of high dimensionality (Singh, 2005). Roy et al. (2014) stated that the EM algorithm is limited, in that it only clusters convex data and it needs to know the number of clusters in order to perform well.
Healthcare is one of the domains that has utilised the EM algorithm. For example, the EM algorithm was utilised to infer the emergency department throughput model (Wang et al., 2013). In this study, the algorithm helped to estimate the parameters of the model and the system noise in the emergency department with the utilisation of real-world time-series data.
3.3.3.6. PageRank
The PageRank algorithm is a statistical method utilised to rank web pages. The algorithm was introduced by Brin and Page in 1998. The algorithm interprets a hyperlink and its pages as votes for every page. PageRank is a useful modelling approach that can make web pages more significant by receiving votes from more important web pages. It scores a web page by considering the number of hyperlinks that point to that particular web page from other web pages. In addition, PageRank considers the importance of each web page, pointing to that particular web page. That is the web page which is pointed to by other important web pages receives a higher score, and therefore becomes more important than web pages that are pointed to by less important web pages (Wasserman and Faust, 1994).
82
PageRank remains a simple approach for ranking web pages. However, it has some limitations especially when compared with other algorithms. Its relevance is less as it ranks web pages according to the indexing time (Sharma and Sharma, 2010). The results of PageRank come out based on the time of indexing of the web pages and not based on the time of the query. Moreover, the algorithm is medium quality and many other web ranking algorithms have higher quality. In relation to the algorithm’s utilisation in healthcare, PageRank seems to be more involved in ranking web pages and its implementation in the medical domain is deficient in the literature.
3.3.3.7. AdaBoost
The AdaBoost algorithm is an important ensemble learning method proposed by Freund and Schapire in 1995. The algorithm is useful to construct a classifier as a linear combination. It combines different weak hypotheses into one strong hypothesis that has a very low level of errors (Mukherjee et al., 2011). AdaBoost is the first algorithm for practical boosting and has been utilised in many domains (Schapire, 2013). One successful example is the application of AdaBoost with a cascade process for face detection (Viola and Jones, 2004).
The algorithm was implemented in the healthcare domain in many studies. An example of such a utilisation is the study by Madabhushi et al. (2006), who utilised AdaBoost to combine images’ features to distinguish between lesion and shadowing. The experiment helps to improve breast ultrasounds for breast diseases effectively.
3.3.3.8. K- Nearest Neighbour
k-Nearest Neighbour (kNN) is a classification algorithm that discovers a group of objects in
the training set that is closest to the test object (Wu et al., 2008). The algorithm discover new objects based on other objects known previously and classifies them by a voting system (Silver et al., 2001).
Besides its vast utilisation, kNN is also easy to implement. The algorithm utilises a training set of data to perform correctly. Nevertheless, this initial process is seen to be completed in a very fast way (Tomar and Agarwal, 2013). However, some limitations of kNN’s performance have been remarked upon. Firstly, the algorithm requires considerable storage
83
space. Secondly, kNN is very sensitive to noise data. In addition, the testing process of this algorithm performs slowly.
In healthcare, kNN has been utilised by many scholars and for different medical purposes. The algorithm was implemented to diagnose skin disease and heart disease (Cataloluk and Kesler, 2012; Shouman et al., 2012) In addition, the algorithm was utilised to classify chronic diseases to help build an early warning system. kNN helped to analyse the association between cardiovascular disease and hypertension and many other diseases’ risk factors (Jen et al., 2012).
3.3.3.9. Naïve Bayes
Naïve Bayes is a classification algorithm that helps to classify new objects into a class (Wu et al., 2008). The algorithm mainly sets a rule for classifying those new objects based on a classification task done before for previous objects. Therefore, Naïve Bayes is a supervised classification method that classifies new objects based on the vector variables of previously classified data.
Naïve Bayes has several advantages. It is a very simple and easy classification model. According to Jothi et al. (2015), Naïve Bayes makes the computation process very easy. Moreover, the algorithm does not need an iterative parameter estimation scheme (Wu et al., 2008). Another important advantage is the fact that Naïve Bayes performs quickly and accurately, even with large datasets. On the other hand, Naïve Bayes has a limitation, as it assumes that all attributes are independent from each other (Tomar and Agarwal, 2013). This limitation can cause huge deficiencies especially if the algorithm is utilised in the healthcare domain.
Naïve Bayes has been utilised in healthcare as well as its implementation in other domains. For example, Bakar et al. (2011) utilised the Naïve Bayes algorithm to detect dengue disease. The experiment utilised other classifiers, including the decision tree and rough set classifier, and helps in early and better detection of the disease.