Chapter 2: Background and Related Research
2.3 Data Mining-Based Disease Prediction
2.3.2 Collaborative filtering
Collaborative filtering is a set of data mining techniques generally used to filter or understand the collaboration or behaviour of users and to understand the pattern of the collaboration (Terveen and Hill, 2001). This broad idea has been scaled down to find application in many different fields of science. In our present research, collaborative filtering is related to predicting a user’s choice by analysing the choices (or collaboration) of similar users. It is often used as a recommender system by learning users’ past choices of items (Adomavicius and Tuzhilin, 2005). The basic proposition behind this is that, if a certain user group habitually chooses a set of items and another user chooses an item from that set, it is likely that the user will also like the other items of the set. The
36
preference information might be explicit (e.g., user ratings, likes on posts) or implicit (e.g., browsing history, bounce rate). As part of the recommender system, collaborative filtering is used extensively to track a user’s activity or choices in various online sites and to provide a more personalised experience, including suggestions and advertisements based on the choice profile. For example, a collaborative filtering-based recommender system is used to suggest items in e-commerce sites like Amazon (Linden, et al., 2003), movies in Netflix (Bell and Koren, 2007, Bennett and Lanning, 2007) or news (Billsus, et al., 2002), and even to find compatible matches in online dating services (Brozovsky and Petricek, 2007).
The methods of implementing collaborative filtering vary widely in the literature. The method of finding similarity can be loosely categorised as memory-based or model-based, or a hybrid of the two (Das, et al., 2007). In memory-based collaborative filtering, all user preferences for items are loaded or mapped into memory, and then similarities with a test user’s preferences are computed using certain methods. The test user’s preferences are predicted based on the items that have the highest weight of similarity. Alternatively, in many cases, a limited number of user preferences from ‘nearest neighbours’ are calculated (Davis, et al., 2010) by finding the similarity between two users and iterating the process over all users. The two most common similarity calculation methods are Pearson correlation (Resnick, et al., 1994) and vector similarity (Breese, et al., 1998). Memory-based collaborative filtering is relatively straightforward, intuitive and normally performs well. However, loading entire user preference databases for similarity calculation is often resource-intensive, especially if the data is sparse, which is true for most online items. Besides, as the item database or vector space is tightly structured, introducing new items is often difficult, as it requires reorganisation of the structure. Model-based collaborative filtering methods generally utilise machine learning and data mining algorithms. There is an extensive variety of algorithms for implementing model- based filtering, which vary considerably in terms of performance and prediction accuracy (Si and Jin, 2003). Different clustering algorithms are often used in model-based
37
collaborative filtering. Here, users are clustered into different classes using the training dataset of user-item preference. The active user is then classified using the same algorithm to discover to which cluster it belongs, and choices are predicted from that cluster. Bayesian clustering and Bayesian network models (Breese, et al., 1998, Dempster, et al., 1977) are often used as the clustering algorithm. Another model-based collaborative filtering method is k-means clustering (Xue, et al., 2005). Here, the users are grouped into k clusters or classes by vector quantisation, where the users of the same clusters have the minimum mean. However, finding clusters with the minimum mean property is an iterative process and computationally expensive (NP-hard). Therefore, approximations using heuristic algorithms (Kanungo, et al., 2002) are often utilised to work around. Besides these, Markov decision process (Su and Khoshgoftaar, 2009), an extension of state-based Markov chain modelling, is also implemented in some collaborative filtering methods.
Unlike memory-based methods, model-based algorithms are usually scalable and faster, as they can efficiently handle the sparsity of the data when the dataset is too large. However, constructing the model is often expensive, and can make it difficult to introduce new data into the training dataset. Several algorithms have been proposed to increase performance by reducing complexity. For example, principal component analysis (Kim and Yum, 2005) is used to reduce the number of parameters; fuzzy clustering (Honda, et al., 2001) has been used to approximate the clustering process and missing values; and singular value decomposition has been used to reduce dimensionality (Sarwar, et al., 2000). Again, this introduces a trade-off between prediction performance and scalability. As the complexity of the model or data is reduced, scalability and performance increase, at the cost of prediction accuracy.
Collaborative filtering methods have great potential in predicting diseases because of their comorbid nature in patients; that is, many diseases or symptoms tend to occur simultaneously. However, only limited research has been done in this particular field (Hassan and Syed, 2010). Davis et al. presented ICARE (Davis, et al., 2008, Davis, et al.,
38
2010), a method based on clustering and collaborative filtering to predict risk for individual patients based on their own medical history and that of other similar patients. The ICARE system uses vector similarity as a collaborative filtering method, where ‘users’ are patients and ‘items’ are diseases. This method is extended by using the inverse frequency of diseases to capture the effect that having rare diseases in common has more impact than having a trivial disease in common. While this method incorporates several features, it still does not consider the sequence or timing of diseases, which is very important in disease behaviour. For example, the risk of disease C might be higher if the patient has disease A and then disease B in sequence. However, if the patient develops disease B before A, the chances of getting C might be low. Vector similarity does not capture this sequence. Further, gaps in time between occurrences of diseases are also important. These longitudinal aspects of disease occurrence are not captured in the ICARE system. While our research method does not directly employ collaborative filtering, our approach implicitly captures the essence of it. Further, we have taken into account the effects of sequence and time gaps between occurrences of diseases.