Utilization of Data Mining Techniques to Improve the System Scalability

Chapter 7 Labeling the Automatically Generated Taxonomic Nodes

8.3 Utilization of Data Mining Techniques to Improve the System Scalability

System Scalability

Various data mining techniques including clustering, similarity indexing, classification and rule induction have been used to improve recommender systems [159]. A state-of- the-art survey of this field was provided in [105], in which the author distinguished the personalization procedure as three phases and explained in detail how the data mining techniques might be applied for each of these phases. Here we mainly focus on the application of clustering techniques as it is most relevant to the research presented in this thesis.

To improve the scalability of recommender systems, Chee et al. developed the data structure RecTree with a divide-and-conquer approach in [132]. Their method it- eratively performs thek-means clustering upon users until the number of users in each partition is smaller than a threshold. Then the recommendation algorithm can be con- ducted within such reduced data space to achieve high scalability and avoid the dilution of suggestion from good users by that from a multitude of poor users. Suryavanshi et al. combined memory-based and model-based collaborative filtering techniques in [140]. They use thek-NN algorithm to identify the neighbours of the current user and gener- ate recommendations based on such small groups of users instead of the whole dataset.

In contrast, Mobasher suggested to group items or users and then employ traditional collaborative filtering algorithms upon these aggregated profiles [105].

Nevertheless, the above approaches are less helpful for improving the system scalability when the domain knowledge about items are considered in the recommendation procedure. For example in our movie datasets, calculating the pairwise similarities within a small set of movies is still time-consuming [4]. A better solution is to construct some kind of data model for preserving item similarities. Recommender systems can then retrieve these values directly instead of computing them online when generating recommendations. Model-based approaches are more suitable for this purpose because the huge amount of candidate items often prevents the utilization of memory-based approaches in practice. In [159] the author reviewed the similarity indexing approaches that can be used to identify users with the same taste or items with the same charac- teristics. The inverted index (widely used in the community of information retrieval) provides an efficient index structure by mapping the users or items into state vectors, but this technique cannot efficiently resolve queries in the recommender systems.

8.4 Proximity Measure based on Taxonomy

When exploiting the taxonomy to compute the item similarities based on domain knowledge in recommender systems, an important issue is how to define an appropriate measure to evaluate the proximity between nodes (corresponding to concepts or data instances) in the taxonomy. Roughly speaking, there are two categories of proximity measure based on the taxonomic structure:

Edge-based approaches mainly consider the topological structure of the hierar-

chy. The most intuitive way is to use the length of shortest path between nodes as the similarity measure, which only depends on their relative location in the hierarchy [120]. Wu and Palmer took into account the absolute locations of both nodes as well as that of their Lowest Common Ancestor (LCA) to calculate the similarity [153]. Additional

chy [65]. For example, the similarity value between two objects in SimTree is calculated by multiplying all weights along the path that connects both objects [157].

Node-based approaches are based on the idea of Information Content. Each of

the nodes in the hierarchy, i.e. a concept in a taxonomy, contains a certain amount of information quantified byIC(c) =₋log₂p(c), wherep(c)is the probability of encountering an instance of conceptc. The root concept has zero information content because all sub-concepts are derived from it. As one moves down the hierarchy, the probability of encountering the instances of a concept decreases, and so the concept’s information content or informativeness increases monotonically. The more abstract a concept, the lower its information content and vice versa. Hence, the similarity of two concepts is determined by the information they share, i.e. the information content in their LCA concept. In a text corpus, the concept probability is calculated by maximum likelihood estimation(MLE) based on the concept frequency [126] [127].

The above categories of proximity measure evaluate the semantic similarity from different viewpoints, and so they are suitable for different situations. When an accu- rate taxonomy is available, the edge-based approaches work better. The node-based approaches outperform in other cases, because they ignore the topological details of the hierarchy. To overcome their individual disadvantages, several hybrid approaches have been proposed to incorporate the information content into the edge-based measure [65] [96]. In comparison, Li et al. proposed a measure that combines the length of the shortest path between two concepts and the location of their LCA within the semantic taxonomy in a non-linear manner [94]. A variant of this measure was later used [24] to personalize web search. In addition, Ganesan et al. reviewed the evolution of proximity measures upon the hierarchical domain structure, analyzed and compared these measures empirically in [46]. They also extended these measures to deal with the multiple occurrence of elements in the taxonomy.

In document Relational clustering models for knowledge discovery and recommender systems (Page 149-152)