• No results found

DATA COLLECTION AND THE DATA MANAGER

CHAPTER 6 : SYSTEM AND DATA ARCHITECTURE

6.4 DATA COLLECTION AND THE DATA MANAGER

We rely upon the data we collect from our listeners to make recommendations in Smart Radio. Listeners can provide explicit ratings on individual music tracks or individual artists on a scale of 1–5, where 5 is the top score. We also log implicit feedback, where the user has listened to music items but has not explicitly rated these items. In this case we allocate positive ratings to frequently played items, and negative ratings to less frequently played items. We discuss the algorithm for deriving implicit data later in this section.

The data model of the ACF recommendation component is regularly updated so that recommendations can be made with the most up to date data available. Since feedback may be posted by several users concurrently, we use a multi-threaded data logger, similar to that used to log requests in a web server. The Data Manager is the component that periodically converts raw log data collected by the system into data that can be used in the Recommendation engine. The data conversion period can be adjusted. Currently the database is checked for new data every 2 minutes, if users are on-line. The Data Manager does this by checking a database table, new_data_flag, which contains records indicating which users have posted new data since the last data conversion. This table records the time and the type of data that has been posted by the user. The Data Manager

can then make an incremental update to the ACF database for each user profile listed in the

new_data_flag table.

As we saw in Chapter 4, the data used in ACF systems usually consists of three columns –

userId, itemId and score. Smart Radio has two such databases, recdata_item and

recdata_artist, which store the data driving the ACF algorithms. Rather than allowing multiple user threads to access these key ACF databases, the data manager periodically compiles new log data into the ACF format and updates the central ACF databases. The Data Manager is also responsible for deriving the implicit scores for item and artist data.

6.4.1 Implicit Track Data

As we discussed in Chapter 4, assigning an implicit value to a user action is likely to be an imprecise affair. Different heuristics have been devised as indicators of user interest such as the time spent reading a piece of text (Morita and Shinoda 1994, Konstan et al. 1997, Rafter & Smyth 2000). In Smart Radio we make use of the fact that very often people will listen to music they like repeatedly, and listen less often to music they dislike. We measured the correlation between the ratings and the number of listens for each user, and found the mean to be 0.236. Although this is a lower correlation than we might have expected it can be explained by noise in the count of the number of listens. Since Smart Radio delivers compilations of music, a compilation containing an item that the user dislikes and has rated negatively may be played frequently because of other more favourable items in the compilation.

We calculate the mean and standard deviation of each user’s rating set and listening count set. Our implicit scoring algorithm allocates positive scores to items the user has listened to above his/her average listening count, and negative scores to items the user has listened to below the average listening count. Each score is allocated around the user’s explicit mean score. A positive vote is calculated as the mean plus a single standard deviation, a negative vote is calculated as the mean minus a single standard deviation.

In certain cases we have no explicit data on which to base implicit track scores. In such situations we allocate the average explicit score in the database for each item, if the count for that item is above the user’s average item count. This allows us to provisionally correlate the user with other users for the purpose of making recommendations. However, we do not use this user’s data for making recommendations for other users. As such, in Chapter 7 the implicit track data we evaluate is based on user data where users have also submitted explicit scores.

6.4.2 Implicit Artist Data – Dimensionality Reduction

In Chapter 4 we discussed dimensionality reduction, a technique for reducing dataset sparsity where the aim is to reduce the horizontal dimensions of the user–item matrix. Singular Value Decomposition (SVD), a matrix factorisation technique, has been used for this purpose (Sarwar et al. 2000a, Billsus & Pazzani 1999). SVD is the basis of latent semantic indexing (LSI), a technique

used in Information Retrieval to solve the problems of synonymy and polysemy in a corpus of documents (Deerwester et al. 1990, Berry et al. 1995). When SVD is used in an information-rich environment like information retrieval, it is capable of producing reduced features that still have human level semantics attached to them. With data that has very little semantic information attached to it such as in an ACF environment, SVD is used simply as a matrix factorisation technique – in which case the reduced feature set has no human level semantics associated with it.

While reducing the dimensionality of the user–item matrix in order to improve ACF performance is desirable, so too is the goal of producing an interpretable feature set, particularly for the purpose of bootstrapping a new user into the system. Since the Smart Radio dataset has very little content attached to it, we did not consider SVD to be a solution that could meet both objectives. However, on closer inspection we realised that the content data available to us did provide us with the type of categorical information we would be looking for from a LSI perspective. The content ‘features’ associated with our music data can be used to describe each music track as belonging to an artist and genre class. Therefore we decided to reduce the user– music item matrix by mapping it to a user–artist matrix. To calculate a score for a user–artist we simply average the scores for tracks from that artist that the user has explicitly rated.

Equation 6.1

( )

ua ui ui S Average t t A = ∈

In Equation 6.1, A represents the set of tracks by artist a. The score allocated to user u for artist a is the mean of the explicit ratings given to tracks, tui, by artist a by user u. Applying Equation 6.1 to the explicit track data we obtain a transformed dataset with the same number of users but with the item dimensions reduced from 4131, the number of music items, to 333, the number of artists represented in the dataset. This decreases the sparsity of the dataset from 0.9734 (track_explicit) to 0.8682 (artist_implicit). We perform a detailed analysis of this dataset in Chapter 7. In the next section we will discuss the issue of bootstrapping in ACF which concerns the performance of the algorithm where there is a deficit of appropriate data. As we will demonstrate, the user–artist dataset is useful for bootstrapping new users into the system.