Decision and Regression Trees - Recommender Systems the Textbook

Decision and regression trees are frequently used in data classification. Decision trees are designed for those cases in which the dependent variable is categorical, whereas regression trees are designed for those cases in which the dependent variable is numerical. Before discussing the generalization of decision trees to collaborative filtering, we will first discuss the application of decision trees to classification.

Consider the case in which we have an m×n matrix R. Without loss of generality, assume that the ﬁrst (n− 1) columns are the independent variables, and the ﬁnal column is the dependent variable. For ease in discussion, assume that all variables are binary. Therefore, we will discuss the creation of a decision tree rather than a regression tree. Later, we will discuss how to generalize this approach to other types of variables.

The decision tree is a hierarchical partitioning of the data space with the use of a set of hierarchical decisions, known as the split criteria in the independent variables. In a univariate decision tree, a single feature is used at one time in order to perform a split. For example, in a binary matrix R, in which the feature values are either 0 or 1, all the data records in which a carefully chosen feature variable takes on the value of 0 will lie in one branch, whereas all the data records in which the feature variable takes on the value of 1 will lie in the other branch. When the feature variable is chosen in such a way, so that it is correlated with the class variable, the data records within each branch will tend to be purer. In other words, most of the records belonging to the diﬀerent classes will be separated out. In other words, one of the two branches will predominantly contain one class, whereas the other branch will predominantly contain the other class. When each node in a decision tree has two children, the resulting decision tree is said to be a binary decision tree.

The quality of the split can be evaluated by using the weighted average Gini index of the child nodes created from a split. If p₁. . . pr are the fractions of data records belonging

to r diﬀerent classes in a node S, then the Gini index G(S) of the node is deﬁned as follows:

G(S) = 1−

i=1

p2_i (3.1)

The Gini index lies between 0 and 1, with smaller values being more indicative of greater discriminative power. The overall Gini index of a split is equal to the weighted average of the Gini index of the children nodes. Here, the weight of a node is deﬁned by the number of data points in it. Therefore, if S₁and S₂are the two children of node S in a binary decision tree, with n₁and n₂data records, respectively, then the Gini index of the split S⇒ (S₁, S₂) may be evaluated as follows:

Gini (S⇒ [S1, S₂]) =n1· G(S1) + n2· G(S2)

3.2. DECISION AND REGRESSION TREES 75 ATTRIBUTE 2 = 1 ATTRIBUTE 2 = 0 ATTR. 1 = 0 ATTR. 1 = 1 ATTR. 3 = 0 ATTR. 3 = 1 ATTR. 4 = 0 ATTR. 4 = 1 ATTR. 3 = 0 _{ATTR. 3 = 1} TEST INSTANCE A = 0 0 1 0 TEST INSTANCE B = 0 1 1 0

Figure 3.2: Example of a decision tree with four binary attributes

The Gini index is used for selecting the appropriate attribute to use for performing the split at a given level of the tree. One can test each attribute to evaluate the Gini index of its split according to Equation 3.2. The attribute with the smallest Gini index is selected for performing the split. The approach is executed hierarchically, in top-down fashion, until each node contains only data records belonging to a particular class. It is also possible to stop the tree growth early, when a minimum fraction of the records in the node belong to a particular class. Such a node is referred to as a leaf node, and it is labeled with the dominant class of the records in that node. To classify a test instance with an unknown value of the dependent variable, its independent variables are used to map a path in the decision tree from the root to the leaf. Because the decision tree is a hierarchical partitioning of the data space, the test instance will follow exactly one path from the root to the leaf. The label of the leaf is reported as the relevant one for the test instance. An example of a decision tree, constructed on four binary attributes, is illustrated in Figure 3.2. The leaf nodes of the tree are shaded in the figure. Note that all attributes are not necessarily used for splits by the decision tree. For example, the leftmost path uses attributes 1 and 2, but it does not use attributes 3 and 4. Furthermore, different paths in the decision tree may use different sequences of attributes. This situation is particularly common with high-dimensional data. Examples of the mappings of test instances A= 0010 and B= 0110 to respective leaf nodes are illustrated in Figure 3.2. Each of these test instances is mapped to a unique leaf node because of the hierarchical nature of the data partitioning.

The approach can be extended to numerical dependent and independent variables with minor modiﬁcations. To handle numerical independent (feature) variables, the attribute values can be divided into intervals in order to perform the splits. Note that this approach might result in a multi-way split, where each branch of the split corresponds to a diﬀerent interval. The split is then performed by choosing the attribute on the basis of the Gini index

76 CHAPTER 3. MODEL-BASED COLLABORATIVE FILTERING

criterion. Such an approach also applies to categorical feature variables, wherein each value of the categorical attribute corresponds to a branch of the split.

To handle numeric dependent variables, the split criterion is changed from the Gini index to a measure better suited to numeric attributes. Speciﬁcally, the variance of the numeric dependent variable is used instead of the Gini index. A lower variances is more desirable because it means that the node contains training instances that are discriminatively mapped in the locality of the dependent variable. Either the average value in the leaf node, or a linear regression model, is used at the leaf node to perform the prediction [22].

In many cases, the tree is pruned to reduce overﬁtting. In this case, a portion of the training data is not used during the tree construction phase. Then, the eﬀect of pruning the node is tested on the portion of the training data that is held out. If the removal of the node improves the accuracy of the decision tree prediction on the held out data, then the node is pruned. Additionally, other variations of the split criteria, such as error rates and entropy, are commonly used. Detailed discussions of various design choices in decision tree construction may be found in [18,22].

3.2.1 Extending Decision Trees to Collaborative Filtering

The main challenge in extending decision trees to collaborative ﬁltering is that the predicted entries and the observed entries are not clearly separated in column-wise fashion as feature and class variables. Furthermore, the ratings matrix is very sparsely populated, where the majority of entries are missing. This creates challenges in hierarchically partitioning the training data during the tree-building phase. Furthermore, since the dependent and independent variables (items) are not clearly demarcated in collaborative ﬁltering, what item should be predicted by the decision tree?

The latter issue is relatively easy to address by constructing separate decision trees to predict the rating of each item. Consider an m× n ratings matrix R with m users and n items. A separate decision tree needs to be constructed by ﬁxing each attribute (item) to be dependent and the remaining attributes as independent. Therefore, the number of decision trees constructed is exactly equal to the number n of attributes (items). While predicting the rating of a particular item for a user, the decision tree corresponding to the relevant item is used for prediction.

On the other hand, the issue of missing independent features is more difficult to address. Consider the case, where a particular item (say, a particular movie) is used as a splitting attribute. All users whose rating is less than a threshold are assigned to one branch of the tree, whereas the users whose ratings are larger than the threshold are assigned to the other branch. Because ratings matrices are sparse, most users will not have specified ratings for this item. Which branch should such users be assigned to? Logic dictates that such users should be assigned to both branches. However, in such a case, the decision tree no longer remains a strict partitioning of the training data. Furthermore, according to this approach, test instances will map to multiple paths in the decision tree, and the possibly conflicting predictions from the various paths will need to combined into a single prediction.

A second (and more reasonable) approach is to create a lower-dimensional representation of the data using the dimensionality reduction methods discussed in section2.5.1.1of Chapter 2. Consider the scenario, where the rating of the jth item needs to be predicted. At the very beginning, the m× (n − 1) ratings matrix, excluding the jth column, is con- verted into a lower-dimensional m× d representation, in which d n − 1 and all attributes are fully speciﬁed. The covariance between each pair of items in the m× (n − 1) ratings matrix is estimated using the methods discussed in section2.5.1.1of Chapter2. The top-d

3.3. RULE-BASED COLLABORATIVE FILTERING 77

In document Recommender Systems the Textbook (Page 96-99)