Discretizing the Dataset - Preference System

4.3 Preference System

4.3.1 Discretizing the Dataset

Many elicitation techniques are designed for finite discrete variables, while many real-world datasets contain continuously-valued variables. Most preference algorithms are designed for scenarios where the attribute space is defined over a set of variables, A. In an exploration context, users usually do not have strong preferences towards specific attribute values. We assume users are interested in ranges of values; i.e., intervals of attributes. Before beginning preference elicitation, our system constructs an appropriate discretized attribute space from

the set of attribute variables. It is important to determine the appropriate granularity for our discretization. It must be precise enough to accurately separate interesting attribute ranges from uninteresting ranges, but not so acute that it creates too many intervals for our preference algorithms.

Constructing an ideal discretization of the preference space is a challenging problem. There are several possible approaches for discretizing a continuous space. Discretization methods can be described as either supervised or unsupervised [28]. Supervised methods build discretizations using a training set, while unsupervised methods do not. We do not assume that a training set is always available, so we turn to unsupervised methods.

The two simplest discretization techniques are the equal width interval and equal frequency

interval. Equal width segments a range intokequally sized bins. While simple, it is susceptible to outliers that may force the size of the bins to take on inappropriate lengths. Wide intervals are more likely to contain elements that belong in separate bins. Equal frequency creates intervals containing n_k elements. However, forcing a fixed number of elements into each bin may also produce inappropriate intervals.

Fundamentally, discretization attempts to assign each attribute value to an interval range;

i.e., a class, making discretization a classification task. Unsupervised classification is analo-

gous to clustering. Therefore, another way to discretize the attribute space is to use clustering algorithms [11, 58].

One way to discretize a continuous n-space is to cluster the values for each attribute inde- pendent of the other attributes. The discretized space is built by taking the cross product of the resulting clusters for each attribute. Clustering can also build a user-defined number of clusters from a set of elements without considering each attribute separately. The more clusters allowed in the system, the smaller the difference between data elements within the same cluster.

k-means clustering is a well-known technique that makes use of the error sum of squares metric [78]. By reducing this value for potential clusters, the distance between the resulting

clusters is maximized while the distance between cluster elements is minimized. However,

k-means clustering is sensitive to how the initial cluster centers are selected. A poor initial assignment can lead to poor clustering. Another clustering technique attempts to create clusters with the greatest amount of “contrast” [26]. The contrast measure is related to the error sum of squares.

Hierarchical clustering is another clustering technique that can be used to discretize an attribute space [58]. Hierarchical clustering works by merging the clusters similar to one another. Initially, every value is its own cluster. After computing the similarity between every cluster, the two closest clusters are merged. The similarities between the newly merged cluster and the remaining clusters are computed and the process continues until only a desired number of clusters remain.

Assistant’s Discretization Routine

Without a preliminary training set to guide the discretization process, the assistant relies on a clustering algorithm to partition the continuous range of an attributeA into a discrete setX, where|X|= kandX ={x₁, . . . , x_k}. Everyx ∈X represents an interval of the continuous range ofA. Therefore, the clustering process maps every attribute valuea_i ∈ [a_lo, a_hi]to one of thek different discrete valuesx_i ∈ X, wherea_i ∈ [x_i,lo, x_i,hi]. The number of clusters for a given attribute is computed as the logarithm of the total number of distinct values from the dataset in this attribute’s range.

Discretization is done on each attribute range with a hybrid clustering algorithm. It runs as follows:

1. Compute the number of clustersk as the logarithm of the number of unique attribute values to be clustered.

resulting cluster centers.

3. Run hierarchical clustering on the set of centers from thek-means clustering. Compute the centers of the resultingkclusters.

4. Run a final k-means clustering using the results of the hierarchical clustering step as seeds.

The results of this process produce discretizations (see Figures 4.6 and 4.7) that bin attribute values according to both their frequency and similarities. Unlike equal interval discretizations, locally dense regions are less likely to be split up by clustering. Likewise, cluster populations are not explicitly bound to a maximum as is the case with equal frequency discretization. Thus, cluster-based discretization is not as susceptible to the same problems as equal width and frequency discretizations.

Once the discretization of each attribute is complete, the discretized from ofDis created by taking the cross product of each discretized range. This space is labeledF. At this point, every elemente_i ∈ Dcan be mapped to an elementf_i ∈ F. For the remainder of this document, if

e_i is mapped tof_i ∈F,f_iis called the discretized form ofe_i.

In document Integrating Preference Elicitation into Visualizations (Page 57-60)