The most na¨ıve form of propositionalisation would involve simply ‘flat- tening’ the data, converting all the information related to a given relational instance into a series of attributes. However, there are a number of diffi- culties with this approach – for example, placing explicit labels on created attributes that indicate relationships not actually present in the data. To use the Mutagenesis dataset as an example, each Compound instance, under this transformation, would contain attributes for the elements, charges and quanta types for each atom in the compound (with the added complication that every instance would need to contain as many attributes as the largest instance, and smaller instances would have to deal with missing values). This would result in a structure similar to:
Compound, Atom1El, Atom1Ch, Atom1Qu, Atom2El, Atom2Ch, Atom2Qu.. In this representation the ordering of the atoms within each compound will have an effect on the models produced by attribute-value machine learning algorithms, as correspondences are explicitly drawn between atoms in the same position in the ordering – an ordering not present in the original dataset.
Rsd [80] (Relational Subgroup Discovery) takes a logic-based approach to propositionalisation. It computes all possible combinations of first-order predicates (within defined constraints) that could form useful features, then instantiates selected variables to produce features. The features produced in this way are then transformed into a set of Boolean values that denote, for each instance in the data, whether that instance is covered by that feature.
Relaggs [43] uses relational database-oriented techniques, such as ag- gregation, to produce propositional representations that are not limited to Boolean values. For example, a set of relational attributes can be summarised by values such as minimum, maximum, mean, mode or frequency counts.
1.7
Learning with Unlabeled Data
The application of random relational rules to learning from data without ex- plicit class labels is discussed in subsequent chapters of the thesis, so a brief overview of such learning is given here.
This section describes two methods for learning from data in which some or all of the instances are without explicit class labels – clustering, in which all of the data is unlabeled, and semi-supervised learning, in which a portion of the data is labeled and the remainder is not.
In supervised learning for classification, the learning algorithm is provided with a set of instances with class labels [13]. A model is derived from the labeled training data and used to classify test instances whose labels have been hidden. On the other hand, in unsupervised learning, the data has no class labels at all. Instead of classifying the data, unsupervised algorithms search for useful structure and groupings within the data.
1.7.1
Clustering
Clustering is a form of unsupervised learning, in which instances are divided into groups, generally according to some distance measure, such as the Eu- clidean distance [34]. Clustering is unlike the train-test procedure of super- vised learning, where a model is produced on labeled training data and used to assign labels to test data, in that it takes a set of unlabeled instances and attempts to produce a meaningful grouping of instances within that set in the absence of class labels.
Instances within a cluster should be, according to the distance measure used, more similar to each other than they are to instances in other clusters. In fact, the ‘Cluster assumption’ states “If points are in the same cluster, they are likely to be of the same class” [13] (although the reverse does not necessarily hold), which suggests a method for assessing the quality of clustering if class labels are available for the data.
The k-means algorithm is used to cluster data in Chapter 4 and is thus de- scribed here, in Algorithm 5. K-means is a partitioning clustering algorithm, meaning that it produces a single set of partitions (as opposed to methods that produce a nested series of partitions). K-means begins with some number (k) of randomly assigned partition centres (centroids) and iteratively reassigns partition centres (and thus partitioning) until a convergence is reached. An ex- ample of this process, reproduced from a diagram previously published in [25], is shown in Figure 1.5. Initially two points are randomly selected as cluster centres and each of the remaining points assigned to whichever of the centre points they are closer to. Then the centroids of each cluster are calculated and the data points are assigned to their nearest centroids repeatedly until the clusters converge to the final stage shown.
It initially creates k cluster centres by selecting instances at random from the data, then assigns each remaining instance to the cluster with the nearest centre. Once this initialisation is complete, the k-means algorithm sets the
1.7. LEARNING WITH UNLABELED DATA 27
centre of each cluster to its centroid and again assigns each instance to the closest centre. This process is repeated until it has converged on a set of centroids that will not change with further iterations.
Algorithm 5 Pseudocode for the k-means algorithm
Randomly select k instances as initial reference points R1..Rk
for each instance i in the data do
Assign instance i to the closest of the k reference points end for
Set new reference points R01..R0kto be the centroids of the instances assigned to each reference point
converged = false
while converged = false do
for each instance i in the data do
Assign instance i to the closest of the k reference points end for
Set new reference points R01..R0k to be the centroids of the instances as- signed to each reference point
if ∀i Ri = R0i then
converged = true end if
end while
Figure 1.5: k-means clustering process
1.7.2
Semi-supervised Learning
Semi-supervised learning falls between supervised and unsupervised learning, in that semi-supervised algorithms utilise both labeled and unlabeled data. A common task in semi-supervised learning is, given a dataset in which some instances have class labels and some do not, to predict class labels for those instances without them.
An example of a basic Expectation-Maximisation (EM) [18] algorithm for semi-supervised learning is shown in Algorithm 6. This is a more general form of an algorithm discussed in [13],
Algorithm 6 EM algorithm for semi-supervised learning Build a classifier from labeled instances only
while Classifier parameters improve do
Use the current classifier to estimate a class for each unlabeled instance Re-estimate the classifier, given the estimated class membership of each instance.
end while
This form of learning is especially valuable in situations where there are large amounts of unlabeled data available, but it is expensive (in terms of time and/or money) to obtain labels for that data. Examples of such situations, also given in [13], include:
• Speech recognition – obtaining recorded speech is cheap, but transcribing (and thus labeling) it requires human effort.
• Webpage classification – vast numbers of webpages are freely available, but their classification requires human effort.
• Protein functions - large numbers of protein sequences are available, but classifying the function of a protein may take years of investigation.