• No results found

Machine Learning and Data Mining

In this work, we aim to describe methods for machine learning and data mining over data and knowledge maintained in DL knowledge bases. In particular, we are con- cerned with methods for generating new DL concept expressions ashypotheseswhich describe patterns in the data. In this section, we describe the particular settings in machine learning and data mining which we will address in this thesis, and in Sec- tion 3.4 we describe how we apply techniques for learning concepts in DL knowledge bases to these settings.

3.3.1 Supervised Learning Problems

Supervisedlearning problems typically take a set of examplesE for which each mem- ber eω ∈ E has been attributed with some label ω ∈ Ωwhere |Ω| ≥ 2. In this way, the set of examples can be partitioned into sets containing examples with a common label ω, denoted Eω where E = S∀ω∈ΩE

ω. In a supervised learning problem, we seek to construct hypotheses h which describe certain proportions of each of the la- belled examples of each setEω. We say that a hypothesiscoversan example, denoted

6However, highly optimised reasoner algorithms do exist to handle moderately large knowledge

§3.3 Machine Learning and Data Mining 29

by the boolean functioncovers(h,eω), ifhdescribes exampleeωwhereeω ∈ Eω. With respect to the set of all examples E, we denote the cover of hypothesis h as the set

cover(h,E) ={e ∈ E |covers(h,e)}.

3.3.1.1 Classification

The typical binary classification problem in machine learning has two labels |Ω| =

{+,−}, where E+ are the positiveexamples and Eare the negative examples. Hy- potheses are sought which cover allpositiveexamples∀e∈ E+:covers(h,e)and none of thenegativeexamples∀e∈ E−:¬covers(h,e).

The performance of any hypothesishin a learning problem is often assessed with ameasure function f which maps hypotheseshfrom the set of all possible hypotheses Land their coverscover(h,E)⊆ E to real values.

Definition 3.3.1. (Measure Function) Given a set of labelled examplesE and the space of all hypotheses L, a measure function is a real-valued function f : L × {E } 7→ R which maps pairs of hypotheses h ∈ Land the set of labelled examples E to a real value denoting the performance of h overE.

In order to describe when a hypothesis hmay be considered asolutionto a learn- ing problem based on its cover over a set of examplesE, we define a thresholdτover f where f(h,E) ≥ τ describes h as being a solution. Aquality function is a boolean

function which succeeds when h is a solution in terms of some threshold τ on a

measure function f.

Definition 3.3.2. (Quality Function)Given a set of labelled examplesE and the space of all hypothesesL, aquality functionis a boolean functionQ:L × {E } 7→Bwhich maps pairs of hypotheses h∈ Land the set of labelled examplesE to a boolean value denoting whether h may be considered a solution to a learning problem overE. Quality functions are often defined in terms of a minimum thresholdτover a measure function f whereQ(h,E) = f(h,E)≥ τ.

An example of a commonly used measure function for assessing hypothesis per- formance in binary classification isaccuracy, which is defined as follows.

Definition 3.3.3. (Accuracy)Given a labelled set of examplesE whereE = S

ω∈ΩEω and

Ω= {+,−}partitioned into positive examplesE+and negative examples E, a hypothesis

h and its cover C where C=cover(h,E), theaccuracyfunction is defined as:

acc(h,E) = TP+TN

where

TP = |E+C| (true positives)

FP = |E−C| (false positives)

TN = |E−\C| (true negatives) FN = |E+\C| (false negatives)

A quality function over accuracy may be defined asQ(h,E): acc(h,E)> 0.95which holds when hypothesis h has an accuracy over95%.

As hypotheses in classification problems are often sought to exclusively cover examples of a single common label, they can be used for prediction. Given a new unseen, unlabelled exampleu, we may use a hypothesis hwhich is deemed a solu- tion to label u by testing if covers(h,u) succeeds. The performance of a hypothesis considered a solution for a classification problem relative to a set of labelled exam- ples E, also known as the training set, can be tested with atest setof unseen labelled examples,U. Any hypothesishwhich was induced over a training set of labelled ex- amplesE can then be assessed for performance over unseen test data by computing its measure f relative to U. If h performs well over E and U, we may consider h to be a good classification hypothesis suitable forprediction, and may use it to provide labels for new examples. If h performs poorly over U, then it may be considered a poor predictor. In this case, h may have been induced to over-fitthe set of training data E such that is does not generalise well to previously unseen examples U. One approach to assessing whether h will generalise well to unseen examples is to split the set of examples E composed of labelled setsEω for each

ω ∈Ωintok ≥2 train-

ing and test set pairs(Ei,Ui)for 1≤i≤k, whereEi ⊂ E andUi =E \ Ei, and where each pair(Ei,Ui) is composed of roughly the same proportion of labelled examples relative toE. By training h on each Ei, we assess its performance on the remaining examplesUi, and compute the overall performance as the average measure over each set f(h,Ui). This technique is known ascross validation. Partitioning the training set intokdifferent sets is calledk-fold cross validation, as we generatekdifferent ‘folds’ of test and training data. Other techniques also exist for ensuring that hypotheses gen- eralise well, such as by ensuring the expressions they are composed of are as simple as possible, according to the minimum description length principle [84] which for- malisesOccam’s razorin that “among competing hypotheses, the one with the fewest assumptions should be selected”.