Algorithm for Prediction - Machine learning algorithms for crime prevention and predictive poli

In this dataset, the Random Forests algorithm will be used to produce predictions of offender survival in the following way:

1. For each of a series of m crimes c1, ..., cm, there is a corresponding set of survival

times y1, ..., ym. The n potentially predictive factors attached to this crime are then

denoted by a set of values x1, ..., xn. The form of the dataset is given in Table 3.1

below:

Table 3.1: Dyfed-Powys Dataset Example: Survival Model. See Appendix for de- scriptions and explanations of independent variables.

ID x1 AgeCommitted x2 BurgValue ... c1 54 0 ... c2 17 290 ... c3 29 0 ... ... ... ... ... cm−1 33 5000 ... cm 21 25 ...

ID xn−1 MDAClass xn MultipleOffences y Survival Time (Days)

c1 A 0 y1 = 200 c2 U 1 y2 = 550 c3 U 0 y3 = 10 ... ... ... ... cm−1 C 0 ym−1 = 375 cm U 0 ym = 100

2. Since some of these survival times yi correspond to ”deaths” (i.e. the offender

having committed a reoffence at time yi), and others correspond to ”survivals” (i.e

the offender not having committed a reoffence by time yi), it also needs to be indi-

cated whether or not a variable is right-censored at yi. As such, the corresponding

indicators of censoring for each crime yi are denoted by δ1, ..., δm, where δi = 1 if

the offender has committed a reoffence at time yi and δi = 0 has ”survived” until

time yi without committing a reoffence.

3. Select a number of Decision Trees, B, to be grown for the forest. For each Decision Tree b ∈ B, sample, with replacement, a subset of m crimes c1, ..., ci i < m.

planatory variables within the dataset. In this case, j will be set to its default,√n rounded to the next largest integer, then later tuned using the same Grid Search method as detailed in Chapter 2.

5. By recursively splitting the dataset into subsets so that one parent node splits into two separate child nodes, considering all j available features in the bootstrap sample for that tree and all jvsplit values for that feature, either the feature j∗ and the value

of that feature jv∗ that maximises survival difference in the child nodes according to

a selected statistic (Logrank, C, Maxstat), or a random value from those available to split the sample at the parent node (Extremely Randomised Trees) will be chosen.

6. Using this method, grow each Decision Tree b ∈ B to its maximum depth, i.e. keep splitting the data under the constraint that a terminal node should have no less than a predefined number d0 _{> 0 of unique deaths, determined by the min.node.size}

parameter under ranger.

7. Once each Decision Tree has been grown, the individual survival times for the offender following each crime ci within the dataset can be predicted. These pre-

dictions are produced from the set of terminal nodes H of each decision tree, i.e. the most extreme set of nodes in each of the trees grown by the algorithm, in the following way:

7.1. For a terminal node h ∈ H containing T survival times, let the subset of the survival times in the original dataset that exists at node h be denoted by y1,h, ..., yT ,h

and the corresponding censoring times by δ1,h, ..., δT ,h.

7.2. The unique reoffence times, i.e. the list of all possible survival times in that node for which δi = 1, tl,h, can then be defined to be a sequence t1,h < t2,h < ... < tT,h.

7.3. Furthermore, defining dl,h to be the number of deaths at time tl,h and sl,h to be

the number of individuals at risk, an estimate of the cumulative hazard function at node h, as defined by the Nelson-Aalen estimator, will be calculated:

ˆ Ah(t) = X tl,h≤t dl,h sl,h (3.1) An illustration of this concept is detailed in Figure 3.1 below.

Figure 3.1: Tree Example 1: Nelson-Aalen Estimator, ˆAh(t), in Terminal Nodes

Every crime ci that occupies the same terminal node h will have the same cu-

mulative hazard function estimate. For example, if crimes c23, c30and c45 all occupy

the leftmost node in the above tree, then they will all have the same cumulative hazard estimate ˆAh(t) = 0.75. To determine, therefore, the estimate of the cumu-

lative hazard function for any crime ci, the crime is then simply dropped down the

tree until it falls into one of the terminal nodes h ∈ H.

8. The cumulative hazard function A(t|ci) for this crime ci will then be the Nelson-

Aalen estimator for ci’s terminal node, defined as follows:

A(t|ci) = ˆAh(t) if ci ∈ h (3.2)

9. Once cumulative hazard function estimates A∗_b(t|ci) have been produced for

each tree b ∈ B, the cumulative hazard function is averaged across each of the B decision trees to obtain the ensemble cumulative hazard function.

For a crime ci, the ensemble cumulative hazard function A∗e(t|ci) is as follows:

A∗_h(t|ci) = 1 B B X b=1 A∗_b(t|ci) (3.3)

Assume that the following two trees exist in addition to the tree shown in Figure 3.1:

Figure 3.2: Tree Examples 2 and 3: Nelson-Aalen Estimator, ˆAh(t), in Terminal

Nodes

In this case, cumulative hazard estimates and ensemble cumulative hazard are calculated as follows:

Table 3.2: Nelson-Aalen Estimators of Cumulative Hazard ˆAh(t) in each Terminal

Node and Ensemble Cumulative Hazard A∗_e(t|ci)

Tree Node 1 Node 2 Node 3 Node 4 Tree b = 1 0.75 0.12 0.93 0.47 Tree b = 2 0.72 0.33 0.97 0.51 Tree b = 3 0.81 0.15 0.95 0.40 Ensemble (Average) 0.76 0.2 0.95 0.46

For each tree b ∈ B, m − i crimes from the training set are not included in the bootstrapping process. These crimes are known as the out of bag (OOB) samples and can be used to validate the effectiveness of the training data’s predictions on an unseen dataset. As such, it is also of interest to generate the cumulative hazard

function and corresponding survival probability estimates for the out of bag samples. These estimates are calculated as follows:

If a crime ci is an out of bag sample for tree b, set Ii,b = 1 and if otherwise, set

Ii,b = 0 . The out of bag ensemble cumulative hazard function is then calculated as

follows: A∗∗_h (t|ci) = PB b=1Ii,bA∗b(t|ci) PB b=1Ii,b (3.4) Now that it has been determined how trees will be grown and how the survival function of an individual crime will be determined, an outline of how these estimates of survival will be evaluated against the actual survival times for the training, test and validation datasets will be provided below.

3.2.1 Split Rules

In order to provide a complete overview of the possibilities inherent within this method, four different variants of the Random Forests algorithm will be tested, which provide four different perspectives on how an individual Decision Tree within the forest should be grown. These variants, which all alter the rule that decides the values jv of a feature j at which a node should be split, are all available within

ranger’s [93] R implementation, which has been chosen for both its relative speed and flexibility at the time of writing.

Split Rule 1: Log-rank Statistic (logrank) The first variant of this algorithm that will be tested uses the log-rank statistic, a special case of the Gehan statistic [31], to decide at which variables, and at which value of these variables, a Decision Tree’s node would be optimally split. In this implementation, the nodes are split by maximising the log-rank statistic, which compares the relative hazard of two nodes of a tree for each feature value jv in order to maximise the difference in survival

times between these nodes. As the split rule originally recommended by Ishwaran et al. [43], this can be considered to be the ’default’ option. With the use of this split rule being based on the proportional hazards assumption, however, concerns have been expressed with regards to its use in this context.

Split Rule 2: Extremely Randomised Trees (extratrees) The second variant that will be tested is known as Extremely Randomised Trees [33], which (true to its name) chooses the values jv for the split at each Decision Tree node randomly,

producing a series of completely randomised trees. Compared to the log-rank statistic method, the Extremely Randomised Trees method can address some issues with bias (i.e. overfitting to training data) and is usually much faster to compute, but may underfit to the training set.

Split Rule 3: Harrell’s C Statistic (C) The third variant, as proposed by the authors of the ranger package [76], uses Harrell’s C statistic [46], another special case of the Gehan statistic, to determine the optimal split at each node. The use of this statistic for the purpose of building optimally split trees is very intuitive, as this statistic is commonly used to evaluate the predictive accuracy of the Random Forest. Compared to the log-rank method, this method has often been shown to perform favourably if the dataset contains a lot of censored data or comparatively more continuous independent variables, but for large or noisy datasets (i.e. datasets that contain a lot of uninformative independent variables) there can be some performance issues.

Split Rule 4: Maximally Selected Rank Statistics (maxstat) The fourth variant, as similarly proposed by the authors of the ranger package [92], uses maximally selected rank statistics [50], an alternative to the linear rank statistics em- ployed in conditional inference forests, to determine the optimal splits for each node. This method, which uses p-value approximations to determine splits instead of the usual conditional Monte-Carlo methods, avoids the log-rank statistic’s tendency to be biased towards selecting variables with a greater number of possible split points [85] and also offers some improvement in algorithm runtime - this is due to the p-value approximation used in the Ranger package, which is detailed in the original paper.

In document Machine learning algorithms for crime prevention and predictive policing (Page 95-100)