A hidden Markov model for criminal behaviour classification

(1)

A hidden Markov model

for criminal behaviour classification

Francesco Bartolucci, Institute of economic sciences,

Urbino University, Italy.

Fulvia Pennoni, Department of Statistics,

(2)

Background

Analysis of criminal behaviour: we want to model offending patterns as well as taking into account the nature of offending and the

sequence of offence type;

criminal histories recorded as official histories: England and Wales

Offenders Index which is a court based record of the criminal histories

of all offenders in England and Wales from 1963 to the current day; general population sample of _{n = 5, 470 individuals paroled from the} cohort of those born in 1953, and followed through to 1993;

offences are combined into _{J = 10 major categories described in the} Offendex Index Codebook (1998);

following Francis et al. (2004) we have define _{T = 6 time windows} or age strips:10-15,16-20, 21-25, 26-30, 31-35.

(3)

Univariate Latent Markov model

Used by Bijleveld and Mooijaart (2003): the offending pattern of a subject within strip age _{t, t =, . . . , T is represented by X}_t a single discrete random variable;

{Xt} depends only on a random process {Ct};

{Ct} follows a first-order homogeneous Markov chain with k states,

initial probabilities _π_c’s and transition probabilities _π_c₁_c₂; the joint distribution of {X_t} may be expressed as

p(X1 = x1, . . . , XT = xT) = c1 φ_x₁_|c₁πc1 c2 φ_x₂_|c₂πc1c2 · · · cT φ_x_T_|c_T πcT −1cT , where φ = p(X = x|C = c).

(4)

Multivariate Extension

Xtj is a binary random variable equal to 1 if he/she is convicted for

offence of type _{j within the strip age t and to 0 otherwise;}

we assume local independence i.e. that for _{t = 1, ..., T , X}_tj are conditionally independent given _C_t:

φ_x_|c = p(Xt = x|Ct = c) = J j=1 λx_j|cj (1 − λ_j|c)1−xj, where _λ_j|c = p(X_tj = 1|C_t = c), X_t = (X_t1_{, · · · , X}_tJ) and _x_j denotes the _{j element of the vector x.}

(5)

Restricted version of the model (unidimensional Rasch)

We assume that for each type of offence we have

logit(λ_j|c) = α_c + β_j, (1) where

αc is the tendency to commit crimes of the subject in the latent class c

(i.e. individual characteristic)

βj is the easiness to commit crime of type j;

it allows for an appropriate labelling of the latent classes to order the latent classes

λ_j|1 <= · · · <= λ_j|k, j = 1, . . . , J,

such constrain is used to formulate a latent class version of the Rasch (1961) model which is well-known in the Psychometric literature.

(6)

Restricted version of the model (multidimensional Rasch)

The previous model assumes that each type of offence has the same latent trait: this may be too much restrictive;

we consider that the crimes may be partitioned into _{s homogenous} subgroups so that logit(λ_j|c) = s d=1 δjdαcd + βj, (2) where

αcd is the tendency of the subject in the latent class c to commit

crimes in the subgroup _d;

δjd is equal to 1 if the crime j is in the subgroup d and to 0 otherwise;

we can classify the offences into groups where crimes belonging to the same group have the same latent trait.

(7)

Likelihood inference

The log-likelihood of the model for an observed cohort of _{n subjects is}

l(θ) =

n

i=1

log[Li(θ)],

where θ is the notation for all the parameters, L_i(θ) is the function

p(xi1, . . . , xiT) defined evaluated at θ.

Li(θ) may be computed through the well-known recursions in the

hidden Markov literature (see Levinson et al., 1983, and MacDonald and Zucchini, 1997, Sec. 2.2);

l(θ) is maximized with the EM algorithm which requires the

(8)

The complete data log-likelihood may be expressed as l∗(θ) = c v·1c log πc + c1 c2 uc1c2 log πc1c2 + i t c vitc j

{xitj log λcj + (1 − xitj) log(1 − λcj)},

where _v_itc is a dummy variable, referred to the _{i-th subject, which is} equal to 1 if _C_t = c and to 0 otherwise, v_·tc = _i _v_itc and _u_c₁_c₂ is the number of transitions from the _c₁-th to the _c₂-th state.

(9)

EM algorithm

E

: computes the conditional expected value of

l

∗

(θ)

, given the

observed data and the current value of the parameters.

M

: updates the parameter estimates by maximizing the

expected value of

l

∗

(θ)

computed above.

When the model is constrained (unidimensional or

multidimensional Rasch) the parameters

α

_cd

and

β

_j

are

estimated by fitting a logistic model with a suitable design

matrix

Z

defined according to the model of interest to the

data.

(10)

Choice of the number of classes (k)

The optimal number of latent classes can be chosen with the

likelihood ratio between the model with _{k states and that with k + 1} states, _D_k = −2(ˆl_k − ˆl_k+1), for increasing values of k;

or using the Bayesian Information Criterion (Kass and Raftery, 1995) defined as

BICk = −2lk + rk log(n)

where _r_k is the number of parameters in the model with _{k states.}

According to this strategy, the optimal number of states is the one for that _BIC_k is minimum.

(11)

Choice of the number of latent traits

The crimes are clustered using a hierarchical algorithm.

At each step the algorithm aggregates the two cluster of crimes which are the closest in terms of deviance between the model fitted at the previous step and the multidimensional Rasch model fitted after the aggregation of the two clusters.

The steps are iterated until the BIC of the resulting model is lower than the unconstrained model.

(12)

An application

We applied the model to a sample of _{n = 5, 470 males taken from the} dataset illustrated above;

we used the estimated number of live births in the cohort year 1953 as reported by Prime et al. (2001).

For a number of classes between 1 and 7 we obtain

k l_k r_k BIC_k 1 −21, 341 10 42, 768 2 −20, 076 23 40, 349 3 −19, 643 38 39, 612 4 −19, 284 55 39, 041 5 −19, 142 74 38, 921 6 −19, 086 95 38, 990 7 −19, 010 118 39, 036

(13)

Choice of the clusters

Using the hierarchical algorithm the best fit (_{BIC = 35, 433) was for} the following cluster aggregations for each of the the 10 typology of crimes and the estimation of _{β’s .}

latent trait

Offence’s category (j) 1 2 3 _β_j

Violence against the person _X −5.824

Sexual offences _X −7.787

Burglary _X −7.004

Robbery _X −10.212

Theft and handling stolen goods _X −5.375

Fraud and Forgery _X −6.473

Criminal Damage _X −5.890

Drug Offences _X −6.720

Motoring Offences _X −8.170

(14)

Estimated

α’s parameters

Values of the estimated tendencies of the subject for each latent state in every subgroup

c

α

1

α

2

α

3

1

0.000

2 −0.134 2.860 −9.513

3

3.315

7.100

6.192

4

3.831

4.445

5.02

5

5.283

6.990

7.439

(15)

Estimate of

π and Π

Initial probabilities _π_c

π

1

π

2

π

3

π

4

π

5

0.393 0.552 0.054 0.000 0.000

Transition probabilities _π_cd’s of the Markov Chain are the following

c

1

2

3

4

5

1 0.996 0.000

0.000

0.003

0.000

2 0.364 0.375 0.010 0.226 0.024

3 0.000 0.241 0.288 0.172 0.300

4 0.555 0.012

0.000 0.429 0.005

0.000

0.071 0.014 0.445 0.470

(16)

Advantages of the proposed methodology

We achieve parsimonious description of the dynamic process underlying the data;

the approach is based on general population sample and not on an offender-based sample as in other studies;

it allows to estimate a waste choice of models and to choose the best one going to the simple latent class model to the constrained model with subgroups;

it can provide important information for policy, such as incarceration or incapacitation policy against the offenders.

(17)

Future extensions

Constraint the probabilities _λ_j|c’s to be equal to 0 for a latent class so that this class may be identified as that of non-offensive subjects;

consider also models in which the transition probabilities may vary with age (non homogeneous of the Markov chains);

consider restriced models in which the transition matrix has a particular structure (e.g. triangular, symmetric);

(18)

References

Bijleveld, C. J. H., and Mooijaart, A. (2003). Latent Markov Modelling of Recidivism Data. Statistica Neerlandica, 57, 3, 305-320.

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. series B, 39, 1-38.

Feng, Z. and McCulloch, C. E. (1996). Using Bootstrap Likelihood Ratios in Finite Mixture Models.

J. R. Statist. Soc., B, 58, 3 609-617.

Francis, B., Soothill, K. and Fligelstone, R. (2004). Identifying Patterns and Pathways of Offending Behaviour: A New Approach to Typologies of Crime. European Journal of Criminology, 1, 47-87.

Kass R. E. and Raftery A. (1995). Bayes factors. Journal of the American Statistical Association, 90 (430),

773-795.

Lazarsfeld, P. F. and Henry, N. W (1968). Latent Structure Analysis. Boston: Houghton Mifflin.

Levinson S. E., Rabiner, L. R. and Sondhi, M. M. (1983). An introduction to an application of theory of

probabilistic functions of a Markov process to automatic speech recognition. Bell System Thechnical Journal, 62, 1035-74.

Lindsay, B., Clogg, C. and Grego, J. (1991). Semiparametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis.

(19)

McCutcheon, A. L. and Thomas, G. (1995). Patterns of drug use among white institutionalized delinquents in Georgia. Evidence from a latent class analysis. Journal of Drug Education, 25,

61-71.

MacDonald I. and Zucchini W. (1997). Hidden Markov and Other Models for Discrete-valued Time Series. London: Chapman & Hall.

McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models, New York, John and Wiley.

Research development and Statistics Directorate (1998). Offenders Index Codebook, London: Home Office. Available at

http://homeoffice.gov.uk/rds/pdfs/oicodes.pdf.

Prime, J., White, S., Liriano, S. and Patel, K. (2001). Criminal careers of those born between 1953 and 1978. Statistical Bulletin 4/01. London: Home Office.

Rasch, G. (1961). On general laws and the meaning of measurement in psychology, Proceedings of the IV Berkeley Symposium on Mathematical Statistics and Probability, 4, 321-333.

Wiggins, L. M. (1973). Panel Analysis: Latent Probability Models for Attitudes and Behavior Processes. Amsterdam: Elsevier.