A hidden Markov model
for criminal behaviour classification
Francesco Bartolucci, Institute of economic sciences,
Urbino University, Italy.
Fulvia Pennoni, Department of Statistics,
Background
Analysis of criminal behaviour: we want to model offending patterns as well as taking into account the nature of offending and the
sequence of offence type;
criminal histories recorded as official histories: England and Wales
Offenders Index which is a court based record of the criminal histories
of all offenders in England and Wales from 1963 to the current day; general population sample of n = 5, 470 individuals paroled from the cohort of those born in 1953, and followed through to 1993;
offences are combined into J = 10 major categories described in the Offendex Index Codebook (1998);
following Francis et al. (2004) we have define T = 6 time windows or age strips:10-15,16-20, 21-25, 26-30, 31-35.
Univariate Latent Markov model
Used by Bijleveld and Mooijaart (2003): the offending pattern of a subject within strip age t, t =, . . . , T is represented by Xt a single discrete random variable;
{Xt} depends only on a random process {Ct};
{Ct} follows a first-order homogeneous Markov chain with k states,
initial probabilities πc’s and transition probabilities πc1c2; the joint distribution of {Xt} may be expressed as
p(X1 = x1, . . . , XT = xT) = c1 φx1|c1πc1 c2 φx2|c2πc1c2 · · · cT φxT|cT πcT −1cT , where φ = p(X = x|C = c).
Multivariate Extension
Xtj is a binary random variable equal to 1 if he/she is convicted for
offence of type j within the strip age t and to 0 otherwise;
we assume local independence i.e. that for t = 1, ..., T , Xtj are conditionally independent given Ct:
φx|c = p(Xt = x|Ct = c) = J j=1 λxj|cj (1 − λj|c)1−xj, where λj|c = p(Xtj = 1|Ct = c), Xt = (Xt1, · · · , XtJ) and xj denotes the j element of the vector x.
Restricted version of the model (unidimensional Rasch)
We assume that for each type of offence we have
logit(λj|c) = αc + βj, (1) where
αc is the tendency to commit crimes of the subject in the latent class c
(i.e. individual characteristic)
βj is the easiness to commit crime of type j;
it allows for an appropriate labelling of the latent classes to order the latent classes
λj|1 <= · · · <= λj|k, j = 1, . . . , J,
such constrain is used to formulate a latent class version of the Rasch (1961) model which is well-known in the Psychometric literature.
Restricted version of the model (multidimensional Rasch)
The previous model assumes that each type of offence has the same latent trait: this may be too much restrictive;
we consider that the crimes may be partitioned into s homogenous subgroups so that logit(λj|c) = s d=1 δjdαcd + βj, (2) where
αcd is the tendency of the subject in the latent class c to commit
crimes in the subgroup d;
δjd is equal to 1 if the crime j is in the subgroup d and to 0 otherwise;
we can classify the offences into groups where crimes belonging to the same group have the same latent trait.
Likelihood inference
The log-likelihood of the model for an observed cohort of n subjects is
l(θ) =
n
i=1
log[Li(θ)],
where θ is the notation for all the parameters, Li(θ) is the function
p(xi1, . . . , xiT) defined evaluated at θ.
Li(θ) may be computed through the well-known recursions in the
hidden Markov literature (see Levinson et al., 1983, and MacDonald and Zucchini, 1997, Sec. 2.2);
l(θ) is maximized with the EM algorithm which requires the
The complete data log-likelihood may be expressed as l∗(θ) = c v·1c log πc + c1 c2 uc1c2 log πc1c2 + i t c vitc j
{xitj log λcj + (1 − xitj) log(1 − λcj)},
where vitc is a dummy variable, referred to the i-th subject, which is equal to 1 if Ct = c and to 0 otherwise, v·tc = i vitc and uc1c2 is the number of transitions from the c1-th to the c2-th state.
EM algorithm
E
: computes the conditional expected value of
l
∗(θ)
, given the
observed data and the current value of the parameters.
M
: updates the parameter estimates by maximizing the
expected value of
l
∗(θ)
computed above.
When the model is constrained (unidimensional or
multidimensional Rasch) the parameters
α
cdand
β
jare
estimated by fitting a logistic model with a suitable design
matrix
Z
defined according to the model of interest to the
data.
Choice of the number of classes (k)
The optimal number of latent classes can be chosen with the
likelihood ratio between the model with k states and that with k + 1 states, Dk = −2(ˆlk − ˆlk+1), for increasing values of k;
or using the Bayesian Information Criterion (Kass and Raftery, 1995) defined as
BICk = −2lk + rk log(n)
where rk is the number of parameters in the model with k states.
According to this strategy, the optimal number of states is the one for that BICk is minimum.
Choice of the number of latent traits
The crimes are clustered using a hierarchical algorithm.
At each step the algorithm aggregates the two cluster of crimes which are the closest in terms of deviance between the model fitted at the previous step and the multidimensional Rasch model fitted after the aggregation of the two clusters.
The steps are iterated until the BIC of the resulting model is lower than the unconstrained model.
An application
We applied the model to a sample of n = 5, 470 males taken from the dataset illustrated above;
we used the estimated number of live births in the cohort year 1953 as reported by Prime et al. (2001).
For a number of classes between 1 and 7 we obtain
k lk rk BICk 1 −21, 341 10 42, 768 2 −20, 076 23 40, 349 3 −19, 643 38 39, 612 4 −19, 284 55 39, 041 5 −19, 142 74 38, 921 6 −19, 086 95 38, 990 7 −19, 010 118 39, 036
Choice of the clusters
Using the hierarchical algorithm the best fit (BIC = 35, 433) was for the following cluster aggregations for each of the the 10 typology of crimes and the estimation of β’s .
latent trait
Offence’s category (j) 1 2 3 βj
Violence against the person X −5.824
Sexual offences X −7.787
Burglary X −7.004
Robbery X −10.212
Theft and handling stolen goods X −5.375
Fraud and Forgery X −6.473
Criminal Damage X −5.890
Drug Offences X −6.720
Motoring Offences X −8.170
Estimated
α’s parameters
Values of the estimated tendencies of the subject for each latent state in every subgroup
c
α
1α
2α
31
0.000
0.000
0.000
2
−0.134 2.860 −9.513
3
3.315
7.100
6.192
4
3.831
4.445
5.02
5
5.283
6.990
7.439
Estimate of
π and Π
Initial probabilities πc
π
1π
2π
3π
4π
50.393 0.552 0.054 0.000 0.000
Transition probabilities πcd’s of the Markov Chain are the following
c
1
2
3
4
5
1
0.996 0.000
0.000
0.003
0.000
2
0.364 0.375 0.010 0.226 0.024
3
0.000 0.241 0.288 0.172 0.300
4
0.555 0.012
0.000 0.429 0.005
0.000
0.071
0.014 0.445 0.470
Advantages of the proposed methodology
We achieve parsimonious description of the dynamic process underlying the data;
the approach is based on general population sample and not on an offender-based sample as in other studies;
it allows to estimate a waste choice of models and to choose the best one going to the simple latent class model to the constrained model with subgroups;
it can provide important information for policy, such as incarceration or incapacitation policy against the offenders.
Future extensions
Constraint the probabilities λj|c’s to be equal to 0 for a latent class so that this class may be identified as that of non-offensive subjects;
consider also models in which the transition probabilities may vary with age (non homogeneous of the Markov chains);
consider restriced models in which the transition matrix has a particular structure (e.g. triangular, symmetric);
References
Bijleveld, C. J. H., and Mooijaart, A. (2003). Latent Markov Modelling of Recidivism Data. Statistica Neerlandica, 57, 3, 305-320.
Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. series B, 39, 1-38.
Feng, Z. and McCulloch, C. E. (1996). Using Bootstrap Likelihood Ratios in Finite Mixture Models.
J. R. Statist. Soc., B, 58, 3 609-617.
Francis, B., Soothill, K. and Fligelstone, R. (2004). Identifying Patterns and Pathways of Offending Behaviour: A New Approach to Typologies of Crime. European Journal of Criminology, 1, 47-87.
Kass R. E. and Raftery A. (1995). Bayes factors. Journal of the American Statistical Association, 90 (430),
773-795.
Lazarsfeld, P. F. and Henry, N. W (1968). Latent Structure Analysis. Boston: Houghton Mifflin.
Levinson S. E., Rabiner, L. R. and Sondhi, M. M. (1983). An introduction to an application of theory of
probabilistic functions of a Markov process to automatic speech recognition. Bell System Thechnical Journal, 62, 1035-74.
Lindsay, B., Clogg, C. and Grego, J. (1991). Semiparametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis.
McCutcheon, A. L. and Thomas, G. (1995). Patterns of drug use among white institutionalized delinquents in Georgia. Evidence from a latent class analysis. Journal of Drug Education, 25,
61-71.
MacDonald I. and Zucchini W. (1997). Hidden Markov and Other Models for Discrete-valued Time Series. London: Chapman & Hall.
McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models, New York, John and Wiley.
Research development and Statistics Directorate (1998). Offenders Index Codebook, London: Home Office. Available at
http://homeoffice.gov.uk/rds/pdfs/oicodes.pdf.
Prime, J., White, S., Liriano, S. and Patel, K. (2001). Criminal careers of those born between 1953 and 1978. Statistical Bulletin 4/01. London: Home Office.
Rasch, G. (1961). On general laws and the meaning of measurement in psychology, Proceedings of the IV Berkeley Symposium on Mathematical Statistics and Probability, 4, 321-333.
Wiggins, L. M. (1973). Panel Analysis: Latent Probability Models for Attitudes and Behavior Processes. Amsterdam: Elsevier.