Maximum pseudolikelihood estimation with Markov random fields in the
segmentation of brain magnetic resonance images
Amy Chan
Bachelor of Mathematics (Hons I)
A thesis submitted for the degree of Doctor of Philosophy at The University of Queensland in 2018
Abstract
Having an accurate model of the tissue structure of the brain is useful in studying the development and progression of neurodegenerative diseases like dementia. Brain magnetic resonance images (MRI) can be used to create such models, by detecting the tissue boundaries in the image and classifying each voxel as a particular tissue. This is task known as image segmentation.
Many segmentation methods use a mixture-Markov random field probabilistic model for the image inten-sities, which can then be used to determine the most likely segmentation of the image. This model consists of normal distribution for the image intensities of each tissue, and a Markov random field (MRF) for the prior distribution of tissue labels. The purpose of the MRF is to incorporate spatial dependence between the labels of neighbouring voxels, adding smoothness to the segmentation to remove noise.
In this thesis, we develop and validate methods to perform automatic tissue segmentation of brain MRI. We specifically focus on the MRF component of the image model. This is used to model spatial dependence between neighbouring tissue labels, which has the effect of spatial regularisation on the segmentations.
This thesis begins with a general introduction to mixture models, Markov random fields, and the combined mixture-MRF model as it applies to image segmentation. Estimation of the tissue intensity parameters and also of the segmentation using Expectation-Maximisation (EM) is explained, as well as effective approximations required to accommodate the MRF’s intractable normalising constant.
First, the homogeneous Potts MRF is introduced. It is used ubiquitously for MRI segmentation. The Potts model has one parameter that controls the strength of the MRF compared to the normal intensity probabilities, and hence the smoothness of the resulting segmentation. In the literature and in practice, this parameter is almost always fixed to a value chosen by manual tuning or with the use of training data. When no training data is available, selection of an appropriate parameter value is subjective and can affect the accuracy of the segmentation. We propose use of the maximum pseudolikelihood estimator (Besag, 1986) to automatically determine the value of this parameter and show how to incorporate it into the EM algorithm. The proposed method adaptively determines the amount of spatial regularisation on a per-image basis, without needing training data or an anatomical atlas. The maximum pseudolikelihood estimator (MPLE) is statistically consistent. It is also computationally tractable, involving only univari-ate maximisation of a concave function, and a straightforward extension of EM. The proposed method is demonstrated on real brain MRI and compared to various existing methods that require manual specifi-cation of the smoothing parameter. It is also compared to the least-squares method of Derin and Elliott
ii
(1987) which has previously been used to automatically determine the smoothing parameter by Van Leem-put et al. (1999b). The MPLE produces segmentations that are comparable or significantly more accurate than these.
Next, the image model is extended to use the non-homogeneous Potts MRF, which has not been studied in detail for tissue segmentation. While the homogeneous Potts MRF has one parameter that controls global smoothness, the non-homogeneous MRF has multiple pairwise parameters that allow different smoothness constraints depending on the specific neighbouring tissues. The MRF additionally has unary parameters allowing for tissue-specific prior information to be incorporated. The role of each of these parameters is studied in isolation and together. The previously proposed MPLE is applied to this MRF to automatically determine the parameters. The method is applied to real brain images. Model selection using pseudolikelihood information criterion (Forbes and Peyrard, 2003) suggests that the MRF with smoothing parameters but without unary parameters is favoured. However, segmentation accuracy suggests that the non-homogeneous Potts MRF (with various combinations of unary and smoothing parameters) is not more beneficial than the homogeneous Potts MRF. A review of similar MRFs in the literature suggests that the use of prior anatomical knowledge is required to constrain the parameters of the non-homogeneous Potts MRF to tailor it for brain segmentation. Leaving all parameters free to be estimated can lead to oversmoothing, particularly if a given tissue boundary is relatively rare compared to others.
Finally, the image model is extended to consider anisotropic MRFs. Based on the Potts MRF, these allow for smoothing that can incorporate local features of the image in addition to the tissue labels. Drawing from the principles of Perona-Malik diffusion (Perona and Malik, 1990), a model is designed and proposed to smooth the segmentation tangentially along a detected edge but not across it, with strength proportional to the detected edge strength. Similar anisotropic MRFs have been used for tissue segmentation before, but are discriminative models requiring training data and different solution methods. The proposed model is generative and may be estimated using Expectation-Maximisation and maximum pseudolikelihood es-timation, thus requiring no training. The model MRF and two variants are applied to brain MRI, and their segmentation accuracy compared to the homogeneous Potts MRF. The two supplementary MRFs under-perform the homogeneous Potts MRF but demonstrate that the anisotropy is being appropriately applied. The proposed MRF significantly outperforms the homogeneous Potts MRF and demonstrates anisotropic smoothing as intended. Suggestions are made to further improve the framework and MRF to make better use of the local image structure.
In summary the thesis comprises two main directions of research. First, automatic determination of MRF parameters in the mixture-Markov random field framework may be achieved in a computationally tractable manner using maximum pseudolikelihood, avoiding poor segmentations due to manual specification of the spatial parameter. Second, different MRFs allow for finer control of smoothing on a tissue-specific or even more local neighbourhood-specific level, and when properly specified may improve segmentation accuracy.
iii
Declaration by author
This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.
I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, financial sup-port and any other original research work used or resup-ported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my higher degree by research candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.
I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.
I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis and have sought permission from co-authors for any jointly authored works included in the thesis.
iv
Publications during candidature
Conference papers
Chan, A., Wood, I. A., and Fripp, J. (2016). Maximum Pseudolikelihood Estimation for Mixture-Markov Random Field Segmentation of the Brain. In2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–7. IEEE
Journal papers
Boyd, R., George, J., Fripp, J., Panneck, K., Chan, A., Fiori, S., Guzzetta, A., Ware, R., Rose, S., and Colditz, P. (2015). Relationship between early brain structure on Mri, white matter integrity (diffusion Mri) and neurological function at 30 weeks post menstrual age in infants born very preterm. Develop-mental Medicine & Child Neurology, 57:8–9
Lai, M., D’Acunto, G., Guzzetta, A., Fripp, J., Chan, A., Rose, S., Ngenda, N., Whittingham, K., Colditz, P., and Boyd, R. (2015). Randomised controlled trial of PREMM: Early somatosensory stimulation (mas-sage) in preterm infants. Developmental Medicine & Child Neurology, 57:94–95
George, J., Fripp, J., Shen, K., Pannek, K., Chan, A., Ware, R., Rose, S., Colditz, P., and Boyd, R. (2015). Relationship between white matter integrity and neurological function in preterm infants at 30 weeks post-menstrual age. Developmental Medicine & Child Neurology, 57:88–89
George, J., Fripp, J., Shen, K., Pannek, K., Chan, A., Ware, R., Rose, S., Colditz, P., and Boyd, R. (2016). Relationship between white matter integrity at 3T Mri and neurological function in preterm infants at 30 weeks postmenstrual age. Developmental Medicine & Child Neurology, 58:33–34
Publications included in this thesis
Chan, A., Wood, I. A., and Fripp, J. (2016). Maximum Pseudolikelihood Estimation for Mixture-Markov Random Field Segmentation of the Brain. In2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–7. IEEE Incorporated as a part of Chapter 3.
Contributor Statement of contribution
Amy Chan (Candidate) Conception and design (70%) Analysis and interpretation (80%) Drafting and production (80%)
Ian Wood Conception and design (15%) Analysis and interpretation (10%) Drafting and production (10%)
Jurgen Fripp Conception and design (15%) Analysis and interpretation (10%) Drafting and production (10%)
v
Contributions by others to the thesis
All chapters were written entirely by the candidate, with editorial advice provided by Dr Ian Wood and Dr Jurgen Fripp. The research in this thesis was developed under the guidance and suggestions of Dr Ian Wood and Dr Jurgen Fripp.
Statement of parts of the thesis submitted to qualify for the award of another degree
None.
Research Involving Human or Animal Subjects
vi
Acknowledgements
Completing my doctorate has been simultaneously one of the best and worst times of my life. I have in turns enjoyed and lamented the student lifestyle. I felt both completely lost in a sea of knowledge, and the satisfaction of coming up with a reasonable idea and realising you’ve gained enough knowledge to do so.
This thesis offered an interesting challenge, being at the intersection of statistics and imaging. It has been at times frustrating to undertake a thesis straddling two fields, needing to constantly learn more about each, running back and forth between the two, in order to gain proficiency in the middle (“Jack of all trades, master of none”!). Yet it has been very rewarding to finally reach that middle and discover that it is a mastery in and of itself. Taking techniques from statistics with various theoretical plaudits and adapting them to practical use in image segmentation has taught me much about compromise, and renewed my sense of wonder at the ability of mathematics and statistics to transcend decades and technologies.
This thesis would not have been possible without my supervisors. I thank Dr Ian Wood for many hours of discussion not only on my research, but also on how to develop as a researcher and writer, how to manage the lack of motivation and writer’s block that all students experience, and being generally good to talk to. I appreciate the shared laughs from many of our meetings. Thank you for your willingness to learn about imaging and brains, and to stay up to date as I went further and further into narrower and narrower areas of research. Thank you also for helping me through my “mid-thesis crisis”, without which this thesis probably would not exist.
I thank Dr Jurgen Fripp for his willingness to learn about the nitty gritty of parameter estimation in Markov random fields, sometimes at the expense of losing sight of the practical application. Your reminders to pull back and think about context within the big picture when I get stuck on tiny small details are much appreciated. I greatly value your vast store of knowledge on the many different imaging techniques out there.
I also thank Dr Geoffrey McLachlan for giving me a good grounding in statistics and offering helpful advice and suggestions in the early days of my doctorate.
One thing I have learned while writing this thesis is the value of community. To my fortnightly Friday board games group, to the folk of the Cactus and Succulent Society QLD, the trivia gang, my fellow students (many now post-docs) and jigsaw crew at the AeHRC, BGSC, QUGS, the Royal Raiders, and #nethack on freenode - you have all been valuable sources of friendship, support and encouragement.
I thank my friends and family – Andrew, Milly, Jess, and too many others to list – for offering support, commiseration and encouragement, and also for your patience with me for missing out on so many occa-sions particularly in the last year, due to “I have to work on my thesis”.
Last but not least thanks to the DevTeam1, NAO, and Keldon’s AI - many, many happy hours squandered.
vii
Financial support
This research was supported by an Australian Government Research Training Program Scholarship and the Commonwealth Scientific Industrial Research Organisation.
Keywords
Markov random field, image segmentation, magnetic resonance imaging, pseudolikelihood, Potts, mixture models, expectation-maximisation
Australian and New Zealand Standard Research Classifications (ANZSRC)
ANZSRC code: 010406, Applied Statistics, 50% ANZSRC code: 080106, Image Processing, 50%
Fields of Research Classification
FoR code: 0104, Statistics, 50%
viii
_______________________________________________________________________ / A PhD student hitting keys at random on a keyboard will almost surely \
\ finish their thesis. /
---\ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || ||
Contents
Abstract i
Contents ix
List of Figures xiii
List of Tables xv
List of Abbreviations xvi
List of Notation xvii
1 Introduction 1
1.1 Background . . . 2
1.1.1 Magnetic resonance imaging . . . 2
1.1.2 Brain MRI segmentation . . . 3
1.2 Aims . . . 7
1.2.1 Automatic determination of the smoothing parameter . . . 8
1.2.2 Different types of MRF . . . 9
1.3 Contributions . . . 9
1.4 Overview of the thesis . . . 10
2 Mathematical background 13 2.1 Introduction . . . 13
2.2 Mixture models . . . 13
2.2.1 Expectation-maximisation . . . 14
2.2.2 Normal mixture models . . . 15
2.2.3 Image segmentation . . . 17
2.3 Markov Random Fields . . . 17
2.3.1 Hammersley-Clifford theorem . . . 19
2.3.2 Likelihood approximations . . . 22
2.4 Expectation-Maximisation for a mixture-MRF model . . . 25
2.4.1 Approximating z . . . 27
2.5 Algorithm . . . 29 ix
x CONTENTS 2.6 Conclusion . . . 31 3 Homogeneous Potts MRF 33 3.1 Introduction . . . 33 3.1.1 Aim . . . 34 3.2 Background . . . 35 3.2.1 Potts MRF . . . 35
3.2.2 Spatial regularisation parameter . . . 36
3.2.3 Related work . . . 38
3.3 Method . . . 42
3.3.1 Maximum Pseudolikelihood Estimation . . . 43
3.3.2 Least-squares estimate . . . 46
3.3.3 Algorithm . . . 49
3.4 Experiments . . . 51
3.4.1 Choice of approximation and neighbourhood size . . . 53
3.4.2 MRF estimation . . . 54
3.4.3 Grid search . . . 55
3.5 Results . . . 56
3.5.1 Choice of approximation and neighbourhood size . . . 56
3.5.2 MRF estimation . . . 58
3.5.3 Grid search . . . 61
3.6 Discussion . . . 63
3.6.1 Choice of approximation and neighbourhood size . . . 63
3.6.2 MRF estimation . . . 66 3.6.3 Grid search . . . 73 3.7 Conclusion . . . 75 4 Non-homogeneous Potts MRF 79 4.1 Introduction . . . 79 4.1.1 Aim . . . 81 4.2 Background . . . 81 4.2.1 Non-homogeneous Potts MRF . . . 81 4.2.2 Related work . . . 83 4.3 Method . . . 87 4.3.1 Choice of MRF . . . 87
4.3.2 Maximum pseudolikelihood estimation . . . 90
4.3.3 Least-squares estimation . . . 93
4.3.4 Algorithm . . . 94
4.4 Experiments . . . 96
4.5 Results . . . 98
CONTENTS xi 4.5.2 Comparison of estimators . . . 101 4.5.3 Parameter values . . . 102 4.6 Discussion . . . 103 4.6.1 Model selection . . . 103 4.6.2 Comparison of estimators . . . 108 4.7 Conclusion . . . 110 5 Anisotropic MRFs 113 5.1 Introduction . . . 113 5.1.1 Aim . . . 114 5.2 Background . . . 116 5.2.1 Image-based diffusion . . . 116 5.2.2 Perona-Malik diffusion . . . 117 5.2.3 Related work . . . 118 5.3 Method . . . 122
5.3.1 Choice of weight function . . . 122
5.3.2 Parameter estimation . . . 125 5.3.3 Limitations . . . 126 5.3.4 Algorithm . . . 131 5.4 Experiments . . . 133 5.5 Results . . . 134 5.6 Discussion . . . 137
5.6.1 Comparison of anisotropic potentials . . . 137
5.6.2 Parameter values . . . 140
5.6.3 The intensity normalisation parameter κ . . . 141
5.6.4 Alternate anisotropic schemes . . . 141
5.7 Conclusion . . . 142
6 Conclusion 145 6.1 Summary and findings . . . 145
6.1.1 Homogeneous Potts MRF . . . 145
6.1.2 Non-homogeneous Potts MRF . . . 147
6.1.3 Locally anisotropic models . . . 148
6.2 Contributions . . . 148
6.3 Future work . . . 150
6.3.1 Markov random field . . . 150
6.3.2 Intensity distribution . . . 151
6.4 Final remarks . . . 152
Bibliography 155
xii CONTENTS
A.1 Joint distribution . . . 167
A.2 E-step . . . 168
A.3 M-step . . . 169
A.3.1 Mixing proportions . . . 169
A.3.2 Gaussian components . . . 170
A.4 Summary . . . 172
B Coding schemes for three dimensional images 175 B.1 6 neighbours . . . 175
B.2 18 neighbours . . . 176
List of Figures
1.1 Tissues in the brain . . . 2
1.2 Example slices of an MRI of the brain . . . 3
1.3 T1 image vs T2 image . . . 3
1.4 Examples of brains with injuries from cerebral palsy . . . 5
1.5 MRI and histogram of its voxel intensities - overall and by tissue . . . 5
1.6 EM fit from a Gaussian mixture model and corresponding segmentation . . . 6
1.7 Segmentations with various smoothing values β . . . 7
2.1 Axial MRI slice and intensity histogram . . . 16
2.2 Segmentation with a 3-component Gaussian mixture model is susceptible to image noise. . . 18
2.3 Cliques for a regular 2D lattice with 4 and 8 neighbours. . . 21
2.4 Coding sets for a two-dimensional image grid . . . 29
3.1 Example Potts MRFs with various smoothing valuesβ and 4 neighbours . . . . 36
3.2 Segmentations with various smoothing values β . . . 37
3.3 Example pixel configurations with two labels, A and B. . . 37
3.4 Neighbourhoods in a 3x3x3 cube with 6, 18 and 26 neighbours . . . 54
3.5 Mean segmentation accuracy for MRF configurations using MPLE . . . 56
3.6 Average segmentation accuracy for different neighbourhood sizes . . . 57
3.7 Average segmentation accuracy for different likelihood approximations . . . 57
3.8 Estimatedβ values for various configurations. . . 57
3.9 Segmentation metrics (accuracy or Dice coefficient) for the various methods. . . 59
3.10 Paired differences in accuracy/Dice, relative to MPL. . . 59
3.11 Range of estimated β values . . . 60
3.12 Accuracy for various fixed beta values and subjects . . . 62
3.13 Two different 3x3x3 neighbourhoods that the 6-neighbourhood MRF cannot distinguish between. . . 64
3.14 Mid-brain slice of segmentations of subject IBSR_18 with the PL approximation. 64 3.15 Estimated and fittedβ values with MPL. . . 65
3.16 Example segmentations for subjects by various methods . . . 67
3.17 Difference in tissue volume relative to manual segmentations . . . 68 xiii
xiv LIST OF FIGURES
3.18 Grey matter of subject IBSR_09 . . . 68
3.19 Manual segmentation and segmentations produced for various fixedβ values . . 73
4.1 Segmentation metrics (accuracy or Dice coefficient) for the various MRFs using MPL. . . 100
4.2 Paired differences in accuracy/Dice, relative to the single-beta MRF. . . 101
4.3 Tissue proportions compared to exp(αj) (normalised to sum to 1). . . 102
4.4 βjk values estimated by MPL for various potentials; one line per subject . . . 103
4.5 Example segmentations for different MRFs and subjects . . . 104
4.6 Tissue proportions for various MRFs compared to the manual segmentation. . . 105
4.7 Proportion of neighbouring voxel pairs with different tissues for multi-beta MRFs106 4.8 Proportion of (CSF, WM) neighbouring voxel pairs for each MRF potential . . . 106
4.9 Example segmentation from which βjk cannot be estimated using LS . . . 108
5.1 Examples of isotropic (Gaussian) and anisotropic (Perona-Malik) image diffusion.117 5.2 Comparison of Perona-Malik functions. . . 118
5.3 Example of anisotropic MRFs in vein segmentation . . . 119
5.4 Graph-cut segmentation schematic . . . 120
5.5 Neighbourhood of voxel i showing an intensity edge and the orientation of the gradient . . . 124
5.6 The underlying dependence graph for mixture-MRF segmentation . . . 127
5.7 Segmentation metrics (accuracy or Dice coefficient) for the various MRFs using MPL . . . 135
5.8 Paired difference in accuracy and Dice score, relative to the single-beta MRF . . 135
5.9 β values for various MRFs . . . 136
5.10 Example segmentations for different anisotropic MRFs . . . 138
5.11 Comparison of how different anisotropic MRFs treat a thin feature . . . 139
B.1 Coding scheme into 2 sets for 6-neighbourhood. . . 176
B.2 Coding scheme into 4 sets for 18-neighbourhood. . . 177
B.3 Coding scheme into 8 sets for the 26-neighbourhood. The “odd” slice is in the centre (symbols !, #, ^, @). . . 178
B.4 Coding scheme into 8 sets for the 26-neighbourhood. The “even” slice is in the centre (symbols x, o, ., *). . . 179
List of Tables
3.1 Experiment summary: MRF neighbourhood size and approximations . . . 54
3.2 Experiment summary: comparison of various mixture-MRF algorithms. . . 55
3.3 Fixed β values used for grid search . . . 55
3.4 Accuracy for different MRF approximations and neighbourhood sizes . . . 56
3.5 Mixed-effects model of segmentation accuracy by MRF approximation and neigh-bourhood size . . . 56
3.6 Post-hoc pairwise comparisons of accuracy for different neighbourhood sizes . . 58
3.7 Post-hoc pairwise comparisons of accuracy for different MRF approximations . . 58
3.8 Average performance for different algorithms, ordered by accuracy decreasing. . 60
3.9 Mixed-effects model of segmentation accuracy by algorithm . . . 60
3.10 Post-hoc pairwise comparisons of accuracy by algorithm . . . 61
3.11 β values for various algorithms, ordered by accuracy decreasing. . . 61
3.12 Average number of matching-label neighbours in MPL segmentations . . . 64
3.13 Mixed linear regression of 1/βagainst average number of matching neighbours . 65 4.1 Summary of MRFs . . . 89
4.2 Experiment summary: comparison of different MRFs . . . 97
4.3 Number of subjects successfully segmented using LS . . . 98
4.4 Model selection. Average PLIC and accuracy . . . 99
4.5 Mixed-effects model of segmentation accuracy for different MRFs . . . 99
4.6 Post-hoc pairwise comparisons of accuracy for different MRFs . . . 99
4.7 Comparison of accuracy and Dice coefficient between LS and MPLE . . . 101
5.1 Average accuracy and Dice for different MRF potentials . . . 134
5.2 Mixed-effects model of segmentation accuracy for different anisotropic models . 136 5.3 Post-hoc pairwise comparisons of accuracy for different MRFs . . . 136
xvi LIST OF ABBREVIATIONS
List of Abbreviations used in the thesis
CSF cerebrospinal fluid
EM Expectation-Maximisation GM grey matter
ICE, ICM Iterated Conditional Modes/Expectation LS, LSE least-squares (estimator)
MF mean-field
MRF Markov random field MRI magnetic resonance images
MPL, MPLE maximum pseudolikelihood (estimator) pdf probability density function
PL pseudolikelihood WM white matter
LIST OF NOTATION xvii
List of Notation used in the thesis
Indices
i index used for voxel i
m index used for voxel m, usually a neighbour ofi
∂i indices in the neighbourhood of voxel i n total number of voxels
j, k indices used for tissue labels g total number of tissues
Variables
Yi, Yi,yi, yi random variable (uppercase) and realisation (lowercase) of intensity at voxel i, either vector or scalar
Y,y intensities for all voxels, i.e. (y1, y2, . . . , yn)
Zi,zi random variable and realisation of tissue labels;zi ∈ {0,1}g such thatPgj=1zij = 1
zij the jth element of zi
hzii mean-field approximation of zi
Z,z random variable and realisation of intensity at voxel i
ej an indicator vector of length g that is 0 everywhere except the jth element, which is 1
z∂i labels in the neighbourhood of voxel i, i.e. zm such that m∈∂i
Distributions
f usually a density function involving continuous variables such as y or yi
φ the normal probability density function
Θ parameters of a the intensity distribution
µj, σ2j mean and standard deviation of the normal distribution corresponding to the
jth tissue
p usually a probability function over discrete variables such as z or zi ˜
p pseudolikelihood or mean-field approximation of p Ui(zi|z∂i) MRF potential of voxel i
C normalising constant of p(z)
uij number of neighbours of voxel i with labelj, i.e. Pgj=1zij
δim distance between voxels i and m
Ψ parameters of the Markov random field
β single smoothing parameter of the homogeneous Potts model
βjk multiple smoothing parameters of the non-homogneous Potts model, applying to the boundary between tissues j and k
B g×g matrix with element (j, k)being βjk when j 6=k; βkj =βjk and βjj = 0 α, αj unary parameters of the non-homogeneous Potts model such that α =
xviii LIST OF NOTATION
Miscellaneous
E[] expectation
Q Q-function in Expectation-Maximisation
τij posterior probability that voxel ibelongs to tissue j
Chapter 1
Introduction
As we develop and then age, our brain changes continuously. Various substructures within the brain may change in shape or size as part of healthy aging (Dennis and Thompson, 2013). On the other hand, the presence of neurodegenerative diseases such as dementia can also affect the brain. Having an accurate model of the brain is vital to studying the progression of such diseases and how they differ from the processes of normal aging.
Clinically, qualitative measures are commonly used to diagnose and assess the severity of neurological disorders. These are both time consuming and require and are dependent on rater expertise/experience. With the rapid development and improvement of medical technology, more accurate diagnoses can be found by including biomarkers from cerebrospinal fluid analyses and images obtained by magnetic resonance imaging and positron emission tomography (Dubois et al., 2007). Quantitative measurement of, for example, the volume of the hippocampus (Jack et al., 1997, 2000; Schuff et al., 2009) or the thickness of the cortical wall (Thompson et al., 2003) has the potential to better characterise the nature of dementia.
The challenge is obtaining accurate measurements of the brain and then accurately modelling it. These models are constructed from medical images of the brain, such as magnetic resonance images (MRI). Automated measurement of structures in the brain from MRI saves both time and the need for a fully-trained expert, and can be highly reliable (Han et al., 2006).
Before such measurements can be taken, a reliable reconstruction of the brain from the MRI is needed. In the brain, there are three main tissue types - cerebrospinal fluid (CSF), grey matter (GM) and white matter (WM). Classifying each spatial location of the MRI to the underlying tissue type being imaged there is known as a tissue segmentation of the brain. The underlying tissue type at a given location may be inferred from the observed signal of the MRI there, as well as prior anatomical knowledge. The task is made more difficult by the presence of artefacts that can degrade the quality of the image, for example scanner noise, the machine’s bias field, or patient motion. An example of an MRI and corresponding segmentation are shown in figure 1.1. Segmentation of brain MRI is the primary focus of this thesis. In particular, we are interested in
2 CHAPTER 1. INTRODUCTION
GM
CSF
WM
Figure 1.1: Brain tissue is primarily grey matter (GM), white matter (WM) or cerebrospinal fluid (CSF). Left: an MRI. Right: corresponding tissue classification.
the use of probability models of brain MRI for segmentation, and the incorporation of adaptive spatial regularisation into these models.
1.1 Background
1.1.1 Magnetic resonance imaging
An MRI machine has a strong static magnetic field in which the object to be imaged is placed, causing the nuclear spins of the object to become aligned. This mostly corresponds to hydrogen atoms in water present in the body. A radiofrequency field is briefly applied in the transverse plane to the static field, causing the spins to align with it. When this field is removed, the protons precess or relax back to their equilibrium position, producing a signal that is detected. The static magnetic field has a physical gradient in field strength, allowing the physical location of the signal to be inferred. In this way a signal is recorded from a dense grid of spatial locations within and around the object to be imaged.
The image itself may be viewed as a set of measurements on a regular (square or cubic) grid, either in 2D or 3D. Each cell of the grid is known as a pixel for a 2D image, or a voxel for a 3D image (also called a volume). For example, colour images may have 3 integer values at each pixel, being red, green and blue intensity values. For an MRI, each voxel contains the signal strength at the corresponding point of real space. Different tissues have different relaxation times, allowing them to be distinguished in the image. An MRI is typically a 3D volume, consisting of many 2D slices (figure 1.2).
Depending on the imaging sequence used in the MRI, different types of image can be produced. For example, a T1-weighted image shows CSF as the darkest tissue, white matter as the brightest, and grey matter intermediate. In a T2-weighted image, CSF is the brightest tissue, grey matter
1.1. BACKGROUND 3
Figure 1.2: Example slices of an MRI of the brain
Figure 1.3: T1 image (left) vs T2 image (right). In a T1 image, CSF is the darkest tissue, followed by grey matter with white matter as the brightest. The order is reversed in a T2 image. grey, and white matter dark (figure 1.3).
1.1.2 Brain MRI segmentation
A segmentation of the brain may mean multiple things:
• a tissue segmentation of the brain into major tissues such as cerebrospinal fluid, grey matter, and white matter, dura, glial tissue, and so on.
• an anatomical segmentation of the brain into finer anatomical regions such as the hip-pocampus, ventricles, thalamus, and so on.
This can be done manually, with semi-automatic, and fully-automatic methods. See (Despotović et al., 2015; Balafar et al., 2010; Withey and Koles, 2008) for reviews.
Manual segmentation involves a technician or expert manually delineating the regions of interest on the MRI. This can be extremely time-consuming. While manual segmentations are often used as a ‘gold standard’, they can be subjective and suffering from poor inter- and intra-rater reliability (Clarke et al., 1995; Collier et al., 2003). For example, Gurleyik and Haacke (2002) reported an inter-observer error of as large as 16% for five experts performing manual segmentation on the caudate nucleus. Even for one well-trained expert, segmentations can
4 CHAPTER 1. INTRODUCTION
differ significantly depending on what segmentation protocol is used (Boccardi et al., 2011). Additionally, it can take many man-hours to manually delineate tissues in each subject. Using semi-automatic or fully-automatic methods has the advantage of restoring objectivity to the segmentations while also saving manual labour.
Semi-automatic segmentation is that which is mostly automatic, but requires some manual input. For example, a technician could click on regions of the MRI they know to be grey matter, white matter, and CSF, and these are used to initialise a segmentation algorithm. These approaches can greatly increase reliability of the segmentations (Yushkevich et al., 2006). However, methods like this still require manual input, though much less than a full manual segmentation.
Related to semi-automatic segmentation methods are those that are fully automatic to run, but require training data. Examples of these methods include neural networks and deep learning-based approaches (Zhang et al., 2015; Moeskops et al., 2016; Litjens et al., 2017; Shen et al., 2017, and the references therein). These methods can offer very promising results and a good compromise between the accuracy of manual segmentation, and the convenience and objectiveness of automatic segmentation. However, such methods still require training data, typically comprised from images and their matching manual segmentations. They can fail if a test image is presented that is significantly different from the training data (e.g. in image contrast, or brain morphology). There is a need for methods that do not require manual intervention, or extensive training data.
In this thesis, we focus on fully-automatic segmentation methods that do not require training data. Automatic methods may largely be classified into those that use prior anatomical knowledge (an atlas), and those that do not. However, combinations of these are also often used, and many methods that do not require an atlas may still make use of one if available. An atlas is typically a representative MRI, along with a hard or probabilistic labelling of it into tissues or regions. For example, each voxel may have a probability associated with it to be white matter, grey matter, or cerebrospinal fluid. An unlabelled input MRI is registered to the atlas (possibly non-rigidly), aligning the two brains. The labels are propagated from the atlas onto the registered input brain, which may then be transformed back into the original space. The advantage of atlas-based methods is that the atlas may be labelled in finer detail than could be inferred from the MRI alone. For example, two cortical walls pressed so closely as to appear a single contiguous region based on the MRI alone, could be properly distinguished as two separate walls.
However, the registration may easily fail if the unlabelled brain does not match the atlas closely enough. Some examples of this can be seen in figure 1.4. If the brain to be segmented have pathologies or injuries, there may not exist a mapping between it and the atlas. Another example is the neonate brain, which changes significantly over a short period of time (Rutherford, 2002). If the atlas selected for the brain does not match its current developmental state closely enough, registration will fail. For this reason, we focus on fully-automatic brain MR segmentation that
1.1. BACKGROUND 5
Figure 1.4: Examples of brains with injuries from cerebral palsy
Figure 1.5: MRI and histogram of its voxel intensities - overall and by tissue (from manual segmentation)
does not require an atlas or prior training.
Segmentation methods that do not require an atlas make use of image intensities. Such methods can be edge- and surface-based, for example active contours and level set methods (Tsai et al., 2001; Vese and Chan, 2002). Edges in the image are located with intensity gradient information, and used to define regions of interest. Such methods often also incorporate region-based metrics, identifying regions in the image as having homogeneous intensity within each region (Wang et al., 2009; Huang et al., 2009).
A large number of segmentation methods for the brain focus on the clustering of image intensities. Figure 1.5 shows a T1 MRI and its corresponding intensity histogram, as well as the intensity distribution of each of the three main tissues (determined by a manual segmentation of the image). Since CSF is generally dark, WM is bright, and GM is in between, the most basic approach is simply to threshold the image intensities to determine an image classification. This can be quite subject to noise. More sophisticated clustering methods include k-means (Cocosco
et al., 2003; Vrooman et al., 2007) and fuzzy C-means (Ahmed et al., 2002).
However, the most common clustering method for brain tissue segmentation, and the focus of this thesis, employs a Gaussian mixture model of the MRI intensities (we will cover this in further detail in Chapter 2). On examining figure 1.5, the image intensities appear to be well approximated by three overlapping Gaussian distributions, one per tissue type. In fact,
6 CHAPTER 1. INTRODUCTION
Figure 1.6: EM fit from a Gaussian mixture model and corresponding segmentation it has been shown that noise in tissue intensities is Rician, but may be approximated by a Gaussian given the signal-to-noise ratio of MRI (Gudbjartsson and Patz, 1995). A mixture of three Gaussians model may be fit to an MRI’s intensities, inducing a segmentation of the image by assigning each voxel to the Gaussian it is most likely to belong to.
As can be seen (figure 1.6), this method is susceptible to noise in the image. All voxels with a given intensity will be classified as the same tissue, even if entirely surrounded by a different tissue. To address this, the segmentation can be smoothed. Morphological operators such as openings and closings can be applied to the EM segmentation to remove isolated noise. However, such operators are ‘blind’ to the image around them, and indiscriminately fill in all features of the same size regardless of the surrounding image data.
An alternative is to incorporate the smoothness constraint into the image model itself, so that smoothing is context-aware of the local intensity information. To do this, it is standard to use a Markov Random Field (MRF) as a prior probability distribution over the tissue labels in the mixture model. The Potts model for atomic spins from statistical mechanics (Potts, 1952) is most commonly used. When considering the probability for a given voxel to be a given tissue conditioned on the tissues of its neighbours, the Potts model prefers the majority tissue in the neighbourhood. The Gaussian distribution of the intensities of each tissue is retained from the standard mixture model. In this way, the smoothing of the Potts model is weighted by the intensity probabilities, so that the smoothing is both intensity- and spatially-dependent. With some modifications, Expectation-Maximisation can be adapted to handle the MRF (we will show these details in Chapter 2).
This model - each tissue distributed according to a Gaussian, and the prior distribution of the tissues with the Potts MRF - is ubiquitous in MR segmentation. Introduced for image segmentation by Besag (1986), it is a component of common segmentation packages such as NiftySeg (Cardoso et al., 2009, 2011), Expectation Maximisation Segmentation (Van Leemput et al., 1999b), Atropos (Avants et al., 2011) and FAST (Zhang et al., 2001). From this basis many extensions may be made; for example, one can add bias-field correction (Van Leemput
1.2. AIMS 7
(a) 0.1 (b) 0.3 (c) 1 (d) 5 (e) 10
Figure 1.7: Segmentations with various smoothing values β
et al., 1999a; Wells et al., 1996) or partial volume correction (Noe and Gee, 2001; Shattuck et al., 2001; Van Leemput et al., 2003).
The Potts MRF has one non-negative parameter β that controls the strength of the smoothing.
This is typically set by manual tuning, or to arbitrarily-chosen values in the literature (for example, β = 1.5 (Besag, 1986), β = 1 (Zhang et al., 2001; McLachlan et al., 1996), β = 0.7 (Owen, 1986; Ripley, 1986), β = 0.3 (Avants et al., 2011), β = 0.25 (Cardoso et al., 2009)).
Larger values correspond to stronger smoothing, while 0 corresponds to no smoothing. Mis-specifying this parameter can lead to not enough smoothing, or too much (figure 1.7). In addition, the value of β that is best for one MRI may not be the same as that for a different
MRI. An automatic method to determine the amount of smoothing, i.e. β, on a per-image basis
is of value, and could then be used in all methods based on the mixture-MRF formulation. In addition, the Potts MRF is quite basic, with its single parameter β only allowing for the
same uniform smoothness across the entire image. In the brain, it is known that some tissue boundaries are more convoluted than others (e.g. cortical folding vs the ventricle boundary). MRFs that allow smoothing to be applied at different scales, for example on a per-tissue basis, or incorporating further local image features, are worth investigating.
1.2 Aims
This thesis addresses the issues mentioned in the previous section through development of a fully-automatic, three-dimensional brain MR segmentation algorithm that does not require training data. We aim to segment the skull-stripped brain into the three primary tissues, cerebrospinal fluid (CSF), grey matter (GM) and white matter (WM). Our segmentation method is based on constructing a probability model for the MRI, which is then used to classify it into tissues. Specifically, this thesis focuses on the Markov random field (MRF) prior probability distribution over the tissue labels.
8 CHAPTER 1. INTRODUCTION
using Markov random fields, where the smoothing is applied on a global, per-tissue, or local level. There are two aspects to this, given below.
1.2.1 Automatic determination of the smoothing parameter
The standard method for probabilistic method uses a Gaussian distribution of intensities for each tissue, and the Potts MRF for the tissue labels. The Potts MRF has one parameter that determines the amount of smoothing to apply to the image, but there is no well-accepted, principled method to determine the value of this parameter automatically, with manual tuning being common.
The first aim of the thesis is to develop methods to automatically and adaptively determine the smoothing parameter for the Potts MRF. The method should be computationally tractable, should not require training data, and should be able to adjust the parameter on a per-image basis.
There have been a number of attempts at automatically setting the smoothing parameter, including Bayesian approaches (Woolrich et al., 2005; Woolrich and Behrens, 2006) and regression-based estimators (Van Leemput et al., 1999b). The Bayesian approaches are computationally intensive and slow, requiring many simulations of the desired Markov random field at each iteration of the algorithm. The regression-based estimator does not have this drawback, but relies on building a neighbourhood histogram of the image which is time-consuming, and suffers additional limitations on its use.
The proposed method utilises the pseudolikelihood (Besag, 1975) and mean-field (Chandler, 1987) approximations in order to determine a suitable value for the smoothing parameter. The method is computationally tractable and easy to interpret and understand. Additionally, the method is a natural extension of the modified Expectation-Maximisation algorithm already used in existing methods, so does not represent much implementational burden to incorporate into existing methods.
We focus on segmentation of brain tissues only, i.e. we assume that the skull has already been stripped in the images and artefacts such as bias-field already corrected. However, the image model used in this thesis is common to many segmentation algorithms that can also perform these tasks, and the methods developed in this thesis can be readily incorporated into these algorithms.
Fully-automatic estimation of the smoothing parameter in the Potts MRF will be studied in chapter 3. We hypothesise that estimation of the parameter individually for each image will provide more accurate segmentations than setting it to the same fixed value for each image.
1.3. CONTRIBUTIONS 9
1.2.2 Different types of MRF
The second aim of this thesis is to investigate the use of more complex forms of MRF in brain segmentation. As previously mentioned, the Potts MRF can only apply the same smoothing uniformly across the image. It may be advantageous to allow different tissue boundaries to be smoothed to different degrees. For example, the GM-WM boundary of the cortical folds could be permitted to be less smooth than the GM-CSF boundary of the ventricles. Additionally, the standard Potts MRF cannot explicitly account for certain tissues being less prevalent than others. In Chapter 4, we aim to use a more general form of the Potts MRF to enable tissue-specific smoothing and control of tissue proportions. This allows both per-tissue smoothing and relative tissue proportions to be controlled. Additionally, we will use the method developed in the first aim to estimate the parameters of the MRF. We hypothesise that this MRF will allow greater sensitivity to the different tissues when smoothing, and be able to adapt to images where the tissue proportions are very different from each other.
It is also of interest to incorporate MRF smoothing on an even finer scale than the tissue level. For example, local image features such as edge orientation and strength can be used to further adjust the smoothing so as not to smooth away thin features such as the cortical folds. In Chapter 5, we will develop and investigate MRF to achieve anisotropic smoothing, and use the method developed in the first aim to estimate its parameters. We hypothesise use of anisotropic MRFs will prevent thin features from being smoothed away, while still permitting strong smoothing of noise in otherwise homogeneous regions.
1.3 Contributions
The key contributions of the thesis are
1. to demonstrate the effectiveness of MRF parameter estimation (by any method) as opposed to fixing the spatial regularisation parameter to a manually-chosen constant. This results in a fully-automatic intensity-based brain segmentation algorithm with adaptive spatial regularisation. While maximum pseudolikelihood estimation in MRFs has been performed before (e.g. Celeux et al. (2003)), it has not been studied in detail with regards to neighbourhood size, choice of MRF approximation, or the form of the MRF itself when applied to MR segmentation.
2. to specifically demonstrate the suitability of maximum pseudolikelihood estimation for MRF parameter estimation in brain segmentation, as compared to other estimation techniques. Maximum pseudolikelihood is more computationally tractable than existing Bayesian methods (Woolrich et al., 2005; Woolrich and Behrens, 2006). It makes use of quantities already calculated in the the MRF segmentation framework, so is modular and straightforward to implement into methods that already use this framework, unlike
10 CHAPTER 1. INTRODUCTION
the regression estimator of (Van Leemput et al., 1999b). Finally, results from real brain datasets indicate that using maximum pseudolikelihood to automatically determine the spatial regularisation is superior to the current methods which fix it, especially when no atlas can be used.
3. to explore more complex MRF models (with parameter estimation) and assess their suitability for/tailor them to brain MR segmentation. These MRFs can smooth on finer scales: on a per-tissue basis (as studied in Chapter 4), and secondly, in local neighbourhoods (as studied in Chapter 5). We use maximum pseudolikelihood to adaptively smooth with
these MRFs also.
1.4 Overview of the thesis
In chapter 2, we give the necessary mathematical and imaging background that will be used throughout the remaining chapters.
In chapter 3, we consider the most common form of MRF used for brain segmentation, the homogeneous Potts MRF, which requires one global smoothing parameter. We study and compare two methods to automatically determine this parameter, least-squares estimation and maximum pseudolikelihood estimation. We argue that the latter is ideal for brain MR segmentation as it can easily be incorporated into existing methods, as it involves an optimisation that is concave, computationally tractable, and uses quantities already calculated in existing methods. We demonstrate its use on a real-brain dataset and show it has favourable performance compared to existing fixed-parameter methods. Although the method itself is not new, we make a detailed study of how the neighbourhood specification and approximation of the MRF are related to segmentation accuracy. To our knowledge, a study focusing on these aspects has not been presented before. The work of this chapter can be thought of as smoothing on a global (image-wide) level.
In chapter 4, we consider smoothing on a per-tissue level, which may be more realistic for the brain. We do this by considering the non-homogeneous Potts MRF, a generalisation of the homogeneous Potts MRF of Chapter 3. We show how to automatically determine the model parameters, comparing least-squares estimation with maximum pseudolikelihood estimation. We focus on various forms of the non-homogeneous Potts MRF, separating out its unary and pairwise terms and studying their effect on the resulting segmentation in detail. We demonstrate the use of these MRFs on a real-brain dataset. This expands on existing work with a similar per-tissue-smoothing MRF.
In chapter 5, we consider smoothing on a local neighbourhood level, by allowing local features such as the presence, strength and orientation of edges to be incorporated into the MRF. As before we show how to automatically determine the model parameters and demonstrate the
1.4. OVERVIEW OF THE THESIS 11 algorithm on a real-brain dataset, showing promising results against the models considered thus far. Incorporation of local features like this into the probability model is novel.
Chapter 2
Mathematical background
2.1 Introduction
Throughout this thesis, the same general probability model is used to describe the intensities of a brain MRI. This comprises of a Gaussian (normal) mixture model on the brain intensities, and a Markov random field as the prior density for the brain tissue classification. The details of the mixture model and MRF vary, but much of the underlying theory remains the same. In this chapter, we cover the common basics of the image model used throughout the thesis.
The first part of the chapter covers the normal mixture model and how it is used for image segmentation. The second part briefly covers Markov random fields, difficulties in working with them and likelihood approximations used to mitigate those difficulties. The last part shows how to incorporate a Markov random field into the normal mixture model for image segmentation.
2.2 Mixture models
Mixture models provide an important tool for statistical modelling and inference. A mixture model is a density that is a linear combination of other densities. One advantage of using a mixture model is that quite complex densities may be built up of simpler and well-known component densities, which need not all be the same. Mixtures are particularly useful for clustering applications. The resulting model can be used to calculate the probability that a given observation (intensity) belongs to a particular mixture component (tissue type), providing a soft classification of the data. This may be converted into a segmentation by e.g. assigning each pixel to the mixture component it has the highest posterior probability of belonging to. McLachlan and Peel (2000) provide an extensive treatment of the theory of finite mixture models.
Let Yi (i = 1, . . . , n) be a random sample of size n, where Yi is a p-dimensional random vector and has probability density function (pdf) f(yi). Let yi be a realisation of Yi, and let
14 CHAPTER 2. MATHEMATICAL BACKGROUND
y= (yT
1, . . . ,yTn)T denote the entire observed data.
The random vectoryi is said to have come from a g-component mixture if its pdf takes the form
f(yi;Θ) = g
X
j=1
πjfj(yi;θj), (2.1)
where component j has pdf fj with parametersθj, and Θ are the elements of (θ1T, . . . ,θgT)T knowna priori to be distinct. It is not necessary thatfj be of the same form (e.g. all Gaussian). The mixing proportions πj (j = 1, . . . , g) are non-negative and sum to one.
It is assumed that each observation belongs to one component of the mixture, but it is not necessarily known which. It is helpful to introduce latent (unobserved) random variables Z = (Z1, . . . ,Zn) indicating the component of the mixture each observation belongs to, and correspondinglyzi being a realisation ofZi. Each Zi is a vector of lengthg consisting of exactly one 1 in the jth position and all other elements 0. Let the jth element of Zi be Zij. Then observation iis said to be in component j if and only if Zij = 1. Alternatively, we may write that Zi =ej, where ej is the vector that is 1 in the jth position and 0 elsewhere.
In a standard mixture model, we assume Zi are independently and identically distributed according to p(zi;Ψ). To arrive at the pdf f(yi) in (2.1), we suppose thatZi are drawn from the multinomial distribution with probabilities Ψ = (π1, . . . , πg). That is, p(Zi = ej) = πj, and the parameters to be estimated areΨ= (π1, . . . , πg−1)(as the proportions sum to 1, this
determines πg). In this formulation, we may identify fj(yi) with f(yi|Zi =ej), which we will usually write as f(yi|ej) for brevity.
2.2.1 Expectation-maximisation
Observing only y, we wish to estimate the mixture parameters Θ and mixing proportionsΨ
For clustering and classification applications we may also wish to retrieve the latent variables z. Expectation-Maximisation (EM) (Dempster et al., 1977) can be used to search for maximum likelihood estimates for the tissue means and covariance matrices and mixing proportions. Only the main equations are shown here; for their derivations, see Appendix A.
Since Zij = 1 for exactly one j and is 0 for all others and Zi are assumed independent, the marginal density of the class labels is
p(z;Ψ) = n Y i=1 g Y j=1 πzij j . (2.2)
2.2. MIXTURE MODELS 15 We also assume that the observed values Yi are independent given their labels Zi, so that
f(y|z;Θ) = n Y i=1 f(yi|zi;Θ) = n Y i=1 g Y j=1 f(yi|Zi =ej;θj)zij. (2.3)
Combining (2.3) and (2.2) yields the joint likelihood: L(Θ,Ψ;Y,Z) = f(y,z;Θ,Ψ) = n Y i=1 g Y j=1 (πjf(yi|ej;θj))zij. (2.4)
The Q-function, being the expectation of the log-likelihood, is then
Q(Θ,Ψ|Θ(t),Ψ(t)) = n X i=1 g X j=1 E h zij|y;Θ(t),Ψ(t) i (logπj(t)+ logf(yi|ej;Θ(t)), where (t) indicates values at the t-th iteration.
On the E-step, the expected label values given the data are calculated:
τij(t)=E h zij|y;Θ(t),Ψ(t) i = πj (t)f(y i|ej;θj(t)) Pg j=1πj(t)f(yi|ej;θj(t)) . (2.5)
The quantities τij also happen to be the posterior probability that observation i belongs to component j of the mixture, p(Zi =j|y;Θ(t),Ψ(t)).
On the M-step, the Q-function is maximised with respect to the parameters Θ andΨ, yielding:
πj(t+1) = Pn i=1τij(t) n . Θ(t+1) = arg max Θ ∂Q ∂Θ (2.6)
The E and M steps are repeated until a stop condition has been satisfied. Common stop conditions are the convergence of the parameter values or the convergence of the observed-data log-likelihood. The observed-data likelihood is given by
f(y;Θ,Ψ) =X z
f(y|z; ˆΘ)p(z; ˆΨ).
2.2.2 Normal mixture models
In the context of brain MRI segmentation, Yi represents the intensity of an n-pixel MRI at voxel i. This is typically scalar, though if a multichannel image is taken it will be a vector
16 CHAPTER 2. MATHEMATICAL BACKGROUND
Figure 2.1: Axial MRI slice and intensity histogram, with tissue intensity distributions from a manual segmentation
represents measurements at a single pixel, and the sample Y consists of the measurements at all pixels of a single image (possibly multi-channel), as opposed to Yi being an entire image of a subject and Y being a images of many subjects.
We are concerned with segmentation of the brain into CSF, GM and WM only (not including bone, background etc) so have g = 3. The unobserved variables Zi give the tissue label at each pixel, andzi represents a particular segmentation of the image.
Figure 2.1 shows a brain MRI and the intensity histogram of the brain voxel intensities. It also shows the intensity distribution of each tissue, where the tissues are determined by an expert manual segmentation. The distribution is trimodal, with one mode corresponding to each of the tissues. CSF has the lowest average intensity, followed by grey matter, and then white matter. Each tissue’s intensity distribution appears to be normally distributed. In fact, use of a 3-component normal mixture model for brain MRI segmentation is standard. The skull may be stripped from the image beforehand using various techniques so that only brain tissue is included. Multiple Gaussians per tissue are also sometimes used (Ashburner and Friston, 1997). We will assume each mixture component f(yi|Zi =ej)to be Gaussian with parameters mean µj and covariance matrix Σj:
f(yi;Θ) = g
X
j=1
πjφ(yi;µj,Σj),
where φ is the Gaussian pdf. For a normal mixture model, the M-step for the mean and
covariance matrix is (see Appendix A for the derivation): µj(t+1) = Pn i=1τij(t)yi Pn i=1τij(t) Σj(t+1) = Pn i=1τij(t)(yi−µj(t+1))T(yi−µj(t+1)) Pn i=1τij(t) . (2.7)
2.3. MARKOV RANDOM FIELDS 17 A mixture model is only identifiable up to the labels j; for example, switching the parameters
and mixing proportions of components 1 and 2 will yield the same density. This is generally restored by imposing some constraint on the parameters. For example, for scalaryi (as we will deal with in this thesis),
µ1 ≤µ2 ≤..≤µ3.
In this thesis our example datasets consist of T1 MR images; this convention is equivalent to having j = 1 for CSF, j = 2 for GM and j = 3 for WM.
2.2.3 Image segmentation
The aim of image segmentation is not so much to determine the tissue parameters Θ, but
rather to determine the underlying segmentation z. Once estimates for the parameters haveˆ been determined, they can be used to calculate the posterior probability that each observation
i belongs to a particular class j, i.e. τij. The class memberships zi (i.e. the hard image segmentation into tissue classes) may be estimated for each voxel by
ˆ zi = arg max ej,j=1,...,g p(Zi =ej|Yi;Θ(t),Ψ(t)) = arg max j τij(t). (2.8)
2.3 Markov Random Fields
The standard mixture model assumes that all voxel labels are independent. This leads to the classification rule (2.8), which will assign all voxels that have the same intensity, the same tissue label. This can be a problem with noisy images - isolated bright or dark pixels will be classified purely according to their intensity, even if they are located in regions of opposite brightness. This can lead to segmentations that are themselves quite noisy, as can be seen in figure 2.2. In practice, tissue labels are not independent. Rather, a pixel’s label should depend on the labels and intensities of its neighbouring pixels. It is more likely that an isolated bright pixel in a dark image region should belong to the same tissue as its dark neighbours, than a component with bright mean intensity.
There have been various attempts to incorporate spatial smoothness into intensity-based segmentation. The most basic involve convolving the pixel intensities with e.g. a Gaussian kernel before fitting a standard Gaussian mixture. This smooths the intensities, reducing noise. The problem with this is that noise and edges are smoothed uniformly, so that sharpness around legitimate boundaries of tissues is lost. Also, the Gaussian mixture still assumes that each voxel is independent of the others. Rather than incorporating the spatial dependence into the model itself, this modifies the observations (image intensities) prior to fitting in order to make the model more applicable.
18 CHAPTER 2. MATHEMATICAL BACKGROUND
(a) MRI (b) EM segmentation (c) intensity thresholds on the image histogram
Figure 2.2: Segmentation with a 3-component Gaussian mixture model is susceptible to image noise.
Another alternative to pre-processing the input image y, is post-processing the output seg-mentation z instead. After the segmentation is obtained, morphological operations such as dilations and erosions may be used to fill in small holes in the segmentation and smooth the tissue boundaries. This suffers from similar problems to preprocessing - in particular, the sulci and gyri forming the convoluted boundary of the brain can be of sufficiently small size in the image that they are smoothed as well as the noise.
A more elegant option is to incorporate the spatial dependence directly into the probability model itself. The voxel intensity Yi could be allowed to depend on the intensities and/or labels of its neighbours. This would be a suitable model for image blurring, where the observation at locationiis corrupted by observations from nearby locations, or where observations are made on
a coarser grid than the underlying location lattice. More suited to our situation is to allow each voxel’s label (rather than intensity) to depend on the labels and/or intensities of its neighbours. This encodes the statement that locations in close proximity are more likely to be of the same tissue.
We will proceed by allowing each voxel’s label Zi to depend on the labels of its neighbours. We still assume that Yi|Zi, the intensities given their labels, are conditionally independent, but relax the assumption of independence betweenZi. A suitable way to capture the dependence of each pixels on its neighbours is through a Markov random field.
Let us represent a set of variables by an undirected graph: each variable is a vertex, while an edge between variables indicates dependence between these variables. Vertices that are not directly connected by an edge should depend on each other only through intermediate nodes that form a path between the vertices. A Markov random field is a probability distribution that encapsulates the dependencies in the graph. For a more extensive treatment, as well as analogues for directed and hierarchical graphs, see (Koller and Friedman, 2009; Lauritzen, 1996).
2.3. MARKOV RANDOM FIELDS 19 For a concise review of statistical inference for MRFs, see (Stoehr, 2017).
More formally, let (V, E) be the vertices and edges of an undirected graph, and Xi, i ∈ V be random variables, one per vertex. First, we define the notion of conditional independence (Dawid, 1980): we say that a variable Xi is conditionally independent of Xj given Xm, written
Xi
⊥
⊥
Xj|Xm, if the conditional probabilityp(Xi|Xj, Xm)is a function of onlyXm. For a subset of verticesA, let the notationXAdenoteXi such thati∈A. Let∂idenote the set ofneighbours of vertex i; that is, all vertices that are connected by an edge to i. Then, X∂i denotes the variables that depend on Xi. The random variables form a Markov random field if the following properties are satisfied:• Pairwise Markov property: two variables corresponding to vertices that are not connected are conditionally independent given the rest of the variables. For any i and m not
connected by an edge,
Xi
⊥
⊥
Xm|XV\{i,m}.• Local Markov property: a variable (vertex) is conditionally independent of all other variables (vertices) not including its neighbours, given its neighbours. For any i,
Xi
⊥
⊥
XV\({i}∪∂i)|X∂i.• Global Markov property: disjoint sets of variables are independent giving a separating subset. For any sets of vertices A, B and S such thatS separates A from B,
XA
⊥
⊥
XB|XS.A subset of verticesS is said to separate other setsA andB, if removing S from the graph
disconnects A and B into separate connected components. Equivalently, every (if any)
path fromA to B passes through S.
It can be shown that for an undirected graph, the global property implies the local, which implies the pairwise (Lauritzen, 1996, proposition 3.4). However, they are not in general equivalent. In the context of image segmentation, Xi =Zi and V is the set of voxels in the image.
2.3.1 Hammersley-Clifford theorem
It is usually more convenient to define dependences between variables locally, i.e. the pdf of each node given its neighbours p(Zi =zi|Z∂i =z∂i) is given. We will shorten this top(zi|z∂i) for convenience of notation in the remainder of the thesis. We assume p(zi|z∂i)belongs to the
20 CHAPTER 2. MATHEMATICAL BACKGROUND
exponential family. It is common to write it as
p(zi|z∂i;Ψ) = exp(−Ui(zi|z∂i;Ψ))
Ci Ci = g X k=1 exp(−Ui(ek|z∂i;Ψ)), (2.9) where the negative sign is by convention. The function Ui is often called a potential.
Given a set of local conditional pdfs, two questions occur: • What is the corresponding joint density p(z)?
• Under what conditions are p(zi|z∂i)even compatible with each other? • If they are compatible, does p(z)satisfy the Markov properties?
The answers to these questions are addressed in Besag (1974). First, it is assumed that p is
positive, i.e. all realisations z have positive probability. Then, the joint density may be found by taking the product of conditionals, normalised to sum to one (see (2.2) of Besag (1974)):
p(z;Ψ) = 1 C n Y i=1 p(zi|z∂i;Ψ), C = X all possiblez0 n Y i=1 p(zi0|z∂i0 ;Ψ) (2.10)
wherez∂i denotes allzmthatzidepends on i.e. all the neighbours ofi, andΨare the parameters of the MRF. We will omit the dependence on Ψunless relevant for ease of notation.
As to whether a given joint pdf forms a valid Markov random field, the Hammserley-Clifford gives the sufficient and necessary conditions. It was first proven by Hammersley and Clifford in an unpublished manuscript (Hammersley and Clifford, 1971), and later proved more generally and concisely by Besag (Besag, 1974). A positive p(z) forms a valid Markov random field (satisfies the Markov properties) if and only if it factorises over the cliques of its underlying
graph.
A clique of a graph is a fully-connected subset of vertices. All the cliques of a 2D grid/lattice where each node has 4 neighbours, or where each node has 8 neighbours, are shown in Figure 2.3. For a pdf to factorise over cliques of a graph means that it can be written
p(z) = 1
C
Y
cliquesc
ψc<