Pelvis segmentation using multi-pass U-Net and iterative shape estimation

(1)

This is a repository copy of Pelvis segmentation using multi-pass U-Net and iterative

shape estimation.

White Rose Research Online URL for this paper:

http://eprints.whiterose.ac.uk/145379/

Version: Accepted Version

Proceedings Paper:

Wang, C, Connolly, B, de Oliveira Lopes, PF et al. (2 more authors) (2019) Pelvis

segmentation using multi-pass U-Net and iterative shape estimation. In: Lecture Notes in

Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture

Notes in Bioinformatics). MICCAI 2018, 21st International Conference on Medical Image

Computing & Computer Assisted Intervention, 16 Sep 2018, Granada, Spain. Springer,

Cham , pp. 49-57. ISBN 9783030111656

https://doi.org/10.1007/978-3-030-11166-3_5

© Springer Nature Switzerland AG 2019. This is a post-peer-review, pre-copyedit version

of an article published in Lecture Notes in Artificial Intelligence. The final authenticated

version is available online at: https://doi.org/10.1007/978-3-030-11166-3_5

[email protected] https://eprints.whiterose.ac.uk/

Reuse

Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item.

Takedown

If you consider content in White Rose Research Online to be in breach of UK law, please notify us by

(2)

iterative shape estimation

Chunliang Wang, Bryan Connolly, Pedro Filipe de Oliveira Lopes, Alejandro F Frangi, ¨Orjan Smedby

Departement of Biomedical Engineering and Health Systems, Stockholm, Sweden Radiology Department, Karolinska Institute,Stockholm, Sweden

Center for Computational Imaging & Simulation Technologies in Biomedicine, Electronic & Electrical Engineering Department, The University of Sheffield,

Sheffield, UK [email protected]

Abstract. In this report, we propose an automatic method for segmen-tation of the pelvis from 3D CT images. The method is based on a 3D U-net which takes both the 3D CT image and estimated volumetric shape models of the targeted structures as input and outputs the probability maps of each structure. During training, the 3D U-net is initially trained using blank shape context inputs to generate the segmentation masks, i.e. relying only on the image channel of the input. The preliminary seg-mentation results are used to estimate a new shape model, which is then fed to the same network again, with the input images. With the addi-tional shape context information, the U-net is trained again to generate better segmentation results. During the testing phase, the input image is fed through the same 3D U-net multiple times, first with blank shape context channels and then with iteratively re-estimated shape models. Preliminary results show that the proposed multi-pass U-net with itera-tive shape estimation outperforms both 2D and 3D conventional U-nets without the shape model.

Keywords: deep learning, multi-pass U-net, pelvis segmentation, shape context, statistic shape model

1 Introduction

Improving resolution of computed tomography (CT) scanners and recent ad-vances in 3D printing technology make image-guided personalized treatment planning and implant design more feasible than ever before. However, how to accurately extract anatomical structure models from 3D high-resolution medical images promptly remains a non-negligible challenge that limits the clinical ap-plicability of new approaches in therapy planning. As a crucial step in surgery planning, radiotherapy planning and quantitative disease evaluation, pelvis seg-mentation is often done manually in clinical practice. It can take half an hour to several hours for a trained radiologist to segment the complete pelvis, which is

(3)

2 Pelvis segmentation using multi-pass U-net and iterative shape estimationt

not acceptable in most public hospitals [1]. A large variety of automated or semi-automated segmentation methods have been developed to address this problem. The methods range from thresholding-based approaches, over clustering-based approaches to deformable-model-based approaches [1–3]. However, most exist-ing methods have achieved moderate success only on a few cases. In this study, we tried to evaluate a new deep-learning-based pelvis segmentation method on a relatively large dataset of 90 patients. In recent years, deep-learning-based meth-ods have gained more and more attention due to their superior performance in several image segmentation challenges [4, 5]. The deep neural networks are often trained on a large number of training samples in a more or less black-box man-ner. It learns task-specific image features and rules by minimizing the objective function often correlated with the segmentation error. However, these learned features and rules are difficult to interpret. One can only imagine that the clas-sification decision is based on local appearance and global shapes represented by a stack of convolution kernels and mixed through weighted non-linear rules. The consensus from existing literature is that shape prior knowledge is crucial for accurate and robust pelvis segmentation [1]. The most promising conven-tional pelvis segmentation methods are based on either statistical shape models or multi-atlas registration [2, 3, 6]. How to enforce shape prior in a deep learning framework and whether that helps to improve segmentation performance re-mains an open question. In a previous study [7], Wang et al. proposed a method where statistical shape models were combined with 2.5D U-net to segment mul-tiple structures of the heart in 3D MRI and CT images. In their setup, three 2D U-nets were first trained to segment the multiple heart structures in three orthogonal views, and then the probability maps were combined and used to estimate 3D shape models of the heart and ventricles. The estimated shapes combined with input images are fed to a second set of 2D U-nets to deliver a refined segmentation result. In this study, we adapt a similar strategy and train a 3D U-net that takes both the 3D CT image and estimated volumetric shape models of the targeted structures as input to generate the segmentation results. However, instead of training two U-nets, we retrain the same 3D U-net multiple times, first with blank images as shape context channels, and then with volu-metric shape representation estimated on the preliminary segmentation results. During the testing phase, the input image is passed through the trained 3D U-net several times, with iteratively re-estimated shape context information. In our preliminary experiments, the proposed method delivered better segmenta-tion results than the convensegmenta-tional methods using 2D or 3D U-net or statistical shape model methods.

2 Methods

The proposed framework, with a multi-pass 3D U-net with iterative shape model estimation, is summarized in Fig. 1. The core module of this framework is a 3D U-net that takes 3 channels as input: one for the input 3D CT image, two for shape context images. The network outputs 6 channels representing the

(4)

proba-bility maps of the background, left femur, right femur, left hip bone, right hip bone and sacrum, respectively. The U-net architecture was initially proposed by Ronneberger et al. [5]. In this study we simply extend it to process 3D images instead of 2D images. Our 3D U-net consists of 3 max pooling layers and 3 up-sampling layers, and two convolutional layers with kernel size of 3 × 3 × 3 located between max pooling or/and up-sampling layers. Due to limited GPU memory, the input volume size to the U-net is set to 128 × 128 × 72. Original CT data are down-sampled to 3mm isotropic resolution before being split into overlapping 3D patches of required size. The outputs of the U-net are passed on to a level-set-based volumetric statistical shape estimation module where the shapes of the whole pelvis (including hip bones and sacrum as a whole) and the sacrum bone are estimated. The volumetric shape images are then added to the corresponding input channels and trigger the 3D U-net to re-run. The segmentation and shape estimation loop can be repeated several times until no significant changes occur.

Fig. 1. Overview of the proposed multi-pass 3D U-net with iterative shape model estimation

2.1 Data set and ground truth generation

From the publicly available image database (cancerimagingarchive.net [8]), we consecutively selected 90 abdominal CT from two studies (50 from the CT colonography study [9], 40 from the lymph node study [10]). Imaging protocols

(5)

for these studies are available at [9] and [10] respectively. For the sake of pelvis segmentation, we only selected scans that cover the full pelvis, i.e. all hip bones are contained in the scan range without being partially cut off. Ground-truth segmentation masks were created by an experienced radiologist, using an inter-active segmentation tool based on fuzzy connectedness and the level set method [11]. Besides the interactive segmentation, all segmentation masks were carefully inspected and manually fixed by the radiologist using the manual segmentation tool in ITKSnap [12]. On average, the radiologist spent 30-50 minutes on each case during the interactive segmentation and manual mask curation procedure. 2.2 Statistical shape model creation

The so-called shape context, or shape image, is a volumetric representation of the subjects shape, which is a signed distance map from the surface of the segmented object. However, instead of performing distance transform on the segmentation result directly, we fit a statistical shape model to it to eliminate irregularity that may present in the segmentation results. A statistical model is created by averag-ing of the signed distance maps of several segmented subjects (trainaverag-ing subjects) and computing the main variations using Principal Component Analysis (PCA) [13]. In this study, the shape model is created using 20 randomly selected train-ing subjects, in which the top 10 principal components are computed via PCA. As suggested by Leventon et al., the shape model M that matches the current segmentation is estimated by solving a level set function

∂φ

∂t = αF (x) + βM (T (x)) + γκ(x) |∇φ| (1) where F is the image force related to the probability output by the U-net, M is the statistical model as a weighted sum of the mean shape and PCA components, T is the global transformation and κ is the mean curvature. The transformation T and the weighting factors of PCA components are updated iteratively by minimizing the squared distance between the model and the level set function, which is also a signed distance map. α, β and γ are weighting factors that can be determined empirically.

2.3 Training phase

In the proposed framework, the statistical shape model training is performed only once, but the multi-pass 3D U-net must be trained in two steps. The first step is to train the net by feeding the 3D CT volumes and blank shape context volumes (all voxels are set to zero) to the 3D U-net, which will force the network to learn the segmentation using only CT images. In the second step, the pre-trained U-net is repre-trained with the 3D CT volumes and the real shape context volumes generated from fitting the statistical shape models to the output of the pre-trained 3D U-net. As the weights of the network are already initialized to recognize anatomical structures from the CT channel, the U-net is expected to learn gradually to use the context information where it helps to improve the

(6)

segmentation results, instead of heavily relying on the context layer. For the U-net training, we used categorical cross entropy as the loss function, stochastic gradient descent was used as optimizer and the training rate is set to 0.01. The network was first trained for 100 epochs in the first-step training and 40 epochs in the second-step training (Fig. 2). Sample augmentation was used, where random translating, rotating and scaling are added when creating the training images. 2.4 Testing phase

During testing, the input images were first sent to the trained 3D U-net with blank shape context layers, and then patient-specific shape models were created by fitting the statistical shape model to the preliminary segmentation results. The shape models were added as context layers to the 3D U-net to re-generate the segmentations. The process can be repeated several times. Besides image resampling and intensity normalization, no pre-processing steps are required for the proposed multi-pass U-net segmentation.

Fig. 2.the Dice coefficient of the right femur (A), right hip bone (B) and sacrum bone (C) during the two-step training (first step: 0-100 epochs, second step: 100-140 epochs). Thick line: Dice coefficient on the training set. Thin line: Dice coefficient on the testing set.

3 Results

To test the performance of the proposed method, we ran a 5-fold cross vali-dation using the 90 cases. In each fold, 72 subjects are used for training and 18 subjects are used for testing. Both the 3D U-net and the statistical shape models are retrained in every fold. To make comparison, we also implemented a hierarchical statistical shape model based segmentation method reported in [14], and plain 2D and 3D U-nets and trained them on the same dataset. The average segmentation accuracy of 5 bone structures of the pelvis from the 90 subjects is summarized in Table 1. On average, running a 2-pass 3D U-net de-livers better results than a 3D U-net. The performance gain is more visible on the sacrum than on other bone structures. Fig. 3 shows several example cases where adding shape context helps to improve the segmentation results. Running

(7)

the multi-pass U-net for the third time will slightly improve the segmentation accuracy, but no significant improvement is observed when run over 3 iterations. For each fold, the training took 60 hours on a NVIDIA GTX 1080ti GPU. For testing, the U-net prediction takes 20-30 seconds to process a 3D volume, and the shape model estimation takes 2-3 minutes for each pass.

Fig. 3. Sample cases comparing the results from the first-pass segmentation (upper row) and the second-pass segmentation (lower row). Arrows indicate segmentation errors.

Table 1.Segmentation accuracy measured as Dice coefficient of the proposed method and alternative methods

Methods Statistical shape model 2D U-net axial view 3D U-net 1st-pass 3D U-net with blank context 2nd-pass 3D U-net with shape context 3rd-pass 3D U-net with shape context Left femur - 0.925±_{0.178 0.949}±_{0.039 0.937}±_{0.047 0.958}±_{0.032 0.958}±_0.031 Right femur - 0.939±_{0.122 0.953}±_{0.019 0.942}±_{0.026 0.961}±_{0.018 0.962}±_0.018 Left hip 0.915±_{0.065 0.940}±_{0.079 0.947}±_{0.016 0.947}±_{0.020 0.957}±_{0.012 0.958}±_0.013 Right hip 0.908±_{0.059 0.943}±_{0.054 0.947}±_{0.016 0.944}±_{0.021 0.957}±_{0.011 0.957}±_0.011 Sacrum 0.850±_{0.082 0.894}±_{0.056 0.905}±_{0.032 0.909}±_{0.029 0.921}±_{0.028 0.924}±_0.027

4 Discussion and conclusion

The results suggest that adding shape context information to a deep neural network seems to improve the segmentation accuracy, especially for relatively

(8)

challenging anatomical structures like the sacrum. Wang et al. reported similar findings when they added shape context to 2D U-net [7]. Moving to 3D U-net is important, since even though the previous study showed that shape context helped to improve the segmentation accuracy, it is unclear whether that is due to the 3D shape model providing 3D information that is not accessible to 2D U-nets, or to it providing useful context information that will help segmentation. Another contribution of the proposed method is to replace the two sets of U-nets with a single U-net that can take blank shape context channels. This not only reduces the file size for deploying the trained net, saving the time of loading and switching network into computer memory, but also makes it possible to run the shape model estimation and shape-context-based segmentation multiple times in an iterative manner. This will hopefully further improve the segmentation accuracy. (However, the benefit of running over 2 passes was not very evident in our experiments.) Wang et al. reported overfitting on certain structures when applying their 2.5 U-net with shape context to the heart segmentation, i.e. the segmentation accuracy drops when the shape context is added [7]. The design of trying to use a single 3D U-net to handle cases with and without shape con-text will force the network to not rely on the shape concon-text too much, avoiding overfitting. We tried several strategies to make sure that, after the second-round training, the single 3D U-net can perform well on samples with only the in-put CT image (with blank shape context channels) and samples with both CT image and shape context channels. These strategies include mixing the train-ing samples with and without shape context, alternattrain-ing traintrain-ing between two types of training samples, and complete retraining while gradually adding one group into another. However, we found that simply continuing to train the U-net with samples with shape context works best. As shown in Table 1, the segmen-tation accuracy of 1st pass on the retrained 3D U-net is only slightly inferior to the results of the 3D U-net trained on samples without shape context. One limitation of this study is there are no diseased cases in the image database. The performance must be evaluated where fracture or other abnormality ex-ists. Another limitation is that the input images must be down-sampled for the segmentation operation which introduce additional errors. Finally, the ground-truth segmentation was generated by a single doctor, no inter-observer variation information is available. Future research activities have been planned to address these limitations.

In conclusion, we proposed a multi-pass 3D U-net framework with iteratively estimated shape models as context information. Preliminary results show that the proposed method outperforms both 2D and 3D conventional U-nets in 3D pelvis segmentation.

Acknowledgments. This study was supported by the Swedish Heart-lung foundation (grant no. 20160609) and the Swedish Childhood Cancer Founda-tion (grant no. MT2016-00166).

(9)

References

1. Ma, Z., Tavares, J.M.R.S., Jorge, R.N., Mascarenhas, T.: A review of algorithms for medical image segmentation and their applications to the female pelvic cavity. Computer Methods in Biomechanics and Biomedical Engineering 13(2), 235–246 (2010)

2. Seim, H., Kainmueller, D., Heller, M., Lamecker, H., Zachow, S., Hege, H.C.: Auto-matic segmentation of the pelvic bones from CT data based on a statistical shape model. In: Proceedings of the First Eurographics Conference on Visual Computing for Biomedicine. pp. 93–100. EG VCBM’08, Eurographics Association (2008) 3. Chu, C., Chen, C., Liu, L., Zheng, G.: FACTS: Fully automatic CT segmentation

of a hip joint. Annals of Biomedical Engineering 43(5), 1247–1259 (2015)

4. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR). pp. 3431–3440 (2015-06)

5. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medibiomedi-cal Image Computing and Computer-Assisted Intervention. pp. 234–241. Springer (2015)

6. Yokota, F., Okada, T., Takao, M., Sugano, N., Tada, Y., Sato, Y.: Automated segmentation of the femur and pelvis from 3d CT data of diseased hip using hier-archical statistical shape model of joint structure. In: Medical Image Computing and Computer-Assisted Intervention MICCAI 2009. pp. 811–818. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg (2009)

7. Wang, C., Smedby, ¨O.: Automatic whole heart segmentation using deep learning and shape context. In: MICCAI statistical atlases and computation models of the heart workshop proceedings (2017)

8. Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., Prior, F.: The cancer imaging archive (TCIA): Maintaining and operating a public information repository. Jour-nal of Digital Imaging 26 (2013)

9. Johnson, C.D., Chen, M.H., Toledano, A.Y., Heiken, J.P., Dachman, A., Kuo, M.D., Menias, C.O., Siewert, B., Cheema, J.I., Obregon, R.G., Fidler, J.L., Zim-merman, P., Horton, K.M., Coakley, K., Iyer, R.B., Hara, A.K., Halvorsen, R.A.J., Casola, G., Yee, J., Herman, B.A., Burgart, L.J., Limburg, P.J.: Accuracy of CT colonography for detection of large adenomas and cancers. New England Journal of Medicine 359(12), 1207–1217 (2008)

10. Roth, H.R., Lu, L., Seff, A., Cherry, K.M., Hoffman, J., Wang, S., Liu, J., Turk-bey, E., Summers, R.M.: A new 2.5d representation for lymph node detection using random sets of deep convolutional neural network observations. In: Medical Im-age Computing and Computer-Assisted Intervention MICCAI 2014. pp. 520–527. Lecture Notes in Computer Science, Springer, Cham (2014)

11. anonymous (2018)

12. Yushkevich, P.A., Piven, J., Hazlett, H.C., Smith, R.G., Ho, S., Gee, J.C., Gerig, G.: User-guided 3d active contour segmentation of anatomical structures: signifi-cantly improved efficiency and reliability. NeuroImage 31(3), 1116–1128 (2006) 13. Leventon, M.E., Grimson, W.E.L., Faugeras, O.: Statistical shape influence in

geodesic active contours. In: Computer vision and pattern recognition, 2000. Pro-ceedings. IEEE conference on. vol. 1, pp. 316–323. IEEE (2000)

14. Wang, C., Smedby, ¨O.: Automatic multi-organ segmentation in non-enhanced CT datasets using hierarchical shape priors. In: Proceedings of the 22th International Conference on Pattern Recognition (ICPR). IEEE (2014)