[PDF] A Physics-Guided Modular Deep-Learning Based Automated Framework for Tumor Segmentation in PET Images

Abstract

The objective of this study was to develop a PET tumor-segmentation framework that addresses the challenges of limited spatial resolution, high image noise, and lack of clinical training data with ground-truth tumor boundaries in PET imaging. We propose a three-module PET-segmentation framework in the context of segmenting primary tumors in 3D FDG-PET images of patients with lung cancer on a per-slice basis. The first module generates PET images containing highly realistic tumors with known ground-truth using a new stochastic and physics-based approach, addressing lack of training data. The second module trains a modified U-net using these images, helping it learn the tumor-segmentation task. The third module fine-tunes this network using a small-sized clinical dataset with radiologist-defined delineations as surrogate ground-truth, helping the framework learn features potentially missed in simulated tumors. The framework's accuracy, generalizability to different scanners, sensitivity to partial volume effects (PVEs) and efficacy in reducing the number of training images were quantitatively evaluated using Dice similarity coefficient (DSC) and several other metrics. The framework yielded reliable performance in both simulated (DSC: 0.87 (95% CI: 0.86, 0.88)) and patient images (DSC: 0.73 (95% CI: 0.71, 0.76)), outperformed several widely used semi-automated approaches, accurately segmented relatively small tumors (smallest segmented cross-section was 1.83 cm2), generalized across five PET scanners (DSC: 0.74), was relatively unaffected by PVEs, and required low training data (training with data from even 30 patients yielded DSC of 0.70). In conclusion, the proposed framework demonstrated the ability for reliable automated tumor delineation in FDG-PET images of patients with lung cancer.

Full PDF

IIOP

Publishing Journal Title Phys. Med. Biol

A Physics-Guided Modular Deep-Learning Based Automated Framework for Tumor Segmentation in PET Images

Kevin H. Leung , Wael Marashdeh , Rick Wray , Saeed Ashrafinia , Martin G. Pomper , Arman Rahmim and Abhinav K. Jha Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA The Russell H. Morgan Department of Radiology, Johns Hopkins University, Baltimore, MD, USA Department of Radiology and Nuclear Medicine, Jordan University of Science and Technology, Arramtha, Jordan Memorial Sloan Kettering Cancer Center, Greater New York City Area, NY, USA Department of Electrical & Computer Engineering, Johns Hopkins University, Baltimore, MD, USA Departments of Radiology and Physics, University of British Columbia, Vancouver, BC, Canada Department of Biomedical Engineering and Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, MO, USA E-mail: [email protected]

Abstract

An important need exists for reliable PET tumor-segmentation methods for tasks such as PET-based radiation-therapy planning and reliable quantification of volumetric and radiomic features. The purpose of this study was to develop an automated physics-guided deep-learning-based PET tumor-segmentation framework that addresses the challenges of limited spatial resolution, high image noise, and lack of clinical training data with ground-truth tumor boundaries in PET imaging. We propose a three-module PET-segmentation framework in the context of segmenting primary tumors in 3D F-fluorodeoxyglucose (FDG)-PET images of patients with lung cancer on a per-slice basis. The first module generates PET images containing highly realistic tumors with known ground-truth using a new stochastic and physics-based approach, addressing lack of training data. The second module trains a modified U-net using these images, helping it learn the tumor-segmentation task. The third module fine-tunes this network using a small-sized clinical dataset with radiologist-defined delineations as surrogate ground-truth, helping the framework learn features potentially missed in simulated tumors. The framework’s accuracy, generalizability to different scanners, sensitivity to partial volume effects (PVEs) and efficacy in reducing the number of training images were quantitatively evaluated using Dice similarity coefficient (DSC) and several other metrics. The framework yielded reliable performance in both simulated (DSC: 0.87 (95% CI: 0.86, 0.88)) and patient images (DSC: 0.73 (95% CI: 0.71, 0.76)), outperformed several widely used semi-automated approaches, accurately segmented relatively small tumors (smallest segmented cross-section was 1.83 cm ), generalized across five PET scanners (DSC: 0.74 (95% CI: 0.71, 0.76)), was relatively unaffected by PVEs, and required low training data (training with data from even 30 patients yielded DSC of 0.70 (95% CI: 0.68, 0.71)). A modular deep-learning-based framework yielded reliable automated tumor delineation in FDG-PET images of patients with lung cancer using a small-sized clinical training dataset, generalized across scanners, and demonstrated ability to segment small tumors. OP Publishing Journal Title Phys. Med. Biol

Keywords: PET, automated segmentation, deep learning, oncology, partial volume effects . Introduction

Accurate tumor delineation in positron emission tomography (PET) is important for many clinical tasks, such as PET-based radiation-therapy planning and reliable quantification of volumetric and radiomic features (Jha et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al et al

Our objective in this paper was to develop a method that, when given a PET image slice with a tumor, automatically localizes and accurately delineates the tumor. In this context, deep-learning (DL)-based methods, in particular those based on convolutional neural networks, such as U-net, have shown substantial promise in medical-image segmentation – especially in anatomical imaging modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) (Litjens et al et al et al et al et al et al et al et al et al et al et al et al

To address these challenges, instead of training a conventional DL-segmentation approach only on limited clinical manually delineated training images, we propose a physics-guided three-module DL framework (Fig. 1). The framework was developed and comprehensively evaluated in the context of segmenting primary tumors in F-fluorodeoxyglucose (FDG)-PET images of patients with lung cancer and yielded accurate segmentation with a small clinical training set, generalized across different scanners, and was relatively insensitive to PVEs . This was an IRB approved, HIPAA-compliant, retrospective study, with a waiver for obtaining informed consent. Data from 160 patients (91 Male, 69 Female, mean age 63.2±11.7 [standard deviation] years; range, 27–90 years) with biopsy-proven lung cancer and a measurable pulmonary tumor on staging F-FDG PET/CT was used. Patients with a second primary malignancy were excluded. Detailed patient characteristics are given in the Appendix (Table A1). Standard imaging protocol involved FDG administration of 0.22 mCi/kg and image acquisition 60 minutes post-injection. The data was acquired across five different scanners: Discovery LS (N=104), Discovery RX (N=40), Discovery HR (N=7), Discovery ST (N=7), and Discovery STE (N=2). Details on scanner and reconstruction parameters are in Table 1. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al A modular DL-based framework was developed in the context of segmenting 3D PET images on a per-slice basis. The framework, when input a PET image slice containing a tumor, localized and segmented the tumor (Fig. 1). At its core, the framework had a modified U-net (mU-net) (U-net with modifications detailed in Section 2.1.2) (Fig. 2). Training DL approaches require large datasets with accurate ground-truth (Shen et al et al et al

Figure 1 : Illustration of the generation of simulated F-fluorodeoxyglucose (FDG)-PET images (a), workflow for the 5-fold cross-validation process used to optimize and train the modified U-net (mU-net) with simulated data (b), fine-tuning the mU-net with patient data (c), overview of the proposed modular framework (d), and evaluation of the proposed framework (e). hys. Med. Biol. XX (XXXX) XXXXXX Leung et al (KDE) and a physics-based approach, this module generates realistic 2D tumors using tumor descriptors obtained from clinical data to capture the observed variability in actual populations. Further, the module uses PET physics with the intent of accounting for partial volume effects in the segmentation process. The second module trains and optimizes the mU-net using these simulated images such that the mU-net learns the tumor-segmentation task for low-resolution images. The third module fine-tunes the mU-net with patient data to learn tumor features missed in simulated tumors. The modules are further described below: A novel KDE and physics-based approach was developed to simulate FDG-PET image slices with realistic tumors (Fig. 1a). Since segmentation performance is especially influenced by tumor shapes and size, count statistics, and tumor-to-background ratio, it was important to accurately account for these parameters when simulating the tumors. For this purpose, we obtained the distributions of these parameters from clinical data and used these distributions to simulate the realistic tumors. Tumor descriptors were extracted from clinical FDG-PET images, including first- and second-order statistics for the intensity, size, shape, intra-tumor

Table 1 : Technical acquisition and reconstruction parameters of PET/CT systems.

Parameter Discovery LS Discovery RX Discovery HR Discovery ST Discovery STE

PET transaxial FOV (mm) 550 700 700 700 700 PET axial FOV (mm) 153 153 157 157 153 Reconstruction method OSEM OSEM OSEM OSEM OSEM Subsets 28 21 21 21 20 Iterations 2 2 2 2 2 Randoms correction method RTSUB SING DLYD DLYD DLYD Attenuation correction method CT CT None CT CT Scatter correction method Convolution subtraction Convolution subtraction None Convolution subtraction Convolution subtraction Energy window (keV) 300 – 650 425 – 650 425 – 650 375 – 650 375 – 650 Voxel size (mm ) 3.91 × × × × × × × × × × CT: computed tomography, DLYD: delayed event subtraction method, FOV: field of view, OSEM: ordered subset expectation-maximization, PET: positron emission tomography, RTSUB: real-time delayed event subtraction method, SING: singles-based correction.

Figure 2 : Illustration of the mU-net architecture present in the second module of the proposed framework. ReLU: rectified linear unit. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al heterogeneity, and tumor-to-background intensity ratio. The background intensities were obtained from non-tumor pixels present within a circular ROI around the tumor. Tumor shape was quantified using five harmonic elliptical Fourier shape Figure 3 : Examples of segmented tumors produced by the proposed framework in simulated F-fluorodeoxyglucose (FDG)-PET images where the ground-truth tumor boundaries were known (a-f). Examples of segmented tumors produced by the proposed framework of patient FDG-PET images where manual segmentations were used as ground-truth (g-l). A full representative slice and several adjacent slices, in a zoomed view, are shown per example (g-l). hys. Med. Biol. XX (XXXX) XXXXXX Leung et al descriptors (Kuhl and Giardina 1982). Tumor size was quantified using diameter and volume. Each metric was modeled with a kernel distribution. The kernel distributions of each tumor descriptor were sampled to generate simulated tumors. Intra-tumor heterogeneity was simulated by incorporating unimodal variability in intensity values within the tumor and, for some tumors, by modeling the intensity distribution as a mixture model. For example, tumor cores assigned a lower intensity than the corresponding rim modeled a necrotic tumor. Examples of simulated tumors are shown in Figs. 3a-f. For simulated tumors, the ground-truth tumor boundaries were known. Since the ground-truth for the image background need not be known, we used multiple existing patient images from the training set as templates to provide a realistic tumor background and account for inter-patient variability. For each simulated tumor slice, an FDG-PET patient image slice containing lung but not tumor was selected as background.

The tumor locations placed in the patient background image slices were first manually selected such that the simulated tumors would appear at visually realistic locations within the lung. The simulated tumors were generated and randomly placed at the manually selected seed locations within the lung region of the patient background slices. The tumor orientation was determined by applying a random rigid rotation to the tumor.

Similar to Ma et al. (Ma et al

Table 2: Architecture of the modified U-net.

Layer Type Filter Size Stride

Layer 1 Conv. 3 ×

3 1 ×

1 16 128 × ×

1 128 × ×

16 Layer 2 Conv. 3 ×

3 1 ×

1 16 128 × ×

16 128 × ×

16 Layer 3 Max-pool 2 ×

2 2 ×

2 - 128 × ×

16 64 × ×

16 Layer 4 Conv. 3 ×

3 1 ×

1 32 64 × ×

16 64 × ×

32 Layer 5 Conv. 3 ×

3 1 ×

1 32 64 × ×

32 64 × ×

32 Layer 6 Max-pool 2 ×

2 2 ×

2 - 64 × ×

32 32 × ×

32 Layer 7 Conv. 3 ×

3 1 ×

1 64 32 × ×

32 32 × ×

64 Layer 8 Conv. 3 ×

3 1 ×

1 64 32 × ×

64 32 × ×

64 Layer 9 Conv. 3 ×

3 1 ×

1 64 32 × ×

64 32 × ×

64 Layer 10 Max-pool 2 ×

2 2 ×

2 - 32 × ×

64 16 × ×

64 Layer 11 Transposed Conv. 2 ×

2 2 ×

2 64 16 × ×

64 32 × ×

64 Layer 11 (a) Skip Connection (add output of Layer 9) - - - 32 × ×

64 32 × ×

64 Layer 12 Conv. 3 ×

3 1 ×

1 64 32 × ×

64 32 × ×

64 Layer 13 Conv. 3 ×

3 1 ×

1 64 32 × ×

64 32 × ×

64 Layer 14 Conv. 3 ×

3 1 ×

1 64 32 × ×

64 32 × ×

64 Layer 15 Transposed Conv. 2 ×

2 2 ×

2 32 32 × ×

64 64 × ×

32 Layer 15 (a) Skip Connection (add output of Layer 5) - - - 64 × ×

32 64 × ×

32 Layer 16 Conv. 3 ×

3 1 ×

1 32 64 × ×

32 64 × ×

32 Layer 17 Conv. 3 ×

3 1 ×

1 32 64 × ×

32 64 × ×

32 Layer 18 Transposed Conv. 2 ×

2 2 ×

2 16 64 × ×

32 128 × ×

16 Layer 18 (a) Skip Connection (add output of Layer 2) - - - 128 × ×

16 128 × ×

16 Layer 19 Conv. 3 ×

3 1 ×

1 16 128 × ×

16 128 × ×

16 Layer 20 Conv. 3 ×

3 1 ×

1 2 128 × ×

16 128 × × × ×

2 128 × × × ×

2 128 × × hys. Med. Biol. XX (XXXX) XXXXXX Leung et al simulating a PET system modeling the major image-degrading processes in PET such as detector blurring with a 5 mm full-width-at-half-maximum (FWHM) Gaussian blur and noise at clinical count levels. These data were added in projection space to incorporate the impact of image reconstruction on the tumor appearance and noise texture. The projection data were reconstructed using the 2D ordered subset expectation-maximization algorithm (Hudson and Larkin 1994) with 16 subsets and 3 iterations to yield a large number of simulated images for different phantoms. These reconstruction parameters yielded the most visually realistic images. Another reason for this choice of reconstruction parameters was that 16 subsets with 3 iterations is roughly equivalent to 48 maximum likelihood expectation-maximization iterations, which was approximately equivalent to the number of iterations used to generate the patient images (Table 1). Realism of the generated images was evaluated visually by a board-certified radiologist. The realism of images generated by such an approach has also been evaluated in previous studies (Ma et al

The core of the proposed framework was a modified U-net (mU-net) with an encoder-decoder architecture (Fig. 2). The encoder network learns spatially local features from the input data through a series of convolutional layers (Goodfellow et al 2016). Each convolutional layer learns feature maps from the previous layer by performing convolution of the input with a filter bank. After each convolutional layer in the network, a leaky rectified linear unit (ReLU) activation function is applied (Maas et al 2013). The ReLU has been shown to help alleviate the vanishing gradient problem (Maas et al 2013). Max-pooling layers were applied following the convolutional layers in the encoder network to condense meaningful features (Goodfellow et al 2016). The decoder network up-sampled the output of the encoder network via transposed convolutional layers. The transposed convolutional layers performed a learned up-sampling of feature maps by reconstructing the spatial resolution of previous layers prior to the pooling layers. The output of the decoder network was then mapped to a tumor mask in the decoder network. The decoder output was fed into a softmax layer, which performed a pixel-wise tumor classification. The mU-net was trained by minimizing a class-weighted cross-entropy loss function quantifying the error between the predicted and true tumor mask via the Adam optimization algorithm (Pereira et al 2016, Kingma and Ba 2014). A detailed description of the network architecture is given below in Table 2.

There were some differences between the implementation of the (mU-net) and the implementation from the original U-net paper (Ronneberger et al et al ×

32 pixels whereas the lowest resolution in our model was 16 × pixels. The mU-net automatically extracted important local contextual and global localization features in the encoder and decoder paths, respectively (Ronneberger et al et al et al α parameter used for the Leaky ReLU activation function, the dropout probability, the initialization value of the bias term for all weights in the network, and the class-weighting on the cross-entropy loss function. It was found during the cross-validation process that initializing the network weights by the Glorot initialization procedure helped to address the problem of vanishing or exploding backpropagated gradients (Glorot and Bengio 2010). Additionally, since there were relatively few tumor pixels compared to background pixels, class balancing on the cross-entropy loss function was used by weighting the loss more heavily for tumor pixels (Pereira et al hys. Med. Biol. XX (XXXX) XXXXXX Leung et al scratch. The hyperparameters that performed best on the training set during the 5-fold cross-validation were used to train the network during this step. . The objective of this module was to fine-tune the mU-net with patient data to learn tumor features that may have been missed in simulated tumors. The pre-trained network was fine-tuned using a small-sized clinical dataset (Fig. 1c) where the weights of the pre-trained network were used to initialize training of the fine-tuned network on patient data. The approach was similar to the fine-tuning process used in certain transfer learning-based approaches (Van Opbroek et al

The framework was comprehensively evaluated via multiple experiments with independent training and test sets (Fig. 1e). The framework’s accuracy on accurately quantifying tumor segmentation and localization in image slices was quantified using Dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC), true positive fraction, true negative fraction, Hausdorff distance (HD) (Foster et al et al et al et al et al et al t -test where a p -value < 0.01 was used to infer a statistically significant difference. The framework was quantitatively compared to semi-automated segmentation methods, including commonly used thresholding-based approaches (Foster et al et al et al et al et al et al et al et al max were considered (Sridhar et al

The semi-automated techniques were provided the tumor location as user Table 3: Hyperparameters of proposed framework.

Hyperparameter Value

Initialization weights Xavier bias 0.03 Leaky ReLU α hys. Med. Biol. XX (XXXX) XXXXXX Leung et al input through a seed pixel or ROI. In contrast, the proposed automated framework had to both localize and segment the tumor, a more challenging task. For the thresholding-based approaches, a procedure similar to (Sridhar et al et al et al The generalizability of DL-based approaches to data acquired from different scanners is highly important otherwise the DL approach would have to be retrained using data acquired from every scanner, making the approach impractical (Chang et al

For this purpose, the framework’s performance was compared to that of a mU-net trained only on clinical data in two experiments. In the first experiment, the framework was pre-trained with simulated images based on 104 patient images acquired by the Discovery LS scanner. The size of the clinical training set, which consisted patient images from the Discovery LS scanner, was varied from 1, 5, 25, 50, 75, and 104 patients. Both the framework and the mU-net trained only on clinical data were evaluated on a test set of 56 patient images from all other scanners. In the second experiment, the framework was pre-trained with simulated images based on the 56 patient images from all other scanners. The size of the clinical training set, which consisted of patient images from all other scanners, was varied from 1, 5, 20, 30, 40, and 56 patients. The testing was with the 104 patient images from the Discovery LS scanner. The performance was quantitatively compared on the tasks of segmentation and tumor localization in both experiments.

For this purpose, experiments were performed with the test set of 2,000 simulated image slices as defined in Section 2.2.1. Simulated images were used since the ground-truth tumor masks for these images were known. These ground-truth masks were blurred by applying a rectangular filter that incorporated the resolution degradation due to the imaging system and reconstruction, yielding a tumor boundary now affected by PVEs. These PVE-affected tumor boundaries and the tumor boundaries estimated by the proposed framework were compared to the ground-truth on the basis of DSC and the ratio between the measured and the true tumor areas in the slices (referred to as %area) (De Bernardi et al hys. Med. Biol. XX (XXXX) XXXXXX Leung et al The network architecture and training were implemented in Python 3.4.5, TensorFlow 1.6.0, and Keras 2.1.5. Experiments were run on an NVIDIA Tesla K40 GPU and a Linux CentOS 5.10 operating system. A detailed list of the hardware and software used to implement the network architecture are given in Table 4.

The proposed framework quantitatively outperformed all other semi-automated methods on the basis of DSC, JSC, and HD ( p -values<0.001) for both simulated and patient images (Figs. 4a-b and Table 5). The framework yielded a DSC of 0.87 (95% CI: 0.86, 0.88) and 0.73 (95% CI: 0.71, 0.76) on simulated and patient images, respectively, indicating reliable segmentation performance (Zijdenbos et al in area. The proposed framework provided reliable segmentation and yielded a DSC of 0.74 (95% CI: 0.71, 0.76) and 0.71 (95% CI: 0.69, 0.73) on the test sets of the first and second generalization experiments described in Section 2.2.2, respectively (Figs. 4c-d and Table 6), outperforming other semi-automated techniques on the basis of DSC, JSC, and HD ( p -value<0.001). When TL was correct, the framework yielded a DSC of 0.85 (95% CI: 0.84, 0.86) and 0.82 (95% CI: 0.81, 0.84) on the test sets of the first and second generalization experiments described in Section 2.2.2, respectively. This provided evidence that the framework generalized across different PET scanners. Table 4: Hardware and software platform. Graphics Processing Unit (GPU) Model NVIDIA Tesla K40 Operating System Linux CentOS 5.10 Deep Learning (DL) Platform Programming Language Python 3.4.5 Open-source DL libraries TensorFlow 1.6.0, Keras 2.1.5 hys. Med. Biol. XX (XXXX) XXXXXX Leung et al The proposed framework outperformed the mU-net trained only on clinical data on the basis of DSC and localization accuracy (Figs. 5a-d) for all training sizes ( p -values<0.001). Representative examples in Figs. 5e-f demonstrate this further. When trained on patient images acquired by the Discovery LS scanner, the proposed framework yielded a DSC of 0.68 (95% CI: 0.66, 0.71) even when trained on 25 patient images (Fig. 5a) and a localization accuracy of 74% (95% CI: 71%, 77%) when trained with just one patient image (Fig. 5b). When TL was correct, the proposed framework yielded a DSC of 0.79 (95% CI: Table 5 : Comparing the proposed framework and other methods on simulated images and patient images using the procedure in Section 2.2.1.

Simulated Images

Segm. Methods mU-net MRF-GMM Snakes 30% SUV max

40% SUV max

50% SUV max

DSC 0.87 (0.86, 0.88) 0.61 (0.60, 0.63) 0.58 (0.57, 0.60) 0.58 (0.56, 0.59) 0.63 (0.61, 0.64) 0.57 (0.56, 0.59) DSC with correct TL 0.91 (0.91, 0.92) 0.61 (0.60, 0.63) 0.58 (0.57, 0.60) 0.58 (0.56, 0.59) 0.63 (0.61, 0.64) 0.57 (0.56, 0.59) JSC 0.81 (0.80, 0.82) 0.50 (0.48, 0.51) 0.47 (0.45, 0.48) 0.48 (0.47, 0.50) 0.55 (0.54, 0.57) 0.49 (0.47, 0.50) TPF 0.90 (0.89, 0.91) 0.86 (0.86, 0.87) 0.88 (0.88, 0.89) 0.80 (0.79, 0.82) 0.68 (0.66, 0.70) 0.53 (0.51, 0.55) TNF 1.00 (1.00, 1.00) 0.99 (0.98, 0.99) 0.99 (0.99, 0.99) 0.99 (0.99, 0.99) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) HD 1.45 (1.39, 1.51) 6.05 (5.75, 6.34) 11.61 (11.29, 11.93) 7.76 (7.44, 8.07) 6.52 (6.17, 6.86) 6.78 (6.43, 7.14)

Patient Images

Segm. Methods mU-net MRF-GMM Snakes 30% SUV max

40% SUV max

50% SUV max

DSC 0.73 (0.71, 0.76) 0.68 (0.66, 0.70) 0.67 (0.65, 0.68) 0.66 (0.64, 0.68) 0.60 (0.58, 0.62) 0.50 (0.48, 0.52) DSC with correct TL 0.84 (0.83, 0.85) 0.68 (0.66, 0.70) 0.67 (0.65, 0.68) 0.66 (0.64, 0.68) 0.60 (0.58, 0.62) 0.50 (0.48, 0.52) JSC 0.65 (0.63, 0.68) 0.55 (0.53, 0.57) 0.52 (0.51, 0.53) 0.53 (0.51, 0.55) 0.46 (0.44, 0.47) 0.36 (0.35, 0.38) TPF 0.76 (0.73, 0.79) 0.86 (0.84, 0.88) 0.68 (0.67, 0.70) 0.63 (0.61, 0.65) 0.50 (0.48, 0.52) 0.38 (0.36, 0.39) TNF 1.00 (1.00, 1.00) 0.99 (0.98, 0.99) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) HD 3.25 (2.92, 3.58) 5.40 (4.88, 5.92) 7.14 (6.62, 7.67) 5.50 (4.99, 6.01) 5.63 (5.12, 6.14) 6.11 (5.60, 6.63)

Note – Data in parentheses are 95% confidence intervals. DSC: Dice similarity coefficient, HD: Hausdorff distance, JSC: Jaccard similarity coefficient, MRF-GMM: Markov random fields-Gaussian mixture model, mU-net: modified U-net, TL: tumor localization, TNF: true negative fraction, TPF: true positive fraction. Sample sizes were 1,916 and 486 for DSC with correct TL and HD of mU-net for simulated and patient images, respectively. Sample sizes were 2,000 and 557 for all other metrics and segmentation methods for simulated and patient images, respectively. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al Figure 4 : Comparing the proposed framework and other semi-automated techniques for simulated images (a) and patient images (b) using the procedure in Section 2.2.1. Evaluating the generalizability of the proposed framework across multiple PET scanners when trained on images acquired by a single PET scanner (c) and across a single PET scanner when trained on images acquired by multiple PET scanners (d) using the procedure in Section 2.2.2. Sample sizes were 2,000 (a), 557 (b), 662 (c), and 1001 (d). Error bars represent 95% confidence intervals. MRF-GMM: Markov random fields-Gaussian mixture model. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al The network correctly localized the tumor in 1,916 of the 2,000 images (95.8%) from the test set of simulated images. These cases that were correctly localized were only considered here to study sensitivity to PVEs. Representative examples comparing the predicted tumor boundaries by the proposed framework to the PVE-affected tumor boundaries are shown in Figs. 6a-d. The proposed framework yielded a DSC of 0.91 (95% CI: 0.91, 0.92) while the PVE-affected tumor boundaries yielded a DSC of 0.75 (95% CI: 0.74, 0.75) for simulated images. The proposed framework outperformed ( p -value<0.001) the PVE-affected tumor boundaries on the basis of DSC and %area (Figs. 6e-f) and yielded reliable segmentation and accurate tumor-area prediction for all tumor sizes (Fig. 6). We proposed a modular automated DL-based framework for tumor segmentation in PET images. The framework accurately localized and delineated primary tumors on FDG-PET images of patients with lung cancer using a small-sized clinical training dataset. The framework generalized across several PET scanners, indicating the features learned by the framework were invariant to scanner differences. Those attributes provide evidence that the framework is robust and can be implemented by various institutions and centers with different PET scanners. Further, the proposed framework outperformed other semi-automated methods for both simulated and patient images. Visually, the proposed framework successfully localized the primary lung tumor even in cases where multiple high-uptake regions were present within the same image (e.g. heart, mediastinum, lymph nodes, or secondary metastatic deposits) (Figs. 3 and 5e). Concurrently, there were few cases where the DL approach could not localize the tumor correctly (Fig. 5g). However, the mU-net trained only on clinical images also failed in those cases (Fig. 5h). The localization accuracy of the proposed framework was generally higher than 80%, and up to 91% when data from 104 patients were used for training (Fig. 5b). To address cases of inaccuracy, one solution would be to display the segmented-tumor output to a radiologist for approval or refinement and integrate this feedback with a reinforcement-learning approach, similar to that in (Wang et al

Table 6 : Comparing the proposed framework and other methods on patient images using the procedure in Section 2.2.2.

Generalizability Experiment

Segm. Methods mU-net MRF-GMM Snakes 30% SUV max

40% SUV max

50% SUV max

DSC 0.74 (0.71, 0.76) 0.68 (0.67, 0.70) 0.67 (0.66, 0.68) 0.67 (0.65, 0.69) 0.62 (0.61, 0.64) 0.54 (0.52, 0.55) DSC with correct TL 0.85 (0.84, 0.86) 0.68 (0.67, 0.70) 0.67 (0.66, 0.68) 0.67 (0.65, 0.69) 0.62 (0.61, 0.64) 0.54 (0.52, 0.55) JSC 0.66 (0.64, 0.68) 0.55 (0.53, 0.57) 0.53 (0.51, 0.54) 0.55 (0.53, 0.57) 0.48 (0.47, 0.50) 0.39 (0.38, 0.40) TPF 0.76 (0.74, 0.79) 0.83 (0.81, 0.84) 0.71 (0.70, 0.72) 0.71 (0.69, 0.73) 0.56 (0.55, 0.58) 0.42 (0.41, 0.44) TNF 1.00 (1.00, 1.00) 1.00 (0.99, 1.00) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) HD 2.50 (2.30, 2.71) 4.17 (4.00, 4.44) 5.85 (5.45, 6.25) 4.71 (4.32, 5.10) 4.64 (4.26, 5.02) 5.04 (4.67, 5.42)

Generalizability Experiment

Segm. Methods mU-net MRF-GMM Snakes 30% SUV max

40% SUV max

50% SUV max

DSC 0.71 (0.69, 0.73) 0.66 (0.65, 0.67) 0.65 (0.64, 0.66) 0.63 (0.61, 0.64) 0.56 (0.55, 0.58) 0.47 (0.46, 0.48) DSC with correct TL 0.82 (0.81, 0.84) 0.66 (0.65, 0.67) 0.65 (0.64, 0.66) 0.63 (0.61, 0.64) 0.56 (0.55, 0.58) 0.47 (0.46, 0.48) JSC 0.63 (0.61, 0.65) 0.53 (0.52, 0.55) 0.51 (0.50, 0.52) 0.50 (0.48, 0.51) 0.42 (0.41, 0.43) 0.34 (0.32, 0.35) TPF 0.72 (0.70, 0.74) 0.83 (0.82, 0.85) 0.69 (0.68, 0.70) 0.61 (0.60, 0.63) 0.48 (0.46, 0.49) 0.36 (0.34, 0.37) TNF 1.00 (1.00, 1.00) 0.99 (0.98, 0.99) 0.99 (0.99, 1.00) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) 1.00 (1.00, 1.00) HD 3.38 (3.11, 3.66) 5.79 (5.40, 6.19) 7.44 (7.03, 7.85) 5.93 (5.56, 6.30) 6.04 (5.69, 6.39) 6.56 (6.20, 6.91)

Figure 5 : Comparing the proposed framework to the modified U-net (mU-net) trained only on clinical images in terms of Dice similarity coefficient and tumor localization accuracy for the various training set sizes with 95% confidence intervals (a-d) using the procedure in Section 2.2.3. Representative examples of segmented tumors by the proposed framework (e) and the mU-net that was trained on clinical images only (f). Each example in (e) and (f) refer to the same image slices. Cases where the proposed framework (g) and the mU-net that was trained on only clinical images (h) failed to localize the tumor are also shown. Similarly, each example in (g) and (h) refer to the same image slices. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al The proposed framework uses a new stochastic-KDE and physics-based approach to generate realistic simulated images that address limited availability of clinical training data. This approach allows the generated data to account for patient-population variability and simultaneously account for the imaging physics. Other data-augmentation strategies include transforming the tumor (e.g. translation, rotation, scaling), or changing tumor intensity in the patient images (Litjens et al et al et al

Figure 6 : Representative examples comparing the predicted tumor boundaries by the proposed framework to the tumor boundaries affected by partial volume effects (PVEs) in simulated images (a-d). Evaluating the ability of the proposed framework to compensate for PVEs on the basis of Dice similarity coefficient (e), and %area (f) using the procedure in Section 2.2.4. A sample size of 1916 simulated images from the test set where the tumor was correctly localized were used (e). The %area was plotted as a function of true tumor area (cm ) to demonstrate the effect of PVEs for different tumor sizes where the tumor sizes were binned with a bin width of 5 cm tumor area. The sample sizes for each bin in order of increasing tumor area were 405, 457, 284, 203, 162, 111, 74, 45, 41, 16, 33, 18, 11, 11, 13, 10, 15, and 7, respectively. Error bars represent 95% confidence intervals. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al Another strategy would be the use of generative adversarial networks (GANs) (Gong et al et al et al et al et al

Next, using all tumor-containing 2D slices as training examples increased the amount of training data. Further, 2D tumor delineation is less computationally expensive. Finally, from a usage perspective, the framework can segment 3D images on a per-slice basis when it is provided transaxial image slices where the primary tumor is present, as in this study. However, fully 3D segmentation could allow the network to learn important 3D information to segment contiguous tumor volumes. Extending the proposed framework to direct 3D segmentation is an important research area. Another limitation of our study is in the simulation framework. The framework generated realistic simulated images where synthetic tumors are generated and randomly placed in the lung region of a patient background slice containing no tumor. While this approach yields visually realistic simulated PET images with lung tumors, this does not capture the biological and clinical information about the tumor location in the clinical dataset. Extending the proposed method to incorporate this information in simulations is an important area of research.

Also, the proposed framework only locates and segments a single tumor per image slice. The latter limitation is an outcome of the patient selection criteria, where patients with a second primary malignancy were excluded. Extending this method to find multiple tumors per patient image is another important research area.

A physics-guided modular DL-framework for automated tumor segmentation was developed and provided accurate delineation of primary lung tumors in FDG-PET images with a small-sized clinical training dataset, generalized across different scanners, and demonstrated ability to segment small tumors. Open-source code for the proposed framework is available here: https://drive.google.com/drive/folders/1H483HB-byS6UiPlf3ip1i52i0EH7R4lK?usp=sharing.

Acknowledgements

Financial support for this project was provided, in part, by the National Institutes of Health under grant terns P41-EB024495 and R21-EB024647. The researchers thank Dr. Martin Lodge and Martin Stumpf for helpful discussions.

Appendix. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al The data used in this study is a subset of previously reported data (Sheikhbahaei et al

References

Aristophanous M, Penney B C, Martel M K and Pelizzari C A 2007 A Gaussian mixture model for definition of lung tumor volumes in positron emission tomography

Med. Phys. EJNMMI Res.

55 Belhassen S and Zaidi H 2010 A novel fuzzy C - means algorithm for unsupervised heterogeneous tumor quantification in PET Med. Phys. Med. Phys. PLoS One e0195798 Brambilla M, Matheoud R, Secco C, Loi G, Krengli M and Inglese E 2008 Threshold segmentation for PET target volume delineation in radiation treatment planning: the role of target - to - background ratio and target size Med. Phys. Table A1 : Patient characteristics.

Characteristic percent

Age <40 1% (2/160) 40-60 43% (69/160) >60 56% (89 /160) Sex Male 57% (91/160) Female 43% (69/160) Race White 68% (109/160) Black 21% (34/160) Other 11% (17/160) Histology Non-small cell lung cancer 82% (131/160) Small cell lung cancer 16% (26/160) Mesothelioma 1% (1/160) Unknown 1% (2/160) Smoking history Positive 79% (126/160) Negative 14% (22/160) Unknown 8% (12/160) Stage I 11% (18/160) II 9% (15/160) III 34% (54/160) IV 38% (61/160) Unknown 8% (12/160) Treatment Surgery 3% (4/160) Chemotherapy 26% (41/160) Radiation therapy 6% (9/160) Surgery & Chemoradiation 10% (16/160) Chemoradiation 48% (77/160) Surgery & Chemotherapy 8% (13/160) Interval between treatment and PET study 1-8 weeks 65 % (104/160) 8-12 weeks 11% (17/160) 12-24 weeks 22% (35/160) 24-40 weeks 3% (4/160) Outcome Alive 33% (52/160) hys. Med. Biol. XX (XXXX) XXXXXX Leung et al Chang K, Balachandar N, Lam C, Yi D, Brown J, Beers A, Rosen B, Rubin D L and Kalpathy-Cramer J 2018 Distributed deep learning networks among institutions for medical imaging

J. Am. Med. Informatics Assoc. IEEE Signal Process. Mag. Med. Phys. Comput. Biol. Med. Radiother. Oncol. Proceedings of the thirteenth international conference on artificial intelligence and statistics pp 249–56 Gong K, Guan J, Liu C-C and Qi J 2018 PET Image Denoising Using a Deep Neural Network Through Fine Tuning

IEEE Trans. Radiat. Plasma Med. Sci.

Goodfellow I, Bengio Y and Courville A 2016

Deep learning (MIT press) Hatt M, Le Rest C C, Turzo A, Roux C and Visvikis D 2009 A fuzzy locally adaptive Bayesian segmentation approach for volume determination in PET

IEEE Trans. Med. Imaging IEEE Trans. Med. Imaging J. Med. Imaging (IEEE) pp 93–6 Kass M, Witkin A and Terzopoulos D 1988 Snakes: Active contour models Int. J. Comput. Vis. Adam: A Method for Stochastic Optimization

Kuhl F P and Giardina C R 1982 Elliptic Fourier features of a closed contour

Comput. Graph. image Process. EJNMMI Phys. Med. Image Anal. J. Med. Imaging Proc. icml vol 30 p 3 Mena E, Sheikhbahaei S, Taghipour M, Jha A K, Vicente E, Xiao J and Subramaniam R M 2017a 18F-FDG PET/CT Metabolic Tumor Volume and intra-tumoral heterogeneity in pancreatic adenocarcinomas: Impact of Dual-Time-Point and Segmentation Methods

Clin. Nucl. Med. e16 Mena E, Taghipour M, Sheikhbahaei S, Jha A K, Rahmim A, Solnes L and Subramaniam R M 2017b Value of Intratumoral Metabolic Heterogeneity and Quantitative 18F-FDG PET/CT Parameters to Predict Prognosis in Patients With HPV-Positive Primary Oropharyngeal Squamous Cell Carcinoma. Clin. Nucl. Med. e227–34 Van Opbroek A, Ikram M A, Vernooij M W and De Bruijne M 2014 Transfer learning improves supervised image segmentation across imaging protocols IEEE Trans. Med. Imaging IEEE Trans. Med. Imaging International Conference on Medical image computing and computer-assisted intervention (Springer) pp 234–41 Shah B, Srivastava N, Hirsch A E, Mercier G and Subramaniam R M 2012 Intra-reader reliability of FDG PET volumetric tumor parameters: effects of primary tumor size and segmentation methods

Ann. Nucl. Med. J. Nucl. Med. Annu. Rev. Biomed. Eng. Nucl. Med. Commun. Am. J. Roentgenol.

J. Mach. Learn. Res. IEEE Trans. Pattern Anal. Mach. Intell.

Werner-Wasik M, Nelson A D, Choi W, Arai Y, Faulhaber P F, Kang P, Almeida F D, Xiao Y, Ohri N and Brockway K D 2012 What is the best way to contour lung tumors on PET scans? Multiobserver validation of a gradient-based method using a NSCLC digital PET phantom

Int. J. Radiat. Oncol. Biol. Phys. hys. Med. Biol. XX (XXXX) XXXXXX Leung et al cell carcinoma Eur. J. Nucl. Med. Mol. Imaging Phys. Med. Biol. IEEE Trans. Med. Imaging716–24