Deep cross-modality (MR-CT) educed distillation learning for cone beam CT lung tumor segmentation
Jue Jiang, Sadegh Riyahi Alam, Ishita Chen, Perry Zhang, Andreas Rimner, Joseph O. Deasy, Harini Veeraraghavan
DDeep cross-modality (MR-CT) educed distillationlearning for cone beam CT lung tumor segmentation
Jue Jiang ,Sadegh Riyahi Alam ,Ishita Chen ,Perry Zhang ,Andreas Rimner Joseph O. Deasy ,Harini Veeraraghavan Department of Medical Physics Department of Radiation Oncology Address: Box 84 - Medical Physics, Memorial Sloan Kettering Cancer Center, 1275 York Avenue,New York, NY 10065.email: [email protected]
AbstractPurpose:
Despite the widespread availability of in-treatment room cone beam com-puted tomography (CBCT) imaging, due to the lack of reliable segmentation methods,CBCT is only used for gross set up corrections in lung radiotherapies. Accurate andreliable auto-segmentation tools could potentiate volumetric response assessment andgeometry-guided adaptive radiation therapies. Therefore, we developed a new deeplearning CBCT lung tumor segmentation method.
Methods:
The key idea of our approach called cross modality educed distillation(CMEDL) is to use magnetic resonance imaging (MRI) to guide a CBCT segmenta-tion network training to extract more informative features during training. We accom-plish this by training an end-to-end network comprised of unpaired domain adaptation(UDA) and cross-domain segmentation distillation networks (SDN) using unpairedCBCT and MRI datasets. UDA approach uses CBCT and MRI that are not alignedand may arise from different sets of patients. The UDA network synthesizes pseudoMRI from CBCT images. The SDN consists of teacher MRI and student CBCT seg-mentation networks. Feature distillation regularizes the student network to extractCBCT features that match the statistical distribution of MRI features extracted bythe teacher network and obtain better differentiation of tumor from background. TheUDA network was implemented with a cycleGAN improved with contextual losses. Weevaluated both Unet and dense fully convolutional segmentation networks (DenseFCN).Performance comparisons were done against CBCT only using 2D and 3D networks.We also compared against an alternative framework that used UDA with MR seg-mentation network, whereby segmentation was done on the synthesized pseudo MRIrepresentation. All networks were trained with 216 weekly CBCTs and 82 T2-weightedturbo spin echo MRI acquired from different patient cohorts. Validation was done on20 weekly CBCTs from patients not used in training. Independent testing was doneon 38 weekly CBCTs from patients not used in training or validation. Segmentationaccuracy was measured using surface Dice similarity coefficient (SDSC) and Hausdroffdistance at 95th percentile (HD95) metrics.
Results:
The CMEDL approach significantly improved (p < ± ± ± ± ± i a r X i v : . [ ee ss . I V ] F e b D95 of 21.70 ± ± ± ± ± Conclusions:
Our results demonstrate that the introduced CMEDL approach pro-duces reasonably accurate lung cancer segmentation from CBCT images. Furthervalidation on larger datasets is necessary for clinical translation.
Keywords:
CBCT segmentation, lung tumors, MR informed segmentation, distilla-tion learning, adversarial deep learning ii age 1 I . Introduction Lung cancer is the leading cause of cancer-related deaths in both men and women in the UnitedStates . The standard treatment for inoperable or unresectable stage III locally advanced non-small cell lung cancer (LA-NSCLC) cancers is definitive/curative radiotherapy to 60 Gy in 30fractions with concomittant chemotherapy . Recently, dose escalation trials and high-dose adaptiveradiotherapies have shown the feasibility to improve local control and survival in LA-NSCLC.However, a key technical challenge in delivering high-dose treatments, both at the time of planningand delivery is accurate and precise delineation of both target tumors and normal organs .Importantly, X-ray cone beam CT (CBCT) imaging is available as part of standard equipment.However, much of the in-treatment-room CBCT information cannot be used routinely beyond basicpositioning corrections. In fact, a key obstacle for clinical adoption of adaptive radiotherapy forLA-NSCLC is the lack of reliable segmentation tools, needed for geometric corrections in target andpossibly the critical OARs . Despite the widespread development of deep learning segmentationmethods, to our best knowledge, there are no reliable CBCT methods for routine lung cancertreatment. The latest works published were primarily focused on the pelvic regions for prostatecancer radiotherapy .The difficulty in generating accurate segmentation results from the lack of sufficient soft-tissuecontrast on CBCT imaging, especially for centrally located cancers. Low soft-tissue contrast makesit inherently difficult to extract features that clearly differentiate the target from its backgroundstructures even for a deep learning method. Prior works have used pseudo MRI (pMRI) producedfrom CBCT to produce more accurate pelvic organs segmentation than CBCT alone. The key ideais that pMRI, which mimics the statistical intensity characteristics of MRI contains better softtissue contrast than CBCT, which helps accuracy. Our approach improves this idea whereby MRIis used to regularize the extraction of more informative CBCT segmentation features even on lessinformative CBCT modality. Unlike the approach in Fu et.al , which required paired CT and MRIimage sets, our approach uses unpaired CBCT and MRI images, which are easier to obtain andpractically applicable without requiring specialized imaging protocols for algorithm development.Our approach introduces unpaired cross-modality distillation learning. Distillation learningwas introduced to compress the knowledge contained in an information rich, high-capacity network(trained with a large training data) into a small, low-capacity network , using paired images.Model compression is meaningful when a high-capacity model is not required for a task or thecomputational limitation of using such a large classifier necessitates the use of simpler and compu-tationally fast model for real-time analysis. The approach in used same modality images andwas used to solve different image-based classification tasks. Distillation itself was done by usingthe probabilistic ”softMax” outputs of the teacher as target output for the student (compressed)network. Improvements to this approach included hint learning for image classification, where fea-tures from intermediate layers in the student network are constrained to mimic the features fromthe teacher network . Recent works in computer vision, extended this approach using low- andhigh-resolution images , as well as for different modality distillation between paired (Red, Green,Blue) or RGB and depth images for semantic segmentation. Our work extends this approach tounpaired distillation learning using explicit hints between MRI and CBCT images. Also, we modifyhow the knowledge distillation is employed, whereby instead of transferring knowledge from a verydeep network into a smaller network, the teacher and student networks are identical except for theimaging modalities used to train them. The teacher network is trained with a more informative age 2 Jue Jiang et. al MRI, while the student CBCT network is regularized to extract similar features for inference likethe teacher network. As a result, only the CBCT segmentation network is required for testing.Methods that require pMRI as an additional input need both cross-modality I2I translation andthe segmentation network at testing time.This work builds on our prior work that used unpaired MRI and contrast enhanced CT (CECT)datasets to improve CECT lung tumor segmentation . Our approach called cross-modality educeddistillation learning (or CMEDL) extends this approach to more challenging CBCT images. Wetested the hypothesis that MRI information extracted using unpaired CBCT and MRI can regularizefeatures computed by the CBCT network and improve performance over CBCT only segmentation.End-to-end network training also benefits from these losses to improve I2I translation.Our contributions include: (i) a new unpaired cross-modality educed distillation-based segmen-tation framework for regularizing inference on less informative modality by using more informativeimaging modality, (ii) application of this framework to the challenging CBCT lung tumor segmen-tation, and (iii) implementation of our framework using two different segmentation networks, withperformance comparisons done against other related approaches. II . Materials and Methods II . A . Patient and image characteristics A total of 274 weekly CBCT scans from 69 unique patients diagnosed with LA-NSCLC and treatedwith conventionally fractionated radiotherapy, and sourced from 49 internal and 20 external insti-tution dataset were analyzed. The internal scans had segmentations on weekly CBCTs with amaximum of 7 per patient. Only week 1 CBCTs from the external dataset were analyzed. Twoof 49 internal patients had seven weekly CBCTs; 18 had six weekly CBCTs; 18 had five weeklyCBCTs; 9 had four weekly CBCTs and the remaining 2 had three weeks CBCT segmented.The internal CBCT scans were acquired for routinely monitoring geometric changes of tumorin response to radiotherapy. The external institution 4D CBCT scans were orginally collectedfor investigating breathing patterns of LA-NSCLC patients undergoing chemoradiotherapy. Imageresolution for the weekly 4DCBCT were 0.98 to 1.17 mm in-plane spacing and 3mm slice thickness.Each CBCT image was standardized and normalized using its global mean and standarddeviation and then registered to the planning CT scans using a multi-resolution B-spline regularizeddiffeomorphic image registration . All contours were reviewed by a radiation oncologist andmodified when necessary and served as expert delineation. B-spline registration was performedusing a mesh size of 32mm at the coarsest level and was reduced by a factor of two at eachsequential level. The optimization step was set to 0.2 with the number of iterations (100, 70, 30)at each level. Additional details of this registration for these datasets are in .Eighty one T2-weighted turbo spin echo (TSE) MRI for cross-modality learning was obtainedfrom 28 stage II-III LA-NSCLC patients scanned every week on a 3T Philips Ingenia scanner.Eleven out of these 28 had weekly MRI scans ranging between 6 to 7 weeks. Seven of these 11patients overlapped with the internal MSK CBCT cohort. However, the MRI and CBCT imageswere neither co-registered nor treated as paired image sets for the purpose of training. The MRI scanparameters were: 16-element phased array anterior coil and a 44-element posterior coil (TE/TR II. MATERIALS AND METHODSage 3 = 120/3000-6000ms, slice thickness of 2.5mm, in-plane pixel size of 1.1 × mm , flip angle of90 deg , number of averages = 2, and field of view of 300 × × mm . II . B . Approach An overview of our cross-modality educed distillation (CMEDL) approach is shown in Fig. 1.The end-to-end trained network consists of an unpaired cross-domain adaptation (UDA) networkcomposed of a generational adversarial network (GAN) and a segmentation distillation network(SDN), which includes a teacher MRI and student CBCT segmentation network. The teacher net-work is trained with expert segmented MRI and pMRI images. TheCBCT network is trained withexpert-segmented CBCT images. Feature distillation is performed by hint learning , whereby fea-ture activations on the CBCT network in specific layers (last and penultimate) are forced to mimicthe corresponding layer feature activations of the teacher network extracted from correspondingsynthesized pMRI images. II . C . Notations The network is trained using a set of expert-segmented CBCT { x c , y c } ∈ { X C , Y C } and MRI { x m , y m } ∈ { X M , Y M } datasets. The CBCT and MRI do not have to arise from the same sets ofpatients and are not aligned for network training. The cross-modality adaptation network consistsof generators G C → M : x c (cid:55)→ x m to produce pseudo MRI x (cid:48) m , G M → C : x m (cid:55)→ x c to produce pseudoCT images x (cid:48) c , and domain discriminators D M and D C . The sub-networks G M → C and D C areused to enforce cyclically consistent transformation when using unpaired CBCT and MRI datasets.Feature vectors produced through a mapping function F ( x ) : x (cid:55)→ f ( x ) are indicated using italizedtext f j , where j = 1 , . . . K , for K = H × W × C for 2D and K = H × W × Z × C for 3D, is thenumber of features for an image of height H , width W , depth Z , and channels C . II . D . Stage I: Unpaired cross-domain adaptation for image-to-image translation The UDA network is composed of a pair of GANs for producing pseudo MRI x (cid:48) m and pseudo CT x (cid:48) c images using generator networks G C → M and G M → C , respectively. The images produced by thesegenerators are constrained by global intensity discriminators D M and D C for MRI and CBCTimages, respectively. The adversarial loss for these two networks are computed as: L Madv ( G C → M , D M , X M , X C ) = E x m ∼ X M [ log ( D M ( x m ))] + E x c ∼ X C [ log (1 − ( D M ( G C → M ( x c ))] L Cadv ( G M → C , D C , X C , X M ) = E x c ∼ X C [ log ( D C ( x c ))] + E x m ∼ X M [ log (1 − ( D C ( G M → C ( x m ))] . (1)Because the networks are trained with unpaired images, cyclical consistency is enforced toshrink the space of possible mappings computed by the generator networks. The loss to enforce thisconstraint is computed by minimizing the pixel-to-pixel loss (e.g. L1-norm) between the generated(e.g. G (cid:9) M = GM → C ( G C → M ( x c ))) and original images as: II.B. Approachage 4 Jue Jiang et. al L cyc ( G C → M , G M → C , X c , X m ) = E x c ∼ X c [ (cid:107) G C (cid:9) M ( x c ) − x c (cid:107) ] + E x m ∼ X m [ (cid:107) G M (cid:9) C ( x m ) − x m (cid:107) ] . (2)However, the cyclical consistency loss alone can only preserve global statistics while failing topreserve organ or target specific constraints . Furthermore, when performing unpaired adaptationbetween unaligned images, pixel-to-pixel matching losses are inadequate to preserve spatial fidelityof the structures . Therefore, we used the contextual loss that was introduced in . The contextualloss is computed by matching the low- or mid-level features extracted from the generated and thetarget images using a pre-trained network like the VGG19 (default network used in this work).This loss is computed by treating the features as a collection and by computing all feature-pairsimilarities, thereby, ignoring the spatial locations. In other words, the similarity between thegenerated ( f ( G ( x c )) = g j ) and target feature maps ( f ( x m ) = m i ) are marginalized over all featurepairings and the maximal similarity is taken as the similarity between those two images. Therefore,this loss also considers the textural aspects of images when computing the generated to targetdomain matching. It is similar to perceptual losses, but ignores the spatial alignment of theseimages, which is advantageous when comparing non-corresponding target and source modalitygenerated images. The contextual similarity is computed by normalizing the inverse of cosinedistances between the features g j and m i as: CX ( g, m ) = 1 N (cid:88) j m i axCX ( g j , m i ) , (3)where, N corresponds to the number of features. The loss is computed as: L cx = − log ( CX ( f ( G ( x c )) , f ( x m )) . (4)The pseudo MRI images produced through I2I translation from this stage are used in thedistillation learning as described below. II . E . Stage II: Cross-modality distillation-based segmentation This stage consists of a teacher (or MRI) segmentation and a student (or CBCT) segmentationnetwork. The goal of distillation learning is to provide hints to the student network such thatfeatures extracted in specific layers of the student network match the feature activations for thosesame layers in the teacher network.Both teacher ( S M ) and student ( S C ) networks use the same network architecture (Unet isthe default architecture), but process different imaging modalities. Both networks are trainedfrom scratch. The teacher network is trained using expert segmented T2w TSE MRI ( { x m , y m ∈{ X M , Y M } ) and pseudo MRI datasets ( { x (cid:48) m , y c } ∈ { X C , Y C } ) obtained from expert-segmentedCBCT datasets. The CBCT network is trained with the expert-segmented CBCT datasets. Thetwo networks are optimized using Dice overlap loss: L seg = L Mseg + L Cseg = E x m ∼ X M [ − logP ( y m | S M ( x m ))] + E x (cid:48) m ∼ G C → M ( X C ) [ − logP ( y c | S M ( x (cid:48) m ))]+ E x c ∼ X C [ − logP ( y c | S C ( x c ))] . (5) II. MATERIALS AND METHODSII.E. Stage II: Cross-modality distillation-based segmentationage 5
Feature distillation is performed by matching the feature activations computed on the pseudoMRI using S M and the feature activations extracted on corresponding CBCT images from the S C networks. Because the features closest to the output are the most correlated with the task , wematch the features computed from the last two network layers by minimizing the L2 loss: L hint = N (cid:88) i =1 (cid:107) φ C,i ( x c ) − φ M,i ( G C → M ( x c )) || (6)where φ C,i , φ
M,i are the i th layer features computed from the two networks, N is the total numberof features. As identical network architecture is used in both networks, the features can be matcheddirectly without requiring an additional step to adapt the features size as shown in . We call thisloss the hint loss.The total loss is expressed as:Loss = L adv + λ cyc L cyc + λ CX L CX + λ hint L hint + λ seg L seg (7)where λ cyc , λ CX , λ hint and λ seg are the weighting coefficients for each loss.The network update alternates between the cross-modality adaptation and segmentation distilla-tion. The network is updated with the following gradients, − ∆ θ G ( L adv + λ cyc L cyc + λ CX L CX + λ hint L hint + λ seg L seg , − ∆ θ D ( L adv ) and − ∆ θ S ( L hint + L seg ). II . F . Implementation details Cross-modality adaptation network structure:
The UDA network architectures wereconstructed based on well-proven architectures as used in our prior work for CT and MRI domainadaptation for tumor and organ at risk segmentation. The generator architectures wereadopted from DCGAN , which has been proven to avoid issues of mode collapse. Specifically, thegenerators consisted of two stride-2 convolutions , 9 residual blocks and two fractionally stridedconvolutions with half strides. Generator network used rectified linear unit (ReLU) in order toincrease stability of training. Similary, instance normalization as done in in all but the lastlayer, which has a tanh activation for image generation to increase training stability.A patchGAN discriminator as suggested in , and which uses 70 ×
70 overlapping pixel imagepatches was used to increase the number of pixel patches to distinguish real vs. fake images toimprove the stability of the discriminator. We have used a patchGAN discriminator in our priorworks and found it to achieve stable performance for both tumor and organ segmentation fromCT and MRI. Leaky ReLU instead of ReLU was used based on the results from DCGAN , alongwith batch normalization in all except the first and last layers to increase discriminator stabilityin training.A pre-trained VGG19 based on the standard VGGNet was used for memory efficiency. TheVGG19 consists of 16 layers of convolution filter with size 3 ×
3, 3 layers of fully connection layersand 5 maxpool layers. The lower level feature convolution filters which were of size 3 × ×
64, wereprogressively doubled to increase the number of feature channels while reducing the feature sizethrough subsampling using maxpool operation.
Segmentation networks structure:
We implemented Unet and DenseFCN networks.We chose these networks as these are the most commonly used segmentation architectures in medical II.F. Implementation detailsage 6 Jue Jiang et. al 𝐺 𝐶→𝑀
VGG19 CX loss ∅(x m′ )/∅(x m ) L hint 𝑥 𝑐 𝑥 𝑚′ S MR S CT 𝐺 𝑀→𝐶
Cycle loss 𝑥 𝑚 𝑥 𝑚 Segmentation (a) Image Translation (b) Knowledge Distillation
Student
Teacher
Hint learning 𝑥 𝑚 rMseg L p Mseg L ctseg L 𝑥 𝑐 Figure 1:
Approach overview. x c , x m are the CBCT and MR images from unrelated patient sets; G C → M and G M → C are the CBCT and MRI translation networks; x (cid:48) m is the pseudo MRI (pMRI)image; x (cid:48) c is the pseudo CBCT image; S MR is the teacher network; S CT is the student CBCTsegmentation network; CX loss is contextual loss. L rMseg , L pMseg are segmentation losses used to trainthe teacher network, while L ctseg is the loss for the student CBCT network. DB (4 L) + TD
L L L
C C C
BN, ReLU, Conv3×3
Dense Block (DB (4L))
BN, ReLU, Conv1×1,
Max Pooling
Layer (L)
Transposed Convolution 3×3
Stride=2
Transition Down (TD) Transition UP (TU)
DB (4 L) + TD
DB (4 L) + TD
DB (4 L) + TDDB (4 L) + TD DB(4 L) C C C C TU + DB (4 L)TU + DB (4 L)
TU + DB (4 L)
TU + DB (4 L)TU + DB (4 L)CCCCInput, Conv3×3
Conv1×1, Out (b) Dense-FCN
CB(N64)×2CB(N128)×2
CB(N256)×2
CB(N512)×2CB(N512)×2 P P P P CB(64,64)CB(128,64)CB(256,128)CB(512,256)UUUUInput
Conv1×1, Out C Concatenation
Conv3×3 ,BN ,ReLU
Convolutional Block (CB)
Input Output P Pooling U Up-poolingSkip connection
CCC (a) U-Net
CB(N512)×2 CB(N512)×2C N=48N=84N=132N=180N=228N=276 N=324 N=372N=324N=276N=228N=192
N=1
Hint feature L C Figure 2:
The segmentation structure of Unet and DenseFCN57 . The red arrow indicatesthat the output of these layers are used for distilling information from MR into CT. This is doneby minimizing the L2-norm between the features in these layers between the two networks. Theblue blocks indicate the lower layer; the green blocks indicate the middle layer; the orange blocksindicate the upper layer in Unet. Best viewed in color. II. MATERIALS AND METHODS II.F. Implementation detailsage 7 image segmentation and have shown fairly good performance for multiple disease sites.The U-Net is composed of series of convolutional blocks, with each block consisting of convolution,batch normalization and ReLU activation. Skip connections are implemented to concatenate high-level and lower level features. Max-pooling layers and up-pooling layers are used to down-sampleand up-sample feature resolution size. We use 4 max-pooling and 4 up-pooling in the implementedU-Net structure. The layers from the last two block of Unet with feature size of 128 × ×
64 and256 × ×
64 are used to tie the features, shown as red arrow in Fig. 2 (a). This network had 13.39M parameters and 33 layers of Unet.The Dense-FCN is composed of Dense Blocks (DB) , which successively concatenates fea-ture maps computed from previous layers, thereby increasing the size of the feature maps. A denseblock is produced by iterative concatenation of previous layer feature maps within that block, wherea layer is composed of a Batch Normalization, ReLU and 3 × × × × , that uses dense blockswith 4 layers for feature concatenation and 5 TD for feature down-sampling and 5 TU for featureup-sampling with a growing rate of 12. Although the authors of DenseFCN provide implemen-tations for deeper networks, including DenseFCN67, DenseFCN120, we used DenseFCN57 as ithas the least cost when combined with the cross-modality adaptation network using the CycleGanframework. This resulted in 1.37 M parameters and 106 layers of DenseFCN. Networks training:
All networks were implemented using the Pytorch library and trainedend to end on Tesla V100 with 16 GB memory and a batch size of 2. The ADAM algorithm withan initial learning rate of 1e-4 was used during training for the image translation networks. Thesegmentation networks were trained with a learning rate of 2e-4. We set λ adv =1, λ cyc =10, λ CX =1, λ hint =1 and λ seg =5 for the coeffcient of equation 7.A pre-trained VGG19 network using the ImageNet dataset was used to compute the contextualloss. The low level features extracted using VGG19 quantify edge and textural characteristics ofimages. Although such features could be more useful for quantifying the textural differences betweenthe activation maps, the substantial memory requirement for extracting these features precludedtheir use in this work. Instead we used higher level features computed from layers Conv7, Conv8,and Conv9 that capture the mid- and high-level contextual information between the various organstructures. The feature sizes were 64 × × × ×
256 and 32 × × were used for network training. In order to increase the number oftraining examples and obtain a more generalizable model, the networks were trained with 42,7402D image slices containing the tumor after cropping the original images (512 × × layers are only counted on layers that have tunable weights II.F. Implementation detailsage 8 Jue Jiang et. al was used. Early stopping strategy was used to avoid over-fitting and the networks were trained upto utmost 100 epochs.The trained models were validated on an independent set of 20 CBCT scans arising from3 internal patients with 15 weekly segmented CBCTs and 5 week 1 CBCT from the externalinstitution dataset. Testing was done on 38 CBCT scans from 6 internal patients with 33 weeklysegmented CBCTs and 5 week 1 CBCT from the external institution dataset. Image sets frompatients were separated such that all CBCT scans pertaining to a patient did not overlap acrossthe training, validation, and testing sets to prevent any potential for data leak.In order to support reproducible research, we will make the code for our approach availablewith reasonable request upon acceptance for publication.
III . Experiments and Results
Experiments were done to test the hypothesis that guiding the CBCT network training to extract asinformative features as the MRI network will produce more accurate tumor segmentation on CBCT.In other words, the CBCT network is regularized to extract features similar to those extracted bythe teacher network.We tested our distillation learning framework on two different commonly used segmentationnetworks, the Unet and denseFCN to measure performance differences due to network architecture.We also evaluated whether the pMRI images produced through the UDA network were more usefulthan the CBCT images for segmentation. For this purpose, we used the UDA and the teacher net-work of the CMEDL architecture. The default CMEDL network only requires the student CBCTnetwork during testing. We also evaluated the performance of pMRI based segmentation usingstandard cycleGAN and a Unet segmentator, which is somewhat similar to the work in Lei et.al and the more advanced variational auto-encoder using the unpaired image to image translation(UNIT) with the Unet network. Additionally, we evaluated whether combining the pMRI withthe CBCT as an additional channel in the input resulted in performance improvement. For thisexperiment, the pMRI was generated using a cycleGAN . Finally, we benchmarked the performanceof the 2D CMEDL network against a 3DUnet network .All networks were trained from scratch using identical sets of training and testing datasets. Rea-sonable hyper-parameter optimization was done to ensure good performance by all networks. III . A . Network training stability Fig.3 shows the training and validation loss curves for both Unet and denseFCN networks trainedwith and without CMEDL approach. As shown, all networks achieved stable loss performance,albeit the CMEDL approach resulted in better loss performance with the same length of trainingas the CBCT only network architectures.
III. EXPERIMENTS AND RESULTSage 9
III . B . Evaluation Metrics Segmentation performance was evaluated from 3D volumetric segmentations produced by combin-ing segmentations from 256 ×
256 pixel 2D image patches. In order to establish the clinical utilityof the developed method, we computed surface Dice similarity coefficient (SDSC) metric , whichwas shown to be more representative of any additional effort needed for clinical adaptation thanthe more commonly used geometric metrics like DSC and Hausdroff distances. For completeness,we also report the DSC and Hausdroff distance metric at 95th percentile (HD95) as recommendedin prior works .The surface DSC metric emphasizes the incorrect segmentations on the boundary, as this iswhere the edits are most likely to be performed by clinicians for clinical acceptance. It is computedas: D ( τ ) i,j = | S i ∩ B ( τ ) j | + | S j ∩ B ( τ ) i || S i | + | S j | (8)where B ( τ ) i ⊂ R is a border region of the segmented surface S i . The tolerance threshold τ =4.38 mm was computed using the standard deviation of the HD95 distances of 8 segmentationsperformed by two different experts blinded to each other.Statistical comparisons between the various methods was performed to assess the differencebetween the CMEDL vs. other approaches using paired Wilcoxon two-sided tests using the DSCaccuracy measure. Adjustments for multiple comparisons were performed using Holm-Bonferronimethod. III . C . Comparison of CMEDL and CBCT segmentation accuracy Table. 1 shows the segmentation accuracies on the validation and test sets produced by variousmethods for the Unet network. The accuracies for DenseFCN method are also shown in Table.2.CMEDL and pMRI-CMEDL methods were significantly more accurate than CBCT only networksusing both Unet (
P < .
001 using DSC and SDSC) and DenseFCN (
P < .
001 using DSC andSDSC) networks. These two methods were also more accurate than also the networks using pMRIcomputed from UNIT and cycleGAN networks trained separately from the segmentation network.The approach combining pMRI as a separate channel with the CBCT in the input was the leastaccurate compared to all other methods indicating that just adding a pMRI as a second channeldoesn’t contribute to a higher accuracy, possibly as the features from the two modalities are aver-aged early on. Finally, the CMEDL approach also outperformed a 3DUnet network, underscoringthe importance of extracting better features for differentiating target from background than incor-porating information from the slices. Fig. 4 shows example segmentations generated on the CBCTimages by using the CBCT only, pMRI-Cycle, pMRI-UNIT, and CMEDL methods using Unetnetwork.
III . D . Pseudo MRI translation accuracy We also measured the accuracy of translating the CBCT into pseudo MRI images using the CMEDL,CycleGAN and UNIT methods using Kullback-Leibler (KL) divergence measure. The CMEDL
III.B. Evaluation Metricsage 10 Jue Jiang et. al
Figure 3:
The training and validation loss (1.0 - DSC) curves for Unet and DenseFCN networkstrained with and without CMEDL approach.
CT only pMR-Cycle pMR-UNIT CMEDL
Figure 4:
Segmentation of representative CBCT images from test set using the CMEDL approachcompared with CBCT-only segmentation. Both methods used U-net as a segmentation architecture.Yellow contour indicates the manual segmentation while the red contour indicates the algorithmsegmentation. The slice-wise DSC score of each example is also shown in the blue rectangle area.
III. EXPERIMENTS AND RESULTS III.D. Pseudo MRI translation accuracyage 11
Table 1: Segmentation accuracy for CBCT dataset using the various approaches using Unetnetworks.
Net Validation (N = 20) Test (N = 38)Method DSC Surface DSC HD95 mm DSC Surface DSC HD95 mm U n e t CBCT only 0.64 ± ± ± ± .
15 0.69 ± ± . ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Segmentation accuracy for CBCT dataset using the various approaches using theDenseFCN network.
Net Validation (N = 20) Test (N = 38)Method DSC Surface DSC HD95 mm DSC Surface DSC HD95 mm D e n s e F C N CBCT only 0.63 ± ± ± ± .
14 0.66 ± ± . ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± approach resulted in the most accurate translation with the lowest KL divergence of 0.082 whencompared with 0.42 for CycleGAN and 0.29 for the UNIT method. Fig. 5 shows representativeexamples from the test set of translated pMRI images produced using the CMEDL method. Asseen, tumor and various structures in the MRI are clearly visualized with clear boundary seenbetween tumor and central structures compared to the corresponding CBCT image. III . E . Sensitivity analysis We evaluated the robustness of the networks’ segmentation by introducing noise into the learnedmodels. This was done by performing random dropout of the weights in the last two networklayers during testing. Test-time dropout was performed by keeping the dropout rate at 0.5 andrandomly zeroing out the learned weights. Dropout was run 10 times for each data and the averageof each case is shown in Figure 8. Both of the CMEDL networks resulted in lower variabilities inthe segmentation as shown in Fig. 8 for each test case. The mean standard deviation (mSD) was0.016 using DSC for CMEDL-Unet, 0.016 using DSC for CMEDL-DenseFCN. Whereas the mSDwas higher for both networks trained using only CBCT data with a 0.024 mSD for DSC of Unetand 0.027 mSD of DSC for denseFCN.
III . F . Separation of target from background using feature mapsextracted from CMEDL and CBCT only networks Finally, we studied how the cross-modality anatomical information from MRI lead to segmentationimprovement on CBCT. Fig.6 shows an example case with eight randomly chosen feature mapsselected from the last layer (with size of 256 × ×
64) of Unet network trained with only CBCT(Fig.6(a)) and the corresponding feature maps and the same case when computed from a CMEDL-Unet (Fig.6(c)). For reference, the feature maps computed in the same layer for the teacher MRInetwork that used the translated pMRI are also shown (Fig.6(b)). As seen, the feature mapsextracted using the CBCT only method are less effective in differentiating the tumor regions from
III.E. Sensitivity analysisage 12 Jue Jiang et. al
CBCTpMRI clear boundaryvague boundary
Figure 5:
Representative examples of pMRI generated from CBCT images using the CMEDLmethod. The tumor region is enclosed in the red contour. (a) CBCT Unet (b) teacher network of CMEDL (c) student CBCT network
Figure 6:
Feature activaton maps computed from (a) CBCT Unet, (b) teacher network of CMEDL,and (c) student CBCT network. (a) CBCT Unet (b) student network of CBCTBackground Tumor
Figure 7:
TSNE map of Feature activaton maps computed from (a) CBCT Unet, (b) CMEDL-CBCT Unet.
III. EXPERIMENTS AND RESULTSIII.F. Separation of target from background using feature maps extracted from CMEDL andCBCT only networksage 13 (b) DenseFCN
CBCT onlyCBCT- CMEDL (a) Unet
Figure 8:
Sensitivity analysis using test time dropout for CMEDL vs. CBCT only networks: (a)Unet, (b) DenseFCN. The segmentation variabilities per each test case for these methods are alsoshown.the background parenchyma when compared to the CMEDL CBCT network. The feature mapsextracted from CBCT using the student network are highly similar to the feature maps extractedfrom the corresponding pMRI images extracted from the teacher network, indicating sufficientknowledge distillation between the teacher and student networks. As a result, the CBCT studentnetwork extracted features that clearly distinguished the tumor from the background even when itis fed only CBCT images. This in turn produced more accurate segmentation than a CBCT onlynetwork trained without CMEDL approach.Fig.7 shows the result of unsupervised clustering the feature maps from the last layer(256 × ×
64) for the CMEDL and CBCT only Unet networks for all the 38 test cases. Notethat during test only the student CBCT Unet network of the CMEDL method is used. The inputto the clustering method consisted of randomly selected set of pixel features chosen from inside a160 ×
160 pixels region of interest enclosing the tumor in each slice containing the tumor, resultingin a total of 200,000 pixels. The number of pixels corresponding to target and background wasbalanced. Clustering was done by using the t-distributed stochastic neighbor embedding (t-SNE) method using the Matlab. Briefly, the t-SNE method computes an unsupervised clustering of highdimensional data by computing a high dimensional and low-dimensional embedding of the dataas probability distributions. Gradient descent is used to minimize the Kullback-Leibler divergencebetween the two distributions either until convergence or a maximum number of iterations. Theclustering parameters, namely perplexity, which is related to the number of nearest neighbors wasset of 60 and the number of gradient descent iterations to 1000. As seen in Fig. 7, the featuresextracted from the CBCT Unet network trained using CMEDL are better able to distinguish thetumor from background pixels than the features extracted from a CBCT Unet alone. IV . Discussion We introduced a new approach to leverage higher contrast information from more informativeMRI modality to improve CBCT segmentation. Our approach uses unpaired sets of MRI andCBCT images from different sets of patients and learns to extract informative features on CBCT age 14 Jue Jiang et. al that help to obtain a good lung tumor segmentation, even for some centrally located tumors.Our approach shows a clear improvement over CBCT only segmentation using both 2D and 3Dmethods as well as pMRI based segmentation when the pMRI is generated using UDA networkstrained independent of the CMEDL framework. Importantly, our results also showed that theCMEDL approach enables the CBCT network to extract more informative features that betterdistinguish tumor from background. This arises by guiding the CBCT student network to extractfeatures that match the statistical distribution of the teacher MRI network.To our best knowledge, this is one of the first works to address the problem of fully automaticlung tumor segmentation on CBCT images. Prior work on CBCT used semi-automated segmen-tation , and the CBCT deep learning methods were applied to segment pelvic normal organs .A clear difference of our approach from the deep learning methods is the use of pMRI as sideinformation during training to guide feature extraction from the CBCT network. The method inFu et.al combined pMRI with the CBCT features even during testing as a late fusion networkwhile performed segmentation from pMRI image patches using attention gated Unet. The mainadvantage of our method is that only a lightweight CBCT segmentation network is needed fortesting and the requirements for pMRI accuracy are less stringent than the methods , which usepMRI as an input modality.Our results showed that the MRI information educed on CBCT is most useful when used ashints through cross-modality distillation. This is because the teacher network provides additionalregularization to guide the extraction of CBCT features that yeild the best possible segmenta-tion performance during training. This regularization is accomplished by matching the featurescomputed from CBCT with the features computed from the corresponding pMRI images. On theother hand, a CBCT network that does not use such constraints may result in a local minimumafter training, but this is not guaranteed to extract features that better distinguish tumor frombackground as shown in our results. Similarly, the frameworks that used the synthesized pMRIsdirectly for segmentation were also less accurate than the CMEDL CBCT network. In the extremecase using pMRI as a secondary input channel with CBCT led to degradation in performance.This is because distillation learning itself only provides additional regularization to constrain theset of features extracted by the student CBCT network, which does not require as accurate pMRItranslation as would be needed when using pMRI itself as input for the segmentation network.On the other hand, accurate I2I translation is more important when using the translated imagefor segmentation. As shown, both pMRI-Cycle and pMRI-UNIT were comparable in performance.On the other hand, the pMRI-CMEDL approach, which uses side information from CBCT stu-dent network to also constrain the I2I translation improves the accuracy of the teacher networkfor segmenting CBCT. However, all these methods were less accurate than when using the studentCMEDL CBCT network for segmentation.As opposed to standard distillation methods that used a pre-trained teacher network as donein , which required paired image sets, ours is the first, to our best knowledge, that works withcompletely unrelated set of images from widely different imaging modalities. Removing the con-straint of paired image sets makes our approach more practical for medical applications, includingnew image-guided cancer treatments.As a limitation, our approach used a modest number of CBCT images for training and testing.Addition of more training sets would likely improve performance even more. Similarly, use of a 3Darchitecture instead of a 2D architecture could enable obtaining more accurate volume segmenta-tions. Also, testing on multi-institutional datasets with different imaging acquisitions is essential IV. DISCUSSIONage 15 to establish the generality of the developed approach and is work for future. To summarize, to ourbest knowledge, this is the first approach to tackle the problem of CBCT lung tumor segmentation. V . Conclusion We introduced a novel cross-modality educed distillation learning approach for segmenting lungtumors on cone-beam CT images. Our approach uses unpaired MRI and CBCT image sets toconstrain the features extracted on CBCT to improve inference and segmentation performance.Our approach implemented on two different segmentation networks showed clear performance im-provements over CBCT only methods. Evaluation on much larger datasets is essential to assesspotential for clinical translation. VI . Acknowledgements This research was supported by NCI [grant number R01-CA198121]. It was also partially supportedthrough the NIH/NCI Cancer Center Support Grant [grant number P30 CA008748] who had noinvolvement in the study design; the collection, analysis and interpretation of data; the writing ofthe report; and the decision to submit the article for publication.
VII . References R. Seigel, K. Miller, and A. Jemal, Cancer statistics, CA: A cancer journal for clinicians ,7–30 (2020). J. Bradley, C. Hu, R. Komaki, G. Masters, G. Blumenschein, S. Schild, J. Bogart, K. Forster,A. Magliocco, V. Kavadi, S. Narayan, P. Iyengar, R. C.G, R. Wynn, C. Koprowski, M. Olson,J. Meng, R. Paulus, W. Curran Jr, and H. Choy, Long-Term Results of NRG OncologyRTOG 0617: Standard- Versus High-Dose Chemoradiotherapy With or Without Cetuximabfor Unresectable Stage III Non–Small-Cell Lung Cancer, Journal of Clinical Oncology ,706–714 (2020). E. Weiss, M. Fatyga, Y. Wu, N. Dogan, S. Balik, W. Sleeman, and G. Hugo, Dose escalationfor locally advanced lung cancer using adaptive radiation therapy with simultaneous integratedvolume-adapted boost, Int J Radiat Oncol Biol Phys , 414–9 (2013). J. Kavanaugh, G. Hugo, C. Robinson, and M. Roach, Anatomical Adaptation—Early ClinicalEvidence of Benefit and Future Needs in Lung Cancer, Semin Radiat Oncol , 274–283(2019). J. Sonke and J. Belderbos, Adaptive radiotherapy for lung cancer, Semin Radiat Oncol ,94–106 (2010). age 16 Jue Jiang et. al J. Sonke, M. Aznar, and C. Rasch, Adaptive radiotherapy for anatomical changes, SeminRadiat Oncol , 245–257 (2019). X. Jia, S. Wang, X. Liang, A. Balagopal, D. Nguyen, M. Yang, Z. Wang, J. X. Ji, X. Qian, andS. Jiang, Cone-Beam Computed Tomography (CBCT) Segmentation by Adversarial LearningDomain Adaptation, in
Medical Image Computing and Computer Assisted Intervention –MICCAI 2019 , edited by D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T.Yap, and A. Khan, pages 567–575, Springer International Publishing, 2019. Y. Fu, Y. Lei, T. Wang, S. Tian, P. Patel, A. Jani, W. Curran, T. Liu, and X. Yang, Pelvicmulti-organ segmentation on cone-beam CT for prostate adaptive radiotherapy, Med Phys . Y. Lei, T. Wang, S. Tian, X. Dong, A. Jani, D. Schuster, W. Curran, P. Patel, T. Liu, andX. Yang, Male pelvic multi-organ segmentation aided by CBCT-based synthetic MRI, PhysMed Biol , 035013 (2020). G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, (2015). C. Bucilua, R. Caruana, and A. Niculescu-Mizil, Model compression, in
Proceedings of the12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages535–541, ACM, 2006. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, Fitnets: Hints forthin deep nets, in , 2015. Q. Li, S. Jin, and J. Yan, Mimicking very efficient network for object detection, in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition , pages 6356–6364, 2017. J.-C. Su and S. Maji, Cross quality distillation, arXiv preprint arXiv:1604.00433 (2016). S. Gupta, J. Hoffman, and J. Malik, Cross modal distillation for supervision transfer, in
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2827–2836, 2016. J. Jiang, Y. Hu, C. Liu, D. Halpenny, M. D. Hellmann, J. O. Deasy, G. Mageras, and H. Veer-araghavan, Multiple Resolution Residually Connected Feature Streams for Automatic LungTumor Segmentation From CT Images, IEEE Transactions on Medical Imaging , 134–144(2019). G. D. Hugo, E. Weiss, W. C. Sleeman, S. Balik, P. J. Keall, J. Lu, and J. F. Williamson,A longitudinal four-dimensional computed tomography and cone beam computed tomogra-phy dataset for image-guided radiation therapy research in lung cancer, Medical Physics ,762–771 (2017). N. Tustison and B. Avants, Explicit B -spline regularization in diffeomorphic image registration,Front Neuroinform (2013). S. Alam, M. Thor, A. Rimner, N. Tyagi, S.-Y. Zhang, L. Kuo, S. Nadeem, W. Lu, Y.-C. Hu,E. Yorke, and P. Zhang, Quantification of accumulated dose and associated anatomical changesof esophagus using weekly magnetic resonance imaging acquired during radiotherapy of locallyadvanced lung cancer, Phys Imaging Radiat Oncol , 36–43 (2020). age 17 I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,and Y. Bengio, Generative adversarial nets, in
Advances in Neural Information ProcessingSystems (NIPS) , pages 2672–2680, 2014. J. Jiang, Y.-C. Hu, N. Tyagi, P. Zhang, A. Rimner, G. S. Mageras, J. O. Deasy, and H. Veer-araghavan, Tumor-Aware, Adversarial Domain Adaptation from CT to MRI for Lung Can-cer Segmentation, in
International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 777–785, Springer, 2018. R. Mechrez, I. Talmi, and L. Zelnik-Manor, The contextual loss for image transformation withnon-aligned data, arXiv preprint arXiv:1803.02077 (2018). K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recog-nition, arXiv preprint arXiv:1409.1556 (2014). G. Lin, A. Milan, C. Shen, and I. Reid, Refinenet: Multi-path refinement networks for high-resolution semantic segmentation, in
Proceedings of the IEEE conference on computer visionand pattern recognition , pages 1925–1934, 2017. J. Jiang, J. Hu, N. Tyagi, A. Rimner, S. Berry, J. Deasy, and H. Veeraraghavan, Integratingcross-modality hallucinated MRI with CT to aid mediastinal lung tumor segmentation, in
International Conf on Med Image Computing and Computer-Assisted Intervention , pages 221–229, 2019. J. Jiang, Y.-C. Hu, N. Tyagi, A. Rimner, N. Lee, J. Deasy, S. Berry, and H. Veeraragha-van, PSIGAN: J oint probabilistic segmentation and image distribution matching for unpairedcross modality adaptation-based MRI segmentation, IEEE Trans. Med Imaging , 4071–4084(2020). A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convo-lutional generative adversarial networks, arXiv preprint arXiv:1511.06434 (2015). K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, arXivpreprint arXiv:1512.03385 (2015). D. Ulyanov, A. Vedaldi, and V. Lempitsky, Improved texture networks: Maximizing qualityand diversity in feed-forward stylization and texture synthesis, in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 6924–6932, 2017. J.-Y. Zhu, T. Park, P. Isola, and A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in
Intl. Conf. Computer Vision (ICCV) , pages 2223–2232,2017. O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedicalimage segmentation, in
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) , pages 234–241, Springer, 2015. S. J´egou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, The one hundred layerstiramisu: Fully convolutional densenets for semantic segmentation, in
Computer Vision andPattern Recognition Workshops (CVPRW), 2017 IEEE Conference on , pages 1175–1183, IEEE,2017. age 18 Jue Jiang et. al P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with conditionaladversarial networks, arXiv preprint (2017). S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducinginternal covariate shift, arXiv preprint arXiv:1502.03167 (2015). G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, Densely connected convolutionalnetworks, in
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 4700–4708, 2017. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer, Automatic differentiation in PyTorch, (2017). D.-P. Kingma and J. Ba, Adam: A method for stochastic optimization, Proceedings of the3rd International Conference on Learning Representations (ICLR) (2014). Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, Universal style transfer via featuretransforms, in
Advances in Neural Information Processing Systems , pages 386–396, 2017. ¨O. C¸ i¸cek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, 3D U-Net: learningdense volumetric segmentation from sparse annotation, in International conference on medicalimage computing and computer-assisted intervention , pages 424–432, Springer, 2016. S. Nikolov et al., Deep learning to achieve clinically applicable segmentation of head and neckanatomy for radiotherapy, arXiv preprint arXiv:1809.04430 (2018). F. Vaassen, C. Hazelaar, A. Vaniqui, M. Gooding, B. van der Heyden, R. Canters, and W. vanElmpt, Evaluation of measures for assessing time-saving of automatic organ-at-risk segmenta-tion in radiotherapy, Physics and Imaging in Radiation Oncology , 1 – 6 (2020). B. H. Menze et al., The multimodal brain tumor image segmentation benchmark (BRATS),IEEE transactions on medical imaging , 1993 (2015). L. Van der Maaten and G. Hinton, Visualizing data using t-SNE., Journal of machine learningresearch (2008). B. K. Veduruparthi, J. Mukhopadhyay, P. P. Das, M. Saha, S. Prasath, S. Ray, R. K. Shrimali,and S. Chatterjee, Segmentation of Lung Tumor in Cone Beam CT Images Based on Level-Sets,in , pages 1398–1402,2018. G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, Learning efficient object detectionmodels with knowledge distillation, in