Probabilistic feature extraction, dose statistic prediction and dose mimicking for automated radiation therapy treatment planning
PProbabilistic feature extraction, dose statisticprediction and dose mimicking for automated radiationtherapy treatment planning
Tianfang Zhang ∗ , † , Rasmus Bokrantz † , Jimmy Olsson ∗ ∗ Department of Mathematics, KTH Royal Institute of Technology, Stockholm SE-100 44, Sweden † RaySearch Laboratories, Sveavägen 44, Stockholm SE-103 65, Sweden
February 24, 2021
Abstract
Purpose : We propose a general framework for quantifying predictive uncertain-ties of dose-related quantities and leveraging this information in a dose mimickingproblem in the context of automated radiation therapy treatment planning.
Methods : A three-step pipeline, comprising feature extraction, dose statistic predic-tion and dose mimicking, is employed. In particular, the features are produced by aconvolutional variational autoencoder and used as inputs in a previously developednonparametric Bayesian statistical method, estimating the multivariate predictivedistribution of a collection of predefined dose statistics. Specially developed objec-tive functions are then used to construct a dose mimicking problem based on theproduced distributions, creating deliverable treatment plans.
Results : The numerical experiments are performed using a dataset of retrospec-tive treatment plans of prostate cancer patients. We show that the features extractedby the variational autoencoder captures geometric information of substantial rele-vance to the dose statistic prediction problem, that the estimated predictive distribu-tions are reasonable and outperforms a benchmark method, and that the deliverableplans agree well with their clinical counterparts. Conclusions : We demonstrate that prediction of dose-related quantities may beextended to include uncertainty estimation and that such probabilistic informationmay be leveraged in a dose mimicking problem. The treatment plans produced bythe proposed pipeline resemble their original counterparts well, illustrating the mer-its of a holistic approach to automated planning based on probabilistic modeling.
Keywords:
Knowledge-based planning, uncertainty modeling, dose–volume histogram pre-diction, variational autoencoder, mixture-of-experts, dose mimicking. a r X i v : . [ phy s i c s . m e d - ph ] F e b Introduction
In light of its rapid advancement of late, machine learning has been extensively applied to vari-ous areas within biomedical engineering, often with considerable success (Park et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. (2012), and as pointed out by Babier et al. (2020b), when designing and evaluating methods for automated planning, it is essential tomaintain a holistic picture of both parts of a prediction–mimicking pipeline. Considering this,2mitting the probabilistic information contained in predictive distributions when using purelypredictive methods may be the cause of a substantial disconnect between the two parts. Forexample, the predictive uncertainty associated with the mean dose in some organ at risk may beorders of magnitude higher than that of the near-minimum dose in the planning target volume(PTV), which is crucial information for the dose mimicking problem; the same holds for indi-vidual voxel doses in spatial dose prediction. Without this information, finding the right weightsin an optimization problem minimizing deviations between DVH statistics or voxelwise dosesmay be just as hard as tuning objective weights in conventional inverse planning.While certain aspects of a general probabilistic approach have been previously addressed—for instance, in Nilsson et al. (2021), where a U-net–based mixture density network is trainedto output voxelwise one-dimensional Gaussian mixtures—the estimation of predictive distribu-tions for DVHs or other dose statistics has yet to be investigated. In particular, given a collectionof dose statistics, which may represent a discretized DVH, one would like to estimate their mul-tivariate conditional probability distribution given the current patient geometry and the trainingdataset. As the raw inputs are typically contoured CT images (or, alternatively, only the con-tours), this is a high-dimensional problem made especially hard by the scarcity of data oftenprevalent in similar settings. One way to remedy this problem is to employ dimensionalityreduction using unsupervised learning methods to preprocess the input images before passingthem to a machine learning model. Such dimensionality reduction methods have been previ-ously explored in e.g. Gruselius (2018), where a variational autoencoder (Kingma and Welling2014) is used for the purpose, but also in other related areas, examples including deformableregistration (Fu et al. et al. et al. et al. (2020b), dose mimicking is performed to produce complete treatment plans. Thecomputational study shows that the proposed pipeline yields extracted features containing geo-metric information of substantial relevance for dose statistic prediction, reasonable estimationsof multivariate predictive distributions over dose statistics and complete treatment plans agree-ing well with their respective ground truths. In conclusion, the experiments serve to demonstratethe feasibility of the framework for automated treatment planning.
Let { ( x n , d n ) } n ⊂ X × D be a dataset of clinical, historically delivered treatment plans con-sisting of pairs of contoured CT images x n and dose distributions d n , which are representedas vectors x n = ( x ni ) i and d n = ( d ni ) i , i being the index over voxels. Here, X and D denotespaces of contoured images and dose distributions, respectively. For our purposes, we will only3se the binary encodings of the regions of interest (ROIs) and not the radiodensities in the CTimage. Let also { ψ j } j be a collection of dose statistic functions ψ j : D → R —e.g., dose-at-volume D v in different ROIs and at different volume levels v —and let y n be defined as thevector y n = ( ψ j ( d n )) j of evaluated dose statistic values on d n for each n . In the following, wewill use the terms dose statistic and dose-related quantity interchangeably.Given a new patient with input image x ∗ , the main task is to predict the corresponding dosestatistic values y ∗ and solve an accordingly constructed dose mimicking optimization problem.Our pipeline, shown in Figure 1, is divided into the following three parts:1. training a variational autoencoder to extract features φ ( x ∗ ) ∈ Z for some relatively low-dimensional vector space Z , where φ is the encoder part,2. using { ( φ ( x n ), y n ) } n as training set to train a similarity-based Bayesian mixture-of-experts model, which outputs an estimate of the multivariate predictive density p ( y ∗ | x ∗ , { ( x n , y n ) } n ) , and3. solving a dose mimicking problem using specially developed objective functions incor-porating probabilistic information in the predictive density p ( y ∗ | x ∗ , { ( x n , y n ) } n ) , thuscreating a deliverable plan. Feature extraction Dose statistic prediction Dose mimicking
Figure 1:
The proposed automatic treatment planning pipeline.
Using neural networks to extract features from high-dimensional images is a form of repre-sentation learning (Bengio et al. et al. x n . An autoencoder consists of an encoder φ : X → Z and a decoder ϕ : Z → X , andis trained in such a way that ϕ ( φ ( x ∗ )) will resemble each new input x ∗ as well as possible. Ifthis is successful, one can deem the low-dimensional feature vector φ ( x ∗ ) to contain sufficientinput to reconstruct x ∗ reasonably well, thereby making φ qualify as a feature extractor.A drawback with such plain autoencoders is the possible lack of regularity—that is, there isgenerally no guarantee that the decoded images ϕ ( z ) , ϕ ( z ) will be close to each other whenever z , z are close. This may be problematic when using the produced features in another statisti-cal model, particularly if the model is a nonparametric model relying on interpolations. Whileregularization methods for autoencoders have been extensively studied (Bengio et al. z ∈ Z corresponding to each input x ∈ X as a random variable, in a variational autoencoder, one usesneural networks to approximate the conditional densities p ( z | x ) and p ( x | z ) . The latter den-sity, referred to as the likelihood, is often well-defined whereas the former, known as the poste-rior, is subsequently derived from Bayes’ rule if one also defines a prior p ( z ) . However, the pos-terior p ( z | x ) in this case will be intractable to compute—instead, it is approximated with someclosed-form density q ( z ) , often called the variational posterior, in such a way that the Kullback–Leibler divergence d KL ( q k p ( · | x )) between q ( z ) and p ( z | x ) is minimized. Equivalently, one4ay find q ( z ) by maximizing the evidence lower bound ELBO( q ) = E q ( z ) log( p ( x , z ) /q ( z )) .For a detailed review of variational inference in general, see Blei et al. (2017).In our case, given a fixed set of ROIs { R k } Kk =1 as index sets over voxels (each voxel maybelong to several ROIs), a binary encoding of each input image x = ( x i ) i ∈ X is used, where x i is the vector x i = (1 i ∈ R k ) k ∈ R K —that is, entry k of x i equals if voxel i belongs to ROI R k and otherwise. For the model, we follow Kingma and Welling (2014) and define for each n the likelihood and prior p ( x n | z n ) = Y i Y k Be( x nik | ϕ ( z n ) ik ) , p ( z n ) = Y j N( z nj |
0, 1), where z n = ( z nj ) j , with the variational posterior being q ( z n ) = Y j N( z nj | φ ( x n ) j , φ ( x n ) j ). Here, Be and N denote the Bernoulli and the normal distribution, and φ and ϕ are two neuralnetworks with outputs of shape dim Z × and dim X × K , respectively. The negative ELBO,which is the loss function to be minimized with respect to the network weights in φ and ϕ , isthen obtained as − ELBO( q ) = X n X j (1 + log φ ( x n ) j − φ ( x n ) j − φ ( x n ) j ) − E q ( z n ) X i X k x ik log ϕ ( z n ) ik + (1 − x ik ) log(1 − ϕ ( z n ) ik ) ! . In particular, we use standard Monte Carlo approximations of the expectations and their accord-ing stochastic reparameterization gradients (Kingma and Welling 2014).
Having trained the variational autoencoder, we may treat the encoder as fixed and use the set { ( φ ( x n ), y n ) } n , comprising pairs of feature vectors and evaluated dose statistics, as train-ing dataset for a statistical model in which the multivariate predictive distribution p ( y ∗ | φ ( x ∗ ), { ( φ ( x n ), y n ) } n ) is obtained given a new input x ∗ . However, despite the embedding ofthe inputs as feature vectors, the dimensionalities involved are still high relative to typical datasetsizes. Due to the rather complex relationships between feature vectors and dose statistics, thiswill make the training of a parametric model of suitable size—e.g., a neural network outputtingfor each input a multivariate normal distribution predicting the corresponding output—hard totrain to acceptable generalization accuracy without overfitting. Moreover, as distributions ofdose statistics may be skewed or even multimodal (Nilsson et al. et al. φ ( x ∗ ) , mixture weights are calculatedby first evaluating the similarity τ n to each equivalent φ ( x n ) in the training set, and then theprobability σ nc of each y n belonging to each expert class c —the τ n are referred to as first5ransitions and the σ nc as second transitions. The experts, in turn, are multivariate normal dis-tributions { N( µ c , Σ c ) } Cc =1 . Specifically, the distance metric d Z used to determine similaritiesin the feature space Z is the Mahalanobis distance d Z ( z , z ) = ( z − z ) T Λ( z − z ) , where Λ isa precision matrix. This leads to a predictive likelihood on the form p ( y ∗ | φ ( x ∗ ), { ( φ ( x n ), y n ) } n , θ ) = X c X n τ n σ nc N( y ∗ | µ c , Σ c ), where θ = (Λ, { ( µ c , Σ c ) } c ) are the parameters and where the first and second transitions τ n , σ nc are evaluated as τ n = N( φ ( x ∗ ) | φ ( x n ), Λ − ) P n N( φ ( x ∗ ) | φ ( x n ), Λ − ) , σ nc = N( y n | µ c , Σ c ) P c N( y n | µ c , Σ c ) . The parameters θ are treated in a Bayesian fashion, fitted using updates according to a mean-fieldvariational Bayes algorithm, and the resulting predictive distribution is obtained from MonteCarlo samples from the variational posterior. For further details, see Zhang et al. (2020a). With a feature extraction method and dose statistic prediction model in place, consider a newpatient with input image x ∗ for which the predictive distribution p ( y ∗ | x ∗ , { ( x n , y n ) } n ) over avector y ∗ = ( ψ j ( d ∗ )) j of dose statistic values has been estimated. In particular, let F j be the cu-mulative distribution function of the one-dimensional marginal density p ( y ∗ j | x ∗ , { ( x n , y n ) } n ) for each j . To utilize the probabilistic information captured in these distributions, we will basethe objective function contribution from dose statistic j for a given dose d on the value of F j ( ψ j ( d )) —loosely speaking, the contribution is based on the fraction of training patients thecurrent patient is better than in terms of dose statistic j . While one could, in principle, use thejoint cumulative distribution function over all dose statistics, the choice of using the marginaldistributions separately is motivated by the advantage of increased control over individual dosestatistics. We assume that each dose statistic is either ideally maximized or minimized, andassign binary labels t j = 1 and t j = 0 for the former and the latter case, respectively, to re-flect this. Since the cumulative distribution function ranges from to , we set up the loss on across-entropy form to obtain the contribution corresponding to ψ j as CE j ( d ) = − t j log F j ( ψ j ( d )) − (1 − t j ) log(1 − F j ( ψ j ( d ))). Note that this means that the higher the certainty that the dose statistic is around some range ofvalues, the more will the objective penalize deviations from those values. At the same time, since F j is continuous and strictly increasing in the case of a mixture-Gaussian predictive density, theoptimizer will always have incentive to improve even when beyond the range of typical values.Figure 2 illustrates the behavior of the penalty for different probability distributions.As for the resulting optimization problem, let η denote the optimization variables with fea-sible set E , where the total dose d is determined by some dose deposition mapping d = d ( η ) .For example, in the case of direct machine parameter optimization for volumetric modulatedarc therapy (VMAT), the relation between the machine parameters η and the dose d may bemodeled according to Section 2c in Unkelbach et al. (2015). The optimization problem maythen be written as minimize η ∈E X j w j CE j ( d ( η )), (1)6 R F D W L R Q 3 U R E D E L O L W \ G H Q V L W \ 3 H Q D O W \ Figure 2:
Illustration of how the shapes of the predictive densities (dashed) affect their respective penaltycontributions (solid). The blue and purple lines correspond to dose statistics to be minimized and the greenlines to those to be maximized. where w j is an importance weight for each dose statistic ψ j . It is important to note that incontrast to conventional inverse planning or dose mimicking, the optimization problem aboveis relatively insensitive to the weights w j since a substantial part of the information regardingthe relative importances of the dose statistics is already stored in the cumulative distributionfunctions F j . To showcase the advantages and disadvantages of the proposed approach, we will use the fol-lowing numerical experiments to demonstrate the full pipeline comprising feature extraction,dose statistic prediction and dose mimicking. The data used for this purpose consists of ret-rospective treatment plans of prostate cancer patients having undergone a prostatectomy prior toradiation therapy, originating from the Iridium Cancer Network in Antwerp, Belgium. The ROIswere contoured according to the RTOG (Gay et al. et al. cGy in the prostate bedand cGy in the seminal vesicles and pelvic nodes, divided into fractions. All patientswere treated using dual -degree VMAT arcs. The DVHs of the training patients are shownin the backgrounds of Figures 6 and 7. For the feature extraction and dose statistic predictionparts, the dataset was split into a training and a test set of and patients, respectively.The variational autoencoder described in Section 2.1 was implemented in TensorFlow2.3, using the architecture depicted in Figure 3. Specifically, convolutional and transpose-convolutional layers were mostly used for the encoder and decoder parts, although comple-mented with dense layers, and layer normalization and dropout were applied after each rectifiedlinear unit (ReLU) activation. Using the prostate PTV, seminal vesicles PTV, rectum, bladder,left femur, right femur and small bowel as the set of ROIs whose contours are inputted to theneural network—a total of seven ROIs—the input images were all preprocessed to a binary ar-ray of size × × × , and the dimension of the latent space Z was set to . In total,the network comprised around million parameters, which were fitted using a standard Adamoptimizer (Kingma and Ba 2014) and a training and validation set of and patients, respec-tively. In particular, random data augmentations of small shifts and rotations were employedduring training.The similarity-based mixture-of-experts model was subsequently trained using the im-7 σ µ × N(0 , +
144 144 144 96 96 96 64 64 64 64 64 64 48 48 48 323232 256 64 7 7
Figure 3:
Illustration of the architecture of the variational autoencoder described in Section 2.1. Theencoder consists of convolutional (yellow) and dense (purple) layers, and the decoder upsamples usingtranspose-convolutional (blue) layers. After each layer, a ReLU activation is applied if indicated witha darker color, followed by layer normalization and dropout. The green vector represents the stochasticsource, and the last gray layer represents a softmax activation. plementation described in Zhang et al. (2020a), in particular also using Tensorflow 2.3.For organs at risk, the set { ψ j } j of dose statistics were dose-at-volume levels at volumes
10 %, 20 %, . . . , 90 % ; for targets, both dose-at-volume and mean-tail-dose (Romeijn et al. were used. Here, mean-tail-dose wasincluded to achieve better control of the tails of the target DVHs (Zhang et al. C = 32 experts with a posterior sample size of for each mean–covariance pair ( µ c , Σ c ) , leading to amixture model of classes.The dose mimicking problem (1) was subsequently set up using the fitted dose statisticprediction model and the same beam configuration as used in the training dataset. To obtainincreased control of the resulting plans, the clinical goals in Table 1 were added to the setof dose statistics, with predictive distributions substituted by synthetically constructed normaldistributions centered around the respective goal values. The importance weights w j were set to for the left and right femurs, for the rectum, bladder and small bowel regions, and for thePTVs. The objective functions, with gradients available due to the recent developments in Zhang et al. (2020b), were implemented in a research version of RayStation 10B. Each mimickingoptimization comprised runs of , and iterations with accurate dose computations inbetween—as per standard practice, approximate doses during optimization were calculated bya singular value decomposition algorithm and accurate doses by a collapsed cone algorithm. As the main purpose of the variational autoencoder is to provide input features for the dosestatistic prediction models, we first ensure that the produced features contain sufficient predic-tive power. For this, we temporarily use a simplistic linear regression model fitted using thedata { ( φ ( x n ), y n pc ) } n , where y n pc = ( y n pc ,1 , y n pc ,2 ) are the two first principal components of thedose statistic values over all ROIs. Then, for each patient x ∗ in the test dataset, we compute8 able 1: Additional dose-at-volume ( D v ) and lower/upper mean-tail-dose ( MTD ± v ) clinical goals used inthe dose mimicking problem. ROI GoalPTV, prostate
MTD −
98 % ≥ cGy PTV, prostate
MTD +0.5 % ≤ cGy PTV, seminal vesicles
MTD −
98 % ≥ cGy PTV, seminal vesicles
MTD +5 % ≤ cGy Rectum D
50 % ≤ cGy Bladder D
25 % ≤ cGy Near-target
MTD +5 % ≤ cGy the standard Euclidean distance between the predicted y ∗ pc and each predicted y n pc , which servesas a relevance-adapted distance in the feature space Z . Figure 4 shows a comparison of thisdistance to the actual distance between the principal component vectors in scatterplots for sixtest patients. Despite the simplicity of the linear model, the relevance-adapted distances agreewell with the principal component distances, with points lying close to the ground truth alsobeing close in feature space. As a start, this shows that the features produced by the variationalautoencoder contain substantial information of relevance for predicting dose statistics. ) L U V W 3 &