[PDF] Probabilistic feature extraction, dose statistic prediction and dose mimicking for automated radiation therapy treatment planning

Abstract

Purpose: We propose a general framework for quantifying predictive uncertainties of dose-related quantities and leveraging this information in a dose mimicking problem in the context of automated radiation therapy treatment planning. Methods: A three-step pipeline, comprising feature extraction, dose statistic prediction and dose mimicking, is employed. In particular, the features are produced by a convolutional variational autoencoder and used as inputs in a previously developed nonparametric Bayesian statistical method, estimating the multivariate predictive distribution of a collection of predefined dose statistics. Specially developed objective functions are then used to construct a probabilistic dose mimicking problem based on the produced distributions, creating deliverable treatment plans. Results: The numerical experiments are performed using a dataset of 94 retrospective treatment plans of prostate cancer patients. We show that the features extracted by the variational autoencoder capture geometric information of substantial relevance to the dose statistic prediction problem and are related to dose statistics in a more regularized fashion than hand-crafted features. The estimated predictive distributions are reasonable and outperforms a non-input-dependent benchmark method, and the deliverable plans produced by the probabilistic dose mimicking agree better with their clinical counterparts than for a non-probabilistic formulation. Conclusions: We demonstrate that prediction of dose-related quantities may be extended to include uncertainty estimation and that such probabilistic information may be leveraged in a dose mimicking problem. The treatment plans produced by the proposed pipeline resemble their original counterparts well, illustrating the merits of a holistic approach to automated planning based on probabilistic modeling.

Full PDF

PProbabilistic feature extraction, dose statisticprediction and dose mimicking for automated radiationtherapy treatment planning

Tianfang Zhang ∗ , † , Rasmus Bokrantz † , Jimmy Olsson ∗ ∗ Department of Mathematics, KTH Royal Institute of Technology, Stockholm SE-100 44, Sweden † RaySearch Laboratories, Sveavägen 44, Stockholm SE-103 65, Sweden

February 24, 2021

Abstract

Purpose : We propose a general framework for quantifying predictive uncertain-ties of dose-related quantities and leveraging this information in a dose mimickingproblem in the context of automated radiation therapy treatment planning.

Methods : A three-step pipeline, comprising feature extraction, dose statistic predic-tion and dose mimicking, is employed. In particular, the features are produced by aconvolutional variational autoencoder and used as inputs in a previously developednonparametric Bayesian statistical method, estimating the multivariate predictivedistribution of a collection of predeﬁned dose statistics. Specially developed objec-tive functions are then used to construct a dose mimicking problem based on theproduced distributions, creating deliverable treatment plans.

Results : The numerical experiments are performed using a dataset of retrospec-tive treatment plans of prostate cancer patients. We show that the features extractedby the variational autoencoder captures geometric information of substantial rele-vance to the dose statistic prediction problem, that the estimated predictive distribu-tions are reasonable and outperforms a benchmark method, and that the deliverableplans agree well with their clinical counterparts. Conclusions : We demonstrate that prediction of dose-related quantities may beextended to include uncertainty estimation and that such probabilistic informationmay be leveraged in a dose mimicking problem. The treatment plans produced bythe proposed pipeline resemble their original counterparts well, illustrating the mer-its of a holistic approach to automated planning based on probabilistic modeling.

Keywords:

Knowledge-based planning, uncertainty modeling, dose–volume histogram pre-diction, variational autoencoder, mixture-of-experts, dose mimicking. a r X i v : . [ phy s i c s . m e d - ph ] F e b Introduction

In light of its rapid advancement of late, machine learning has been extensively applied to vari-ous areas within biomedical engineering, often with considerable success (Park et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. et al. (2012), and as pointed out by Babier et al. (2020b), when designing and evaluating methods for automated planning, it is essential tomaintain a holistic picture of both parts of a prediction–mimicking pipeline. Considering this,2mitting the probabilistic information contained in predictive distributions when using purelypredictive methods may be the cause of a substantial disconnect between the two parts. Forexample, the predictive uncertainty associated with the mean dose in some organ at risk may beorders of magnitude higher than that of the near-minimum dose in the planning target volume(PTV), which is crucial information for the dose mimicking problem; the same holds for indi-vidual voxel doses in spatial dose prediction. Without this information, ﬁnding the right weightsin an optimization problem minimizing deviations between DVH statistics or voxelwise dosesmay be just as hard as tuning objective weights in conventional inverse planning.While certain aspects of a general probabilistic approach have been previously addressed—for instance, in Nilsson et al. (2021), where a U-net–based mixture density network is trainedto output voxelwise one-dimensional Gaussian mixtures—the estimation of predictive distribu-tions for DVHs or other dose statistics has yet to be investigated. In particular, given a collectionof dose statistics, which may represent a discretized DVH, one would like to estimate their mul-tivariate conditional probability distribution given the current patient geometry and the trainingdataset. As the raw inputs are typically contoured CT images (or, alternatively, only the con-tours), this is a high-dimensional problem made especially hard by the scarcity of data oftenprevalent in similar settings. One way to remedy this problem is to employ dimensionalityreduction using unsupervised learning methods to preprocess the input images before passingthem to a machine learning model. Such dimensionality reduction methods have been previ-ously explored in e.g. Gruselius (2018), where a variational autoencoder (Kingma and Welling2014) is used for the purpose, but also in other related areas, examples including deformableregistration (Fu et al. et al. et al. et al. (2020b), dose mimicking is performed to produce complete treatment plans. Thecomputational study shows that the proposed pipeline yields extracted features containing geo-metric information of substantial relevance for dose statistic prediction, reasonable estimationsof multivariate predictive distributions over dose statistics and complete treatment plans agree-ing well with their respective ground truths. In conclusion, the experiments serve to demonstratethe feasibility of the framework for automated treatment planning.

Let { ( x n , d n ) } n ⊂ X × D be a dataset of clinical, historically delivered treatment plans con-sisting of pairs of contoured CT images x n and dose distributions d n , which are representedas vectors x n = ( x ni ) i and d n = ( d ni ) i , i being the index over voxels. Here, X and D denotespaces of contoured images and dose distributions, respectively. For our purposes, we will only3se the binary encodings of the regions of interest (ROIs) and not the radiodensities in the CTimage. Let also { ψ j } j be a collection of dose statistic functions ψ j : D → R —e.g., dose-at-volume D v in different ROIs and at different volume levels v —and let y n be deﬁned as thevector y n = ( ψ j ( d n )) j of evaluated dose statistic values on d n for each n . In the following, wewill use the terms dose statistic and dose-related quantity interchangeably.Given a new patient with input image x ∗ , the main task is to predict the corresponding dosestatistic values y ∗ and solve an accordingly constructed dose mimicking optimization problem.Our pipeline, shown in Figure 1, is divided into the following three parts:1. training a variational autoencoder to extract features φ ( x ∗ ) ∈ Z for some relatively low-dimensional vector space Z , where φ is the encoder part,2. using { ( φ ( x n ), y n ) } n as training set to train a similarity-based Bayesian mixture-of-experts model, which outputs an estimate of the multivariate predictive density p ( y ∗ | x ∗ , { ( x n , y n ) } n ) , and3. solving a dose mimicking problem using specially developed objective functions incor-porating probabilistic information in the predictive density p ( y ∗ | x ∗ , { ( x n , y n ) } n ) , thuscreating a deliverable plan. Feature extraction Dose statistic prediction Dose mimicking

Figure 1:

The proposed automatic treatment planning pipeline.

Using neural networks to extract features from high-dimensional images is a form of repre-sentation learning (Bengio et al. et al. x n . An autoencoder consists of an encoder φ : X → Z and a decoder ϕ : Z → X , andis trained in such a way that ϕ ( φ ( x ∗ )) will resemble each new input x ∗ as well as possible. Ifthis is successful, one can deem the low-dimensional feature vector φ ( x ∗ ) to contain sufﬁcientinput to reconstruct x ∗ reasonably well, thereby making φ qualify as a feature extractor.A drawback with such plain autoencoders is the possible lack of regularity—that is, there isgenerally no guarantee that the decoded images ϕ ( z ) , ϕ ( z ) will be close to each other whenever z , z are close. This may be problematic when using the produced features in another statisti-cal model, particularly if the model is a nonparametric model relying on interpolations. Whileregularization methods for autoencoders have been extensively studied (Bengio et al. z ∈ Z corresponding to each input x ∈ X as a random variable, in a variational autoencoder, one usesneural networks to approximate the conditional densities p ( z | x ) and p ( x | z ) . The latter den-sity, referred to as the likelihood, is often well-deﬁned whereas the former, known as the poste-rior, is subsequently derived from Bayes’ rule if one also deﬁnes a prior p ( z ) . However, the pos-terior p ( z | x ) in this case will be intractable to compute—instead, it is approximated with someclosed-form density q ( z ) , often called the variational posterior, in such a way that the Kullback–Leibler divergence d KL ( q k p ( · | x )) between q ( z ) and p ( z | x ) is minimized. Equivalently, one4ay ﬁnd q ( z ) by maximizing the evidence lower bound ELBO( q ) = E q ( z ) log( p ( x , z ) /q ( z )) .For a detailed review of variational inference in general, see Blei et al. (2017).In our case, given a ﬁxed set of ROIs { R k } Kk =1 as index sets over voxels (each voxel maybelong to several ROIs), a binary encoding of each input image x = ( x i ) i ∈ X is used, where x i is the vector x i = (1 i ∈ R k ) k ∈ R K —that is, entry k of x i equals if voxel i belongs to ROI R k and otherwise. For the model, we follow Kingma and Welling (2014) and deﬁne for each n the likelihood and prior p ( x n | z n ) = Y i Y k Be( x nik | ϕ ( z n ) ik ) , p ( z n ) = Y j N( z nj |

0, 1), where z n = ( z nj ) j , with the variational posterior being q ( z n ) = Y j N( z nj | φ ( x n ) j , φ ( x n ) j ). Here, Be and N denote the Bernoulli and the normal distribution, and φ and ϕ are two neuralnetworks with outputs of shape dim Z × and dim X × K , respectively. The negative ELBO,which is the loss function to be minimized with respect to the network weights in φ and ϕ , isthen obtained as − ELBO( q ) = X n X j (1 + log φ ( x n ) j − φ ( x n ) j − φ ( x n ) j ) − E q ( z n ) X i X k x ik log ϕ ( z n ) ik + (1 − x ik ) log(1 − ϕ ( z n ) ik ) ! . In particular, we use standard Monte Carlo approximations of the expectations and their accord-ing stochastic reparameterization gradients (Kingma and Welling 2014).

Having trained the variational autoencoder, we may treat the encoder as ﬁxed and use the set { ( φ ( x n ), y n ) } n , comprising pairs of feature vectors and evaluated dose statistics, as train-ing dataset for a statistical model in which the multivariate predictive distribution p ( y ∗ | φ ( x ∗ ), { ( φ ( x n ), y n ) } n ) is obtained given a new input x ∗ . However, despite the embedding ofthe inputs as feature vectors, the dimensionalities involved are still high relative to typical datasetsizes. Due to the rather complex relationships between feature vectors and dose statistics, thiswill make the training of a parametric model of suitable size—e.g., a neural network outputtingfor each input a multivariate normal distribution predicting the corresponding output—hard totrain to acceptable generalization accuracy without overﬁtting. Moreover, as distributions ofdose statistics may be skewed or even multimodal (Nilsson et al. et al. φ ( x ∗ ) , mixture weights are calculatedby ﬁrst evaluating the similarity τ n to each equivalent φ ( x n ) in the training set, and then theprobability σ nc of each y n belonging to each expert class c —the τ n are referred to as ﬁrst5ransitions and the σ nc as second transitions. The experts, in turn, are multivariate normal dis-tributions { N( µ c , Σ c ) } Cc =1 . Speciﬁcally, the distance metric d Z used to determine similaritiesin the feature space Z is the Mahalanobis distance d Z ( z , z ) = ( z − z ) T Λ( z − z ) , where Λ isa precision matrix. This leads to a predictive likelihood on the form p ( y ∗ | φ ( x ∗ ), { ( φ ( x n ), y n ) } n , θ ) = X c X n τ n σ nc N( y ∗ | µ c , Σ c ), where θ = (Λ, { ( µ c , Σ c ) } c ) are the parameters and where the ﬁrst and second transitions τ n , σ nc are evaluated as τ n = N( φ ( x ∗ ) | φ ( x n ), Λ − ) P n N( φ ( x ∗ ) | φ ( x n ), Λ − ) , σ nc = N( y n | µ c , Σ c ) P c N( y n | µ c , Σ c ) . The parameters θ are treated in a Bayesian fashion, ﬁtted using updates according to a mean-ﬁeldvariational Bayes algorithm, and the resulting predictive distribution is obtained from MonteCarlo samples from the variational posterior. For further details, see Zhang et al. (2020a). With a feature extraction method and dose statistic prediction model in place, consider a newpatient with input image x ∗ for which the predictive distribution p ( y ∗ | x ∗ , { ( x n , y n ) } n ) over avector y ∗ = ( ψ j ( d ∗ )) j of dose statistic values has been estimated. In particular, let F j be the cu-mulative distribution function of the one-dimensional marginal density p ( y ∗ j | x ∗ , { ( x n , y n ) } n ) for each j . To utilize the probabilistic information captured in these distributions, we will basethe objective function contribution from dose statistic j for a given dose d on the value of F j ( ψ j ( d )) —loosely speaking, the contribution is based on the fraction of training patients thecurrent patient is better than in terms of dose statistic j . While one could, in principle, use thejoint cumulative distribution function over all dose statistics, the choice of using the marginaldistributions separately is motivated by the advantage of increased control over individual dosestatistics. We assume that each dose statistic is either ideally maximized or minimized, andassign binary labels t j = 1 and t j = 0 for the former and the latter case, respectively, to re-ﬂect this. Since the cumulative distribution function ranges from to , we set up the loss on across-entropy form to obtain the contribution corresponding to ψ j as CE j ( d ) = − t j log F j ( ψ j ( d )) − (1 − t j ) log(1 − F j ( ψ j ( d ))). Note that this means that the higher the certainty that the dose statistic is around some range ofvalues, the more will the objective penalize deviations from those values. At the same time, since F j is continuous and strictly increasing in the case of a mixture-Gaussian predictive density, theoptimizer will always have incentive to improve even when beyond the range of typical values.Figure 2 illustrates the behavior of the penalty for different probability distributions.As for the resulting optimization problem, let η denote the optimization variables with fea-sible set E , where the total dose d is determined by some dose deposition mapping d = d ( η ) .For example, in the case of direct machine parameter optimization for volumetric modulatedarc therapy (VMAT), the relation between the machine parameters η and the dose d may bemodeled according to Section 2c in Unkelbach et al. (2015). The optimization problem maythen be written as minimize η ∈E X j w j CE j ( d ( η )), (1)6 RFDWLRQ 3 U R E D E LOL W \ G H Q V L W \ 3 H Q D O W \ Figure 2:

Illustration of how the shapes of the predictive densities (dashed) affect their respective penaltycontributions (solid). The blue and purple lines correspond to dose statistics to be minimized and the greenlines to those to be maximized. where w j is an importance weight for each dose statistic ψ j . It is important to note that incontrast to conventional inverse planning or dose mimicking, the optimization problem aboveis relatively insensitive to the weights w j since a substantial part of the information regardingthe relative importances of the dose statistics is already stored in the cumulative distributionfunctions F j . To showcase the advantages and disadvantages of the proposed approach, we will use the fol-lowing numerical experiments to demonstrate the full pipeline comprising feature extraction,dose statistic prediction and dose mimicking. The data used for this purpose consists of ret-rospective treatment plans of prostate cancer patients having undergone a prostatectomy prior toradiation therapy, originating from the Iridium Cancer Network in Antwerp, Belgium. The ROIswere contoured according to the RTOG (Gay et al. et al. cGy in the prostate bedand cGy in the seminal vesicles and pelvic nodes, divided into fractions. All patientswere treated using dual -degree VMAT arcs. The DVHs of the training patients are shownin the backgrounds of Figures 6 and 7. For the feature extraction and dose statistic predictionparts, the dataset was split into a training and a test set of and patients, respectively.The variational autoencoder described in Section 2.1 was implemented in TensorFlow2.3, using the architecture depicted in Figure 3. Speciﬁcally, convolutional and transpose-convolutional layers were mostly used for the encoder and decoder parts, although comple-mented with dense layers, and layer normalization and dropout were applied after each rectiﬁedlinear unit (ReLU) activation. Using the prostate PTV, seminal vesicles PTV, rectum, bladder,left femur, right femur and small bowel as the set of ROIs whose contours are inputted to theneural network—a total of seven ROIs—the input images were all preprocessed to a binary ar-ray of size × × × , and the dimension of the latent space Z was set to . In total,the network comprised around million parameters, which were ﬁtted using a standard Adamoptimizer (Kingma and Ba 2014) and a training and validation set of and patients, respec-tively. In particular, random data augmentations of small shifts and rotations were employedduring training.The similarity-based mixture-of-experts model was subsequently trained using the im-7 σ µ × N(0 , +

144 144 144 96 96 96 64 64 64 64 64 64 48 48 48 323232 256 64 7 7

Figure 3:

Illustration of the architecture of the variational autoencoder described in Section 2.1. Theencoder consists of convolutional (yellow) and dense (purple) layers, and the decoder upsamples usingtranspose-convolutional (blue) layers. After each layer, a ReLU activation is applied if indicated witha darker color, followed by layer normalization and dropout. The green vector represents the stochasticsource, and the last gray layer represents a softmax activation. plementation described in Zhang et al. (2020a), in particular also using Tensorﬂow 2.3.For organs at risk, the set { ψ j } j of dose statistics were dose-at-volume levels at volumes

10 %, 20 %, . . . , 90 % ; for targets, both dose-at-volume and mean-tail-dose (Romeijn et al. were used. Here, mean-tail-dose wasincluded to achieve better control of the tails of the target DVHs (Zhang et al. C = 32 experts with a posterior sample size of for each mean–covariance pair ( µ c , Σ c ) , leading to amixture model of classes.The dose mimicking problem (1) was subsequently set up using the ﬁtted dose statisticprediction model and the same beam conﬁguration as used in the training dataset. To obtainincreased control of the resulting plans, the clinical goals in Table 1 were added to the setof dose statistics, with predictive distributions substituted by synthetically constructed normaldistributions centered around the respective goal values. The importance weights w j were set to for the left and right femurs, for the rectum, bladder and small bowel regions, and for thePTVs. The objective functions, with gradients available due to the recent developments in Zhang et al. (2020b), were implemented in a research version of RayStation 10B. Each mimickingoptimization comprised runs of , and iterations with accurate dose computations inbetween—as per standard practice, approximate doses during optimization were calculated bya singular value decomposition algorithm and accurate doses by a collapsed cone algorithm. As the main purpose of the variational autoencoder is to provide input features for the dosestatistic prediction models, we ﬁrst ensure that the produced features contain sufﬁcient predic-tive power. For this, we temporarily use a simplistic linear regression model ﬁtted using thedata { ( φ ( x n ), y n pc ) } n , where y n pc = ( y n pc ,1 , y n pc ,2 ) are the two ﬁrst principal components of thedose statistic values over all ROIs. Then, for each patient x ∗ in the test dataset, we compute8 able 1: Additional dose-at-volume ( D v ) and lower/upper mean-tail-dose ( MTD ± v ) clinical goals used inthe dose mimicking problem. ROI GoalPTV, prostate

MTD −

98 % ≥ cGy PTV, prostate

MTD +0.5 % ≤ cGy PTV, seminal vesicles

MTD −

98 % ≥ cGy PTV, seminal vesicles

MTD +5 % ≤ cGy Rectum D

50 % ≤ cGy Bladder D

25 % ≤ cGy Near-target

MTD +5 % ≤ cGy the standard Euclidean distance between the predicted y ∗ pc and each predicted y n pc , which servesas a relevance-adapted distance in the feature space Z . Figure 4 shows a comparison of thisdistance to the actual distance between the principal component vectors in scatterplots for sixtest patients. Despite the simplicity of the linear model, the relevance-adapted distances agreewell with the principal component distances, with points lying close to the ground truth alsobeing close in feature space. As a start, this shows that the features produced by the variationalautoencoder contain substantial information of relevance for predicting dose statistics. )LUVW3& 6 H F R Q G 3 & )LUVW3& 6 H F R Q G 3 & )LUVW3& 6 H F R Q G 3 & )LUVW3& 6 H F R Q G 3 & )LUVW3& 6 H F R Q G 3 & )LUVW3& 6 H F R Q G 3 & *URXQGWUXWK Figure 4:

Scatterplot of the ﬁrst and second principal components (PCs) of the dose statistics in the trainingdata (round) and one test patient (diamond), plotted for six patients in the test dataset. A darker green colorcorresponds to being closer in the feature space with respect to a relevance-adapted metric.

Next, we assess the predictive performance of the seven ROI-speciﬁc dose statistic predic-tion models. Naturally, since the true predictive distributions are unknown, we cannot directlyevaluate the accuracy of the estimated predictive distributions by computing e.g. Kullback–Leibler divergences to ground truths. However, we may use the training dataset as a reference9ample and compare average errors between that and a sample drawn from the estimated pre-dictive distribution. Table 2 shows the mean squared error (MSE) E k y − y ∗ k std and the meanabsolute error (MAE) E k y − y ∗ k std for each model, averaged over the test patients—here,expectation is taken over y ∼ p ( y | φ ( x ∗ ), { ( φ ( x n ), y n ) } n ) and k · k std denotes a standardizedEuclidean norm using the sample standard deviations in the training dataset.Figure 5 shows boxplots illustrating the spreads of the MSE and MAE errors over the testdataset beside corresponding benchmarks consisting of their empirical analogues over the train-ing dataset. For the organs at risk, the dose statistic prediction outperforms the random choiceof a training data point, as would be expected—in particular, the interpretation of the ﬁgure isthat the models mostly output predictions better than most other training data points, since theworst-case predictions are still better than many or most training data points. In this context,it is important to note that there is a theoretical limit on how low the boxplots may lie due tothe inherent uncertainty of decisions made during treatment planning. For the targets, however,interestingly, the models perform slightly worse than the benchmark. There may be variousreasons for this—for example, the predictive uncertainties may have been overestimated, or thetarget DVHs may not actually depend noticeably on the patient geometry. Finally, in Figures6, 7, we show, for two patients in the test dataset, pointwise DVH conﬁdence bands associatedwith the dose statistic prediction models for the seven ROIs. Table 2:

Validation metrics computed for the different models on the test dataset in comparison to bench-mark values.

ROI MSE MAEModel Benchmark Model BenchmarkPTV, prostate

PTV, seminal vesicles

Rectum

Bladder

Left femur

Right femur

Small bowel

Finally, to demonstrate that the dose mimicking method outlined in Section 2.3 can leveragethe information contained in the predictive distributions from the dose statistic prediction, wecreate deliverable plans for two patients in the test dataset. The resulting DVHs and spatial dosedistributions of the mimicked dose distributions are shown in Figures 8 and 9, respectively, incomparison to their ground truth counterparts. Interestingly, we observe that for both cases, themimicked DVHs follow the ground truth mostly well, possibly with slight deviations for thetargets. This may be interpreted as the speciﬁcation of preferences captured in the predictivedistributions being realized differently depending on the actual patient geometry—hence, thisshowcases the advantage of articulating one’s preferences by a probability distribution insteadof a single reference value. On the other hand, we see that the spatial dose distributions followthe ground truths mostly well but may differ signiﬁcantly in parts not covered by any ROI, whichis expected since the optimization problem does not penalize such deviations. While we mayconclude that the results demonstrate the merits of a probabilistic approach to dose mimicking,it is emphasized that in a clinical setting, further post-processing is in general required to obtainplans of quality sufﬁcient to be approved and delivered.10 79SURVWDWH 379VHPLQDOYHVLFOHV 5HFWXP %ODGGHU /HIWIHPXU 5LJKWIHPXU 6PDOOERZHO 9 D OL G D W L R Q P H W U L F 06(PRGHO06(EHQFKPDUN0$(PRGHO0$(EHQFKPDUN Figure 5:

Boxplots of the MSE and MAE for the different models in comparison to their benchmarkmodels. The diamond-shaped markers show outliers.

In the context of predicting dose-related quantities as part of an automated treatment planningpipeline, the importance of precise quantiﬁcation of predictive uncertainties has been left largelyunaddressed in literature. Motivated by this, in this work, we have presented a general frame-work in which the multivariate predictive distribution of any collection of dose-related quanti-ties may be estimated and leveraged in a dose mimicking problem. In particular, we have shownthat a variational autoencoder can be employed to extract from contoured CT images lower-dimensional features of substantial relevance for predicting dose statistics, and that a similarity-based mixture-of-experts model can be used to quantify predictive uncertainties of dose statisticvalues, including inter-statistic dependencies. Moreover, we have shown that by setting up adose mimicking problem in which contributions from dose statistics are based on the marginalcumulative distribution functions in a cross-entropy–like form, with gradients of dose statisticsdirectly available by virtue of recent developments, we may create deliverable plans agreeingwell with their respective historical clinical plans with minimal need for weight tuning.As argued by Babier et al. (2020b), it is crucial to maintain a holistic perspective whendeveloping and assessing constituent methods in a prediction–mimicking pipeline—there is,for instance, no direct causality between the performance of a dose prediction model and thequality of the produced plan. In this sense, the proposed division of the pipeline into featureextraction, dose statistic prediction and dose mimicking has the advantages of being ﬂexible,with each part being substitutable with any other algorithm, and general, with minimal loss ofinformation between the steps. Moreover, the former two parts may naturally be combined into asemi-supervised learning algorithm, reducing the need for training data in the form of clinicallydelivered treatments. A disadvantage, however, is that the dose-related quantities to be predictedmust be meaningful when evaluated on all training patients, which is not true for e.g. individualvoxel doses of the current patient. Thus, a spatial dose prediction may complement the currentdose mimicking problem setup by helping to shape the dose distribution outside the deﬁnedROIs.While the present work primarily focuses on conceptually demonstrating the merits of theproposed method, there are several interesting directions for future research. For example, thefeatures extracted by the variational autoencoder may beneﬁt from being augmented with hand-11 'RVH>F*\@ 9 R O X P H > @ 379VSURVWDWHDQGVHPLQDOYHVLFOHV 'RVH>F*\@ 9 R O X P H > @ 5HFWXP 'RVH>F*\@ 9 R O X P H > @ %ODGGHU 'RVH>F*\@ 9 R O X P H > @ /HIWIHPXU 'RVH>F*\@ 9 R O X P H > @ 5LJKWIHPXU 'RVH>F*\@ 9 R O X P H > @ 6PDOOERZHO Figure 6:

Illustration of dose statistic predictions for the ﬁrst test patient, where the shaded bands corre-spond to

99 % ,

95 % and

70 % prediction intervals for each dose statistic. The training equivalents areshown in gray and the ground truth in black. crafted counterparts such as in Cui et al. (2019), and the estimated predictive distribution overdose statistics may be used for quality assurance purposes. Furthermore, with an estimated jointpredictive distribution, one may use the inter-statistic dependencies to post-process the distri-bution to be more aggressive toward one or several dose statistics. This may, in turn, be usedto form differently focused tradeoff objectives in a multicriteria optimization problem, leadingto a semiautomated treatment planning workﬂow in which the user may optionally articulatetheir own preferences by navigating between Pareto optimal plans. In conclusion, we note thata rigorous probabilistic perspective of the automated treatment planning problem enables amplenew opportunities to be further explored in future work.

In this work, we have presented a general framework for automated radiation therapy treatmentplanning based on a three-step pipeline comprising feature extraction, dose statistic predictionand dose mimicking. By using latent representations produced by a variational autoencoderand a nonparametric Bayesian dose statistic prediction method, we obtain reasonable predictivedistributions as multivariate Gaussian mixtures. Furthermore, using objective functions spe-cially designed to leverage such predictive information, we perform dose mimicking to createdeliverable plans resembling their original counterparts well. Thus, we have demonstrated theadvantages of a holistic approach based on probabilistic modeling, with applications to manyrelated tasks. 12 'RVH>F*\@ 9 R O X P H > @ 379VSURVWDWHDQGVHPLQDOYHVLFOHV 'RVH>F*\@ 9 R O X P H > @ 5HFWXP 'RVH>F*\@ 9 R O X P H > @ %ODGGHU 'RVH>F*\@ 9 R O X P H > @ /HIWIHPXU 'RVH>F*\@ 9 R O X P H > @ 5LJKWIHPXU 'RVH>F*\@ 9 R O X P H > @ 6PDOOERZHO Figure 7:

Illustration of dose statistic predictions for the second test patient, where the shaded bandscorrespond to

99 % ,

95 % and

70 % prediction intervals for each dose statistic. The training equivalentsare shown in gray and the ground truth in black.

References

Appenzoller LM, Michalski JM, Thorstad WL, Mutic S and Moore KL (2012). Predicting dose–volume histograms for organs-at-risk in IMRT planning.

Med. Phys.

Phys. Med.

72, pp. 73–79.Bengio Y, Courville A and Vincent P (2013). Representation learning: a review and new per-spectives.

IEEE Trans. Pattern Anal. Mach. Intell.

J. Am. Stat. Assoc.

Med. Phys.

Med. Phys. 'RVH>F*\@ 9 R O X P H > ? @ 'RVH>F*\@ 9 R O X P H > ? @ 379SURVWDWH379VHPLQDOYHVLFOHV5HFWXP%ODGGHU/HIWIHPXU5LJKWIHPXU6PDOOERZHO Figure 8:

DVH comparisons between the mimicked plans (solid) and their clinical counterparts (dashed),shown for the two test patients in Figures 6 and 7. Note that the optimization problem is not set up tomimic the clinical plan but rather the predicted dose statistics.

Fu Y, Lei Y, Wang T, Curran WJ, Liu T and Yang X (2020). Deep learning in medical imageregistration: a review.

Phys. Med. Biol.

Int. J. Radiat. Oncol. Biol. Phys.

Med. Phys.

Br. J. Radiol.

Phys.Med. Biol.

64, 23NT04.Kearney V, Chan JW, Haaf S, Descovich M and Solberg TD (2018). DoseNet: a volumetricdose prediction algorithm using 3D fully-convolutional neural networks.

Phys. Med. Biol.

International Conferenceon Learning Representations . 14 igure 9:

Transversal cuts of the spatial dose distributions of the mimicked plans in comparison to theirclinical counterparts, shown for the two test patients in Figures 6 and 7. The colors of the ROI contourscorrespond to those in the legend of Figure 8.

Liu Z, Chen X, Men K, Yi J and Dai J (2020). A deep learning model to predict dose–volumehistograms of organs at risk in radiotherapy treatment plans.

Med. Phys.

Med.Phys.

Phys. Med. Biol.

PLoS One

Pract. Radiat. Oncol.

IOP SciNotes

Sci. Rep.

Med. Phys.

Phys. Med. Biol.

Biomed.Eng. Lett.

8, pp. 1–3.Romeijn HE, Ahuja RK, Dempsey JF and Kumar A (2006). A new linear programming ap-proach to radiation therapy treatment planning problems.

Oper. Res.

54, pp. 201–216.Salembier C, Villeirs G, De Bari B, Hoskin P, Pieters BR, Van Vulpen M, Khoo V, Henry A,Bossi A, De Meerleer G and Fonteyne V (2018). ESTRO ACROP consensus guideline onCT- and MRI-based target volume delineation for primary radiation therapy of localizedprostate cancer.

Radiother. Oncol.

Med.Phys.

Med. Phys.

Rep. Pract. Oncol.Radiother.

Phys. Med. Biol.

Med. Phys.

Phys. Med. Biol.

63, p. 015035.Wang C, Zhu X, Hong JC and Zheng D (2019). Artiﬁcial intelligence in radiotherapy treatmentplanning: present and future.

Technol. Cancer Res. Treat.

18, pp. 1–11.Zhang T, Bokrantz R and Olsson J (2020a). A similarity-based Bayesian mixture-of-expertsmodel. arXiv: 2012.02130 [stat.ML].— (2020b). Direct optimization of dose–volume histogram metrics in radiation therapy treat-ment planning.