[PDF] Efficient inversion strategies for estimating optical properties with Monte Carlo radiative transport models

Abstract

Indirect imaging problems in biomedical optics generally require repeated evaluation of forward models of radiative transport, for which Monte Carlo is accurate yet computationally costly. We develop a novel approach to reduce this bottleneck which has significant implications for quantitative tomographic imaging in a variety of medical and industrial applications. Using Monte Carlo we compute a fully stochastic gradient of an objective function for a given imaging problem. Leveraging techniques from the machine learning community we then adaptively control the accuracy of this gradient throughout the iterative inversion scheme, in order to substantially reduce computational resources at each step. For example problems of Quantitative Photoacoustic Tomography and Ultrasound Modulated Optical Tomography, we demonstrate that solutions are attainable using a total computational expense that is comparable to (or less than) that which is required for a single high accuracy forward run of the same Monte Carlo model. This approach demonstrates significant computational savings when approaching the full non-linear inverse problem of optical property estimation using stochastic methods.

Full PDF

EEﬃcient inversion strategies for estimating optical propertieswith Monte Carlo radiative transport models

Callum M. Macdonald , Simon Arridge , Samuel Powell

31 Department of Medical Physics and Biomedical Engineering, University College London, London, WC1E 6BT, UK.2 Department of Computer Science, University College London, London, WC1E 6BT, UK.3 Faculty of Engineering, University of Nottingham, Nottingham, NG7 2RD, UK.

July 10, 2020

AbstractSigniﬁcance:

Indirect imaging problems in biomedical optics generally require repeated evalu-ation of forward models of radiative transport, for which Monte Carlo is accurate yet computa-tionally costly. We develop a novel approach to reduce this bottleneck which has signiﬁcant im-plications for quantitative tomographic imaging in a variety of medical and industrial applications.

Aim:

Our aim is to enable computationally eﬃcient image reconstruction in (hybrid) diﬀuse op-tical modalities using stochastic forward models.

Approach:

Using Monte Carlo we compute a fully stochastic gradient of an objective functionfor a given imaging problem. Leveraging techniques from the machine learning community wethen adaptively control the accuracy of this gradient throughout the iterative inversion scheme, inorder to substantially reduce computational resources at each step.

Results:

For example problems of Quantitative Photoacoustic Tomography and Ultrasound Mod-ulated Optical Tomography, we demonstrate that solutions are attainable using a total computa-tional expense that is comparable to (or less than) that which is required for a single high accuracyforward run of the same Monte Carlo model.

Conclusions:

This approach demonstrates signiﬁcant computational savings when approachingthe full non-linear inverse problem of optical property estimation using stochastic methods.

Keywords: Monte Carlo, Radiative Transport, Optical Tomography, Machine Learning, StochasticGradient Descent

Inverse problems arise in many areas within biomedical optics, both for global characterisation ofoptical properties of media and for image reconstruction, amongst other applications [1]. Inverseproblems are often considered as optimisation problems, solved by deriving the gradient of an objectivefunction and iteratively descending through the solution space. This process requires repeated solutionsof forward and corresponding adjoint problems which are often computationally demanding in their1 a r X i v : . [ phy s i c s . c o m p - ph ] J u l wn right. If the forward problem is given by the solution to a partial diﬀerential equation (PDE)then one appealing approach is to solve the forward and inverse problems simultaneously so that atintermediate stages in the algorithm (i.e. before it has ﬁnally converged) the forward problem isonly approximately solved; this approach (which has its basis in Optimal Control) is known as PDE-constrained Optimisation [2–4]. In this work we seek an equivalent framework for the case whereapproximate noisy solutions to the forward model can (or must) be sought by stochastic methods.The application of stochastic methods for the solution of PDEs is particularly pertinent in problemsinvolving diﬀuse optics, since the “gold standard” method of solving the Radiative Transfer Equation(RTE) - which is the most generally applicable description of the the underlying physics - is to usestochastic (Monte Carlo) techniques [5]; their use in such applications parallels their extensive employ-ment in other ﬁelds such as Neutron Physics [6]. Whilst approximations to the RTE (such as diﬀusion)which permit deterministic solutions are available, these are often not valid in many cases such as insmall domains, close to sources and boundaries, and in regions with weak scattering or strong absorp-tion. Analytical solutions to the RTE itself are known for some geometries, such as inﬁnite space [7],and layered media [8], but such expressions are not readily available for general domains. The practi-cality of Monte Carlo techniques has been signiﬁcantly boosted by recent advances in computationalhardware developments, particularly in the application of parallelization [9, 10]. Other approaches toimprove their computational performance have been explored, such as the introduction of perturbationtechniques [11], or variance reduction techniques [12,13]. Consequently, even when the aforementionedapproximations to the RTE are reasonable, Monte Carlo solutions may oﬀer an attractive alternativeto the use of deterministic techniques such as the ﬁnite element method, when the complexity of thegeometry or probe requires a high density discretisation of the spatial domain.With both deterministic and stochastic solvers, the computational cost of the forward model typi-cally remains the limiting factor in image reconstruction procedures. However stochastic methods havea particular quality distinct from deterministic methods: one may arbitrarily trade computational ex-pense against noise in the estimated solution, without bias. In the case of diﬀuse optics, this trade oﬀis mediated through the number of virtual photons simulated by the Monte Carlo model for a givenproblem. This fact naturally leads one to consider how much noise can be tolerated during the solutionof the inverse problem, and if a strategy can be found by which to approach this solution with theleast work.Parallels can be drawn between this problem and large scale machine learning, where the require-ment is to ﬁnd the global minimum of a loss function expressed through ﬁtting a model to a very largeset of training data. The recent growth in this ﬁeld has led to signiﬁcant developments in optimisationmethods using stochastic subsets , and in particular the use of approximate gradients at intermediatesteps, a technique known as Stochastic Gradient Descent (SGD). At the heart of this issue is the inter-play between optimisation and randomness, and the fact that attaining highly accurate estimations ofthe gradient at each step in SGD can come at a high cost when dealing with large datasets. However, ifwe can accept certain levels of randomness in our gradient computation, then each step in the gradientdescent can be achieved at a lower computational cost. Returning to the context of biomedical optics,we may be able to accept a “noisy” low-cost forward model computation (which would otherwise beundesirable in the PDE-constrained approach) and simulate fewer photon trajectories during the ear-lier stages of the inversion process, leading to an overall accuracy vs computation time beneﬁt. Thus,the topic of how to most eﬃciently utilize ﬁnite sized data sets in machine learning is relevant to thedeployment of Monte Carlo based solvers in biomedical optics.In this study we attempt to translate these recent insights from SGD in machine learning intopractical suggestions to improve the use of Monte Carlo methods in inverse problems which arise inbiomedical optics. To do this, we employ a fully stochastic computation of an objective gradient usingforward and adjoint models of the RTE solved by the Monte Carlo method. This allows for the fullnon-linear inverse problem to be approached. In our demonstration problems the inverse problem canbe approximately solved using a total computational expenditure which is similar or less than thatwhich would typically be dedicated to a single high quality (low variance) solution of the forwardimaging problem. 2his paper is arranged as follows. First we will begin by outlining some key aspects of GradientDescent in Section 2, including what appropriate metrics can be used to quantify acceptable levels ofvariance in the computation of sub-gradients via a stochastic process (i.e. Monte Carlo), and what stepsizes to use in order to allow convergence. In Section 3 we will then describe the example problemsthat we will use to evaluate the improvements of SGD in a biomedical context, as well as the detailsof the Monte Carlo forward and adjoint models and gradient calculations. In Section 4 we apply theseideas to two diﬀerent

Coupled Physics Imaging (CPI) modalities, namely Quantitative PhotoacousticTomography (QPAT) and Ultrasound Modulated Optical Tomography (UMOT) [14,15]. Both of theseproblems are non-linear and entail the RTE for an accurate description, but exhibit diﬀerent degreesof ill-posedness and resolution; thus they serve to demonstrate the generality of our approach. InSection 4 we then evaluate the performance of various Monte Carlo inversions using simulated QPATand UMOT data, and discuss what practical lessons can be taken from this in Section 5.

A common problem in biomedical optics involves ﬁnding the internal distribution of some opticalproperties x within a medium using various measurements made around and/or within the medium, y obs . To do this, we can employ some forward model of the underlying physical problem A , whichproduces an output y , given some estimate of the internal properties x , y = A x , (1)where in this case the forward model A could represent the radiative transport equation, and allrelevant aspects of the optical setup (geometry of sources & detectors etc.). In cases where A is notdirectly invertible, then in order to solve for an unknown distribution of properties x , we can formulatea cost function as a measure of the quality of an estimate. This could for example be the L2-norm ofthe residual between the real measured data, y obs , and our forward modelled data, y : F ( x ) = 12 (cid:13)(cid:13) y obs − y (cid:13)(cid:13) = 12 (cid:13)(cid:13) y obs − A x (cid:13)(cid:13) (2)From this point, the problem now becomes one of minimization, where we will qualify our solution x ∗ as that which minimizes the cost function, x ∗ = arg min x F ( x ). Note that the ground truth parameters x true may diﬀer from the minimizer x ∗ leading to reconstruction error . This minimization problemcan be approached via iterative Gradient Descent (GD), where we start with some estimate x , andeach succesive iterate, x n , is determined by subtracting a (scaled) gradient of our cost function ∇ F (relative to the internal optical properties) from the previous iterate, x n = x n − − α n ∇ F ( x n − ) , (3)where α n is the step size which scales the update term. If we have access to some computation or set ofcomputations (sometimes referred to as a “ﬁrst-order oracle”) which we can call to compute F ( x n − )and ∇ F ( x n − ), then this algorithm can be implemented, and is said to converge if lim n →∞ F ( x n ) = 0.In practice, the descent may be terminated early once the cost function reaches some acceptable value,for example when the norm of the diﬀerence between observed and model data is of the same order asmeasurement noise, a criterion known as the Discrepancy Principle [16].

In a stochastic setting, for instance when our forward model A is a Monte Carlo model of radiativetransport, then the true cost F and gradient ∇ F stated in Eq. (3) are not directly available. Instead,we may only have access to estimates of the cost function and gradient (provided by a “stochastic ﬁrst-order oracle”). In Section 3 we will detail the nature of these stochastic Monte Carlo computations3n the radiative transport setting. In the interest of generality, for now we will simply assume suchmodels exist, and that we can make a call to a “stochastic oracle” to attain F S n and ∇ F S n which weassume are non-biased approximations, i.e. E [ F S n ( x n )] = F ( x n ) , E [ ∇ F S n ( x n )] = ∇ F ( x n ) , (4)where E denotes the mean (expected) value for scalar quantities, or the mean (expected) vector forvector quantities such as the gradient. Here S n denotes the n ’th “sample” used in the computation.The meaning of “sample” here depends on the application. For example in machine learning this mayrefer to a particular training example (or group of training examples) to be used during one learningiteration [17]. In Monte Carlo modelling of radiative transport, the sample refers to the set of virtualphotons (and their associated random number seeds) that are initiated in the simulation to representan optical source, which are subsequently used to estimate F ( x n ) and ∇ F ( x n ). The stochastic versionof Gradient Descent (SGD) thus attempts to minimize a sampled objective function, F S n , by updatingthe previous iterate with a scaled sampled gradient, x n = x n − − α n ∇ F S n ( x n − ) . (5)As with any computation, a call to a stochastic oracle at each iteration comes with a certain compu-tational cost. The particular cost may depend on a number of factors, including the sample size, | S n | .This is one of the reasons why the study of SGD is of such importance in modern machine learning,where training data sets may be of an enormous size, meaning that computing a gradient based onall available data at each iteration could be very costly. Rather, individual samples ( | S n | = 1), orbatches of samples ( | S n | >

1) may be used instead at each iteration. While this degrades the qual-ity of any individual gradient estimate compared to using all available data, if the variance of theseestimates is maintained below an acceptable value, the overall tradeoﬀ may be net positive. Whatthis means in a Monte Carlo radiative transport context is that we may be able to allow low qualitygradient estimations (simulating only a small number photons) for a large part of the inversion processwhen estimating optical properties, saving on per iteration computational resources, leading to anoverall eﬃciency improvement. This is in contrast to typical implementations of iterative Monte Carlosolvers in the biomedical optics community, where each iteration is computed with large numbers ofphotons that are deemed suﬃcient to produce “stable” and “smooth” (low variance) forward modeldata [18–24]. In some cases, where a linearised approximation is assumed for the inverse problem, thecost of rerunning the forward model can be avoided using techniques such as Perturbation Monte Carlo(PMC) methods [11, 25, 26]. However, for the full non-linear problem, although PMC can be used forcalculation of the problem Jacobian, this has to be recomputed at each iteration of, for example, aGauss-Newton optimisation scheme [27].If in this study we are to accept a level of variance and imperfection in our forward/adjoint models,this of course raises the question of how much variance is acceptable in order for SGD to be successful?Furthermore, what measure of the variance is the best indicator in terms of eﬃciency/performance forcommon Monte Carlo solvers? To begin to answer this, it is important to ﬁrst note that ﬁxed-stepSGD does not in general converge to a solution [28, 29]. That is, if α n is ﬁxed for all n , eventuallythere will come a point where the next update of the estimate (with the term α n ∇ F S n ( x n − )) willreliably “undo” the work of the prior step, which will eﬀectively halt the descent. The point at whichthis occurs depends on the variance of ∇ F S n . We can see this by re-writing the sampled stochasticgradient estimate as, ∇ F S n ( x n ) = ∇ F ( x n ) + (cid:15) S n ( x n ) , (6)where (cid:15) is a random vector with E [ (cid:15) S n ( x n )] = 0 for all n . As gradient descent progresses successfully,the “true” gradient ∇ F will eventually begin reducing in size as we near the minimum. Once themagnitude of the true gradient reduces to a point at which it is comparable to the randomness of (cid:15) S n ,the problem arises. The larger the expected magnitude of (cid:15) S n , the sooner the minimization of the cost4unction reaches this limiting scenario, where further iterations will only lead to a random walk aboutthis point.To prevent this from happening, we may take one of two actions (or a combination thereof): i)reduce the step size at each iteration such that we can avoid “backtracking” in the descent, moreon this in Section 2.3; or ii) gradually improve the accuracy of our sampled gradient such that thevariance of the sampled gradient remains below some threshold value compared to the norm of thetrue gradient ∇ F . In other words, we may wish to ensure the inequality V ( x n ) := E (cid:104) (cid:107) (cid:15) S n ( x n ) (cid:107) (cid:105) (cid:107)∇ F ( x n ) (cid:107) ≤ γ , γ tot > . (7)where γ tot is a positive coeﬃcient describing the acceptable threshold. The above inequality is knownas the “norm test” [30]. Note that, since for any vector of random variables the variance of its lengthis the sum of the variances parallel and orthogonal to any ﬁxed vector, this test equally penalizes thecomponents of randomness parallel and perpendicular to the true gradient. Recent studies howeverhave demonstrated that controlling the component of randomness parallel to ∇ F is potentially a morerelevant objective, as the component of the sampled gradient orthogonal to the true gradient is zeroin expectation. An alternative measure of acceptable variance in ∇ F S n has thus been introduced asthe “inner product test” [30], which only aims to restrict the component of variance in the sampledgradient parallel to the true gradient ∇ F , V (cid:107) ( x n ) := E (cid:104) (cid:104) (cid:15) S n ( x n ) , ∇ F ( x n ) (cid:105) (cid:105) (cid:107)∇ F ( x n ) (cid:107) ≤ γ (cid:107) , γ (cid:107) > . (8)This inner product test imposes a less restrictive limitation of the overall variance in the sampledgradients, particularly in cases where the variance may be higher in directions orthogonal to thetrue gradient than in the direction parallel to ∇ F . Either of these metrics however will be able toexploit the fact that an increased expected error, E [ (cid:107) (cid:15) S n (cid:107) ], will correlate to a cheaper computation ofthe estimated gradient. Thus, setting larger values of γ tot or γ (cid:107) in the inequalities will correspond tocheaper computational requirements for each step, but also a more pronounced random walk componentto the gradient descent. In many cases, it may be found that the penalty paid by increasing the randomwalk component is acceptable (up to a point) compared to the penalty paid in computational cost forreducing the expected norm of (cid:15) to a negligible value. For example, using Monte Carlo RTE simulationsto compute ∇ F with a negligible level of variance (i.e. setting γ tot (cid:28)

1) may take billions of simulatedphotons at each step. Whereas, it may be possible to compute a gradient that passes the norm test orinner product with larger values of γ tot or γ (cid:107) with many orders of magnitude less photons, particularlyduring the early stages of gradient descent, where we may be far from the minimum. The ideal choiceof γ tot , or γ (cid:107) will depend on the speciﬁc application. We have discussed two diﬀerent measures of the variance in the sampled gradient ∇ F S n that we wish toinvestigate in the context of Monte Carlo estimation of media properties, viz. the norm test Eq. (7), andthe inner product test Eq. (8). In order to satisfy the inequalities deﬁning these tests as the gradientdescent progresses, we will be required to reduce the variance in the sampled gradients ∇ F S n wheneverthe norm test or inner product test fail. This can be done by increasing the sample size (number ofphotons used, | S n | ) when making a call to the stochastic oracle. Two practical considerations are stillrequired: ﬁrst, how to compute the “true” gradient ∇ F , which is needed to evaluate the norm testand inner product test; and second, by how much we should increase the sample size in a situationwhere one of the tests fails?The true gradient ∇ F is only calculable in the limit that an inﬁnite number of photons are usedin the Monte Carlo model. This limit can equivalently be represented as an average over independent5epeated outputs of the sampled gradient, ∇ F ( x n ) = lim N rep →∞ N rep N rep (cid:88) j =1 ∇ F S j ( x n ) , | S j | = | S n | ∀ j . (9)Using a ﬁnite value of N rep in the evaluation of Eq. (9) provides an approximation to the true gradient,and when this is used to compute the norm and inner product tests (Eqs. 7 & 8) the inequalities willfail before they would if N rep = ∞ , thus acting as a conservative approximation. It is noted that thismethod of approximating the true gradient is computationally taxing. In practice however, the innerproduct test and norm tests can still be conducted eﬃciently if they are only computed occasionally(not at every iteration) of the descent. For example, using N rep = 100 repeated computations ofthe sampled gradient to conduct the tests once every 100 iterations (thus only updating our samplesize every 100 iterations) would only double the total number of simulated photons required for theinversion. In this study we evaluate these metrics once every 10 iterations, using N rep = 100 repeatedsampled gradients. While this is a signiﬁcant computational burden, we do so in this study as we areinterested in assessing the best case scenarios for such methods. Note that although we compute theabove approximation to the true gradient for the purposes of evaluating the inner product and normtests, we only ever update our estimate using the sampled gradient.In terms of increasing the sample size in the event where the inner product and/or norm tests fail,this can be done in a number of ways. A simple method we will employ in this study is to scale thecurrent sample size by some factor κ ( n ), to increase the number of photons used in the next iteration, | S n +1 | = κ ( n ) | S n | (10)One option for κ ( n ) is to use the same factor by which the variance exceeds our imposed limit at agiven point in the descent. For instance, upon failure of the inner product test for a chosen value of γ (cid:107) , we can increase the sample size on the next iteration using κ ( n ) = V (cid:107) ( x n ) /γ (cid:107) . However, we alsoinvestigate other forms of κ ( n ) in the Section 4, which better cope with statistical variations that canlead to over-estimating the required sample size increase. In cases where we are not taking actions to bound the error in the sampled gradient (such as enforcingsuccessful outcomes of an inner product test or norm test), ﬁxed step SGD may only converge toa region around the solution. Reducing the step size suﬃciently at each step is usually required toallow convergence [31]. However, it can be shown that if we are bounding the error in the sampledgradient, e.g. by increasing the sample size, then ﬁxed step SGD may converge so long as the followingis satisﬁed for all n [30] α n ≤ γ ) L , (11)where L is the Lipschitz constant for F . As intuition may indicate, when the sample size (e.g numberof simulated photons) increases towards the maximum number of samples | S n | → | S max | ( | S max | = ∞ in the case of Monte Carlo RTE simulations), the expected error in the sampled gradient approacheszero, | (cid:15) S n | →

0, as do the measures of variance in the sampled gradients ( V → V (cid:107) → α = L [32].In this study we will aim to satisfy the above step size criteria for an assumed value of the Lipschitzconstant L , which we will choose conservatively depending on the particular scenario. However, as we The Lipschitz constant for a functional F is a measure of its rate of change with respect to its parameter and canbe deﬁned for example as the smallest constant such that ∇ F (cid:22) L Id, where Id is the identity matrix, and we assumethat F is twice continuously diﬀerentiable. It can also be interpreted as the largest eigenvalue of the Hessian of F [32]. N ph . Algorithm 1

Inversion using Monte Carlo sampled gradients with adaptive sample sizeChoose initial photon sample size | S | , and desired value of γ (cid:107) , or γ tot while (cid:80) ni =1 | S i | < N ph doif run test? then compute sampled gradient, ∇ F S n , and approximate true gradient, ∇ F (using Eq. (9))check norm test (or) inner product test is satisﬁed if test fail then increase sample size on next iteration | S n +1 | = κ ( n ) | S n | else set | S n +1 | = | S n | end ifelse compute sampled gradient only ∇ F S n set | S n +1 | = | S n | end if update x n +1 = x n − α n ∇ F S n end while In this section we cover the computation of the stochastic forward model and stochastic gradientapproximation, referred to above as the ﬁrst-order stochastic oracle. We will cover the basic radiativetransport forward problem, as well as the gradient computations involved in our example problems ofabsorption estimation in Quantitative Photoacoustic Tomography (QPAT) and Ultrasound ModulatedOptical Tomography (UMOT). The speciﬁc details of these models are not required to understand themain premise of this paper, but serve as a demonstration in a context familiar to many in the biomedicaloptics community, where Monte Carlo models of optical transport are employed to estimate mediumproperties.

For any optical source Q ( r , ˆ s ), either incident on a medium or present within it, we wish to model theresulting radiance, φ ( r , ˆ s ), describing the radiant ﬂux at each position r , and in each direction ˆ s . Thiscan be achieved using the Radiative Transport Equation (RTE):(ˆ s · ∇ + µ a ( r ) + µ s ( r )) (cid:124) (cid:123)(cid:122) (cid:125) T µ a ,µ s φ ( r , ˆ s ) = µ s ( r ) (cid:90) S p (ˆ s , ˆ s (cid:48) ) (cid:124) (cid:123)(cid:122) (cid:125) S µ s φ ( r , ˆ s (cid:48) ) dˆ s + Q ( r , ˆ s ) . (12)7here we denote T , and S , as the attenuation and scattering operators, which together compose theRTE operator, L . Here, µ a is the absorption coeﬃcient, µ s is the scattering coeﬃcient, and p is thescattering phase function. Using the deﬁned operators, Eq. (12) can be rewritten in a more compactform L µ a ,µ s φ = ( T µ a ,µ s − S µ s ) φ = Q . (13)In order to obtain (stochastic) estimates of the radiance resulting from a given source, and thus toobtain an estimate of any derived data function y ( φ ), we can implement a Monte Carlo solver, L − . Inthis study we have adapted a GPU-accelerated version of the commonly employed “Monte Carlo Multi-Layer” (MCML) program used to simulate radiative transport within a layered planar medium [34,35].The basic operation of this program is unchanged from the original release. Simulated photons areinitiated by sampling from a given source function, Q , and scattering/absorption events are pseudo-randomly generated along each photon’s trajectory until either: i) the photon leaves the domain, or;ii) the photon drops its weight below some threshold value. The scattering directions are sampledfrom the Henyey-Greenstein scattering phase function in this study. The expected accuracy of thecomputed radiance using such Monte Carlo solvers L − depends on the total number of photons used,i.e. the sample size | S n | . As | S n | → ∞ , the radiance approaches the deterministic solution of theRTE. Importantly however, Monte Carlo models allow an estimate of the radiance to be achieved with any number of photons with | S n | ≥

1. The expected computational requirements (number of ﬂoatingpoint operations) of the Monte Carlo solver L − also scales with the number of photons simulated,and it is this trade-oﬀ between accuracy of the forward model (and corresponding adjoint model) andcomputational cost that we will be investigating. To compute the gradient of our cost function ∇ F , with respect to the optical properties of the medium,we make use of an adjoint RTE model. Although direct methods of ﬁnding the derivative of a MonteCarlo method can also be developed [12], adjoint methods have more applicability in general, and alsoallow closer comparison with optimisation techniques used in machine learning. For further details offorward and adjoint methods in the RTE we refer to [36]; for speciﬁc details of coupled physics imagingprobems we refer to [37]. We ﬁrst consider a change to Eq. (12) where µ a → µ a + µ δ a , µ s → µ s + µ δ s ,for the same source Q , which results in a change in radiance φ → φ + φ δ . This implies (cid:0) T µ a + µ δ a ,µ s + µ δ s − S µ s + µ δ s (cid:1) (cid:0) φ + φ δ (cid:1) = ( T µ a ,µ s − S µ s ) φ ⇒ ( T µ a ,µ s − S µ s ) φ δ = − ( µ δ a + µ δ s + S µ δ s ) φ (14) L µ a ,µ s φ δ = − ( µ δ a + µ δ s + S µ δ s ) (cid:124) (cid:123)(cid:122) (cid:125) L δµδ a ,µδ s φ . (15)We also deﬁne the ﬂuence, Φ, as the angular integral of the radiance,Φ( r ) = (cid:90) S φ ( r , ˆ s ) d ˆ s . (16)To proceed beyond this point, we must now consider the speciﬁc form of the data function relevant toa particular modality of interest. We begin with the ﬁrst of our two example modalities, QuantitativePhotoacoustic Tomography (QPAT). In QPAT the medium is illuminated with a pulsed optical source, Q (see Fig. 1). The distributedoptical energy is absorbed at various points within the sample, giving rise to internal acoustic waves. For notational convenience we assume that Eq. (12) is combined with appropriate boundary conditions which wedo not write explicitly here; see [33] for more details. edium Acoustic waveAcoustic Sensor/ TransducerOptical source

Figure 1:

Setup for Quantitative Photoacoustic Tomography.

These acoustic waves can be detected at the surface of the medium by a sensor, and processed to locatethe initial pressure distribution p within the medium [38–40]. This internal pressure distribution isrelated to the spatial distribution of absorbed optical energy, h , where h ( r ) = µ a ( r )Φ( r ) , (17)and where Φ is the optical ﬂuence of Eq. (16) . Assuming that we can recover the absorbed opticalenergy, h , the problem remains to ﬁnd the distribution of µ a ( r ) within the medium [41, 42]. Note thatalthough the optical source is pulsed, it is acceptable to use a continuous wave (time-independent)model to describe φ and Φ because the time scale of the acoustic wave propagation is orders ofmagnitude slower than the optical propagation [43]. First, restating our cost function in terms of theQPAT data function, h , we have F QPAT = 12 (cid:90) Ω ( h obs − h ) d r = 12 (cid:10) h obs − h, h obs − h (cid:11) L (Ω) . (18)We then write the Fr´echet derivative of F QPAT as DF QPAT = − (cid:10) h obs − h, Dhµ δ a (cid:11) L (Ω) , (19)where µ δ a is a small change in absorption. In this paper we will neglect changes in scattering, howeverthe below formalism is still general for the gradient with respect to absorption. The gradient termwith respect to scattering coeﬃcient is described for example in [42], and will be included in futureinvestigations. Writing the Fr´echet derivative of h as Dh = Φ + µ a · D Φ , (20)and deﬁning Φ δ = D Φ µ δ a , we arrive at DF QPAT = − (cid:10) Φ( h obs − h ) , µ δ a (cid:11) L (Ω) − (cid:10) µ a ( h obs − h ) , Φ δ (cid:11) L (Ω) . (21)Next, we deﬁne the adjoint radiance, φ ∗ , as the solution to L ∗ φ ∗ = µ a ( h obs − h ) (22) We have omitted the Gr¨uneisen parameter for clarity of exposition, though this parameter can be included in practice. s . We then substitutethe above into Eq. (21) to give DF QPAT = − (cid:10) Φ( h obs − h ) , µ δ a (cid:11) L (Ω) − (cid:10) L ∗ φ ∗ , φ δ (cid:11) L (Ω × S n − ) (23)where we exploited the fact that the right hand side of Eq. (22) does not depend on direction. Usingthe deﬁnition of the adjoint operator, and the fact that the change in radiance is zero on the boundary ∂ Ω yields DF QPAT = − (cid:10) Φ( h obs − h ) , µ δ a (cid:11) L (Ω) − (cid:10) φ ∗ , L φ δ (cid:11) L (Ω × S n − ) . (24)Finally we make use of the perturbation expression Eq. (15), whilst again here we neglect any changein scattering. This gives DF QPAT = − (cid:10) Φ( h obs − h ) , µ δ a (cid:11) L (Ω) + (cid:10) φ ∗ φ, µ δ a (cid:11) L (Ω × S n − ) , (25)allowing us to deﬁne the (absorption) gradient as in Eq. (33) of [42] : ∂F QPAT ∂µ a = ∇ F QPAT = − Φ( h obs − h ) + (cid:90) S n − φ ∗ φ dˆ s (26)To compute a stochastic approximation of this gradient, we can thus use the forward model MonteCarlo solver L − to provide estimates of φ and Φ, and an adjoint Monte Carlo solver L − ∗ MC to produce φ ∗ from an adjoint source term Q adj = µ a ( h obs − h ), as deﬁned in Eq. (22). Due to the symmetry ofthe problem, the adjoint solver is identical to the forward solver, and follows the same basic operatingprinciples. The only diﬀerence is that here the adjoint source Q adj = µ a ( h obs − h ) may in fact benegative in some locations. This is handled by splitting the source term into two parts, one purelypositive, Q +adj , and one purely negative, Q − adj . Two simulations are then run (where the total numberof photons to be used is split between the two simulations accordingly), and the results summed toproduce φ ∗ . The following algorithm describes the basic operation for computing a sampled gradient, ∇ F S n , for QPAT using the above derivation. This will be used in conjunction with Algorithm 1 toconduct an inversion with adaptive sample size for each iterate, | S n | . Algorithm 2

Monte Carlo sampled QPAT gradient1) Compute L − Q (cid:55)→ φ, Φ, using | S n | / Q adj = µ a ( h obs − h )3) Compute L − ∗ MC Q adj (cid:55)→ φ ∗ , Φ ∗ , using | S n | / ∇ F S n Referring to Figure 2, in Ultrasound Modulated Optical Tomography we have an optical light source Q q incident on a medium, as well as an optical detector J m . In addition, an ultrasound source isincident on the medium, where the focus η ( r ) is scanned through the sample [44, 45]. Assuming forsimplicity an ideal (delta-function) ultrasound focus, the data of interest in this case is found to be ofthe form [46] b ( r ) = η ( r )Φ q ( r )Φ m ( r ) , (27)where Φ q is the ﬂuence resulting from the optical source Q q , and Φ m is the resulting ﬂuence from avirtual source Q m which is reciprocal to the detector J m [46]. From this point we proceed in similarfashion as in section 3.2.1, where now our data ﬁtting error is given by F UMOT = 12 (cid:90) Ω ( b obs − b ) d r = 12 (cid:10) b obs − b, b obs − b (cid:11) L (Ω) (28)10 edium Optical source Ultrasound source

Ultrasound focus

Optical Detector

Figure 2:

Setup for Ultrasound-Modulated Optical Tomography in the transmission geometry. and its Fr´echet derivative as DF UMOT = − (cid:10) b obs − b, Dbµ δ a (cid:11) L (Ω) (29)In this case the Fr´echet derivative of b becomes Db = η Φ q · D Φ m + η Φ m · D Φ q (30)leading to DF UMOT = − (cid:10) η Φ q ( b obs − b ) , Φ δm (cid:11) L (Ω) − (cid:10) η Φ m ( h obs − h ) , Φ δq (cid:11) L (Ω) . (31)Here we need to deﬁne two adjoint radiances, φ ∗ , , φ ∗ , , as the solution to L ∗ φ ∗ , = η Φ q ( b obs − b ) (32) L ∗ φ ∗ , = η Φ m ( b obs − b ) (33)and substituting into Eq. (31) to give DF UMOT = − (cid:10) L ∗ φ ∗ , , φ δm (cid:11) L (Ω × S n − ) − (cid:10) L ∗ φ ∗ , , φ δq (cid:11) L (Ω × S n − ) (34)by the same arguments as for QPAT we get DF UMOT = − (cid:10) φ ∗ , , L φ δm (cid:11) L (Ω × S n − ) − (cid:10) φ ∗ , , L φ δq (cid:11) L (Ω × S n − ) . (35)Again using the perturbation expression Eq. (15) we have DF UMOT = (cid:10) φ ∗ , φ m , µ δ a (cid:11) L (Ω × S n − ) + (cid:10) φ ∗ , φ q , µ δ a (cid:11) L (Ω × S n − ) . (36)allowing us to deﬁne the (absorption) gradient as ∂F UMOT ∂µ a = ∇ F UMOT = (cid:90) S n − (cid:0) φ ∗ , φ m + φ ∗ , φ q (cid:1) dˆ s (37)Thus, similar to the QPAT case, here we are able to compute a stochastic approximation of thisgradient using the forward model Monte Carlo solver L − to provide φ q and φ m from our two sources,11nd an adjoint Monte Carlo solver L − ∗ MC to produce φ ∗ , and φ ∗ , from the adjoint source terms Q = ηφ q ( b obs − b ) and Q = ηφ m ( b obs − b ), as deﬁned in Eqs. (32– 33). Here as well, adjointsource terms are split into two parts, one purely positive, Q +adj , and one purely negative, Q − adj , withthe photon budget being split accordingly. The following algorithm describes the basic operation forcomputing a sampled gradient, ∇ F S n , for UMOT using the above derivation. This will be used inconjunction with Algorithm 1 to conduct an inversion with adaptive sample size for each iterate, | S n | . Algorithm 3

Monte Carlo sampled UMOT gradient1) Compute L − Q q (cid:55)→ φ q , Φ q , and L − Q m (cid:55)→ φ m , Φ m , each using | S n | / Q = η Φ q ( b obs − b ) and Q = η Φ m ( b obs − b )3) Compute L − ∗ MC Q (cid:55)→ φ ∗ , , Φ ∗ , , and L − ∗ MC Q (cid:55)→ φ ∗ , , Φ ∗ , , each using | S n | / ∇ F S n It should be noted that numerous Monte Carlo radiative transport solvers do not explicitly outputthe radiance, as this requires additional programming to store the angular ordinates at each location.Commonly, only the ﬂuence will be available, which is the angular integral of the radiance Eq. (16). Insuch cases, the above integrals for the gradients of interest Eqs. (26), (37) can be computed under theassumption of approximately angularly isotropic radiances, where for example (cid:82) φ ∗ φ d ˆ s becomes Φ ∗ Φ.The accuracy of this approximation of course depends on the true angular dependence of the radiances,where the approximation is poorest in regions close to directional light sources, but improves furtheraway. The higher the scattering asymmetry g of the medium, the slower the approximation improvesas a function of distance from these sources. In many cases however this is a satisfactory assumption,and is employed in the below example cases. In this section we present the results of a number of investigations using our two example problemsof QPAT and UMOT. We will demonstrate the implementation of the forward-adjoint Monte Carlosolvers described above, along with adaptive sampling strategies to estimate the absorption coeﬃcientof a medium via SGD. Here we investigate media with a semi-inﬁnite slab geometry, with numerouslayers in the z-direction having diﬀerent optical properties, but otherwise homogeneous in the x and y directions. The application to layered geometry in this demonstration was chosen for simplicity toprovide an easily recognizable setting to test these adaptive sampling methods. Furthermore, whilstapparently simplistic, layered geometries are still of practical interest for applications including in-strument calibration and validation, and the imaging of biological structures with small curvature butsigniﬁcant heterogeneity in depth. The latter example includes studies such as functional (cognitive)imaging when localised to small activation regions. Application of these new methods in more compli-cated 3D geometries will be carried out in future work. Each of the medium layers can be describedin terms of thickness, scattering coeﬃcient, absorption coeﬃcient, refractive index (background) andscattering asymmetry parameter. We will assume all parameters of the layered medium are known apriori with the exception of the absorption coeﬃcient, which we will attempt to solve for. For theexamples in this study we set the total slab thickness to 2cm, and the inversion is conducted with aresolution of 0 . photons. With this samplesize, the variance of the measured forward data h obs , b obs is found to be negligible in this setup, andas such can be treated as eﬀectively equivalent to the deterministic solution of the RTE.12o conduct an inversion, we stipulate a total photon budget, N ph , for which all combined samplesizes in the descent must not exceed, i.e . (cid:80) n | S n | ≤ N ph . Once the total photon budget is expended weterminate the descent. This is to emulate an imposed restriction on computational resources requiredto reach a solution. Whilst each iteration (involving forward and adjoint runs of the Monte Carlo)has a non-zero computational overhead, optimization of these Monte Carlo programs for repeatediteration (such as employed in [23]) allow this overhead to become negligibly small. This means thatthe required computational resources of the inversion (and therefore required computation time) areproportional to the total number of simulated photons used througout the descent, i.e. the photonbudget N ph . The inversions are carried out using Algorithm 1, along with Algorithms 2 & 3 to computethe gradients for QPAT and UMOT, respectively. In Algorithm 1, we will compute the metrics V and V (cid:107) and conduct the norm test and inner product test once every 10 iterations to evaluate thequality of our computed gradients (using N rep = 100 independent repeated samples of the gradient),and to update the step size and sample size. Note that as this is an investigation of how such methodsmight perform in best case scenarios, we do not include the photons used to compute these metrics ascounting against the total allowed photon budget.Strategy Step Size, α n Sample Size, | S n +1 | = κ ( n ) | S n | γ ) L | S n +1 | = V γ | S n | V ) L | S n +1 | = V (cid:107) γ (cid:107) | S n | V tot ) L | S n +1 | = V (cid:107) γ (cid:107) | S n | Table 1:

Table showing the diﬀerent inversion strategies used. Strategy 1 has a constant step size, withadaptive sample size. Strategies 2 & 3 both have adaptive step sizes, and adaptive sample sizes. Note that inaccordance with Algorithm 1, the sample size is only increased upon a failure of the relevant test. If the testpasses, then | S n +1 | = | S n | . There are three diﬀerent strategies we have employed to control the step size and sample size asthe inversion progresses, see Table 1 for a summary. Strategy 1 uses a ﬁxed step size as described inEq. (11) for a chosen value of γ tot . The sample size is adaptive and attempts to enforce successfuloutcomes of the norm test ( V ≤ γ ), by increasing the sample size when the norm test is violated.In the event of a violation of this inequality, the fractional increase in the sample size is equivalentto the factor by which the norm test fails, V /γ . Strategy 2 uses an adaptive step size which stillsatisﬁes Eq. (11), however it selects the largest step size possible for this criteria each time the metricsare evaluated. In this strategy, the sample size is also adaptive and attempts to enforce successfuloutcomes of the inner product test ( V (cid:107) ≤ γ (cid:107) ) by increasing the sample size when the inner producttest is violated. In the event of a violation of this inequality, the fractional increase in the sample sizeis equivalent to the factor by which the inner product test fails, V (cid:107) /γ (cid:107) . In Strategy 3, we attempt toaccelerate the descent by using a larger adaptive step size with V tot in the denominator in place of V .Upon failure of the inner product test, the sample size is increased by fraction V (cid:107) /γ (cid:107) , and diﬀers fromStrategy 2 in order to reduce the speed at which the photon budget is depleted. This is an attempt toreduce premature increase of the sample size caused by volatility in the computation of the norm andinner product metrics.Finally, we introduce an error function for the estimated absorption distribution, µ a F µ a = 12 (cid:13)(cid:13) µ truea − µ a (cid:13)(cid:13) . (38)13his metric would not be available under normal circumstances (as we wouldn’t know the groundtruth µ truea ), however it is useful to monitor in terms of the underlying performance of each strategy.Furthermore, as we will see, the sampled data cost function F S n is itself heavily dependent on thenumber of photons (sample size) used in the forward Monte Carlo, and is thus not an ideal indicatorof proximity to the true solution. (a) (b) Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10

Ground truth Measured data ,

Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10

Figure 3:

QPAT inversion: (a) - Ground truth absorption distribution, µ truea , and estimated absorptiondistribution µ a at the point where the photon budget is expended, using each of the three strategies with thestated values of γ tot or γ (cid:107) . (b) - Associated measured data from ground truth medium, and simulated forwarddata at the end of the inversion using each strategy. Iteration, (n) -4 -3 -2 -1 (a) Iteration, (n) -5 -4 -3 -2 (b) Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10 Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10

Figure 4:

QPAT inversion: (a) - Sampled cost function, F S n , as a function of iteration, n . (b) - Error inabsorption estimate, F µ a , as a function of iteration, n . We begin with our example QPAT problem. The starting sample size in all cases shown is | S | = 200photons per iteration (100 for each forward run, and 100 for each adjoint run in accordance withAlgorithm 2), and the total photon budget for the inversion was set to N ph = 2 × photons. TheLipschitz constant was set at L = 2 .

5, as this displayed stable descent in our test problems using largephoton budgets (low variance case). The initial estimate of the absorption distribution in the medium14 -4 -3 -2 -1 (a) (b) Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10

Iteration, (n) Iteration, (n)

Figure 5:

QPAT inversion: (a) - Step sizes, α n , as a function of iteration, n . (b) - Adaptive sample size, | S n | ,as a function of iteration. (a) (b) Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10 Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10

Iteration, (n) Iteration, (n)

Figure 6:

QPAT inversion: (a) - V (cid:107) as a function of iteration. (b) - V tot as a function of iteration. is µ a = 0 . − in all layers. The scattering coeﬃcient of all layers was set to µ s = 40cm − , andthe scattering asymmetry parameter was set to g = 0 .

9. The ground truth absorption µ truea is shownin Fig. 3(a), along with the ﬁnal retrieved absorption distributions obtained via Strategies 1, 2, & 3using the stated values of γ tot and γ (cid:107) . Figure 3(b) shows the corresponding “measured” data, and theﬁnal forward modelled data for each of the Strategies. Figure 4 shows the outcome of each strategyin terms of the sampled data cost function F S n , and the absorption error function F µ a . It can be seenthat the ranking of these methods in terms of the lowest achieved value of the sampled cost function F S n does not correlate directly to the best outcomes in terms of the error in the estimated absorption F µ a . This is due to the above mentioned dependence of the sampled data cost function on the samplesize used in the forward model, where for example the case of Strategy 2 only appears to performpoorly in terms of F S n due to its small sample size used throughout the inversion. This is more clearlyillustrated in Fig. 3(b), where the ﬁnal forward modelled data from Strategy 2 is noisier than the otherstrategies due to the low sample size at the end of the inversion, where this noise would clearly impactthe sampled cost function. Note that the relevant step sizes, and sample sizes for each of these threeexamples are shown in Figure 5. Before ﬁnding the best parameter for Strategy 1, we trialled a rangeof values of γ tot over the range [0 . , Iteration, (n) -4 -3 -2 -1 (a) Iteration, (n) -6 -4 -2 (b) Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10 Strategy 1, = 4Strategy 2, = 20Strategy 3, = 10

Figure 7:

QPAT inversion: With initial estimate of µ a = 1 . − : (a) - Sampled cost function, F S n , as afunction of iteration, n . (b) - Error in Absorption estimate, F µ a , as a function of iteration, n . increase rapidly in order to maintain high quality (low variance) sample gradients. This resulted inthe photon budget being depleted early, terminating the descent after around 100 iterations, whichdid not perform well. Too large a value of γ (cid:107) and the norm test never failed, meaning the samplesize was never required to increase and the inversion progressed for the maximum 10,000 iterationspermitted by the photon budget. However, as Strategy 1 has a ﬁxed value of γ in the denominatorof the step size, large values also result in step sizes which were too small to perform well. A value of γ tot = 4 was found to strike a balance between these two extremes, and was the best performer usingStrategy 1. Strategy 2 has an adaptive step size which selects the largest possible step size that stillsatisﬁes Eq. (11), instead of selecting a constant step size which accounts for the worst case scenarioas in Strategy 1. For this reason, we found that the largest value of γ (cid:107) = 20 was the best performerfor this strategy, where the photon budget remained at 200 photons for each of the 10,000 iterations.For Strategy 3, the best performer was a value of γ (cid:107) = 10, where larger values appeared to allow toomuch variance in the gradient, leading to unstable descents. In all strategies the recovered absorptiondistribution matched the ground truth absorption more closely in the regions of the sample closest tothe light source at z = 0. This is due to the decay of the ﬂuence as a function of depth, as we can seethe QPAT signal is highest at shallow depths in Fig 3(b). The deeper regions of the sample were thelast to approach the ground truth in each of the three strategies.Next, looking at Figure 6, we see the values of our two metrics V (cid:107) , and V tot . In all cases bothmeasures of the variance begin at low values, indicating that even with low numbers of photons beingsimulated, the computed gradients are of reasonable quality, likely due to the poor initial ﬁrst guessbeing far from the true solution. Each of the measures of variance increase as the inversion progressesuntil they begin to violate the norm test or inner product test depending on the strategy. It is seenthat the Strategy 1 example attempts to keep V tot ≤

4, however due to some level of variation in themetrics themselves, this condition can be seen to be violated regularly, requiring regular updates to thesample size. For Strategy 2, the imposed limit of V (cid:107) ≤

20 is never violated, and thus the sample sizeis never required to increased. We also see that Strategy 3 manages to keep V (cid:107) ≤

10 for the majorityof the descent.In addition to these experiments presented in Figures 3-6, we also trialled a number of otherconditions including media with isotropic scatterers ( i.e with g = 0), various scattering coeﬃcients, andvarious initial estimates of the absorption. In all cases explored the methods showed similar behaviouras above, but with some diﬀerences in the ideal values of γ tot and γ (cid:107) for each strategy. The outcomesof a range of these experiments are summarized in Table 2 for various problem parameters. Strategy 3was used in all cases in the table, with the same Lipschitz constant ( L = 2 . | S | µ a (cm − )0.01 0.2 1.0 g = 0 . µ s = 40cm − γ (cid:107) = 2010000 iterations F S n = 2 . × − F µ a = 3 . × − γ (cid:107) = 104476 iterations F S n = 4 . × − F µ a = 7 . × − γ (cid:107) = 106533 iterations F S n = 3 . × − F µ a = 1 . × − MediumProperties g = 0 . µ s = 4cm − γ (cid:107) = 53819 iterations F S n = 5 . × − F µ a = 2 . × − γ (cid:107) = 52579 iterations F S n = 9 . × − F µ a = 3 . × − γ (cid:107) = 52834 iterations F S n = 1 . × − F µ a = 2 . × − g = 0 µ s = 4cm − γ (cid:107) = 2010000 iterations F S n = 4 . × − F µ a = 8 . × − γ (cid:107) = 52056 iterations F S n = 1 . × − F µ a = 3 . × − γ (cid:107) = 1010000 iterations F S n = 2 . × − F µ a = 8 . × − Table 2:

Final outcomes of QPAT inversions with various medium optical properties and starting values of µ a . Values of F S n and F µ a are the ﬁnal values at the end of each inversion after the stated number of iterations.In each case Strategy 3 was employed, with a starting sample size of | S | = 200 photons per iteration, and atotal photon budget of N ph = 2 × photons. Slab thickness is 2cm in all cases, with the same ground truth µ truea distribution as shown in Fig. 3(a). = 200 photons), photon budget ( N ph = 2 × photons), and ground truth absorption distribution µ truea as used in the above examples. The ﬁnal attained values of the sampled data cost function F S n and absorption error F µ a are similar in all cases with the exception of the high asymmetry and lowscattering case ( g = 0 .

9, and µ s = 4cm − ). In this case the reduced scattering coeﬃcient is only µ (cid:48) s = µ s (1 − g ) = 0 . − , meaning much lower overall attenuation of the light through the sample.This results in a more uniform data function, h , where the simulated photons probe the domainmore uniformly, and allows the problem to converge signiﬁcantly faster than in the higher attenuatingcases demonstrated in Figs. 3-6. It is also worth noting that in the regime with low scattering, andhigh scattering asymmetry, it is generally problematic for the performance of approximate transportmodels such as the diﬀusion approximation, and the results here highlight the ﬂexibility of RTE basedapproaches, as well as the eﬃciency of the proposed adaptive sampling techniques.Finally, interesting behaviour was observed when using certain initial guesses of the absorption. Anexample of this is shown in Fig. 7, where we show the resulting cost functions for a starting estimateof µ a = 1cm − (signiﬁcantly overestimating the absorption at all depths), and medium properties of g = 0 . µ s = 40cm − . In this case we see that the descent appears to encounter local minima inthe data cost function F S n at various points during the descent, depending on the particular strategyused. However, the algorithm manages to escape these local minima and converge to a better solution.This is seen to be the case for all three strategies shown in Fig. 7. Next we demonstrate similar experiments performed using the UMOT modality described in Sec-tion 3.2.2 for the transmission geometry. In this setup we used the same medium slab size as theQPAT example, and the same optical properties apart from the absorption distribution. The startingsample size in all cases shown is | S | = 4000 photons per iteration, 1000 for each forward run (per eachof the two sources), and 1000 for each of the two adjoint sources as outlined in Algorithm 3. The totalphoton budget for the inversion was set to N ph = 4 × photons. The Lipschitz constant was set at L = 50, as this displayed stable descent in our test problems using large photon budgets (low variancecase). The initial estimate of the absorption distribution in the medium is µ a = 0 . − in all layers.The ground truth absorption µ truea is shown in Fig. 8(a), along with the ﬁnal retrieved absorption17 (a) (b) Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15

Ground truth

Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15

Measured data ,

Figure 8:

UMOT inversion: (a) - Ground truth absorption distribution, µ truea , and recovered absorptiondistribution µ a using each of the three strategies with the stated values of γ tot or γ (cid:107) . (b) - Associated measureddata from ground truth medium, and simulated forward data at the end of the inversion using each strategy. distributions obtained via Strategies 1, 2, & 3 using the stated values of γ tot and γ (cid:107) . Figure 8(b)shows the true “measured” UMOT data, b obs , along with the forward modelled data from the ﬁnalestimated medium for each strategy. Figure 9 shows the outcome of each strategy in terms of thesampled data cost function F S n , and the absorption error function F µ a . The relevant step sizes, andsample sizes for each of these three examples are shown in Figure 10, and the values of the metricsmeasuring the variance in the sampled gradients are presented in Figure 11. Similar to the the QPATmodality, we found that Strategy 3 performed the best in terms of the ﬁnal achieved value of the errorin the absorption estimate F µ a .In addition to the results shown in Figures 8-11, in Table 3 we also present a summary of resultsfor a range of diﬀerent medium optical parameters, and starting estimates of the absorption. In allcases, Strategy 3 was used, and the starting photon budget was the same as in the previous UMOTexamples ( | S | = 4000 photons), with a total photon budget of N ph = 4 × photons. For each ofthe inversions presented in this table, we conducted the inner product test once every 50 iterations,using N rep = 50 repeated evaluations of the gradient. The resulting inversions display similar errorin these cases to the above examples where we used N rep = 100 repeated evaluations of the sampledgradient once every 10 iterations to run the inner product test. This demonstrates that the describedmethods can still be successful when dedicating fewer computational resources to the inner product ornorm test metrics which control the adaptive sample size, and step size. The results shown in Section 4 demonstrate that the adaptive sampling strategies performed well inboth our example problems of QPAT and UMOT. We were able to achieve low error estimates ofthe medium absorption using a total computational expenditure that was either comparable to, orsigniﬁcantly lower than the resources required to simulate a single low variance run of the forwardproblem. In each demonstration the adaptive sampling strategies maintained low photon numbersthroughout the early stages of the inversion. Photon numbers were only increased when required tokeep the variance in the gradients below the stipulated limits. These adaptive sampling strategies thusenabled signiﬁcant computational savings compared to a na¨ıve implementation which might seek touse low variance (high quality) computations of the gradient at every iteration. For instance, if wewere to use a constant stepsize of 1 /L , and the same number of photons per iteration as that which18tarting µ a (cm − )0.01 0.1 1.0 g = 0 . µ s = 40cm − γ (cid:107) = 1511412 iterations F S n = 1 . × − F µ a = 4 . × − γ (cid:107) = 103495 iterations F S n = 2 . × − F µ a = 2 . × − γ (cid:107) = 107823 iterations F S n = 5 . × − F µ a = 9 . × − MediumProperties g = 0 . µ s = 4cm − γ (cid:107) = 1519017 iterations F S n = 4 . × − F µ a = 8 . × − γ (cid:107) = 1515813 iterations F S n = 5 . × − F µ a = 2 . × − γ (cid:107) = 1514249 iterations F S n = 4 . × − F µ a = 3 . × − g = 0 µ s = 4cm − γ (cid:107) = 1511594 iterations F S n = 1 . × − F µ a = 7 . × − γ (cid:107) = 157304 iterations F S n = 5 . × − F µ a = 4 . × − γ (cid:107) = 1513981 iterations F S n = 1 . × − F µ a = 7 . × − Table 3:

Final outcomes of UMOT inversions with various medium optical properties and starting values of µ a . Values of F S n and F µ a are the ﬁnal values at the end of each inversion after the stated number of iterations.In each case Strategy 3 was employed, with a starting sample size of | S | = 4000 photons per iteration, and atotal photon budget of N ph = 4 × photons. Slab thickness is 2cm in all cases, with the same ground truth µ truea distribution as shown in Fig. 8(a). was used to generate the “measured” data (10 photons), then we ﬁnd we still required hundreds ofiterations to reach a similar quality estimate of the absorption as seen in the above problems. Thismeans that the computational requirements of the low variance approach would be proportional to N ph = 10 photons. Comparing this to N ph = 2 × photons used in the QPAT examples, or N ph = 4 × photons used in the UMOT examples, the required computational resources/timeto attain our solutions with these adaptive sampling methods is multiple orders of magnitude lowercompared to the na¨ıve low variance approach.In this work we have emphasised the similarities between our approach and that of StochasticGradient Descent, as employed in the context of machine learning. However it should be noted thatthere are signiﬁcant diﬀerences between the two settings. In machine learning, the measured dataare assumed to consist of a large number of samples to be ﬁt to a deterministic model so as tominimise a suitable loss function, and each stochastic gradient is generated by a random subset ofthese data forming the descent direction of a sub-function. The same method has also been appliedin alternative image reconstruction techniques where the data can be more naturally considered asconsisting of a large number of random samples from some underlying distribution, for example inPositron Emission Tomography [47]. By contrast, our image reconstruction approach considers thecomplete measured data on each iteration, with stochasticity arising from the approximation withinthe forward model: we are eﬀectively sub-sampling the gradient in terms of the parameter space, ratherthan data space. This is to say that at each iteration we utilise a subset of some notionally completemodel, rather than of the data. The motivation by which each approach is employed is consistent:stochasticty is intentionally introduced to whichever part of the objective function introduces thegreatest computational demand. This suggests a third possible approach, where the computationalload of the (sub-) gradient computation can be lowered through some stochastic division of both thedata, and the model; this might be relevant in imaging modalities with discrete counting data, suchas time-domain and/or dynamic diﬀuse optical tomography.Our work suggests a number of interesting future developments: • In the examples shown here the “observed” data were eﬀectively “noise-free” by virtue of runningthe forward Monte Carlo on a very large number of photons. Thus an interesting topic for furtherstudy will be to evaluate these methods on noisy forward data, wherein the data ﬁtting term19 Iteration, (n) -4 -2 (a) Iteration, (n) -5 -4 -3 -2 (b) Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15 Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15

Figure 9:

UMOT inversion: (a) - Sampled cost function, F S n , as a function of iteration, n . (b) - Error inabsorption estimate, F µ a , as a function of iteration, n . should not be iterated to convergence, but where regularisation should be introduced either byearly stopping (i.e. by setting a minimum threshold for the data error) or by adding an explicitpenalty term. • Related to the previous point, we further note that our objective function employed a leastsquares data ﬁtting term in this study. Depending upon the nature of the noise in the data and that of the stochastic forward model, more suitable metrics may include the Kullback-Leiblerdiscrepency (for Poisson likelihood), or a generalised measure of the distance between samplesof probability distributions (Wasserstein distance [48]). • Our results demonstrate a consistent tendency for the adaptive sampling method to exhibit ageometric increase in the sample size as the descent progresses. This suggests that our adaptiveapproach could be employed to ﬁnd a particular set of sampling parameters that perform well ina given regime, including the starting photon budget | S | , rate of increase of the sample size κ ( n ),and rate of change of the step size α n . If a suitable set of such parameters could be found, theycould help determine a fully prescribed sampling strategy. Once calibrated for a given problemof choice, this would avoid the need to explicitly compute the variance of the sampled gradientsduring the descent, and lead to even greater eﬃciency and speed in the inverse problem. • Further topics of interest include more advanced methods of variance reduction (e.g recursivegradients [49]); adaptive estimates of the Lipschitz constant as described in Ref. [30]; alternativeoptimisation strategies such as back-tracking line-search, or primal dual methods [50]; the useof preconditioning and/or second-order optimisation methods [51]; and an in depth comparisonof these non-linear adaptive models to the alternative approaches such as Perturbation MonteCarlo [27].In summary, we have successfully demonstrated a means by which stochastic forward models, notdirectly amenable to standard variational methods for optimisation, can be employed eﬃciently innon-linear image reconstruction. We expect this concept to lead to be many new directions of researchin optical image reconstruction.

Disclosures

No conﬂicts of interest, ﬁnancial or otherwise, are declared by the authors.20 -6 -5 -4 -3 -2 -1 (a) (b) Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15 Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15

Iteration, (n) Iteration, (n)

Figure 10:

UMOT inversion: (a) - Step sizes, α n , as a function of iteration, n . (b) - Adaptive sample size, | S n | , as a function of iteration. (a) (b) Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15 Strategy 1, = 4Strategy 2, = 30Strategy 3, = 15

Iteration, (n) Iteration, (n)

Figure 11:

UMOT inversion: (a) - V (cid:107) as a function of iteration. (b) - V tot as a function of iteration. Acknowledgements

This work was funded by Engineering and Physical Sciences Research Council (EPSRC), EP/N032055/1& EP/N025946/1. S. Powell further acknowledges Royal Academy of Engineering (RAEng) fellowshipRF1516/15/33.The authors would like to thank Robert Twyman and Kris Thielemans for helpful discussions onstochastic gradient and data subset methods.

References [1] S. R. Arridge and J. Schotland, “Optical tomography: Forward and inverse problems,”

InverseProblems (12), p. 123010: (59pp), 2009.[2] L. T. Biegler, O. Ghattas, M. Heinkenschloss, and B. van Bloemen Waanders, “Large-scale PDE-constrained optimization: An introduction,” in Large-Scale PDE-Constrained Optimization , L. T.21iegler, M. Heinkenschloss, O. Ghattas, and B. van Bloemen Waanders, eds., pp. 3–13, SpringerBerlin Heidelberg, (Berlin, Heidelberg), 2003.[3] K. Bredies, C. Clason, K. Kunisch, and G. Winckel, eds.,

Control and Optimization with PDEConstraints , vol. 164, Springer Science & Business Media, 2013.[4] J. C. D. los Reyes,

Numerical PDE-Constrained Optimization , Springer Briefs in Optimization,Springer, 2015.[5] C. Zhu and Q. Liu, “Review of Monte Carlo modeling of light transport in tissues,”

Journal ofbiomedical optics (5), p. 050902, 2013.[6] I. Lux and L. Koblinger, Monte Carlo Particle Transport Methods : Neutron and Photon Calcu-lations , CRC Press, 1991.[7] A. Liemert and A. Kienle, “Analytical solution of the radiative transfer equation for inﬁnite-spaceﬂuence,”

Phys. Rev. A , p. 015804, 2011. Erratum Phys. Rev. A 83, 039903 (2011).[8] A. Liemert, D. Reitzle, and A. Kienle, “Analytical solutions of the radiative transport equationfor turbid and ﬂuorescent layered media,” Scient. Rep. , p. 3819, 2017.[9] W. C. Lo, K. Redmond, J. Luu, P. Chow, J. Rose, and L. D. Lilge, “Hardware accelerationof a Monte Carlo simulation for photodynamic treatment planning,” Journal of biomedical op-tics (1), p. 014019, 2009.[10] N. Ren, J. Liang, X. Qu, J. Li, B. Lu, and J. Tian, “Gpu-based Monte Carlo simulation for lightpropagation in complex heterogeneous tissues,” Optics express (7), pp. 6811–6823, 2010.[11] C. K. Hayakawa, J. Spanier, F. Bevilacqua, A. K. Dunn, J. S. You, B. J. Tromberg, and V. Venu-gopalan, “Perturbation Monte Carlo methods to solve inverse photon migration problems in het-erogeneous tissues,” Optics letters (17), pp. 1335–1337, 2001.[12] R. Graaﬀ, M. H. Koelink, F. F. M. de Mul, W. G. Zijlstra, A. C. M. Dassel, and J. G. Aarnoudse,“Condensed Monte Carlo simulations for the description of light transport,” Appl. Opt. (4),pp. 426–434, 1993.[13] I. T. Lima, A. Kalra, and S. S. Sherif, “Improved importance sampling for Monte Carlo simulationof time-domain optical coherence tomography,” Biomedical optics express (5), pp. 1069–1081,2011.[14] H. Ammari, Optical, Ultrasound, and Opto-Acoustic Tomographies , Mathematical Modeling inBiomedical Imaging II, Springer, 2011.[15] S. R. Arridge and O. Scherzer, “Imaging from coupled physics,”

Inverse Problems (8), p. 080201,2012.[16] V. A. Morozov, “On the solution of functional equations by the method of regularization,” SovietMath. Doklady , pp. 414–417, 1966.[17] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variancereduction,” in NeurIPS (NIPS 2013) , Advances in Neural Information Processing Systems ,2013.[18] M. Friebel, A. Roggan, G. J. M¨uller, and M. C. Meinke, “Determination of optical properties ofhuman blood in the spectral range 250 to 1100 nm using Monte Carlo simulations with hematocrit-dependent eﬀective scattering phase functions,” Journal of biomedical optics (3), p. 034021,2006. 2219] I. Yaroslavsky, A. Yaroslavsky, T. Goldbach, and H.-J. Schwarzmaier, “Inverse hybrid techniquefor determining the optical properties of turbid media from integrating-sphere measurements,” Applied optics (34), pp. 6797–6809, 1996.[20] Q. Fang and D. Boas, “Monte Carlo simulation of photon migration in 3D turbid media acceleratedby graphics processing units,” Opt. Express (22), pp. 20178–20190, 2009.[21] Q. Fang, “Mesh-based Monte Carlo method using fast ray-tracing in Pl¨ucker coordinates,” Biomed. Opt. Express , p. 165, 2010.[22] J. Buchmann, B. A. Kaplan, S. Powell, S. Prohaska, and J. Laufer, “Three-dimensional quan-titative photoacoustic tomography using an adjoint radiance Monte Carlo model and gradientdescent,” Journal of biomedical optics (6), p. 066001, 2019.[23] R. Hochuli, S. Powell, S. Arridge, and B. Cox, “Quantitative photoacoustic tomography using for-ward and adjoint Monte Carlo models of radiance,” Journal of biomedical optics (12), p. 126004,2016.[24] A.A.Leino, A. Pulkkinen, and T. Tarvainen, “ValoMC: A Monte Carlo software and MATLABtoolbox for simulating light transport in biological tissue,” OSA Continuum (3), pp. 957–972,2019.[25] U. Tricoli, C. M. Macdonald, A. Da Silva, and V. A. Markel, “Reciprocity relation for the vectorradiative transport equation and its application to diﬀuse optical tomography with polarizedlight,” Optics letters (2), pp. 362–365, 2017.[26] R. Yao, X. Intes, and Q. Fang, “A direct approach to compute Jacobians for diﬀuse optical to-mography using perturbation Monte Carlo-based photon ’replay’,” Biomed. Optics Express (10),pp. 4588–4603, 2018.[27] A. Leino, T. Lunttila, M. Mozumder, A. Pulkkinen, and T. Tarvainen, “Perturbation Monte Carlomethod for quantitative photoacoustic tomography,” IEEE Transactions on Medical Imaging ,pp. 1–1, 2020.[28] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,”

Siam Review (2), pp. 223–311, 2018.[29] D. Newton, F. Youseﬁan, and R. Pasupathy, “Stochastic gradient descent: Recent trends,” in Re-cent Advances in Optimization and Modeling of Contemporary Problems , pp. 193–220, INFORMS,2018.[30] R. Bollapragada, R. Byrd, and J. Nocedal, “Adaptive sampling strategies for stochastic optimiza-tion,”

SIAM Journal on Optimization (4), pp. 3312–3343, 2018.[31] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematicalstatistics , pp. 400–407, 1951.[32] Y. Nesterov,

Introductory lectures on convex optimization: A basic course , vol. 87, Springer Science& Business Media, 2013.[33] S. R. Arridge, “Optical tomography in medical imaging,”

Inverse Problems (2), pp. R41–R93,1999.[34] A. Erik, S. Tomas, and A. Stefan, “CUDAMCML user manual and implementation notes,” 2009.[35] L. Wang, S. L. Jacques, and L. Zheng, “MCML—Monte Carlo modeling of light transport inmulti-layered tissues,” Computer methods and programs in biomedicine (2), pp. 131–146, 1995.2336] G. Bal, “Inverse transport theory and applications,” Inverse Problems (5), p. 053001, 2009.[37] G. Bal, “Hybrid inverse problems and internal functionals,” in Inverse problems and applications:inside out. II , , pp. 325–368, 2013.[38] L. V. Wang, “Multiscale photoacoustic microscopy and computed tomography,” Nature Photon-ics , pp. 503–509, Sep 2009.[39] P. Beard, “Biomedical photoacoustic imaging,” Interface Focus , 2011.[40] L. Nie and X. Chen, “Structural and functional photoacoustic molecular tomography aided byemerging contrast agents.,”

Chemical Society Reviews (20), pp. 7132–70, 2014.[41] B. Cox, J. G. Laufer, S. R. Arridge, and P. C. Beard, “Quantitative spectroscopic photoacousticimaging: a review,” Journal of Biomedical Optics (6), pp. 061202–1–061202–22, 2012.[42] T. Saratoon, T. Tarvainen, B. Cox, and S. Arridge, “A gradient-based method for quantitativephotoacoustic tomography using the radiative transfer equation,” Inverse Problems , p. 075006,2013.[43] K. Wang and M. Anastasio, “Photoacoustic and thermoacoustic tomography: Image formationprinciples,” in Handbook of Mathematical Methods in Imaging , O. Scherzer, ed., pp. 781–815,Springer New York, 2011.[44] H. Ammari, E. Bossy, J. Garnier, L. H. Nguyen, and L. Seppecher, “A reconstruction algorithmfor ultrasound-modulated diﬀuse optical tomography,” 2014.[45] F. J. Chung and J. C. Schotland, “Inverse transport and acousto-optic imaging,”

SIAM J. Math.Analysis , pp. 4704–4721, 2017.[46] S. Powell, S. R. Arridge, and T. S. Leung, “Gradient-based quantitative image reconstructionin ultrasound-modulated optical tomography: ﬁrst harmonic measurement type in a lineariseddiﬀusion formulation,” IEEE transactions on medical imaging (2), pp. 456–467, 2015.[47] K. Thielemans and S. Arridge, “Adaptive adjustment of the number of subsets during iterativeimage reconstruction,” in , p. 1–2, IEEE, 2016.[48] C. Villani, Optimal Transport: Old and New , vol. 338 of

Grundlehren der mathematischen Wis-senschaften , Springer, 2009.[49] L. Nguyen, J. Liu, K. Scheinberg, and M. Tak´ac, “SARAH: A novel method for machine learn-ing problems using stochastic recursive gradient,” in

ICML 2017 , International Conference onMachine Learning , 2017.[50] A. Chambolle, M. J. Ehrhardt, P. Richt´arik, and C.-B. Sch¨onlieb, “Stochastic primal-dual hy-brid gradient algorithm with arbitrary sampling and imaging applications,” SIAM Journal onOptimization (4), pp. 2783–2808, 2018.[51] P. Moritz, R. Nishihara, and M. I. Jordan, “A linearly-convergent stochastic L-BFGS algorithm,”in Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, AIS-TATS ,41