[PDF] Consistency analysis of bilevel data-driven learning in inverse problems

Abstract

One fundamental problem when solving inverse problems is how to find regularization parameters. This article considers solving this problem using data-driven bilevel optimization, i.e. we consider the adaptive learning of the regularization parameter from data by means of optimization. This approach can be interpreted as solving an empirical risk minimization problem, and we analyze its performance in the large data sample size limit for general nonlinear problems. We demonstrate how to implement our framework on linear inverse problems, where we can further show the inverse accuracy does not depend on the ambient space dimension. To reduce the associated computational cost, online numerical schemes are derived using the stochastic gradient descent method. We prove convergence of these numerical schemes under suitable assumptions on the forward problem. Numerical experiments are presented illustrating the theoretical results and demonstrating the applicability and efficiency of the proposed approaches for various linear and nonlinear inverse problems, including Darcy flow, the eikonal equation, and an image denoising example.

Full PDF

aa r X i v : . [ m a t h . S T ] J u l CONSISTENCY ANALYSIS OF BILEVEL DATA-DRIVENLEARNING IN INVERSE PROBLEMS

NEIL K. CHADA, CLAUDIA SCHILLINGS, XIN T. TONG, AND SIMON WEISSMANN

Abstract.

One fundamental problem when solving inverse problems is howto ﬁnd regularization parameters. This article considers solving this problemusing data-driven bilevel optimization, i.e. we consider the adaptive learningof the regularization parameter from data by means of optimization. Thisapproach can be interpreted as solving an empirical risk minimization prob-lem, and we analyze its performance in the large data sample size limit forgeneral nonlinear problems. To reduce the associated computational cost, on-line numerical schemes are derived using the stochastic gradient method. Weprove convergence of these numerical schemes under suitable assumptions onthe forward problem. Numerical experiments are presented illustrating thetheoretical results and demonstrating the applicability and eﬃciency of theproposed approaches for various linear and nonlinear inverse problems, includ-ing Darcy ﬂow, the eikonal equation, and an image denoising example.

AMS subject classiﬁcations:

Keywords : bilevel optimization, statistical consistency, inverse problems,stochastic gradient descent, Tikhonov regularization1.

Introduction

Data-driven modeling seeks to improve model accuracy and predictability byexploiting informations from existing data. It has lead to a wide range of successesin deep learning, reinforcement learning, natural language processing and others [26,28, 47]. This article is interested in its applications when solving inverse problem.Mathematically speaking, when solving an inverse problems, we try to recover a u ∈ U from perturbed data y ∈ Y where the relationship is given as(1.1) y = G ( u ) + η. In (1.1), η ∼ N (0 , Γ) is an additive Gaussian noise and G : U → Y is the mappingfrom the parameter space U to the observation space Y . Here, U and Y denotepossibly inﬁnite dimensional Banach spaces. Solutions to inverse problems havebeen well-studied through the use of variational and optimization methods whichare well-documented in the following texts [24, 49].Regularization is an important aspect of the numerical treatment of inverse prob-lems. It helps overcoming the ill-posedness problem in theory and the overﬁttingphenomenon in practice. It can also be interpreted as a form of a-priori knowledgein the Bayesian approach [48, 32]. To implement regularization on (1.1), we esti-mate the unknown parameters by minimizing a regularized loss function, i.e. weconsider(1.2) u ∗ := arg min u ∈U L Y ( G ( u ) , y ) + S λ ( u ) , λ ∈ R + , where L Y : Y × Y → R + is some metric in Y and S λ : U → R + is a regularizationfunction with regularization parameter λ >

0. A common choice is Tikhonovregularization [50] which can be included in (1.2) through the penalty term S λ ( u ) = λ k u k U . The choice of norm k · k U often models prior information on the unknown parameter. Other common forms include L and total variation regularization,which are particularly useful for imaging purposes [4, 24, 37].In (1.2), the parameter λ balances the inﬂuence of the data and the a-prioriknowledge via the regularization. While expert knowledge can often provide arough range of λ , the exact value, i.e. the λ leading to the best estimation of theunknown parameter u , is often diﬃcult to determine. However, the parameter λ strongly inﬂuences the accuracy of the estimate and has to be properly chosen.Bilevel optimization is one way to resolve this issue [17, 21, 45, 49]. It seeks tolearn the regularization parameter in a variational manner, and it can be viewedas a data–driven regularization [2]. To formulate this approach, we view unknownparameter U ∈ U and the data Y ∈ Y in the model (1.1) as a jointly distributedrandom variable with distribution µ ( U,Y ) . To ﬁnd the best possible regularizationparameter of the model (1.1), the bilevel minimization seeks to solve(1.3) λ ∗ = arg min λ> F ( λ ) , F ( λ ) = E µ ( U,Y ) [ L U ( u λ ( Y ) , U )] , (upper level) u λ ( Y ) := arg min u ∈U L Y ( G ( u ) , Y ) + S λ ( u ) , (lower level)where L U : U × U → R + is some metric in the parameter space U . The upper levelproblem seeks to minimize the distance between the unknown parameter U andthe regularized solution corresponding to its data Y , which is computed through u λ ( Y ) in the lower level problem. To solve this (stochastic) bilevel optimizationproblem, we assume that we have access to training data, given through samples of( U i , Y i ) ∼ µ ( U,Y ) , and the function F in (1.3) can be approximated by its empiricalMonte–Carlo approximation. The area of bilevel optimization has been applied tovarious methodologies for inverse problems. To motivate this we provide variousexamples of the application of bilevel optimization, in the setting describe by (1.3),to inverse problems and an overview of recent literature.1.1. Motivating Examples.

Example 1 - PDE-constrained inverse problems.

We ﬁrst consider a inverseproblem (1.1) with the lower level problem formulated by a partial diﬀerentialequation (PDE): arg min u ∈U L Y ( O ( p ) , y ) + S λ ( u ) , s.t. M ( u, p ) = 0 , (1.4)where u ∈ U denotes the unknown parameter and p ∈ V is the state. The function M : U ×V → W describes an underlying ODE or PDE model. The operator O : V → R K denotes the observation operator which maps the state p to ﬁnite dimensionalobservations. The Darcy’s ﬂow problem is one such example. In particular, u describes a subsurface structure, p is the corresponding pressure ﬁeld, M describesthe Darcy’s law, and O evaluate p at diﬀerent locations.In order to formulate the corresponding bilevel problem (1.3), we assume thatthe forward model M ( u, p ) = 0 is well-posed, which means that for each u ∈ U there exists a unique p ∈ V such that M ( u, p ) = 0 ∈ W . Hence, using the solutionoperator G : U → V s.t. M ( u, G ( u )) = 0, we can formulate the reduced problem of(1.4) by arg min u ∈U L Y ( G ( u ) , y ) + S λ ( u ) , ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 3 where we have deﬁned G = O ◦ G . Hence, given a training data set ( u ( j ) , y ( j ) ) wecan also formulate the empirical bilevel minimization problem b λ n := arg min λ> n n X j =1 k u λ ( y ( j ) ) − u ( j ) k ,u λ ( y ( j ) ) := arg min u ∈U L Y ( G ( u ) , y ( j ) ) + S λ ( u ) . In terms of applications, many inverse problems arising in PDEs [3] are concernedwith the recovery of an unknown which is heterogeneous. As a result it is verynatural to model the unknown as a Gaussian random ﬁelds. Such models includeDarcy ﬂow, the Navier–Stokes model [30] and electrical impedance tomography[5, 32]. Physical constraints such as boundary, or initial conditions are required formodeling purposes.Holler et al. [29] consider bilevel optimization for inverse problems in the settingof (1.4). They provide theory which suggests existence of solutions and formulatetheir problem as an optimal control problem. This is connected with the work ofKaltenbacher [33, 34] who provided a modiﬁed approach known as “all-at-once”inversion. These works have also been used in the context of deconvolution [13, 46].1.1.2.

Example 2 - Image & signal processing.

Bilevel optimization is a popularsolution choice for image processing problems [6, 35]. In these problems, one isinterested in optimizing over an underlying image and particular areas/segmentsof that image. A common example of this includes image denoising which is toremove noise from an image. Another example is image deblurring where the imageis commonly given as a convolution with a linear kernel A , i.e. y = A ∗ u + η, where ∗ denotes the convolution of A and u , commonly expressed as A ∗ u ( x ) = Z R d A ( x − τ ) u ( τ ) dτ. This inverse problem is also known as deconvolution. The setting of (1.3) iscommon for deconvolution, where their loss functions are given as b λ n := arg min λ> n n X j =1 k u λ ( y ( j ) ) − u ( j ) k ,u λ ( y ( j ) ) := arg min u ∈U L Y ( A ∗ u, y ( j ) ) + λ k Lu k . (1.5)In (1.5), L is a regularization matrix, and the upper level problem is taken as theminimization of the empirical loss function given by a training data set ( u ( j ) , y ( j ) ).Commonly λ is taken to be either a weighted function between L Y and the penaltyterm, or it can be viewed as the noise within a system. Common choices of L traditionally are L = I or a ﬁrst or second order operator, which can depend onthe unknown or image of interest. Further detail on the choice of L and A arediscussed in [6].The work of De los Reyes, Sch¨onlieb [9, 18, 19, 20] and coauthors considered theapplication of bilevel optimization to denoising and deblurring, where non-smoothregularization is used such as total variation and Bregman regularization. Thelatter forms of regularization are useful in imaging as they preserve non-smoothfeatures, such as edges and straight lines. N. K. CHADA, C. SCHILLINGS, X. T. TONG, AND S. WEISSMANN

Our contributions.

In this article, we investigate two diﬀerent approachesto solve bilevel optimization and their performance on inverse problems. Firstly weformulate the oﬄine approach of bilevel optimization as an empirical risk minimiza-tion (ERM) problem. Analyzing the performance of the ERM solution is diﬃcult,since the loss function is random and non-convex, so numerical solutions often canonly ﬁnd local minimums. We build a theoretical framework under these generalsettings. In particular, it provides convergence rate of the ERM solution whensample size goes to inﬁnity. This framework is applied to linear inverse problemsto understand the performance of bilevel optimization approach.Secondly, we discuss how to implement stochastic gradient descent (SGD) meth-ods on bilevel optimization. SGD is a popular optimization tool for empirical riskminimization because of its straightfoward implementation and eﬃciency. The lowcomputational costs are particularly appealing in the bilevel context as ﬁnding thelower-level solution is already time consuming. Besides exact SGD, we also considerSGD with central diﬀerence approximation. This can be useful for problems withcomplicated forward observation models. A general consistency analysis frameworkis formulated for both exact SGD and approximated SGD. We demonstrate how toapply this framework to linear inverse problems.Various numerical results are presented highlighting and verifying the theoryacquired. Our results are ﬁrstly presented on various partial diﬀerential equationsboth linear and nonlinear which include Darcy ﬂow and the eikonal equation, asmotivated through Example 1 in subsection 1.1.1. We also test our theory on animage denoising example which is discussed through Example 2 in subsection 1.1.2.In particular, we demonstrate numerically the statistical consistency result whichincludes the rate of convergence and we show the learned parameter λ within eachinverse problem experiment outperforms that with a ﬁxed λ .1.3. Organization.

The structure of this paper is given as follows. In Section 2 wepresent the bilevel optimization problem of interest in a stochastic framework, andpresent a statistical consistency result of the adaptive parameter. We then extendthis result to the linear inverse setting with Tikhonov regularization. This will leadonto Section 3 where we discuss the stochastic gradient descent method and itsapplication in our bilevel optimization problem. We provide various assumptionsrequired where we tend show in the linear setting that our parameter converges in L to the minimizer. In Section 4 we test our theory on various numerical modelswhich include both linear and nonlinear models such as Darcy ﬂow and the eikonalequation. This will also include an image denoising example. Finally, we concludeour ﬁndings in Section 5. The appendix will contain the proofs for results in Section2 and Section 3.2. Regularization parameter offline recovery

In this section we discuss how to use oﬄine bilevel optimization to recover reg-ularization parameters. We also show the solution is statistically consistent undersuitable conditions.2.1.

Oﬄine bilevel optimization.

Regularization parameter learning by bileveloptimization views the unknown parameter U and the data Y as a jointly dis-tributed random variable with distribution µ ( U,Y ) , see e.g. [2] for more details.Recall the bilevel optimization problem is given by λ ∗ = arg min λ ∈ Λ F ( λ ) , F ( λ ) = E µ ( U,Y ) [ L U ( u λ ( Y ) , U )] , (upper level) u λ ( Y ) := arg min u ∈U Ψ( λ, u, Y ) , Ψ( λ, u, y ) := L Y ( G ( u ) , y ) + S λ ( u ) , (lower level) ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 5 where L U denotes a discrepancy function in the parameter space U := R d and L Y denotes a discrepancy function in the observation space Y := R K . S λ ( U ) representsthe regularization with parameter λ ∈ Λ. Here, Λ represents the range of regular-ization parameters which often come from physical constraints. For simplicity, weassume all the functions here are continuous and integrable, and so are their ﬁrstand second order derivatives with respect to λ .In general, we do not know the exact distribution µ in the upper level of (1.3).We consider the scenario where we have access to training data ( u ( j ) , y ( j ) ) nj =1 , whichwe assume to be i.i.d. samples from µ ( U,Y ) . With these data, we can approximate F in (1.3) by its empirical average:(2.1) b F n = 1 n n X j =1 L U ( u λ ( y ( j ) ) , u ( j ) ) . This leads to a data-driven estimator of the regularization parameter,ˆ λ n = arg min λ ∈ Λ b F n ,u λ ( y ( j ) ) = arg min u ∈U L Y ( G ( u ) , y ( j ) ) + S λ ( u ) . (2.2)This method of estimation is often known as empirical risk minimization in machinelearning [47]. We refer to this is as “oﬄine” since minimizing b F n involves all n datapoints at each algorithmic iteration. With ˆ λ n being formulated, it is of naturalinterest to investigate its convergence to the true parameter λ ∗ , when the samplesize increases. Consistency analysis is of central interest in the study of statistics.In particular, if ˆ λ n is the global minimum of b F n , we have the following theorem5.2.3 [7] from Bickel and Doksum, formulated in our notation Theorem 2.1.

Suppose for any ǫ > P (sup { λ ∈ Λ , | b F n ( λ ) − F ( λ ) |} > ǫ ) → , as n → ∞ , ˆ λ n is the global minimizer of b F n , and λ ∗ is the unique minimizer of F ,then ˆ λ n is a consistent estimator. In more practical scenarios, the ﬁnding of ˆ λ n relies on the choice of optimizationalgorithms. If we are using gradient based algorithms, such as gradient descent,ˆ λ n can be the global minimum of b F n if b F n is convex. More generally, we can onlyassume ˆ λ n to be a stationary point of b F n , i.e. ∇ b F n (ˆ λ n ) = 0. In such situations, weprovide the following alternative tool replacing Theorem 2.1: Proposition 2.2.

Suppose F is C , λ ∗ is a local minimum of F , and ˆ λ n is a localminimum of b F n . Let D be an open convex neighborhood of λ ∗ in the parameterspace and c be a positive constant. Denote A n as the event A n = { ˆ λ n ∈ D , ∇ λ b F n ( λ ) (cid:23) c I for all λ ∈ D} . When A n takes place, the following holds: k ˆ λ n − λ ∗ k ≤ k∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) k c . In particular, we have E A n k ˆ λ n − λ ∗ k ≤ p tr ( Var ( ∇ λ f ( λ ∗ , Z ))) c √ n . Proposition 2.2 makes two claims. From the second claim, we can see λ n con-verges to λ ∗ at rate of √ n . And with the ﬁrst claim, sometimes we can have more N. K. CHADA, C. SCHILLINGS, X. T. TONG, AND S. WEISSMANN accurate estimate on large or medium deviations. We will see how to do that inthe linear inverse problem discussed below.On the other hand, Proposition 2.2 requires the existence of region D so thatboth ˆ λ n and λ ∗ are in it, moreover b F n needs to be strongly convex inside D .The convexity part is necessary, since without it, there might be multiple localminimums, and we will have identiﬁability issues. In order to apply Theorem2.2, one needs to ﬁnd D and bound the probability of outlier cases A cn . Thisprocedure can be nontrivial, and requires some advanced tools from probability.We demonstrate how to do so for the linear inverse problem.2.2. Oﬄine consistency analysis with linear observation models.

In thissection we demonstrate how to apply Theorem 2.2 for linear observation modelswith Tikhonov regularization. In particular, we assume u ∈ R d and the data y isobserved through a matrix A ∈ R K × d (2.3) y = Au + η, with Gaussian prior information u ∼ N (0 , λ ∗ C ) and Gaussian noise ξ ∼ N (0 , Γ).The common choice of discrepancy functions in the lower level are the correspondingnegative log-likelihoods L Y ( G ( u ) , y ) = 12 k Au − y k , S λ ( u ) = λ k u k C . Since both of these functions are quadratic in u , the lower level optimization prob-lem has an explicit solution u λ ( y ) = ( A ⊤ Γ − A + λC ) − A ⊤ Γ − y i . If we use the root-mean-square error in the upper level to learn λ , the discrepancyfunction is given by f ( λ, u, y ) = k u λ ( y ) − u k . and the empirical loss function is deﬁned by b F n ( λ ) = 1 n n X i =1 k u λ ( y i ) − u i k . It is worth mentioning that F ( λ ) is not convex on the real line despite that G islinear. The detailed calculation can be found in Remark A.3. It is of this reason, itis necessary to introduce the local region D that F is convex inside at Proposition2.2.Since the formulation of u λ involves the inversion of matrix A ⊤ Γ − A + λC , suchan operation may be unstable for λ approaching ∞ . When λ approaches ∞ , thegradient of b F n approaches zero, so ∞ can be a stationary point that an optimizationalgorithm tries to converge to. To avoid these issues, we assume there are lowerand upper bounds such that0 < λ l < λ ∗ < λ ∗ < λ u , where λ l can be chosen as a very small number and λ u can be very large. Theirvalues often can be obtained from physical restrictions from the inverse problem.By assuming their existence, we can restrict ˆ λ n to be in the interval Λ = ( λ l , λ u ).Now we are ready to present our main result for the oﬄine recovery of regularizationparameter. In particular, we show ˆ λ n converges to λ ∗ with high probability. Theorem 2.3.

Suppose λ ∗ n ∈ ( λ l , λ u ) is a local minimum of b F n . Then there exist C ∗ , c ∗ > such that for any > ǫ > and n , P ( | ˆ λ n − λ ∗ | > ǫ, λ l < ˆ λ n < λ u ) ≤ C ∗ exp( − c ∗ n min( ǫ, ǫ )) . ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 7

The values of C ∗ , c ∗ > depend on λ l , λ u , λ ∗ , C but not on n . Since we can obtain consistency assuming that ˆ λ n is a local minimum, we donot demonstrate how to implement Theorem 2.1 for the more restrictive scenariowhere ˆ λ n is a global minimum. Remark 2.4.

We note that in the Gaussian setting with Tikhonov regularizationone can also estimate λ ∗ empirically by using the maximum likelihood estimator b λ n = d ·  n n X j =1 ( u ( j ) ) ⊤ C − u ( j )  − . However, only in the Gaussian setting with Tikhonov regularization the estimate willlead to the optimal solution of (1.3) . When considering alternative regularization,or dropping the Gaussian assumption on u , it is not clear whether this approachstill leads to a good estimate of λ ∗ . Regularization parameter online recovery

In this section we discuss how to implement the stochastic gradient descent(SGD)method for online solutions of the bilevel optimization. We will formulate the SGDmethod for general nonlinear inverse problems and state certain assumptions onthe forward model and the regularization function to ensure convergence of theproposed method.3.1.

Bilevel stochastic gradient descent method.

In the oﬄine solution of thebilevel optimization problem (2.2), one has to compute the empirical loss function b F n or its gradient in (2.1). This involves solving the lower level problem for eachtraining data point ( u ( j ) , y ( j ) ), j = 1 , . . . , n . When n is very large, this can bevery computationally demanding. One way to alleviate this is to use the stochasticgradient descent (SGD). This has been done in the context of traditional optimiza-tion [16], where various convergence results were shown. As a result this has beenapplied to problems in machine learning, most notably support vector machines[14, 15], but also in a more general context without the use of SGD [25, 31]. Herewe propose a SGD method to solve the bilevel optimization problem (1.3) online.To formulate the SGD method, we ﬁrst note that the gradient descent methodgenerates iterates λ k +1 based on the following update rules λ k +1 = λ k − β k ∇ λ F ( λ k ) , where β k is a sequence of stepsizes.As mentioned above, the population gradient ∇ λ F is often computationally in-accessible, and its empirical approximation ∇ λ b F n is often expensive to compute.One general solution to this issue is using a stochastic approximation of ∇ λ F . Herewe choose ∇ λ f ( λ k , Z ( k ) ), since it is an unbiased estimator of ∇ λ F : ∇ λ F ( λ k ) = E Z ∇ λ f ( λ k , Z ) . The identity above holds by Fubini’s theorem, since we assume f and its secondorder derivatives are all continuous and diﬀerentiable. Comparing with ∇ λ b F n , ∇ λ f involves only one data point Z ( k ) , so it has a signiﬁcantly smaller computation cost.We refer to this method as “online”, since it does not require all n data pointsavailable at each algorithmic iteration.We formulate the stochastic gradient descent method to solve (1.3) as Algorithm1. N. K. CHADA, C. SCHILLINGS, X. T. TONG, AND S. WEISSMANN

Algorithm 1

Bilevel Stochastic Gradient Descent Input: λ , m, β = ( β k ) nk =1 , β k >

0, i.i.d. sample ( Z ( k ) ) k ∈{ ,...,n } ∼ µ ( U,Y ) . for k = 0 , . . . , n − do (3.1) λ k +1 = χ ( λ k − β k ∇ λ f ( λ k , Z ( k ) )) , end for Output: the average ¯ λ n = m P nk = n − m +1 λ k In Algorithm 1, the step size β k is a sequence decreasing to zero not too fast, sothat the Robbins–Monro conditions [41] apply:(3.2) ∞ X k =1 β k = ∞ , ∞ X k =1 β k < ∞ . One standard choice is to take a decreasing step size β k = β k − α with α ∈ (1 / , λ n , which has been shown to accelerate the scheme for standardSGD methods, see [40]. The projection map χ ([47] Section 14.4.1) is deﬁned as χ ( λ ) = arg min θ ∈ Λ {k θ − λ k} . In other words, it maps λ to itself if λ ∈ Λ, otherwise it outputs the point in Λ thatis closest to λ . Using χ ensures λ k +1 is still in the range of regularization parameterif Λ is closed. This operation in general shorten the distance between λ k +1 and λ ∗ when Λ is convex: Lemma 3.1 (Lemma 14.9 of [47]) . If Λ is convex, then for any λ k χ ( λ ) − λ ∗ k ≤ k λ − λ ∗ k . In particular, the stochastic gradient ∇ λ f ( λ k , Z ( k ) ) is given by the followinglemma, which states suﬃcient conditions on Ψ to ensure both u λ and f are contin-uously diﬀerentiable w.r.t. λ . Lemma 3.2.

Suppose the lower level loss function Ψ( λ, u, y ) is is C and strictlyconvex for ( u, λ ) in a neighborhood of ( u λ , λ ) , then the function λ u λ ( y ) iscontinuously diﬀerentiable w.r.t. λ near λ and the derivative is given by ∇ λ u λ ( y ) = − (cid:0) ∇ u [Ψ( λ, u λ ( y ) , y )] (cid:1) − ∇ λu [Ψ( λ, u λ ( y ) , y )] . and (3.3) ∇ λ f ( λ, y, u ) = ∇ w L U ( u λ ( y ) , u ) T ∇ λ u λ ( y ) . Approximate stochastic gradient method.

In order to implement Algo-rithm 1, it is necessary to evaluate the gradient ∇ λ f . While Lemma 3.2 providesa formula to compute the gradient, its evaluation can be expensive for compli-cated PDE forward models. In these scenarios, it is more reasonable to implementapproximate SGD.One general way to ﬁnd approximate gradient is applying central ﬁnite diﬀerenceschemes. This involves perturbing certain coordinates in opposite direction, anduse the value diﬀerence to approximate the gradient:(3.4) ( e ∇ λ f ( λ k , z )) i ≈ f ( λ k + h k e i , z ) − f ( λ k − h k e i , z )2 h k , where e i is the i -th Euclidean basis vector and h k is a step size. h k can eitherbe ﬁxed as a small constant, or it can be decaying as k increases, so that higheraccuracy gradients are used when the iterates are converging. ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 9

In many cases, the higher level optimization uses a L loss function L U ( y, u ) = k y − u k . The exact SGD update step (3.1) can be written as λ k +1 = λ k − β k ∇ λ k u λ k ( y ( k ) ) − u ( k ) k = λ k − β k (cid:16) ∇ λ u λ k ( y ( k ) ) (cid:17) ⊤ ( u λ k ( y ( k ) ) − u ( k ) ) . In this case, it makes more sense to apply central diﬀerence scheme only on the ∇ u λ part:(3.5)( ∇ λ u λ ( y ( k ) )) i = ∂u λ ( y ( k ) ) ∂λ i ≈ u λ + h k e i ( y ( k ) ) − u λ − h k e i ( y ( k ) )2 h k =: ( e ∇ λ u λ ( y ( k ) )) i , where ( e i ) i =1 ,...,d denote the i –th unit vectors in R d . Using this approximation,we formulate the approximate SGD method in the following algorithm, where wereplace the exact gradient ∇ λ u λ ( y ( k ) ) by the numerical approximation e ∇ λ u λ ( y ( k ) )deﬁned in (3.5).Here we have deﬁned the numerical approximation of ∇ λ f by(3.6) e ∇ λ f ( λ, ( y, u )) := (cid:16) e ∇ λ u λ ( y ) (cid:17) ⊤ ( u λ ( y ) − u ) . In most ﬁnite diﬀerence approximation schemes, the approximation error involvedis often controlled by h k . In particular, we assume the centred forward diﬀerencescheme used in either (3.4) or (3.6) yields an error of order k E e ∇ λ ( f ( λ, Y, U )) − ∇ λ F ( λ ) k =: α k = O ( h k ) . Replacing the stochastic gradient in Algorithm 1 with its approximation, we obtainthe algorithm below:

Algorithm 2

Approximate Bilevel Stochastic Gradient Descent Input: λ , m, β = ( β k ) nk =1 , β k >

0, i.i.d. sample ( Z ( k ) ) k ∈{ ,...,n } ∼ µ ( U,Y ) . for k = 0 , . . . , n − do λ k +1 = χ ( λ k − β k e ∇ λ f ( λ k , Z ( k ) )) , end for Output: the average ¯ λ n = m P nk = n − m +1 λ k Consistency analysis for online estimators.

Next we formulate suﬃcientconditions that can ensure that λ k converges in L to the optimal solution λ ∗ of(1.3). Proposition 3.3.

Suppose that there is a convex region

D ⊂ Λ and a constant c > such that (3.7) inf λ ∈D ( λ − λ ∗ ) ⊤ ∇ λ F ( λ ) > c k λ − λ ∗ k . and there are constants a, b > such that for all λ ∈ D it holds true that (3.8) E [ | e ∇ λ f ( λ, Z ) | ] < a + b k λ − λ ∗ k . Also the bias in the approximated SGD is bounded by (3.9) k E e ∇ λ f ( λ k , Z k ) − ∇ λ F ( λ k ) k ≤ α k . Let A n be the event that λ k ∈ D . Suppose β ≤ cb . Then if the approximation erroris bounded by a small constant α k ≤ α , there is a constant C n such that E A n k λ n − λ ∗ k ≤  E Q + 2 a ∞ X j =1 β j  C n + α c . Here C n = min k ≤ n max  n Y j = k +1 (1 − cβ j ) , aβ k /c  is a sequence converging to zero.If the approximation error is decaying so that α k ≤ Dβ k , then we have theestimation error E A n k λ n − λ ∗ k ≤  E Q + 2( a + D/c ) ∞ X j =1 β j  C n . Remark 3.4.

We note that the above result also leads to similar convergence ofthe average estimator ¯ λ n since by Jensen’s inequality k ¯ λ n − λ ∗ k ≤ m n X k = n − m +1 k λ k − λ ∗ k . Further, for standard SGD methods the averaging step has been shown to lead tothe highest possible convergence rate under suitable assumptions. Interested readerscan refer to [40] for more details.

Consistency analysis with linear inverse problem.

We consider againthe linear inverse problem from Section 2.2 y = Au + ξ, with Gaussian prior information u ∼ N (0 , λ ∗ C ) and Gaussian noise ξ ∼ N (0 , Γ),and the corresponding bilevel optimization with least squares data misﬁt and Tikhonovregularization, i.e. L Y ( Au, y ) = 12 k Au − y k , S λ ( u ) = λ k u k C . Theorem 3.5.

Let β = ( β k ) k ∈ N be a sequence of step sizes with β k > , ∞ P k =1 β k = ∞ , and ∞ P k =1 β k < ∞ . Then for some constant B and a sequence C n converging tozero, the following hold(1) the iterates generated from the exact SGD, Algorithm 1, converge to λ ∗ inthe sense E k λ n − λ ∗ k ≤ BC n , (2) the iterates generated from the aproximate SGD, Algorithm 2 with formula (3.6) and h k = h , converge to λ ∗ up to an error of order O ( h ) , i.e. E k λ n − λ ∗ k ≤ B ( C n + h ) . If we use decaying ﬁnite diﬀerence stepsize h k ≤ hβ / k , then the error canbe further bounded by E k λ n − λ ∗ k ≤ BC n . ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 11

Remark 3.6.

While in the oﬄine setting the proof of the consistency result for thelinear Gaussian setting was heavily relying on the Gaussian assumption on u and ξ ,in the online setting we are able to extend the result to non Gaussian distributionsof u and ξ . For our proof of Theorem 3.5 in Appendix B.3 we only need to assumethat E [ | u | ] < ∞ and E [ | A ⊤ Γ − ξ | ] < ∞ . Hence, it can also be applied to generallinear inverse problems without Gaussian assumption on the unknown parameteror Gaussian assumption on the noise. Numerical results

In this section our focus will be directed on testing the results of the previoussections. We will present various inverse problems to our theory, which will be basedon partial diﬀerential equations, both linear and nonlinear which includes a linear2D Laplace equation, a 2D Darcy ﬂow from geophysical sciences and a 2D eikonalequation which arises in wave propagation. As a ﬁnal numerical experiment, relatedto the examples discussed in Section 1, we test our theory on an image denoisingproblem.For the linear example, we have access to the exact derivative of the Tikhonov so-lution for the bilevel optimization. In particular, we can implement both oﬄine andonline bilevel optimization methodologies. In contrast, ﬁnding the exact derivativesfor nonlinear inverse problems is diﬃcult both in derivation and computation, sowe will only use online methods with approximated gradient. For online methods,we implement the following variants: • bSGD: Application of the bilevel SGD, Algorithm 1 with exact derivative(3.3). • bSGD a : Application of the bilevel SGD, Algorithm 2 with derivative ap-proximation (3.6) for ﬁxed h k = h in (3.5).For our ﬁrst model we have tested both bSGD and bSGD a , while for the nonlinearmodels we have used bSGD a . It is worth mentioning that we have also tested, as aside experiment, using the adaptive derivative h k = h /k / . For these experimentsit was shown that the adaptive derivative scheme does not show any major diﬀerenceto the case of ﬁxed h k = h . In fact, Theorem 3.5 has already implied this, sincethe diﬀerence between the two scheme is of order h − , which is often smaller thanthe error from the numerical forward map solver or the use of ¯ λ n . For this reason,we do not demonstrate this scheme in our numerics.4.1. Linear example: 2D Laplace equation.

We consider the following forwardmodel(4.1) ( −△ p ( x ) = u ( x ) , x ∈ X,p ( x ) = 0 , x ∈ ∂X, with Lipschitz domain X = [0 , and consider the corresponding inverse problemof recovering the unknown u † from observation of (4.1), described through y = O ( p ) + η, where η ∼ N (0 , Γ) is measurements noise and p is the solution of (4.1). Wesolve the PDE in weak form, where A − : U → V , with U = L ∞ ( X ) and V = H ( X ) ∩ H ( X ), denotes the solution operator for (4.1) and O : V → R K de-notes the observation map taking measurements at K randomly chosen points in X , i.e. O ( p ) = ( p ( x ) , . . . , p ( x K )) ⊤ , for p ∈ V , x , . . . , x K ∈ X . For our numericalsetting K = 250 points have been observed, which is illustrated in Figure 1. Wecan express this problem as a linear inverse problem in the reduced form (2.3) by y † = Au † + η ∈ R K , where A = O ◦ A − is the forward operator which takes measurements of (4.1). Theforward model (4.1) is solved numerically on a uniform mesh with 1024 grid pointsin X by a ﬁnite element method with continuous, piecewise linear ﬁnite elementbasis functions.We assume that our unknown parameter u † follows a Gaussian distribution N (0 , λ ∗ C ) with covariance(4.2) C = β · ( τ I − △ ) − α , with Laplacian operator △ equipped with Dirichlet boundary conditions, known β, τ > α > λ ∗ >

0. To sample from the Gaussian distribution,we consider the truncated Karhunen-Lo`eve (KL) expansion [36], which is a seriesrepresentation for u ∼ N (0 , C ), i.e.(4.3) u ( x ) = ∞ X i =1 ξ i r λ ∗ σ i ϕ i ( x ) , where ( σ i , ϕ i ) i ∈ N are the eigenvalues and eigenfunction of the covariance operator C and ξ = ( ξ i ) i ∈ N is an i.i.d. sequence with ξ ∼ N (0 ,

1) i.i.d. . Here, we have sam-pled from the KL expansion for the discretized C on the uniform mesh. Further-more, we assume to have access to training data ( u ( j ) , y ( j ) = Au ( j ) + η ( j ) ) j =1 ,...,n , n ∈ N , which we will use to learn the unknown scaling parameter λ ∗ before solvingthe inverse problem. For the numerical experiment we set β = 100 , τ = 0 . , α = 2and λ ∗ = 0 .

1. After learning the regularization parameter, we will compare theestimated parameter through the diﬀerent results of the Tikhonov minimum u λ i ( y † ) = ( A ⊤ Γ − A + λ i · C ) − A ⊤ y † , for λ = b λ learned from the training data, λ = λ ∗ and ﬁxed λ = 1. We have usedthe MATLAB function fmincon to recover the the regularization parameter oﬄineby solving the empirical optimization problem b λ n ∈ arg min λ> n n X j =1 | u λ ( y ( j ) ) − u ( j ) | . We use M = 1000 samples of training data to construct Monte–Carlo estimatesof E [ | b λ n − λ † | ]. While the computation of the empirical loss function can becomputational demanding, we also apply the proposed online recovery in form ofthe SGD method to learn the regularization parameter λ by running Algorithm 1with chosen step size β k = 200 /k , range of regularization parameter Λ = [0 . , λ = 1. The resulting iterate λ k can be seen in Figure 2 on theright side.From the numerical experiments for the linear example we observe that thenumerics match our derived theory. In the oﬄine recovery setting, this is ﬁrstevident in Figure 2 on the left side. We compare the MSE with the theoreticalrate, which seems to decay at the same rate. The online recovery is highlighted bythe right plot in Figure 2 which demonstrates the convergence towards λ ∗ as theiterations progress. Further, we show the result of the approximate bSGD methodAlgorithm 2 for ﬁxed chosen h k = 0 .

01 in (3.5). As the derivative approximation(3.5) is closely exact, we see very similar good performance of the approximatebSGD method.Finally, Figure 3 shows the recovery of the underlying unknown through diﬀerentchoices of λ . It veriﬁes that the adaptive learning of λ outperforms that of ﬁxedregularization parameter λ = 1. ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 13

Figure 1.

Reference PDE solution for the Laplace equation ofthe underlying unknown parameter u † , and the corresponding ran-domized observation points x , . . . , x K ∈ X . n -5

50 100 150 200 250 300

Iteration

Figure 2.

MSE (left) resulting from the oﬄine recovery depend-ing on training data size. Learned regularization parameter λ k (right) resulting from the online recovery, Algorithm 1 for theLaplace equation.4.2. Nonlinear example: 2D Darcy ﬂow.

We now consider the following ellip-tic PDE which arises in the study of subsurface ﬂow known as Darcys ﬂow. Theforward model is concerned using the log-permeability log u ∈ L ∞ ( X ) =: U to solvefor the pressure p ∈ H ( X ) ∩ H ( X ) =: V from(4.4) ( −∇ · (exp( u ) ∇ p ) = f, x ∈ Xp = 0 , x ∈ ∂X with domain X = [0 , and known scalar ﬁeld f ∈ R . We again consider thecorresponding inverse problem of recovering the unknown u † from observation of(4.4), described through y = O ( p ) + η, where O : V → R K denotes the linear observation map, which takes again mea-surements at K randomly chosen points in X , i.e. O ( p ) = ( p ( x ) , . . . , p ( x K )) ⊤ , for p ∈ V , x , . . . , x K ∈ X . For our numerical setting we choose K = 125 observationalpoints, which can again be seen in Figure 4. The measurements noise is denotedby η ∈ N (0 , Γ), for Γ ∈ R K × K symmetric and positive deﬁnite.We formulate the inverse problem through y † = G ( u † ) + η, Figure 3.

Comparison of diﬀerent Tikhonov solutions for choicesof regularization parameter λ i . The learned Tikhonov regularizedsolution corresponds to the resulting one of the SGD method Al-gorithm 1 for the Laplace equation.with G = O ◦ G , where G : U → V denotes the solution operator of (4.4), solvingthe PDE (4.4) in weak form. The forward problem (4.4) has been solved by asecond-order centered ﬁnite diﬀerence method on a uniform mesh with 256 gridpoints.We assume that u † follows the Gaussian distribution N (0 , λ ∗ C ) with a co-variance operator (4.2) prescribed with Neumann boundary condition. Similar asbefore, β, τ > α > λ ∗ > ξ . See also [10, 27] for more details. Therefore we truncate (4.3)up to d and consider the nonlinear map G : R d → R K , with G ( ξ ) = G ( u ξ ( · )) and u ξ ( · ) = d X i =1 ξ i r λ ∗ σ i ϕ i ( · ) . This implies our unknown parameter is given by ξ ∈ R d and we set a Gaussianprior on ξ with N (0 , λ ∗ I ), where λ ∗ > ξ ( j ) , y ( j ) ) j =1 ,...,n , n ∈ N , where ξ ( j ) ∼ Ξ ∼ N (0 , λ ∗ I ) and we aim to solve the original bilevel optimization problem b λ ∈ arg min λ> E [ k u λ ( Y ) − Ξ k ] , u λ ( Y ) = arg min ξ ∈ R d kG ( ξ ) − Y k + λ k ξ k I . The corresponding empirical optimization problem is given by(4.5) b λ n ∈ arg min λ> n n X j =1 k u λ ( y ( j ) ) − ξ ( j ) k , u λ ( y ( j ) ) = arg min ξ ∈ R d kG ( ξ ) − y ( j ) k + λ k ξ k I , ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 15 for a given size of the training data n . In comparison to the linear setting, weare not able to compute the Tikhonov minimum analytically for each observation y ( j ) , as we require more computational power to solve (4.5). We will solve (4.5)online by application of Algorithm 2, where we will approximate the derivative ofthe forward model by centered diﬀerent method (3.5). We keep the accuracy of thenumerical approximation ﬁxed to h k = 0 . d = 25 coeﬃcients in the KL expansion andthe noise covariance Γ = γ I with γ = 0 . β = 10, α = 2, τ = 3 and the true scaling parameter λ ∗ = 0 . β k = 0 . k − . The learnedparameter moves fast into direction of the true λ ∗ , and oscillates around this value,where the variance reduces with the iterations, as seen in Figure 6.Finally, Figure 5 highlights again the importance and improvements of choosingthe right regularization parameters. Figure 4.

Reference PDE solution for Darcy ﬂow of the under-lying unknown parameter u † and the corresponding randomizedobservation points x , . . . , x K ∈ X .4.3. Nonlinear example: Eikonal equation.

We also seek to test our theory onthe eikonal equation, which is concerned with wave propagation. Given a slownessor inverse velocity function s ( x ) ∈ C ( ¯ X ) =: U , characterizing the medium, anda source location x ∈ X , the forward eikonal equation is to solve for travel time T ( x ) ∈ C ( ¯ X ) =: V satisfying(4.6)  |∇ T ( x ) | = s ( x ) , x ∈ X \ { x } ,T ( x ) = 0 , ∇ T ( x ) · ν ( x ) ≥ , x ∈ ∂X. The forward solution T ( x ) represents the shortest travel time from x to a point inthe domain X . The Soner boundary condition imposes that the wave propagatesalong the unit outward normal ν ( x ) on the boundary of the domain. The modelequation (4.6) is of the form (1.4) with an additional constrain arising from theSoner boundary condition.The inverse problem for (4.6) is to determine the speed function s = exp( u ) frommeasurements of the shortest travel time T ( x ). The data is assumed to take theform y = O ( T ) + η, where O : V → R K denotes the linear observation map, which takes again measure-ments at K = 125 randomly chosen grid points in X , i.e. O ( p ) = ( T ( x ) , . . . , T ( x K )) ⊤ , Figure 5.

Comparison of diﬀerent Tikhonov solutions for choicesof the regularization parameter λ . The learned Tikhonov regular-ized solution corresponds to the resulting one of the SGD methodAlgorithm 2 for Darcy ﬂow. Iteration

Figure 6.

Learned regularization parameter λ k , for Darcy ﬂow,resulting from the approximate bilevel SGD method Algorithm 2with ﬁxed derivative accuracy h = h and the corresponding meanover the last 50 iterations ¯ λ n . We obtain an error | λ ∗ − ¯ λ n | =3 . − T ∈ Z , x , . . . , x K ∈ X . The observed points can be seen in Figure 7. Themeasurements noise is again denoted by η ∈ N (0 , Γ), for Γ ∈ R K × K symmetricand positive deﬁnite. Again we formulate the inverse problem through y † = G ( u † ) + η, with G = O ◦ G , where G : U → V denotes the solution operator of (4.4). As beforewe will assume our unknown u † is distributed according to a mean-zero Gaussianwith covariance structure (4.2). For this numerical example we set β = 1 , τ =0 . , α = 2 and λ ∗ = 0 .

1. We truncate the KL expansion such that the unknownparameters ξ ∈ R d with d = 25. For the eikonal equation we take a similar approach ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 17 to Section 4.2, that is we use the SGD described through Algorithm 2. For the SGDmethod we have chosen an adaptive step size β k = min (cid:18) . , λ |∇ λ f ( λ k , Z ( k ) ) | (cid:19) k − . Here, the chosen step size β k provides a bound on the maximal moved step in eachSGD step, i.e. | β k · ∇ λ f ( λ k , Z ( k ) ) | ≤ λ /k. This helps to avoid instability arising through the high variance of the stochasticgradient, but the step size will be mainly of order 0 . /k . However, from theoret-ical side it is not clear whether assumption of (3.2) is still satisﬁed. Therefore, wewill also show the resulting P nk =1 β k and the realisation of the stochastic gradient ∇ f ( λ k , Z ( k ) ) in Figure 10.Our setting for the parameter choices of our prior and for the bilevel-optimizationproblem remain the same. To discretize (4.6) on a uniform mesh with 256 gridpoints we use a fast marching method, described by the work of Sethian [23, 44].As we observe the numerical experiments, Figure 8 highlights that using thelearned λ n provides recoveries almost identical to that of using the true λ ∗ . Forboth cases we see an improvement over the case λ = 1 which is what we expectedand have seen throughout our experiments. This is veriﬁed through Figure 9 wherewe see oscillations of the learned λ k around the true λ ∗ , until approximately 100iterations where it starts to become stable. Finally from Figure 10 we see thatthe summation of our choice β k diverges, but not as quickly as the summation ofthe deterministic step size 0 . /k does, which is the implication of the introducedadaptive upper bound based on the size of the stochastic gradient ∇ λ f ( λ k , Z ( k ) ).Figure 10 also shows the histrogram of the stochastic gradient and its rare realizedlarge values. Figure 7.

Reference PDE solution for the eikonal equation of theunderlying unknown parameter u † , and the corresponding random-ized observation points x , . . . , x K ∈ X .4.4. Signal denoising example.

We now consider implementing our methodson image denoising, which is discussed in Section 1 and subsection 1.1.2. We areinterested in denoising a 1D compound Poisson process of the form(4.7) u t = N t X i =1 X i , Figure 8.

Figure 9.

Learned regularization parameter λ k , for the eikonalequation, resulting from the approximate bilevel SGD method Al-gorithm 2 with ﬁxed derivative accuracy h = h and the corre-sponding mean over the last 50 iterations ¯ λ n . We obtain an error | λ ∗ − ¯ λ n | = 1 . − N t ) t ∈ [0 ,T ] is a Poisson process, with rate r > X i ) N t i =1 are i.i.d. randomvariables representing the jump size. Here, we have chosen X ∼ N (0 , λ . In particular,the observed signal u = ( u t , . . . , u t d ) ⊤ ∈ R d is perturbed by white noise(4.8) y t i = u t i + η t i , where t i ∈ { /d · T, /d · T, . . . , T } and η t i ∼ N (0 , σ ) are i.i.d. random variables,and the Tikhonov estimate corresponding to the lower level problem of (1.5) for ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 19 n -3 Figure 10.

Summation of the realized adaptive step size (left)and the realized stochastic gradient ∇ f ( λ k , Z ( k ) ) (right) resultingfrom the online recovery, Algorithm 1 for the eikonal equation. λ e −

02 1 e − λ n λ ∗ error . . . . Table 1.

MSE over time of the reconstruction for diﬀerent choicesof the regularization parameter for signal denoising example.given regularization parameter λ > u λ ( y ) = (Γ − + λL − ) − Γ − y, with given regularization matrix L ∈ R d × d and y = ( y , . . . , y d ) ⊤ ∈ R d . We assumeto have access to training data ( u ( j ) , y ( j ) ) nj =1 of (4.8) and choose the regularizationparameter b λ according to Algorithm 1. Further, we compare the resulting estimateof the signal y o bs = u † + η, to ﬁxed choices of λ ∈ { . , . } and to the best possible choice λ ∗ =arg min λ k u λ ( y o bs ) − u † k .For the experiment we set the rate of jumps r = 10 and consider the signal ob-served up to time T = 1 at d = 1000 observation points. For Algorithm 1, we usea training data set of size n = 500, we set an initial value λ = 0 .

001 and step size β k = 0 . k − . The Tikhonov solution (4.9) has been computed with a second-order regularization matrix L = △ − . As we can see from our results the value of λ = 0 .

001 oversmoothens the estimate in comparison with λ = 0 . λ with the learned λ in Figure 12 wesee an improvement, closer to the best possible λ , which is veriﬁed further throughTable 1, where we can see the MSE over the time intervall. Both Figure 11 andFigure 12 show on the right hand side the pointwise squared error over time.5. Conclusion

In this work we have provided new insights into the theory of bilevel learningfor inverse problems. In particular our focus was on deriving statistical consistencyresults with respect to the data limit of the regularization parameter λ . This wasconsidered for both the oﬄine and online representations of the bilevel problem.For the online version we used and motivated stochastic gradient descent as thechoice of optimizer, as it is well known to reduce the computational time requiredcompared to other methodologies. To test our theory we ran numerical experimentson various PDEs which not only veriﬁed the theory, but clariﬁed that adapting the Time -5-4-3-2-10120 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time -5-4-3-2-1012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time

Figure 11.

Comparison of diﬀerent Tikhonov solutions for ﬁxedchoices of the regularization parameter λ for the signal denoisingexample. Time -5-4-3-2-10120 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time -5-4-3-2-1012 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Time

Figure 12.

Comparison of the learned to best possible Tikhonovsolutions for choices of the regularization parameter λ . The learnedTikhonov regularized solution corresponds to the resulting one ofthe SGD method Algorithm 1 for the signal denoising example.regularization parameter λ outperforms that of a ﬁxed value. Our results in thisarticle provide numerous directions for future, both practically and analytically. • One direction is to consider a fully Bayesian approach, or understanding,to bilevel learning. In the context of statistical inverse problems, this couldbe related to treating λ as a hyperparameter of the underlying unknown. ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 21

This is referred to as hierarchical learning [39] which aims to improve theoverall accuracy of the reconstruction [1, 22]. • Another potential direction is to understand statistical consistency fromother choices of regularization. Answering this for other penalty terms,such as L , total variation and adversarial [38] (based on neural networks),is of importance and interest in numerous applications [2]. A potentialﬁrst step in this direction would be to consider the well-known elastic-netregularization [24], which combines both L and Tikhonov regularization.Of course to consider this one would need to modify the assumptions onconvexity. • Finally one could propose using alternative optimizers, which provide alower computational cost. A natural choice would be derivative-free op-timization. One potential optimizer could be ensemble Kalman inversion,a recent derivative-free methodology, which is of particular interest to theauthors. In particular as EKI has been used in hierarchical settings [10, 11],the reduction in cost could be combined with the hierarchical motivationdiscussed above.

Acknowledgements

NKC acknowledges a Singapore Ministry of Education Academic Research FundsTier 2 grant [MOE2016-T2-2-135]. SW is grateful to the DFG RTG1953 ”StatisticalModeling of Complex Systems and Processes” for funding of this research. Theresearch of XTT is supported by the National University of Singapore grant R-146-000-292-114.

Appendix A. Proofs of offline consistency analysis

A.1.

General framework.

We start with the proof for the general framework:

Proof of Proposition 2.2.

To simplify the mathematical notation, we use z to de-note the data couple ( u, y ), and use f to denote the data loss function f ( λ, z ) = L U ( u λ ( y ) , u ) . When ˆ λ n ∈ D , we apply the fundamental theorem of calculus on ∇ λ b F n , and ﬁnd ∇ λ b F n ( λ ∗ ) = ∇ λ b F n (ˆ λ ) + Z ∇ λ b F n ( sλ ∗ + (1 − s )ˆ λ )( λ ∗ − ˆ λ n ) ds = A F (ˆ λ n − λ ∗ ) , where A F := Z ∇ λ b F n ((1 − s )ˆ λ n + sλ ∗ ) ds (cid:23) c o I. Note that = ∇ λ F ( λ ∗ ) = ∇ λ F ( λ ∗ ) − ∇ λ b F n ( λ ∗ ) + ∇ λ b F n ( λ ∗ )= A F (ˆ λ n − λ ∗ ) + ∇ λ F ( λ ∗ ) − ∇ λ b F n ( λ ∗ ) . We can reorganize this as − (cid:16) ∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) (cid:17) = A F (ˆ λ n − λ ∗ ) , As a consequence, we now have a formula for the point estimation error λ ∗ − ˆ λ n . k λ ∗ − ˆ λ k = (cid:13)(cid:13)(cid:13) A − F (cid:16) ∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) (cid:17)(cid:13)(cid:13)(cid:13) ≤ c − (cid:13)(cid:13)(cid:13) ∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) (cid:13)(cid:13)(cid:13) . Note that by using ∇ λ E f ( λ, Z ) = E ∇ λ f ( λ, Z ), see [43, Theorem 12.5], ∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) = 1 n n X i =1 ∇ λ f ( λ ∗ , z i ) − E ∇ λ f ( λ ∗ , Z ) . So E k∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) k = 1 n tr(Var( ∇ λ f ( λ ∗ , Z ))) . And our second claim follows by Cauchy-Schwarz E A n k∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) k ≤ q E k∇ λ b F n ( λ ∗ ) − ∇ λ F ( λ ∗ ) k . (cid:3) Next, we apply Proposition 2.2 to linear the inverse problem. To simplify thediscussion, we do a whitening transformation, u i = C − / u i , such that we canassume w.l.o.g. C = λ − ∗ I . Denote D := A ⊤ Γ − / , and ξ i = − Γ − / ( Au i − y i ) ∼N (0 , I ) and note that( A ⊤ Γ − A + λI ) − A ⊤ Γ − y i − u i = ( DD ⊤ + λI ) − A ⊤ Γ − ( Au i + ξ i ) − u i = ( DD ⊤ + λI ) − ( λu i + Dξ i ) . Therefore we deﬁne f ( λ, z ) = Tr(( DD ⊤ + λI ) − ( λu + Dξ )( λu + Dξ ) ⊤ )= Tr(( DD ⊤ + λI ) − ( λ uu ⊤ + 2 λDξu ⊤ + Dξξ ⊤ D ⊤ )) . A.2.

Pointwise consistency analysis.

To apply Proposition 2.2, it is necessaryto show the gradient of b F n ( λ ) is a good approximation of ∇ F ( λ ) at λ = λ ∗ withhigh probability. This is actually true for general λ .To show this, we start by showing the sample covariance are consistent. Lemma A.1.

Let X ( i ) and Y ( i ) be i.i.d. N (0 , I ) , i ∈ N , let Σ ∈ R d × d be ﬁxed.There exists a universal constant c > , such that for any n and C n = 1 n n X i =1 X ( i ) ( X ( i ) ) ⊤ , B n = 1 n n X i =1 X ( i ) ( Y ( i ) ) ⊤ , the following holds P ( | Tr(Σ C n ) − Tr(Σ) | > t ) ≤ (cid:18) − cn min (cid:18) t k Σ k F , t k Σ k (cid:19)(cid:19) , P ( | Tr(Σ B n ) | > t ) ≤ (cid:18) − cn min (cid:18) t k Σ k F , t k Σ k (cid:19)(cid:19) . Proof.

First note Tr(Σ u i u ⊤ i ) = u ⊤ i Σ u i . We deﬁne the block-diagonal matrix D Σ ∈ R nd × nd which consists of n blocks of Σ,and Z = [ u ; u ; · · · ; u n ] ∈ R nd . Note thatTr(Σ C u ) = Z ⊤ ( n D Σ ) Z. By the Hanson–Wright inequality [42, Theorem 1.1], we obtain for some constants c and K , P ( | Tr(Σ C u ) − Tr(Σ) | > t ) ≤ (cid:18) − c min (cid:18) t K k n D Σ k F , tK k n D Σ k (cid:19)(cid:19) . Note that k n D Σ k F = 1 n k D Σ k F = 1 n k Σ k F , ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 23 k n D Σ k = 1 n k D Σ k = 1 n k Σ k . So the ﬁrst assertion is proved. For the second claim we ﬁrst note thatTr(Σ ξ i u ⊤ i ) = u ⊤ i Σ ξ i = (cid:2) u ⊤ i , ξ ⊤ i (cid:3) Q (cid:20) u i ξ i (cid:21) , Q = (cid:20) (cid:21) ∈ R ( d + y ) × ( d + y ) . Consider then a block-diagonal matrix D Q ∈ R n ( d + y ) × n ( d + y ) which consists of n blocks of Q , and Z = [ u ; ξ ; u ; ξ ; · · · ; ξ n ; u n ] ∈ R n ( d + y ) . Then we can verify thatTr(Σ B ) = Z ⊤ ( n D Q ) Z. By the Hanson–Wright inequality [42, Theorem 1.1], we have P ( | Tr(Σ B ) | > t ) ≤ (cid:18) − c min (cid:18) t K k n D Q k F , tK k n D Q k (cid:19)(cid:19) . Again we note that k n D Q k F = 1 n k D Q k F = 1 n k Σ k F , k n D Q k = 1 n k D Q k = 1 n k Q k . and ﬁnally end up with P ( | Tr(Σ B ) | > t ) ≤ (cid:18) − cn min (cid:18) t K k Σ k F , tK k Σ k (cid:19)(cid:19) . (cid:3) By the previous result we obtain the following convergence results.

Lemma A.2.

Let Q λ = ( DD ⊤ + λI ) − , the empirical loss function b F n is C in λ ,and for any λ ∈ ( λ l , λ r ) , there exists constants C, c > such that for all ε > P ( | ∂ λ b F n ( λ ) − λ/λ ∗ − Q λ ) | > ε ) ≤ C exp( − nc min { ε, ε } ) , and P ( | ∂ λ b F n ( λ ) − λ ∗ Tr (cid:0) (3 λ ∗ I − λI + DD ⊤ ) DD ⊤ Q λ (cid:1) | > ε ) ≤ C exp( − cn min { ε, ε } ) , where we have deﬁned D := A ⊤ Γ − / and Q λ = ( DD ⊤ + λI ) − .Proof. We compute ∂ λ f ( λ, z ) = Tr (cid:18) − Q λ ( λ uu ⊤ + 2 λDξu ⊤ + Dξξ ⊤ D ⊤ ) + Q λ (2 λuu ⊤ + 2 Dξu ⊤ ) (cid:19) .∂ λ f ( λ, z ) = 2Tr (cid:18) DD ⊤ + λI ) − ( λ uu ⊤ + 2 λDξu ⊤ + Dξξ ⊤ D ⊤ ) − DD ⊤ + λI ) − ( λuu ⊤ + Dξu ⊤ ) + ( DD ⊤ + λI ) − uu ⊤ (cid:19) = 2Tr (cid:18) (3 λ Q λ − λQ λ + Q λ ) uu ⊤ + 2(3 λQ λ − Q λ ) Dξu ⊤ + 3 Q λ Dξξ ⊤ D ⊤ (cid:19) . Since b F n ( λ ) = n P ni =1 f ( λ, ζ i ), if we let C u = 1 n n X i =1 u i u ⊤ i , B = 1 n n X i =1 ξ i u ⊤ i , C ξ = 1 n n X i =1 ξ i ξ ⊤ i , then ∂ λ b F n ( λ ) = Tr (cid:18) − Q λ ( λ C u + 2 λDB + DC ξ D ⊤ ) + Q λ (2 λC u + 2 DB ) (cid:19) = − λ Tr( Q λ C u ) − λ Tr( Q λ DB ) − D ⊤ Q λ DC ξ ) + 2 λ Tr( Q λ C u )(A.1) + 2Tr( Q λ DB ) , and ∂ λ b F n ( λ ) = 2Tr (cid:18) (3 λ Q λ − λQ λ + Q λ ) C u + 2(3 λQ λ − Q λ ) DB + 3 Q λ DC ξ D ⊤ (cid:19) = 2Tr (cid:18) (3 λ Q λ + Q λ ) C u (cid:19) − λ Tr( Q λ C u ) + 6 λ Tr( Q λ DB )(A.2) − Q λ DB ) + 6Tr( D ⊤ Q λ DC ξ ) . Therefore ∂ λ b F n ( λ ) = 2Tr (cid:18) (3 λ Q λ − λQ λ + Q λ ) C u + 2(3 λQ λ − Q λ ) DB + 3 Q λ DC ξ D ⊤ (cid:19) = 2Tr (cid:18) (6 λQ λ − λ Q λ − Q λ + 12 λQ λ − Q λ ) C u (A.3) + 2(3 Q λ − λQ λ + 6 Q λ ) DB − Q λ DC ξ D ⊤ (cid:19) . We note that k Q λ k ≤ λ ≤ λ l for all λ ≥ λ l and k Q λ k F ≤ dλ l respectively, andthat E ∂ λ b F n ( λ ) = 2( λ/λ ∗ − Q λ ) , where the expectation is averaging realizations of ζ i . Moreover, in (A.1), ∂ λ b F n canbe written as sum of Tr(Σ C u ) , Tr(Σ B ) and Tr(Σ C ξ ) for certain matrices Σ. Notethat for any random variables A k P | m X k =1 ( A k − E A k ) | > ε ! ≤ m X k =1 P ( | A k − E A k | > ε/m ) . Therefore we can apply Lemma A.1 at each trace term, and bound its probabilityof deviating from its mean. Therefore, we can ﬁnd constants C , c such that P ( | ∂ λ b F n ( λ ) − λ/λ ∗ − Q λ ) | > ε ) ≤ C exp (cid:0) − cn min( ε , ε ) (cid:1) . We can apply the similar method to show the second assertion, since E ∂ λ b F n ( λ ) = λ ∗ Tr (cid:0) (3 λ ∗ I − λI + DD ⊤ ) DD ⊤ Q λ (cid:1) . (cid:3) Remark A.3.

It is worthwhile to note that ∂ λ F ( λ ) = E ∂ λ b F n ( λ ) = λ ∗ Tr (cid:0) (3 λ ∗ I − λI + DD ⊤ ) DD ⊤ Q λ (cid:1) is not always positive, and it can be negative if λ is very large. In other words, F isnot convex on the real line. Therefore, it is necessary to introduce a local parameterdomain where F is convex inside. Remark A.4.

We note that through the deﬁnition of f , we can ensure that E [ ∂ λ f ( λ, z )] = ∂ λ E [ f ( λ, z )] . This can be seen, by the following computation of E [ f ( λ, z )] . We can write E [ f ( λ, z )] = k Q − λ ( λu + Dξ ) k = Tr (cid:0) Cov[ Q − λ ( λu + Dξ ) , Q − λ ( λu + Dξ )] (cid:1) , ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 25 and with Q − λ ( λu + Dξ ) ∼ N (0 , Q − λ ( λ + D Γ D ⊤ ) Q − λ ) we obtain E [ f ( λ, z )] = Tr( Q − λ ( λ I + D Γ D ⊤ ) Q − λ )= Tr( Q − λ ( λ I + D Γ D ⊤ )) . Hence, we imply ∂ λ E [ f ( λ, z )] = Tr( − Q λ ( λ I + D Γ D ⊤ ) + 2 Q λ λI ) = E [ ∂ λ f ( λ, z )] . A.3.

Consistency analysis within an interval.

To apply Proposition 2.2, it isalso necessary to show the b F n ( λ ) is strongly convex in a local region/interval. Thiscan be done using a chaining argument in probability theory.First, we show that the empirical loss function has bounded derivatives withhigh probability. Lemma A.5.

There exists an

L > such that the following holds true P (cid:18) max λ l ≤ λ ≤ λ u | ∂ kλ b F n ( λ ) | > L, k = 1 , , (cid:19) ≤ − nc ) . Proof.

Recall that k Q λ k ≤ λ l and the formulae (A.1), (A.2) and (A.3). It is easyto see that if C u , B, C ξ are all bounded in operator norm, then | ∂ kλ b F n ( λ ) | will beuniformly bounded for all M − ≤ λ ≤ M . While if we take t = √ d in Lemma A.1 P ( k C u k > dλ − ∗ + √ d ) ≤ P (Tr( C u ) > dλ − ∗ + √ d ) ≤ − cn min( d/d, √ d )) = 2 exp( − cn ) , P ( k C ξ k > d + √ d ) ≤ P (Tr( C ξ ) > d + √ d ) ≤ − cn ) , P ( k B | > d + √ d ) ≤ P (Tr( B ) > d + √ d ) ≤ − cn ) . Then with certain constants M ( λ ) and L = M ( λ )( dλ − ∗ + d + √ d ) P (cid:18) max M − ≤ λ ≤ M | ∂ kλ b F n ( λ ) | ≤ L M , k = 1 , , (cid:19) ≥ P (cid:18) k C u k ≤ dλ − ∗ + √ d, k C ξ k ≤ d + √ d, k B k ≤ d + √ d (cid:19) ≥ − − nc ) . (cid:3) Next, we show that if a function is bounded at each ﬁxed point with high prob-ability, it is likely to be bounded on a ﬁxed interval if it is Lipschitz.

Lemma A.6.

Let f n ( λ ) be function of λ and the following is true for some interval I = [ λ l , λ u ] P ( f n ( λ ) > a ) ≤ C exp( − nc a ) ∀ λ l ≤ λ ≤ λ u . Then P (cid:18) max λ ∈ I f n ( λ ) > a, max λ ∈ I | ∂f n ( λ ) | ≤ M (cid:19) ≤ a − | λ u − λ l | M C exp( − nc a ) . Let f n ( λ ) be function of λ and the following is true for some interval I P ( f n ( λ ) < a ) ≤ exp( − nc a ) ∀ λ l ≤ λ ≤ λ u . Then P (cid:18) min λ ∈ I f n ( λ ) < a/ , max λ ∈ I | ∂f n ( λ ) | ≤ M (cid:19) ≤ a − | λ u − λ l | M C exp( − nc a ) . Proof.

Pick λ i = λ l + a | M | i for i = 0 , . . . , ⌊ | λ u − λ l | M a ⌋ . Then λ l ≤ λ i ≤ λ u , andfor any λ l ≤ λ ≤ λ u , | λ − λ i | ≤ aM for some λ i . Not that if | ∂f n ( λ ) | ≤ M , and f n ( λ i ) ≤ a , for all i , then for any λ l ≤ λ ≤ λ u , f n ( λ ) ≤ f n ( λ i ) + ( λ i − λ ) ∂ λ f n ( λ ) ≤ a + aM M = 2 a. Consequentially, by union bound P (cid:18) min λ ∈ I f n ( λ ) > a, max λ ∈ I | ∂f n ( λ ) | ≤ M (cid:19) ≤ P (cid:18) f n ( λ i ) > a for some i (cid:19) ≤ a − | λ u − λ l | M C exp( − nc a ) . The same argument can be applied to show the second claim, except that we choose λ i = c + a | M | . (cid:3) The next lemma indicates that the loss function is strongly convex within D withhigh probability. Lemma A.7.

Assume that the largest eigenvalue of DD ⊤ is λ D and let λ ∈ D :=[ λ ∗ , λ ∗ ] . Then for some constants c, C > , P (min λ ∈D ∂ λ b F n ( λ ) < H ∗ / ≤ C min { H ∗ , } exp (cid:18) − cn min( H ∗ , H ∗ , (cid:19) , with H ∗ = H ∗ ( λ D , λ ∗ ) = 2 λ D ( λ D + 3 λ ∗ / λ ∗ > . Proof.

Note that for any λ ∈ D , and v being the eigenvector of DD ⊤ correspondsto the largest eigenvalue λ D ,2 λ ∗ Tr (cid:0) (3 λ ∗ I − λI + DD ⊤ ) DD ⊤ Q λ (cid:1) ≥ λ ∗ Tr (cid:0) DD ⊤ DD ⊤ Q λ (cid:1) ≥ λ ∗ Tr (cid:0) v ⊤ DD ⊤ DD ⊤ Q λ v (cid:1) = 2 λ D ( λ D + λ ) λ ∗ ≥ H ∗ , for λ ∈ D and we set ε = H ∗ / > C , cC exp (cid:0) − cn min( H ∗ , H ∗ ) (cid:1) ≥ P | ∂ λ b F n ( λ ) − λ ∗ Tr (cid:0) (3 λ ∗ I − λI + DD ⊤ ) DD ⊤ Q λ (cid:1) | > H ∗ / ! ≥ P ∂ λ b F n ( λ ) < λ ∗ Tr (cid:0) (3 λ ∗ I − λI + DD ⊤ ) DD ⊤ Q λ (cid:1) − H ∗ / ! ≥ P ( ∂ λ b F n ( λ ) < H ∗ / . By Lemma A.5 there exists an

L > c such that P (cid:18) max λ ∈D | ∂ λ b F n ( λ | ) > L (cid:19) ≤ − nc ) , and by Lemma A.6 it holds true that C min( H ∗ ,

1) exp (cid:18) − cn min( H ∗ , H ∗ , (cid:19) ≥ P (cid:18) min λ ∈D ∂ λ b F n ( λ ) < H ∗ / , max λ ∈D | ∂ λ b F n ( λ | ) ≤ M (cid:19) , ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 27 for some C >

0. We deﬁne the sets A n := { min λ ∈D ∂ λ b F n ( λ ) < H ∗ / } and B n := { max λ ∈D | ∂ λ b F n ( λ | ) ≤ L } , and we obtain P (min λ ∈D ∂ λ b F n ( λ ) < H ∗ /

4) = P ( A n | B n ) P ( B n ) + P ( A n | B ∁ n ) P ( B ∁ n ) ≤ P ( A n ∩ B n ) + P ( B ∁ n ) ≤ C min( H ∗ ,

1) exp (cid:0) − cn min( H ∗ , H ∗ , (cid:1) . (cid:3) The last lemma indicates the empirical loss function is unlikely to have localminimums outside [ λ ∗ , λ ∗ ]. Lemma A.8.

Assume again that the largest eigenvalue of DD ⊤ is λ D . Let L ∗ = λ D + λ u ) . There are constants c, C such that P  min λ u ≥ λ> λ ∗ ∂ λ b F n ( λ ) < L ∗ /  ≤ C min { L ∗ , } exp (cid:0) − cn min (cid:0) L ∗ , L ∗ , (cid:1)(cid:1) , and P min λ ∗ ≥ λ>λ l ∂ λ b F n ( λ ) > − L ∗ / ! ≤ C min { L ∗ , } exp (cid:0) − cn min (cid:0) L ∗ , L ∗ , (cid:1)(cid:1) . Proof.

We ﬁrst note that with v being the leading eigenvector of DD ⊤ ,Tr( Q λ ) ≥ v ⊤ Q λ v ≥ L ∗ . And for λ > λ ∗ we have 2( λ/λ ∗ − Q λ ) ≥ L ∗ . We set ε = L ∗ / C exp (cid:0) − nc min( L ∗ , L ∗ ) (cid:1) ≥ P ( | ∂ λ b F n ( λ ) − λ/λ ∗ − Q λ ) | > L ∗ / ≥ P ( ∂ λ b F n ( λ ) < λ/λ ∗ − Q λ ) − L ∗ / ≥ P ( ∂ λ b F n ( λ ) < L ∗ / . Similarly as in Lemma A.7, we use Lemma A.5 and Lemma A.6 to obtain the ﬁrstassertion.We obtain the second assertion by using2( λ/λ ∗ − Q λ ) ≤ − L ∗ < , for λ < λ ∗ . (cid:3) A.4.

Summarizing argument.

Finally, we are ready to prove Theorem 2.3.

Proof of Theorem 2.3.

Denote D = [ λ ∗ , λ ∗ ] , H ∗ = λ D ( λ D +3 λ ∗ / λ ∗ and the events B = { λ l < ˆ λ n < λ u } , A n = { ˆ λ n ∈ D , ∂ λ b F n ( λ ) ≥ H ∗ for all λ ∈ D} . First we decompose P ( | ˆ λ n − λ ∗ | > ε, B ) = P ( | ˆ λ n − λ ∗ | > ε, B | A n ) · P ( A n )+ P ( | ˆ λ n − λ ∗ | > ε, B | A ∁ n ) · P ( A ∁ n ) ≤ P ( | ˆ λ n − λ ∗ | > ε, B | A n ) + P ( B ∩ A ∁ n ) . In the last step we have used P (ˆ λ n ≤ λ u ) = 1. By Proposition 2.2 P ( | ˆ λ n − λ ∗ | > ε, B | A n ) ≤ P (cid:18) | ∂ λ b F n ( λ ∗ ) − ∂ λ F ( λ ∗ ) | > H ∗ ǫ, B (cid:19) = P (cid:18) | ∂ λ b F n ( λ ∗ ) | > H ∗ ǫ, B (cid:19) , which can be bounded by Lemmas A.2 and A.6 P ( | ˆ λ n − λ ∗ | > ε | A n ) ≤ C exp( − nc min { ǫ, ǫ } ) , for some C , c .We bound the probability P ( A ∁ n ) by P ( B ∩ A ∁ n ) ≤ P ( B, ˆ λ n / ∈ D ) + P ( { ∂ λ b F n ( λ ) ≥ H ∗ / λ ∈ D} ∁ ) , and study both terms separately. Note ﬁrst, by Lemma A.8, for some constants C , c > P ( B, ˆ λ n / ∈ D ) ≤ P ( ∂ λ b F n ( λ ) = 0 for some λ ∈ ( λ l , λ u ) \ D ) ≤ C exp ( − c n ) . Second, by Lemma A.7, for some constants C , c > P ( { ∂ λ b F n ( λ ) ≥ H ∗ / λ ∈ D} ∁ ) ≤ C exp ( − c n ) . Finally, there exist some constants C ∗ , c ∗ > P ( | ˆ λ n − λ ∗ | > ǫ ) ≤ C ∗ exp( − c ∗ n min( ǫ, ǫ )) . (cid:3) Appendix B. Online consistency analysis

B.1.

Stochastic gradient decent.

We start by verifying Lemma 3.2.

Proof of Lemma 3.2.

We apply the implicit function theorem to prove this state-ment. For ﬁxed y ∈ R K , we deﬁne the function ϕ ( λ, u ) := ∇ u Ψ( λ, u, y ) . Since ( λ, u ) Ψ( λ, u, y ) is strictly convex, we have that for all ( λ, u ) near ( λ , u λ )the Jacobian of ϕ w.r.t. u is invertible, i.e. D u ϕ ( λ, u ) = ∇ u Ψ( λ, u, y ) > . Set ¯ λ ∈ R d arbitrary, then for ¯ u = u ¯ λ ( y ) it holds true that ϕ (¯ λ, ¯ u ) = 0and by the implicit function theorem there exists a open neighborhood D ⊂ R d of λ with ¯ λ ∈ D such that there exists a unique continuously diﬀerentiable function¯ U : D → R d with ¯ U (¯ λ ) = ¯ u and ϕ ( λ, ¯ U ( λ )) = 0 , for all λ ∈ Λ, i.e. ¯ U maps all λ ∈ Λ to the corresponding regularized solution¯ U ( λ ) = u λ ( y ). Further, the partial derivatives of ¯ U are given by ∂ ¯ U∂λ i ( λ ) = − (cid:2) D u ϕ ( λ, ¯ U ( λ )) (cid:3) − (cid:20) ∂ϕ∂λ i ( λ, ¯ U ( λ )) (cid:21) . Since the choice of ¯ λ ∈ R d is arbitrary, it follows that λ u λ ( y ) is continuouslydiﬀerentiable with derivative given by ∇ λ u λ ( y ) = − (cid:0) ∇ u [Ψ( λ, u λ ( y ) , y )] (cid:1) − ∇ λu [Ψ( λ, u λ ( y ) , y )] . The computation of ∇ λ f can be obtained by the chain rule. ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 29 (cid:3)

B.2.

General consistency analysis framework.

Proof of Proposition 3.3.

Note that △ k +1 = χ ( λ k − β k e ∇ λ f ( λ k , Z k )) − λ ∗ , and apply Lemma 3.1 k△ k +1 k = k χ (cid:16) λ k − β k e ∇ λ f ( λ k , Z k ) (cid:17) − λ ∗ k ≤ k λ k − β k e ∇ λ f ( λ k , Z k ) − λ ∗ k = k△ k − β k e ∇ λ f ( λ k , Z k ) k = k△ k − β k ∇ λ F ( λ k , Z k ) − β k δ k − β k ξ k k , where δ k = E k e ∇ f ( λ k , Z ) − ∇ F ( λ k ) , ξ k = e ∇ λ f ( λ k , Z k ) − E k e ∇ λ f ( λ k , Z )is the bias and noise in the stochastic gradient, we denote the expectation condi-tioned on information available at step k as E k and deﬁne the ﬁrst exit time of D by with τ = inf { k ≥ | λ k ∈ D} . Next, we note that E k k∇ f ( λ k , Z k ) k = k∇ λ F ( λ k ) + δ k k + E k k ξ k k . So if τ ≥ k , E k k△ k +1 k ≤ k△ k k − β k △ Tk ( ∇ λ F ( λ k ) + δ k ) + β k k∇ λ F ( λ k ) + δ k k + E k k ξ k k ≤ k△ k k − β k △ Tk ∇ λ F ( λ k ) + 2 β k k△ k kk δ k k + β k ( a + b k△ k k ) ≤ k△ k k − cβ k k△ k k + 12 cβ k k△ k k + 2 c β k k δ k k + β k a + bβ k k△ k k ≤ (1 − . cβ k + bβ k ) k△ k k + ( aβ k + 2 α k /c ) β k . Since β k < c/ b , we have E k τ ≥ k +1 k△ k +1 k ≤ E k τ ≥ k k△ k +1 k ≤ τ ≥ k (1 − cβ k ) k△ k k + ( aβ k + 2 α k /c ) β k . Let Q k = 1 τ k ≥ k k△ k k , then we just derived that E Q k +1 ≤ (1 − cβ k ) E Q k + ( aβ k + 2 α k /c ) β k . Therefore by Gronwall’s inequality(B.1) E Q n ≤ a n X k =1  n Y j = k +1 (1 − cβ j ) β k  + 2 c n X k =1  n Y j = k +1 (1 − cβ j ) β k α k  +exp  − c n X j =1 β j  E Q . Next we look at the 2nd term of (B.1). Note that when α k ≤ α , then2 c n X k =1 n Y j = k +1 (1 − cβ j ) β k α k ≤ α c n X k =1 n Y j = k +1 (1 − cβ j ) cβ k ≤ α c n X k =1  n Y j = k +1 (1 − cβ j ) − n Y j = k (1 − cβ j )  ≤ α c . In this case, (B.1) becomes E Q n ≤ a n X k =1  n Y j = k +1 (1 − cβ j ) β k  + α c + exp  − c n X j =1 β j  E Q . And if α k ≤ Dβ k , then (B.1) can be simpliﬁed as E Q n ≤ ( a + D/c ) n X k =1  n Y j = k +1 (1 − cβ j ) β k  + exp  − c n X j =1 β j  E Q . In both cases, to show our claim, we just need to show n X k =1  n Y j = k +1 (1 − cβ j ) β k  ≤ C n , exp  − c n X j =1 β j  ≤ C n . Let k be the minimizer of k = arg min k ≤ n max { n Y j = k +1 (1 − cβ j ) , aβ k /c } Then note that, k X k =1 n Y j = k +1 (1 − cβ j ) β k ≤ k X k =1 n Y j = k +1 (1 − cβ j ) β k ≤ n Y j = k +1 (1 − cβ j ) ∞ X k =1 β k ≤ C n . also n X k = k +1 n Y j = k +1 (1 − cβ j ) β k ≤ c β k k X k =1 n Y j = k +1 (1 − cβ j ) cβ k ≤ c β k k X k =1  n Y j = k +1 (1 − cβ j ) − n Y j = k (1 − cβ j )  ≤ c β k = C n . The sum of the previous two inequalities leads to n X k =1  n Y j = k +1 (1 − cβ j ) β k  ≤ C n . Finally exp( − c n X j =1 β j ) E Q ≤ exp( − c n X j = k +1 β j ) E Q ≤ C n . To see that C n converges to zero, simply let k n = max k  k Y j =1 (1 − cβ j ) > vuut n Y j =1 (1 − cβ j )  Because Q nj =1 (1 − cβ j ) decays to zero when n → ∞ , so k n will increases to ∞ , and β k n will decay to zero. Meanwhile, C n ≤ min  n Y j = k +1 (1 − cβ j ) , β k n  ≤ min vuut n Y j =1 (1 − cβ j ) , β k n  , which will decay to zero when n → ∞ . (cid:3) ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 31

B.3.

Application to linear inverse problems.

Proof of Theorem 3.5.

We will set D = Λ = [ λ l , λ u ]. Then because χ always bring λ k back into D , the event A always happen.We have seen in the proof of Lemma A.8, that ∇ λ F ( λ ) = 2( λ/λ ∗ − Q λ D T D ) . Multiplication with ( λ − λ ∗ ) gives( λ − λ ∗ ) ∇ λ F ( λ ) = 2( λ − λ ∗ ) Tr( Q λ D T D/λ ∗ ) . So if we take c = 2Tr( Q λ u DD T /λ ∗ ) > E | e ∇ λ f ( λ, Z ) − ∇ λ f ( λ, Z ) | ≤ E [ k∇ λ u λ ( y ) − e ∇ λ u λ ( y ) k k u λ ( y ) − u k ] ≤ E [ k∇ λ u λ ( y ) − e ∇ λ u λ ( y ) k ] E [ k u λ ( y ) − u k ] ] ∈ O ( h ) . This leads to (3.9) by | E e ∇ λ f ( λ, Z ) − ∇ λ F ( λ ) | = | E e ∇ λ f ( λ, Z ) − E ∇ λ f ( λ, Z ) | ≤ E | e ∇ λ f ( λ, Z ) − ∇ λ f ( λ, Z ) | . First note that it holds true that E [ k u λ ( y ) − u k ] ] ≤ E [ k u λ ( y ) k ] + 8 E [ k u k ] ≤ k Q λ k ( k DD T k E [ k u k ] + E [ k Dξ k ]) + 8 E [ k u k ] ≤ (cid:0) k DD T k /λ l + 1 (cid:1) m u + 8 /λ l m Dξ =: C . Next, we show that(B.2) E [ k∇ λ u λ ( y ) − e ∇ λ u λ ( y ) k ] ≤ C h . To this end, we apply Taylor series expansion of u λ ( y ) = Q λ D T ( Du + ξ ) = ϕ ( λ ) V ,with V ∼ N (0 , D (1 /λ ∗ D T D + I ) D T ) and ϕ ( λ ) := Q λ , i.e. there exists λ + , λ − ∈ [ λ − h, λ + h ] such that e ∇ λ u λ ( y ) − ∇ λ u λ ( y ) = 112 (cid:16) ϕ (3) ( λ + ) + ϕ (3) ( λ − ) (cid:17) h V. Using ϕ ( k ) ( λ ) = ( − k Q k +1 λ , we can bound E [ k ϕ (3) ( λ i ) h V k ] ≤ λ l E [ k V k ] h , for λ i ∈ { λ + , λ − } ⊂ [ λ l , λ u ] and we obtain (B.2) with C = 2 / (12 λ l ) E [ k V k ].Taking the square root ﬁnally veriﬁes (3.9) with α ∈ O ( h ).To prove that (3.8) is satisﬁed, we ﬁrst compute |∇ λ f ( λ, Z ) | = (cid:12)(cid:12)(cid:12) − λ h u, Q λ u i − λ h u, Q λ Dξ i − h Dξ, Q λ Dξ ) + 2 λ h ( u, Q λ u i + 2 h u, Q λ Dξ i (cid:12)(cid:12)(cid:12) ≤ λ k u k k Q λ k + 80 λ k Q λ k k u k k Dξ k + 20 k Q λ k k Dξ k + 20 λ k Q λ k k u k + 20 k Q λ k k u k k Dξ k . We apply the bound k Q λ k ≤ λ ≤ λ l , λ ∈ D and obtain |∇ λ f ( λ, Z ) | ≤ λ l k u k + 80 λ l k u k k Dξ k + 20 λ l k Dξ k + 20 λ l k u k + 20 λ l k u k k Dξ k . Then we note that E k e ∇ f ( λ, Z ) k ≤ E k∇ f ( λ, Z ) k + 2 E k∇ f ( λ, Z ) − e ∇ f ( λ, Z ) k Therefore, (3.8) holds with a = 80 λ l m u + 160 λ l m u + 40 λ l m Dξ + 40 λ l m u m Dξ + p C C h , b = 0 , where m u = E [ k u k ] , m u = E [ k u k ] , m Dξ = E [ k Dξ k ], m Dξ = E [ k Dξ k ]. (cid:3) References [1] S. Agapiou, J. M. Bardsley, O. Papaspiliopoulos and A. M. Stuart. Analysis of the Gibbssampler for hierarchical inverse problems.

SIAM/ASA J. Uncertainty Quantiﬁcation , 2(1),511–544, 2014.[2] S. Arridge, P. Maass, O. ¨Oktem and C.-B. Sch¨onlieb. Solving inverse problems using data-driven models.

Acta Numerica , 2019.[3] Y. Y. Belov.

Inverse Problems for Partial Diﬀerential Equations . Inverse and Ill-Posed Prob-lems Series, 32, De Gruyter, 2002.[4] M. Benning and M. Burger. Modern regularization methods for inverse problems.

Acta Nu-merica , 27, 2018.[5] l. Borcea. Electrical impedance tomography,

Inverse Problems

18, 2002.[6] M. Burger, A. C.G. Mennucci, S. Osher and M. Rumpf.

Level set and PDE based reconstructionmethods in imaging.

Springer, 2008.[7] P. J. Bickel and K. A. Doksum.

Mathematical Statistics: Basic Ideas and Selected Topics .Cambridge University Press, 1971.[8] L. Bottou

On-Line Learning and Stochastic Approximations . On-line learning in neural net-works, Cambridge University Press, USA, 9–42, 1999.[9] L. Calatroni, C. Chung, J. C. De los Reyes, C.-B. Sch¨onlieb and T. Valkonen. Bilevel ap-proaches for learning of variational imaging models.

Variational Methods , 2016.[10] N. K. Chada, M. A. Iglesias, L. Roininen and A. M. Stuart. Parameterizations for ensembleKalman inversion.

Inverse Problems , 32, 2018.[11] N. K. Chada. Analysis of hierarchical ensemble Kalman inversion, arXiv:1801.00847, 2018.[12] N. K. Chada, A. M. Stuart and X. T. Tong. Tikhonov regularization within ensemble Kalmaninversion.

SIAM J. Numer. Anal. , 58(2), 1263–1294, 2020.[13] J. Chung, M. I. Espanol and T. Nguyen. Optimal regularization parameters for general-formTikhonov regularization.

Inverse Problems , 2014.[14] N. Couellan and W. Wang. Bi-level stochastic gradient for large scale support vector machine.

Neurocomputing , 2014.[15] N. Couellan and W. Wang. Uncertainty-safe large scale support vector machines.

Computa-tional Statistics & Data Analysis , Elsevier, vol. 109, 215-230, 2017.[16] N. Couellan and W. Wang.

On the convergence of stochastic bi-level gradient methods. hal-01932372,2018.[17] B. Colson, P. Marcotte and G. Savard. An overview of bilevel optimization.

Ann Oper Res ,153, 235–256, 2007.[18] M. D’Elia, J. C. De los Reyes, A. M. Trujillo. Bilevel parameter optimization for learningnonlocal image denoising models. Arxiv preprint arxiv:1912.02347, 2020.[19] J. C. De los Reyes and C.-B. Sch¨onlieb. Image denoising: Learning the noise model vianonsmooth PDE-constrained optimization.

Inverse Problems , 7, 1183–1214, 2013.[20] J. C. De los Reyes, C.-B. Sch¨onlieb and T. Valkonen. Bilevel parameter learning for higher-order total variation regularisation models.

J. Math. Imaging Vision

57, 1–25, 2017.[21] S. Dempe.

Foundations of bilevel Programming , Springer, 2002.[22] M. Dunlop, M. A. Iglesias and A. M. Stuart. Hierarchical Bayesian level set inversion.

Statistics and Computing , 27, 1555–1584, 2017.[23] C. M. Elliott, K. Deckelnick and V. Styles. Numerical analysis of an inverse problem for theeikonal equation.

Numerische Mathematik , 2011.[24] H.W. Engl, K. Hanke and A. Neubauer. Regularization of inverse problems.

Mathematicsand its Applications , Volume 375, Kluwer Academic Publishers Group, Dordrecht,1996.

ONSISTENCY OF BILEVEL DATA-DRIVEN LEARNING IN INVERSE PROBLEMS 33 [25] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi and M. Pontil. Bilevel programming for hyper-parameter optimization and meta-learning.

International Conference on Machine Learning ,1563–1572, 2018.[26] J. H. Friedman, R. Tibshirani, and T. Hastie.

Elements of Statistical learning . Springer, 2001.[27] A. Garbuno–Inigo, F. Hoﬀmann, W. Li, A. M. Stuart. Interacting Langevin Diﬀusions: Gra-dient Structure And Ensemble Kalman Sampler.

SIAM J. Appl. Dyn. Syst. , 19(1), 412441,2020.[28] I. Goodfellow, Y. Bengio and A. Courville.

Deep Learning . The MIT Press, 2016.[29] G. Holler, K. Kunisch and R. C. Barnard. A bilevel approach for parameter learning in inverseproblems,

Inverse Problems,

34 115012, 2018.[30] N. Kantas, A. Beskos and A. Jasra. Sequential Monte Carlo methods for high-dimensionalinverse problems: A case study for the Navier-Stokes equations,

SIAM/ASA J. Uncertain.Quantif,

2, 464–489, 2014.[31] S. Jenni and P. Favaro. Deep bilevel learning.

European Conference on Computer Vision ,2018.[32] J. Kaipio and E. Somersalo.

Statistical and Computational Inverse problems.

Springer Verlag,NewYork, 2004.[33] B. Kaltenbacher, A. Kirchner and B. Vexler. Goal-oriented adaptivity in the IRGNM forparameter identiﬁcation in PDEs II: all-at-once formulations.

Inverse Problems , 30 045002,2014.[34] B. Kaltenbacher. Regularization based on all-at-once formulations for inverse problems. SIAMJ. Num Anal, 54, 2594–2618, 2016.[35] K. Kunisch and T. Pock. A bilevel optimization approach for parameter learning in variationalmodels.

SIAM J. Imaging Sci.

6, 938–983, 2013.[36] G. Lord, C.E. Powell and T. Shardlow.

An Introduction to Computational Stochastic PDEs.

Cambridge Texts in Applied Mathematics, 2014.[37] S. Lu and S. V Pereverzev.

Regularization Theory for Ill-posed Problems Selected Topics .Walter de Gruyter GmbH, Inverse and Ill-Posed Problems Series 58, 2013.[38] S. Lunz, O. ¨Oktem and C.-B. Sch¨onlieb. Adversarial regularizers in inverse problems.

Pro-ceedings of the 32nd International Conference on Neural Information Processing Systems ,8516–8525, 2018.[39] O. Papaspiliopoulos, G. O. Roberts, and M. Sk¨old. A general framework for the parametriza-tion of hierarchical models.

Statistical Science,

SIAM J. Control Optim ., 30(4), 8380–855, 1992.[41] H. Robbins and S. Monro. A stochastic approximation method.

Ann. Math. Statistics , 22:400–407, 1951.[42] M. Rudelson and R. Vershynin. Hanson-Wright inequality and sub-Gaussian concentration.

Electronic Communications in Probability , 18, 1–9, 2013.[43] R. Schilling.

Measures, Integrals and Martingales . Cambridge University Press, second edi-tion, 2017.[44] J. A. Sethian. Level set methods and fast marching methods.

Cambridge Monographs onApplied and Computational Mathematics , Cambridge University Press, 1999.[45] A. Sinha, P. Malo and K. Deb. A Review on bilevel optimization: from classical to evolu-tionary approaches and applications.

IEEE Transactions on Evolutionary Computation

Inverse Problems , 33 114008, 2019.[47] S. Shalev-Shwartz and S Ben-David.

Understanding Machine Learning: From theory to al-gorithms.

Cambridge university press, 2014.[48] A. M. Stuart. Inverse problems: A Bayesian perspective.

Acta Numerica , Vol. 19, 451–559,2010.[49] A. Tarantola.

Inverse Problem Theory and Methods for Model Parameter Estimation , Else-vier, 1987.[50] A. N. Tikhonov, A. S. Leonov and A. G. Yagola.

Nonlinear ill-posed inverse problems.

Springer, Applied Mathematical Sciences, 1998.

Department of Statistics and Applied Probability, National University of Singa-pore, 119077, Singapore

E-mail address : [email protected] Mannheim School of Computer Science and Mathematics, University of Mannheim,68131 Mannheim, Germany

E-mail address : [email protected] Department of Mathematics, National University of Singapore, 119077, Singapore

E-mail address : [email protected] Mannheim School of Computer Science and Mathematics, University of Mannheim,68131 Mannheim, Germany

E-mail address ::