[PDF] Bayesian optimization with improved scalability and derivative information for efficient design of nanophotonic structures

Abstract

We propose the combination of forward shape derivatives and the use of an iterative inversion scheme for Bayesian optimization to find optimal designs of nanophotonic devices. This approach widens the range of applicability of Bayesian optmization to situations where a larger number of iterations is required and where derivative information is available. This was previously impractical because the computational efforts required to identify the next evaluation point in the parameter space became much larger than the actual evaluation of the objective function. We demonstrate an implementation of the method by optimizing a waveguide edge coupler.

Full PDF

TThis is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. Bayesian optimization with improved scalabilityand derivative information for efﬁcient design ofnanophotonic structures

Xavier Garcia-Santiago, Sven Burger, Carsten Rockstuhl, and Philipp-Immanuel Schneider

Abstract —We propose the combination of forward shapederivatives and the use of an iterative inversion scheme forBayesian optimization to ﬁnd optimal designs of nanophotonic de-vices. This approach widens the range of applicability of Bayesianoptmization to situations where a larger number of iterations isrequired and where derivative information is available. This waspreviously impractical because the computational efforts requiredto identify the next evaluation point in the parameter spacebecame much larger than the actual evaluation of the objectivefunction. We demonstrate an implementation of the method byoptimizing a waveguide edge coupler.

I. I

NTRODUCTION

Finding the optimal design of a device is a fundamentalproblem in science and engineering. The general solutionprocedure consists of parametrizing the geometry or materialsof the device by a set of variables x = [ x , x , ..., x d ] .By using a numerical solver to determine with an objectivefunction f ob ( x ) the performance of devices at different pointsin the d -dimensional parameter space D , one seeks to ﬁnd anoptimal set of parameters x opt such that, f ob ( x opt ) ≥ f ob ( x ) ∀ x ∈ D . (1)Depending on the parametrization, one can access a differ-ent range of possible devices. Usually, the larger the numberof parameters the broader the range of designs that can be ad-dressed. However, except for convex problems, the complexityof ﬁnding a globally optimal design scales exponentially withthe number of parameters. This fact is commonly referred toas the curse of dimensionality [1] or the Hughes Effect [2],[3].When designing nanophotonic devices, the objective func-tion depends on the electromagnetic ﬁeld and for its deter-mination one normally needs to solve Maxwell’s equations. Xavier Garcia-Santiago is with the Institute of Theoretical Solid StatePhysics, Karlsruhe Institute of Technology, 76137 Karlsruhe, Germany andwith JCMwave GmbH, Bolivarallee 22, 14050 Berlin, Germany, e-mail:([email protected]).Carsten Rockstuhl is with the Institute of Theoretical Solid State Physics,Karlsruhe Institute of Technology, 76137 Karlsruhe, Germany, with the Insti-tute of Nanotechnology, Karlsruhe Institute of Technology, 76344 Eggenstein-Leopoldshafen, Germany, and with the Max Planck School of Photonics,Germany.Sven Burger and Philipp-Immanuel Schneider are with JCMwave GmbH,Bolivarallee 22, 14050 Berlin, Germany and with the Zuse Institute Berlin,Takustraße 7, 14195 Berlin, Germany.

In general, computing the objective function requires timesbetween some minutes and several hours, depending on thecomplexity of the problem. This, added to the problem ofthe exponential scaling of the parameter space, makes ﬁndingoptimal nanophotonic devices a very hard or even intractableproblem. Only for highly symmetric devices, such as devicesinvariant in one dimension or cylindrically symmetric devices,computing the objective function requires far less times andthe global optimization problem can be solved with reasonableeffort.The convergence rate of the optimization problem canoften be improved by providing derivatives of the objectivefunction with respect to the design parameters x i . To achievethis, two different methods have been proposed: the directmethod [4], [5] and the adjoint method [6], [7]. The di-rect method propagates all derivatives with the solution ofMaxwell’s equations. It is suitable for problems with a lowor medium number of parameters, a few dozen at most. Theadjoint method determines derivatives by solving additionallyan adjoint problem. This additional effort is only worthwhilefor high-dimensional parameter spaces. For example, topologyoptimization typically requires thousands or even millions ofdegrees of freedom. This makes, however, the problem of ﬁnd-ing a global optimum in general intractable [8]. Nevertheless,designs obtained using local optimization methods show agood performance [9]. Here, we consider medium dimensionaldesign spaces, with no more than twenty parameters, whereﬁnding a global optimum can still be feasible.One highly efﬁcient global optimization algorithm formedium dimensional optimization problems of computation-ally expensive objective functions is Bayesian optimiza-tion [10]–[14]. Bayesian optimization is based on the useof a surrogate stochastic model of the objective function.The stochastic model is used to infer different probabilisticquantities of the objective function with the aim of achievinga faster convergence with respect to the number of evaluations.The stochastic model most frequently used in Bayesian opti-mization is a Gaussian process [15], [16]. Gaussian processesare widely used in the ﬁeld of machine learning [17], [18] forregression and classiﬁcation tasks. One property to highlightis that they can also incorporate derivative information in anatural manner. In several benchmarks [19] Bayesian opti-mization required about an order of magnitude less iterationsto ﬁnd good parameter values when compared to other global a r X i v : . [ phy s i c s . c o m p - ph ] J a n his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. optimization methods such as particle swarm optimization ordifferential evolution. Moreover, the benchmarks showed thatproviding derivative information can signiﬁcantly increase theconvergence rate.However, one fundamental disadvantage of Bayesian opti-mization is its poor scalability with respect to the number ofobjective function evaluations N ev . The time to compute a newsampling point scales with O ( N ) . Moreover, if also providedwith d derivatives of the objective, the sample computationtime increases asymptotically by a factor of (1 + d ) .For very expensive objective functions and a moderatenumber of iterations the sample computation time is oftennegligible. However, in many cases it can become the majorbottleneck of the method. Speciﬁcally, if the simulations aresuitable to be parallelized or devices with a high degree ofsymmetry are considered, e.g., diffractive gratings or cylindri-cal symmetric devices, the simulation times can be drasticallyreduced. Often, this allows for solving Maxwell’s equationseven in a few seconds. The design space can then be sampledvery often allowing to explore higher dimensional parameterspaces. In these cases, the O (N ) scaling renders Bayesianoptimization impractical.There are two main steps during which Bayesian op-timization suffers from scalability issues: the inversion ofthe covariance matrix and the evaluation of the acquisitionfunction [20] at different points. In a previous work [19], weproposed a technique to mitigate the problem of the time forevaluating the acquisition function. In this contribution, wepropose the use of a matrix update scheme for the inversionof the covariance matrix. With this scheme, the method canscale with O ( N ( d +1) ) instead of O ( N ( d +1) ) , enablinga signiﬁcant increase of the number of possible optimizationsteps in practical applications.The work starts with a short review of the ﬁnite elementmethod with forward shape and material derivatives. Then,the Bayesian optimization method is presented and its scala-bility problems are demonstrated. After introducing the matrixupdate approach for addressing the scalability problem, themethod is exemplary applied to optimize a waveguide couplerwith a minimal spatial footprint as a prototypical problem inintegrated nanophotonics.II. N UMERICAL PROCEDURE

A. Finite element method with shape derivatives

The objective function in nanophotonic design problemsis typically a function of the electromagnetic ﬁeld that is asolution of Maxwell’s equations. To solve Maxwell’s equa-tions, we employ the ﬁnite element method [21] (FEM),as it is particularly suitable for handling general shapes ofvarying complexity. Speciﬁcally, we use the commercial solverJCMsuite [22], [23]. Note however, that the method that wewill present in this work can be used with any numerical solverable to calculate the derivatives of an objective function withrespect to the design parameters. Other examples are the ﬁnitedifference time domain method [24], the beam propagationmethod [25] or the eigenmode expansion method [26]–[28],to name a few. Within FEM, one solves the weak form of Maxwell’sequations approximating the electric ﬁeld with a discrete setof local basis functions. This calculation translates into analgebraic system Ae = s, (2)where A is the system matrix, e is the solution of the electricﬁeld projected into the basis of the ﬁnite element functions,and the right hand side s describes source terms like theillumination ﬁeld. Solving the system (2), one obtains thesolution of the electromagnetic ﬁeld as e = A − s. (3)In addition to the solution e , it is also possible to determinethe derivatives of the electromagnetic ﬁeld with respect to thedesign parameters x i .There are two main methods of computing the shape andmaterial derivatives of some merit function f ob that dependson the value of the electromagnetic ﬁeld: the direct method[4], [5] and the adjoint method [29]. For both methods thefundamental step is to be able to determine the derivatives ofthe system matrix A with respect to the design parameters x i , d A dx i .In the direct method, the shape or material derivatives ofthe electric ﬁeld are obtained as d e d x i = A − (cid:18) − d A d x i e + d s d x i (cid:19) . (4)Therefore, once the electric ﬁeld has been solved, obtainingevery derivative d e d x i implies solving the same system of Eq. (2)with a different right hand side term. Notably, the sameconvergence properties as for the solution e are obtained for d e/ d x [4]. The derivatives d e d x i can then be propagated tocompute the derivatives of the objective function f ob d f ob d x i = d f ob d e T · d e d x i , (5)where the vector d f ob d e gives the derivatives of the objectivefunction with respect to the values of the solution e .The adjoint method determines derivatives by solving adi-tionally to Eq. (2) the adjoint problem A † λ = d f ob d e , (6)where A † denotes the conjugate transpose of the system matrix A .Once the adjoint system has been solved, obtaining the valueof each shape or material derivative involves a matrix-vectormultiplication and a vector-vector multiplication d f ob d x i = − λ T · (cid:18) d A d x i e (cid:19) . (7)In terms of computation times, the main difference betweenboth methods is that the direct method only requires to invertor decompose one system matrix and then to perform twomatrix-vector multiplications or one Gaussian elimination andone matrix-vector multiplication per parameter. The adjointmethod needs to decompose two different matrices, but once his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. this is done, obtaining the derivatives for each parameter onlyrequires a matrix-vector multiplication and a vector-vectormultiplication. This last operation results in an importantadvantage when the number of parameter derivatives is con-siderably large as it is, e.g., the case for topology optimization.Then, the adjoint method is especially advantageous becausethe matrix d A d x i is mainly composed of zeros, as each parameter x i represents usually the permittivity value of a small pixel.The matrix-vector operation can then be neglected and the timefor obtaining each derivative parameter is determined just bythe vector-vector multiplication.For the opposite reason, in this work the direct method isused. The goal is to optimize problems with around 20 param-eters at maximum. For this case, solving the additional adjointsystem typically imposes a severe overhead with respect to thedirect method and the non zero entries of d A d x i are, in general,considerably larger than for topology optimization problems.Moreover, using forward derivatives allows to work withobjective functions of higher complexity, as the derivativescan be propagated for the objective function in a simpler wayusing automatic differentiation [30]. Reference [31] providesa detailed analysis and comparison between both the forwardand the adjoint methods.The FEM solver used in our calculations allows to com-pute the shape and material derivatives of different quantitiesderived from the ﬁelds, as for example the amplitudes of itsangular spectrum representation. To compute the shape deriva-tives, we only need to specify the partial shape derivatives ofthe nanophotonic structure, i.e., how the boundaries of thedifferent elements of the structure change with respect to aninﬁnitesimal change in the design variables. Then the solverautomatically obtains the derivatives of the mesh elements andpropagates them until it obtains the derivatives of the ﬁnalquantity of interest. B. Bayesian optimization with Gaussian processes

A Gaussian process GP ( m, k ) is a stochastic model, whichyields a probabilistic distribution over functions (Ref. [17],Chapter 2.2). We use it as a surrogate model of the computa-tionally expensive objective function f ob ( x ) . A Gaussian pro-cess (GP) is characterized by its mean m ( x ) and covariancefunctions k ( x, x (cid:48) ) . We parametrize the mean function as, m ( x ) = m , (8)and the covariance function with the Mat´ern 5/2 [32] covari-ance function k ( x, x (cid:48) ) = σ (cid:18) √ r + 53 r (cid:19) e −√ r , (9)where r = d (cid:88) i =1 ( x i − x (cid:48) i ) l i . (10)The hyperparameters m , σ , l , · · · , l d are chosen to max-imize the log-likelihood of all observations of the function f ob ( x ) [17]. The hyperparameter optimization is computation-ally expensive. Therefore, we perform it only if required [33]and stop their optimization after 300 observations.The probabilistic behaviour of a function f ( x ) modelled asa GP at any collection of points X N ev = { x , ..., x N ev } isdescribed by a multivariate Gaussian distribution over thesepoints f N ev ∼ N (cid:16) m N ev , K N ev (cid:17) , (11)where f N ev is the random vector of function values [ f ( x ) , · · · , f ( x N ev )] T , m N ev the vector of mean values [ m ( x ) , · · · , m ( x N ev )] T , and K N ev the covariance matrix,where the i - j element is determined by k ( x i , x j ) .Let us assume that f ob ( x ) has been evaluated at some points X ev = [ x ev1 , ..., x ev N ] and let X ∗ = [ x ∗ , ..., x ∗ M ] denoteanother set of points. The prior joint distribution for the twosets can be calculated with the mean m ( x ) and covariance k ( x, x (cid:48) ) functions (Ref. [17], Eq. 2.18) (cid:20) f ev f ∗ (cid:21) ∼ N (cid:32)(cid:20) m ev m ∗ (cid:21) , (cid:34) K ev K ev , ∗ K T ev , ∗ K ∗ (cid:35)(cid:33) , (12)where K ev is the covariance matrix for the set X ev , K ∗ thecovariance matrix for the set X ∗ , and K ev , ∗ the crossvariancebetween the two sets.Using Bayes’s theorem [34], one can determine the posteriorprobability distribution over function values f ∗ at X ∗ giventhe observations ( X ev , f ev ) . The posterior distribution f ∗ | X ∗ , X ev , f ev ∼ N (cid:16) m p , K p (cid:17) (13)is again a multivariate Gaussian distribution with the posteriormean vector m p and covariance matrix K p m p = m ∗ + K ∗ , ev K − (cid:0) f ev − m ev (cid:1) , (14) K p = K ∗ − K ∗ , ev K − K T ∗ , ev . (15)The above two equations give a probabilistic distributionfor the values of the objective function f ob ( x ) at any point x ∗ where we do not know its true value, based on the informationwe have obtained from the evaluations already computed.The diagonal elements of K p contain the variance σ p of thedifferent points of X ∗ . Figure 1a shows the values of m p and σ p for a function f ( x ) with three known function values.The Bayesian optimization method uses the statistical infor-mation provided by the posterior GP to determine promisingparameter sets and to sequentially converge to the optimumvalue of f ob . There are different strategies for how to use theinformation provided by the GP [20]. One commonly usedstrategy, which is also used in this work, consists in ﬁndingthe point of maximum expected improvement [17].It addition to the function values itself, it is also possibleto include derivative observations to obtain more accurateGP predictions, as, e.g., described in Ref. [35] chapter 10or Ref. [17] chapter 9.4. The use of derivative informationfor regression with Gaussian processes was already used tomodel dynamic systems [36] and also applied to Bayesian his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. optimization [19], [37], [38]. The covariance function of thejoint process of objective values and ﬁrst derivatives (cid:20) f ( x ) ∇ f ( x ) (cid:21) ∼ GP (cid:16) , K f, ∇ f (cid:17) (16)can be split into the below four blocks [38], K f, ∇ f ( x, x (cid:48) ) = (cid:20) k [ f,f ] ( x, x (cid:48) ) k [ f, ∇ f ] ( x, x (cid:48) ) k [ ∇ f,f ] ( x, x (cid:48) ) k [ ∇ f, ∇ f ] ( x, x (cid:48) ) (cid:21) (17)with the covariance functions k [ f,f ] ( x, x (cid:48) ) , k [ f, ∇ f ] ( x, x (cid:48) ) , k [ ∇ f,f ] ( x, x (cid:48) ) , and k [ ∇ f, ∇ f ] ( x, x (cid:48) ) being k [ f,f ] ( x, x (cid:48) ) = cov ( f ( x ) , f ( x (cid:48) )) = k ( x, x (cid:48) ) , (18) k [ f, ∇ f ] ( x, x (cid:48) ) = cov ( f ( x ) , ∇ f ( x (cid:48) )) = ∇ x (cid:48) k ( x, x (cid:48) ) , (19) k [ ∇ f,f ] ( x, x (cid:48) ) = cov ( ∇ f ( x ) , f ( x (cid:48) )) = ∇ xk ( x, x (cid:48) ) , (20) k [ ∇ f, ∇ f ] ( x, x (cid:48) ) = cov ( ∇ f ( x ) , ∇ f ( x (cid:48) )) = ∇ x ∇ x (cid:48) k ( x, x (cid:48) ) . (21)The same inference rules of the posterior distributions as inEqs. (14)-(15) also apply for the larger process. That is, theposterior distribution over f ∗ can be obtained as [38] m p ( x ∗ ) = K ∗ , ev K ∇ f ev − (cid:104) f T ev , ∇ f T ev (cid:105) T (22) K p ( x ∗ ) = K ∗ − K ∇∗ , ev K ∇ f ev − K ∇ f ∗ , ev T , (23)where K ∗ is the same covariance matrix as in Eq. (12), K ∇ f ev is the covariance matrix for the set X ev obtained evaluating K f, ∇ f ( x, x (cid:48) ) , and K ∇ f ∗ , ev is the crossvariance between the twosets X ∗ and X ev obtained also using Eqs. (18)-(21). Forsimpliﬁcation of notation, from this point we will refer withthe term K ev to the covariance matrix of the GP for boththe cases with and without derivative information. The rankof the matrix is determined by the number of derivative andnon-derivative observations N obs = N ev · ( d + 1) .In Fig. 1 we can see how adding derivative observationsincreases the accuracy of the GP representation of the functionin a vicinity of the evaluation points. The extra informationalso provides a better estimation of the hyperparameters inEqs. (8) and (9) and thus of the uncertainty σ p . Due to the moreaccurate GP predictions, derivative information can signiﬁ-cantly improve the convergence rate of Bayesian optimization.On the other hand, as can be seen from Eq. (17), includingthe derivative information of d parameters implies to introduce d extra entries in the covariance matrix after every evaluation.This increases the effort to compute the right-hand side ofEqs. (22)-(23).The covariance matrix K ev becomes typically ill-conditioned during the course of the optimization. Therefore,the computation and application of its inverse is avoided. Since K ev is symmetric and positive deﬁnite, a numerically morestable approach is to determine its Cholesky decomposition L K ev in O ( N ) steps and solve the equation K ev · b = L K ev · L TK ev · b = x (24)by forward and backward substitution in O ( N ) steps. Otherapproaches [38]–[40] that address the ill-condition of thecovariance matrix are discussed in the Appendix B. − . − . . . . x f a) f m p f ev m p ± σ p − . − . . . . x f b) f m p f ev m p ± σ p Fig. 1. Example of Gaussian processes regression. The ﬁgure shows thevalues of the posterior mean and standard deviation, Eqs. (22)-(23), fromthree observations f ev of a function f ( x ) . Top: GP regression using theinformation of the evaluations of the function f . Bottom: The derivative ofthe function f with respect to the input variable x has been also included intothe GP regression and the hyperparameter estimation. The standard deviationhas larger values for the regression including derivative observations, but thisfact results in a better estimation of the real behavior of function f . Figure 2 shows the time to determine the Cholesky de-composition both with and without derivative information for d = 10 parameters. For the case of nanophotonic structuresthat are solvable on the time scale of seconds, already afterthousand simulations decomposing the covariance matrix re-quires around the same time as solving Maxwell’s equationswhen including the derivative information. For ten thousandsimulations it would require around twenty minutes, renderingBayesian optimization impractical.In order to still be able to treat these optimization problemsusing Bayesian optimization, we propose the use of a matrixupdate method to compute the Cholesky decomposition of thecovariance matrix. The main idea of the approach is to takeadvantage of the fact that at each iteration, a large block ofthe covariance matrix has been already decomposed in theprevious iteration. A block inversion or decomposition of amatrix is not numerically faster than the standard algorithmsused for this task, unless one has already pre-computed thedecomposition of a large block of the matrix. For this reason,a matrix update method for the GP regression is only ofadvantage if the GP is iteratively updated and predictionsare required at each iteration as in the case of Bayesian his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. Number of evalulations ( N ev ) − C a l c u l a t i o n t i m e ( s ) svd K ev ( f ) L K ev ( f ) L K ev ( f , d f/ d x i ) − · N [s] · − · N [s] Fig. 2. Time needed to perform the Cholesky decomposition of the covariancematrix K ev as a function of the number of evaluations of a function f : R → R (orange and green points). For comparison, also the time to performa singular value decomposition (SVD) is included (blue points). The orangeand blue points correspond to the case of observing only the function value f at each evaluation. The green points correspond to the case of observing alsothe derivatives d f/ d x i with respect to all ten parameters. The lines show theasymptotic behavior for the calculation of L K ev . The calculations were doneon a cluster node with 250 GB of memory, due to the memory requirementsof Gaussian processes for data sets containing a large number of observations. optimization.The numerical procedure to update L K ev is shown inappendix A. Computing L K ev iteratively using the previous L K ev reduces the numerical effort from O ( N ) to O ( N ) ,as shown in Fig. 3. The numerical implementation is based onthe C implementation of the BLAS [41] and LAPACK [42]libraries. Number of evaluations ( N ev ) − − C a l c u l a t i o n t i m e ( s ) L K ev ( f , d f/ d x i ) L K ev ( f , d f/ d x i ) mat. update L − K ev ( f , d f/ d x i ) mat. update · − · N [s] · · − · N [s] Fig. 3. Time needed to perform the decomposition of the covariance matrix K ev as a function of the number of evaluations ( N ev ). Each evaluation iscomposed of the observation of the function value, f , and of its derivatives, d f/ d x i , with respect to ten parameters. Three different operations are con-sidered, the calculations of the Cholesky decomposition (green points), L K ev ,the iterative calculation of L K ev (red points) and the iterative calculation of L − K ev (purple points). Lines mark the asymptotic behavior of the computationtimes. Additionally to the iterative construction of the Choleskydecomposition, it is also possible to determine the inverse ofthe Cholesky decomposition L − K ev in an iterative way (seeAppendix A). As it holds that K − = L − TK ev · L − K ev , it ispossible to directly solve the inference of Eqs. (22)-(23).With this second iterative approach one can halve theinference time as only a matrix vector multiplication has tobe performed instead of a forward and backward substitution.However, we found that the matrix update of this secondscheme is numerically much more unstable, as shown inAppendix B.We like to note, that the iterative update of the Choleskymatrix is not applicable if the GP hyperparameters are updatedat each iteration. In this case, the full covariance matrix isupdated and a full Cholesky decomposition is required. On theother hand, after some hundred observations the hyperparam-eter optimization becomes numerically extremely expensivesince a series of Cholesky decompositions has to be performed.Moreover, practicable values of the hyperparameters can usu-ally be estimated based on fewer observations.III. W AVEGUIDE COUPLER FOR INTER - CHIP OPTICALCOMMUNICATION

To demonstrate the applicability of our approach to designnanophotonic structures, we apply it to optimize a freeformwaveguide coupler with a minimal spatial footprint as shownin the schematic of Fig. 4. The waveguide coupler is partof a photonic inter-chip communication architecture [43]. Itconsists of a silicon based integrated photonic chip that shallbe connected to a different photonic chip by means of aphotonic wire bond (PWB). The PWB is a freeform dielectricwaveguide that can be written with 3D laser nanolithographyto follow an arbitrary path. To keep the connection ideallylossless, the PWB shall have an optimal trajectory [44], [45]but also the connection between the silicon waveguide andthe PWB should be optimized. We concentrate here on thesecond challenge and our objective is to design a coupler witha minimal spatial footprint that couples the fundamental modeof a silicon waveguide into the fundamental mode of a PWBat a central wavelength of λ c = 1550 nm.The silicon waveguide has standard dimensions for anintegrated photonic circuit, i.e. we consider a width of 500nm and a height of 220 nm. The PWB has a width of 2 µ m and a height of 1.8 µ m . Its refractive index at λ c is 1.53. Bothwaveguides are placed on top of a Silica substrate, n SiO =1.44, and are surrounded by a cladding with a refractive indexof 1.36.Adiabatic tapers are the most standard structure used foredge couplers [46], [47] although different structures havebeen proposed [48]–[51]. To achieve low coupling lossesand a compact device design, many different structures havebeen proposed, such as tapers with smooth non uniform pro-ﬁles [52]–[54] or with more complex shapes [55]. Generally,the more compact the device, the more complex needs to beits shape and also the smaller is its bandwidth. Many differentultra-compact devices have been achieved using topologyoptimization [56], [57]. However, the fabrication of these his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. Si PWB0.5 2 y xcladding Si control points

Fig. 4. Schematic of the waveguide coupler between the silicon waveguideand the photonic wire bond waveguide. The shape of the coupler is describedby a NURBS curve. The curve has a number d of control points, equispacedalong the x direction. The total length of the coupler is 3 µm . devices is complex. An alternative is to design a freeformtaper. This approach has been used to design similar siliconphotonic elements, such as power splitters [58], [59]. Here,we follow this approach to design a waveguide coupler with alength of 3 µ m . The shape of the coupler is parametrized witha mirror symmetric NURBS curve, with a series of d controlpoints. The control points are equispaced in the x dimensionand they are allowed to vary in the y dimension. The widthof the coupler is constrained to be larger than 140 nm andsmaller than 1800 nm. These dimensions are compatible withfabrication processes using e-beam lithography.The simulation of the full 3D structure is computationallyvery demanding. In order to reduce the computation time andto be able to assess how the coupling efﬁciency improves whenmore complex shapes are considered, i.e., when more controlpoints are included, we reduce the three dimensional problemto an effective two dimensional approximation. In the twodimensional simulations, the refractive indices of the siliconwaveguide and PWB are replaced by the effective refractiveindices of their fundamental modes [60] computed with a FEMmode solver [22]. The refractive index used for the claddingcorresponds to the one of the cladding of the 3D model, 1.36.The layout of the considered model corresponds to the x-y topplane shown in Fig. 4.The objective function used for the optimization is the modeoverlap of the scattered ﬁeld, E scat , and the fundamental modeof the PWB, E PWB , along the cross section S , f ob = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) (cid:0) E scat × H PWB , (cid:1) · dS (cid:12)(cid:12)(cid:12)(cid:12) . (25)The silicon waveguide is excited with its fundamental mode E Si , . For the simulations of the scattered ﬁeld, we use a FEMwith a polynomial order of 3 and a mesh size of λ /10, where λ is the wavelength inside the material. Each simulation requiresaround 15 seconds to obtain both the coupling efﬁciency andits shape derivatives with respect to the control points of theNURBS curve.We have performed optimizations for waveguide couplersparameterized with a different number of control points d equal to 4, 10, 15, and 20. To test the convergence rate ofthe optimizations, every optimization is repeated four times. Iteration . . . . . . . C u rr e n t m a x i m u m o f f o b d = 4 d = 10 d = 15 d = 20 Fig. 5. Current optimal value of f ob during the optimization processof the waveguide coupler. The results are shown for waveguide couplersparametrized with a different number d of control points. The line shown foreach d is the mean value obtained over four independent optimization runs.The shadowed area represents the region within one standard deviation.Theoptimizations for d equals to 4 were stopped after one thousand simulations,as the maximum values of the expected improvement were already lower than − . The results of the optimizations are shown in Fig. 5. Theyshow the average convergence rate of the four optimizationsand its standard deviation. We can see that the optimizationsfor different d converge to different values. The efﬁciency ofthe coupler increases with the value of d . This result canbe expected, as the shape of the devices represented withfewer control points are contained in the design spaces of thelarger parametrizations. However, increasing the number ofparameters d could also slow down the convergence rate dueto the exponentially growing search space. Each optimizationexploits the data of N ev · ( d + 1) observations, being N ev the total number of iterations used. One can measure thepercentage of the explored volume of the parameter space asthe ratio between the number of observations used and thenumber of d -dimensional length scale blocks with side lengths l i that composes the parameter space. The length scales aredeﬁned in Eqn. (10). This measure gives values of 20.5, 10.2,1.0, and 0.02 percent for the optimizations with d equals to 4,10, 15, and 20 respectively.Indeed, for a growing number of parameters it takes increas-ingly longer to locate the global maximum. This can be seenin Fig. 6, where the average distance to the best parameterconﬁguration of all four optimization runs are shown as afunction of the iteration number. For the optimizations with 15parameters, around four thousand simulations are required forall four optimization runs to approach the same optimum. Forthe optimizations with 20 parameters, the optimizations didnot converge to the same design even after 4500 simulations.However, the standard deviation of the coupling efﬁciency con-verged to 0.001 already after one thousand simulations. Thisresult indicates that for these high dimensional spaces there his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. Iteration A v g . d i s t . c u rr e n t m a x i m u m s ( n m ) d = 4 d = 10 d = 15 d = 20 Fig. 6. Mean value of the distance between the optimal points obtained indifferent optimization runs. The results are shown for the optimization ofwaveguide couplers parametrized with a different number d of control points. are multiple designs showing a similar performance. Similarresults are also frequently obtained in topology optimizationsthat use the adjoint method.Altogether, using a larger number of parameters d for theoptimization seems to be favourable. Although it takes longerto reach a global maximum, the coupling efﬁciencies areon average better already after a small number of iterationsfor larger d . This shows, that the approach scales well forincreasing the number of design parameters. The use of shapederivatives probably plays a crucial role in this behavior.To further analyze the role played by the shape derivativesin the convergence of the algorithm, we compare the resultsobtained with the proposed algorithm with those obtainedusing the same Bayesian optimization algorithm without in-cluding derivative information. The comparison is done for thecoupler parametrized with 20 parameters, d = 20. The numberof iterations for the optimizations without derivatives are setsuch that the total time used for both optimization methodsis approximately the same. The results are shown in Fig. 7.As one can see, the algorithm that used derivative informationobtained slightly better results and it required around one thirdof the time to obtain coupling efﬁciencies as the best oneobtained by the algorithm without derivative information.Let us note that two thousand evaluations with twentyparameters using shape derivatives implies already to have acovariance matrix of size 40,000 x 40,000. Decomposing thismatrix without using the matrix update scheme would alreadyrequire computational times around one and a half minutes,compared with the 15 seconds needed for the FEM simulation.With the iterative approach, on the other hand, the decomposi-tion for this matrix size takes only 1 second. Without the useof the iterative approach, the algorithm that exploits derivativeinformation would require 276 hours instead of 47 hours toperform the 4500 iterations. The determination of a parameterset with large expected improvement for the optimization casewith d = 20 parameters took around 12 seconds. This number Iteration . . . . . C u rr e n t m a x i m u m o f f o b d = 20 d = 20 without derivatives Optimization time (hours) . . . . . C u rr e n t m a x i m u m o f f o b d = 20 d = 20 without derivatives Fig. 7. Comparison of the performance between two Bayesian optimizationalgorithms with and without derivative information. The comparison is donefor the optimization of the waveguide coupler with d = 20. Top: Currentoptimum of the coupling efﬁciency with respect to the number of iterations,FEM simulations, performed by the optimization algorithm. Bottom: Optimumof the coupling efﬁciency with respect to the run time of the algorithm. is the average over all the iterations. Let us note also that withthe proposed approach the memory still scales with O ( N ) .The representation of the covariance matrix and its Choleskydecomposition requires to store · N double precisionﬂoating point numbers, each taking 8 bytes of memory. Theoptimization of the coupler with 20 parameters needed 120gigabytes.Figure 8 shows the optimal geometries obtained for thecases of d equal to 4, 15, and 20. The optimal design for d = 4 parameters has the shape of an adiabatic taper. The couplersparametrized with more control points present more complexgeometries, although both the coupler obtained using 15 and20 control points present similar characteristics between them.It is expected that these complex shapes are the result ofmaximizing the coupling efﬁciency at the target optimizationwavelength of 1550 nm. But interesting enough, by takinga look at Fig. 8, one can see important differences betweenthe ﬁeld intensity for the designs where d equals to 4 and 20parameters. For the design with d = 4, the width of the coupleris narrower than the silicon waveguide along the entire couplerregion. As a result, the light is no longer conﬁned withinthe silicon core, but it is radiated into the PWB waveguidesection that has a lower refractive index. Since the width ofthe coupler decreases continuously and smoothly, the back-reﬂections along the coupler are small and they are producedcontinuously along the coupler. This can be concluded fromthe smooth proﬁle of the ﬁeld intensity within the core of thecoupler. After 2 µ m , most of the incident light has already been his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. x (nm) − − − y ( n m ) d = 20 − − − y ( n m ) d = 15 − − − y ( n m ) d = 4 × × . . . . . . . . × Fig. 8. Geometries of the obtained optimal waveguide couplers. The resultsshow the optimal designs for couplers parametrized with a number d of4, 15, and 20 control points. The colors indicate the energy density of theelectromagnetic ﬁeld. radiated into the PWB waveguide section where it continuesto be guided and the core carries almost no energy. Sucha behaviour can be considered as the expected or rationalsolution to the problem.The behavior of the coupler with d = 20 is completelydifferent. Its width is larger than the width of the siliconwaveguide almost along the entire 3 µ m of the coupling sec-tion. As a consequence, the light is kept conﬁned in the siliconcore of the coupler and there is almost no radiation into thePWB waveguide until the terminating section of the coupler.The proﬁle of the coupler presents a chirped subwavelengthmodulation. As a further difference to the design with d = 4, aninterference pattern is clearly present within the core sectionof the coupler. The strong interference is a clear indication thatmultiple reﬂections occur along the coupler. Nevertheless, theinterferences between counter propagating waves in the inputsilicon waveguide are weaker than in the design for d = 4.This observation is in agreement with the higher couplingefﬁciency shown in Fig. 5. The fact that the coupler with d = 20 produces higher internal reﬂections but it presents asmaller total reﬂection at the input port can only be explainedby destructive interference of the multiple reﬂections. Theshape of the coupler seems to modulate the reﬂections insuch a way that they interfere destructively at the design wavelength, allowing all the incoming power to be emittedinto the PWB waveguide. At the same time, the shape at theregion of the tip also modulates the radiation pattern, leadingto an almost perfect overlap with the fundamental mode ofthe PWB. This alternative, completely non-adiabatic solutionemerges somewhat unexpected. Wavelength (nm) . . . . . . . . . C o up li n g e (cid:30) c i e n c y d = 4 d = 10 d = 15 d = 20 Fig. 9. Wavelength sweeps for the coupling efﬁciency of the optimal couplers.

To demonstrate how this ﬁne tuning of the geometry affectsthe coupling efﬁciency in a spectral bandwidth around thecentral wavelength, Fig. 9 shows a wavelength scan of theefﬁciencies of the optimized designs. The more complex cou-plers present a more pronounced, spectrally selective response.Their coupling efﬁciencies are higher than that of the adiabatictaper within a bandwidth of 100 nm. Beyond this spectralregion the efﬁciencies drop below that of the adiabatic coupler,which maintains an efﬁciency above 0.9.IV. C

ONCLUSION

We propose a global optimization scheme for the designof nanophotonic devices composed of a FEM solver withforward shape derivatives and an iterative Bayesian optimiza-tion algorithm. The iterative algorithm improves the mainscalability problem of Bayesian optimization, allowing toaccess optimization problems that are not feasible for Bayesianoptimization with a classical implementation, specially whenderivative observations are included. With this approach, onecan beneﬁt from the good performance of Bayesian optimiza-tion for a wider range of devices.As an application example, we optimized the two di-mensional model of a compact freeform waveguide coupler,achieving a coupling efﬁciency of 98 percent. The coupleris optimized for a target wavelength of 1550 nm and itcan be implemented using e-beam lithography. The resultsshow that a larger dimensional design space for the freeformshape leads to signiﬁcantly higher coupling efﬁciencies at thetarget wavelength. On the other hand, the higher dimensionalcouplers are also more frequency selective. his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. A PPENDIX AM ATRIX UPDATE FOR L K ev AND L − K ev To derive the method to update L K ev based on a previouslycomputed submatrix, we split the covariance matrix K ev intofour blocks, K ev = (cid:20) A B T B C (cid:21) , (26)where A represents the covariance matrix computed in a subsetof all evaluation points. In our case, A will be the main blockof K ev , and the submatrices B and C the result of adding afew more observations to the GP. However, this considerationabout the relative sizes of the sub-blocks is not needed in thebelow derivation. It follows that L K ev can be obtained in termsof the elements of A , B and C as, Ref. [61] Sec. 6.5.4, L K ev = (cid:34) L A B · L A − T L S (cid:35) , (27)with S being the Schur complement of K ev , S = C − B · A − · B T . (28)By deﬁning the matrix X = B · L A − T one can express S as S = C − X · X T . (29)Here, we have used that the Cholesky decomposition ofthe inverse of a matrix equals to the inverse transpose of theCholesky decomposition of the matrix. To see this, considerthe Cholesky decomposition M = L M · L M T (30)of a matrix M . By inverting Eq. (30) and comparing with theCholesky decomposition of the inverse M − = L M − · L M − T we have M − = L M − · L M − T = L M − T · L M − . (31)To determine X T one can solve the triangular system L A · X T = B T . Finally, we get the expression to the matrix updatescheme, L K ev = (cid:20) L A X L S (cid:21) . (32)Equation (31) shows how to efﬁciently use L K ev to computethe inference of σ p , Eq. (23). If we deﬁne the vector b as b = L K ev − · K T ∗ , ev , (33)we can infere the posterior σ p as K ∗ , ev K − K T ∗ , ev = b T · b. (34)The vector b can be obtained solving the triangular system L K ev · b = K T ∗ , ev using forward and backward substitution.One can also update the inverse of L K ev instead of L K ev itself, L K ev − = (cid:34) L A − − L S − · X · L A − L S − (cid:35) . (35)Using the inverse of L K ev one can compute b as a matrixvector multiplication instead of calculating it with a forwardand backward substitution, which halves the computation timeof the inference. However, as we show in Appendix B, workingwith L K ev − instead of L K ev leads to higher numerical errorsas the covariance matrix K ev is typically ill-conditioned.A PPENDIX BS TABILITY ANALYSIS

It is well known, that the covariance matrix becomes ill-conditioned in the course of Bayesian optimization. Especially,when the observed sampling points are too close together,some rows of the covariance matrix are almost identical.The degeneracy of the matrix becomes even worse whenusing derivative information [37]. It is therefore an importantquestion if the presented iterative approach suffers from thisdegeneracy, i.e. if the GP inference is sufﬁciently accurate aftera large number of iterations.Many regularization techniques have been proposed [39] inorder to be able to work with an ill-conditioned covariancematrix. The most frequently used methods are the use of theLU decomposition (Ref. [62] Section 3.1.1), the Choleskydecomposition (Ref. [17] Section 2.2), and the use of thetruncated singular value decomposition [63], TSVD (see forexample Ref. [62] Section 3.1.2). The LU decompositionrequires 2/3 N ev3 ﬂoating point operations (ﬂops) (Ref. [64]Lecture 23), the Cholesky decomposition requires 1/3 N ev3 ﬂops (Ref. [64] Lecture 23), and the ﬁrst phase of theSVD, which is in principle the most expensive part of thealgorithm, requires 8/3 N ev3 ﬂops when using the Golub-Kahan bidiagonalization (Ref. [64] Lecture 31). Note that, asdiscussed already in Appendix A, twice the time is required toperform an inference when using the Cholesky decompositionin comparison with direct inversion methods, such as the SVD.Another noteworthy regularization technique that was pro-posed in the context of Bayesian optimization using derivativeinformation is the spectral representation [37] of the Gaussianprocess, SGP. The basic principle consists of Fourier trans-forming the design space and performing the inference inthe frequency domain. By limiting the value for the highestfrequency, the method can deal well with ill-conditionedcovariance matrices. However, the method is not suitable forthe discussed applications, as it does not scale well with thenumber of dimensions d of the design space. The methodrequires a ﬁxed number M of equidistant points in everydimension. The memory requirement for the transformedcovariance matrix is of order O (d ) . Because of this memoryrequirement, we were not able to perform a comparison withthis method. Note that in Ref. [37] the authors already mentionthat the method is memory efﬁcient only when the lengthscales of the problem are large compared to the size of thedesign space.To assess the numerical stability of the proposed matrixupdate schemes, we run a comparison of the error produced his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected]. Size of K ev − − − − − E rr o r i n σ p Direct inversionTSVDCholeskyCholesky (matrix update)Inv. Cholesky (matrix update)

Fig. 10. Error produced by different numerical techniques used for solvingthe system of Eq. (23): Direct inversion of K ev , the truncated singularvalue decomposition, the Cholesky decomposition on the two matrix updatesapproaches proposed in this work. The error shown is the maximum errorobtained after evaluating σ p at all the evaluation points contained in the GP. by the Cholesky decomposition and the TSVD decompositionand by both matrix update schemes. To measure the error, weinfer the values of σ p at all the evaluation points contained inthe GP after applying the different regularization techniques.The uncertainty should be ideally zero at the sampling points.The error is therefore deﬁned as the maximum deviation fromzero of σ p at all previous sampling points. The results of thetest are shown in Fig. 10.The behavior of both matrix update schemes is very dif-ferent. While the matrix update scheme for the Choleskydecomposition keeps as stable as the Cholesky decompositionitself, the use of the inverse of the Cholesky decompositionproduces large errors in σ p as the covariance matrix increasesits size. Based on these results, the use of the inverse of theCholesky decomposition seems to be not suitable to performBayesian optimization. However, the results also show us thehigh stability of the matrix update scheme for the Choleskydecomposition in comparison to the TSVD and the directinversion. We attribute the stable behaviour to the fact thatalso algorithms for the full Cholesky decomposition composethe matrix row-by-row or column-by-column and do notrequire a pivoting of the matrix [65]. Therefore, it is possibleto improve the scalability of Bayesian optimization withoutlosing accuracy in the predictions of the GP.A CKNOWLEDGEMENTS

This work has received funding from the European Union’sHorizon 2020 research and innovation programme under theMarie Sklodowska-Curie grant agreement No 675745. Theauthors also gratefully acknowledge ﬁnancial support by theDeutsche Forschungsgemeinschaft (DFG) through – Project-ID 258734477 – SFB 1173 and – Project-ID 390761711 –EXC 2082/1. This project has received funding from theGerman Federal Ministry of Education and Research (BMBFForschungscampus MODAL, project number 05M20ZBM) and by the Deutsche Forschungsgemeinschaft under Ger-many’s Excellence Strategy - MATH+ (EXC-2046/1, projectID 390685689, AA4-6). X.G.S. acknowledges support fromthe Karlsruhe School of Optics and Photonics (KSOP).R

EFERENCES[1] R. Bellman,

Dynamic Programming , ser. Dover Books on ComputerScience Series. Dover Publications, 2003.[2] G. Hughes, “On the mean accuracy of statistical pattern recognizers,”

IEEE Trans. Inf. Theory , vol. 14, no. 1, pp. 55–63, 1968.[3] T. Oommen, D. Misra, N. K. C. Twarakavi, A. Prakash, B. Sahoo,and S. Bandopadhyay, “An objective analysis of support vector machinebased classiﬁcation for remote sensing,”

Math. Geosci. , vol. 40, no. 4,pp. 409–424, 2008.[4] S. Burger, L. Zschiedrich, J. Pomplun, F. Schmidt, and B. Bodermann,“Fast simulation method for parameter reconstruction in optical metrol-ogy,”

Proc. SPIE , vol. 8681, p. 868119, 2013.[5] T. W. Hughes, I. A. D. Williamson, M. Minkov, and S. Fan, “Forward-mode differentiation of Maxwell’s equations,”

ACS Photonics , vol. 6,no. 11, pp. 3010–3016, 2019.[6] O. Sigmund and J. Søndergaard Jensen, “Systematic design of phononicband–gap materials and structures by topology optimization,”

Philos.Trans. R. Soc. A , vol. 361, no. 1806, pp. 1001–1019, 2003.[7] P. I. Borel, A. Harpøth, L. H. Frandsen, M. Kristensen, P. Shi, J. S.Jensen, and O. Sigmund, “Topology optimization and fabrication ofphotonic crystal structures,”

Opt. Express , vol. 12, no. 9, pp. 1996–2001,2004.[8] J. Lu, S. Boyd, and J. Vuˇckovi´c, “Inverse design of a three-dimensionalnanophotonic resonator,”

Opt. Express , vol. 19, no. 11, pp. 10 563–10 570, 2011.[9] S. Molesky, Z. Lin, A. Y. Piggott, W. Jin, J. Vuckovi´c, and A. W.Rodriguez, “Inverse design in nanophotonics,”

Nat. Photonics , vol. 12,no. 11, p. 659, 2018.[10] M. Pelikan, D. E. Goldberg, and E. Cant´u-Paz, “BOA: The Bayesianoptimization algorithm,” in

Proc. Gen. Ev. Comp. Conf. GECCO-99 ,vol. 1, 1999, pp. 525–532.[11] A. D. Bull, “Convergence rates of efﬁcient global optimization algo-rithms,”

J. Mach. Learn. Res. , vol. 12, no. 88, pp. 2879–2904, 2011.[12] D. Zhang, F. Qin, Q. Zhang, Z. Liu, G. Wei, and J. J. Xiao, “SegmentedBayesian optimization of meta-gratings for sub-wavelength light focus-ing,”

J. Opt. Soc. Am. B , vol. 37, no. 1, pp. 181–187, 2020.[13] F. Qin, Z. Liu, Q. Zhang, H. Zhang, and J. Xiao, “Mantle cloaksbased on the frequency selective metasurfaces designed by Bayesianoptimization,”

Sci. Rep. , vol. 8, no. 14033, pp. 1–10, 2018.[14] F. Qin, D. Zhang, Z. Liu, Q. Zhang, and J. Xiao, “Designing metal-dielectric nanoantenna for unidirectional scattering via Bayesian opti-mization,”

Opt. Express , vol. 27, no. 21, pp. 31 075–31 086, 2019.[15] M. A. Osborne, R. Garnett, and S. J. Roberts, “Gaussian processes forglobal optimization,” in , 2009,pp. 1–15.[16] D. R. Jones, “A taxonomy of global optimization methods based onresponse surfaces,”

J. Global Optim. , vol. 21, no. 4, pp. 345–383, 2001.[17] C. K. I. Williams and C. E. Rasmussen,

Gaussian processes for machinelearning . MIT Press Cambridge, MA, 2006, vol. 2.[18] N. D. Lawrence, “Gaussian process latent variable models for visualisa-tion of high dimensional data,” in

Adv. Neural Inf. Process. Syst. , 2004,pp. 329–336.[19] P.-I. Schneider, X. Garcia Santiago, V. Soltwisch, M. Hammerschmidt,S. Burger, and C. Rockstuhl, “Benchmarking ﬁve global optimizationapproaches for nano-optical shape optimization and parameter recon-struction,”

ACS Photonics , vol. 6, no. 11, pp. 2726–2733, 2019.[20] P.-I. Schneider, X. G. Santiago, C. Rockstuhl, and S. Burger, “Globaloptimization of complex optical structures using Bayesian optimizationbased on Gaussian processes,”

Proc. SPIE , vol. 10335, p. 103350O,2017.[21] P. Monk, “Analysis of a ﬁnite element method for maxwell’s equations,”

SIAM J. Numer. Anal. physica statussolidi (b) , vol. 244, no. 10, pp. 3419–3434, 2007.[24] E. Hassan, B. Scheiner, F. Michler, M. Berggren, E. Wadbro, F. R¨ohrl,S. Zorn, R. Weigel, and F. Lurz, “Multilayer topology optimization ofwideband siw-to-waveguide transitions,”

IEEE Trans. Microw. TheoryTech. , vol. 68, no. 4, pp. 1326–1339, 2020.his is the authors’s version an article published in the Journal of Lightwave Technology: X. Garcia-Santiago, et al., J. Light. Technol. , 167 (2021). Theﬁnal version of record is available at https://dx.doi.org/10.1109/JLT.2020.3023450 . Copyright 2020 IEEE. Personal use is permitted. For any other purposes,permission must be obtained from the IEEE by emailing [email protected].[25] A. Iguchi, Y. Tsuji, T. Yasui, and K. Hirayama, “Topology optimaldesign of optical waveguides in consideration of polarization dependenceusing bpm and avm,” in .IEEE, 2016, pp. 1772–1773.[26] N. Bursch¨apers, S. Fiege, R. Schuhmann, and A. Walther, “Sensitivityanalysis of waveguide eigenvalue problems,” Adv. Radio Sci. , vol. 9, pp.85–89, 2011.[27] J. Petr´aˇcek and J. Luksch, “Bidirectional eigenmode propagation algo-rithm for 3d waveguide structures,” in

IEEE, 2011, pp. 1–4.[28] D. V. Murthy and R. T. Haftka, “Derivatives of eigenvalues andeigenvectors of a general complex matrix,”

Int. J. Numer. Meth. Eng. ,vol. 26, no. 2, pp. 293–311, 1988.[29] G. Veronis, R. W. Dutton, and S. Fan, “Method for sensitivity analysisof photonic crystal devices,”

Opt. Lett. , vol. 29, no. 19, pp. 2288–2290,2004.[30] M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon,“Automatic differentiation of algorithms,”

J. Comput. Appl. Math. , vol.124, no. 1-2, pp. 171–190, 2000.[31] M. Kollmann, “Sensitivity analysis: The direct and adjoint method,”Master’s thesis, Johannes Kepler Universitat Linz, 2010.[32] M. L. Stein,

Interpolation of spatial data: some theory for kriging .Springer Science & Business Media, 2012.[33] X. Garcia-Santiago, P.-I. Schneider, C. Rockstuhl, and S. Burger, “Shapedesign of a reﬂecting surface using Bayesian optimization,”

J. Phys.Conf. Ser. , vol. 963, p. 012003, 2018.[34] T. Bayes and N. Price, “Lii. An essay towards solving a problem in thedoctrine of chances. By the late rev. Mr. Bayes, F. R. S. communicatedby Mr. Price, in a letter to John Canton, A. M. F. R. S,”

Philos. Trans.Royal Soc. , vol. 53, pp. 370–418, 1763.[35] A. Papoulis and S. U. Pillai,

Probability, random variables, and stochas-tic processes . Tata McGraw-Hill Education, 2002.[36] E. Solak, R. Murray-Smith, W. E. Leithead, D. J. Leith, and C. E.Rasmussen, “Derivative observations in gaussian process models ofdynamic systems,” in

Adv. Neural Inf. Process. Syst. , Vancouver, BritishColumbia, Canada, 2003, pp. 1057–1064.[37] J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier, “Bayesian optimiza-tion with gradients,” in

Adv. Neural Inf. Process. Syst. , Long Beach,California, 2017, pp. 5267–5278.[38] A. Wu, M. C. Aoi, and J. W. Pillow, “Exploiting gradients and hessiansin Bayesian optimization and Bayesian quadrature,” arXiv preprintarXiv:1704.00060 , 2017.[39] H. Mohammadi, R. L. Riche, N. Durrande, E. Touboul, and X. Bay, “Ananalytic comparison of regularization methods for gaussian processes,” arXiv preprint arXiv:1602.00853 , 2016.[40] L. Foster, A. Waagen, N. Aijaz, M. Hurley, A. Luis, J. Rinsky,C. Satyavolu, M. J. Way, P. Gazis, and A. Srivastava, “Stable andefﬁcient gaussian process calculations,”

J. Mach. Learn. Res. , vol. 10,no. 31, pp. 857–882, 2009.[41] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, “Basiclinear algebra subprograms for fortran usage,”

ACM Trans. Math. Softw. ,vol. 5, no. 3, pp. 308–323, 1979.[42] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney,J. Du Croz, S. Hammarling, J. Demmel, C. Bischof, and D. Sorensen,“Lapack: A portable linear algebra library for high-performance com-puters,” in

Proc. 1990 ACM/IEEE Conf. Supercomputing . New York,USA: IEEE Computer Society Press, 1990, pp. 2–11.[43] A. Nesic, M. Blaicher, T. Hoose, A. Hofmann, M. Lauermann, Y. Ku-tuvantavida, M. N¨ollenburg, S. Randel, W. Freude, and C. Koos,“Photonic-integrated circuits with non-planar topologies realized by 3d-printed waveguide overpasses,”

Opt. Express , vol. 27, no. 12, pp. 17 402–17 425, 2019.[44] F. Negredo, M. Blaicher, A. Nesic, P. Kraft, J. Ott, W. D¨orﬂer, C. Koos,and C. Rockstuhl, “Fast and reliable method to estimate losses of single-mode waveguides with an arbitrary 2d trajectory,”

J. Opt. Soc. Am. A ,vol. 35, no. 6, pp. 1063–1073, 2018.[45] M. Hammer, A. Hildebrandt, and J. F¨orstner, “Full resonant transmissionof semiguided planar waves through slab waveguide steps at obliqueincidence,”

J. Light. Technol. , vol. 34, no. 3, pp. 997–1005, 2015.[46] R. E. Smith, C. T. Sullivan, G. A. Vawter, G. R. Hadley, J. R. Wendt,M. B. Snipes, and J. F. Klem, “Reduced coupling loss using a tapered-ribadiabatic-following ﬁber coupler,”

IEEE Photon. Technol. Lett. , vol. 8,no. 8, pp. 1052–1054, 1996.[47] T. A. Ramadan and R. M. Osgood, “Adiabatic couplers: design rulesand optimization,”

J. Light. Technol. , vol. 16, no. 2, p. 277, 1998. [48] P. Wang, A. Michael, and C. Y. Kwok, “Cantilever inverse taper couplerwith SiO2 gap for submicrometer silicon waveguides,”

IEEE Photon.Technol. Lett. , vol. 29, no. 16, pp. 1407–1410, 2017.[49] M. Teng, B. Niu, K. Han, S. Kim, Y. Xuan, Y. J. Lee, and M. Qi, “Tridentshape soi metamaterial ﬁber-to-chip edge coupler,” in

Optical FiberCommunication Conference (OFC) 2019 . San Diego, USA: OpticalSociety of America, 2019, p. Tu2J.6.[50] M. Hammer, L. Ebers, and J. F¨orstner, “Oblique quasi-lossless excitationof a thin silicon slab waveguide: a guided-wave variant of an anti-reﬂection coating,”

J. Opt. Soc. Am. B , vol. 36, no. 9, pp. 2395–2401,2019.[51] X. Mu, S. Wu, L. Cheng, and H. Fu, “Edge couplers in silicon photonicintegrated circuits: A review,”

Appl. Sci. , vol. 10, no. 4, p. 1538, 2020.[52] A. Agrebi and R. Attia, “Coupling efﬁciency using non-uniform diffusedtapered waveguide,” in , Sousse, Tunisia, 2012, pp. 511–515.[53] S. J. Hettrick, J. Wang, C. Li, J. S. Wilkinson, and D. P. Shepherd,“An experimental comparison of linear and parabolic tapered waveguidelasers and a demonstration of broad-stripe diode pumping,”

J. Light.Technol. , vol. 22, no. 3, p. 845, 2004.[54] C. Liu, G. Zhao, F. Yang, Q. Lu, and W. Guo, “Design of compactbut fabrication-tolerant vertical coupler for active–passive integration,”

J. Light. Technol. , vol. 36, no. 3, pp. 755–762, 2017.[55] P. Sethi, A. Haldar, and S. K. Selvaraja, “Ultra-compact low-loss broad-band waveguide taper in silicon-on-insulator,”

Opt. Express , vol. 25,no. 9, pp. 10 196–10 203, 2017.[56] M. Teng, A. Honardoost, Y. Alahmadi, S. S. Polkoo, K. Kojima, H. Wen,C. K. Renshaw, P. LiKamWa, G. Li, S. Fathpour et al. , “Miniaturizedsilicon photonics devices for integrated optical signal processors,”

J.Light. Technol. , vol. 38, no. 1, pp. 6–17, 2020.[57] Y. Liu, W. Sun, H. Xie, N. Zhang, K. Xu, Y. Yao, S. Xiao, andQ. Song, “Adiabatic and ultracompact waveguide tapers based on digitalmetamaterials,”

IEEE J. Sel. Top. Quantum Electron. , vol. 25, no. 3, pp.1–6, 2018.[58] C. Sideris, E. Garza, and O. P. Bruno, “Ultrafast simulation andoptimization of nanophotonic devices with integral equation methods,”

ACS Photonics , vol. 6, no. 12, pp. 3233–3240, 2019.[59] C. M. Lalau-Keraly, S. Bhargava, O. D. Miller, and E. Yablonovitch,“Adjoint shape optimization applied to electromagnetic design,”

Opt.Express , vol. 21, no. 18, pp. 21 693–21 701, 2013.[60] Y. Chung and N. Dagli, “An assessment of ﬁnite difference beampropagation method,”

IEEE J. Quantum Electron. , vol. 26, no. 8, pp.1335–1339, 1990.[61] G. H. Golub and C. F. Van Loan,

Matrix computations . JHU press,2012, vol. 3.[62] M. N. Gibbs, “Bayesian Gaussian processes for regression and classiﬁ-cation,” Ph.D. dissertation, University of Cambridge, 1998.[63] P. C. Hansen, “The truncated SVD as a method for regularization,”

BITNumer. Math. , vol. 27, no. 4, pp. 534–553, 1987.[64] L. N. Trefethen and D. Bau III,

Numerical linear algebra . Siam, 1997,vol. 50.[65] G. H. Golub and C. F. Van Loan,