[PDF] Neural Schrödinger Equation:Physical Law as Neural Network

Abstract

We show a new family of neural networks based on the Schrödinger equation (SE-NET). In this analogy, the trainable weights of the neural networks correspond to the physical quantities of the Schrödinger equation. These physical quantities can be trained using the complex-valued adjoint method. Since the propagation of the SE-NET can be described by the evolution of physical systems, its outputs can be computed by using a physical solver. As a demonstration, we implemented the SE-NET using the finite difference method. The trained network is transferable to actual optical systems. Based on this concept, we show a numerical demonstration of end-to-end machine learning with an optical frontend. Our results extend the application field of machine learning to hybrid physical-digital optimizations.

Full PDF

NNeural Schrödinger Equation:Physical law as Neural Network

Mitsumasa Nakajima, Kenji Tanaka, Toshikazu Hashimoto

NTT Device Technology Labs., Kanagawa, Japan [email protected]

Abstract

We show a new family of neural networks based on the Schrödinger equation(SE-NET). In this analogy, the trainable weights of the neural networks correspondto the physical quantities of the Schrödinger equation. These physical quantitiescan be trained using the complex-valued adjoint method. Since the propagation ofthe SE-NET can be described by the evolution of physical systems, its outputs canbe computed by using a physical solver. As a demonstration, we implemented theSE-NET using the ﬁnite difference method. The trained network is transferable toactual optical systems. Based on this concept, we show a numerical demonstrationof end-to-end machine learning with an optical frontend. Our results extend theapplication ﬁeld of machine learning to hybrid physical-digital optimizations.

Deep neural networks (DNNs) have a remarkable ability to learn and generalize from data. Thisfeature enables excellent capabilities in a wide range of applications, such as image processing,natural language processing, and robot operation[24, 13, 38, 28]. Recent studies have shown that it ispossible to model the residual layer of ResNet as a continuous ordinary differential equation (ODE)network [44] or by partial differential equations (PDEs). The ODEs and PDEs are commonly used tosimulate physical systems. Therefore, the previous work on ODE-Nets suggests that the evolutionequations used for actual physical systems can be considered as a family of neural networks.In this study, we propose a new building block for neural networks based on the Schrödingerequation (SE-NET) as depicted in Fig. 1(a). The Schrödinger equation is used for modeling quantumphenomena and also commonly used for analyzing lightwave transmission for the design of analogoptical waveguides [15, 10]. As the Schrödinger equation is related to various physics (see Fig. A1),the discussion in this paper applies to various physical laws, including those in thermodynamics,electromagnetism, and mechanics . Our contributions in this paper are as follows:

Network model augmentation to physical law:

We extend continuous representations of neuralnetworks to physical law (the Schrödinger equation). The relationship between the manipulation ofthe physical quantity (potential ﬁeld) and the neural network is revealed. We show that the SE-NETcan be trained by using the complex-valued adjoint method . In this framework, the learning targetsare no longer only the weights of DNNs but also actual physical quantities.

Computational physics for neural network solver:

Thanks to the model augmentation, a physicalsimulator can be used as a neural network solver. We show that the SE-NET can be efﬁcientlycomputed by using the Cranck-Nicolson ﬁnite difference method (FDM). To prevent explodingand vanishing gradients, we also introduce a phase-only optimization method that was originallydeveloped for topology optimization of optics.

Physical data processing:

The SE-NET describes actual physical behaviors, such as beam propaga-tion on an optical waveguide. The training means the "design" of physical structure, and the designed

Preprint. Under review. a r X i v : . [ phy s i c s . c o m p - ph ] J un avefunction, ψ(x,z o )(Electric field for optical system)Input data Potential field, V(x,z)(refractive index for optical system)ψ(x,z) Post processing includingstandard neural network ψ(x,z=z ) 𝒋 𝝏𝝍 𝝏𝒛 =HψEvolution along z-axisx z Ψ(x,z) ∈ ℂ … Ψ(x,z+1) ∈ ℂ H( 𝛻 ,V,ψ ) Equilibrium weight matrix,

𝑾 ∈ ℂ (a) (b) 𝒅𝒙 𝒅𝒕 = 𝒇(𝒙,𝜽,𝒕) Forward prop. 𝒙(𝒕𝒐) ∈ ℝ 𝒅 L=ODEsolv(x,t o ,t ,f,θ)ODE 𝒅𝒂𝒅𝒕 = −𝒂(𝒕)𝝏𝒇(𝒙,𝜽,𝒕)𝝏𝒙𝒅𝑳/𝒅𝜽 = ODEsolv(a,to,t ,f,θ)ODE Backward prop. 𝒂(𝒕 ) = 𝒅𝑳𝒅𝒙(𝒕 ) (c) Forward prop. Backward prop.

L=SEsolv(ψ,t o ,t ,H)SE 𝒋 𝒅𝝍 𝒂 (𝒙,𝒕 𝒐 )𝒅𝒕 = −𝝍 𝒂 (𝒙,𝒕) 𝝏(𝒋𝑯𝝍 𝒙,𝒕 )𝝏𝝍 SE 𝝍 𝒂 𝒙,𝒕 = 𝒅𝑳𝒅𝝍(𝒙,𝒕 )𝝍(𝒙,𝒕 𝒐 ) ∈ ℂ 𝒅 dL/dV=SEsolv((ψ,t o ,t ,H)) 𝒋 𝒅𝝍(𝒙,𝒕)𝒅𝒕 = H(x,t)ψ(x,t)

Figure 1: (a) Concept of neural SE. Diagram of (a) Neural ODE and (b) Neural SE.structure is physically transferable. We can train the physical structure as a black box on a machinelearning (ML) platform such Pytorch. A physical neural network provides the chance to access suchnatural data.

End-to-end hybrid physical digital machine learning:

In general, much information is lost duringmeasurement: e.g. camera images lose depth (phase), polarization, and wavelength informationthrough measurement. By using our framework, we can jointly optimize the physical structureand digital DNNs under the same cost function. All gradients can be computed by the chainrule and adjoint. This enables end-to-end ML with a physical structure manipulating the naturalinformation. As an example, we demonstrate a data-driven end-to-end design of an ultracompactoptical spectrometer.In the following section, we demonstrate the above-mentioned features of the SE-NET. Note that ourdiscussion basically assumes classical (not quantum) systems. Thus, the discussion does not pertainto quantum ML systems, but there are several similarities between our learning scheme and quantumsystems as described in Section 6.

First, we describe the continuous representation of neural networks. Hidden state x ( L ) ∈ R for a ResNet is represented by x ( L + 1) = x ( L ) + f [ x ( L ) , θ ] , where L ∈ N denotesthe layer number, f denotes the nonlinear function, and θ ∈ R is a trainable weight. It has beenshown that a residual block can be interpreted as a discrete approximation of an ODE by setting thediscretization step to one. When the discretization step approaches zero, it yields a family of neuralnetworks, which are called neural ODEs [9]. A block diagram of the processing in an ODE-Netis shown in Fig. 1(b). Formally, in a neural ODE, the relation between the input and output ischaracterized by dx ( t ) dt = f [ x ( t ) , θ, t ] . (1)From the given input x ( t o ) , the output x ( t ) can be computed by solving the ODE in (1). Thebackward propagation of neural ODE-Nets can also be computed with the standard ODE solver byconsidering their adjoint. Neural Schrödinger equation:

Next, we consider the time-dependent one-dimensional Schrödingerequation in the neural ODE framework. A block diagram of the processing in the SE-NET is shown inFig. 1(c). The Schrödinger equation is a complex-valued second-order hyperbolic equation describedas j (cid:126) ∂ψ ( x, t ) ∂t = H ψ ( x, t ) . (2)where ψ ( r, t ) ∈ C is a wave function, t is time, x ∈ R is a position vector, j is an imaginary unit, (cid:126) is Dirac’s constant, and H is the Hamiltonian of considering systems. For example, the Hamiltonian2or a particle under the potential ﬁeld V ( x, t ) ∈ C is described as follows. H L = − (cid:126) m ∂ ∂x + V ( x, t ) , (3)where m denotes the mass of the particle. As discussed in the appendix and [35], one-dimensionalkernel ﬁlter K ( θ ) in the convolutional neural network (CNN) is described as K ( θ ) = α ( θ ) + ∂∂x α ( θ ) + ∂ ∂x α ( θ ) . (4)By comparing Eqs. (3) and (4), we see that linear Schrödinger equation [Eq. (2)] correspondsto the complex-valued evolution of the ODE-Net in Eq. (1) without a nonlinear function. In thisanalogy, the tuning of the potential ﬁeld corresponds to the change in the weights in the kernelﬁlter. The above discussion can be extended to the three-dimensional case. Then, the dimensionsof the corresponding kernel ﬁlter are expanded. The Schrödinger equation with linear Hamiltoniandoes not include nonlinear conversion. Therefore, we als consider a nonlinear Hamiltonian suchas H N L = ( H L + g | ψ ( x, t ) | ) , where g is a nonlinear constant and | · | denotes an absolute value.This equation is commonly used for the analysis of nonlinear optical phenomena and quantum chaos[10]. Note that linear Hamiltonia is also valuable for considering actual implementation to physicalSE-NET.As discussed above, the time evolution of the Schrödinger equation is considered as a special typeof neural network, which we call the SE-NET. In this study, we simply deal with the Shrödingerequation as a classical complex-valued wave equation; that is, ψ ( x, t ) in the Schrödinger equationsimply represents the complex-valued amplitude of a transmitted wave (not the probability of theexistence of a particle). Training by adjoint method:

As the building block of SE-NET is no longer a simple weight matrix,it seems difﬁcult to apply the standard back-propagation method. However, as pointed out in [9]and also described in the past[27, 18], the backpropagation of an ODE-based continuous network isequivalent to an adjoint sensitivity analysis. The adjoint method is typically used to optimize physicalsystems, which is called inverse design or topology optimization [3, 21].Here, we consider optimizing loss function L , whose input is the result of SE-NET outputs. Tooptimize L , we consider how the gradients of the loss depends on the hidden state ψ ( x, t ) . Thisquantity is called the adjoint ψ a ( x, t ) , which is deﬁned as ψ a ( x, t ) ≡ ∂L∂ψ ( x,t ) Its dynamics are givenby : ∂ψ a ( x, t ) ∂t = ψ ∗ a ( x, t ) (cid:18) j H − ∂ H ∂ψ ( x, t ) (cid:19) ψ ( x, t ) (5) ∂L∂V ( x o , t o ) = − jψ ∗ a ( x o , t o ) ∂ H ∂V ( x o , t o ) ψ ( x o , t o ) (6)which can be thought of as the instantaneous analog of the chain rule. From (2) and (6), we derivethe following gradients for V : ∂L∂V real ( x o , t o ) = Im [ ψ ∗ a ( x o , t o ) ψ ( x o , t o )] (7) ∂L∂V imag ( x o , t o ) = Re [ ψ ∗ a ( x o , t o ) ψ ( x o , t o )] (8)where V real ( x o , t o ) and V imag ( x o , t o ) are real and imaginary part of potential ﬁelds at each ( x o , t o ) .As shown in Eqs. (7) and (8), the SE-NET can be trained by optimizing the local potential ﬁeld V ( x, t ) .The gradients can be computed by evaluating forward propagated ﬁeld ψ ( x, t ) and backpropagatedadjoint ﬁeld ψ a ( x, t ) . As can be seen in Eq. (5), we can solve the backpropagated wave by usingalmost the same propagation rules: for linear Hamiltonian [ ( ∂ H /∂ψ ) = 0 ], in particular, thebackpropagation equation is mirrors the forward propagation one. The same training rule can beintroduced by considering the calculus of variations as described in [14].3 hysical simulator as SE-NET solver: Several techniques for efﬁciently solving Eqs.(2) have beenwidely studied [33, 7]. We can use these methods as a SE-NET solver. The algorism based on fastFourier transformation (FFT) is simple but it requires O ( n x log ( n x )) complexity where n x is numberof grids along x -axis. Here, we solve this equation using a ﬁnite-difference (FDM) method whichrequires only O ( n x ) complexity. There are various ways to implement the FDM. As also discussed inthe original ODE-paper, the choice of solver is important because the numerical stability is different ineach solver. Explicit methods such as the Euler method are simple and easy to implement. However,their solutions become unstable when ∆ x / ∆ t > / , where ∆ x and ∆ t are grid spacings alongthe x - and t -axes. In addition, we need ﬁne computational meshes to obtain accurate solutionsbecause their numerical error is relatively large: [ O (∆ x ) + O (∆ t )] . Here, we employ the Crank-Nicolson method to solve FD-BPM because its solutions are stable and its numerical error is small, [ O (∆ x ) + O (∆ t )] . The details of the Crank-Nicolson method are described in the appendix. It is known that lightwave propagation in an optical waveguidewith paraxial ray approximation is described by the Schrödinger equation. Therefore, we can considerthe SE-NET as a way of optical beam transmission. In this analogy, the training of the SE-NETmeans the "design" of the waveguides. Thus, the trained network is transferable to actual opticalsystems. Here, we consider light propagation in a medium with a refractive index distribution for theemulation and training of the SE-NET. For optical systems, we can derive the Schrödinger equationas follows by considering time-independent two-dimensional ( x - z ) systems with slowly varyingenvelope approximation [15]: j ∂ψ ( x, z ) ∂z = H ψ ( x, z ) . (9)where ψ ( x, z ) is an electric ﬁeld. For the standard linear optical system, Hamiltonian H L is describedas H L = 12 kn r (cid:18) − ∂ ∂x + k ∆ n ( x, z ) (cid:19) (10)where k is wavenumber. ∆ n is deﬁned as ∆ n ≡ ( n − n r ) , where n r is a reference refractive index n is a complex valued local refractive index. The real part of the refractive index, n real , denotes thephase shift of the wave, and the imaginary part n imag contributes the loss or gain of the system. Bycomparing with the Schrödinger equation in Eq.(2), it is clear that the ﬁrst term of Eq.(10) correspondsto momentum and second term corresponds to the local potential. The nonlinear conversions canbe introduced by considering the approach described in section 2.2 (see Fig. 2). Thus, the z -axisevolution of Eq. (9) is also considered as forward propagation in a neural network.By considering adjoint ψ a ( x, z ) ≡ ∂L∂ψ ( x,z ) , we can derive the update rule in the same way as inEqs. (9) and (10). This means that the SE-NET can be trained by optimizing the local n ( x, z ) . Wecan solve both forward propagation and backpropagation by using a simulation method for opticalsystems. Phase-only training:

We have two parameters at each position, ( x o , z o ). However, n imag means thegain or loss of the transmitting medium. ODE-based dynamical systems are known to be unstablewhen the gain of the weight matrix becomes large, which corresponds to gradient explosion. Inaddition, system loss signiﬁcantly affects the signal-to-noise-ratio of the output signal, which degradessystem performance. To avoid the above issue, we consider the phase-only update rule for the localpotential n , which was developed for the inverse design of optical devices [14, 36]. This method iscalled the wave-front-matching (WFM) method. In the WFM method, n imag and ∂L/∂n imag areﬁxed at zero. We only update n real , which means that the update is executed to match the wavefrontbetween the forward propagating wave and backpropagating wave. By using this method, the systemautomatically maintains unitarity. The weight matrix derived from the local refractive index is aHermitian matrix. Thus, we can maintain model stability. This means that the system is alwaysstable and that the law of the conservation of energy is satisﬁed. Therefore, the physical system hasno principle loss or energy consumption for solving SE-NETs . This framework is considered as aspeciﬁc expression of unitary neural network [1]. Manipulate lightwave information:

Lightwaves have inherent parallelism with respect to phase,space, and wavelength. These data are lost when intensity is detected by a photodetector (PD). Theoptical SE-NET can process these data via n . The phase and space information can be directly tuned4 ntensity Forward prop.

Backward prop.Phase OptimizedRefractiveindex (a) (b) optimizedFor each minibatch λ1λ2λ3 input target dL/dn Average dL/dn λ1λ2λ3

Refractive indexForward Backward

Figure 2: Learning example using physical solver called the FD-BPM on Pytorch for (a) 1:2 powersplitter and (b) wavelength demultiplexer.by n real ( x, z ) . Since the amount of phase modulation tuned by n differs for each wavelength, thewavelength information can be processed by using the standard SE-NET [eqs. (9) and (10)]. Thus,we can train the physical structure by considering the wavelength. Optics on Pytorch:

We implement the optical SE-NET using ﬁnite difference beam propagationmethod (FD-BPM) on Pytorch. Fig. 2(a) shows a training example, which is an optical 1:2 splitterdesign. The training example of wavelength splitter is also shown in Fig. 2(b). In our framework, wetreat the wavelength information as the data in each minibatch. Although this framework is almostthe same as topology optimization or inverse design [14], we can optimize the physical structure as ablack box with few prior physical knowledge on Pytorch. The additional information for experimentsis described in appendix.

End-to-end hybrid physical digital ML:

Thanks to Pytorch implementation, we can jointly opti-mize the physical structure and digital DNNs under the same cost function. All gradients can becomputed by the chain rule and adjoint thanks to the autograd function on Pytorch. This enables theend-to-end ML with a physical structure. The algorithm is shown in the appendix.

To examine above described training framework, we validated the SE-NET using some datasets. Tocompute the z -axis evolution of SE-NET numerically, we implemented the FD-BPM solver with theCrank-Nicholson method on the Pytorch framework. To train the weights, we used a standard Adamoptimizer with a learning rate of 0.001. For the training, we set the size of the calculation grid along x - and z -axis to 1.0 µ m. The reference refractive index n r to 1.45. These hyperparameter values forthe SE-NET are based on the standard silica-based optical circuits for telecom applications. Image classiﬁcation:

At ﬁrst, we conﬁrmed the performance of SE-NET itself by neglecting thephysical implementation. We trained SE-NETs to classify the MNIST dataset of handwritten digits[26]. The data set consists of 60,000 labeled digital images with size × showing hand writtendigits from 0 to 9. For the training, the optimizer steps were computed using mini-batches consistingof 128 randomly chosen examples from the 60,000 training data. The input wavelength was set to ( λ = 2 π/k ) to 1.55 µ m. The trained models for our experiment are shown in Fig. 3(a) and (b). Thearchitecture in Fig. 3(a) is named CNN-SE-NET. It has a CNN-based down sampling layer containingthree two-dimensional convolution units, which is the same as in the pre-ﬁltering method used in thepreviously reported ODE-Net model [9]. To neglect the effect of the CNN layer, we also trained themodel with SNN-based down sampling, named SE2-NET. Calculation regions are also shown in Fig.3(a) and (b). We consider two types of nonlinear conversion as shown in Fig. 3(c) and (d) in this task.First one uses typical nonlinear Hamiltonian with g = 0 . . We decided this value using a more simpletask (see Appendix). The other one employs the nonlinear activation after propagation of SE-NETas shown in Fig. 3(d). This system can be physically devised by inserting the nonlinear materialdiscretely. At the input edge of the SE-NETs, the input data are aligned to the x -axis and mapped tothe real part of ψ ( x, z ) . At the output edge of SE-NETs, the real and imaginary parts of ψ ( x, z = z ) are independently treated. They go to the full connection layer. The experimental results for test5rror are shown in Table I. As can be seen in this ﬁgure, the performances of the SE-NETs are onthe same order of magnitude as the best performance found in the literature. Although SE2-NETachieved 1.33% accuracy, the CNN-SE-NET model showed much better performance, suggesting thatCNN-based down sampling is superior to SNN-based down sampling. We think this difference is dueto the one-dimensional feature of our tried model as described in Eq.(4). Thus, it will be improvedwhen we consider more highly dimensional SE-NET model including time or y -axis transmission.The experiment for the larger dataset is also shown in appendix. (a) C NN - ba s e d do w n s a m p l e B a t c h n o r m SE-NET FC S o f t m ax × × × × (b) S NN - ba s e d do w n s a m p l e B a t c h n o r m SE-NET FC S o f t m ax

576 60Down sample to 24 × copy M ax poo l ( × ) M ax poo l ( × ) (c) (d) H NL H L H L H L N L N L N L Lz Lz/N

Lz/N …

Figure 3: Trained model for MNIST task with (a) CNN- (b) SE-NET-based down sampling layers.Table 1: Test error for MNIST data sets.

CNN-SE-NET SE2-NET ODE-N ET L ES N ET H AMILTONIAN ( THIS WORK ) (

THIS WORK ) [9] [25]L

INEAR (N=1) .

77% 1 . L INEAR (N=5) .

63% 1 .

05% 0 .

42% 1 . N ONLINEAR ( G =0.1) .

69% 1 . End-to-end compressed sensing spectroscopy:

Next, we demonstrate possible physical implemen-tation with the joint end-to-end machine learning scheme. We tried to design an ultra-compact opticalspectrometer. As the photodetector lost the wavelength information by the measurement, the standardspectrometer is designed to receive each wavelength independently as shown in Fig. 4(a). It requiresbulk gratings and long optical paths to separate the wavelength spatially. For the miniaturization,compressive sensing (CS) spectroscopy is attracting much attention [2, 29, 22, 23, 17]. In this method,spectra y ∈ R N × are reconstructed from a few detector signals x ∈ R N × with a known relationship x = T y , where T ∈ R M × N is the detection matrix of the optical structure. It was reported thatthe deep learning is a powerful tool for reconstruction [22, 23, 17]. Each row of T represents thespectrum transmission to each detector. In general, the optical implementation of T requires complexoptics and specialized knowledge for the design.Here, we consider the system shown in Fig. 4(b). We implement T using the SE-NET, which enablessimultaneous optimization of T and the DNN for reconstruction. To evaluate the performance,we used 12,000 synthetic spectral datasets with wavelengths of 1000–1500 nm using Gaussiandistribution functions. The data were randomly divided into 10,000 training examples, and 2,000examples were used for validation. The input data are represented as a complex-valued Gaussianbeam with a beam width of 100 µ m. The analyzed region was set to × µ m . The 25 detectorswere set on the output edge of the waveguide. We used a linear Hamiltonian by considering easytransfer to the optics. For the post processing, we used a residual layer with a one-dimensionalconvolution ﬁlter, which has almost the same structure as the ﬁlter in [23] (see the appendix for moredetails). Figure 4(c) and (d) shows the validation loss and peak signal to noise ratio (PSNR) with andwithout SE-NET updates. The result is the average of 5 examines. After the training of 20 epoch, thevalidation PSNR for the model with the SE-NET reached 41.7 dB, which is far superior to the PSNRwithout SE-NET updating (25.9 dB). Fig. 4(e) shows the two examples of reconstructed spectra.Figure 4(f) shows the evolution of refractive index under the training. As can be seen, the structure isoptimized from the ﬂat initial state, suggesting that we can optimize the physical structure as a blackbox with a little prior physical knowledge . These results suggest that the effectiveness of the hybridphysical digital ML. 6 PS N R ( d B ) Epoch M S E Epoch -2 -4 M S E (a) Input fiber PDslensPrism~10cm

PD-ch.Detection matrix, T (b) Waveguide< 1mmInput fiber

PDs

Digital electric circuit u Optics

SE-NET(waveguide)Propagation … x Digital post processing

Up-convert 1d-conv+pooling+Relu fc+relu ＋ y Skip connection (f) … Minibatch 0 Minibatch 100Minibatch 5 (d)(c) w/o SE-NET updatew/o SE-NET updatewith SE-NET update with SE-NET update (e) I n t e n s it y ( a . u . ) Wavelength (nm)00.20.40.60.81 I n t e n s it y ( a . u . ) Wavelength (nm)

TeacherOutput, yTeacherOutput, y

Figure 4: Schematic of (a) standard spectrometer and (b) considered CS sepectrometer. (c) loss and(d) PSNR as a function of minibatch. (e) Typical reconstructed spectra and (f) refractive index foreach minibatch state

Augmentablity to other wave dynamics:

We summarize the correspondence between physicsand neural network in Figure A.1. The real-valued repressentation of the Schrödinger equationcorresponds to the diffusioin equation. Thus, our work could be extended to the thermal conduction,reaction-diffusion system, and ﬂuid systems. The Schrödinger equationis derived from the scalarwave equation. Our work could thus be extended to complex-valued propagation of, for instance,acoustic waves. As is well known, classical limitation (i.e. (cid:126) → in quantum systems or paraxialapproximation in optics) corresponds to the Hamilton-Jacobi equation. Therefore, our discussioncan be extended to members of its family, such as the Eikonal equation, by considering suchapproximation. The application of Hamilton-Jacobi dynamics to ML is discussed in [8]. Possible applications:

One application of the SE-NET is design method for functional opticalwaveguides and lenses that enables end-to-end ML with physical structure; e.g. compressed sensing,computational imaging [39, 11, 6], and optical communication [19]. Another application is an opticalprocessor for ultrafast and energy-efﬁcient inference engines [37]. This is because the operation ofthe the transferred network is performed at the speed of light, which does not require any principalenergy consumption

Scalability of physical SE-NET:

Our SE-NET-based optical device requires less than 1 µ m foreach pixelized matrix, which is much smaller than in previous on-chip optical neural networks[37, 20]. This means that we can integrate over 100 million weights into 1-cm optical chips, whichis compatible with state-of-the-art application-speciﬁc integrated circuits (ASICs). In addition, wehave not yet extracted the full potential of the SE-NET because there are many dimensions for theaugmentation, including time, the y -axis, and polarization. From the viewpoint of ResNet:

Our model is inspired by the ResNet and its ODE representations.Although the ResNet has a very deep structure, its essence is an ensemble of relatively shallowneural networks, which was conﬁrmed by the drop-out dependency of neural networks [42]. TheSE-NET is formed by connecting each point as a node in the time direction. They can be seen asa superposition of many paths. The change in the path at a given point is obtained by multiplyingthe path by exp ( − jV ( x, t ) dt ) ≈ − jV ( x, t ) dt , assuming that dt is very small. In this case, we getthe unscattered path from the right-hand side ﬁrst term and the scattered path from the right-handside second term. Since dt is very small, the probability that a single path is scattered at many pointsis very small. The majority of the paths will be scattered by a small number. Furthermore, theirsuperposition can be interpreted as an ensemble of relatively shallow networks. This is consistentwith ResNet’s analysis. This feature provides robustness against weight error, which is effective forthe actual analog implementation described in Section 3. These investigations remain as future work.7 Related works

Neural ODEs:

The basic proposal of ODEs and their operation are described in section 2. Recently,network architectures based on ODE-Nets and their applications have been intensively investigated.For example, W. Grathwohl et al. proposed a neural ODE-based a generative model to solve inverseproblems [12]. A higher-order approximation of the states in a neural ODE was proposed in [44],and Y. Rubanova et al. proposed an ODE-based RNN structure [34]. The robustness of ODEs isdiscussed in [43].

Neural network with optics:

The crossover between neural networks and optics is a recent hot topic[40]. On the application side, it has been reported that the joint optimization of the lens structure iseffective for computational imaging applications [39, 11, 6]. However, the optimization is limited to asingle-layer Fresnel lens, limiting the functionality of the optics. Furthermore, since the optimizationtarget is a point spread function (PSF), the PSF has to be converted to an actual lens structure. Byaugmenting our method with a three-dimensional equation, we can optimize the local refractive indexdirectly to enhance system performance.On the basic research side, optical neuromorphic accelerators are being intensively investigated ascandidates for the DNN processor. They typically use a Mach-Zehnder interferometer [37] or adiscrete diffractive optical element [30, 5, 3] as a weight element. Adjoint optimization of theseelements has already been shown in [30, 3, 20]. In general, increasing the number of nodes for eachlayer improves the characteristics in proportion to the polynomial of the number of nodes. On theother hand, increasing the number of layers is said to improve the characteristics exponentially, so it isbetter to have more layers. From this point of view, we think that the SE-NET is more advantageousthan diffractive neural networks as a conﬁguration to increase the number of layers. In addition, theSE-NET does not have a waveguide structure. Instead, each point becomes a node, and the nodesare connected through the Hamiltonian. In contrast, an optical waveguides are difﬁcult to scale upbecause the elements that optically connect them must be arranged in proportion to the square of thenumber of waveguides.

Complex-valued neural network:

Our proposed network acts as a complex-valued neural network.It has been demonstrated that the use of complex-valued parameters has numerous advantages. Forexample, Trabelsi et al. constructed the building blocks for complex-valued neural networks anddemonstrated that the deep complex-valued network shows performance superior to that of a real-valued one [41]. Arjovsky et al. have demonstrated that a recurrent neural network based on unitaryweight matrices provides a richer representation [1].

Quantum Machine Learning:

Our work is partially related to quantum ML because we use theSchrödinger equation for ML. Among the various techniques, quantum circuit learning [31, 16] isrelatively related to our work. With this technique, the angular of unitary quantum gates is trainedto decrease the cost function through a backpropagation method, which is similar to our proposedWFM-based learning. Our approach differs from quantum circuit learning. We assume classicaloptical systems (not quantum systems), which are easier to fabricate and apply than quantum ones,but there is no quantum acceleration. In addition, we directly train the local potential ﬁeld (not a gatedevice), allowing scalable system implementation.

We investigated the potential of Schrödinger equation as a building block of neural networks. Weshow that the SE-NET can be trained by optimizing the local potential ﬁeld of the Schrödingerequation. As a demonstration, we trained the SE-NET by using the ﬁnite difference method andfound its performance is comparable to the standard deep neural networks. The trained network istransferable to the actual optical systems. Our works extend the application ﬁeld of machine learningfrom digital only optimization to hybrid physical digital optimizations.8

Broader impact

Our research bridges the gap between physical laws and neural networks. By augmenting the physicalneural network concept, we believe that our work has the following impact.

Optimizing data accuision:

Recent machine learning (ML) techniques require complex processingand enormous datasets to achieve state-of-the-art performance. On the other hand, the input dataare still captured to be easily understood by humans . It is questionable whether traditional dataacquisition is the optimal approach for neural post-processing. For instance, it is difﬁcult to getdepth, spectral, and polarization information from standard commercial camera data. However, whenwe introduce special purpose lenses or gratings in front of sensors, we can easily reconstruct thosedata with simple processing. This implies that an optimized physical structure can simplify dataprocessing and improve its performance. Our method provides a way to jointly optimize the physicalstructure and deep neural networks. Although the method has only been explored for simple tasks sofar, it has the potential to integrate sensing and information processing in the future.

Physical structure as a computational resource:

The computational resources to execute deeplearning are increasing more and more. Thus, alternative methods of computation are being widelyexplored to reduce energy consumption and processing latency. Based on our concept, we canoutsource part of the digital processing to a passive physical structure, which will reduce the energyconsumption of the whole processing operation.

Physics inspired ML

The crossover of ML and physics is a topic of great interest for, for instacne,material informatics and physics-inspired networks such as neural ODEs. Our paper shows therelationship between the manipulation of the physical quantityand the neural network, connectingML and physics (details are described in the appendix). According to our idea, physicists can treatthe physical equations as neural networks, and ML researchers can refer to physics for the design ofDNNs . 9

Correspondence between physics and neural network

We summarized the correspondence between physical phenomena and neural network in FigureA.1. The neural ODEs are the continuous representation of the ResNet or RNN[9]. As discussedin the main article, Schrödinger equation is considered as a continuous expression of complex-valued residual layer. Especially, it is a special type of unitary neural network [1] when we use thephase-only update rule described in the main article. Our discussions are directly connecting towaveguide optics and quantum mechanics because they are denoted by the Schrödinger equation.The real-valued representation of the Schrödinger equation corresponds to the diffusion equation [32].The diffusion equation is typically used for modeling the thermal conduction and diffusion-reactionsystem. The discussion in section A.3 is effective by considering the real-valued case. Thus, diffusionequation is also considered as a special type of neural network. As is well known, classical limitation( (cid:126) → ) in quantum systems, or paraxial approximation ( λ → ) in wave optics corresponds to theHamilton-Jacobi equation, or Eikonal equation. Our work can be extended to analytical mechanicsor analytical geometric optics. Note that the neural network based on such second-order ODE havealready introduced by [4]. The Schrödinger equations derived from the scalar wave equation. Ourwork could thus be extended to complex-valued propagation of, for instance, electromagnetic wavesand acoustic waves. The training of the such physical neural network can be executed by the adjointmethod (reverse mode ODE or PDE) as the same manner as neural SE, which is essentially equivalentto the back propagation method used in the standard neural network. Optics Mechanics

Neural network

Model

Training method

Trainable parameters

Geometric optics

Analytical mechanics (HamiltonianNetwork )(Eikonal equation) (Hamilton-Jacobiequation) Neural ODEBack prop.λ→0 ℏ→0Wave optics(Scholar wave equation)Quantum mechanics(Schrödinger equation) Neural SE(This study) ResNetRNNUnitary ResNetRNN continuous

Back prop.Waveguide optics(Schrödinger equation) 𝝏 /𝝏𝒛 → 𝟎 Refractive index Potential field weights

Thermodynamics/

Diffusion system

Thermal conduction(Heat equation)

Reaction-diffusion system(Diffusion equation)Real value

Thermal diffusivity/Diffusion coefficient Reverse mode PDE Reverse mode PDE/ODE

Figure A.1: Correspondence between physical equation and neural networks. [4], [1]

B Crank-Nicolson Finite Difference Method

We describe the Crank-Nicolson ﬁnite difference method (FDM) for solving the Schrödinger equationnumerically[33]. In the FDM framework, an equation to solve is discretized by the linear spatialgrid as shown in Fig. A.2(a). Then, by considering the ﬁrst-order approximation: ∂ψ/∂z ≈ ( ψ q +1 p − ψ qp ) / ∆ z , Eq. (9) can be expressed as difference equation (A.1) j ψ q +1 p − ψ qp ∆ z = (1 − γ ) H ψ qp + γ H ψ q +1 p . (A.1)where ψ qp denotes the electric ﬁeld at each grid position ( x, z ) = ( p ∆ x, q ∆ z ) , and the parameter γ indicates the contribution of the forward (explicit) and backward (implicit) difference. For example,eq. (A.1) with γ = 0 is known as an explicit method, and its calculation diagram is shown in Fig.A.2(b). This scheme is simple and easy to implement, but the solution becomes unstable when ∆ x / ∆ z > / . In addition, we need ﬁne computational meshes to obtain the accurate solutionsbecause their numerical error is relatively large [ O (∆ x + ∆ z )] . In the case of γ = 1 , eq. (A.1)becomes implicit method as shown in Fig. A.2(c). This scheme is numerically stable, but theirnumerical accuracy is same as that of the explicit method. On the other hand, Eq. (A.1) with γ = 1 / is known as the Crank-Nicolson method [Fig. A.2(d)]. This scheme is numerically stable and the10 𝒙 Δ𝒛 x zψ 𝒑𝒒 ψ 𝒑−𝟏𝒒 ψ 𝒑+𝟏 𝒒 ψ 𝒑+𝟏𝒒+𝟏 ψ 𝒑𝒒+𝟏 ψ 𝒑−𝟏 𝒒+𝟏 p p+1p-1 q-1 q (b) (c) p p+1p-1 q-1qp p+1p-1 q-1q (d)(a) Figure A.2: (a) Finite difference space grid. Diagram for computing FDM with (b) explicit ( γ = 0 ),(c) implicit ( γ = 1 ), and (d) Crank-Nicolson condition. • and ◦ denote known and unknown pointsnumerical error is smaller than that of the explicit and implicit method [ O (∆ x + ∆ z )] . Thus,Crank-Nicolson method is widely used. In this case, the Eq. (A.1) becomes j ψ q +1 p − ψ qp ∆ z = H ψ q +1 p − ψ qp . (A.2)By considering the deﬁnitions, H L = 12 kn r ( − ∂ ∂x + k ∆ n qp ) (A.3) ∂ ψ/∂z ≈ ψ qp +1 + ψ qp − − ψ qp ∆ z (A.4)we can derive ψ q +1 p = (cid:18) j H ∆ z (cid:19) − (cid:18) − j H ∆ z (cid:19) ψ qp ≈ ( u + D + v ∆ n qp ) − ( u + D + v ∆ n qp ) ψ qp , (A.5)where, ∆ n qp , u , and v are deﬁned as ∆ n qp = n ( p ∆ x, q ∆ z ) − n r (A.6) u = j kn r ∆ z (∆ x ) , (A.7) v = 2 k n r (∆ x ) , (A.8)and D is tridiagonal matrix described as follows; D =  − . . . − . . . − . . . − . . . ... ... ... ... . . . ... ... . . . − . . . −  , (A.9)Thus we can compute the electric ﬁeld by solving eq. (A.5) at each calculation grid by using linearsolver. When we simply use a standard linear solver (e.g. Gaussian elimination, numpy.linlg modules), O ( n x ) computational resources are typically required. Here we use sparse matrix solverbased on Thomas method. Then the computational resources decrease to O ( n x ) . The backwardtransmission can be solved in the same way. We implemented the above algorithm onto the Pytorchframework. Thus, we can train the SE-NET as a standard building block of deep neural networks.11 Convolutional neural network and PDE

We show that a convolution layer can be considered as a partial differential equation (PDE). Thesame discussion is shown in [35]. Here we assume that one-dimensional PDE inputs y ∈ C n x . Itrepresents a one-dimensional grid function obtained at the cell centers with a mesh size h = 1 /n x .A convolution layer with one-dimensional Kernel ﬁlter is represented as K ∈ C ( n x × n x ) , which isparameterized by the convolution stencil θ ∈ C . Applying a coordinate change, the convolutionaloperation can be described as follows: K ( θ ) y = [ θ , θ , θ ] ∗ y = ( α , ,

1] + α h [ − , ,

1] + α h [ − , , − ∗ y (A.10)where ∗ denotes convolution operation, and α i are given by the unique solution of (A.11).  / − / h − /h / /h / / h − /h  (cid:34) α α α (cid:35) = (cid:34) θ θ θ (cid:35) (A.11)By considering the limitation h → , K ( θ ) = α ( θ ) + ∂∂x α ( θ ) + ∂ ∂x α ( θ ) . (A.12)This argument extends to higher spatial dimensions and it acts as a high-dimensional Kernel ﬁlter. D End-to-end learning method

As a example of end-to-end ML, we consider the system for the spectroscopy as shown in Fig. 4 inthe main article. The detailed processing scheme is shown in Fig. A.3. It consists of two-dimensionaloptical circuits that act as SE-NETs, photodetectors (PDs) to receive the SE-NET outputs, and digitalelectric circuits which act as a standard neural network. The input power spectra u ( λ ) are representedas a complex-valued Gaussian beam, which described as follows. ψ ( x, z , λ ) = (cid:112) u ( λ )Φ( x ) (A.13)where Φ represents the input waveguide ﬁelds as follows; Φ( x ) = 1 (cid:112) πω i exp (cid:20) − x ω i (cid:21) (A.14)where ω i is mode radius of the input edge of optimizing regeion. The data is processed by SE-NETs(Eq. 9) and detected by the PD. The detection efﬁciency η of each PD for each wavelength λ isdescribed as η i ( λ ) = (cid:90) ψ ( x, z , λ )Φ (cid:48) i ( x ) dx (A.15)where i denotes the detector number, and Φ (cid:48) represents the detection ﬁelds of PDs, which are typicallydescribed as Gaussian proﬁles. Φ (cid:48) i ( x ) = 1 (cid:112) πω o exp (cid:20) − ( x − ix p ) ω o (cid:21) (A.16)where ω o is the mode radius of output wavegide, and x p is PD channel spacing. The total outputpower is described as X i = (cid:90) | η i ( λ ) | dλ (A.17)The detected data are acquired by digital circuits, and they become inputs of digital DNNs. DNNoutput Y is converted to loss L by the cost function. Then, the backward of L is computed by usingthe standard backward process, and we get following equations: ∂L∂X i = ( ∂Y∂X i )( ∂η∂Y ) (A.18)12 μm ・・・

500 μm500 μm Optimized region

Input, u (λ) OutputwaveguidePDsInputwaveguide Output, X X

Digital post processing upconversion ＋ y Skip connection

SE-NET 𝑋 𝑖 = න |𝜂 𝑖 𝜆 | 𝑑𝜆 Detector forward prop.

SE-NET forward prop.for each λ 𝜕𝜓(𝑥, 𝑧, 𝜆) 𝜕𝑧 =−jHψ(x,z , 𝜆 ) 𝑦 = 𝑓(𝜔, 𝑋) Digital forward prop. 𝜕𝐿𝜕𝑋 𝑖 = 𝜕𝑦𝜕𝑋 𝜕𝐿𝜕𝑦 Digital backward prop. 𝜓 𝑎 𝑥, 𝑧 , 𝜆 = 𝜕𝐿 𝜕𝜓(𝑥, 𝑧 , 𝜆) = ෍ 𝑖𝑀 𝑖 𝑥 𝑅𝑒[𝜂 𝜆 ] 𝜕𝐿𝜕𝑋 Detector backward prop. 𝜕𝜓 𝑎 (𝑥, 𝑧, 𝜆) 𝜕𝑧 =jHa(x,z , 𝜆 ) SE-NET backward prop.for each λ L 𝜕𝐿𝜕𝑌 ForwardBackward 𝜂 𝑖 (𝜆) = න 𝜓 𝑥, 𝑧 , 𝜆 𝛷 ′ 𝑖 (𝑥) 𝑑𝑥 Figure A.3: End-to-end machine learning with optical frontend V a li d a t i o n a cc u r a c y Nonlinear constant, g (b) Ellipses Moons (c) V a li d a t i o n a cc u r a c y Number of layers, N

Ellipses Moons (a)

Im Re | ・ |

10 μm

Figure A.4: (a) Simulation set-up. Results for the SE-NET with (b) nonlinear Hamiltonian and (c)cascade connected networks. ψ a ( x, z ) = M (cid:88) i (cid:48) i ( x ) Re [ η i ( λ )] ∂L∂X i (A.19)where M is the number of PDs. As the backward in analog photonics ψ a ( x, z ) is solved aboveequation, we can design the optimal structure through the end-to-end hybrid machine learning.For the experiment in Fig. 4 in the main article, we set the parameters as follows. The ﬁrst layeris SE-NET using the optical waveguide. The ω i and ω o are set to 100 and 6 µ m, respectively. Theanalyzed region was set to × µ m . The 25 detectors were set on the output edge of thewaveguide (M=25) with x p = 12 µm . We used a linear Hamiltonian by considering easy transferto the optics. For the post processing, we used a residual layer with a one-dimensional convolutionﬁlter, which has almost the same structure as the ﬁlter in [23]. It consists the up-conversion layer,one-dimensional convolution layer, two fully-connected (fc) layers, and a residual connection layer.The kernel size and padding size of the convolution ﬁlter are set to 16 and 8. For the up-conversion,we use the pseudo-inverse matrix of the transmission matrix T .13 Additional Experiments

Binary classiﬁcation task:

We consider a small-scale test problem in two dimensions called ellipsesand moons. The experimental set-up is shown in Fig. A.4(a).The test data consists of 1,200 pointsthat are evenly divided into two groups as shown in the subplots in Figure A.4(b) and (c). Theoriginal data is randomly divided into 1,000 training examples and 200 examples used for validation.We train the SE-NETs with the nonlinear Hamiltonian [Fig. 3(c)] and Cascade connected network[Fig. 3(d)]. For the activation function of the cascade connected network, the tanh function wasemployed, which can be physically implemented by using the saturable absorbers. The input data arerepresented as a complex-valued Gaussian beam with a beam width of 6 µ m. The analyzed regionwas set to × µ m . The nonlinearity of the networks was scanned by changing the g value (forthe nonlinear Hamiltonian case) and divided layer number N (for the cascaded network case). At theoutput edge, the intensity was detected by two-arrayed PDs with ω o = 6 µ m. We show the resultsin Figure A3(b) and (c). As the examined task obviously requires nonlinear conversion, the linearSE-NETs [ g = 0 in Fig. A.4(b), N = 0 in Fig. A.4(c)] could not classify the data. When the networkhave a nonlinearity, the validation accuracy increases up to 100%. When g increase too much, theaccuracy was degraded. Thus, we set g = 0 . for another experiment. CIFAR10:

We trained SE-NETs to classify the CIFAR10 image dataset. The data set consistsof 60,000 labeled digital images with size × . For the training, the optimizer steps werecomputed using mini-batches consisting of 128 randomly chosen examples from the 60,000 trainingdata. The input wavelength was set to ( λ = 2 π/k ) to 1.55 µ m. The trained models is almost same asCNN-SE-NET in Fig. 3(a). The differences are follows; (i) the input dimension of the convolutionﬁlter was set to three to match the data dimension of CIFAR10, (ii) the region of SE-NET was setto × µm . It has a CNN-based down sampling layer containing three two-dimensionalconvolution units, which is the same as in the pre-ﬁltering method used in the previously reportedODE-Net model [9]. We consider the nonlinear activation model as shown in Fig. 3(d). This systemcan be physically devised by inserting the nonlinear material discretely. At the input edge of theSE-NETs, the input data are aligned to the x -axis and mapped to the real part of ψ ( x, z ) . At theoutput edge of SE-NETs, the real and imaginary parts of ψ ( x, z = z ) are independently treated.They go to the full connection layer. The experimental results for test accuracy was 71.4 % after the30-epoch training. References [1] A

RJOVSKY , M., S

HAH , A.,

AND B ENGIO , Y. Unitary evolution recurrent neural networks. In

International Conference on Machine Learning (2016), pp. 1120–1128.[2] A

UGUST , Y.,

AND S TERN , A. Compressive sensing spectrometry based on liquid crystaldevices.

Optics letters 38 , 23 (2013), 4996–4999.[3] B

ACKER , A. S. Computational inverse design for cascaded systems of metasurface optics.

Optics express 27 , 21 (2019), 30308–30331.[4] C

HANG , B., M

ENG , L., H

ABER , E., R

UTHOTTO , L., B

EGERT , D.,

AND H OLTHAM , E.Reversible architectures for arbitrarily deep residual neural networks. In

Thirty-Second AAAIConference on Artiﬁcial Intelligence (2018).[5] C

HANG , J., S

ITZMANN , V., D UN , X., H EIDRICH , W.,

AND W ETZSTEIN , G. Hybridoptical-electronic convolutional neural networks with optimized diffractive optics for imageclassiﬁcation.

Scientiﬁc reports 8 , 1 (2018), 1–10.[6] C

HANG , J.,

AND W ETZSTEIN , G. Deep optics for monocular depth estimation and 3d objectdetection. In

Proceedings of the IEEE International Conference on Computer Vision (2019),pp. 10193–10202.[7] C

HANG , Q., J IA , E., AND S UN , W. Difference schemes for solving the generalized nonlinearschrödinger equation. Journal of Computational Physics 148 , 2 (1999), 397–415.[8] C

HAUDHARI , P., O

BERMAN , A., O

SHER , S., S

OATTO , S.,

AND C ARLIER , G. Deep relaxation:partial differential equations for optimizing deep neural networks.

Research in the MathematicalSciences 5 , 3 (2018), 30. 149] C

HEN , T. Q., R

UBANOVA , Y., B

ETTENCOURT , J.,

AND D UVENAUD , D. Neural ordinarydifferential equations. In

NeurIPS , S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman,N. Cesa-Bianchi, and R. Garnett, Eds., pp. 6572–6583.[10] E

ISENBERG , H. S., S

ILBERBERG , Y., M

ORANDOTTI , R., B

OYD , A. R.,

AND A ITCHISON ,J. S. Discrete spatial optical solitons in waveguide arrays.

Phys. Rev. Lett. 81 , 16 (Oct. 1998),3383–3386.[11] E

LMALEM , S., G

IRYES , R.,

AND M AROM , E. Learned phase coded aperture for the beneﬁt ofdepth of ﬁeld extension.

Optics express 26 , 12 (2018), 15316–15331.[12] G

RATHWOHL , W., C

HEN , R. T., B

ETTENCOURT , J., S

UTSKEVER , I.,

AND D UVENAUD ,D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXivpreprint arXiv:1810.01367 (2018).[13] G

RAVES , A., M

OHAMED , A.- R ., AND H INTON , G. In .[14] H

ASHIMOTO , T., S

AIDA , T., O

GAWA , I., K

OHTOKU , M., S

HIBATA , T.,

AND T AKAHASHI ,H. Optical circuit design based on a wavefront-matching method.

Optics letters 30 , 19 (2005),2620–2622.[15] H

AUS , H. A.

Waves and ﬁelds in optoelectronics . Prentice-Hall„ 1984.[16] H

AVLÍ ˇCEK , V., C

ÓRCOLES , A. D., T

EMME , K., H

ARROW , A. W., K

ANDALA , A., C

HOW ,J. M.,

AND G AMBETTA , J. M. Supervised learning with quantum-enhanced feature spaces.

Nature 567 , 7747 (2019), 209–212.[17] H

EISER , Y., O

IKNINE , Y.,

AND S TERN , A. Compressive hyperspectral image reconstructionwith deep neural networks. In

Big Data: Learning, Analytics, and Applications (2019),vol. 10989, International Society for Optics and Photonics, p. 109890M.[18] H

ERMANS , M., A

NTONIK , P., H

AELTERMAN , M.,

AND M ASSAR , S. Embodiment of learningin electro-optical signal processors.

Physical review letters 117 , 12 (2016), 128301.[19] H

UANG , C., F

UJISAWA , S., DE L IMA , T. F., T

AIT , A. N., B

LOW , E., T

IAN , Y., B

ILODEAU ,S., J HA , A., Y AMAN , F., B

ATSHON , H. G.,

ET AL . Demonstration of photonic neural networkfor ﬁber nonlinearity compensation in long-haul transmission systems. In (2020), IEEE, pp. 1–3.[20] H

UGHES , T. W., M

INKOV , M., S HI , Y., AND F AN , S. Training of photonic neural networksthrough in situ backpropagation and gradient measurement. Optica 5 , 7 (2018), 864–871.[21] H

UGHES , T. W., M

INKOV , M., W

ILLIAMSON , I. A.,

AND F AN , S. Adjoint method andinverse design for nonlinear nanophotonic devices. ACS Photonics 5 , 12 (2018), 4781–4787.[22] K IM , C., P ARK , D.,

AND L EE , H.-N. Convolutional neural networks for the reconstruction ofspectra in compressive sensing spectrometers. In Optical Data Science II (2019), vol. 10937,International Society for Optics and Photonics, p. 109370L.[23] K IM , C., P ARK , D.,

AND L EE , H.-N. Compressive sensing spectroscopy using a residualconvolutional neural network. Sensors 20 , 3 (2020), 594.[24] L E C UN , Y., B ENGIO , Y.,

AND H INTON , G. Deep learning. nature 521 , 7553 (2015), 436–444.[25] L E C UN , Y., B OTTOU , L., B

ENGIO , Y.,

AND H AFFNER , P. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE 86 , 11 (1998), 2278–2324.[26] L E C UN , Y., C ORTES , C.,

AND B URGES , C. J. The mnist database of handwritten digits, 1998.

URL http://yann. lecun. com/exdb/mnist 10 (1998), 34.[27] L E C UN , Y., T OURESKY , D., H

INTON , G.,

AND S EJNOWSKI , T. A theoretical frameworkfor back-propagation. In

Proceedings of the 1988 connectionist models summer school (1988),vol. 1, CMU, Pittsburgh, Pa: Morgan Kaufmann, pp. 21–28.[28] L

ILLICRAP , T. P., H

UNT , J. J., P

RITZEL , A., H

EESS , N., E

REZ , T., T

ASSA , Y., S

ILVER , D.,

AND W IERSTRA , D. Continuous control with deep reinforcement learning. In

ICLR , Y. Bengioand Y. LeCun, Eds.[29] L IN , X., L IU , Y., W U , J., AND D AI , Q. Spatial-spectral encoded compressive hyperspectralimaging. ACM Transactions on Graphics (TOG) 33 , 6 (2014), 1–11.1530] L IN , X., R IVENSON , Y., Y

ARDIMCI , N. T., V

ELI , M., L UO , Y., J ARRAHI , M.,

AND O ZCAN ,A. All-optical machine learning using diffractive deep neural networks.

Science 361 , 6406(2018), 1004–1008.[31] M

ITARAI , K., N

EGORO , M., K

ITAGAWA , M.,

AND F UJII , K. Quantum circuit learning.

Physical Review A 98 , 3 (2018), 032309.[32] O

KINO , T. Correlation between diffusion equation and schrödinger equation.[33] P

EDROLA , G. L.

Beam Propagation Method for Design of Optical Waveguide Devices . JohnWiley & Sons, 2015.[34] R

UBANOVA , Y., C

HEN , R. T. Q.,

AND D UVENAUD , D.[35] R

UTHOTTO , L.,

AND H ABER , E. Deep neural networks motivated by partial differentialequations.

CoRR .[36] S

AKAMAKI , Y., S

AIDA , T., H

ASHIMOTO , T.,

AND T AKAHASHI , H. New optical waveguidedesign based on wavefront matching method.

Journal of lightwave technology 25 , 11 (2007),3511–3518.[37] S

HEN , Y., H

ARRIS , N. C., S

KIRLO , S., P

RABHU , M., B

AEHR -J ONES , T., H

OCHBERG , M.,S UN , X., Z HAO , S., L

AROCHELLE , H., E

NGLUND , D.,

ET AL . Deep learning with coherentnanophotonic circuits.

Nature Photonics , 7, 441.[38] S

ILVER , D., H

UANG , A., M

ADDISON , C. J., G

UEZ , A., S

IFRE , L.,

VAN DEN D RIESSCHE , G.,S

CHRITTWIESER , J., A

NTONOGLOU , I., P

ANNEERSHELVAM , V., L

ANCTOT , M., D

IELEMAN ,S., G

REWE , D., N

HAM , J., K

ALCHBRENNER , N., S

UTSKEVER , I., L

ILLICRAP , T., L

EACH ,M., K

AVUKCUOGLU , K., G

RAEPEL , T.,

AND H ASSABIS , D. Mastering the game of go withdeep neural networks and tree search.

Nature , 484–.[39] S

ITZMANN , V., D

IAMOND , S., P

ENG , Y., D UN , X., B OYD , S., H

EIDRICH , W., H

EIDE , F.,

AND W ETZSTEIN , G. End-to-end optimization of optics and image processing for achromaticextended depth of ﬁeld and super-resolution imaging.

ACM Transactions on Graphics (TOG)37 , 4 (2018), 1–13.[40] S UI , X., W U , Q., L IU , J., C HEN , Q.,

AND G U , G. A review of optical neural networks. IEEEAccess 8 (2020), 70773–70783.[41] T

RABELSI , C., B

ILANIUK , O., Z

HANG , Y., S

ERDYUK , D., S

UBRAMANIAN , S., S

ANTOS ,J. F., M

EHRI , S., R

OSTAMZADEH , N., B

ENGIO , Y.,

AND P AL , C. J. Deep complex networks. arXiv preprint arXiv:1705.09792 (2017).[42] V EIT , A., W

ILBER , M. J.,

AND B ELONGIE , S. Residual networks behave like ensemblesof relatively shallow networks. In

Advances in neural information processing systems (2016),pp. 550–558.[43] YAN, H., DU, J., TAN, V.,

AND

FENG, J. On robustness of neural ordinary differentialequations. In

International Conference on Learning Representations .[44] Y

ILDIZ , C., H

EINONEN , M.,