[PDF] Fourier-domain Variational Formulation and Its Well-posedness for Supervised Learning

Abstract

A supervised learning problem is to find a function in a hypothesis function space given values on isolated data points. Inspired by the frequency principle in neural networks, we propose a Fourier-domain variational formulation for supervised learning problem. This formulation circumvents the difficulty of imposing the constraints of given values on isolated data points in continuum modelling. Under a necessary and sufficient condition within our unified framework, we establish the well-posedness of the Fourier-domain variational problem, by showing a critical exponent depending on the data dimension. In practice, a neural network can be a convenient way to implement our formulation, which automatically satisfies the well-posedness condition.

Full PDF

11–18

Fourier-domain Variational Formulation and Its Well-posedness forSupervised Learning

Tao Luo * LUOTAO

SJTU . EDU . CN Zheng Ma

ZHENGMA @ SJTU . EDU . CN Zhiwei Wang

VICTORYWZW @ SJTU . EDU . CN Zhi-Qin John Xu

XUZHIQIN @ SJTU . EDU . CN Yaoyu Zhang

ZHYY . SJTU @ SJTU . EDU . CN School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC and Qing Yuan Research Institute,Shanghai Jiao Tong University, Shanghai, 200240, P.R. China

Abstract

A supervised learning problem is to ﬁnd a function in a hypothesis function space given values onisolated data points. Inspired by the frequency principle in neural networks, we propose a Fourier-domain variational formulation for supervised learning problem. This formulation circumvents thedifﬁculty of imposing the constraints of given values on isolated data points in continuum mod-elling. Under a necessary and sufﬁcient condition within our uniﬁed framework, we establish thewell-posedness of the Fourier-domain variational problem, by showing a critical exponent depend-ing on the data dimension. In practice, a neural network can be a convenient way to implement ourformulation, which automatically satisﬁes the well-posedness condition.

Keywords:

Fourier-domain variational problem, well-posedness, critical exponent, frequency prin-ciple, supervised learning

1. Introduction

Supervised learning is ubiquitous. In a supervised learning problem, the goal is to ﬁnd a function ina hypothesis function space given values on isolated data points with labels. In practice, Deep neuralnetwork (DNN), although with limit understanding, has been a powerful method. A series of worksprovide a good explanation for the good generalization of DNNs by showing a

Frequency Principle(F-Principle), i.e., a DNN tends to learn a target function from low to high frequencies during thetraining (Xu et al., 2019, 2020; Rahaman et al., 2019). The F-Principle shows a low-frequencybias of DNNs when ﬁtting a given data set. In the neural tangent kernel regime (Jacot et al., 2018;Lee et al., 2019), later works show that the long-time training solution of a wide two-layer neuralnetwork is equivalent to the solution of a constrained Fourier-domain variational problem (Zhanget al., 2019; Luo et al., 2020).Inspired by above works about the F-Principle, in this paper, we propose a general Fourier-domain variational formulation for supervised learning problem and study its well-posedness. Incontinuum modelling, it is often difﬁcult to impose the constraint of given values on isolated datapoints in a function space without sufﬁcient regularity, e.g., a L p space. We circumvent this difﬁ-culty by regarding the Fourier-domain variation as the primal problem and the constraint of isolateddata points is imposed through a linear operator. Under a necessary and sufﬁcient condition within * Corresponding author © T. Luo, Z. Ma, Z. Wang, Z.-Q.J. Xu & Y. Zhang. a r X i v : . [ m a t h . NA ] D ec OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING our uniﬁed framework, we establish the well-posedness of the Fourier-domain variational problem.We show that the well-posedness depends on a critical exponent, which equals to the data dimension.This is a stark difference compared with a traditional partial differential equation (PDE) problem.For example, in a boundary value problem of any PDE in a d -dimensional domain, the boundarydata should be prescribed on the ( d − -dimensional boundary of the domain, where the dimension d plays an important role. However, in a well-posed supervised learning problem, the constraint isalways on isolated points, which are -dimensional independent of d , while the model has to sat-isfy a well-posedness condition depending on the dimension. In practice, a neural network can bea convenient way to implement our formulation, which automatically satisﬁes the well-posednesscondition. With a clear understanding of its posedness, the Fourier-domain variational formulationalso provides insight for designing methods for supervised learning problems.The rest of the paper is organized as follows. Section 2 shows some related work. In section 3,we propose a Fourier-domain variational formulation for supervised learning problems. The neces-sary and sufﬁcient condition for the well-posedness of our model is presented in section 4. Section 5is devoted to the numerical demonstration in which we solve the Fourier-domain variational problemusing band-limited functions. Finally, we present a short conclusion and discussion in section 6.

2. Related Works

Our work, as a modelling for supervised learning, is related to the point cloud interpolation problemwhich belongs to semi-supervised learning. One of the most widely used methods for the point cloudproblems is the 2-Laplacian method (Zhu et al., 2003), which is an approach based on a Gaussianrandom ﬁeld and weighted graph model. But it has been observed (El Alaoui et al., 2016; Nadleret al., 2009) that when the number of unlabeled data point is large, the graph Laplacian method isusually ill-posed. A new weighted Laplace method was proposed to overcome this shortcoming ofthe original 2-Laplacian method (Shi et al., 2017). Calder and Slepev (2019) further considered away to correctly set the weights in Laplacian regularization with a exponent α > d and proved thewell-posedness of the corresponding continuum model in the large-sample limit. We remark thatour continuum model is proposed for the case of ﬁnite number of data points, i.e., n < + ∞ , not thelarge-sample limit case.From extensive synthetic and realistic datasets, frequency principle is proposed to characterizethe training process of deep neural networks (Xu et al., 2019; Rahaman et al., 2019; Xu et al., 2020).A series of theoretical works subsequently show that frequency principle holds in different settings,for example, a non-NTK (neural tangent kernel) regime with inﬁnite samples (Luo et al., 2019) andthe NTK regime with ﬁnite samples (Zhang et al., 2019; Bordelon et al., 2020; Luo et al., 2020) orinﬁnite samples (Cao et al., 2019; Ronen et al., 2019). E et al. (2020) show that the integral equationwould naturally leads to the frequency principle. The frequency principle inspires the design of deepneural networks to fast learn a function with high frequency (Liu et al., 2020; Wang et al., 2020b;Jagtap et al., 2020; Cai et al., 2019; Biland et al., 2020; Li et al., 2020; Wang et al., 2020a).

3. Fourier-domain Variational Problem for Supervised Learning

In the following, we consider the regression problem of ﬁtting a target function f ∈ C c ( R d ) .Clearly, f ∈ L ( R d ) . Speciﬁcally, we use a DNN, h DNN ( x , θ ( t )) , to ﬁt the training dataset OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING { ( x i , y i ) } ni =1 of n sample points, where x i ∈ R d , y i = f ( x i ) for each i . For the convenienceof notation, we denote X = ( x , . . . , x n ) (cid:124) , Y = ( y , . . . , y n ) (cid:124) . It has been shown in (Jacot et al.,2018; Lee et al., 2019) that, if the number of neurons in each hidden layer is sufﬁciently large, then (cid:107) θ ( t ) − θ (0) (cid:107) (cid:28) for any t ≥ . In such cases, the the following function h ( x , θ ) = h DNN ( x , θ ) + ∇ θ h DNN ( x , θ ) · ( θ − θ ) , is a very good approximation of DNN output h DNN ( x , θ ( t )) with θ (0) = θ . Note that, we havethe following requirement for h DNN which is easily satisﬁed for common DNNs: for any θ ∈ R m ,there exists a weak derivative of h DNN ( · , θ ) with respect to θ satisfying ∇ θ h DNN ( · , θ ) ∈ L ( R d ) .Inspired by the F-Principle and the linear dynamics in the kernel regime, (Zhang et al., 2019;Luo et al., 2020) derived a Linear F-Principle (LFP) dynamics to effectively study the trainingdynamics of a two-layer ReLU NN with the mean square loss in the large width limit. Up to amultiplicative constant in the time scale, the gradient descent dynamics of a sufﬁciently wide two-layer NN is approximated by ∂ t F [ u ]( ξ , t ) = − ( γ ( ξ )) F [ u ρ ]( ξ ) , (1)where u ( x , t ) = h ( x , t ) − h target ( x ) , u ρ ( x ) = u ( x , t ) ρ ( x ) . We follow this work and furtherassume that ρ ( x ) = n (cid:80) ni =1 δ ( x − x i ) , accounting for the real case of a ﬁnite training dataset { ( x i , y i ) } ni =1 , and ( γ ( ξ )) = E a (0) ,r (0) (cid:20) r (0) π (cid:107) ξ (cid:107) d +3 + a (0) r (0)4 π (cid:107) ξ (cid:107) d +1 (cid:21) , where r (0) = | w (0) | and the two-layer ReLU NN parameters at initial a (0) and w (0) are randomvariables with certain given distribution. In this work, for any function g deﬁned on R d , we use thefollowing convention of the Fourier transform and its inverse: F [ g ]( ξ ) = (cid:90) R d g ( x )e − π i ξ (cid:124) x d x , g ( x ) = (cid:90) R d F [ g ]( ξ )e π i x (cid:124) ξ d ξ . Different from F [ u ]( ξ , t ) on the left hand side, the formula on the right hand side reads as F [ u ρ ]( ξ , t ) = F [ u ( · , t ) ρ ( · )]( ξ , t ) = 1 n F (cid:34) n (cid:88) i =1 ( h ( · , θ ( t )) − y i ) δ ( · − x i ) (cid:35) ( ξ , t ) , which incorporates the information of the training dataset. The solution of the LFP model (1) isequivalent to that of the following optimization problem in a proper hypothesis space F γ , min h − h ini ∈ F γ (cid:90) R d ( γ ( ξ )) − |F [ h ]( ξ ) − F [ h ini ]( ξ ) | d ξ , subject to constraints h ( x i ) = y i for i = 1 , . . . , n . The weight ( γ ( ξ )) − grows as the frequency ξ increases, which means that a large penalty is imposed on the high frequency part of h ( x ) − h ini ( x ) .As we can see, a random non-zero initial output of DNN leads to a speciﬁc type of generalization er-ror. To eliminate this error, we use DNNs with an antisymmetrical initialization (ASI) trick (Zhanget al., 2020), which guarantees h ini ( x ) = 0 . Then the ﬁnal output h ( x ) is dominated by low fre-quency, and the DNN model possesses a good generalization. OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

Inspired by the variational formulation of LFP model, we propose a new continuum model for thesupervised learning. This is a variational problem with a parameter α > to be determined later: min h ∈H Q α [ h ] = (cid:90) R d (cid:104) ξ (cid:105) α |F [ h ]( ξ ) | d ξ , (2) s . t . h ( x i ) = y i , i = 1 , · · · , n, (3)where (cid:104) ξ (cid:105) = (1 + (cid:107) ξ (cid:107) ) is the “Japanese bracket” of ξ and H = { h ( x ) | (cid:82) R d (cid:104) ξ (cid:105) α |F [ h ]( ξ ) | d ξ < ∞} . Note that in the spatial domain, the evaluation on n known data points is meaningless in thesense of L functions. Therefore, we consider the problem in the frequency domain and deﬁnea linear operator P X : L ( R d ) ∩ L ( R d ) → R n for the given sample set X to transform theoriginal constraints into the ones in the Fourier domain: P X φ ∗ = Y . More precisely, we deﬁne for φ ∈ L ( R d ) ∩ L ( R d ) P X φ := (cid:18)(cid:90) R d φ ( ξ )e π i ξ · x d ξ , · · · , (cid:90) R d φ ( ξ )e π i ξ · x n d ξ (cid:19) (cid:124) . (4)The admissible function class reads as A X , Y = { φ ∈ L ( R d ) ∩ L ( R d ) | P X φ = Y } . Notice that (cid:107)F − [ φ ] (cid:107) H α = (cid:0)(cid:82) R d (cid:104) ξ (cid:105) α | φ ( ξ ) | d ξ (cid:1) is a Sobolev norm, which characterizes theregularity of the ﬁnal output function h ( x ) = F − [ φ ]( x ) . The larger the exponent α is, the betterthe regularity becomes.For example, when d = 1 and α = 2 , by Parseval’s theorem, (cid:107) u (cid:107) H = (cid:90) R (1 + | ξ | ) |F [ u ]( ξ ) | d ξ = (cid:90) R u + 14 π |∇ u | d x. Accordingly, the Fourier-domain variational problem reads as a standard variational problem inspatial domain. This is true for any quadratic Fourier-domain variational problem, but of course ourFourier-domain variational formulation is not necessarily being quadratic. The details for generalcases (non-quadratic ones) are left to future work. For the quadratic setting with exponent α , i.e.,Problem (2), it is roughly equivalent to the following spatial-domain variational problem: min (cid:90) R d ( u + |∇ α u | ) d x. This is clear for integer α/ , while fractional derivatives are required for non-integer α/ .Back to our problem, after the above transformation, our goal is transformed into studying thefollowing Fourier-domain variational problem, Problem 1

Find a minimizer φ ∗ in A X , Y such that φ ∗ ∈ arg min φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α . (5) OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

This formulation is novel in the following two aspects:1. We regard F [ h ] as the primal solution.2. The evaluation on the sample points x i ’s are imposed by the linear operator P X .Now we explain the importance of these new viewpoints for our task.Traditionally, we all considered the problem in x − y space, the spatial domain. Recently, bythe understanding of F-Principle (Xu et al., 2019, 2020; Luo et al., 2020; Zhang et al., 2019), webelieve that if DNN is used to ﬁt the data, it is more natural to consider the problem in the frequencydomain. In particular, the functions deﬁned on Fourier domain are assumed to be primal. And ourvariational problem is asked for such functions.We remark that the operator P X is the inverse Fourier transform with evaluations on samplepoints X . Actually, the linear operator P X projects a function deﬁned on R d to a function deﬁnedon -dimensional manifold X . Just like the (linear) trace operator T in a Sobolev space projects afunction deﬁned on d -dimensional manifold into a function deﬁned on ( d − -dimensional bound-ary manifold. Note that the only function space over the -dimensional manifold X is the n -dimensional vector space R n , where n is the number of data points, while any Sobolev (or Besov)space over d -dimensional manifold ( d ≥ ) is an inﬁnite dimensional vector space.

4. Existence and Non-existence of Fourier-domain Variantional Problems

In this section, we consider the existence/non-existence dichotomy to Problem 1. In subsection 4.1,we prove that there is no solution to the Problem 1 in subcritical case α < d . The supercritical case α > d will be investigated in subsection 4.2, where we prove the optimal function is a continuousand nontrivial solution. All proof of propositions and theorems in this section can be found inAppendix. α < d

In order to prove the nonexistence of the solution to the Problem 1 in α < d case, at ﬁrst we need toﬁnd a class of functions that make the norm tend to zero. Let ψ σ ( ξ ) = (2 π ) d σ d e − π σ (cid:107) ξ (cid:107) , thenby direct calculation, we have F − [ ψ σ ]( x ) = e − (cid:107) x (cid:107) σ . For α < d the following proposition showsthat the norm (cid:107)F − [ ψ σ ] (cid:107) H α can be sufﬁciently small as σ → . Proposition 1 (critical exponent)

For any input dimension d , we have lim σ → (cid:107)F − [ ψ σ ] (cid:107) H α =  , α < d,C d , α = d, ∞ , α > d. (6) Here the constant C d = ( d − π ) − d π d/ Γ( d/ only depends on the dimension d . Remark 1

The function F − [ ψ ] can be any function in the Schwartz space, not necessarily Gaus-sian. Proposition 1 still holds with (possibly) different C d . For every small σ , we can use n rapidly decreasing functions F − [ ψ σ ]( x − x i ) to construct thesolution F − [ φ σ ]( x ) of the supervised learning problem. However, according to Proposition 1,when the parameter σ tends to 0, the limit is the zero function in the sense of L ( R d ) . Therefore wehave the following theorem: OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

Theorem 1 (non-existence)

Suppose that Y (cid:54) = . For α < d , there is no function φ ∗ ∈ A X , Y satisfying φ ∗ ∈ arg min φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α . In other words, there is no solution to the Problem 1. α > d

In this section, we provide a theorem to establish the existence of the minimizer for Problem 1 inthe case of α > d . Theorem 2 (existence)

For α > d , there exists φ ∗ ∈ A X , Y satisfying φ ∗ ∈ arg min φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α . In other words, there exists a solution to the Problem 1.

Remark 2

Note that, according to the Sobolev embedding theorem (Adams and Fournier, 2003;Evans, 1999), the minimizer in Theorem 2 has smoothness index no less than [ α − d ] .

5. Numerical Results

In this section, we illustrate our results by solving Fourier-domain variational problems numerically.We use uniform mesh in frequency domain with mesh size ∆ ξ and band limit M ∆ ξ . In this discretesetting, the considered space becomes R (2 M ) d . We emphasize that the numerical solution with thissetup always exists even for the subcritical case which corresponds to the non-existence theorem.However, as we will show later, the numerical solution is trivial in nature when α < d . To simplify the problem, we start with a single point X = 0 ∈ Z with the label Y = 2 . Denote φ j = φ ( ξ j ) for j ∈ Z . We also assume that the function φ is an even function. Then according tothe deﬁnition of P X , we have the following problem: Example 1 (Problem 1 with a particular discretization) min φ ∈ R M M (cid:88) j =1 (1 + j ∆ ξ ) α | φ j | , (7) s . t . M (cid:88) j =1 φ j ∆ ξ = 1 , (8)where we further assume φ = φ (0) = 0 . If we denote φ = ( φ , φ , . . . , φ M ) (cid:124) , b = ξ , A =(1 , , . . . , ∈ R M and Γ = √ λ  (1 + 1 ∆ ξ ) α (1 + 2 ∆ ξ ) α . . . (1 + M ∆ ξ ) α  . OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

In fact this is a standard Tikhonov regularization (Tikhonov and Arsenin, 1977) also known as ridgeregression problem with the Lagrange multiplier λ . The corresponding ridge regression problem is, min φ (cid:107) Aφ − b (cid:107) + (cid:107) Γ φ (cid:107) , (9)where we put λ in the optimization term (cid:107) Γ φ (cid:107) , instead of the constraint term (cid:107) Aφ − b (cid:107) . Thisproblem admits an explicit and unique solution (Tikhonov and Arsenin, 1977), φ = ( A (cid:124) A + Γ (cid:124) Γ ) − A (cid:124) b. (10)Here we need to point out that the above method is also applicable to the case that the matrix Γ isnot diagonal.Back to our problem, in order to obtain the explicit expression for the optimal φ we need thefollowing relation between the solution of the ridge regression and the singular-value decomposition(SVD).By denoting ˜ Γ = I and ˜ A = A Γ − = 1 √ λ (cid:16) (1 + 1 ∆ ξ ) α , (1 + 2 ∆ ξ ) α , . . . , (1 + M ∆ ξ ) α (cid:17) , where I is the diagonal matrix, the optimal solution (10) can be written as φ = ( Γ (cid:124) ) − (cid:16) ˜ A (cid:124) ˜ A + I (cid:17) − Γ − A (cid:124) b = ( Γ (cid:124) ) − (cid:16) ˜ A (cid:124) ˜ A + I (cid:17) − ˜ A (cid:124) b = ( Γ (cid:124) ) − ˜ φ , where ˜ φ = (cid:16) ˜ A (cid:124) ˜ A + I (cid:17) − ˜ A (cid:124) b is the solution of ridge regression with ˜ A and ˜ Γ . In order toobtain the explicit expression for ˜ φ we need the following relation between the solution of the ridgeregression and the singular-value decomposition (SVD). Lemma 1 If ˜ Γ = I , then this least-squares solution can be solved using SVD. Given the singularvalue decomposition ˜ A = U Σ V (cid:124) , with singular values σ i , the Tikhonov regularized solution can be expressed aspects ˜ φ = V DU (cid:124) b, where D has diagonal values D ii = σ i σ i + 1 , and is zero elsewhere. Proof

In fact, ˜ φ = ( ˜ A (cid:124) ˜ A + ˜ Γ (cid:124) ˜ Γ ) − ˜ A (cid:124) b = V ( Σ (cid:124) Σ + 1 I ) − V (cid:124) V Σ (cid:124) U (cid:124) b = V DU (cid:124) b , which completes the proof.Since ˜ A ˜ A (cid:124) = 1 λ (cid:80) Mj =1 (1 + j ∆ ξ ) − α , we have ˜ A = U Σ V (cid:124) with U = 1 , Σ = 1 √ λ  M (cid:88) j =1 (1 + j ∆ ξ ) − α  := Z/ √ λ, OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING V = (cid:16) (1 + 1 ∆ ξ ) − α /Z, (1 + 2 ∆ ξ ) − α /Z, . . . , (1 + M ∆ ξ ) − α /Z (cid:17) (cid:124) . Then we get the diagonal value D = Z/ √ λZ /λ + 1 . Therefore, by Lemma 1 ˜ φ = V DU b = 1 / √ λZ /λ + 1 (cid:16) (1 + 1 ∆ ξ ) − α , (1 + 2 ∆ ξ ) − α , . . . , (1 + M ∆ ξ ) − α (cid:17) (cid:124) b. Finally, for the original optimal solution φ = ( Γ (cid:124) ) − ˜ φ = 1( Z + λ )∆ ξ (cid:16) (1 + 1 ∆ ξ ) − α , (1 + 2 ∆ ξ ) − α , . . . , (1 + M ∆ ξ ) − α (cid:17) (cid:124) , which means φ j = (1 + j ∆ ξ ) − α ( Z + λ )∆ ξ . To derive the function in x space, say h ( x ) then h ( x ) = 1( Z + λ ) M (cid:88) j = − M (1 + j ∆ ξ ) − α e π i jx = 2( Z + λ ) M (cid:88) j =1 (1 + j ∆ ξ ) − α cos(2 πjx ) . (11)Fig. 1 shows that for this special case with a large M , h ( x ) is not an trivial function in α > d case and degenerates to a trivial function in α < d case. n Points in d Dimension

Assume that we have n data points x , x , . . . , x n ∈ R d and each data point has d components: x i = ( x i , x i , . . . , x id ) (cid:124) and denote the corresponding label as ( y , y , . . . , y n ) (cid:124) . For the sake of simplicity, we denote thevector ( j , j , · · · , j d ) (cid:124) by J j ...j d . Then our problem becomes Example 2 (Problem 1 with general discretization) min φ ∈ R (2 M ) d M (cid:88) j ,...,j d = − M (1 + (cid:107) J j ...j d (cid:107) ∆ ξ ) α | φ j ...j d | , (12) s . t . M (cid:88) j ,...,j d = − M φ j ...j d e π i∆ ξ J (cid:124) j ...jd x k = y k , k = 1 , , . . . , d (13) OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING x h ( x ) =0.5=1=10 Figure 1: Fitting the function h ( x ) shown in equation (11) with different exponent α ’s. Here wetake M = 10 , ∆ ξ = 0 . , λ = 1 and different α and observe that h ( x ) is not an trivial function in α > d case and degenerates to a trivial function in α < d case.The calculation of this example can be completed by the method analogous to the one used insubsection 5.1. Let A j = (cid:16) e π i∆ ξ J (cid:124) − M − M... − M x j , . . . , e π i∆ ξ J (cid:124) j j ...jd x j , . . . , e π i∆ ξ J (cid:124) MM...M x j (cid:17) (cid:124) , j = 1 , , . . . , n, (14) A = ( A , A , . . . , A n ) (cid:124) ∈ R n × (2 M ) d , b = ( y , y , . . . , y n ) (cid:124) ∈ R n × , (15) Γ = λ  . . . (1 + (cid:107) J j j ...j d (cid:107) ∆ ξ ) α . . .  ∈ R (2 M ) d × (2 M ) d . (16)We just need to solve the following equation: φ = ( A (cid:124) A + Γ (cid:124) Γ ) − A (cid:124) b. (17)Then we can get the output function h ( x ) by using inverse Fourier transform: h ( x ) = M (cid:88) j ,...,j d = − M φ j ...j d e π i∆ ξ J j ...jd · x (18)Since the size of the matrix is too large, it is difﬁcult to solve φ by an explicit calculation. Thus wechoose special n , d and M and show that h ( x ) is not a trivial solution (non-zero function).In our experiment, we set the hyper-parameter M, α, λ, ∆ ξ in advance. We set λ = 0 , , ∆ ξ =0 . in 1-dimensional case and λ = 0 . , ∆ ξ = 0 . in 2-dimensional case. We select two datapoints { ( − . , . , (0 . , . } as the given points in 1-dimensional case and four points as givenpoints in 2-dimensional case whose second coordinates are 0.5 so that it is convenient to observethe phenomenon. At ﬁrst, we use formula (14), (15) and (16) to calculate matrix A , Γ and vector b . OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING x h ( x ) M=5M=10M=100 M=1000data point ( a ) 2 points in 1 dimension x h ( x ) M=3M=10M=25 M=100data point ( b ) 20 points in 2 dimensionFigure 2: Fitting data points in different dimensions with different band limit M . We use a proper α ( α > d ) and observe that even for a large M , the function h ( x ) does not degenerate to a trivialfunction. Note that the blue curve and the red one overlap with each. Here the trivial functionrepresents a function whose value decays rapidly to zero except for the given training points.Then from the equation (17) we can deduce vector φ . The ﬁnal output function h ( x ) is obtained byinverse discrete Fourier transform (18).In Fig.2, we set α = 10 in both cases to ensure α > d and change the band limit M . We observethat as M increases, the ﬁtting curve converges to a non-trivial curve. In Fig.3, we set M = 1000 in 1-dimensional case and M = 100 in 2-dimensional case. By changing exponent α , we can see inall cases, the ﬁtting curves are non-trivial when α > d , but degenerate when α < d .

6. Conclusion

In this paper, we study the supervised learning problem by proposing a Fourier-domain variationalformulation motivated by the frequency principle in deep learning. We establish the sufﬁcient andnecessary conditions for the well-posedness of the Fourier-domain variational problem, followed bynumerical demonstration.Our Fourier-domain variational formulation provides a novel viewpoint for modelling machinelearning problem, that is, imposing more constraints, e.g., higher regularity, on the model ratherthan the data (always isolated points in practice) can give us the well-posedness as dimension ofthe problem increases. This is different from the modelling in physics and traditional point cloudproblems, in which the model is independent of dimension in general. Our work suggests a potentialapproach of algorithm design by considering a dimension-dependent model for data modelling.In contrast to the natural sciences, where models are usually derived from fundamental physicallaws, in data science, the story may be totally different: we may propose mathematical models basedon the algorithms of great practical success like DNNs. Afterwards, the continuum formulation andits mathematical validation can be analyzed. This seems to be a new scientiﬁc paradigm, alongwhich our work plays a role as one step. OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING x h ( x ) alpha=0.1alpha=1 alpha=10data point ( a ) 2 points in 1 dimension x h ( x ) =0.1=2 =10data point ( b ) 20 points in 2 dimensionFigure 3: Fitting data points in different dimensions with different exponent α ’s. We observe thatwith a proper M , the function h ( x ) is not a trivial function for α > d case and degenerates to atrivial function for α > d case. References

Robert A. Adams and John J.F. Fournier.

Sobolev spaces . Elsevier Science, 2003.Simon Biland, Vinicius C. Azevedo, Byungsoo Kim, and Barbara Solenthaler. Frequency-awarereconstruction of ﬂuid simulations with generative networks. In Alexander Wilkie and FrancescoBanterle, editors,

Eurographics 2020 - Short Papers . The Eurographics Association, 2020. ISBN978-3-03868-101-4. doi: 10.2312/egs.20201019.Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves inkernel regression and wide neural networks. arXiv preprint arXiv:2002.02561 , 2020.Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequencyapproximation and wave problems.

Accepted by SISC, arXiv:1909.11759 , 2019.Jeff Calder and Dejan Slepev. Properly-weighted graph Laplacian for semi-supervised learning.

Applied Mathematics and Optimization , (4), 2019.Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understandingthe spectral bias of deep learning. arXiv preprint arXiv:1912.01198 , 2019.Weinan E, Chao Ma, and Lei Wu. Machine learning from a continuous viewpoint, I.

Science ChinaMathematics , pages 1–34, 2020.Ahmed El Alaoui, Xiang Cheng, Aaditya Ramdas, Martin J Wainwright, and Michael I Jordan.Asymptotic behavior of l p -based Laplacian regularization in semi-supervised learning. In Con-ference on Learning Theory , pages 879–906, 2016.Lawrence C. Evans. Partial differential equations.

Mathematical Gazette , 83(496):185, 1999. OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: convergence and gen-eralization in neural networks. In

Advances in neural information processing systems , pages8571–8580, 2018.Ameya D Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Adaptive activation functionsaccelerate convergence in deep and physics-informed neural networks.

Journal of ComputationalPhysics , 404:109136, 2020.Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear modelsunder gradient descent. In

Advances in neural information processing systems , pages 8572–8583,2019.Xi-An Li, Zhi-Qin John Xu, and Lei Zhang. A multi-scale DNN algorithm for nonlinear ellipticequations with multiple scales.

Communications in Computational Physics , 28(5):1886–1906,2020.Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (MscaleDNN) for solvingPoisson-Boltzmann equation in complex domains.

Communications in Computational Physics ,28(5):1970–2001, 2020.Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. Theory of the frequency principle forgeneral deep neural networks. arXiv preprint arXiv:1906.09235 , 2019.Tao Luo, Zheng Ma, Zhi-Qin John Xu, and Yaoyu Zhang. On the exact computation of linearfrequency principle dynamics and its generalization. arXiv preprint arXiv:2010.08153 , 2020.B. Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the graph Laplacian:the limit of inﬁnite unlabelled data. In

NIPS 2009 , 2009.Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, YoshuaBengio, and Aaron Courville. On the spectral bias of neural networks. In

International Confer-ence on Machine Learning , pages 5301–5310, 2019.Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neuralnetworks for learned functions of different frequencies. In

Advances in Neural Information Pro-cessing Systems , pages 4763–4772, 2019.Zuoqiang Shi, Stanley Osher, and Wei Zhu. Weighted nonlocal Laplacian on interpolation fromsparse data.

Journal of Scientiﬁc Computing , 73(2-3):1–14, 2017.Andrej Nikolaevich Tikhonov and Vasiliy Yakovlevich Arsenin. Solutions of ill-posed problems.

Mathematics of Computation , 32(144):491–491, 1977.Bo Wang, Wenzhong Zhang, and Wei Cai. Multi-scale deep neural network (mscalednn) methodsfor oscillatory stokes ﬂows in complex domains.

Communications in Computational Physics , 28(5):2139–2157, 2020a.Feng Wang, Alberto Eljarrat, Johannes M¨uller, Trond R Henninen, Rolf Erni, and Christoph TKoch. Multi-resolution convolutional neural networks for inverse problems.

Scientiﬁc reports ,10(1):1–11, 2020b. OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network infrequency domain.

International Conference on Neural Information Processing , pages 264–274,2019.Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle:Fourier analysis sheds light on deep neural networks.

Communications in Computational Physics ,28(5):1746–1767, 2020.Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, and Zheng Ma. Explicitizing an implicit bias of thefrequency principle in two-layer neural networks. arXiv:1905.10264 , May 2019.Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, and Zheng Ma. A type of generalization error inducedby initialization in deep neural networks. In Jianfeng Lu and Rachel Ward, editors,

Proceedingsof The First Mathematical and Scientiﬁc Machine Learning Conference , volume 107, pages 144–164, 2020.Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using Gaussianﬁelds and harmonic functions. In

Proceedings of the 20th International conference on Machinelearning (ICML-03) , pages 912–919, 2003.

Appendix A. Lemma 2

Lemma 2

Let the function ψ σ ( ξ ) = (2 π ) d σ d e − π σ (cid:107) ξ (cid:107) , ξ ∈ R d . We have lim σ → (cid:90) R d (cid:107) ξ (cid:107) α | ψ σ ( ξ ) | d ξ =  , α < d,C d , α = d, ∞ , α > d. (19) Here the constant C d = ( d − π ) − d π d/ Γ( d/ only depends on the dimension d . Proof

In fact, lim σ → (cid:90) R d (cid:107) ξ (cid:107) α | ψ σ ( ξ ) | d ξ = lim σ → (cid:90) R d (cid:107) ξ (cid:107) α (2 π ) d σ d e − π σ (cid:107) ξ (cid:107) d ξ = lim σ → (2 π ) d σ d − α (cid:90) R d (cid:107) σ ξ (cid:107) α e − π (cid:107) σ ξ (cid:107) d( σ ξ )= lim σ → (2 π ) d σ d − α (cid:90) ∞ r α + d − e − π r d r · ω d , where ω d = π d Γ ( d ) is the surface area of a unit ( d − -sphere.Notice that (cid:90) ∞ r α + d − e − π r d r = (cid:90) r α + d − e − π r d r + (cid:90) ∞ r α + d − e − π r d r ≤ (cid:90) ∞ e − π r d r + (cid:90) ∞ r [ α ]+ d e − π r d r = 18 π + (cid:90) ∞ r [ α ]+ d e − π r d r OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING and (cid:90) ∞ r [ α ]+ d e − π r d r =  (cid:16) [ α ]+ d − (cid:17) !(2 π ) − ([ α ]+ d +1) , [ α ] + d is odd , √ π (2 π ) − ([ α ]+ d +1) ( ) [ α ]+ d ([ α ] + d − , [ α ] + d is even . Therefore, in both cases, the integral (cid:82) ∞ r α + d − e − π r d r is ﬁnite. Then we have lim σ → (cid:90) R d (cid:107) ξ (cid:107) α | ψ σ ( ξ ) | d ξ = lim σ → (2 π ) d σ d − α (cid:90) ∞ r α + d − e − π r d r · ω d = (cid:40) , α < d, ∞ , α > d. When α = d , it follows that (cid:90) ∞ r α + d − e − π r d r = 12 (2 π ) − d ( d − . Therefore lim σ → (cid:90) R d (cid:107) ξ (cid:107) α | ψ σ ( ξ ) | d ξ = 12 ( d − π ) − d π d Γ (cid:0) d (cid:1) , which completes the proof. Appendix B. Proof of Proposition 1

Proof

Similar to the proof of Lemma 2, we have lim σ → (cid:107)F − [ ψ σ ] (cid:107) H α = lim σ → (2 π ) d σ d − α (cid:90) R d ( σ + (cid:107) σ ξ (cid:107) ) α e − π (cid:107) σ ξ (cid:107) d( σ ξ )= lim σ → (2 π ) d σ d − α (cid:90) ∞ r d − ( σ + r ) α e − π r d r · ω d . For σ < , the following integrals are bounded from below and above, respectively: (cid:90) ∞ r d − ( σ + r ) α e − π r d r ≥ (cid:90) ∞ r α + d − e − π r d r = C > , and (cid:90) ∞ r d − ( σ + r ) α e − π r d r ≤ (cid:90) r d − (1 + r ) α e − π r d r + (cid:90) ∞ r d − ((2 r ) ) α e − π r d r ≤ (cid:90) r d − (1 + r ) α e − π r d r + 2 α (cid:90) ∞ r α + d − e − π r d r = C < ∞ , OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING where C = (cid:82) ∞ r α + d − e − π r d r and C = (cid:82) r d − (1+ r ) α e − π r d r +2 α (cid:82) ∞ r α + d − e − π r d r .Therefore, we obtain the results for the subcritical ( α < d ) and supercritical ( α > d ) cases lim σ → (cid:107)F − [ ψ σ ] (cid:107) H α = lim σ → (2 π ) d σ d − α (cid:90) ∞ r d − ( σ + r ) α e − π r d r · ω d = (cid:40) , α < d, ∞ , α > d. For the critical case α = d , we have lim σ → (cid:107)F − [ ψ σ ] (cid:107) H α = lim σ → (2 π ) d (cid:90) ∞ r d − ( σ + r ) α e − π r d r · ω d = lim σ → (2 π ) d (cid:90) ∞ r d − e − π r d r · ω d + lim σ → (cid:20) α π ) d σ (cid:90) ∞ r d − e − π r d r · ω d + o ( σ ) (cid:21) = lim σ → (2 π ) d (cid:90) ∞ r d − e − π r d r · ω d = 12 ( d − π ) − d π d Γ (cid:0) d (cid:1) . Therefore the proposition holds.

Appendix C. Proof of Theorem 1

Proof

Given X = ( x , . . . , x n ) (cid:124) and Y = ( y , . . . , y n ) (cid:124) , let A = (cid:16) exp( − (cid:107) x j − x i (cid:107) σ ) (cid:17) n × n be an n × n matrix. For sufﬁciently small σ , the matrix A is diagonally dominant, and hence invertible.So the linear system Ag ( σ ) = Y has a solution g ( σ ) = (cid:16) g ( σ )1 , g ( σ )2 , · · · , g ( σ ) n (cid:17) (cid:124) . Let φ σ ( ξ ) = (cid:88) i g ( σ ) i e − π i ξ (cid:124) x i ψ σ ( ξ ) , where ψ σ ( ξ ) = (2 π ) d σ d e − π σ (cid:107) ξ (cid:107) satisfying F − [ ψ σ ]( x ) = e − (cid:107) x (cid:107) σ . Thus F − [ φ σ ]( x ) = (cid:88) i g ( σ ) i F − [ ψ σ ]( x − x i ) = (cid:88) i g ( σ ) i e − (cid:107) x − x i (cid:107) σ . In particular, for all i = 1 , , · · · , n F − [ φ σ ]( x i ) = (cid:88) j g ( σ ) j e − (cid:107) x i − x j (cid:107) σ = ( Ag ( σ ) ) i = y i . Therefore, φ σ ∈ A X , Y for sufﬁciently small σ > . OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

According to the above discussion, we can construct a sequence { φ m } ∞ m = M ⊂ A X , Y , where M is a sufﬁciently large positive integer to make the matrix A invertible. As Proposition 1 shows, lim m → + ∞ (cid:107)F − [ φ m ] (cid:107) H α = 0 . Now, suppose that there exists a solution to the Problem 1, denoted as φ ∗ ∈ A X , Y . By deﬁnition, (cid:107)F − [ φ ∗ ] (cid:107) H α ≤ min φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α ≤ lim m → + ∞ (cid:107)F − [ φ m ] (cid:107) H α = 0 . Therefore, φ ∗ ( ξ ) ≡ and P X φ ∗ = , which contradicts to the restrictive condition P X φ ∗ = Y forthe situation that Y (cid:54) = . The proof is completed. Appendix D. Proof of Theorem 2

Proof

1. We introduce a distance for functions φ, ψ ∈ L ( R d ) : dist( φ, ψ ) = (cid:107)F − [ φ ] − F − [ ψ ] (cid:107) H α . Under the topology induced by this distance, the closure of the admissible function class A X , Y reads as A X , Y := { φ ∈ L ( R d ) ∩ L ( R d ) | P X φ = Y } dist( · , · ) .

2. We will consider an auxiliary minimization problem: to ﬁnd φ ∗ such that φ ∗ ∈ arg min φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α . (20)Let m := inf φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α . According to the proof of Proposition 1 and Theorem 1, for asmall enough σ > , the inverse Fourier transform of function φ σ ( ξ ) = (cid:88) i g ( σ ) i e − π i ξ (cid:124) x i ψ σ ( ξ ) has ﬁnite Sobolev norm (cid:107)F − [ φ σ ] (cid:107) H α < ∞ , where ψ σ ( ξ ) satisﬁes F − [ ψ σ ]( x ) = e − (cid:107) x (cid:107) σ , A = (cid:16) exp( − (cid:107) x j − x i (cid:107) σ ) (cid:17) n × n and g ( σ ) = (cid:16) g ( σ )1 , g ( σ )2 , · · · , g ( σ ) n (cid:17) (cid:124) = A − Y . Thus m < + ∞ .3. Choose a minimizing sequence { ¯ φ k } ∞ k =1 ⊂ A X , Y such that lim k →∞ (cid:107)F − [ ¯ φ k ] (cid:107) H α = m. By deﬁnition of the closure, there exists a function φ k ∈ A X , Y for each k such that (cid:107)F − [ ¯ φ k ] − F − [ φ k ] (cid:107) H α ≤ k . Therefore { φ k } ∞ k =1 ⊂ A X , Y is also a minimizing sequence, i.e., lim k →∞ (cid:107)F − [ φ k ] (cid:107) H α = m. OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

Then {F − [ φ k ] } ∞ k =1 is bounded in the Sobolev space H α ( R d ) . Hence there exist a weakly conver-gent subsequence {F − [ φ n k ] } ∞ k =1 and a function F − [ φ ∗ ] ∈ H α ( R d ) such that F − [ φ n k ] (cid:42) F − [ φ ∗ ] in H α ( R d ) as k → ∞ . Note that m = inf φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α ≤ (cid:107)F − [ φ ∗ ] (cid:107) H α ≤ lim inf φ nk (cid:107)F − [ φ n k ] (cid:107) H α = m, where we have used the lower semi-continuity of the Sobolev norm of H α ( R d ) in the third inequal-ity. Hence (cid:107)F − [ φ ∗ ] (cid:107) H α = m .4. We further establish the strong convergence that F − [ φ n k ] − F − [ φ ∗ ] → in H α ( R d ) as k →∞ . In fact, since F − [ φ n k ] (cid:42) F − [ φ ∗ ] in H α ( R d ) as k → ∞ and lim k →∞ (cid:107)F − [ φ n k ] (cid:107) H α = m = (cid:107)F − [ φ ∗ ] (cid:107) H α , we have lim k →∞ (cid:107)F − [ φ n k ] − F − [ φ ∗ ] (cid:107) H α = lim k →∞ (cid:104)F − [ φ n k ] − F − [ φ ∗ ] , F − [ φ n k ] − F − [ φ ∗ ] (cid:105) = lim k →∞ (cid:104)F − [ φ n k ] , F − [ φ n k ] (cid:105) + (cid:104)F − [ φ ∗ ] , F − [ φ ∗ ] (cid:105) − (cid:104)F − [ φ n k ] , F − [ φ ∗ ] (cid:105) − (cid:104)F − [ φ ∗ ] , F − [ φ n k ] (cid:105) = m + m − lim k →∞ (cid:0) (cid:104)F − [ φ n k ] , F − [ φ ∗ ] (cid:105) + (cid:104)F − [ φ ∗ ] , F − [ φ n k ] (cid:105) (cid:1) = m + m − (cid:104)F − [ φ ∗ ] , F − [ φ ∗ ] (cid:105) − (cid:104)F − [ φ ∗ ] , F − [ φ ∗ ] (cid:105) = 0 . Here (cid:104)· , ·(cid:105) is the inner product of the Hilbert space H α .5. We have φ ∗ ∈ L ( R d ) because (cid:90) R d | φ ∗ ( ξ ) | d ξ = (cid:90) R d (cid:104) ξ (cid:105) α | φ ∗ ( ξ ) |(cid:104) ξ (cid:105) α d ξ ≤ (cid:107)F − [ φ ∗ ] (cid:107) H α (cid:18)(cid:90) R d (cid:104) ξ (cid:105) α d ξ (cid:19) = Cm < + ∞ , where C := (cid:16)(cid:82) R d (cid:104) ξ (cid:105) α d ξ (cid:17) < + ∞ . Hence φ ∗ ∈ L ( R d ) ∩ L ( R d ) and P X φ ∗ is well-deﬁned.6. Recall that P X φ n k = Y . We have | Y − P X φ ∗ | = lim k → + ∞ |P X φ n k − P X φ ∗ | = lim k → + ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) R d ( φ n k − φ ∗ )e π i xξ d ξ (cid:12)(cid:12)(cid:12)(cid:12) = lim k → + ∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) R d (cid:104) ξ (cid:105) α ( φ n k − φ ∗ ) (cid:104) ξ (cid:105) α e π i xξ d ξ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ lim k → + ∞ (cid:107)F − [ φ n k ] − F − [ φ ∗ ] (cid:107) H α (cid:32)(cid:90) R d (cid:12)(cid:12) e π i xξ (cid:12)(cid:12) (cid:104) ξ (cid:105) α d ξ (cid:33) = C lim k → + ∞ (cid:107)F − [ φ n k ] − F − [ φ ∗ ] (cid:107) H α = 0 . Hence P X φ ∗ = Y and φ ∗ ∈ A X , Y . OURIER - DOMAIN V ARIATIONAL F ORMULATION FOR S UPERVISED L EARNING

7. Note that m = inf φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α ≤ inf φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α ≤ (cid:107)F − [ φ ∗ ] (cid:107) H α = m. This implies that inf φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α = m and φ ∗ ∈ arg min φ ∈A X , Y (cid:107)F − [ φ ] (cid:107) H α , whichcompletes the proof., whichcompletes the proof.