Nonlinear dimension reduction for surrogate modeling using gradient information
Daniele Bigoni, Youssef Marzouk, Clémentine Prieur, Olivier Zahm
NNonlinear dimension reduction for surrogate modelingusing gradient information
Daniele Bigoni ∗ , Youssef Marzouk ∗ , Cl´ementine Prieur † andOlivier Zahm † ‡ February 23, 2021
Abstract
We introduce a method for the nonlinear dimension reduction of a high-dimensionalfunction u : R d → R , d (cid:29)
1. Our objective is to identify a nonlinear feature map g : R d → R m , with a prescribed intermediate dimension m (cid:28) d , so that u can be wellapproximated by f ◦ g for some profile function f : R m → R . We propose to build thefeature map by aligning the Jacobian ∇ g with the gradient ∇ u , and we theoreticallyanalyze the properties of the resulting g . Once g is built, we construct f by solvinga gradient-enhanced least squares problem. Our practical algorithm makes use of asample { x ( i ) , u ( x ( i ) ) , ∇ u ( x ( i ) ) } Ni =1 and builds both g and f on adaptive downward-closed polynomial spaces, using cross validation to avoid overfitting. We numericallyevaluate the performance of our algorithm across different benchmarks, and explore theimpact of the intermediate dimension m . We show that building a nonlinear featuremap g can permit more accurate approximation of u than a linear g , for the same inputdata set. Keywords high-dimensional approximation, nonlinear dimension reduction, feature map,Poincar´e inequality, adaptive polynomial approximation.
Computational models from a wide range of fields, such as physics, biology, and finance, involvelarge numbers of uncertain input parameters. Quantifying uncertainty is essential to improvingthe reliability of these models. Most uncertainty quantification analyses, however, require a largenumber of model evaluations. When a single evaluation is computationally expensive, a commonpractice is therefore to replace the model with a surrogate —meaning an approximation that canbe evaluated cheaply, without further evaluations of the original model. Yet constructing accurate ∗ Center for Computational Science & Engineering, Massachusetts Institute of Technology † Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France ‡ Corresponding author ( [email protected] ) a r X i v : . [ m a t h . NA ] F e b pproximations is a challenging task because many function approximation tools become inexpres-sive in high dimensions. This is often referred as to the curse of dimensionality . This problem isexacerbated in the small-data regime, i.e., when few model evaluations are available.This paper addresses the problem of reducing parameter space dimension from the perspectiveof surrogate modeling. We represent the model by a scalar-valued quantity of interest u ( x ) whichdepends on a high dimensional parameter x ∈ R d with d (cid:29)
1. When the parameter is uncertain, itis denoted by a random vector X whose law models the uncertainty of the parameter. Dimensionreduction consists in finding a map g : R d → R m , with m (cid:28) d , that captures the most “relevant”features of the parameters. This feature map permits reduction of the parameter dimension from d to m by replacing X with the m -dimensional random vector Z = g ( X ). From the perspective ofsurrogate modeling, a good feature map should enable u ( X ) to be well approximated as f ( Z ) = f ◦ g ( X ), for some function f : R m → R of m variables only. If such a feature map g is known inadvance, f can be constructed by minimizing the mean squared error, E (cid:2) ( u ( X ) − f ◦ g ( X )) (cid:3) , over a class of functions of m (cid:28) d variables. This task is, in principle, easier than constructing a d -dimensional approximation to u ( X ) directly. Linear dimension reduction corresponds to identifying linear feature maps g . Many lineardimension reduction strategies have been proposed in different research fields. Global sensitivityanalysis [36] identifies a set of m parameters g ( x ) = ( x σ , . . . , x σ m ) that best explain, in somestatistical sense, the model output. More generally, ridge functions [34] are functions of the form x (cid:55)→ f ◦ g ( x ) where g ( x ) = W T x for some matrix W ∈ R d × m . In [9, 16], the model u is assumedto be a ridge function and W is recovered via adaptive model query strategies. Linear dimensionreduction also arises in the statistical regression literature under the name sufficient dimensionreduction [2, 28], where W is constructed via sliced inverse regression (SIR) [29], sliced averagevariance estimation (SAVE) [13], and their variants. Closely related to the present work is theactive subspace method [12, 11, 23], which identifies W using gradients of the model. The recentpapers [45, 32] show that the active subspace method constructs the matrix W by minimizing anupper bound for the mean squared error E [( u ( X ) − f ◦ g ( X )) ] optained with the optimal profilefunction. This result is particularly relevant because it motivates the construction of g from theperspective of approximating u in the least-squares sense. Similar ideas are developped in [46, 14, 7]for the detection of informed subspace in the context of Bayesian inverse problems.While linear dimension reduction methods are quite successful in many applications, they canfail to detect certain kinds of low-dimensional structure that a model might have; consider anisotropic u ( x ) = h ( (cid:107) x (cid:107) ), for instance. Nonlinear dimension reduction allows g to detect suchnonlinear features, in order to improve the approximation power of the composed approximation f ◦ g . Nonlinear dimension reduction methods have been developed and analyzed mostly in thecommunity of sufficient dimension reduction; see for instance [43, 44, 27], to cite just a few. Inthese works, the main idea is to use kernel methods to construct a nonlinear feature map g ( x ) = W T Φ( x ), where Φ( x ) = (Φ ( x ) , Φ ( x ) , . . . ) are the eigenfunctions of an ad hoc kernel (typically asquared exponential or a polynomial kernel), and where the matrix W is determined using inverseregression techniques (SIR, SAVE) on the transformed variables Z = Φ( X ). Those methods,however, typically require a large sample size to accurately detect the low-dimensional structure ofthe model, and thus are not well suited to the small-data regime. In the spirit of kernel principalcomponent analysis (KPCA), [25] builds a feature map of the form g ( x ) = (Φ ( x ) , . . . , Φ m ( x ))by taking the m first eigenfunctions of a kernel whose hyperparameters (e.g., correlation length,smoothness) are determined by an outer optimization procedure. .1 Contribution The main contribution of this paper is to propose and analyze a nonlinear parameter space dimen-sion reduction method, for the purpose of function approximation, using gradients of the model.We assume here that the implementation of the computational model permits computing the gra-dient of x (cid:55)→ u ( x ) with respect to the parameters x . Recent advances in computational sciencepermit computing such gradients at a complexity comparable to that of evaluating the model itself,for instance using automatic differentiation [19] and/or adjoint state methods [35]. Having ac-cess to gradient evaluations is a valuable workaround in small-data regimes, as ∇ u ( X ) constitutesadditional information for learning the model; see [26]. In this paper we propose to build g byminimizing the loss function J ( g ) = E (cid:104)(cid:13)(cid:13) ∇ u ( X ) − Π range( ∇ g ( X ) T ) ∇ u ( X ) (cid:13)(cid:13) (cid:105) , where Π range( ∇ g ( X ) T ) denotes the orthogonal projector onto the range of the Jacobian ∇ g ( X ) T .Intuitively, minimizing this loss yields a feature map whose Jacobian ∇ g ( X ) tends to be alignedwith the gradient ∇ u ( X ). Based on the same heuristic, the authors of [47] introduce a different lossfunction to align ∇ g ( X ) with ∇ u ( X ) (see Appendix A for more details) but without proposing adeeper mathematical or computational analysis. In the present paper, we prove that, under someassumptions, the loss J ( g ) yields an upper bound on the mean squared error that can be obtainedafter constructing f ; that is min f : R m → R E [( u ( X ) − f ◦ g ( X )) ] ≤ C J ( g ) , for some Poincar´e-type constant C associated with X . We propose a quasi-Newton algorithm tominimize J ( g ) and show that this algorithm is similar to the power iteration used to compute aneigendecomposition in the active subspace method.In practice, we make use of a data set { x ( i ) , u ( x ( i ) ) , ∇ u ( x ( i ) ) } Ni =1 , to estimate the loss J ( g ) and the mean squared error E [( u ( X ) − f ◦ g ( X )) ]. We assume that thecomputational cost is dominated by the N evaluations of u ( x ( i ) ) and ∇ u ( x ( i ) ), such that the costfor constructing f and g is relatively negligible. Borrowing ideas from [5, 30, 10], we represent both f and g on adaptive downward-closed polynomial spaces which are built using a greedy algorithm.In order to avoid overfitting, a cross validation procedure is used to determine when to stop theadaptive polynomial enrichment. We show that building a nonlinear feature map g permits moreaccurate approximation of u than a linear g , for the same input data set.We emphasize that our method is a two step procedure: we first build the feature map g byminimizing J ( g ), and we then build f by minimizing the mean squared error E [( u ( X ) − f ◦ g ( X )) ].Another strategy would consist of minimizing the mean squared error jointly over f and g . Forinstance, in [20] the authors build a linear g and polynomial f by employing dedicated optimizationalgorithms on Grassmann manifolds, without using gradients of the model. Nonlinear g are alsobuilt in [25] by joint minimization over f and g . However, the structure of such optimizationproblems, and of the algorithms they employ, remain not well understood.The rest of this paper is organized as follows. In Section 2 we analyze the problem of approxi-mating a function u by a composition f ◦ g . In particular, we give sufficient conditions on ∇ g and ∇ u so that there exists an f such that f ◦ g = u . We then introduce the loss J ( g ) and describeits properties regarding the approximation problem. In Section 3 we present algorithms for con-structing g and f on adaptive polynomial spaces. Then, in Section 4, we illustrate the method onnumerical examples. Dimension reduction via smooth feature maps
Let u : X → R be a scalar-valued function defined on an open set X ⊆ R d with d (cid:29)
1. Our goal isto construct a feature map g : X → R m with m (cid:28) d such that, given a prescribed tolerance ε > f : R m → R for which E (cid:2) ( u ( X ) − f ( g ( X ))) (cid:3) ≤ ε . (1)Here, X denotes a random vector with probability density function π such that supp( π ) = X , and E [ · ] denotes the mathematical expectation. The function f is called the profile function and m the intermediate dimension. The construction of the profile function is postponed to Section 3.3,and we focus here on how to find a suitable feature map g such that (1) is attainable for some f .We note that the f which minimizes the above mean squared error is the conditional expectation f : z (cid:55)→ E [ u ( X ) | g ( X ) = z ]. This well-known result will be used later. We now give two trivialsolutions to (1) which help to understand the problem: • With g = Id, the identity function on X , the profile function f = u yields f ◦ g = u . In thiscase we have m = d . • With g = u , the profile f = Id also yields f ◦ g = u with an intermediate dimension m = 1.Those two trivial solutions are not satisfactory either because m = d (cid:29) g = u is untractable. The balance between the intermediate dimension m and thecomplexity of the feature map g appears as a central question in dimension reduction. Our goalis to construct g in a tractable space G m of functions from X to R m . For instance, G m could be aspace of multivariate polynomial functions, a reproducing kernel Hilbert space, etc. We emphasizethe necessity of constraining the function g to belong to a space of tractable functions; otherwiseproblem (1) makes no sense, as it admits a trivial solution with g = u . From now on, we assume that u : X → R is continuously differentiable over the open set X ⊆ R d and that all the functions in G m are also continuously differentiable. Assumption 2.1. u ∈ C ( X ; R ) and g ∈ G m ⊆ C ( X ; R m ).Let us assume for a moment that u is exactly of the form u = f ◦ g for some g : X → R m and f : R m → R . Denoting by ∇ f ( z ) ∈ R m the gradient of f at point z ∈ R m , and by ∇ g ( x ) = ∇ g ( x ) T ... ∇ g m ( x ) T ∈ R m × d , the Jacobian of g at point x ∈ X , the chain rule allows writing ∇ u ( x ) = ∇ g ( x ) T ∇ f ( g ( x )) for any x ∈ X . In this case, ∇ u ( x ) lies in the subspace range( ∇ g ( x ) T ) for any x ∈ X . In short, we have u = f ◦ g = ⇒ ∇ u ( x ) ∈ range( ∇ g ( x ) T ) , ∀ x ∈ X . We use the standard convention that each row of the Jacobian matrix is the transpose of the gradientof each component. X and the blue lines represent different pre-images of two candidate functions g . On the left,the function g satisfies Assumption 2.2, but on the right, the level sets of the function g arenot pathwise-connected. Conversely, one can ask whether a function u which satisfies ∇ u ( x ) ∈ range( ∇ g ( x ) T ) for somevector-valued differentiable function g is necessarily of the form of u = f ◦ g for some f . Thefollowing proposition gives a positive answer to this question, under additional assumptions on g . Assumption 2.2.
The pre-image under g of any point is smoothly pathwise-connected ; that is, forany z ∈ Im( g ) ⊆ R m and for any points x , y in the preimage g − ( z ) = { s ∈ X : g ( s ) = z } , thereexists a continuously differentiable function γ : [0 , → g − ( z ) such that γ (0) = x and γ (1) = y . Proposition 2.3.
Under Assumptions 2.1 and 2.2, if u : X → R and g : X → R m satisfy ∇ u ( x ) ∈ range( ∇ g ( x ) T ) , (2)for any x ∈ X , then u = f ◦ g for some function f : R m → R . Proof.
We first show that relation (2) implies the following property: if g ( x ) = g ( y ) for some x , y ∈ X , then u ( x ) = u ( y ). Thus, let x , y ∈ X be any two points such that g ( x ) = g ( y ). ByAssumption 2.2, the pre-image g − ( z ) , z = g ( x ), is smoothly pathwise-connected so that thereexsits a continuously differentiable path γ : [0 , → X from x = γ (0) to y = γ (1) such that g ( γ ( t )) = z for any t ∈ [0 , ≤ i ≤ m the function g i ◦ γ : [0 , → R is con-stant so that ( g i ◦ γ ) (cid:48) ( t ) = ∇ g i ( γ ( t )) T γ (cid:48) ( t ) = 0 for any t ∈ [0 , γ (cid:48) ( t ) ∈ R d denotes thederivative of γ at point t . This means that, for any t ∈ [0 , γ (cid:48) ( t ) is orthogonal tospan {∇ g ( γ ( t )) , . . . , ∇ g m ( γ ( t )) } = range( ∇ g ( γ ( t )) T ). By (2) we then have( u ◦ γ ) (cid:48) ( t ) = ∇ u ( γ ( t )) T γ (cid:48) ( t ) = 0 , which implies that the continuous function u ◦ γ : [0 , → R is constant. Then u ( x ) = u ( γ (0)) = u ( γ (1)) = u ( y ).Now we build a function f : R m → R such that u = f ◦ g . Such a function needs to be definedonly on the image g ( X ) ⊆ R m and can be set to zero on the complement of g ( X ) in R m . We define f such that for any z ∈ g ( X ), f ( z ) = u ( x ) where x ∈ X is any point such that g ( x ) = z . Evenif this x is not unique, f ( z ) is uniquely defined because u ( x ) = u ( y ) whenever g ( y ) = g ( x ). Byconstruction we have f ( g ( x )) = u ( x ) for any x ∈ X , which concludes the proof.Let us note that Assumption 2.2 is a necessary condition in Proposition 2.3. Indeed, if thepre-images of g are not smoothly pathwise-connected, as in the right plot of Figure 1, one can build function u which satisfies (2) without being of the form f ◦ g . For example, think of a smoothfunction u which is constant on each of the connected parts of g − ( z ) (so that (2) is satisfied) butwhich takes different values on each of those connected parts (so that u (cid:54) = f ◦ g ).Here are some examples where Assumption (2.2) is satisfied. Example 2.4 (Affine feature map) . Any function g ( x ) = A x + b with A ∈ R m × d and b ∈ R m satisfies Assumption 2.2, provided X is a convex set. Indeed, for any z ∈ R m , x , y ∈ g − ( z ), and t ∈ [0 , γ ( t ) := t x + (1 − t ) y belongs to X and it satisfies g ( γ ( t )) = t ( A x + b ) + (1 − t )( A y + b ) = z , which shows that γ is a continuously differentiable path in g − ( z ) from x to y . Example 2.5 (Feature map following from a C -diffeomorphism) . Assume X is convex. One wayto build functions which satisfy Assumption 2.2 is to consider a C -diffeomorphism φ : X → X ,meaning a continuously differentiable invertible function whose inverse is continuously differen-tiable, and to define g ( x ) = ( φ ( x ) , . . . , φ m ( x )) where φ i ( x ) is the i -th component of φ ( x ). Such a g satisfies Assumption 2.2: for any x , y ∈ X such that g ( x ) = g ( y ) = z , the function γ ( t ) = φ − (cid:0) tφ ( y ) + (1 − t ) φ ( x ) (cid:1) , defined for t ∈ [0 ,
1] is a smooth path from x = γ (0) to y = γ (1) as a composition of smoothfunctions. It is well defined because tφ ( y ) + (1 − t ) φ ( x ) is in X by convexity. By construction wehave φ ( γ ( t )) = tφ ( y ) + (1 − t ) φ ( x ) and the m first components of that relation yield g ( γ ( t )) = tg ( y ) + (1 − t ) g ( x ) = z . This shows that γ ( t ) ∈ g − ( z ), so that g satisfies Assumption 2.2. Example 2.6 (Polynomial feature map) . Consider the case where g is a polynomial function on X = R d . Assumption 2.2 is satisfied if and only if for any z ∈ g ( X ), the zeros of the polynomial x (cid:55)→ g ( x ) − z are pathwise-connected. Calculating the number of connected components (i.e., thezeroth Betti number) of an algebraic set like { x : g ( x ) − z = 0 } is a difficult question, commonlyencountered in algebraic geometry. Unfortunately, there is no easy answer to this question; see [37].Still, we show later in Section 4 that polynomials work well from a numerical point of view, eventhough Assumption 2.2 is not checked in practice. Motivated by Proposition 2.3, we propose to build g by minimizing a cost function which measureshow “aligned” are the gradient ∇ u ( x ) and the subspace range( ∇ g ( x ) T ). For any g : X → R m weintroduce the cost function J ( g ) = E (cid:104)(cid:13)(cid:13) ∇ u ( X ) − Π range( ∇ g ( X ) T ) ∇ u ( X ) (cid:13)(cid:13) (cid:105) , (3)where Π range( ∇ g ( X ) T ) ∈ R d × d denotes the orthogonal projector onto range( ∇ g ( X ) T ) and (cid:107) · (cid:107) isthe Euclidean norm on R d . Obviously we have J ( g ) ≥
0. The following proposition shows that if J ( g ) = 0 then there exists a profile function f such that u = f ◦ g . Proposition 2.7.
Let u : X → R and g : X → R m be continuously differentiable functions suchthat J ( g ) = 0. If g satisfies Assumption 2.2 and ifrank( ∇ g ( x ) T ) = m, (4)for any x ∈ X , then there exists a function f : R m → R such that u = f ◦ g . efore we give the proof of Proposition 2.7, let us comment on condition (4). This condition iscommonly encountered in implicit function theory. It ensures that, for all z ∈ g ( X ), the level set g − ( z ) is a smooth manifold of dimension d − m ; see for instance Theorem 4.3.1 in [22]. One caneasily check that (4) is satisfied in the case of affine feature maps g ( x ) = A x + b with rank( A ) = m ,but also in the case of feature maps following from a C -diffeomorphism; see Example 2.5. Proof of Proposition 2.7.
Let us assume for a moment that x (cid:55)→ Π range( ∇ g ( x ) T ) is a continuousfunction from X to R d × d . Then x (cid:55)→ (cid:107)∇ u ( x ) − Π range( ∇ g ( x ) T ) ∇ u ( x ) (cid:107) is a continuous function, viaproducts and sums of continuous functions. As J ( g ) = 0, then (cid:13)(cid:13) ∇ u ( x ) − Π range( ∇ g ( x ) T ) ∇ u ( x ) (cid:13)(cid:13) isequal to zero π -almost surely. By continuity, we have that (cid:107)∇ u ( x ) − Π range( ∇ g ( x ) T ) ∇ u ( x ) (cid:107) is equalto zero for all x ∈ supp( π ) = X , so that ∇ u ( x ) ∈ range( ∇ g ( x ) T ) holds for any x ∈ X . Togetherwith Assumption 2.2, Proposition 2.3 ensures the existence of f : R m → R such that u = f ◦ g .It remains to show that x (cid:55)→ Π range( ∇ g ( x ) T ) is continuous. Let M ( x ) = ∇ g ( x ) ∇ g ( x ) T ∈ R m × m .By Assumption (4) M ( x ) is invertible and we can write Π range( ∇ g ( x ) T ) = ∇ g ( x ) T M ( x ) − ∇ g ( x ) forany x ∈ X . For any δ ∈ R d we can write (cid:107) M ( x ) − − M ( x + δ ) − (cid:107) sp ≤ (cid:107) M ( x + δ ) − (cid:107) sp (cid:107) M ( x + δ ) M ( x ) − − I d (cid:107) sp = λ min ( M ( x + δ )) − (cid:107) M ( x + δ ) M ( x ) − − I d (cid:107) sp , where (cid:107) · (cid:107) sp denotes the spectral norm and where λ min ( M ( x + δ )) denotes the smallest eigenvalueof M ( x + δ ). Because the eigenvalues are continuous with respect to the matrix entries (see [38])and by Assumption (4), we have λ min ( M ( x + δ )) → λ min ( M ( x )) > δ →
0. Therefore we have λ min ( M ( x + δ )) − (cid:107) M ( x + δ ) M ( x ) − − I d (cid:107) sp → λ min ( M ( x )) − (cid:107) I d − I d (cid:107) sp = 0. This shows the conti-nuity of x (cid:55)→ M ( x ) − and therefore the continuity of x (cid:55)→ Π range( ∇ g ( x ) T ) = ∇ g ( x ) T M ( x ) − ∇ g ( x ).This concludes the proof.Next we consider the minimization problemmin g ∈G m J ( g ) , (5)where G m ⊆ C ( X ; R m ) is a set of tractable functions. In general, given some choice of G m , theminimum of the cost function will not be exactly zero, and thus an assumption of Proposition 2.7 willnot hold. Using arguments based on Poincar´e inequalities, Proposition 2.9 below shows that, underspecific assumptions, there exists at least one function f : R m → R such that E [( u ( X ) − f ( g ( X ))) ]is of the same order of magnitude as J ( g ). In other words, we will be able to control the L -errorin an approximation of u by making J ( g ) small. Let us first introduce the Poincar´e inequalityassociated with a random variable. Definition 2.8 (Poincar´e inequality) . Given a continuous random variable X M taking values in asmooth manifold M , the Poincar´e constant C ( X M ) ∈ [0 , + ∞ ] is defined as the smallest constantsuch that E (cid:104)(cid:0) h ( X M ) − E [ h ( X M )] (cid:1) (cid:105) ≤ C ( X M ) E (cid:104)(cid:13)(cid:13) ∇ h ( X M ) (cid:13)(cid:13) (cid:105) (6)holds for any continuously differentiable function h : M → R . Here, the gradient ∇ h ( z ) is a vectorin T z ( M ), the tangent space of M at point z ∈ M . We say that X M satisfies the Poincar´einequality (6) if C ( X M ) < + ∞ .We refer to [4] for a simple proof of the Poincar´e inequality for a large class of probabilitymeasures. roposition 2.9. Assume that the set of functions G m ⊆ C ( X ; R m ) is such that rank( ∇ g ( x ) T ) = m for all g ∈ G m and all x ∈ X . Furthermore, assume that G m satisfies C ( X |G m ) := sup g ∈G m sup z ∈ g ( X ) C ( X | g ( X ) = z ) < ∞ , (7)where X | g ( X ) = z denotes the random variable obtained by conditioning X on the event g ( X ) = z . Then, for any g ∈ G m , there exists a measurable f : R m → R such that E (cid:104)(cid:0) u ( X ) − f ( g ( X )) (cid:1) (cid:105) ≤ C ( X |G m ) J ( g ) , (8)where J ( g ) is defined as in (3). Proof of Proposition 2.9.
Let g ∈ G m . Because rank( ∇ g ( x ) T ) = m for any x ∈ X , the level set M = g − ( z ) for some z ∈ g ( X ) is a smooth manifold of dimension d − m ; see Theorem 4.3.1 in[22]. Let u M : M → R be the restriction of u to M . Together with (7), the Poincar´e inequality(6) with h = u M and X M = ( X | g ( X ) = z ) permits writing E [( u ( X M ) − E [ u ( X M )]) ] = E [( u M ( X M ) − E [ u M ( X M )]) ] (6)&(7) ≤ C ( X |G m ) E [ (cid:107)∇ u M ( X M ) (cid:107) ] . (9)Because M is a smooth manifold embedded in R d , the gradient ∇ u M can be expressed by meansof the gradient ∇ u as follows ∇ u M ( x ) = Π T x ( M ) ∇ u ( x ) (10)for all x ∈ M , where Π T x ( M ) ∈ R d × d is the orthogonal projector onto T x ( M ), the tangent spaceof M at x . Since M is a level set of g , we have T x ( M ) = ker( ∇ g ( x )) = (range( ∇ g ( x ) T )) ⊥ (see forinstance [1, Section 3.5.7]) so thatΠ T x ( M ) = Π ker( ∇ g ( x )) = I d − Π range( ∇ g ( x ) T ) . (11)Combining (9) with (10) and (11) we obtain E [( u ( X M ) − E [ u ( X M )]) ] ≤ C ( X |G m ) E [ (cid:107) ( I d − Π range( ∇ g ( X M ) T ) ) ∇ u ( X M ) (cid:107) ] . (12)Now, because X M is the conditional random variable X | g ( X ) = z , we can interpret any ex-pectation E [ φ ( X M )] as a conditional expectation E [ φ ( X ) | g ( X ) = z ] for any integrable function φ : X → R . This manipulation permits rewriting the inequality (12) as E [( u ( X ) − E [ u ( X ) | g ( X )]) | g ( X ) = z ] ≤ C ( X |G m ) E (cid:104) (cid:107) ( I d − Π range( ∇ g ( X ) T ) ) ∇ u ( X ) (cid:107) (cid:12)(cid:12)(cid:12) g ( X ) = z (cid:105) Replacing z by the random variable Z = g ( X ) and taking the expectation on both sides, we obtain E (cid:2) ( u ( X ) − E [ u ( X ) | g ( X )]) (cid:3) ≤ C ( X |G m ) E (cid:104) (cid:107) ( I d − Π range( ∇ g ( X ) T ) ) ∇ u ( X ) (cid:107) (cid:105) . Finally we define the measurable function f : R m → R such that f ( z ) = E [ u ( X ) | g ( X ) = z ] for any z ∈ R m . We can write E [ u ( X ) | g ( X )] = f ( g ( X )) which yields (8) and concludes the proof. roposition 2.9 ensures that, for any g ∈ G m , there exists a function f : R m → R such that themean squared error between u and f ◦ g is bounded by C ( X |G m ) J ( g ). This remarkable propertyjustifies the use of the cost function J for the construction of g . Remark . When X ∼ N (0 , I d ) is a stan-dard Gaussian random vector and when G m = { x (cid:55)→ U x : U ∈ R m × d , U U T = I m } contains linearfeatures, the constant C ( X |G m ) is equal to 1. Indeed, the level sets g − ( z ) are affine subspacesand any conditional random variable of the form X | g ( X ) = z is Gaussian with identity covariance.Theorem 3.20 in [6] ensures that C ( X | g ( X ) = z ) = 1 for any g ∈ G m and z ∈ g ( X ), which yields C ( X |G m ) = 1.We conclude this section with an important property of J . Consider a C -diffeomorphism φ : R m → R m . Since ∇ φ ( x ) ∈ R m × m is invertible for all x ∈ X , it holds that range( ∇ φ ◦ g ( X ) T ) =range( ∇ g ( X ) T ∇ φ ( g ( X )) T ) = range( ∇ g ( X ) T ). Thus we have J ( φ ◦ g ) = J ( g ) . (13)This invariance reflects the following property of our initial dimension reduction problem (1): anycomposed function f ◦ g can be written as the composition of f ◦ φ − with φ ◦ g so that the featuremaps g and φ ◦ g are equivalent with regard to the problem (1). The invariance (13) offers thepossibility to arbitrarily impose the probability law of g ( X ). Indeed, under natural assumptions on g , there exists a C -diffeomorphism φ = φ g depending on g so that φ g ◦ g ( X ) follows, for instance,the standard normal distribution N (0 , I d ); see [41]. Replacing g by ¯ g = φ g ◦ g yields the same valueof J (¯ g ) = J ( g ) with ¯ g ( X ) ∼ N (0 , I d ). However, constructing φ g can be numerically expensivein practice. A more pragmatic way to exploit (13) is simply to consider the affine transformation φ g ( z ) = Cov( g ( X )) − / ( z − E [ g ( X )]), which ensures that φ g ◦ g ( X ) is centered with identitycovariance. This affine map is readily computable and allows one to normalize the feature map g .In the following, we will consider the constrained minimization problemmin g ∈G m E [ g ( X )]=0Cov( g ( X ))= I d J ( g ) . (14)The constraints E [ g ( X )] = 0 and Cov( g ( X )) = I d will be useful to stabilize the minimizationalgorithms, as described in the next section. Based on the previous section, an approximation f ◦ g of u can be obtained by first minimizing J ( g ) over some prescribed feature map space G m , and then by minimizing the mean squared error E [( u ( X ) − f ◦ g ( X )) ] over f ∈ F m . In this section we propose adaptive algorithms to construct afeature map space G m of the form G m = g : x (cid:55)→ g ( x )... g m ( x ) where g i ∈ span { Φ , . . . , Φ K } (15)and a profile function space F m of the form F m = span { Ψ , . . . , Ψ P } , (16) here Φ , . . . , Φ K and Ψ , . . . , Ψ P are polynomials defined on R d and R m , respectively. In practicewe make use of a sample { ( x ( i ) , u ( x ( i ) ) , ∇ u ( x ( i ) ) } Ni =1 of size N , which allows estimating J ( g ) by (cid:98) J ( g ) := 1 N N (cid:88) i =1 (cid:107)∇ u ( x ( i ) ) − Π range( ∇ g ( x ( i ) ) T ) ∇ u ( x ( i ) ) (cid:107) , (17)and the mean squared error E [( u ( X ) − f ◦ g ( X )) ] by N (cid:80) Ni =1 ( u ( x ( i ) ) − f ◦ g ( x ( i ) )) . First wepresent in Section 3.1 an algorithm for the minimization of (cid:98) J ( g ) over a given (fixed) space G m .Then in Section 3.2 we propose a greedy procedure to enrich the space G m adaptively. A similarprocedure will be presented in Section 3.3 for the construction of the polynomial space F m . Forthose adaptive algorithms, a cross-validation error analysis determines when to stop the enrichmentprocedures, as described in Section 3.4. Assume the basis { Φ , . . . , Φ K } of the feature map space (15) is given, with K ≥ m . We showthat minimizing J ( g ) (or (cid:98) J ( g )) over g ∈ G m boils down to the maximization of the expectation ofa generalized Rayleigh quotient. We then propose a quasi-Newton algorithm to solve the problem.With the notation Φ( x ) = (Φ ( x ) , . . . , Φ K ( x )) ∈ R K , any feature map g in the space G m definedby (15) can be written as g ( x ) = G T Φ( x ) , for some matrix G ∈ R K × m . In order to account for the constraints E [ g ( X )] = 0 and Cov( g ( X )) = I d in (14), we assume that E [Φ( X )] = 0 and we impose the constraint that G satisfy G T Cov(Φ( X )) G = I d . (18)Assuming the Jacobian ∇ g ( X ) = G T ∇ Φ( X ) has rank m almost surely, the orthogonal projectorΠ range( ∇ g ( X ) T ) can be expressed asΠ range( ∇ g ( X ) T ) = ∇ g ( X ) T (cid:0) ∇ g ( X ) ∇ g ( X ) T (cid:1) − ∇ g ( X ) , and the cost function J ( g ) becomes J ( g ) = E (cid:104)(cid:13)(cid:13) ∇ u ( X ) − Π range( ∇ g ( X ) T ) ∇ u ( X ) (cid:13)(cid:13) (cid:105) = E (cid:2) (cid:107)∇ u ( X ) (cid:107) (cid:3) − E (cid:104)(cid:13)(cid:13) Π range( ∇ g ( X ) T ) ∇ u ( X ) (cid:13)(cid:13) (cid:105) = E (cid:2) (cid:107)∇ u ( X ) (cid:107) (cid:3) − E (cid:104) ∇ u ( X ) T ∇ g ( X ) T (cid:0) ∇ g ( X ) ∇ g ( X ) T (cid:1) − ∇ g ( X ) ∇ u ( X ) (cid:105) = E (cid:2) (cid:107)∇ u ( X ) (cid:107) (cid:3) − E (cid:104) trace (cid:0) G T A ( X ) G (cid:1)(cid:0) G T B ( X ) G (cid:1) − (cid:105) . Here, A ( X ) ∈ R K × K and B ( X ) ∈ R K × K are two symmetric positive semidefinite matrices givenby A ( X ) = ∇ Φ( X ) ∇ u ( X ) ∇ u ( X ) T ∇ Φ( X ) T ,B ( X ) = ∇ Φ( X ) ∇ Φ( X ) T . Minimizing g (cid:55)→ J ( g ) over G m is the same as maximizing R ( G ) = E (cid:104) trace (cid:16)(cid:0) G T A ( X ) G (cid:1)(cid:0) G T B ( X ) G (cid:1) − (cid:17)(cid:105) , (19) ver G ∈ R K × m . Similarily, minimizing g (cid:55)→ (cid:98) J ( g ) over G m is the same as maximizing (cid:98) R ( G ) = 1 N N (cid:88) i =1 trace (cid:16)(cid:0) G T A ( X ( i ) ) G (cid:1)(cid:0) G T B ( X ( i ) ) G (cid:1) − (cid:17) , (20)over G ∈ R K × m . The quantity R ( G ) corresponds to the expectation of the generalized Rayleighquotient associated with the matrix pair ( A ( X ) , B ( X )), and (cid:98) R ( G ) to its Monte Carlo estimate. Itis easier to recognize the generalized Rayleigh quotient when m = 1, since G ∈ R K becomes a vectorso that R ( G ) = E [ G T A ( X ) GG T B ( X ) G ] and (cid:98) R ( G ) = N (cid:80) Ni =1 G T A ( x ( i ) ) GG T B ( x ( i ) ) G . Generalized Rayleigh quotients areubiquitous in dimension reduction; see [21]. However, the expectations or sums of generalizedRayleigh quotients as in (19) and (20) are not common and appear to be much more difficult tomaximize. As shown in [42, 48, 49], maximizing the sum of two generalized Rayleigh quotientsis already a difficult task, which requires dedicated algorithms. In the particular case where thefeature map is linear, however, maximizing R ( G ) can be done analytically, as shown by the nextremark. Remark . The space of linear feature maps G m = { x (cid:55)→ G T x : G ∈ R d × m } corresponds to (15) with Φ( x ) = x , the identity map. In thiscase ∇ Φ( x ) = I d is independent of x so that A ( X ) = ∇ u ( X ) ∇ u ( X ) T and B ( X ) = I d . Theexpected generalized Rayleigh quotient (19) becomes the standard (matrix) Rayleigh quotient R ( G ) = trace(( G T HG )( G T G ) − ) where H = E [ ∇ u ( X ) ∇ u ( X ) T ] . The maximum of G (cid:55)→ R ( G ) is known to be attained by any matrix G ∈ R K × m whose columnsspan the m -dimensional dominant eigenspace of H . This subspace is sometimes called the activesubspace; see [11, 12, 45]. When considering the sample approximation (cid:98) R ( G ) in (20), the matrix H is simply replaced by its approximation (cid:98) H = N (cid:80) Ni =1 ∇ u ( x ( i ) ) ∇ u ( x ( i ) ) T . The accuracy of theactive subspace recovery from (cid:98) H depends on the sample size N , on the active subspace dimension m , and on the spectrum of H ; see [23] for more details.So far we have seen that, provided the basis Φ( x ) = (Φ ( x ) , . . . , Φ K ( x )) satisfies E [Φ( X )] = 0,the minimization problem (14) can be rewritten asmin g ∈G m E [ g ( X )]=0Cov( g ( X ))= I d J ( g ) g ( x )= G T Φ( x ) ⇐⇒ max G ∈ R K × m G T Cov(Φ( X )) G = I d R ( G ) . (21)Next we propose a quasi-Newton method to solve this problem. The following proposition givesthe expression for the gradient of G (cid:55)→ R ( G ). The proof is given in Appendix B. Proposition 3.2.
Let A ( X ) , B ( X ) ∈ R K × K be two random symmetric positive semidefinite ma-trices. Assume that for a given G ∈ R K × m , there exists ε > G + δG ) T B ( X )( G + δG )is almost surely invertible for any (cid:107) δG (cid:107) ≤ ε . Then R ( · ) defined by (19) is differentiable at G andits gradient ∇R ( G ) ∈ R K × m is such that ( ∇R ( G )) ij = ∂ R ( G ) ∂G ij can be written as ∇R ( G ) = 2 (cid:0)(cid:0) H ( G ) − Σ( G ) (cid:1) G vec (cid:1) mat , (22)where H ( G ) and Σ( G ) are two symmetric positive semidefinite matrices in R ( Km ) × ( Km ) given by H ( G ) = E (cid:2)(cid:0) ( G T B ( X ) G ) − (cid:1) ⊗ A ( X ) (cid:3) (23)Σ( G ) = E (cid:2)(cid:0) ( G T B ( X ) G ) − G T A ( X ) G ( G T B ( X ) G ) − (cid:1) ⊗ B ( X ) (cid:3) . (24) ere, the notation ( · ) vec denotes the vectorization of a matrix, such that G vec ∈ R Km is the verticalconcatenation of the columns of G ∈ R K × m . The matricization ( · ) mat is the reverse operation, suchthat ( G vec ) mat = G . The notation ⊗ denotes the Kronecker product.Starting at an initial guess G (0) ∈ R K × m , a quasi-Newton method for maximizing G (cid:55)→ R ( G ) isan iterative procedure G ( k +1) = G ( k ) − ( H ( k ) ) − ∇R ( G ( k ) ) where H ( k ) : R K × m → R K × m is an ap-proximation to the Hessian of R ( · ) at point G ( k ) ; see [15]. Because our goal is to maximize R ( · ), theoperator H ( k ) should be chosen symmetric negative definite. We propose to use H ( k ) = − G ( k ) ).This matrix naturally appears in the expression of the Hessian ∇ R ( G ( k ) ) when differentiating therelation (22). Assuming Σ( G ( k ) ) is invertible (we observe in practice that it is non-singular) thequasi-Newton iteration in vectorized form is G ( k +1)vec = G ( k )vec − (cid:16)(cid:0) H ( k ) (cid:1) − ∇R ( G ( k ) ) (cid:17) vec(22) = G ( k )vec − (cid:16) − G ( k ) ) (cid:17) − (cid:16) H ( G ( k ) ) − G ( k ) ) (cid:17) G ( k )vec = Σ( G ( k ) ) − H ( G ( k ) ) G ( k )vec . (25)To account for the constraint G T Cov(Φ( X )) G = I m in (21), notice that, by the definition (19)of R ( · ), we have R ( GM ) = R ( G ) for any invertible matrix M ∈ R m × m . By letting M =( G T Cov(Φ( X )) G ) − / , the matrix (cid:101) G = GM satisfies the constraint (cid:101) G T Cov(Φ( X )) (cid:101) G = I m andyields the same Rayleigh quotient R ( (cid:101) G ) = R ( G ). Following this reasoning, we modify the itera-tions (25) by adding a normalization step: G ( k +1 / = (cid:16) Σ( G ( k ) ) − H ( G ( k ) ) G ( k )vec (cid:17) mat , (26) G ( k +1) = G ( k +1 / (cid:16) G ( k +1 / T Cov(Φ( X )) G ( k +1 / (cid:17) − / . (27)Interestingly, this quasi-Newton procedure is very similar to a power iteration for solving eigenvalueproblems; see the next remark. Remark . Let us continue Remark 3.1, where G m isthe space of linear feature maps. Recall that Φ( x ) = x , A ( X ) = ∇ u ( X ) ∇ u ( X ) T , B ( X ) = I d , andassume for simplicity that Cov(Φ( X )) = I d . Given an iterate G ( k ) such that G ( k ) G ( k ) T = I d , thematrices H ( G ( k ) ) and Σ( G ( k ) ) introduced in (23) and (24) become H ( G ( k ) ) = I d ⊗ H and Σ( G ( k ) ) =( G ( k ) T HG ( k ) ) ⊗ I d , where H = E [ ∇ u ( X ) ∇ u ( X ) T ]. Using the relation (( S ⊗ S ) G vec ) mat = S GS for any symmetric matrices S , S , the quasi-Newton iteration (26) becomes G ( k +1 / = (cid:18)(cid:18)(cid:16) G ( k ) T HG ( k ) (cid:17) − ⊗ H (cid:19) G ( k )vec (cid:19) mat = HG k (cid:16) G ( k ) T HG ( k ) (cid:17) − . (28)Thus, the relationrange( G ( k +1) ) (27) = range( G ( k +1 / ) (28) = range( HG ( k ) ) = range( H k +1 G (0) )holds and shows that the quasi-Newton iteration (26) with the normalization step (27) is preciselya power iteration method which aims to compute the m -dimensional dominant eigenspace of thematrix H . n practice, the quasi-Newton method (26) and (27) can be used to maximize (cid:98) R ( G ) (20) byreplacing H ( G ) and Σ( G ) with their sample approximations: (cid:98) H ( G ) = 1 N N (cid:88) i =1 (cid:16) ( G T B ( x ( i ) ) G ) − (cid:17) ⊗ A ( x ( i ) ) (cid:98) Σ( G ) = 1 N N (cid:88) i =1 (cid:16) ( G T B ( x ( i ) ) G ) − G T A ( x ( i ) ) G ( G T B ( x ( i ) ) G ) − (cid:17) ⊗ B ( x ( i ) ) . The procedure is summarized in Algorithm 1. In the next section, we propose a relevant choice forthe initialization G of Algorithm 1. We emphasize that assembling these Km -by- Km matriceswould require the storage of K m scalars, which is obviously not affordable when K (and m ) arelarge. In practice, we never assemble these matrices explicitly. Using the formulas (cid:98) H ( G ) x = (cid:32) N N (cid:88) i =1 A ( x ( i ) ) x mat ( G T B ( x ( i ) ) G ) − (cid:33) vec (29) (cid:98) Σ( G ) x = (cid:32) N N (cid:88) i =1 B ( x ( i ) ) x mat ( G T B ( x ( i ) ) G ) − G T A ( x ( i ) ) G ( G T B ( x ( i ) ) G ) − (cid:33) vec , (30)the matrix-vector products x (cid:55)→ H ( G ) x and x (cid:55)→ Σ( G ) x are computationally tractable. In thissense, the matrices H ( G ) and Σ( G ) are implicit matrices. For the calculation of x (cid:55)→ (cid:98) Σ( G ) − x , asrequired in (26), iterative solvers are well suited because they rely only on matrix-vector products;see [17]. Here we use a conjugate gradient solver preconditioned with the diagonal matrix containingthe diagonal of (cid:98) Σ( G ). Algorithm 1:
Quasi-Newton method to maximize G (cid:55)→ (cid:98) R ( G ). Require:
Computing the matrix-vector products x (cid:55)→ (cid:98) H ( G ) x and x (cid:55)→ (cid:98) Σ( G ) x as in(29) and (30). Data:
Training sample
Input:
Feature map space G m , initial guess G (0) ∈ R K × m , tolerance ε >
0, maxiteration K max Initialize k = 0 and stepsize = ε + 1 while k < K max and stepsize ≥ ε do Compute b = (cid:98) H ( G ( k ) ) G ( k )vec ∈ R Km Solve (cid:98) Σ( G ( k ) ) x = b using preconditioned conjugate gradientMatricize x mat = ( x ) mat ∈ R K × m and update G ( k +1 / = G ( k ) − x mat Normalize G ( k +1) = G ( k +1 / M − / with M = G ( k +1 / T Cov(Φ( X )) G ( k +1 / ∈ R m × m Update k ← k + 1 and stepsize ← (cid:107) x (cid:107) endOutput: final iterate G ( k ) .2 Adaptive polynomial feature map space In the previous section we proposed an algorithm for minimizing g (cid:55)→ (cid:98) J ( g ) over a given feature mapspace G m , as in (15). In this section, we borrow ideas from [5, 30, 10] to construct G m adaptivelyusing multivariate polynomials.We assume that the probability density function π of X is a product density π ( x ) = π ( x ) . . . π d ( x d ).For any 1 ≤ ν ≤ d we denote by { Φ ν , Φ ν , . . . } an orthonormal polynomial basis, with the degree ofΦ νi equal to i , such that (cid:90) Φ νi ( x )Φ νj ( x ) π ν ( x )d x = δ ij , holds for any i, j ≥
0. For any multi-index α = ( α , . . . , α d ) ∈ N d , we define the multivariatepolynomial Φ α as Φ α ( x ) = d (cid:89) ν =1 Φ να ν ( x ν ) , and, for a given multi-index set Λ K ⊆ N d of cardinality K = K , we introduce G Λ K m = x (cid:55)→ g ( x )... g m ( x ) , g i ∈ span { Φ α ; α ∈ Λ K } . (31)This feature map space parametrized by Λ K is, up to a change of notation, of the form of G m in(15). The optimal multi-index set Λ K is that which minimizes the minimum of J ( g ) over g ∈ G Λ K m ,meaning arg min Λ K ⊆ N d K = K min g ∈G Λ Km (cid:98) J ( g ) . (32)This best K -term approximation problem is combinatorial and not tractable in practice. Wepropose a suboptimal solution to (32) using a greedy procedure of the formΛ K +1 = Λ K ∪ { α K +1 } , where α K +1 ∈ N d is a multi-index to determine. Suppose we are given Λ K and that the corre-sponding optimal feature map g Λ K ∈ argmin g ∈G Λ Km (cid:98) J ( g )has been computed (for instance using Algorithm 1). The optimal multi-index α K +1 to add wouldbe the one which minimizes α (cid:55)→ (cid:98) J ( g Λ K ∪{ α } ). This would require the computation of g Λ K ∪{ α } formany α ∈ N d , which is not affordable in practice. Instead we choose the multi-index α K +1 as theone which yields the steepest gradient of the function v (cid:55)→ (cid:98) J ( g Λ K + v Φ α ) around v = 0, meaning α K +1 ∈ arg max α ∈ N d (cid:13)(cid:13)(cid:13) ∇ v (cid:98) J ( g Λ K + v Φ α ) (cid:12)(cid:12) v =0 (cid:13)(cid:13)(cid:13) . (33)The rationale behind (33) is to select the polynomial Φ α which, once added to the feature map space G m , yields the best immediate improvement of (cid:98) J ( · ) when moving away from g Λ K in the directionΦ α .Maximization over the entire N d as in (33) is not feasible in practice. A standard workaroundis to search for the maximum over an arbitrary subset of N d with finite cardinality. The subset (a) Λ K , M (Λ K ) and α K +1 . (b) Λ K +1 and M (Λ K +1 ). Figure 2: Greedy construction of the downward closed set Λ K ⊆ N d with d = 2. Adding α K +1 (the cross on the left) to Λ K (gray boxes on the left) yields Λ K +1 and the new reducedmargin M (Λ K +1 ) (right plot). { α ∈ N d , (cid:80) di =1 α i ≤ p } is commonly used, as it corresponds to the polynomials Φ α with total degreebounded by p . However the cardinality of this subset is (cid:0) d + pd (cid:1) = ( d + p )! p ! d ! which can still be very large.Borrowing ideas from [30, 31], we propose an alternative strategy which relies on the notion ofdownward-closed sets; see [8, 10]. We assume that the set Λ K is downward-closed, meaning that α ∈ Λ K and α (cid:48) ≤ α ⇒ α (cid:48) ∈ Λ K , (34)where α (cid:48) ≤ α means α (cid:48) i ≤ α i for all 1 ≤ i ≤ d . Intuitively, (34) means that Λ K has a pyramidalshape that contains no hole. We denote by M (Λ K ) the reduced margin of Λ K , defined by M (Λ K ) = { α ∈ N d \ Λ K such that α − e i ∈ Λ K for all 1 ≤ i ≤ d with α i (cid:54) = 0 } where e i denotes the i -th canonical vector of N d . By construction, any set of the form Λ K ∪ { α } with α ∈ M (Λ K ) remains downward closed, which is the fundamental property of the reducedmargin. By searching for the new multi-index in the reduced margin of Λ K , as in α K +1 ∈ argmax α ∈M (Λ K ) (cid:13)(cid:13)(cid:13) ∇ v (cid:98) J ( g Λ K + v Φ α ) (cid:12)(cid:12) v =0 (cid:13)(cid:13)(cid:13) , we ensure that Λ K +1 remains downward closed. This is illustrated on Figure 2.As pointed out in [30, 10] in the context of least-squares regression, adding multiple multi-indicesat each greedy iteration could yield better performance compared to adding only one multi-indexat a time. Instead of the enrichment Λ K +1 = Λ K ∪ { α K +1 } , we consider the so-called bulk chasing procedure Λ K +1 = Λ K ∪ λ K +1 , where λ K +1 ⊆ M (Λ K ) is the smallest set of multi-indices such that (cid:88) α ∈ λ K +1 (cid:13)(cid:13)(cid:13) ∇ v (cid:98) J ( g Λ K + v Φ α ) (cid:12)(cid:12) v =0 (cid:13)(cid:13)(cid:13) ≥ θ (cid:88) α ∈M (Λ K ) (cid:13)(cid:13)(cid:13) ∇ v (cid:98) J ( g Λ K + v Φ α ) (cid:12)(cid:12) v =0 (cid:13)(cid:13)(cid:13) , (35)for some parameter 0 < θ ≤
1. That is, λ K +1 contains the λ K +1 largest values of (cid:107)∇ v (cid:98) J ( g Λ K + v Φ α ) (cid:12)(cid:12) v =0 (cid:107) which capture a prescribed fraction θ of the norm of the gradient of J on the reducedmargin. With the bulk chasing procedure we have K (cid:54) = K in general.This procedure is summarized in Algorithm 2. We choose to start the algorithm with the setΛ K = Λ d = { α ∈ N d : (cid:80) di =1 α i = 1 } . This corresponds to the space of linear feature maps and, s explained in Remark 3.3, Algorithm 1 boils down to a power iteration for which a randominitialization works well. Later, we initialize Algorithm 1 by adding a row of zeros to G Λ K toaccount for the newly added basis terms. Notice that Algorithm 2 stops after K max iterations. Wewill explain in Section 3.4 how to use cross validation to determine K max . Algorithm 2:
Construction of feature map g on a downward-closed polynomialspace Data:
Training sample
Input:
Intermediate dimension m , max iteration K max , parameter θ Initialize K = d and Λ K = { α ∈ N d : (cid:80) di =1 α i = 1 } Compute G Λ K ∈ R d × m using Algorithm 1 with random initialization.Define g Λ K ( x ) = G T Λ K x for K = d, . . . , K max − do Compute (cid:107)∇ v (cid:98) J ( g Λ K + v Φ α ) | v =0 (cid:107) for all α ∈ M (Λ K )Select λ K +1 as in (35)Update Λ K +1 = Λ K ∪ λ K +1 and G Λ K +1 m Compute G Λ K +1 ∈ G Λ K +1 m using Algorithm 1 initialized with G (0)Λ K +1 = (cid:20) G Λ K [0 , . . . , (cid:21) ∈ R ( K +1) × m Define g Λ K +1 ( · ) = G T Λ K +1 Φ( · ), where Φ = [Φ , . . . , Φ α K +1 ] : R d → R K +1 endOutput: final iterate g Λ K max Remark . The greedy procedure of Algorithm 2 can get stuck because it “doesn’t see” behindthe reduced margin. For instance, if a relevant index is located above M (Λ K ) and if the gradientvanishes on the reduced margin, the algorithm will never activate that index. [31] suggests asafeguard mechanism to avoid this: arbitrarily activate the most ancient index from the reducedmargin every n -th iteration. In our numerical tests, however, we never needed such a safeguardmechanism. In this section we assume the feature map g has been computed using Algorithm 2. We now build theprofile function f in a polynomial space F m . As in the previous section, we propose to greedily enrich F m so that the minimum of the empirical mean squared error (cid:98) E g ( f ) = N (cid:80) Ni =1 ( u ( x ( i ) ) − f ◦ g ( x ( i ) )) over f ∈ F m is minimized. Since the gradients u ( x (1) ) , . . . , u ( x ( N ) ) are available, we instead considerthe gradient-enhanced empirical mean squared error, (cid:98) E ∇ g ( f ) = 1 N N (cid:88) i =1 (cid:16) ( u ( x ( i ) ) − f ◦ g ( x ( i ) )) + (cid:107)∇ u ( x ( i ) ) − ∇ f ◦ g ( x ( i ) ) (cid:107) (cid:17) . (36)Using (cid:98) E ∇ g ( f ) instead of (cid:98) E g ( f ) is known to yield better mean squared error in the small sample regime;see [33]. This will be illustrated in the next section. Given a finite multi-index set Γ L ⊆ N m we ntroduce F Γ L m = span { Ψ α ; α ∈ Γ L } , (37)where Ψ α denotes the α -th multivariate Hermite polynomial. These polynomials form an orthogonalbasis of L N (0 ,I m ) . In the present context it would have been preferable to work with a L g (cid:93) µ -orthogonal basis, but such a basis is not readily obtainable as it would require computing expensivehigh-dimensional integrals (e.g., for a Gram-Schmidt procedure). We justify the use of Hermitebasis by the fact that, since g ( X ) is centered and has identity covariance (recall the constraints in(14)), { Ψ α } α ∈ N d is a relatively well conditioned basis in L g (cid:93) µ . We show numerically in Section 4that Hermite polynomials perform well.As in the previous section, we propose to build a sub-optimal solution to the best L -termapproximation problem min Γ L ⊆ N d L = L min f ∈F Γ Lm (cid:98) E ∇ g ( f )by greedily constructing the multi-index set as follows: Γ L +1 = Γ L ∪ λ L +1 , where λ L +1 ⊆ M (Γ L )is the smallest multi-index set such that (cid:88) α ∈ λ L +1 (cid:12)(cid:12)(cid:12)(cid:12) dd t (cid:98) E ∇ g ( f Γ L + t Ψ α ) (cid:12)(cid:12)(cid:12) t =0 (cid:12)(cid:12)(cid:12)(cid:12) ≥ θ (cid:88) α ∈M (Γ L ) (cid:12)(cid:12)(cid:12)(cid:12) dd t (cid:98) E ∇ g ( f Γ L + t Ψ α ) (cid:12)(cid:12)(cid:12) t =0 (cid:12)(cid:12)(cid:12)(cid:12) . (38)Here, f Γ L denotes the minimizer of (cid:98) E ∇ g ( f ) over f ∈ F Γ L m and M (Γ L ) the reduced margin of Γ L .This is summarized in Algorithm 3. Since (cid:98) E ∇ g ( f ) is quadratic in f , this algorithm corresponds toan Orthogonal Matching Pursuit (OMP) approach, as explained in the next remark. Remark . Using the expansion f = (cid:80) Ll =1 w l Ψ α l ∈ F Γ L m with w = ( w , . . . , w L ) T ∈ R L , thegradient-enhanced empirical mean squared error (36) can be written as (cid:98) E ∇ g ( f ) = (cid:107) y − A w (cid:107) , where y ∈ R N ( d +1) is given by y = 1 √ N (cid:18) u ( x (1) ) . . . u ( x ( N ) ) ∇ u ( x (1) ) . . . ∇ u ( x ( N ) ) (cid:19) vec and the α -th column of the matrix A = [ A α · · · A α L ] ∈ R N ( d +1) × L is A α = 1 √ N (cid:18) Ψ α ( z (1) ) . . . Ψ α ( z ( N ) ) ∇ g ( x (1) ) ∇ Ψ α ( z (1) ) . . . ∇ g ( x ( N ) ) ∇ Ψ α ( z ( N ) ) (cid:19) vec with z ( i ) = g ( x ( i ) ). Recall that the subscript “vec” stands for the vectorization of a matrix. Thuswe have | dd t (cid:98) E ∇ g ( f + t Ψ α ) | t =0 | = | A Tα ( y − Ax L ) | , which shows that the selection procedure (38)corresponds to choosing the (nonactive) column of A α which is most correlated with the residual y − Ax L . This is similar to the OMP algorithm [40]; the difference is that, instead of seeking α ina prescribed set, Algorithm (38) seeks α in M (Γ L ), which evolves during the iteration process. Algorithms 2 and 3 need to be stopped before they begin overfitting the data. We employ the ν -foldcross-validation procedure decribed in Algorithm 4. It consists of partitioning the initial sampleΞ = { ( x ( i ) , u ( x ( i ) ) , ∇ u ( x ( i ) ) } Ni =1 into ν subsets Ξ train i , i = 1 , . . . , ν of equal cardinality N/ν , thenrunning the algorithms on each subset Ξ train i while monitoring the error on the corresponding testset Ξ test i = Ξ \ Ξ train i . The optimal number of iterations K ∗ (for Algorithm 2) and L ∗ (for Algorithm lgorithm 3: Construction of profile function f on downward-closed polynomialspace Data:
Training sample
Input:
Feature map g with intermediate dimension m , max iteration L max ,parameter θ Initialize Γ = { (0 , . . . , } Solve the least-squares problem f Γ = min { (cid:98) E ∇ g ( f ); f ∈ F Γ m } for L = 0 , . . . , L max − do Compute | dd t (cid:98) E ∇ g ( f + t Ψ α ) | t =0 | for all α ∈ M (Γ L )Select λ L +1 as in (38)Update Γ L +1 = Γ L ∪ λ L +1 and F Γ L +1 m Solve the least-squares problem f Γ L +1 = min { (cid:98) E ∇ g ( f ); f ∈ F Γ L +1 m } endOutput: final iterate f Γ L max
3) are those which minimize the test error averaged over the ν folds. With these numbers in hand,we then run K ∗ and L ∗ iterations of the algorithms on the entire sample.In Algorithm 4, we use the same sample to train both f and g . Alternatively, we can build f and g using two independent samples. We tried this alternative without obtaining significantimprovement. Thus, in the context where the model u is expensive to evaluate, we recommend raining f and g on the same sample. Algorithm 4:
Learning a composed model f ◦ g ≈ u using values and gradients of u Data:
Sample { ( x ( i ) , u ( x ( i ) ) , ∇ u ( x ( i ) ) } Ni =1 Input:
Intermediate dimension m , max iteration K max and L max , number of folds ν Partition the data set
Ξ = { ( x ( i ) , u ( x ( i ) ) , ∇ u ( x ( i ) ) } Ni =1 for cross validation Partition Ξ into ν subsets of equal cardinality: - i -th test set: Ξ test i is the i -th subset of Ξ - i -th training set: Ξ train i = Ξ \ Ξ test i Construction of the feature mapfor i = 1 , . . . , ν do Run K max iterations of Algorithm 2 on the i -th training set Store the iterates g (1) , . . . , g ( K max ) Monitor the loss J i,j = (cid:98) J ( g ( j ) ), 1 ≤ j ≤ K max , on the i -th test setend Define K ∗ as the minimum of the mean j (cid:55)→ ν (cid:80) νi =1 J i,j Run K ∗ iterations of Algorithm 2 using the whole sample Ξ return feature map g = g ( K ∗ ) Construction of the profilefor i = 1 , . . . , ν do Run L max iterations of Algorithm 3 on the i -th training set Store the iterates f (1) , . . . , f ( L max ) Monitor the mean squared error E i,j = (cid:98) E g ( f ( j ) ), 1 ≤ j ≤ L max , on the i -th test setend Define L ∗ as the minimum of the mean j (cid:55)→ ν (cid:80) νi =1 E i,j Run L ∗ iterations of Algorithm 3 using the whole sample Ξ return profile function f = f ( L ∗ ) Output:
Composed approximation f ◦ g Source code for the algorithms above and numerical experiments below is freely available so thatall results presented here are entirely reproducible. Our implementation uses the toolbox Approxi-mationToolbox [3].
We first consider the function u : R d → R with d = 20 defined by u ( x ) = cos( (cid:107) x (cid:107) ) , https://gitlab.inria.fr/ozahm/nonlinear-dimension-reduction-for-surrogate-modeling.git
10 15 20 25 30 35 4010 − − − Quasi-Newton iteration k (cid:98) J ( g ) m = 1 m = 5 m = 10 Figure 3: Isotropic function. Evolution of (cid:98) J ( g ) during the quasi-Newton algorithm 1 using N = 100 gradients of u (10 different realizations). For the first 20 iterations, G Λ K m containslinear functions only ( K = 20). At the 21st iteration, G Λ K m is enlarged to include allquadratic functions ( K = 20 + 210 = 230). and we let µ = N (0 , I d ) be the standard normal distribution. This function is isotropic: it cannotbe well approximated by f ◦ g with a linear feature map g . However, if one allows g to be a quadraticpolynomial, the function g ( x ) = x + . . . + x = (cid:107) x (cid:107) allows one to write u = f ◦ g with a rathersimple one-dimensional profile function, f ( z ) = cos( √ z ).First we assess the performance of the quasi-Newton method (Algorithm 1) for the minimizationof g (cid:55)→ (cid:98) J ( g ) over a fixed space of feature maps G m . Results are reported in Figure 3. During thefirst 20 iterations, G m is chosen to be the space of linear feature maps; after the 21st iteration, G m is enlarged to contain linear and quadratric feature maps. During the first period, we observe arapid convergence of J ( g ) towards a plateau which decreases with m . Once the quadratic termsare activated, J ( g ) converges toward zero at an exponential rate. This shows the efficiency of thequasi-Newton approach in Algorithm 1 for building g on a fixed function space G Λ K m . We observethat the convergence rates are not the same for m = 1, m = 5, and m = 10.Figure 4a shows the behavior of the adaptive Algorithm 2 for constructing a feature map g .Recall that Algorithm 2 is initialized with Λ K = { α ∈ N : (cid:80) di =1 α i = 1 } , which corresponds tothe space of linear feature maps. For this experiment, we enrich Λ K with only one multi-index ata time, i.e., Λ K +1 = Λ K ∪ { α K +1 } with α K +1 as in (33). We observe that the algorithm is alwayscapable of building a polynomial g such that J ( g ) = 0 with very few greedy iterations. Note thatfor large m , J ( g ) = 0 is attained earlier, i.e., for smaller K . To explain this phenomenon, Table1 lists a few exact decompositions u = f ◦ g , where we see that a large intermediate dimension m compensates for a small feature map space K .Figure 4b shows the performance of Algorithm 3. We set the bulk chasing parameter to θ = 0 . ν = 5 folds to determine when tostop the enrichment process. With m = 1, the algorithm is capable of recovering a very accurateapproximation to u (error below 10 − ) with only N = 100 samples. In contrast, using the samesample, a full dimensional polynomial approximation (black curves in Figure 4b) can barely attainerrors below 10 − . With intermediate dimensions m = 5 and m = 10, we still outperform the fulldimensional approach d = m , but the error does not reach 10 − . This example nicely illustratesthe fundamental issue of balancing the complexity between f and g : − − K J ( g ) m = 1 m = 5 m = 10 (a) Evolution of J ( g ) during the greedyenrichment process of Algorithm 2.
50 100 150 200 250 30010 − − − L E [ ( u ( X ) − f ◦ g ( X )) ] m = 1 m = 5 m = 10 m = d (b) Evolution of the mean squared error duringthe greedy Algorithm 3. The black curve m = d is obtained by running Algorithm 3 with g ( x ) = x ,the identity map. Figure 4: Isotropic function. Performances of Algorithms 2 and 3 using N = 100 samples(5 realizations). First, we construct g using Algorithm 2 (left plot) and then, given g ,we construct f using Algorithm 3 (right plot). Both J ( g ) and E [( u ( X ) − f ◦ g ( X )) ] arecomputed here on a large validation sample of size 2000. • With m = 1, we obtain a complex g ∈ G Λ K m with K ≥
40 and a simple f ∈ F Γ L m with L ≤
5. Error is below 10 − . • With m = 5 or m = 10, we obtain a simpler g ∈ G Λ K m with 30 ≤ K ≤
40 and a morecomplex f ∈ F Γ L m with 20 ≤ L ≤ × − . • With m = d , (no dimension reduction) g ( x ) = x is linear and f ∈ F Γ L m with L ≥ − .Clearly, for the considered isotropic function, the optimal choice of intermediate dimension is m = 1.We will see in the next examples that this is not always the case. Our second example is the commonly used Borehole function [39], which models water flow througha borehole. It is a function of d = 8 variables defined by u ( X ) = 2 πT u ( H u − H l )ln( r/r w ) (cid:16) LT r ln( r/r w ) r w K w + T r T l (cid:17) , where X is a random vector in R d with independent components given by X = r w ∼ N (0 . , . , X = r ∼ log N (7 . , . ,X = T u ∼ U [63070 , , X = H u ∼ U [990 , ,X = T l ∼ U [63 . , , X = H l ∼ U [700 , ,X = L ∼ U [1120 , , X = K w ∼ U [9855 , . = 1 f ( z ) = cos( √ z ) g ( x ) = ( x + . . . + x ) K = 40 m = 2 f ( z , z ) = cos( (cid:112) z + z ) g ( x ) = (cid:18) x x + . . . + x (cid:19) K = 39... ... ... ... m = 19 f ( z , . . . , z ) =cos( (cid:112) z + . . . + z + z ) g ( x ) = x ... x x + x K = 22 m = 20 f ( z , . . . , z ) = cos( (cid:112) z + . . . + z ) g ( x ) = x ... x K = 20 Table 1: Isotropic function. List of exact decompositions u = f ◦ g with polynomials g ∈ G Λ K m with K ranging from 40 (and m = 1) to 20 (and m = 20). This explains why, in Figure4a, J ( g ) drops to zero earlier in K when m is large. We first numerically illustrate Proposition 2.9. Recall that this proposition states that, given g ∈ G m , there exists a function f such that the mean squared error E [( u ( X ) − f ( g ( X )) (cid:1) ] isbounded by J ( g ) multiplied by the Poincar´e-type constant C ( X |G m ). In general, C ( X |G m ) isunknown. We build three feature maps g : a linear map, a quadratic map, and a cubic map definedas the minimizers of (cid:98) J ( g ) over the polynomial spaces G Λ lin m where Λ lin = (cid:40) α ∈ N : 1 ≤ (cid:88) i =1 α i ≤ (cid:41) , lin = 8 , G Λ quad m where Λ quad = (cid:40) α ∈ N : 1 ≤ (cid:88) i =1 α i ≤ (cid:41) , quad = 44 , G Λ cub m where Λ cub = (cid:40) α ∈ N : 1 ≤ (cid:88) i =1 α i ≤ (cid:41) , cub = 164 , respectively. To compute these feature maps, we estimate (cid:98) J ( g ) with N = 30, 60, or 150 samples.The dashed curves in Figure 5 are the resulting J ( g ) (computed on a validation set of size N = 2000)as a function of m . Once g is built, we construct the profile f using Algorithm 3 on the same sample.The continuous lines in Figure 5 represent E [( u ( X ) − f ◦ g ( X )) ] (computed on the validation set).As the sample size N increases, we obtain a better profile function f , and the mean squared errordecreases until it falls below J ( g ). We also observe that the larger m is, the higher N must beto obtain a mean squared error below J ( g ). Domination of the mean squared error by J ( g ) isconsistent with Proposition 2.9 with a Poincar´e-type constant C ( X |G m ) that seems to be close toone for this benchmark.In the limit N → ∞ , g converges towards the optimal linear/quadratic/cubic feature map whilethe profile function f , built adaptively in Algorithm 3, converges towards the solution ofmin f : R m → R E [( u ( X ) − f ◦ g ( X )) ] . With a larger polynomial degree for g , the best achievable error min f : R m → R E [( u ( X ) − f ◦ g ( X )) ]is smaller and so we obtain a better approximation f ◦ g to u . Notice, however, that when the mean − − − − − − − mN = 30 N = 60 N = 150 (a) Linear feature map − − − − − − − mN = 30 N = 60 N = 150 (b) Quadratic feature map − − − − − − − mN = 30 N = 60 N = 150 (c) Cubic feature map Figure 5: Borehole. Continuous lines: mean squared error E [( u ( X ) − f ◦ g ( X )) ], Dashedlines: cost function J ( g ). The width of the shaded region corresponds to the standard devi-ation over 20 experiments. The feature map g is built by minimizing (cid:98) J ( g ) using Algorithm1 on samples of size N ∈ { , , } . To build f , we employ Algorithm 3 on the samesample with bulk-chasing parameter θ = 0 . squared error is far above J ( g ) (typically for large m ), increasing the polynomial degree of g doesnot significantly improve the approximation f ◦ g . The interpretation is that if we cannot build asufficiently accurate profile function f (either because m is too large or N is too small), there is nobenefit in having a complex (i.e., high polynomial degree) feature map g .We now build both g and f adaptively using Algorithm 4 with parameters θ = 0 . ν = 5(from now on we use these parameters by default). Compared to the previous experiments wherethe polynomial degree of g was fixed, the mean squared errors shown in Figure 6a go to zero when N → ∞ , even for small m . Figure 6b shows the cardinalities of Λ K and Γ L as functions of theintermediate dimension m . We clearly see that, for small m , our adaptive algorithm builds complexfeature maps and simple profile functions. For large m , it is the other way around.From Figure 6a, it seems that the optimal intermediate dimension m depends on N : for smallsample size N = 30 or N = 60, the best intermediate dimension is m = 2 or m = 3. For N = 150,however, one clearly obtains better results with m = d , meaning without dimension reduction, i.e., u ( x ) ≈ f ( x ) with g ( x ) = x . We consider now the benchmark introduced in [18] defined as a deep composition of functions. Weconsider the function u of d = 16 variables defined by u ( x ) = h (cid:16) h (cid:0) h ( h ( x , x ) , h ( x , x )) , h ( h ( x , x ) , h ( x , x )) (cid:1) ,h (cid:0) h ( h ( x , x ) , h ( x , x )) , h ( h ( x , x ) , h ( x , x )) (cid:1)(cid:17) , where h ( s, t ) = 9 − (1 + st ) and we let X be the random vector with uniform measure on [ − , .This function u is a polynomial (as a composition of polynomials) and can readily be written as u = f ◦ g for m = 2 , , f and g . − − − − − − − mN = 30 N = 60 N = 150 (a) Gradient-enhanced construction of f m Λ K N = 30 N = 60 N = 1501 2 3 4 5 6 7 810 m Γ L (b) Mean cardinality of Λ K (top) and of Γ L (bottom) Figure 6: Borehole. Same settings as for Figure 5 but with a feature map g built using theadaptive Algorithm 2. The plots on the right show the complexity of g and f through thecardinalities of Λ K and Γ L , respectively (mean over 20 experiments). Numerical results are reported in Figure 7. For each choice of N and m , after constructing thefeature map g via Algorithm 2 and the cross-validation procedure in the first half of Algorithm 4, weillustrate the benefits of the gradient-enhanced construction of the profile function f by building iteither with gradient-free least squares (i.e., by minimizing (cid:98) E g ( f ) = N (cid:80) Ni =1 ( u ( x ( i ) ) − f ◦ g ( x ( i ) )) )or with gradient-enhanced least squares (i.e., by minimizing (cid:98) E ∇ g ( f ) in (36)). For large m , thegradient-enhanced approach clearly outperforms the gradient-free approach, but for small m , bothapproaches perform equally. It seems that, for small m , the profile can be estimated accuratelyusing evaluations of u ( x ( i ) ) only. Since gradients are needed to construct g regardless, our recom-mendation is always to use the gradient-enhanced approach to construct f , as it makes better useof the available information.For this benchmark, it seems that m = 2 is the best intermediate dimension for the consideredrange of sample sizes N . With this choice, the mean squared error can be reduced by around afactor of 10 over a full-dimensional function approximation scheme that simply uses g = Id withthe same sample. Our last numerical experiment is a PDE-based model where the quantity of interest u ( x ) is thesmallest resonance frequency of a 2D structure which has the shape of a bridge, as shown in Figure8. Here, x parameterizes the Young modulus field of the structure. An important feature of thisproblem is that, while it relies on a complex numerical model, one can evaluate the gradient ∇ u ( x )with the same computational cost as that of an evaluation of u ( x ), as we shall explain below.To model the structure, we consider a linear elasticity problem in two spatial dimensions underplane stress assumption. After finite element discretization, the smallest resonance frequency u ( x ) − − m N = 30 N = 60 N = 150 N = 300 (a) Gradient-free constructionof f − − m N = 30 N = 60 N = 150 N = 300 (b) Gradient-enhanced con-struction of f Λ K N = 30 N = 60 N = 150 N = 3002 4 6 8 10 12 14 1610 m Γ L (c) Mean cardinality of Λ K (top) and of Γ L (bottom). Figure 7: Composed function. Mean squared error E [( u ( X ) − f ◦ g ( X )) ] (computed on avalidation set of size 1000) where g and f obtained by Algorithm 4 ( θ = 0 . ν = 5).The line (resp. the width of the shades) corresponds to the mean (resp. the variance)over 20 experiments. Figure 7a: f is built by minimizing the gradient-free mean square (cid:98) E g ( f ) = N (cid:80) Ni =1 ( u ( x ( i ) ) − f ◦ g ( x ( i ) )) . Figure 7b: f is built by minimizing by minimizing (cid:98) E ∇ g ( f ), see (36). Figure 7c: cardinalities of Λ K and of Γ L (with the gradient-enhancedconstruction of f ). is defined as the minimum of a Rayleigh quotient u ( x ) = min v ∈ R n v T K ( x ) vv T M v , where K ( x ) ∈ R n × n and M ∈ R n × n are the stiffness and the mass matrices given by K ij ( x ) = (cid:90) Ω (cid:28) E ( x )1 + ν ε ( φ i ) + νE ( x )1 − ν trace( ε ( φ i )) I , ε ( φ j ) (cid:29) F dΩ ,M ij = (cid:90) Ω (cid:104) φ i , φ j (cid:105) dΩ . Here, n = 960 is the number of nodes in the finite element mesh, φ i : Ω → R is the i -th finiteelement function, ε ( v ) = ( ∇ v + ∇ v T ) ∈ R × is the strain tensor, (cid:104)· , ·(cid:105) F is the Frobenius scalarproduct in R × , and (cid:104)· , ·(cid:105) the canonical scalar product in R . The Poisson coefficient is set to ν = 0 . E ( x ) : Ω → R is parameterized by a d = 32-dimensionalparameter x ∈ R d as follows, E ( x ) = exp (cid:32) (cid:88) i =1 x i √ σ i ψ i (cid:33) , where ψ i : Ω → R and σ i are the i -th leading eigenfunctions and eigenvalues of the Gaussian kernel c ( s, t ) = √ −(cid:107) s − t (cid:107) / X with the standard normal distributionon R .We denote by v ( x ) = argmin v ∈ R n v T K ( x ) vv T M v , E ( X ) (color of the elements) and the associated resonance mode v ( X ) (displacement of themesh). the minimizer of the Rayleigh quotient (i.e., the eigenvector associated to the eigenvalue/frequency u ( x )). The i -th component of ∇ u ( x ) = ( ∂ x u ( x ) , · · · , ∂ x d u ( x )) can be written as ∂ x i u ( x ) = v ( x ) T (cid:0) ∂ x i K ( x ) (cid:1) v ( x ) v ( x ) T M v ( x ) . (39)To show this, let us write u ( x ) = R ( v ( x ) , x ) where R ( v, x ) = v T K ( x ) vv T Mv is the Rayleigh quotient.By definition of v ( x ) we have ∇ v R ( v ( x ) , x ) = 0 so that a chain rule derivative yields ∂ x i u ( x ) = ∇ v R ( v ( x ) , x ) T ∂ x i v ( x ) + ∂ x i R ( v ( x ) , x ) = ∂ x i R ( v ( x ) , x ), which is (39). By definition of E ( x ) and K ( x ), the matrix ∂ x i K ( x ) is given by ∂ x i K kl ( x ) = (cid:90) Ω √ σ i ψ i (cid:28) E ( x )1 + ν ε ( φ k ) + νE ( x )1 − ν trace( ε ( φ k )) I , ε ( φ l ) (cid:29) F dΩ . The cost of assembling ∂ x i K for 1 ≤ i ≤ d is negligible compared to the cost of computing theeigenmode v ( x ), which requires an expensive inverse power iteration method. In other words, once v ( x ) is computed, one can evaluate both u ( x ) and ∇ u ( x ) almost for free.In Table 2 we report the performance of Algorithm 4 on this benchmark, for a sample size N = 100 and a range of values of m . The best performance is obtained with an intermediatedimension of m = 3. For m = 8 or m = 16, the mean squared error is slightly higher thanfor m = d , meaning when we don’t reduce the dimension. As before, we observe that a smallintermediate dimension m yields complex feature maps g (i.e., large K ) and simple profiles f (i.e., small L ). = 1 m = 2 m = 3 m = 4 m = 6 m = 8 m = 16 m = d = 32Mean × . . . . . . . . × .
80 0 . . .
24 0 .
28 0 .
83 0 .
39 0 . K
148 ( ±
64) 129 ( ±
45) 91 ( ±
21) 80 ( ±
23) 64 ( ±
16) 57 ( ±
9) 51 ( ±
1) 32 ( ± L ±
1) 8 ( ±
1) 11 ( ±
1) 15 ( ±
3) 24 ( ±
7) 44 ( ±
24) 133 ( ± ± Table 2: Bridge. Mean and standard deviation (std) of the mean squared error E [( u ( X ) − f ◦ g ( X )) ] over 20 experiments, where g and f are constructed using Algorithm 4 with N = 100 samples. The error E [( u ( X ) − f ◦ g ( X )) ] is computed on a (fixed) validation setof size 1000. The last two lines of the table give the mean( ± std) of the cardinalities K and L , which represent the complexity of g and f , respectively. We have proposed and analyzed a novel framework for the dimension reduction of multivariatefunctions. Our approach relies on gradient evaluations of the model u : R d → R and is a two-step procedure. First, we build a feature map g : R d → R m in a function space G m by aligning theJacobian of g with the gradients of u . Second, we build a profile function f : R m → R by minimizingthe mean squared error between u and f ◦ g . We prove that having a finite Poincar´e constant C ( X |G m ) ensures good theoretical properties of the feature map—namely that the objective usedto identify g bounds the L error between u and its approximation. The Poincar´e constant dependsboth on the probability measure of the inputs X and on the feature space G m . In practice we observegood approximation performance using polynomial spaces G m , constructed via a greedy adaptiveprocedure, but we cannot easily check that C ( X |G m ) < ∞ for this case. Indeed, theoreticallyguaranteeing that C ( X |G m ) < ∞ for a computationally feasible space of nonlinear feature maps G m remains a challenge.Our numerical experiments also illustrate the role of the intermediate dimension m in thissetting. It is natural to ask what is the intrinsic intermediate dimension m of a model u ? From atheoretical perspective, we argue that this question is void without specifying a function class G m for g . For instance, we can talk about the linear or quadratic intrinsic intermediate dimension of u asthe smallest m such that there exists a linear or a quadratic g so that the error E [( u ( X ) − f ◦ g ( X )) ]is less than a prescribed tolerance for some f : R m → R . The OMP-type algorithm we propose,which adapts the complexity of G m to the sample size, then makes the interpretation of m morecomplicated.A useful alternative question is how to optimally select the intermediate dimension m in prac-tice? For now, we have no way to select it a priori . In our numerical tests, we run the algorithmfor all possible values of m = 1 , . . . , d and select the intermediate dimension which yields the lowestcross-validation error. We have observed that the intermediate dimension which yields the smallestreconstruction error depends on the sample size N : for instance, in the small sample size regime, anintermediate dimension of m = 2 or 3 might yield better approximation while, in the large samplesize regime, no dimension reduction, i.e., m = d , could be a better choice. This trend depends verymuch on the target function u , and we show examples where an intermediate value of m is bestover a range of sample sizes.The minimization of the function J ( g ) turns out to be quite a challenging task. While the quasi-Newton method proposed here is generally effective, recent work [24] may offer a novel optimizationperspective to address the essential problem of minimizing sums of generalized Rayleigh quotients.Another interesting direction motivated by the present work is the recursive construction ofapproximations of the form f k ◦ f k − ◦ . . . ◦ f , where each f i is built using gradients of u . This omposition is related to deep neural network architectures for function approximation, and mayoffer a perspective on the choice of latent space and internal dimension in such methods. Acknowledgment
The authors gratefully acknowledge support from the Inria associate team UNQUESTIONABLE.CP and OZ also acknowledge support from CIROQUO consortium. DB and YMM also acknowledgesupport from the US Department of Energy, Office of Advanced Scientific Computing Research,AEOLUS project.
References [1]
P.-A. Absil, R. Mahony, and R. Sepulchre , Optimization algorithms on matrix mani-folds , Princeton University Press, 2009.[2]
K. P. Adragni and R. D. Cook , Sufficient dimension reduction and prediction in regression ,Philosophical Transactions of the Royal Society A: Mathematical, Physical and EngineeringSciences, 367 (2009), pp. 4385–4405.[3]
N. Anthony, G. Erwan, and G. Loic , Approximationtoolbox , Feb. 2020.[4]
D. Bakry, F. Barthe, P. Cattiaux, A. Guillin, et al. , A simple proof of the poincar´einequality for a large class of probability measures , Electronic Communications in Probability,13 (2008), pp. 60–66.[5]
A. Beck and Y. C. Eldar , Sparsity constrained nonlinear optimization: Optimality condi-tions and algorithms , SIAM Journal on Optimization, 23 (2013), pp. 1480–1509.[6]
S. Boucheron, G. Lugosi, and P. Massart , Concentration inequalities: A nonasymptotictheory of independence , Oxford university press, 2013.[7]
M. C. Brennan, D. Bigoni, O. Zahm, A. Spantini, and Y. Marzouk , Greedy inferencewith structure-exploiting lazy maps , arXiv preprint arXiv:1906.00031, (2020).[8]
A. Chkifa, A. Cohen, and C. Schwab , Breaking the curse of dimensionality in sparsepolynomial approximation of parametric pdes , Journal de Math´ematiques Pures et Appliqu´ees,103 (2015), pp. 400–428.[9]
A. Cohen, I. Daubechies, R. DeVore, G. Kerkyacharian, and D. Picard , Capturingridge functions in high dimensions from point queries , Constructive Approximation, 35 (2012),pp. 225–243.[10]
A. Cohen and G. Migliorati , Multivariate approximation in downward closed polynomialspaces , in Contemporary Computational Mathematics-A celebration of the 80th birthday ofIan Sloan, Springer, 2018, pp. 233–282.[11]
P. G. Constantine , Active subspaces: Emerging ideas for dimension reduction in parameterstudies , SIAM, 2015. P. G. Constantine, E. Dow, and Q. Wang , Active subspace methods in theory andpractice: applications to kriging surfaces , SIAM Journal on Scientific Computing, 36 (2014),pp. A1500–A1524.[13]
R. D. Cook and S. Weisberg , Discussion of sliced inverse regression for dimension reduc-tion , Journal of the American Statistical Association, 86 (1991), pp. 328–332.[14]
T. Cui and O. Zahm , Data-free likelihood-informed dimension reduction of bayesian inverseproblems , (2020).[15]
J. E. Dennis, Jr and J. J. Mor´e , Quasi-newton methods, motivation and theory , SIAMreview, 19 (1977), pp. 46–89.[16]
M. Fornasier, K. Schnass, and J. Vybiral , Learning functions of few arbitrary lin-ear parameters in high dimensions , Foundations of Computational Mathematics, 12 (2012),pp. 229–262.[17]
G. H. Golub and C. F. Van Loan , Matrix computations , vol. 3, JHU press, 2013.[18]
E. Grelier, A. Nouy, and M. Chevreuil , Learning with tree-based tensor formats , arXivpreprint arXiv:1811.04455, (2018).[19]
A. Griewank et al. , On automatic differentiation , Mathematical Programming: recentdevelopments and applications, 6 (1989), pp. 83–107.[20]
J. M. Hokanson and P. G. Constantine , Data-driven polynomial ridge approximationusing variable projection , SIAM Journal on Scientific Computing, 40 (2018), pp. A1566–A1589.[21]
E. Kokiopoulou, J. Chen, and Y. Saad , Trace optimization and eigenproblems in dimen-sion reduction methods , Numerical Linear Algebra with Applications, 18 (2011), pp. 565–602.[22]
S. G. Krantz and H. R. Parks , The implicit function theorem: history, theory, and appli-cations , Springer Science & Business Media, 2012.[23]
R. R. Lam, O. Zahm, Y. M. Marzouk, and K. E. Willcox , Multifidelity dimensionreduction via active subspaces , SIAM Journal on Scientific Computing, 42 (2020), pp. A929–A956.[24]
J. B. Lasserre, V. Magron, S. Marx, and O. Zahm , Minimizing rational functions:a hierarchy of approximations via pushforward measures , arXiv preprint arXiv:2012.05793,(2020).[25]
C. Lataniotis, S. Marelli, and B. Sudret , Extending classical surrogate modeling to highdimensions through supervised dimensionality reduction: a data-driven approach , InternationalJournal for Uncertainty Quantification, 10 (2020).[26]
L. Laurent, R. Le Riche, B. Soulier, and P.-A. Boucard , An overview of gradient-enhanced metamodels with applications , Archives of Computational Methods in Engineering,26 (2019), pp. 61–106.[27]
K.-Y. Lee, B. Li, F. Chiaromonte, et al. , A general theory for nonlinear sufficientdimension reduction: Formulation and estimation , Annals of Statistics, 41 (2013), pp. 221–249. B. Li , Sufficient dimension reduction: Methods and applications with R , CRC Press, 2018.[29]
K.-C. Li , Sliced inverse regression for dimension reduction , Journal of the American StatisticalAssociation, 86 (1991), pp. 316–327.[30]
G. Migliorati , Adaptive polynomial approximation by means of random discrete least squares ,in Numerical Mathematics and Advanced Applications-ENUMATH 2013, Springer, 2015,pp. 547–554.[31] ,
Adaptive approximation by optimal weighted least-squares methods , SIAM Journal onNumerical Analysis, 57 (2019), pp. 2217–2245.[32]
M. T. Parente, J. Wallin, B. Wohlmuth, et al. , Generalized bounds for active sub-spaces , Electronic Journal of Statistics, 14 (2020), pp. 917–943.[33]
J. Peng, J. Hampton, and A. Doostan , On polynomial chaos expansion via gradient-enhanced (cid:96) , Journal of Computational Physics, 310 (2016), pp. 440–458.[34]
A. Pinkus , Ridge functions , vol. 205, Cambridge University Press, 2015.[35]
R.-E. Plessix , A review of the adjoint-state method for computing the gradient of a functionalwith geophysical applications , Geophysical Journal International, 167 (2006), pp. 495–503.[36]
A. Saltelli, M. Ratto, T. Andres, F. Campolongo, J. Cariboni, D. Gatelli,M. Saisana, and S. Tarantola , Global sensitivity analysis: the primer , John Wiley &Sons, 2008.[37]
P. Scheiblechner , On the complexity of deciding connectedness and computing betti numbersof a complex algebraic variety , Journal of Complexity, 23 (2007), pp. 359–379.[38]
G. W. Stewart , Matrix perturbation theory , (1990).[39]
S. Surjanovic and D. Bingham , Virtual library of simulation experiments , 2013.[40]
J. A. Tropp and A. C. Gilbert , Signal recovery from random measurements via orthogonalmatching pursuit , IEEE Transactions on information theory, 53 (2007), pp. 4655–4666.[41]
C. Villani , Optimal transport: old and new , vol. 338, Springer Science & Business Media,2008.[42]
X. Wang, L. Wang, and Y. Xia , An efficient global optimization algorithm for maximizingthe sum of two generalized rayleigh quotients , Computational and Applied Mathematics, 37(2018), pp. 4412–4422.[43]
H.-M. Wu , Kernel sliced inverse regression with applications to classification , Journal ofComputational and Graphical Statistics, 17 (2008), pp. 590–610.[44]
Y.-R. Yeh, S.-Y. Huang, and Y.-J. Lee , Nonlinear dimension reduction with kernelsliced inverse regression , IEEE transactions on Knowledge and Data Engineering, 21 (2008),pp. 1590–1603.[45]
O. Zahm, P. G. Constantine, C. Prieur, and Y. M. Marzouk , Gradient-based dimen-sion reduction of multivariate vector-valued functions , SIAM Journal on Scientific Computing,42 (2020), pp. A534–A558. O. Zahm, T. Cui, K. Law, A. Spantini, and Y. Marzouk , Certified dimension reductionin nonlinear bayesian inverse problems , arXiv preprint arXiv:1807.03712, (2018).[47]
G. Zhang, J. Zhang, and J. Hinkle , Learning nonlinear level sets for dimensionalityreduction in function approximation , in Advances in Neural Information Processing Systems,2019, pp. 13199–13208.[48]
L.-H. Zhang , On optimizing the sum of the rayleigh quotient and the generalized rayleighquotient on the unit sphere , Computational Optimization and Applications, 54 (2013), pp. 111–139.[49] ,
On a self-consistent-field-like iteration for maximizing the sum of the rayleigh quotients ,Journal of Computational and Applied Mathematics, 257 (2014), pp. 14–28.
A Link with the loss function introduced in [47]
As in Example 2.5, let φ : X → X be a C -diffeomorphism and let g : X → R m be a feature mapdefined by g ( x ) = ( φ ( x ) , . . . , φ m ( x )). In [47], the diffeomorphism φ is built by minimizing the lossfunction L ω ( φ ) := E (cid:34) d (cid:88) i =1 ω i (cid:28) ∇ φ i ( X ) (cid:107)∇ φ i ( X ) (cid:107) , ∇ u ( X ) (cid:29) (cid:35) , where ω = ( ω , . . . , ω d ) ∈ R d ≥ are non-negative weights which are arbitrarily chosen. To link thisloss function with the proposed cost function J ( g ), let us assume that the orthogonality condition ∇ φ i ( x ) T ∇ φ j ( x ) = 0 , (40)holds for any i (cid:54) = j and for any x ∈ X . Under this assumption, the cost function J ( g ) can bewritten as J ( g ) = E (cid:104)(cid:13)(cid:13) ( I d − Π range( ∇ g ( X ) T ) ) ∇ u ( X ) (cid:13)(cid:13) (cid:105) (40) = E (cid:34) d (cid:88) i = m +1 (cid:28) ∇ φ i ( X ) (cid:107)∇ φ i ( X ) (cid:107) , ∇ u ( X ) (cid:29) (cid:35) = L ω ( φ ) , where the last equality is obtained by letting ω = (0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) m times , , . . . , (cid:124) (cid:123)(cid:122) (cid:125) d − m times ) . In [47], the loss function L ω ( φ ) is used without ensuring the orthogonality condition (40) andno theoretical justification is provided. For instance, without condition (40), it is unclear whether L ω ( φ ) = 0 implies u ( x ) = f ◦ g ( x ) or, more critically, if u ( x ) = f ◦ g ( x ) implies L ω ( φ ) = 0. B Proof of Proposition 3.2
We use the notation M sym = ( M + M T ) / M . For any (cid:107) δG (cid:107) ≤ ε we can write( G + δG ) T A ( X )( G + δG ) = G T A ( X ) G + 2( δG T A ( X ) G ) sym + O ( (cid:107) δG (cid:107) ) , nd (cid:0) ( G + δG ) T B ( X )( G + δG ) (cid:1) − = (cid:0) G T B ( X ) G + 2( δG T B ( X ) G ) sym + O ( (cid:107) δG (cid:107) ) (cid:1) − = ( G T B ( X ) G ) − − G T B ( X ) G ) − ( δG T B ( X ) G ) sym ( G T B ( X ) G ) − + O ( (cid:107) δG (cid:107) ) . Multiplying the two above quantities yields (cid:16) ( G + δG ) T A ( X )( G + δG ) (cid:17)(cid:16) ( G + δG ) T B ( X )( G + δG ) (cid:17) − = ( G T A ( X ) G )( G T B ( X ) G ) − + 2( δG T A ( X ) G ) sym ( G T B ( X ) G ) − − G T A ( X ) G )( G T B ( X ) G ) − ( δG T B ( X ) G ) sym ( G T B ( X ) G ) − + O ( (cid:107) δG (cid:107) ) . Taking the expectation of the trace yields R ( G + δG ) = R ( G ) + E (cid:2) trace (cid:0) δG T A ( X ) G ( G T B ( X ) G ) − (cid:1)(cid:3) − E (cid:2) trace (cid:0) G T A ( X ) G )( G T B ( X ) G ) − ( δG T B ( X ) G )( G T B ( X ) G ) − (cid:1)(cid:3) + O ( (cid:107) δG (cid:107) ) . Here we used the fact that trace( M sym S ) = trace( M S ) holds for any square matrix M and anysymmetric matrix S . Using the notation (cid:104) M, N (cid:105) = trace(
M N T ), we can write R ( G + δG ) = R ( G ) + (cid:104)∇R ( G ) , δG (cid:105) + O ( (cid:107) δG (cid:107) ) where ∇R ( G ) = 2 E (cid:2) A ( X ) G ( G T B ( X ) G ) − (cid:3) − E (cid:2) B ( X ) G ( G T B ( X ) G ) − G T A ( X ) G ( G T B ( X ) G ) − (cid:3) . This shows that R ( · ) is differentiable at G . Finally, the expression (22) of ∇R ( G ) is obtained byusing the definitions of H ( G ) and Σ( G ) (see (23) and (24)) and by using the fact that ( S GS ) vec =( S ⊗ S ) G vec for any symmetric matrices S , S . Both H ( G ) and Σ( G ) are symmetric positivesemidefinite, as the expectations of the Kronecker products of symmetric positive semidefinitematrices.) are symmetric positivesemidefinite, as the expectations of the Kronecker products of symmetric positive semidefinitematrices.