Probabilistic Learning on Manifolds (PLoM) with Partition
PProbabilistic Learning on Manifolds (PLoM) with Partition
C. Soize a, ∗ , R. Ghanem b a Universit´e Gustave Eiffel, MSME UMR 8208 CNRS, 5 bd Descartes, 77454 Marne-la-Vall´ee, France b University of Southern California, 210 KAP Hall, Los Angeles, CA 90089, United States
Abstract
The probabilistic learning on manifolds (PLoM) introduced in 2016 has solved difficult supervised problems forthe “small data” limit where the number N of points in the training set is small. Many extensions have since beenproposed, making it possible to deal with increasingly complex cases. However, the performance limit has beenobserved and explained for applications for which N is very small (50 for example) and for which the dimension ofthe diffusion-map basis is close to N . For these cases, we propose a novel extension based on the introduction of apartition in independent random vectors. We take advantage of this novel development to present improvements ofthe PLoM such as a simplified algorithm for constructing the diffusion-map basis and a new mathematical result forquantifying the concentration of the probability measure in terms of a probability upper bound. The analysis of theefficiency of this novel extension is presented through two applications. Keywords: probabilistic learning, PLoM, partition in independent random vectors, machine learning, data driven,uncertainty quantification
Notations
The following notations are used:A lower case letter such as x , η , or u , is a real deterministic variable.A boldface lower case letter such as x , η , or u is a real deterministic vector.An upper case letter such as X , H, or U , is a real random variable (except for E ).A boldface upper case letter, X , H , or U , is a real random vector.A letter between brackets such as [ x ], [ η ], [ u ] or [ C ], is a real deterministic matrix.A boldface upper case letter between brackets such as [ X ], [ H ], or [ U ], is a real random matrix. n , n q , n u , n w : dimensions of random vectors X , Q , U , W . n MC : number of additional realizations for random matrix [ H Nm o ]. n p : number of groups in the partition of H . m , m i : dimension of the reduced-order diffusion-map bases [ g m ] , [ g im ]. m o , m i , o : optimal value of m , m i . E : mathematical expectation. N , N ar : number of points in the training, learned sets. ν, ν i : dimensions of random vectors H , Y i . R , R n : real line, Euclidean vector space of dimension n . M n , M n , N : sets of all the ( n × n ) , ( n × N ) real matrices. M + n : set of all the positive-definite symmetric ( n × n ) real matrices. (cid:107) x (cid:107) : Euclidean norm when x is the vector or Frobenius norm when [ x ] is the matrix. ∗ Corresponding author: C. Soize, [email protected]
Email addresses: [email protected] (C. Soize ), [email protected] (R. Ghanem)
Preprint submitted to arXiv February 23, 2021 a r X i v : . [ s t a t . M E ] F e b . Introduction (i) About the PLoM. The PLoM (probabilistic learning on manifolds) method was proposed in 2016 [1] as a com-plementary approach to existing methods in machine learning. It allows for solving unsupervised and supervisedproblems under uncertainty for which the training sets are small. This situation is encountered in many problemsof physics and engineering sciences with expensive function evaluations. The exploration of the admissible solutionspace in these situations is thus hampered by available computational resources. The PLoM was successfully adaptedto tackle these challenges for several related problems including nonconvex optimization under uncertainty [2, 3] andthe calculation of Sobol’s indices [4]. (ii) Brief discussion on the hypotheses on which the PLoM method has been built.
The PLoM approach starts froma training set made up of a relatively small number N of points (initial realizations). For the supervised case, it isassumed that the training set is related to an underlying stochastic manifold related to a R n -valued random variable X = ( Q , W ) with Q = f ( U , W ) = F ( W ). The measurable mapping f is unknown and F is also an unknown stochasticmapping. The probability distributions of the vector-valued random variables W (control parameter) and U (non-controlled parameter) are given. The stochastic manifold is defined by the unknown graph { w , F ( w ) } for w belongingto the support of the probability distribution of W . In the PLoM construction, its is not assumed that this stochasticmanifold can directly be described; for instance, it is not assumed that there exist properties of local differentiability(moreover, the manifold is stochastic). Under these conditions, the probability measure of X is concentrated in aregion of R n for which the only available information is the cloud of the points of the training set. The PLoM methodmakes it possible to generate the learned set whose n ar (cid:29) N points (additional realizations) are generated by theprobability measure that is estimated from the training set. The concentration of the probability measure is preservedthanks to the use of the diffusion-maps basis that allows to enrich the available information from the training set. Itshould also be noted that the estimate of this unknown probability measure cannot be performed from the training setby using an arbitrarily estimator. It must be parameterized in a manner that permits convergence to any probabilitymeasure as its number of points goes towards infinity. The PLoM method therefore does not only consist in generatingpoints that belong to the region in which the measure is concentrated, but also allows these additional points to berealizations of the estimate probability measure with the convergence properties evoked above. The choice of thekernel estimation method for estimating the probability measure from the training set guarantees that this requiredfundamental property is satisfied (see [1] and in particular, Section 5.3 of [5]). (iii) A difficulty of the PLoM method for certain applications. Since its introduction in 2016, extensions of the orig-inal method [1] have been developed in order to address increasingly complex problems for the case of small data:sampling of Bayesian posteriors with a non-Gaussian probabilistic learning on manifolds in very high dimension [6],physics-constrained non-Gaussian probabilistic learning on manifolds, for instance, to integrate information comingfrom experimental measurements during the learning [7], probabilistic learning on manifolds constrained by nonlinearpartial differential equations for small data [8]. During this period, a number of applications were addressed makingit possible to refine the method, to validate it, and to better assess and relax its limitations. However, some challengeshave remained. These are cases where the number of points (realizations) in the training set is very small (for example50) and for which the dimension of the subspace generated by the diffusion-map basis is very close to this number. Inthis case, the PLoM may not be more efficient than a standard MCMC algorithm that is agnostic to any concentrationof the probability measure. One possible way to improve the PLoM for these very challenging cases is to partitionthe random vector of which the training set is a realization, into statistically independent groups in a non-Gaussianframework. In this manner, statistical knowledge about the data set, beyond its localization to a manifold, is reliedupon to enhance information extraction and representation. One difficulty with standard grouping into independentcomponents is that they typically divvy-up available samples into seperate groups, with each group consisting of asmaller number of samples. These approach would not be suitable to the present setting given the already small sam-ple size or the training set. A more useful approach, which is adopted in this paper, results in groups that are eachequal in size to the training set, but for which the dimension of the diffusion-map basis is significantly reduced. Thisapproach corresponds to the novel extension of the PLoM method that we present in this paper. (iv) A novel extension of the PLOM method to get around the difficulty.
One of the ingredients for this novel extensionof the PLoM method is the construction of a partition in non-Gaussian independent random vectors, assuming that2t exists. Indeed, there may very well be applications for which the partition yields a single group, identical to theinitial random vector. For such cases the present approach of PLoM with partitions affords no further reduction.Concerning the construction of a partition of independent random vectors, a popular method for testing the statisticalindependence of the ν components of a random vector from a given set of N realizations is the use of the frequencydistribution [9] coupled with the use of the Pearson chi-squared ( χ ) test [10, 11]. For the high dimensions ( ν big)and a relatively small value of N , such an approach does not give sufficiently accurate results. In addition, evenwhen this type of methods permits testing for independence, the need remains for a fast algorithm for constructing theoptimal partition . The independent component analysis (ICA) [12, 13, 14, 15, 16, 17, 18, 19] is a method that consistsof extracting independent source signals as a linear mixture of mutually statistically dependent signals, and is oftenused for source-separation problems. The fundamental hypothesis that introduced in the ICA methodology is that theobserved vector-valued signal is a linear transformation of statistically independent real-valued signals (that is to say,is a linear transformation of a vector-valued signal whose components are mutually independent) and the objectiveof the ICA algorithms is to identify the best linear operator. In this paper, for the PLoM with partition, we use theprocedure proposed in [20], which is an ICA by mutual information and which does not use the construction of alinear transformation. This direct algorithm permits the identification of an optimal partition in terms of independentrandom vectors for any non-Gaussian vector in high dimension, which is defined by a relatively small number N ofrealizations, and which is based on Information Theory. (v) Organization of the paper. In Section 2, we present the methodology of the PLoM method with partition. Inthe process, we reintroduce some necessary key notions of the PLoM method. Sections 3 and 4 are both devotedtoapplications. Finally, a discussion of the method is presented in the conclusions. (vi) Novelties presented in the paper.
The main novelty is the development of the PLoM methodology with partition.We also propose a novel algorithm for identifying the optimal values of the hyperparameters related to the constructionof the reduced-order diffusion-map basis. For covering the cases for which the normalization introduced by the PLoMis lost with the use of a partition, we introduce constraints by using the Kullback-Leibler minimum cross-entropyprinciple [7]. The quantification of the concentration of the probability measure for the PLoM, which is performedwith the distance introduced in [5], is extended for the PLoM with partition, and is completed by a novel resultformulated in terms of a probability upper bound of the measure of concentration.
2. Methodology
Let ( w , u ) (cid:55)→ f ( w , u ) be any measurable mapping on R n w × R n u with values in R n q representing a mathemati-cal/computational model. Let W be the R n w -valued random control parameter and let U be a R n u -valued randomnon-controlled parameter, defined on a probability space ( Θ , T , P ). The random vectors W and U are assumed tobe statistically independent and they are generally non-Gaussian. The probability distributions P W ( d w ) = p W ( w ) d w and P U ( d u ) = p U ( u ) d u are defined by the probability density functions p W and p U with respect to the Lebesguemeasures d w and d u on R n w and R n u . Let Q be the R n q -valued random variable (QoI) defined on ( Θ , T , P ) such that Q = f ( W , U ). It is assumed that N ≥ { q jd , j = , . . . , N } of Q have been computed such that q jd = f ( w jd , u jd ) in which { w jd , j = , . . . , N } and { u jd , j = , . . . , N } are N independent realizations of W and U (sub-script d refers to the training set). We then consider the random variable X with values in R n , such that X = ( Q , W )with n = n q + n w . The training set (initial data set) related to random vector X is then made up of the N independentrealizations { x jd , j = , . . . , N } in which x jd = ( q jd , w jd ) ∈ R n (note that U is not included in X ). Since generally the datapertains to heterogeneous features with potentially wildly distinct supports, it is assumed that the training set has beensuitably scaled for the purpose of computational statistics. Let us assume that the measurable mapping f is such thatthe conditional probability distribution P Q | W ( d q | w ) given W = w admits a conditional probability density function. Itcan be deduced (see [7]) that the probability distribution P X ( d x ) of X admits a density x (cid:55)→ p X ( x ) with respect to theLebesgue measure d x on R n . The PLoM [1, 5] allows for generating the learned set made up of N ar (cid:29) N realizations { x (cid:96) ar , (cid:96) = , . . . , N ar } that allows for deducing { ( q (cid:96) ar , w (cid:96) ar ) = x (cid:96) ar , (cid:96) = , . . . , N ar } without using the computational model,but using only the training set (subscript ar refers to the learned set).3 .2. Principal component analysis (PCA) of random vector X Let x d ∈ R n and [ C X ] ∈ M + n be the mean vector and the covariance matrix of X estimated with the training set.Let µ ≥ µ ≥ . . . ≥ µ ν > ν largest eigenvalues and let ϕ , . . . , ϕ ν be the associated orthonormal eigenvectorsof [ C X ]. The integer ν ≤ n is such that, for a given ε PCA >
0, we have err
PCA ( ν ) = − (cid:80) να = µ α / tr[ C X ] ≤ ε PCA . ThePCA of X allows for representing X by X ν such that X ν = x d + [ Φ ] [ µ ] / H such that E {(cid:107) X − X ν (cid:107) } ≤ ε PCA E {(cid:107) X (cid:107) } , inwhich [ Φ ] = [ ϕ . . . ϕ ν ] ∈ M n ,ν such that [ Φ ] T [ Φ ] = [ I ν ] and [ µ ] is the diagonal ( ν × ν ) matrix such that [ µ ] αβ = µ α δ αβ .From a numerical point of view, if N < n , then matrix [ C X ] is not estimated and ν, [ µ ], and [ Φ ] are directly computedusing a thin SVD [21] of the matrix whose N columns are ( x jd − x d ) for j = , . . . , N . The R ν -valued random variable H is obtained by projection, H = [ µ ] − / [ Φ ] T ( X − x d ), and its N independent realizations { η jd , j = , . . . , N } are suchthat η jd = [ µ ] − / [ Φ ] T ( x jd − x d ) ∈ R ν . Using { η jd , j = , . . . , N } , the estimates of the mean vector and the covariancematrix of H verify η d = ν and [ C H ] = [ I ν ]. We define the matrix [ η d ] = [ η d . . . η Nd ] ∈ M ν, N whose columns are the N realizations of H , which is such that (cid:107) [ η d ] (cid:107) = tr { [ η d ] T [ η d ] } = N (cid:88) j = (cid:107) η jd (cid:107) = ν ( N − . (1) The PLoM analysis used is the one presented in [1] whose complete mathematical analysis is performed in [5]. H A modification of the multidimensional Gaussian kernel-density estimation method [22, 23, 24, 25] is used forconstructing the nonparametric estimate p ( N ) H on R ν of the pdf p H of random vector H , which is written (see Theo-rem 3.1 of [5] for the convergence with respect to N ) as p ( N ) H ( η ) = N N (cid:88) j = √ π (cid:98) s ) ν exp {− (cid:98) s (cid:107) (cid:98) ss η jd − η (cid:107) } , ∀ η ∈ R ν , (2)in which s = ( N ( ν + / − / ( ν + is the usual Silverman bandwidth (since [ C H ] = [ I ν ], see for instance, [26]) andwhere (cid:98) s = s ( s + ( N − / N ) − / has been introduced in order that (cid:82) R ν η p ( N ) H ( η ) d η = ν and (cid:82) R ν η ⊗ η p ( N ) H ( η ) d η = [ I ν ]. To identify the subset around which the initial data are concentrated, the PLoM relies on the diffusion-map method[27, 28]. The Gaussian kernel is used. Let [ K ] and [ b ] be the matrices such that, for all i and j in { , . . . , N } ,[ K ] i j = exp {− (4 ε DM ) − (cid:107) η id − η jd (cid:107) } and [ b ] i j = δ i j b i with b i = (cid:80) Nj = [ K ] i j , in which ε DM > P = [ b ] − [ K ] ∈ M N is the transition matrix of a Markov chain that yields the probability oftransition in one step). The eigenvalues λ , . . . , λ N and the associated eigenvectors ψ , . . . , ψ N of the right-eigenvalueproblem [ P ] ψ α = λ α ψ α are such that 1 = λ > λ ≥ . . . ≥ λ N and are computed by solving the generalized eigenvalueproblem [ K ] ψ α = λ α [ b ] ψ α with the normalization < [ b ] ψ α , ψ β > = δ αβ . The eigenvector ψ associated with λ = κ ≥
0, the diffusion-map basis { g , . . . , g α , . . . , g N } is a vector basis of R N defined by g α = λ κα ψ α . For a given integer m with 3 ≤ m ≤ N , the reduced-order diffusion-map basis of order m isdefined as the family { g , . . . , g m } that is represented by the matrix [ g m ] = [ g . . . g m ] ∈ M N , m with g α = ( g α , . . . , g α N )and [ g m ] (cid:96)α = g α(cid:96) . This ROB-DM depends on two parameters, ε DM and m , which have to be identified. It is proven in[5], that the PLoM method does not depend of κ that can therefore be chosen to 0.It should be noted that, if ν =
1, then there is no really interest to use the ROB-DM and in this case, we propose totake m = N and [ g N ] = [ I N ]. For non-trivial applications analyzed with the PLoM without partition, we always have ν > (cid:28) ν ≤ n . However for the PLoM method with partition, optimal partitions can be found for whichsome groups may have dimension 1 (hence the consideration of possible cases of this type).4 .3.3. Novel algorithm for identifying the optimal values ε o and m o of ε DM and m Let us assume that ν ≥
2. For estimating the optimal values ε o of ε DM and m o of m , the criterion of the eigenvaluesgiven in Section 5.2 of [5] must be satisfied for the PLoM method to be applicable. This criterion can be summarizedas follows. We have to find the value m o ≤ N of m and the smallest value ε o > ε DM such that1 = λ > λ ( ε o ) (cid:39) . . . (cid:39) λ m o ( ε o ) (cid:29) λ m o + ( ε o ) ≥ . . . ≥ λ N ( ε o ) > , (3)with a jump in amplitude equal to 10 between λ m o ( ε o ) and λ m o + ( ε o ). This property means that we have to find m o ≤ N and the smallest positive value ε o in order (i) to have λ ( ε o ) < λ ( ε o ) to λ m o ( ε o ) with a jump of amplitude 10 between λ m o ( ε o ) and λ m o + ( ε o ). A further in-depth analysis makes it possible to state the following new criterion and algorithm to easilyestimate ε o and m o . Let ε DM (cid:55)→ Jump ( ε DM ) be the function on ]0 , + ∞ [ defined by Jump ( ε DM ) = λ m o + ( ε DM ) /λ ( ε DM ). The Algorithm 1
Algorithm for estimating the optimal values m o of m and ε o of ε DM if ν = then Set the value of m to m o = N and [ g N ] = [ I N ]. end if if ν ≥ then Set the value of m to m o = ν + Identify the smallest possible value ε o of ε DM in order that Jump ( ε o ) ≤ . end if novel algorithm is thus given in Algorithm 1 and Figure 1 shows an illustration: we have m o = ν + =
61; the optimalvalue of ε DM that satisfies the criteria is ε o =
65 and yields Figure 1a; if a smaller value than 65 is chosen, for instancethe value 5, then there will be many eigenvalues close to 1 as shown in Figure 1b; if the smallest value for ε DM is notselected, for example taking the value 100, then the plateau is not obtained as shown in Figure 1c. For these two badvalues of ε DM , the calculated diffusion-map basis is not adapted to the PLoM procedure. -6 -4 -2 (a) Correct value ε o of ε DM (b) Bad value of ε DM that is smallerthan ε o -4 -2 (c) Bad value of ε DM that is largerthan ε oFigure 1: Illustration of the criterion effects defined by Eq. (3) for identifying ε o. [ H N ] , [ H Nm ] , [ H Nm o ] , and MCMC generator Let H ( N ) be the R ν -valued random variable defined on ( Θ , T , P ) for which the pdf is p ( N ) H defined by Eq. (2). Let[ H N ] be the random matrix with values in M ν, N such that [ H N ] = [ H . . . H N ] in which H , . . . , H N are N independentcopies of H ( N ) . It can be seen that E { H ( N ) } = ν and E { H ( N ) ⊗ H ( N ) } = [ I ν ]. Note that H , . . . , H N are not takenas N independent copies of H whose pdf p H is unknown, but are taken as N independent copies of H ( N ) whose pdf p ( N ) H is known. The PLoM method introduces the M ν, N -valued random matrix [ H Nm ] = [ Z m ] [ g m ] T with 3 ≤ m ≤ N , corresponding to a data-reduction representation of random matrix [ H N ], in which [ g m ] is the ROB-DM andwhere [ Z m ] is a M ν, m -valued random matrix for which its probability measure p [ Z m ] ([ z ]) d [ z ] is explicitly describedby Proposition 2 of [5]. In the PLoM method, the MCMC generator of random matrix [ Z m ] belongs to the class ofHamiltonian Monte Carlo methods [29], is explicitly described in [1], and is mathematically detailed in Theorem 6.35f [5]. For generating the learned set, the best probability measure of [ H Nm ] is obtained for m = m o and usingthe previously defined [ g m o ]. For these optimal quantities m o and [ g m o ], the generator allows for computing n MC realizations { [ z (cid:96) ar ] , (cid:96) = , . . . , n MC } of [ Z m o ] and therefore, for deducing the n MC realizations { [ η (cid:96) ar ] , (cid:96) = , . . . , n MC } of[ H Nm o ]. The reshaping of matrix [ η (cid:96) ar ] ∈ M ν, N allows for obtaining N ar = n MC × N additional realizations { η (cid:96) (cid:48) ar , (cid:96) (cid:48) = , . . . , N ar } of H . These additional realizations allow for estimating converged statistics on H and then on X , suchas pdf, moments, or conditional expectation of the type E { ξ ( Q ) | W = w } for w given in R n w and for any givenvector-valued function ξ defined on R n q . [ H Nm o ]In [5], for 3 ≤ m ≤ N , we have introduce a L -distance d N ( m ) of random matrix [ H Nm ] to matrix [ η d ] in order toquantify the concentration of the probability measure of random matrix [ H Nm ], which is informed by the initial data setrepresented by matrix [ η d ]. The square of this distance is defined by d N ( m ) = E {(cid:107) [ H Nm ] − [ η d ] (cid:107) } / (cid:107) [ η d ] (cid:107) . (4)Let M o = { m o , m o + , . . . , N } in which m o is the optimal value of m previously defined. Theorem 7.8 of [5] showsthat min m ∈M o d N ( m ) ≤ + m o / ( N − < d N ( N ) which means that the PLoM method, for m = m o and [ g m o ] is a bettermethod than the usual one corresponding to d N ( N ) = + N / ( N − n MC realizations { [ η (cid:96) ar ] , (cid:96) = , . . . , n MC } of [ H Nm o ], we have the estimate d N ( m o ) (cid:39) (1 / n MC ) (cid:80) n MC (cid:96) = {(cid:107) [ η (cid:96) ar ] − [ η d ] (cid:107) } / (cid:107) [ η d ] (cid:107) . In this section, for ν ≥
2, we present the extension of the PLoM analysis for which statistically independent groupsare constructed using an optimal partition of random vector H . H From the training set { η j , j = , . . . , N } , the optimal partition of H = ( H , . . . , H ν ) is performed using the algorithmproposed in [20]. Such a partition is composed of n p groups consisting in n p mutually independent random vectors Y , . . . , Y n p . Since H is a normalized random vector (zero mean vector and covariance matrix equal to the identitymatrix), for i = , . . . , n p , Y i is a normalized R ν i -valued random variable Y i = ( Y i , . . . , Y i ν i ) = ( H r i , . . . , H r i ν i ) in which1 ≤ r i < r i < . . . < r i ν i ≤ ν , with ν = ν + . . . + ν n p , and where Y ik = H r ik . Random vector Y i is non-Gaussianand such that the estimate of its mean vector is η i = ν i and the estimate of its covariance matrix is [ C Y i ] = [ I ν i ].We then have H = perm ( Y , . . . , Y n p ) in which perm is the permutation operator acting on the components of vector (cid:101) H = ( Y , . . . , Y n p ) in order to reconstitute H = perm ( (cid:101) H ). For each group i , the training set is represented by the matrix[ η id ] ∈ M ν i , N whose columns are the N realizations { η i , jd , j = , . . . , N } of the R ν i -valued random variable Y i , which arededuced from an adapted extraction (due to the permutations) of the components of vectors { η jd , j = , . . . , N } . Thepartition is identified by constructing the function i ref (cid:55)→ τ ( i ref ) of the mutual information defined by Eq. (3.44) of [20]and then by deducing the optimal level i o ref defined by Eq. (3.46) of [20]. Let i be fixed in { , . . . , n p } . The PLoM method (summarized in Section 2.3) is applied to the R ν i -valued randomvariable Y i of the optimal partition Y , . . . , Y n p of H = perm ( Y , . . . , Y n p ). The parameters of the PLoM are thus thefollowing.1) The Silverman bandwidth is s i = ( N ( ν i + / − / ( ν i + (since [ C Y i ] = [ I ν i ]) and the modified bandwidth is (cid:98) s i = s i ( s i + ( N − / N ) − / .2) Algorithm 1 is used. If ν i =
1, then m i , o = N and [ g iN ] = N . If ν i ≥
2, the optimal parameter m i , o of the dimension m i of the ROB-DM i is such that m i , o = ν i +
1. The optimal parameter ε i , o of ε i , DM is calculated as explained inSection 2.3.2. The ROB-DM i of order m i , o is represented by the matrix [ g im i , o ] ∈ M N , m i , o .3) The learned set of the random matrix [ Y N , im i , o ] = [ Z im i , o ] [ g im i , o ] T is computed for m i = m i , o and by using [ g im i , o ]. Finally,the n MC realizations { [ η i ,(cid:96) ar ] , (cid:96) = , . . . , n MC } of [ Y N , im i , o ] are computed with the MCMC generator and by reshaping, weobtain the N ar = n MC × N additional realizations { η i ,(cid:96) (cid:48) ar , (cid:96) (cid:48) = , . . . , N ar } .6 .4.3. Possible lost of the normalization Numerical experiments have been done for numerous cases with respect to the number of groups and the dimensionof each group. These experiments have shown the following. In general, the mean value of Y i , estimated using the N ar additional realizations { η i ,(cid:96) (cid:48) ar , (cid:96) (cid:48) = , . . . , N ar } , is sufficiently close to zero. Likewise, the estimate of the covariancematrix is sufficiently close to a diagonal matrix. However, sometimes the diagonal of the estimated covariance matrixcan be lower than 1 (for instance 0 . ν i (but not systematically andnot only; this behavior is application-dependent). In these situations, normalization and structure can be recovered byimposing constraints in the PLoM method. Y i if loss of normalization occurs As explained in Section 2.4.3, if appropriate for group i , constraints { E { ( Y ik ) } = , k = , . . . , ν i } can be readilyintroduced in the PLoM. For that, we use the method and the iterative algorithm presented in Sections 5.5 and 5.6of [7]. The method consists of constructing the generator using the PLoM for each independent group, defined inSection 2.4.2, and the Kullback-Leibler minimum cross-entropy principle. The resulting optimization problem is for-mulated using Lagrange multipliers associated with the constraints. The optimal solution of the Lagrange multipliersis computed using an efficient iterative algorithm. At each iteration, the MCMC generator of the PLoM is used. Theconstraints are rewritten as E { h i ( Y i ) } = b i , (5)in which the function h i = ( h i , . . . , h i ν i ) and the vector b i = ( b i , . . . , b i ν i ) are such that h ik ( Y i ) = ( Y ik ) and b ik = k in { , . . . , ν i } . Eqs. (71) and (72) of [7] involve the Lagrange multiplier λ = ( λ , . . . , λ ν i ) ∈ R ν i associated with theconstraints defined by Eq. (5). These two equations, which define the nonlinear mapping [ u ] (cid:55)→ [ L i λ ([ u ])] from M ν i , N into M ν i , N (drift of the Itˆo stochastic differential equation of the PLoM generator), have to be modified as follows. For α = , . . . , ν i , for (cid:96) = , . . . , N , and for [ u ] = [ u . . . u N ] in M ν i , N , we have[ L i λ ([ u ])] α(cid:96) = ρ i ( u (cid:96) ) ∂ρ i ( u (cid:96) ) ∂ u (cid:96)α − λ α u (cid:96)α ,ρ i ( u (cid:96) ) = N N (cid:88) j = √ π (cid:98) s i ) ν i exp {− (cid:98) s i (cid:107) (cid:98) s i s i η i , jd − u (cid:96) (cid:107) } . The iteration algorithm computes the sequence { λ ι } ι ≥ that is convergent. If difficulties of convergence appear, arelaxation factor (less than 1) is introduced for computing λ ι + as a function of λ ι . For controlling the convergence asa function of iteration number ι , we use the error function ι (cid:55)→ err i ( ι ) defined by err i ( ι ) = (cid:107) b i − E { h i ( Y i λ ι ) }(cid:107) / (cid:107) b i (cid:107) . (6)At each iteration ι , E { h i ( Y i λ ι ) } is estimated with the N ar additional realizations deduced by reshaping of the n MC realizations of the M ν i , N -valued random matrix [ Y N , im i , o ( λ ι )] that depends on λ ι . These realizations are generated by theMCMC algorithm of the PLoM under the constraints. We have seen above (see Section 2.4.2-(3) how the learned set { [ η i ,(cid:96) ar , (cid:96) = , . . . , n MC } of random matrix [ Y N , im i , o ] aregenerated using With-Group PLoM for each group i = , . . . , n p (using or not the constraints). From this information,we can directly deduce the learned set { [ η wg ,(cid:96) ar ] , (cid:96) = , . . . , n MC } of [ H wg , N m o ] that corresponds to the concatenation withan adapted extraction of the rows (due to the permutations) of matrices { [ Y N , im i , o ] , i = , . . . , n p } and where m o = ( m , o , . . . , m n p , o ). We have introduced a superscript wg for distinguishing With-Group PLoM from No-Group PLoM.The reshaping of matrix [ η wg ,(cid:96) ar ] ∈ M ν, N allows for obtaining N ar = n MC × N additional realizations { η wg ,(cid:96) (cid:48) ar , (cid:96) (cid:48) = , . . . , N ar } of H , computed using With-Group PLoM. 7 .4.6. Quantifying the concentration of the probability measure of random matrices [ H wg , N m o ] and { [ Y N , im o ] , i = , . . . , n p } For m o = ( m , o , . . . , m n p , o ), the square of the distance of the random matrix [ H wg , N m o ] to matrix [ η d ] is directly givenby Eq. (4) which is rewritten here as, d wg , N ( m o ) = E {(cid:107) [ H wg , N m o ] − [ η d ] (cid:107) } / (cid:107) [ η d ] (cid:107) . (7)The mathematical expectation is estimated using the n MC realizations { [ η wg ,(cid:96) ar ] , (cid:96) = , . . . , n MC } . Using again Eq. (4),for i ∈ { , . . . , n p } , the square of the distance of random matrix [ Y N , im i , o ] to matrix [ η id ] is given by d i , N ( m i , o ) = E {(cid:107) [ Y N , im i , o ] − [ η id ] (cid:107) } / (cid:107) [ η id ] (cid:107) , (8)which is estimated using the n MC realizations { [ η i ,(cid:96) ar ] , (cid:96) = , . . . , n MC } . Eq. (7) contains the information defined byEq. (8). Indeed it is easy to verify that we have the relation d wg , N ( m o ) = n p (cid:88) i = ( ν i /ν ) d i , N ( m i , o ) . (9) p > d wg , N ( m o ) defined byEq. (7) with d N ( m o ) defined by Eq. (4). If there is a gain, we must have d wg , N ( m o ) < d N ( m o ) . (10)This expected inequality for any applications for which n p > Proposition 1 (Probability upper bound of the measure of concentration).
Let ε be a given real number such that < ε < . Let d N ( m o ) be defined by Eq. (4) for m = m o . We then have Proba {(cid:107) [ H Nm o ] − [ η d ] (cid:107) / (cid:107) [ η d ] (cid:107) ≥ ε } ≤ d N ( m o ) /ε . (11) Let r be the positive real number (geometric mean) such that r = { Π n p i = d i , N ( m i , o ) } / n p in which d i , N ( m i , o ) is defined byEq. (8) . We then have Proba {(cid:107) [ H wg , N m o ] − [ η d ] (cid:107) / (cid:107) [ η d ] (cid:107) ≥ ε } ≤ ( r /ε ) n p . (12)P ROOF (P ROOF OF P ROPOSITION Ξ i = (cid:107) [ Y N , im i , o ] − [ η id ] (cid:107) and ζ i = (cid:107) [ η id ] (cid:107) .Therefore, Eq. (8) can be rewritten as d i , N ( m i , o ) = E { Ξ i } /ζ i . If ∀ i ∈ { , . . . , n p } we have Ξ i ≥ ε ζ i a . s , then (cid:107) [ H wg , N m o ] − [ η d ] (cid:107) = (cid:80) n p i = Ξ i ≥ ε (cid:80) n p i = ζ i = ε (cid:107) [ η d ] (cid:107) a . s , that is to say (cid:107) [ H wg , N m o ] − [ η d ] (cid:107) / (cid:107) [ η d ] (cid:107) ≥ ε a . s . (iii)Using result (ii) above, it can be deduced that Proba {∩ n p i = { Ξ i /ζ i ≥ ε }} = Proba {(cid:107) [ H wg , N m o ] − [ η d ] (cid:107) / (cid:107) [ η d ] (cid:107) ≥ ε } . (iv)Due to the partition, the random matrices [ Y N , m , o ] , . . . , [ Y N , n p m np , o ] are statistically independent, and thus Ξ , . . . , Ξ n p arestatistically independent. Therefore, we can write, Proba {∩ n p i = { Ξ i /ζ i ≥ ε }} = Π n p i = Proba { Ξ i /ζ i ≥ ε } . (v) The results (iii)and (iv) above yield Proba {(cid:107) [ H wg , N m o ] − [ η d ] (cid:107) / (cid:107) [ η d ] (cid:107) ≥ ε } = Π n p i = Proba { Ξ i /ζ i ≥ ε } . The use of the Markov inequalityallows us to write, Proba { Ξ i /ζ i ≥ ε } ≤ E { Ξ i } / ( ε ξ i ) = d i , N ( m i , o ) /ε . Substituting this inequation into the right hand-sidemember of the last equality allows us to write Proba {(cid:107) [ H wg , N m o ] − [ η d ] (cid:107) / (cid:107) [ η d ] (cid:107) ≥ ε } ≤ Π n p i = { d i , N ( m i , o ) /ε } = ( r /ε ) n p ,which is Eq. (12).
3. Application 1
The probabilistic model is chosen so that the partition in terms of statistically independent groups is known. Thiswill serve to validate the proposed methodology. This application can easily be reproduced. We directly construct8he normalized non-Gaussian R ν -valued random variable H = ( H , . . . , H ν ) with ν =
60. Its probabilistic model isdescribed in Appendix Appendix A. The random vector X from which H is deduced by a PCA is not constructed. Itshould be noted that this application is very difficult for the learning methods taking into account the high degree ofthe polynomials in the model, which induces a complexity of the geometry of the support of the probability measureof H .A reference data set with N ref = N = { η jd , j = , . . . , N } are generated using the probabilistic model of H . The learned set is generated bythe PLoM method (without or with groups) with N ar = { η (cid:96) ar , (cid:96) = , . . . , N ar } ( N ar = n MC × N with n MC = η and the covariance matrix [ C H ] of H , which are estimatedwith the N realizations of the training set, are such that η d = ν and [ C H ] = [ I ν ]. Algorithm 1 is used for the calculation of the reduced-order diffusion-map basis [ g m o ] of the R ν -valued randomvariable H . The optimal dimension is m o = ν + =
61. Figure 2a displays the function ε DM (cid:55)→ Jump ( ε DM ) and showsthat the optimal value ε o of the smoothing parameter ε DM is ε o =
656 for which
Jump ( ε o ) = .
1. For this value ε o of ε DM , Figure 2b shows the graph of function α (cid:55)→ λ α ( ε o ). It can be seen that the criterion defined by Eq. (3) is satisfied.The PLoM algorithm with no group is then used for generating the learned set { η (cid:96) ar , (cid:96) = , . . . , N ar } . Figure 3 shows
200 400 6000.10.120.140.16 (a) Identifying the value ε o of ε DM -6 -4 -2 (b) Eigenvalues of the transition matrixFigure 2: Diffusion map basis of the standard PLoM (No group). the pdf of each one of the random variables H , H , H , and H estimated with the learned set. Each pdf is estimated(i) with the N realizations of the training set, (ii) with the N ref realizations of the reference data set, (iii) with the N ar additional realizations generated with the Hamiltonian MCMC algorithm corresponding to the PLoM with m o = N and [ g m o ] = [ I N ], and referenced as ” No-PLoM ”, and finally, with the N ar realizations of the learned set constructedwith the PLoM for which the partition in groups is not taken into account and referenced as ” No-Group PLoM ” (in thiscase no constraints are applied). It can be seen that the No-PLoM estimation yields a big scattering with an importantincrease of the dispersion (and thus a loss of the concentration of the probability measure) while No-Group PLoMpreserves the concentration of the probability measure (as expected) and the pdfs’ estimations are good enough. Theseestimations will be improved by using the PLoM with groups and referenced as ”
With-Group PLoM ”. The optimal partition is computed as explained in Section 2.4.1. Figure 4a displays the graph of i ref (cid:55)→ τ ( i ref ),which shows that i o ref = . n p = ν = ν =
20, and ν =
30 and with Y = ( H , . . . , H ), Y = ( H , . . . , H ), and Y = ( H , . . . , H ), which correspondto the model introduced in Appendix Appendix A for generating the training set. This result constitutes an additionalvalidation of the optimal partition algorithm that is used for non-Gaussian random vectors. For illustration, Figure 4bdisplays the graph of the joint pdf of random variables H and H .9 (a) pdf of H -3 -2 -1 0 1 2 300.20.40.60.81 (b) pdf of H -3 -2 -1 0 1 2 300.511.5 (c) pdf of H -2 -1 0 1 200.511.52 (d) pdf of H Figure 3: pdf estimated with (i) the training set (black thin), (ii) the reference data set (red thick), (iii) No-PLoM (dashed) , and (iv) No-GroupPLoM (blue thin). (a) Graph of function i ref (cid:55)→ τ ( i ref ) (b) Joint pdf of H and H Figure 4: Partition of H in n p mutually independent random vectors Y , . . . , Y n p . .3. PLoM analysis with groups (With-Group PLoM) Algorithm 1 is used for each group i = , ,
3. We then have m i , o = ν i +
1. The training set { η i , jd , j = , . . . , N } of Y i is used. A similar graph to the one shown in Figure 2a is constructed for identifying the optimal value ε i , o ofthe smoothing parameter ε i , DM yielding ε , o = ε , o = ε , o = i computed for ε i , DM = ε i , o . It can be seen that all the requiredcriteria are satisfied. For each group i = , ,
3, the PLoM method with groups is used for generating the learned set -6 -4 -2 Group 1Group 2Group 3 (a) Eigenvalues of the transition matrix of group i computed for ε i , DM = ε i , o Group 1Group 2Group 3 (b) Error function ι (cid:55)→ err i ( ι ) of group i for itera-tion number ι of the iteration algorithmFigure 5: Diffusion map basis and error function of the iteration algorithm for each one of the 3 groups, i = , , (a) Mean value of H k (b) Standard deviation of H k Figure 6: Mean value and standard deviation of the components H k , k = , . . . , ν , of H estimated using the learned set generated by No-GroupPLoM (thin line) and by With-Group PLoM (thick line). { η i , j ar , j = , . . . , N ar } . The constraints E { ( Y ik ) } = k ∈ { , . . . , ν i } are applied and the iterative algorithm introducedin Section 2.4.4 is used. Figure 5b displays the error function ι (cid:55)→ err i ( ι ) of group i , defined by Eq. (6), whichshows the convergence of the iterative algorithm. It should be noted that the convergence could have been pushedfurther, but the numerical experiments showed that the additional gain obtained is negligible. In addition, numericalexperiments have been carried out to compare the efficiency of the type of constraints. We have verified that takinginto account all the constraints (mean of Y i equal to and covariance matrix [ C Y i ] = [ I ν i ]) did not provide significantimprovements on the preservation of the concentration of the probability measure compared to the sole applicationof the constraints E { ( Y ik ) } = k ∈ { , . . . , ν i } . Figure 6 shows the mean value and the standard deviation of thecomponents H k , k = , . . . , ν of H estimated using the learned set generated by No-Group PLoM (see Section 3.1)and by With-Group PLoM. Figure 6a shows that the mean values are reasonably small with respect to 1 and thereforethat it is not necessary to improve it by introducing the constraints for the mean. Figure 6b shows that the standarddeviations are improved by using With-Group PLoM for which the constraints are taken into account. Figure 7 showsthe pdf of H , H , H , and H estimated with the learned set. Similarly to Section 3.3, each pdf is estimated (i) with11he N realizations of the training set, (ii) with the N ref realizations of the reference data set, (iii) with the N ar additionalrealizations computed by No-PLoM, and (iv) with the N ar realizations computed by With-Group PLoM. ComparingFigure 3 with Figure 7, it can be seen that the use of groups improves the pdfs’ estimations as expected. It can alsobe noted that the estimates are excellent for this very difficult case in particular by comparing with the usual approach(see the dashed lines corresponding to No-PLoM). -2 0 200.20.40.60.811.2 (a) pdf of H -2 0 200.20.40.60.811.2 (b) pdf of H -2 0 200.20.40.60.811.2 (c) pdf of H -2 -1 0 1 200.511.52 (d) pdf of H Figure 7: pdf estimated with (i) the training set (black thin), (ii) the reference data set (red thick), (iii) No-PLoM (dashed) , and (iv) With-GroupPLoM (blue thin).
For No PLoM, the computations are performed as explained in Section 3.1, for No-Group PLoM as in Sections 2.3and 3.1, and for With-Group PLoM as in Sections 3.3 and 2.4.(i) The results concerning the concentration of the probability measure are summarized in Table 1. For No PLoM, d N ( N ) is computed with Eq. (4) for which m = N = d N ( m o ) is also computed withEq. (4) but with m = m o =
61. For With-Group PLoM, d wg , N ( m o ) is computed with Eq. (7) for which m o = ( m , o , m , o , m , o ) with m , o = m , o =
21, and m , o =
31, and where d i , N ( m i , o ) are computed using Eq. (8),which yields d , N ( m , o ) = . d , N ( m , o ) = . d , N ( m , o ) = . d N ( N ) (cid:39) d N ( m o ) = . (cid:28)
2, which shows that the usual PLoM method (without group)12ffectively preserves the concentration of the probability measure unlike the usual MCMC method that does not allowit. For the PLoM with groups, an improvement is observed relative to the PLoM without group as indicated by theevaluation d wg , N ( m o ) = . (cid:28) d N ( m o ) = . r /ε ) n p corresponds to an upper value, theprobability being certainly smaller. Table 1: Concentration of the probability measure for Application 1
No PLoM PLoMPLoM No Group With Group m o =
61 Proba by Eq. (12) d N ( N ) d N ( m o ) d wg , N ( m o ) ε ≤ ( r ε ) n p .
00 0 .
094 0 .
016 0 . ≤ . . ≤ . { η (cid:96) ar , (cid:96) = , . . . , N ar } for the components ( H , H , H ), generated (a) without using PLoM (No-PLoM), (b) usingthe PLoM method without group (No-Group PLoM), and (c) using the PLoM method with groups (With-GroupPLoM). The three figures confirm the analysis presented in point (i) above. (a) No PLoM (b) No-Group PLoM (c) With-Group PLoMFigure 8: Clouds of the N ar = H , H , H ) computed with No-PLoM, No-Group PLoM, and With-Group PLoM.
4. Application 2
The second application is devoted to a supervised learning problem Q = f ( W , U ) (see Section 2.1) in high di-mension, for which the uncontrolled random parameter is the R n u -valued random variable U with n u =
420 000, therandom control random parameter is the R n w -valued random variable W with n w =
2, and the QoI is the R n q -valuedrandom variable Q with n q =
10 098.
This application relates to a linear elastic system modeled by an elliptic stochastic boundary problem (BVP) ina 3D bounded domain Ω , described in the SI Units. The generic point of Ω is ζ = ( ζ , ζ , ζ ) in an orthonormalCartesian coordinate system ( O , ζ , ζ , ζ ) with O = (0 , , ∂ Ω = Γ ∪ Γ is denotedby n ( ζ ). There is a zero Dirichlet condition on Γ and a Neumann condition on Γ . Domain Ω is occupied by arandom linear elastic medium (heterogeneous material). The uncontrolled parameter of the system is the fourth-ordertensor-valued non-Gaussian elasticity random field { K = { K i jkh ( ζ ) } ≤ i , j , k , h ≤ , ζ ∈ Ω } (random coefficients of the partialdifferential operator) for which the mean value is isotropic and the statistical fluctuations are anisotropic. The controlparameter w = ( w , w ) of the system consists of w = log( L corr ) in which L corr is a spatial correlation length and of w = log( δ G ) in which δ G is a dispersion parameter, which allow the statistical fluctuations of K to be controlled. The13bservation of the system is the R -valued random displacement field V = ( V , V , V ) on Ω , which is the randomsolution of the stochastic BVP, − div Σ = in Ω , V = on Γ , Σ n = G Γ on Γ . The stress tensor Σ = { Σ i j } ≤ i , j ≤ is related to the strain tensor E ( V ) = { E kh ( V ) } ≤ k , h ≤ by the constitutive equation, Σ i j ( ζ ) = K i jkh ( ζ ) E kh ( V ( ζ )) in which the strain tensor is such that E kh ( V ) = ( ∂ V k /∂ζ h + ∂ V h /∂ζ k ) /
2. The geometry, thesurface force field G Γ , the probabilistic model of the elasticity random field K that depends on parameter w , and thefinite element discretization of the weak formulation of the stochastic BVP are detailed in Appendix Appendix B.The control parameter w is modelled by a R -valued random variable W = ( W , W ). The random vectors U , W ,and Q , for which the dimensions are n u =
420 000, n w =
2, and n q =
10 098, are defined in Appendix Appendix B.The random vectors W and U are statistically independent. The dimension n = n q + n w of random vector X = ( Q , W )is thus n =
10 000.The training set is generated as explained in Section 2.1 for which N =
100 independent realizations, u jd and w jd ,of U and W are generated using the probabilistic model detailed in Appendix Appendix B. For each j ∈ { , . . . , N } ,the realization q jd of Q is computed by solving the BVP using the computational model (finite element discretizationof the BVP), which is such that q jd = f ( w jd , u jd ) (note that f is not explicitly known and results from the solution ofthe BVP). The training set related to random vector X = ( Q , W ) is then made up of the N independent realizations { x jd , j = , . . . , N } in which x jd = ( q jd , w jd ) ∈ R n .The reference data set { x (cid:96) ref , (cid:96) = , . . . , N ref } for X is generated as the training set but with N ref =
20 000 independentrealizations. Computations have been made for N ref =
15 000, 18 000, and 20 000, which have shown that the pdf ofeach observed component of Q were converged for N ref =
20 000 (note that the construction of the reference has beenvery CPU time consuming).The learned sets generated without using the PLoM method (No PLoM), or using the PLoM method with nogroup (No-Group PLoM), or with groups (With-Group PLoM) will be all performed with N ar =
200 000 realizations { η (cid:96) ar , (cid:96) = , . . . , N ar } ( N ar = n MC × N with n MC = In this section, we give the main results without too many details (paper length limitation), knowing that we havealready presented a detailed analysis for Application 1. (i) PCA of random vector X . Since n =
10 000 (cid:28) N = C X ] is solved using athin SVD of matrix [ x d ] = [ x d . . . x Nd ] ∈ M n , N , which thus does not require the assembling of [ C X ] (as explainedin Section 2.2). Figure 9a shows the distribution of the eigenvalues µ α . For constructing the PCA representation, X ν = x d + [ Φ ] [ µ ] / H , of X , we have chosen ε PCA = .
001 that yields ν =
27. Following Section 2.2, the matrix[ η d ] ∈ M ν, N is constructed with the ν realizations η d , . . . , η Nd of the R ν -valued random variable H . (ii) Reduced-order diffusion-map basis for No-Group PLoM. Algorithm 1 is used for the calculation of the reduced-order diffusion-map basis [ g m o ] of the R ν -valued random variable H . For the optimal values m o = ν + =
28 and ε o = λ α of the transition matrix. (iii) Construction of the optimal partition of H . The optimal partition is computed as explained in Section 2.4.1.Figure 10a displays the graph of i ref (cid:55)→ τ ( i ref ), which shows that i o ref = . n p = ν = Y = ( H , H , H , H , H ), ν = Y = H , ν =
15 with Y = ( H to H , H , H , H , H , H , H to H ), and ν = . . . = ν = Y = H , Y = H , Y = H , Y = H , Y = H , and Y = H . For each one of the two groups i = m , o = m , o = ε , o = .
7, and ε , o = ε i , o , Figure 10bshows the distribution of eigenvalues λ i ,α of the transition matrix. For the groups i (cid:44) m i , o = N (seeAlgorithm 1). 14
20 40 60 8010 -6 -4 -2 (a) Eigenvalues of the covariance matrix of X -5 -4 -3 -2 -1 (b) Eigenvalues of the transition matrixFigure 9: PCA eigenvalues and diffusion map eigenvalues of the PLoM without group (No-Group PLoM). ( r e f ) (a) Graph of function i ref (cid:55)→ τ ( i ref ) -4 -2 Group 1Group 3 (b) Eigenvalues of the transition matrixFigure 10: Partition of H in n p = Y , . . . , Y n p and diffusion map eigenvalues of the PLoM for groups 1and 3. iv) Influence of the constraints of all the components of Y i . No-Group PLoM is performed without any constraintsapplied to random vector H . With-Group PLoM is performed, group by group, in applying, for i = , . . . , n p = E { ( Y ik ) } = k ∈ { , . . . , ν i } . For all the components k = , . . . , ν =
27 of H , Figure 11 shows themean value E { H k } and the standard deviation σ H k that are estimated by No-Group PLoM and by With-Group PLoM.We can see that the mean values remain much lower than 1 although no constraint is applied to the mean, as wellfor No-Group PLoM as for With-Group PLoM. We can also see that the standard deviation of the components arealready close to 1 for No-Group PLoM although no constraint is applied to the second-order moments. As expected,for With-Group PLoM for which the constraints are applied to the second-order moments, the standard deviations arealmost equal to 1. (a) Mean value of H k (b) Standard deviation of H k Figure 11: Mean value and standard deviation of the components H k , k = , . . . , ν , of H estimated using the learned set generated with No-GroupPLoM (thin line) and With-Group PLoM (thick line). (v) pdf of observations estimated by the PLoM. The pdf of components Q and Q of Q are presented in Fig-ure 12. Component 17 corresponds to the ζ -axis displacement V ( ζ ) at point ζ = (0 , , .
1) while component 7740corresponds to the ζ -axis displacement V ( ζ ) at point ζ = (0 . , , . -0.01 -0.005 00100200300400 (a) pdf of Q -10 -5 0 10 -4 (b) pdf of Q Figure 12: pdf estimated with (i) the training dataset (black thin), (ii) the reference dataset (red thick), (iii) No-PLoM (dashed), (iv) No-GroupPLoM (blue thin), and (v) With-Group PLoM (blue thick). with the N d =
100 points of the training set, (ii) with the N ref =
20 000 points of the reference data set, (iii) with16 ar =
200 000 additional realizations generated with an usual MCMC generator (without using the PLoM method),(iv) with the N ar =
200 000 additional realizations of the learned set generated by No-Group PLoM, and finally, (v)with the N ar =
200 000 additional realizations of the learned set generated by With-Group PLoM for which a partitionin n p = (vi) Quantifying the concentration of the probability measure. The analysis is carried out as in Section 3.4. Theresults concerning the concentration of the probability measure is summarized in Table 2 and in Figure 13. For No
Table 2: Concentration of the probability measure for Application 2
No PLoM PLoMPLoM No Group With Group m o =
28 Proba by Eq. (12) d N ( N ) d N ( m o ) d wg , N ( m o ) ε ≤ ( r ε ) n p .
00 0 .
16 0 .
044 0 . ≤ . × − . ≤ . × − PLoM, d N ( N ) is computed with Eq. (4) for which m = N = d N ( m o ) is also computedwith Eq. (4) but with m = m o =
28. For With-Group PLoM, d wg , N ( m o ) is computed with Eq. (7) for which m o = ( m , o , . . . , m , o ). The graph i (cid:55)→ d i , N ( m i , o ) is computed using Eq. (8) and is plotted in Figure 13. Without usingthe PLoM method, we find numerically d N ( N ) = d N ( m o ) = . (cid:28)
2, which shows that the usual PLoM method (without group) effectively preserves the concentrationof the probability measure unlike the usual MCMC methods that do not allow it. For the PLoM with groups, it canbe seen an improvement with respect to the PLoM without group because d wg , N ( m o ) = . (cid:28) d N ( m o ) = .
16. Thequantification of the probability of the random relative distance defined by Eq. (12) confirms this improvement. Notethat the probability ( r /ε ) n p corresponds to an upper value, the probability being certainly smaller. Figure 13: Concentration of the probability measure for each group i ∈ { , . . . , } : graph of i (cid:55)→ d i , N ( m i , o ) computed using Eq. (8).
5. Discussion and conclusion
The implementation of a partition in the PLoM method has provided an opportunity to revisit, improve the effi-ciency, and simplify the algorithm to identify the optimal values of the hyperparameters of the reduced-order diffusion-map basis. This was made necessary for the PLoM method with partition, because the number of groups identified canbe large and for each group of dimension greater than 1, the reduced-order diffusion-map basis must be constructed.This new efficient algorithm is common to PLoM with or without partition.17till within the framework of the PLoM carried out with partition, we have made the following observations. If agroup of the partition has a relatively small dimension (a few units, or even one or two dozen) and if the support of itsprobability measure has a complex geometry, one could obtain a significant loss of normalization compared to 1 (forinstance 0.6 or 0.7 instead of 0.9 or 1). For instance, such a situation can be encountered by the presence of numerousnon-Gaussian stochastic germs that generate strong statistical fluctuations (for example, up to ten times the standarddeviation for some components). For these cases, we have proposed to introduce constraints on the second-ordermoments of the components of such a group, by reusing the Kullback-Leibler minimum cross-entropy principle thatwe have previously used for taking into account physics constraints in the PLoM method. For instance, Application 1is very difficult due to the high degree of the polynomials in the model; although the realizations of the training set arecentered and have a covariance matrix equal to the identity matrix, the fluctuations vary between −
10 and +10 for somecomponents (which must be compared to a magnitude of 1). It should be noted that we have also developed, tested,and implemented the general case of introducing constraints on the mean vector (zero mean) and on the covariancematrix (identity matrix). One then increases considerably the number of Lagrange multipliers to be calculated by theiterative method, which induces a significant numerical additional cost. We have not seen any significant improvementcompared to the only constraint related to the diagonal of the covariance matrix (second-order moments equal to 1knowing that the centering is reasonably well obtained without constraint on the mean vector). Under these conditionswe have only presented the simplest case of constraints and we have demonstrated it on two applications.In the recently published mathematical foundations of PLoM [5], to establish the main theorem, we introduceda distance between the random matrix defined by PLoM and the deterministic matrix that represent all the givenpoints of the training set. In the present paper, and in order to facilitate the quantification of the preservation of theconcentration of the probability measure between the usual MCMC method (No PLoM), the PLoM method withoutpartition (No-Group PLoM), and the PLoM method with partition (With-Group PLoM), we apply this distance toeach group of the partition. We have assessed it numerically for the two applications. The results obtained confirmthe theoretical results: there is a significant loss of concentration of the probability measure for the usual MCMCmethod (No PLoM) while the PLoM method without partition (No-Group PLoM) preserves well the concentration ofthe probability measure. In addition, this distance shows that the PLoM method with partition further improves thepreservation of the concentration of the probability measure compared to the PLoM method without partition, whichwas hoped for. Finally, to complete the quantification of the concentration of the probability measure by the distance,we have also proven a mathematical result of this quantification in terms of probability. This result shows that ifthe number of groups of the partition increases, then the gain of With-Group PLoM can be significantly improvedcompared to No-Group PLoM. We have numerically quantified these probabilities for the two applications.This work contributes to the improvement of the PLoM method. The results presented are very satisfactory for thetwo applications which, while quite distinct, present significant challenges to other statistical learning methods.
Appendix A. Probabilistic model of the random generator for Applications 1
In this Appendix, any second-order random quantity S is defined on a probability space ( Θ , T , P ) and its mathe-matical expectation E { S } is estimated by s = (1 / N ) (cid:80) Nj = s j using N independent realizations { s j = S ( θ j ) , j = , . . . , N } of S with θ j ∈ Θ .The R ν -valued random variable H = ( H , . . . , H ν ) is written as a partition of n p = Y , . . . , Y n p such that, for i = , . . . , n p , the normalized R ν i -valued random variable Y i = ( Y i , . . . , Y i ν i ) is non-Gaussianand such that the estimate of its mean vector is η i = ν i and the estimate of its covariance matrix is [ C Y i ] = [ I ν i ]. Wehave ν = ν + . . . + ν n p and we choose ν = ν =
20, and ν = i = , ,
3, let [ b i ] be the deterministic matrix in M ν i defined by: rng ( (cid:48) de f ault (cid:48) ); [ b i ] = (0 . × rand ( ν i , ν i ) + . /ν i (in which rng and rand are the Matlab functions). Let U i = b i ] U i − be the R ν i -valued random variablein which = (1 , . . . ,
1) and where U i = ( U i , . . . , U i ν i ) is the random vector constituted of ν i independent uniformrandom variables on [0 , N independent realizations are { U i ( θ j ) , j = , . . . , N } . The random vectors U , U , and U are statistically independent. Let M i = ( M i , . . . , M i ν i ) be the R ν i -valued random variable in which,for k = , . . . , ν i , M ik is the random monomial M ik = √ k ! ( U ik ) k (thus the degree of this monomial is k ). Let be M ic = M i − m i in which m i is the estimate of the mean value of M i . Let [ C M i ] be the estimate of the covariancematrix of M i and let [ L i ] be the upper triangular matrix computed from the Cholesky factorization [ C M i ] = [ L i ] T [ L i ].18herefore, the random vector Y i is constructed as Y i = ([ L i ] T ) − M ic whose the N independent realizations { η i , jd , j = , . . . , N } are such that η i , jd = ([ L i ] T ) − M ic ( θ j ). The N independent realizations { η jd , j = , . . . , N } of H are such that η jd = ( η , jd , η , jd , η , jd ) ∈ R ν = R ν × R ν × R ν . Using these N realizations, the estimate of the mean vector of H is η = ν and the estimate of its covariance matrix is [ C H ] = [ I ν ]. By construction, we have Y = ( H , . . . , H ), Y = ( H , . . . , H ), and Y = ( H , . . . , H ). Appendix B. Model and data for Applications 2 (i) Geometry and surface force field G Γ . The 3D bounded domain is defined by
Ω = ]0 , . × ]0 , . × ]0 , . ∂ Ω = Γ ∪ Γ in which Γ = ∂ Ω \ Γ with Γ = { ζ = . , ≤ ζ ≤ . , ≤ ζ ≤ . } . Thesurface force field G Γ = ( G Γ , G Γ , G Γ ) is zero on all Γ except on the part { ζ = . , ≤ ζ ≤ . , ≤ ζ ≤ . } forwhich G Γ ( ζ ) = − . × N / m , G Γ ( ζ ) = − . × N / m , and G Γ ( ζ ) = − . × N / m . (ii) Probabilistic modeling of the elasticity random field. Random field K is rewritten as K i jkh ( ζ ) = [ K ( ζ )] IJ with I = ( i , j ) and J = ( k , h ), in which indices I and J belong to { , . . . , } , and where { [ K ( ζ )] , ζ ∈ R } is a second-order M + -valued non-Gaussian random field indexed by R , which is assumed to be statistically homogeneous. Its meanfunction [ K ] ∈ M + is thus independent of ζ and corresponds to the elasticity tensor of a homogeneous isotropic elas-tic medium whose Young modulus is 10 N / m and Poisson coefficient 0 .
15. The statistical fluctuations of randomfield [ K ] around [ K ] are those of a heterogeneous anisotropic elastic medium. The non-Gaussian M + -valued randomfield { [ K ( ζ )] , ζ ∈ Ω } is constructed using the stochastic model [30, 31] of random elasticity fields for heterogeneousanisotropic elastic media that are isotropic in statistical mean and exhibit anisotropic statistical fluctuations. Its pa-rameterization consists of three spatial-correlation lengths, one dispersion parameter, and a positive-definite lowerbound. Random field [ K ] is written, for all ζ in R , as [ K ( ζ )] = [ K (cid:96) ] + [ L ε ] T [ G ( ζ )] [ L ε ]. The lower-bound matrixis defined by [ K (cid:96) ] = ε (1 + ε ) − [ K ] ∈ M + in which ε is chosen equal to 10 − . The upper triangular matrix [ L ε ] ∈ M is written as [ L ε ] = (1 + ε ) − / [ L ] in which [ K ] = [ L ] T [ L ] (Cholesky factorization). The non-Gaussian random field { [ G ( ζ )] , ζ ∈ R } , which is indexed by R with values in M + , is homogeneous in R and is a second-order randomfield whose modeling and generator are detailed Pages 272 to 274 of [31]. For all ζ in R , the random matrix [ G ( ζ )] iswritten as [ G ( ζ )] = [ L G ( ζ )] T [ L G ( ζ )] in which [ L G ( ζ )] is an upper (6 ×
6) real triangular random matrix that dependsof n G =
21 independent normalized Gaussian random variables. Random field [ G ] depends on 3 spatial correlationlengths, L corr , , L corr , , L corr , , relative to each one of the three directions ζ -, ζ -, and ζ -axes. It also depends on thedispersion parameter δ G > L corr and δ G , for which we have chosen L corr , = L corr , = L corr , = L corr . (iii) Finite element approximation of the stochastic boundary value problem and definition of random vector U . Do-main Ω is meshed with 50 × × = Γ and therefore, there are 198 zero Dirichlet conditions. There are 8 integration points ineach finite element. Consequently, there are N i =
20 000 integration points ζ , . . . , ζ N i . Let us consider the R n u -valuedrandom variable U constituted of all the n u = N i × n G =
20 000 × =
420 000 independent normalized Gaussianrandom variables that allow the set { [ L G ( ζ )] , . . . , [ L G ( ζ N i )] } of random matrices to be generated. (iv) Construction of random vectors Q , and W . The R n q -valued random variable Q of the QoIs are constituted ofthe 10 098 dofs of the discretization of the random displacement field V . The random vector W = ( W , W ) in suchthat W = log( L corr ) and W = log ( δ G ). The random variables L corr and δ G are independent and uniform on [0 . , . . . L corr = . U + . δ G = . U + . U and U are twoindependent uniform random variable on [0 , Acknowledgments
Support for this work was partially provided through the Scientific Discovery through Advanced Computing (Sci-DAC) program funded by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research19 eferences [1] C. Soize, R. Ghanem, Data-driven probability concentration and sampling on manifold, Journal of Computational Physics 321 (2016) 242–258. doi:10.1016/j.jcp.2016.05.044 .[2] R. Ghanem, C. Soize, L. Mehrez, V. Aitharaju, Probabilistic learning and updating of a digital twin for composite material systems, Interna-tional Journal for Numerical Methods in Engineering (2020). doi:10.1002/nme.6430 .[3] R. Ghanem, C. Soize, C. Safta, X. Huan, G. Lacaze, J. C. Oefelein, H. N. Najm, Design optimization of a scramjet under uncertainty usingprobabilistic learning on manifolds, Journal of Computational Physics 399 (2019) 108930. doi:10.1016/j.jcp.2019.108930 .[4] M. Arnst, C. Soize, K. Bulthies, Computation of sobol indices in global sensitivity analysis from samll data sets by probabilisticlearning on manifolds, International Journal for Uncertainty Quantification online, 19 August 2020 (2020). doi:10.1615/Int.J.UncertaintyQuantification.2020032674 .[5] C. Soize, R. Ghanem, Probabilistic learning on manifolds, Foundations of Data Science (2020) 1–29 doi:10.3934/fods.2020013 .[6] C. Soize, R. Ghanem, C. Desceliers, Sampling of bayesian posteriors with a non-gaussian probabilistic learning on manifolds from a smalldataset, Statistics and Computing 30 (5) (2020) 1433–1457. doi:10.1007/s11222-020-09954-6 .[7] C. Soize, R. Ghanem, Physics-constrained non-gaussian probabilistic learning on manifolds, International Journal for Numerical Methods inEngineering 121 (1) (2020) 110–145. doi:10.1002/nme.6202 .[8] C. Soize, R. Ghanem, Probabilistic learning on manifolds constrained by nonlinear partial differential equations for small datasets,arXiv:2010.14324 [stat.ML] (2020) 1–38.URL http://arxiv.org/abs/2010.14324 [9] J. L. Fleiss, B. Levin, M. C. Paik, Statistical Methods for Rates and Proportions, john wiley & sons, 2013.[10] P. E. Greenwood, M. S. Nikulin, A guide to Chi-Squared Testing, Vol. 280, John Wiley & Sons, 1996.[11] K. Pearson, X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that itcan be reasonably supposed to have arisen from random sampling, The London, Edinburgh, and Dublin Philosophical Magazine and Journalof Science, Series 5 50 (302) (1900) 157–175. doi:10.1080/14786440009463897 .[12] R. Boscolo, H. Pan, V. P. Roychowdhury, Independent component analysis based on nonparametric density estimation, IEEE Transactions onNeural Networks 15 (1) (2004) 55–65. doi:10.1109/TNN.2003.820667 .[13] P. Comon, Independent component analysis, a new concept?, Signal processing 36 (3) (1994) 287–314. doi:10.1016/0165-1684(94)90029-9 .[14] P. Comon, C. Jutten, J. Herault, Blind separation of sources, part ii: Problems statement, Signal processing 24 (1) (1991) 11–20. doi:10.1016/0165-1684(91)90080-3 .[15] J. Herault, C. Jutten, Space or time adaptive signal processing by neural network models, in: AIP conference proceedings, Vol. 151, AmericanInstitute of Physics, 1986, pp. 206–211.[16] A. Hyvarinen, Fast and robust fixed-point algorithms for independent component analysis, IEEE transactions on Neural Networks 10 (3)(1999) 626–634. doi:10.1109/72.761722 .[17] A. Hyv¨arinen, E. Oja, Independent component analysis: algorithms and applications, Neural networks 13 (4-5) (2000) 411–430. doi:10.1016/S0893-6080(00)00026-5 .[18] C. Jutten, J. Herault, Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture, Signal processing 24 (1)(1991) 1–10. doi:10.1016/0165-1684(91)90079-X .[19] T.-W. Lee, M. Girolami, A. J. Bell, T. J. Sejnowski, A unifying information-theoretic framework for independent component analysis,Computers & Mathematics with Applications 39 (11) (2000) 1–21. doi:10.1016/S0898-1221(00)00101-2 .[20] C. Soize, Optimal partition in terms of independent random vectors of any non-gaussian vector defined by a set of realizations, SIAM/ASAJournal on Uncertainty Quantification 5 (1) (2017) 176–211. doi:10.1137/16M1062223 .[21] G. H. Golub, C. F. Van Loan, Matrix Computations, Second Edition, Johns Hopkins University Press, Baltimore and London, 1993.[22] T. Duong, A. Cowling, I. Koch, M. Wand, Feature significance for multivariate kernel density estimation, Computational Statistics & DataAnalysis 52 (9) (2008) 4225–4242. doi:10.1016/j.csda.2008.02.035 .[23] T. Duong, M. L. Hazelton, Cross-validation bandwidth matrices for multivariate kernel density estimation, Scandinavian Journal of Statistics32 (3) (2005) 485–506. doi:10.1111/j.1467-9469.2005.00445.x .[24] M. Filippone, G. Sanguinetti, Approximate inference of the bandwidth in multivariate kernel density estimation, Computational Statistics &Data Analysis 55 (12) (2011) 3104–3122. doi:10.1016/j.csda.2011.05.023 .[25] N. Zougab, S. Adjabi, C. C. Kokonendji, Bayesian estimation of adaptive bandwidth matrices in multivariate kernel density estimation,Computational Statistics & Data Analysis 75 (2014) 28–38. doi:10.1016/j.csda.2014.02.002 .[26] A. Bowman, A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach With S-Plus Illustrations, Vol. 18, OxfordUniversity Press, Oxford: Clarendon Press, New York, 1997. doi:10.1007/s001800000033 .[27] R. Coifman, S. Lafon, Diffusion maps, Applied and Computational Harmonic Analysis 21 (1) (2006) 5–30. doi:10.1016/j.acha.2006.04.006 .[28] S. Lafon, A. B. Lee, Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and dataset parameterization, IEEE transactions on pattern analysis and machine intelligence 28 (9) (2006) 1393–1403. doi:10.1109/TPAMI.2006.184 .[29] R. Neal, MCMC using hamiltonian dynamics, in: S. Brooks, A. Gelman, G. Jones, X.-L. Meng (Eds.), Handbook of Markov Chain MonteCarlo, Chapman and Hall-CRC Press, Boca Raton, 2011, Ch. 5. doi:10.1201/b10905-6 .[30] C. Soize, Non gaussian positive-definite matrix-valued random fields for elliptic stochastic partial differential operators, Computer Methodsin Applied Mechanics and Engineering 195 (1-3) (2006) 26–64. doi:10.1016/j.cma.2004.12.014 .[31] C. Soize, Uncertainty Quantification. An Accelerated Course with Advanced Applications in Computational Engineering, Springer, NewYork, 2017. doi:10.1007/978-3-319-54339-0 ..