Generalized Score Matching for General Domains
GGeneralized Score Matching for General Domains
Shiqing Yu , Mathias Drton , and Ali Shojaie Department of Statistics, University of Washington, Seattle, Washington, 98195, U.S.A. Department of Mathematics, Technical University of Munich, 85748 Garching beiM¨unchen, Germany Department of Biostatistics, University of Washington, Seattle, Washington, 98195, U.S.A.September 25, 2020
Abstract
Estimation of density functions supported on general domains arises when the data is nat-urally restricted to a proper subset of the real space. This problem is complicated by typi-cally intractable normalizing constants. Score matching provides a powerful tool for estimatingdensities with such intractable normalizing constants, but as originally proposed is limited todensities on R m and R m + . In this paper, we offer a natural generalization of score matching thataccommodates densities supported on a very general class of domains. We apply the frameworkto truncated graphical and pairwise interaction models, and provide theoretical guarantees forthe resulting estimators. We also generalize a recently proposed method from bounded to un-bounded domains, and empirically demonstrate the advantages of our method.KEY WORDS: Density estimation, graphical model, normalizing constant, sparsity, truncateddistributions Probability density functions, especially in multivariate graphical models, are often defined onlyup to a normalizing constant. In higher dimensions, computation of the normalizing constant istypically an intractable problem that becomes worse when the distributions are defined only on aproper subset of the real space R m . For example, even truncated multivariate Gaussian densitieshave intractable normalizing constants except for special situations, e.g., with diagonal covariance1 a r X i v : . [ s t a t . M E ] S e p atrices. This inability to calculate normalizing constants makes density estimation for generaldomains very challenging. Score matching
Hyv¨arinen (2005) is a computational efficient solution to density estimationthat bypasses the calculation of normalizing constants and has enabled, in particular, large-scaleapplications of non-Gaussian graphical models Forbes and Lauritzen (2015); Lin et al. (2016); Yuet al. (2016); Sun et al. (2015); Yu et al. (2019a). Its original formulation targets distributionssupported on R m . It was extended to treat the non-negative orthant R m + in (Hyv¨arinen, 2007),with more recent generalizations in Yu et al. (2018, 2019b). An extension to products of intervalslike [0 , m was given in (Janofsky, 2015, 2018; Tan et al., 2019), and more general bounded domainswere considered in (Liu and Kanamori, 2019). Despite this progress, the existing approaches haveimportant limitations: The method in Liu and Kanamori (2019) only allows for bounded support,and earlier methods for R m + and [0 , m offer ad-hoc solutions that cannot be directly extended tomore general domains. This paper addresses these limitations by developing a unifying frameworkthat encompasses the existing methods and applies to unbounded domains. The framework enablesnew applications for more complicated domains yet retains the computational efficiency of theoriginal score matching. The remainder of this introduction provides a more detailed review of thescore matching estimator and the contributions of this paper. The original score matching estimator introduced in Hyv¨arinen (2005) is based on the idea ofminimizing the Fisher distance given by the expected (cid:96) distance between the gradients of the truelog density log p on R m and a proposed log density log p , that is, (cid:90) R m p ( x ) (cid:107)∇ x log p ( x ) − ∇ x log p ( x ) (cid:107) d x . (1)Integration by parts leads to an associated empirical loss, in which an additive constant termdepending only on p is ignored. This loss avoids calculations of the normalizing constant throughdealing with the derivatives of the log-densities only. Minimizing the empirical loss to derive anestimator is particularly convenient if p belongs to an exponential family because the loss is then aquadratic function of the family’s canonical parameters. The latter property holds, in particular,for the Gaussian case, where the methods proposed by Liu and Luo (2015); Zhang and Zou (2014)constitute a special case of score matching.In (Hyv¨arinen, 2007), the approach was generalized to densities on R m + = [0 , ∞ ) m by minimizing2nstead (cid:90) R m + p ( x ) (cid:107)∇ x log p ( x ) (cid:12) x − ∇ x log p ( x ) (cid:12) x (cid:107) d x . (2)The element-wise multiplication (“ (cid:12) ”) with x dampens discontinuities at the boundary of R m + andfacilitates integration by parts for deriving an empirical loss that does not depend on the true p .In recent work, we proposed a generalized score matching approach for densities on R m + byusing the square-root of slowly growing and preferably bounded functions h ( x ) in place of x inthe element-wise multiplication (Yu et al., 2018, 2019b). This modification improves performance(theoretically and empirically) as it avoids higher moments in the empirical loss. Another recentwork extended score matching to supports given by bounded open subsets D ⊂ R m with piecewisesmooth boundaries (Liu and Kanamori, 2019). The idea there is to minimizesup g ∈G (cid:90) D p ( x ) g ( x ) (cid:107)∇ x log p ( x ) − ∇ x log p ( x ) (cid:107) d x , (3)where G ≡ { g | g ( x ) = 0 ∀ x ∈ ∂ D and g is 1-Lipschitz continuous } , and ∂ D is the boundary of D .The supremum in the loss is achieved at g ( x ) ≡ min x (cid:48) ∈ ∂ D (cid:107) x − x (cid:48) (cid:107) , the distance of x to ∂ D . In this paper, we further extend generalized score matching with the aim of avoiding the limitationsof existing work and allowing for general and possibly unbounded domains D with positive Lebesguemeasure. We require merely that all sections of D ⊆ R m , i.e., the sets of values of any component x j fixing all other components x − j , are countable disjoint unions of intervals in R . This level ofgenerality ought to cover all practical cases. To handle such domains, we compose the function h in the generalized score matching loss of Yu et al. (2018, 2019b) with a component-wise distancefunction ϕ = ( ϕ , . . . , ϕ m ) : D → R m + . To define ϕ j ( x ), we consider the interval in the section givenby x − j that contains x j and compute the distance between x j and the boundary of this interval.The function ϕ j ( x ) is then defined as the minimum of the distance and a user-selected constant C j . The loss resulting from this extension, with the composition h ◦ ϕ in place of h , can againbe approximated by an empirical loss that is quadratic in the canonical parameters of exponentialfamilies.As an application of the proposed framework, we study a class of pairwise interaction modelsfor an m -dimensional random vector X = ( X i ) mi =1 that was considered in Yu et al. (2018, 2019b)and in special cases in earlier literature. These a - b models postulate a probability density function3roportional to exp (cid:26) − a x a K x a + 1 b η (cid:62) x b (cid:27) , x ∈ D . (4)Where past work assumes D = R m or D = R m + , we here allow a general domain D ⊂ R m . In (4), a ≥ b ≥ K ∈ R m × m and η ∈ R m are unknown parametersto be estimated. For a = 0 we define x a (cid:62) K x a /a ≡ (log x ) (cid:62) K (log x ) and for b = 0 we define η (cid:62) x b /b ≡ η (cid:62) (log x ). The case where a = 0 was not considered in Yu et al. (2018, 2019b). Thismodel class provides a simple yet rich framework for pairwise interaction models. In particular, if D = D ×· · ·× D m is a product set, then X i and X j are conditionally independent given all others ifand only if κ ij = κ ji = 0 in the interaction matrix K ; i.e., the a - b models become graphical models(Maathuis et al., 2019). When a = b = 1, model (4) is a (truncated) Gaussian graphical model,with Σ ≡ K − the covariance matrix and Σ − η the mean parameter. The case where a = b = 1 / D = R m + is the exponential square root graphical model from Inouye et al. (2016).For estimation of a sparse interaction matrix K in high-dimensional a - b models, we take up an (cid:96) regularization approach considered in Lin et al. (2016) and improved in Yu et al. (2018, 2019b).In Yu et al. (2018, 2019b), we showed that this approach permits recovery of the support of K under sample complexity n = Ω(log m ) for Gaussians truncated to D = R m + . Here, we prove thatthe same sample complexity is achieved for Gaussians truncated to any domain D that is a finitedisjoint union of convex sets with n = Ω(log m ) samples. In addition, we derive similar resultsfor general a - b models on bounded subsets of R m + with positive measure for a >
0, or if log D isbounded for a = 0. On unbounded domains for a > D and a = 0, we require n to be Ω(log m ) times a factor that may weakly depend on m . The rest of the paper is structured as follows. We provide the necessary background on scorematching in Section 2. In Section 3, we introduce and detail our new methodology, along with theregularized generalized estimator for exponential families. In Section 4, we define the a - b interactionmodels and focus on application of our method to these models on domains with positive Lebesguemeasure. Theoretical results and numerical experiments are given in Sections 5 and 6, respectively.We apply our method to a DNA methylation dataset in Section 7. Longer proofs are included inthe Appendix. An implementation that incorporates various types of domain D is available in the genscore R package. 4 .4 Notation
We use lower-case letters for constant scalars, vectors and functions and upper-case letters forrandom scalars and vectors (except some special cases). We reserve regular font for scalars (e.g. a , X ) and boldface for vectors (e.g. a , X ), and m = (1 , . . . , ∈ R m . For two vectors u , v ∈ R m , wewrite u (cid:31) v if u j > v j for j = 1 , . . . , m . Matrices are in upright bold, with constant matrices inupper-case ( K , M ) and random data matrices in lower-case ( x , y ). Superscripts index rows andsubscripts index columns in a data matrix x , so, X ( i ) is the i -th row, and X ( i ) j is its j -th feature.For vectors u , v ∈ R m , u (cid:12) v ≡ ( u v , . . . , u m v m ) denotes the Hadamard product (element-wisemultiplication), and the (cid:96) a -norm for a ≥ (cid:107) u (cid:107) a = ( (cid:80) mj =1 | u j | a ) /a , with (cid:107) u (cid:107) ∞ =max j =1 ,...,m | u j | . For a ∈ R , let v a ≡ ( v a , . . . , v am ). Similarly, for function f : R m → R m , x (cid:55)→ ( f ( x ) , . . . , f m ( x )), we write f a ( x ) ≡ ( f a ( x ) , . . . , f am ( x )). Similarly, we also write f (cid:48) ( x ) ≡ ( ∂f ( x ) /∂x , . . . , ∂f m ( x ) /∂x m ).For a matrix K = [ κ ij ] i,j ∈ R n × m , its vectorization is obtained by stacking its columns into an R nm vector. Its Frobenius norm is ||| K ||| F = (cid:107) vec( K ) (cid:107) , its max norm is (cid:107) K (cid:107) ∞ ≡ (cid:107) vec( K ) (cid:107) ∞ ≡ max i,j | κ ij | , and its (cid:96) a – (cid:96) b operator norm is ||| K ||| a,b ≡ max x (cid:54) = (cid:107) K x (cid:107) b / (cid:107) x (cid:107) a , with ||| K ||| a ≡ ||| K ||| a,a .For a vector x ∈ R m and an index j ∈ { , . . . , m } , we write x − j for the subvector that hasthe j th component removed. For a function f of a vector x , we may also write f ( x j ; x − j ) tostress the dependency on x j , especially when x − j is fixed and only x j is varied, and write ∂ j f ( x ) = ∂ j f ( y ; x − j ) /∂y | y ≡ x j . For two compatible functions f and g , f ◦ g denotes their function composition.Unless otherwise noted, the considered probability density functions are densities with respect tothe Lebesgue measure on R m . Suppose X ∈ R m is a random vector with distribution function P supported on domain D ⊆ R m and a twice continuously differentiable probability density function p with respect to the Lebesguemeasure restricted to D . Let P ( D ) be a family of distributions of interest with twice continuouslydifferentiable densities on D . The goal is to estimate p by picking the distribution P from P ( D )with density p minimizing an empirical loss that measures the distance between p and p .5 .1 Original Score Matching on R m The original score matching loss proposed by Hyv¨arinen (2005) for D ≡ R m is given by J R m ( P ) ≡ (cid:90) R m p ( x ) (cid:107)∇ log p ( x ) − ∇ log p ( x ) (cid:107) d x , in which the gradients can be thought of as gradients with respect to a hypothetical locationparameter and evaluated at the origin (Hyv¨arinen, 2005). The log densities enable estimationwithout calculating the normalizing constants of p and p . Under mild conditions, using integrationby parts, the loss can be rewritten as J R m ( P ) ≡ (cid:90) R m p ( x ) m (cid:88) j =1 (cid:20) ∂ jj log p ( x ) + 12 ( ∂ j log p ( x )) (cid:21) d x plus a constant independent of p . One can thus use a sample average to approximate the losswithout knowing the true density p . R m + Consider D ≡ R m + . Let h : R m + → R m + , x (cid:55)→ ( h ( x ) , . . . , h m ( x m )) (cid:62) , where h , . . . , h m : R + → R + are almost surely positive functions that are absolutely continuous in every bounded sub-intervalof R + . The generalized h -score matching loss proposed by Yu et al. (2018, 2019b) is J h , R m + ( P ) ≡ (cid:90) R m + p ( x ) (cid:13)(cid:13)(cid:13) ∇ log p ( x ) (cid:12) h / ( x ) − ∇ log p ( x ) (cid:12) h / ( x ) (cid:13)(cid:13)(cid:13) d x . (5)The score matching loss for R m + originally proposed by Hyv¨arinen (2007) is a special case of (5) with h ( x ) = x . In Yu et al. (2018, 2019b) we proved that by choosing slowly growing and preferablybounded h , . . . , h m , the estimation efficiency can be significantly improved. Under assumptionsthat for all P ∈ P ( R m + ) with density p ,(A0.1) p ( x j ; x − j ) h j ( x j ) ∂ j log p ( x j ; x − j ) (cid:12)(cid:12)(cid:12) x j (cid:37) + ∞ x j (cid:38) + = 0 , ∀ x − j ∈ R m − ∀ j ;(A0.2) E p (cid:13)(cid:13) ∇ log p ( X ) (cid:12) h / ( X ) (cid:13)(cid:13) < + ∞ , E p (cid:13)(cid:13) ( ∇ log p ( X ) (cid:12) h ( X )) (cid:48) (cid:13)(cid:13) < + ∞ ,where f ( x ) (cid:12)(cid:12)(cid:12) x j (cid:37) + ∞ x j (cid:38) + ≡ lim x j (cid:37) + ∞ f ( x ) − lim x j (cid:38) + f ( x ), the loss (5) can be rewritten as J h , R m + ( P ) ≡ (cid:90) R m + p ( x ) m (cid:88) j =1 (cid:20) h (cid:48) j ( x j ) ∂ j (log p ( x )) + h j ( x j ) ∂ jj (log p ( x )) + 12 h j ( x j )[ ∂ j (log p ( x ))] (cid:21) d x plus a constant independent of p . One can thus estimate p by minimizing the empirical loss J h , R m + ( P ). 6 .3 Score Matching on Bounded Open Subsets of R m The method proposed in Liu and Kanamori (2019) estimates a density p on a bounded open subset D ⊂ R m with a piecewise smooth boundary ∂ D by minimizing the following “maximally weightedscore matching” loss J g , D ( P ) ≡ sup g ∈G (cid:90) R m + g ( x ) p ( x ) (cid:107)∇ log p ( x ) − ∇ log p ( x ) (cid:107) d x . (6)with G ≡ { g | g ( x ) = 0 , ∀ x ∈ ∂ D and g is L -Lipschitz continuous } for some constant L >
0. Theauthors show that the maximum is obtained with g ( x ) ≡ L · inf x (cid:48) ∈ ∂ D (cid:107) x − x (cid:48) (cid:107) , i.e. the (cid:96) distanceof x to the boundary of D ; using integration by parts similar to the previous methods, (6) can beestimated using the empirical loss which can be calculated with a closed form. For x ∈ R m and any index j = 1 , . . . , m , write C j, D ( x − j ) ≡ { y ∈ R : ( y ; x − j ) ∈ D } for the sectionof D obtained by fixing the coordinates in x − j . This j th section is the projection of the intersectionbetween D and the line { ( y ; x − j ) : y ∈ R } . A non-empty j th section is obtained from the vectors x − j in the set S − j, D ≡ { x − j : C j, D ( x − j ) (cid:54) = ∅ } ⊂ R m − . For notational simplicity, we drop theirdependency on D . Definition 1.
We say that a domain D ⊆ R m is a component-wise countable union of intervals ifit is measurable, and for any index j = 1 , . . . , m and any x − j ∈ S − j, D , the section C j, D ( x − j ) is acountable union of disjoint intervals, meaning that C j, D ( x − j ) ≡ K j ( x − j ) ∪ k =1 I k ( x − j ) , (7) where K j ( x − j ) ∈ N ∪ {∞} , and each set I k ( x − j ) is an interval (closed, open, or half-open) withendpoints −∞ ≤ a k,j ( x − j ) ≤ b k,j ( x − j ) ≤ + ∞ , with the I k ( x − j ) ’s being the connected componentsof C j, D ( x − j ) . The last point rules out constructions like I = (0 , and I = (1 , but allows I = (0 , and I = (1 , . We define the component-wise boundary set of such a component-wisecountable union of intervals as ∂ D ≡ (cid:26) x ∈ R m : ∃ j = 1 , . . . , m, x − j ∈ S − j, D , x j ∈ K j ( x − j ) ∪ k =1 { a k,j ( x − j ) , b k,j ( x − j ) }\{±∞} (cid:27) . (8)7 .2 Generalized Score Matching Loss for General Domains We first define a truncated component-wise distance , which is based on distances within connectedcomponents of sections.
Definition 2.
Let C = ( C , . . . , C m ) be comprised of positive constants, so C (cid:31) . Let D ⊆ R m bea non-empty component-wise countable union of intervals whose sections are presented as in (7) .For any vector x ∈ D , define the truncated component-wise distance of x to the boundary of D as ϕ C , D ( x ) ≡ ( ϕ C , D , ( x ) , . . . , ϕ C m , D ,m ( x )) ∈ R m + , (9) ϕ C j , D ,j ( x ) ≡ C j , a k,j = −∞ , b k,j = + ∞ , min( C j , b k,j − x j ) , a k,j = −∞ , x j ≤ b k,j < + ∞ , min( C j , x j − a k,j , b k,j − x j ) , −∞ < a k,j ≤ x j ≤ b k,j < + ∞ , min( C j , x j − a k,j ) , −∞ < a k,j ≤ x j , b k,j = + ∞ , (10) where k is the index for which x j ∈ I k ( x − j ) and a k,j ≤ b k,j are the endpoints of I k ( x − j ) . Our idea for defining a score matching loss suitable for general domains is now to use thegeneralized score matching framework from (5) but apply the function h to ϕ C , D ( x ) instead of to x . Definition 3.
Suppose the true distribution P has a twice continuously differentiable density p supported on D ⊆ R m , a non-empty component-wise countable union of intervals. Given positiveconstants C (cid:31) , and h : R m + → R m + , y (cid:55)→ ( h ( y ) , . . . , h m ( y m )) with h , . . . , h m : R + → R + , the generalized ( h , C , D )-score matching loss for P ∈ P ( D ) with density p is defined as J h , C , D ( P ) ≡ (cid:90) D p ( x ) (cid:13)(cid:13)(cid:13) ∇ log p ( x ) (cid:12) ( h ◦ ϕ C , D ) / ( x ) − ∇ log p ( x ) (cid:12) ( h ◦ ϕ C , D ) / ( x ) (cid:13)(cid:13)(cid:13) d x . (11)In (11), we apply the loss from (5) with the choice ( h ◦ ϕ C , D ) in place of h . The function ϕ C , D transforms a point x ∈ D into the component-wise distance vector in R m + . The loss fromDefinition 3 is thus a natural extension of our work in Yu et al. (2018, 2019b), with the appealthat ϕ C , D usually has a closed-form solution and can be computed efficiently. For D = R m + and C = (+ ∞ , . . . , + ∞ ) it holds that ϕ C , R m + ( x ) = x , and the generalized score matching loss from (5)becomes a special case of (11). In Yu et al. (2018, 2019b), we suggested taking the components of8 a) g (b) ϕ , D , ∨ ϕ , D , (c) ϕ , D , (d) ϕ , D , Figure 1: Comparison of g , ϕ , D , and ϕ , D , on D ≡ { x ∈ R : (cid:107) x (cid:107) < } . (a) g (b) ϕ , D , ∨ ϕ , D , (c) ϕ , D , (d) ϕ , D , Figure 2: Comparison of g , ϕ , D , and ϕ , D , on D ≡ { x ∈ R : (cid:107) x (cid:107) < } . h as bounded functions, which may now also be incorporated via finite truncation points C for ϕ .If h ( x ) = x , C = (+ ∞ , . . . , + ∞ ) and D ≡ R m + , then ( h ◦ ϕ C , D ) / ( x ) ≡ x gives the estimatorin Hyv¨arinen (2007); Lin et al. (2016); see (2). When choosing h ( x ) = m , i.e., constant one, wehave ( h ◦ ϕ C , D ) / ( x ) ≡ m and recover the original score-matching for R m from Hyv¨arinen (2005);see (1).For a bounded domain D , our approach is different from directly using a distance in R m asproposed in Liu and Kanamori (2019); recall (3) with the optimizer g ( x ) being the (cid:96) distance of x to the usual boundary of D . Instead, we decompose the distance for each component and applyan extra transformation via the function h .Figure 1 illustrates the case of the 2-d unit disk given by x + x <
1. While the function fromLiu and Kanamori (2019) is g ( x ) = 1 − (cid:112) x + x , our method uses ϕ ( x ) = (cid:112) − x − | x | and ϕ ( x ) = (cid:112) − x −| x | , assuming that C = ( C , C ) has C , C ≥
1. In Figure 2 we consider the 2-dunit disk restricted to R , where ϕ ( x ) = min (cid:110) x , (cid:112) − x − x (cid:111) , ϕ ( x ) = min (cid:110) x , (cid:112) − x − x (cid:111) , g ( x ) = min (cid:110) x , x , − (cid:112) x + x (cid:111) . 9 .3 Examples of component-wise distances We give the form of the component-wise distance ϕ C , D for different examples of domains. Example 1 ( R m and R m + ) . Two frequently encountered domains are the real space and its nonnega-tive orthant. These are the original settings considered in Hyv¨arinen (2005, 2007); Lin et al. (2016);Yu et al. (2018, 2019b). We have ϕ C , R m ( x ) = C and ϕ C , R m + ( x ) = (min( C , x ) , . . . , min( C m , x m )) . Example 2 (Unit hypercube) . Now consider the unit hypercube D = [ − / , / m as an exampleof a compact set and encountered in applications like (Janofsky, 2015, 2018; Tan et al., 2019).Every non-empty section C j, D ( x − j ) equals [ − / , / , and so ϕ C j , D ,j ( x ) = min { C j , / − | x j |} .Since the hypercube is bounded by nature, it is natural to drop the truncation by C j and simply use ϕ D ( x ) = m / − | x | . Example 3 ( L q ball) . As a compact domain that is difficult for previously proposed approaches,consider the L q ball with radius r > and q ≥ , so D ≡ { x ∈ R m : (cid:107) x (cid:107) q ≤ r } . Given a point x ∈ D and an index j ∈ { , . . . , m } , the section C j, D ( x − j ) is the interval [ − ( r q − (cid:62) | x − j | q ) /q , ( r q − (cid:62) | x − j | q ) /q ] , and so ϕ C j , D ,j ( x ) = min (cid:8) C j , ( r q − (cid:62) | x − j | q ) /q − | x j | (cid:9) . Example 4 ( L q ball restricted to R m + ) . Further restricted, consider D ≡ { x ∈ R m + : (cid:107) x (cid:107) q ≤ r } , thenonnegative part of the L q ball with radius r > and q ≥ . Given a point x ∈ D and an index j , thesection C j, D ( x − j ) is [0 , ( r q − (cid:62) | x − j | q ) /q ] , so ϕ C j , D ,j ( x ) = min (cid:8) C j , x j , ( r q − (cid:62) | x − j | q ) /q − x j (cid:9) . Example 5 (Complement of L q ball) . Now consider D ≡ { x ∈ R m : (cid:107) x (cid:107) q > r } , the complementof the L q ball with radius r > and q ≥ . Given x ∈ D and j , we now have C j, D ( x − j ) = R if (cid:62) m − | x − j | q > r q , (cid:0) −∞ , − ( r q − (cid:62) m − | x − j | q ) /q (cid:1) ∪ (cid:0) ( r q − (cid:62) m − | x − j | q ) /q , + ∞ (cid:1) otherwise ; ϕ C j , D ,j ( x ) = C j if (cid:62) m − | x − j | q > r q , min (cid:8) C j , | x j | − ( r q − (cid:62) m − | x − j | q ) /q (cid:9) otherwise . Example 6 (Complement of L q ball restricted to R m + ) . Next consider D ≡ { x ∈ R m + : (cid:107) x (cid:107) q > r } ,the complement of the nonnegative part of the L q ball with radius r > and q ≥ . Given x ∈ D and j , C j, D ( x − j ) = R + if (cid:62) m − | x − j | q > r q , (cid:0) −∞ , − ( r q − (cid:62) m − | x − j | q ) /q (cid:1) ∪ (cid:0) ( r q − (cid:62) m − | x − j | q ) /q , + ∞ (cid:1) otherwise ;10 C j , D ,j ( x ) = min { C j , x j } if (cid:62) m − | x − j | q > r q , min (cid:8) C j , x j − ( r q − (cid:62) m − | x − j | q ) /q (cid:9) otherwise . Example 7 (Complicated domains defined by inequality constraints) . More generally, a domain D may be determined by a series of intersections/unions of regions determined by inequality con-straints, e.g., D = { x ∈ R m : ( f ( x ) ≤ c ∧ f ( x ) ≤ c ) ∨ f ( x ) ≥ c } . In this case, to calculate ϕ C , D we may plug in x − j as given and solve numerically f i ( x j ; x − j ) = c i for i = 1 , , , andobtain the boundary points for x j using simple algorithms for interval unions/intersections. Thisis implemented in the package genscore for some types of polynomials f i and arbitrary intersec-tions/unions. From this section on, we simplify notation by dropping the dependence of ϕ on C and D . Lemma 1.
Suppose C (cid:31) , p ( x ) > for almost every x ∈ D and h ( y ) , . . . , h m ( y ) > for all y > . Then J h , C , D ( P ) = 0 if and only if p = p for a.e. x ∈ D .Proof of Lemma 1. By the measurability and definition of D , and using the Fubini-Tonelli theorem,the component-wise boundary set ∂ D from Definition 1 is a Lebesgue-null set. Thus, ϕ ( x ) (cid:31) for almost every x ∈ D , so that ( h ◦ ϕ )( x ) (cid:31) for a.e. x ∈ D . So J h , C , D ( P ) = 0 if and onlyif ∇ log p ( x ) = ∇ log p ( x ) for a.e. x ∈ D , i.e., log p ( x ) = log p ( x ) + c for a.e. x ∈ D for someconstant c , or p ( x ) = c · p ( x ) for a.e. x ∈ D for some non-zero constant c ≡ exp( c ). Since p and p both integrate to 1 over D , we have c = 1 and p = p for a.e. x ∈ D .According to Lemma 1 the proposed score matching method requires the domain D to havepositive Lebesgue measure in R m . For some null sets, e.g, a probability simplex, an appropriatetransformation to a lower-dimensional set with positive measure can be given but we defer discussionof such domains to future work, and assume D has positive Lebesgue measure for the rest of thispaper. Lemma 2.
Similar to (A0.1) and (A0.2) in Section 2.2, assume the following to hold for all p ∈ P ( D ) ,(A.1) p ( x j ; x − j ) h j ( ϕ j ( x )) ∂ j log p ( x j ; x − j ) (cid:12)(cid:12)(cid:12) x j (cid:37) b k ( x − j ) − x j (cid:38) a k ( x − j ) + = 0 for all k = 1 , . . . , K j ( x − j ) and x − j ∈ S − j for all j ; A.2) (cid:82) D p ( x ) (cid:13)(cid:13) ∇ log p ( x ) (cid:12) ( h ◦ ϕ ) / ( x ) (cid:13)(cid:13) d x < + ∞ , (cid:82) D p ( x ) (cid:13)(cid:13) [ ∇ log p ( x ) (cid:12) ( h ◦ ϕ )( x )] (cid:48) (cid:13)(cid:13) d x < + ∞ .Also assume that(A.3) ∀ j = 1 , . . . , m and a.e. x − j ∈ S − j , the component function h j of h is absolutely continuousin any bounded sub-interval of the section C j ( x − j ) . This implies the same for ( h j ◦ ϕ j ) andalso that ∂ j ( h j ◦ ϕ j ) exists a.e.Then the loss J h , C , D ( P ) is equal to a constant depending on p only (i.e., independent of p ) plus m (cid:88) j =1 (cid:90) D p ( x ) · ( h j ◦ ϕ j )( x ) · [ ∂ j log p ( x )] d x + m (cid:88) j =1 (cid:90) D p ( x ) · ∂ j [( h j ◦ ϕ j )( x ) · ∂ j log p ( x )] d x . (12)The proof of the lemma is given in Appendix B. The lemma enables us to estimate the popu-lation loss, or rather those parts that are relevant for estimation of P , using the empirical lossˆ J h , C , D ( P ) = 12 m (cid:88) j =1 n (cid:88) i =1
12 ( h j ◦ ϕ j ) (cid:16) x ( i ) (cid:17) · (cid:104) ∂ j log p (cid:16) x ( i ) (cid:17)(cid:105) + ∂ j (cid:104) ( h j ◦ ϕ j ) (cid:16) x ( i ) (cid:17) · ∂ j log p (cid:16) x ( i ) (cid:17)(cid:105) . (13)As the canonical choices of h are power functions in x , we give the following sufficient conditionsfor the assumptions in the lemma. Proposition 3.
Suppose for all j = 1 , . . . , m , h j ( x j ) = x α j j for some α j > . Suppose in additionthat for all j and x − j ∈ S − j and all p ∈ P we have(1) p ( x j ; x − j ) ∂ j log p ( x j ; x − j ) = o (1 / ( x j − c k,j ) α j ) as x j (cid:37) c k,j ≡ b k,j ( x − j ) < + ∞ or as x j (cid:38) c k,j ≡ a k,j ( x − j ) > −∞ for all k , and(2) p ( x j ; x − j ) ∂ j log p ( x j ; x − j ) → as x j (cid:37) + ∞ if C j ( x − j ) is unbounded from above, and as x j (cid:38) −∞ if C j ( x − j ) is unbounded from below.Then (A.1) and (A.3) are satisfied.Proof of Proposition 3. Condition (A.3) is satisfied by the property of h j . (A.1) is also satisfiedsince by construction ( h j ◦ ϕ j )( x ) becomes | x j − c k,j | α j as x j → c k,j ∈ ∪ K j k =1 { a k,j ( x − j ) , b k,j ( x − j ) } and also ( h j ◦ ϕ j ) is bounded by C α j as x j (cid:37) + ∞ or x j (cid:38) −∞ , if applicable.12 .5 Exponential Families and Regularized Score Matching Consider the case where P ( D ) ≡ { p θ : θ ∈ Θ ⊂ R r } for some r = 1 , , . . . is an exponential familycomprised of continuous distributions supported on D with densities of the formlog p θ ( x ) = θ (cid:62) t ( x ) − ψ ( θ ) + b ( x ) , x ∈ D , (14)where θ ∈ R r is the unknown canonical parameter of interest, t ( x ) ∈ R r are the sufficient statistics, ψ ( θ ) is the normalizing constant, and b ( x ) is the base measure. The empirical loss ˆ J h , C , D (13) canthen be written as a quadratic function in the canonical parameter:ˆ J h , C , D ( p θ ) = 12 θ (cid:62) Γ ( x ) θ − g ( x ) (cid:62) θ + const , with (15) Γ ( x ) = 1 n n (cid:88) i =1 m (cid:88) j =1 ( h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) ∂ j t ( X ( i ) ) (cid:16) ∂ j t (cid:16) X ( i ) (cid:17)(cid:17) (cid:62) and (16) g ( x ) = − n n (cid:88) i =1 m (cid:88) j =1 (cid:104) ( h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) ∂ j b (cid:16) X ( i ) (cid:17) ∂ j t (cid:16) X ( i ) (cid:17) +( h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) ∂ jj t (cid:16) X ( i ) (cid:17) + ∂ j ( h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) ∂ j t (cid:16) X ( i ) (cid:17)(cid:105) , (17)where ∂ j t ( x ) = ( ∂ j t ( x ) , . . . , ∂ j t r ( x )) (cid:62) ∈ R r . Note that (16) and (17) are sample averages offunctions in the data matrix x only. These forms are an exact analog of those in Theorem 5 in Yuet al. (2019b). As expected, we can thus obtain the following consistency result similar to Theorem6 in Yu et al. (2019b): Theorem 4 (Theorem 6 of Yu et al. (2019b)) . Suppose the true density is p ≡ p θ and that(C1) Γ is almost surely invertible, and(C2) Σ ≡ E p (cid:104) ( Γ ( x ) θ − g ( x )) ( Γ ( x ) θ − g ( x )) (cid:62) (cid:105) , Γ ≡ E p Γ ( x ) , Γ − , and g ≡ E p g ( x ) existand are component-wise finite.Then the minimizer of (15) is almost surely unique with closed form solution ˆ θ ≡ Γ ( x ) − g ( x ) with ˆ θ → a.s. θ and √ n ( ˆ θ − θ ) → d N r (cid:0) , Γ − Σ Γ − (cid:1) as n → ∞ . Estimation in high-dimensional settings, in which the number of parameters r may exceed thesample size n , usually benefits from regularization. For exponential families, as in Yu et al. (2019b),we add an (cid:96) penalty on θ to the loss in (15), while multiplying the diagonals of the Γ by a diagonalmultiplier δ >
1: 13 efinition 4.
Let δ > , and Γ δ ( x ) be Γ ( x ) with diagonal entries multiplied by δ . For exponentialfamily distributions in (14), the regularized generalized ( h , C , D )-score matching loss is defined as ˆ J h , C , D ,λ,δ ( p θ ) ≡ θ (cid:62) Γ δ ( x ) θ − g ( x ) (cid:62) θ + λ (cid:107) θ (cid:107) . (18)The multiplier δ >
1, together with the (cid:96) penalty, resembles an elastic net penalty and preventsthe loss in (18) from being unbounded from below for smaller λ , in which case there can be infinitelymany minimizers. This is discussed in Section 4 in Yu et al. (2019b), where a default for δ is given,so that no tuning for this parameter is necessary. Minimization of (18) can be efficiently done usingcoordinate-descent with warm starts, which along with other computational details is discussed inSection 5.3 of Yu et al. (2019b). A key ingredient to our treatment of unbounded domains is truncation of distances. This idea canalso be applied to the method proposed for general bounded domains in Liu and Kanamori (2019);recall Section 2.3. For any component-wise countable union of intervals D as in Definition 1, wemay modify the loss in (6) to J g ,C, D ( P ) ≡ sup g ∈G (cid:90) R m + g ( x ) p ( x ) (cid:107)∇ log p ( x ) − ∇ log p ( x ) (cid:107) d x , but with G ≡ { g | g ( x ) = 0 , ∀ x ∈ ∂ D , g is L -Lipschitz continuous and g ≤ C } instead. Here, we usethe same Lipschitz constant L >
C >
0. Moreover, we usethe component-wise boundary set ∂ D (8) instead of the usual boundary set ∂ D used in Liu andKanamori (2019). Following the same proof as for their Proposition 1 and dropping the Lipschitzconstant L by replacing C with C/L (or equivalently choosing L = 1), it is easy to see that themaximum is obtained at g ( x ) ≡ min (cid:26) C, inf x (cid:48) ∈ ∂ D (cid:107) x − x (cid:48) (cid:107) (cid:27) , (19)the (cid:96) distance of x to ∂ D truncated above by C , which naturally extends the method in Liu andKanamori (2019) to unbounded domains. In the special case where ∂ D = ∅ , we must have D ≡ R m and g ( x ) ≡ C by the expression above (with the convention of inf ∅ = + ∞ ), which coincides withthe original score matching in Hyv¨arinen (2005).14ssuming that (A.1) and (A.2) from Lemma 2 hold when replacing ( h ◦ ϕ )( x ) by g ( x ) m and h j ( ϕ j ( x )) by g ( x ), the same conclusion there applies, i.e. J g ,C, D ( P ) ≡ m (cid:88) j =1 (cid:90) D p ( x ) g ( x ) [ ∂ j log p ( x )] d x + m (cid:88) j =1 (cid:90) D p ( x ) ∂ j [ g ( x ) ∂ j log p ( x )] d x plus a constant depending on p but not on p ; this is the same loss as in Equation (13) in Liu andKanamori (2019) with a truncation by C applied to g . The proof is in the same spirit of that forLemma 2 and is thus omitted. a - b Models
As one realm of application of the proposed estimation method, we consider exponential familymodels that postulate pairwise interactions between power-transformations of the observed vari-ables, as in (4): p η , K ( x ) ∝ exp (cid:18) − a x a (cid:62) K x a + 1 b η (cid:62) x b (cid:19) D ( x ) (20)for which we treat x ≡ log x and 1 / ≡
1, on a domain D ⊆ R m with a positive measure. Here a ≥ b ≥ K ∈ R m × m and the linearvector η ∈ R m are unknown parameters of interest. As in Yu et al. (2019b), our focus will beon the support of K , S ( K ) = { ( i, j ) : κ ij (cid:54) = 0 } , that for product domains defines the conditionalindependence graph of X ∼ p η , K . However, we simultaneously estimate the nuisance parameter η unless it is explicitly assumed to be .When a = b = 1, model (20) is a truncated Gaussian model. From a = b = 1 /
2, we may obtainthe exponential square-root graphical model in Inouye et al. (2016). The gamma model in Yu et al.(2019b) has a = 1 / b = 0. The following theorem gives detailed sufficient conditions for the a - b density p η , K in (20) to be aproper density on a domain D ⊆ R m with positive Lebesgue measure. Theorem 5 (Sufficient conditions for finite normalizing constant) . Denote ρ j ( D ) ≡ { x j : x ∈ D } the closure of the range of x j in the domain D . If any of the following conditions holds, the densityin (20) is a proper density, i.e., the right-hand of (20) is integrable over D : CC1) a > , b > , D is bounded;(CC2) a > , b > , v a (cid:62) K v a > ∀ v ∈ D \{ } , and either a > b or η (cid:62) v b ≤ ∀ v ∈ D ;(CC3) a > , b = 0 , η j > − for all j s.t. ∈ ρ j ( D ) , and one of the following holds:(i) D is bounded;(ii) D is unbounded and v a (cid:62) K v a > ∀ v ∈ D \{ } ;(iii) D is unbounded, v a (cid:62) K v a ≥ ∀ v ∈ D and η j < − for all j s.t. ρ j ( D ) is unbounded (whichimplies that ρ j ( D ) = [0 , + ∞ ) is not allowed for any j );(CC4) a = 0 , D is bounded and (cid:54)∈ ρ j ( D ) for all j ;(CC5) a = 0 , b = 0 , log( x ) (cid:62) K log( x ) > ∀ x ∈ D ;(CC6) a = 0 , b > , log( x ) (cid:62) K log( x ) > ∀ x ∈ D and η j ≤ for all j s.t. ρ j ( D ) is unbounded;(CC7) a = 0 , b > , log( x ) (cid:62) K log( x ) ≥ ∀ x ∈ D and η j < for all j s.t. ρ j ( D ) is unbounded.In the centered case where η = is known, any condition in terms of b and η can be ignored. To simplify our discussion, the following corollary gives a simpler set of sufficient conditions forintegrability of the density.
Corollary 6 (Sufficient conditions for finite normalizing constant; simplified) . Suppose(CC0*) K is positive definiteand one of the following conditions holds:(CC1*) a > b > or a = b = 0 ;(CC2*) a > , b = 0 , η (cid:31) − m ;(CC3*) a = 0 , b > , η j ≤ for any j such that x j is unbounded in D .Then the right-hand side of (20) is integrable over D . In the case where η ≡ , (CC0*) is sufficient. For simplicity, we use conditions (CC0*)–(CC3*) throughout the paper. The following theoremgives sufficient conditions on h that satisfy conditions (A.1)–(A.3) in Lemma 2 for score matching. Theorem 7 (Sufficient conditions that satisfy assumptions for score matching) . Suppose (CC0*)and one of (CC1*) through (CC3*) holds, and h ( x ) = ( x α , . . . , x α m m ) , where
1) if a > and b > , α j > max { , − a, − b } ;(2) if a > and b = 0 , α j > − η ,j ;(3) if a = 0 , α j ≥ .Then conditions (A.1), (A.2) and (A.3) in Lemma 2 are satisfied and the equivalent form of thegeneralized score matching loss (12) holds, and the empirical loss (13) is valid. In the centered casewith η ≡ , it suffices to have a > and α j > max { , − a } or a = 0 and α j ≥ . Let Ψ ≡ (cid:2) K (cid:62) η (cid:3) (cid:62) ∈ R ( m +1) × m . In this section, we give the form of Γ ∈ R ( m +1) m × ( m +1) m and g ∈ R ( m +1) m in the unpenalized loss vec( Ψ ) (cid:62) Γ vec( Ψ ) − g (cid:62) vec( Ψ ) following (16)–(17). Γ isblock-diagonal, with the j -th R ( m +1) × ( m +1) block Γ j ( x ) ≡ Γ K ,j γ K , η ,j γ (cid:62) K , η ,j γ η ,j ≡ n n (cid:88) i =1 ( h j ◦ ϕ j ) (cid:0) X ( i ) (cid:1) X ( i ) j a − X ( i ) a X ( i ) a (cid:62) − ( h j ◦ ϕ j ) (cid:0) X ( i ) (cid:1) X ( i ) j a + b − X ( i ) a − ( h j ◦ ϕ j ) (cid:0) X ( i ) (cid:1) X ( i ) j a + b − X ( i ) a (cid:62) ( h j ◦ ϕ j ) (cid:0) X ( i ) (cid:1) X ( i ) j b − . On the other hand, g ≡ vec (cid:16)(cid:2) g (cid:62) K g η (cid:3) (cid:62) (cid:17) ∈ R ( m +1) m , where g K ∈ R m × m and g η ∈ R m correspondto K and η , respectively. The j -th column of g K is1 n n (cid:88) i =1 (cid:18) ∂ j ( h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) X ( i ) j a − + ( a − h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) X ( i ) j a − (cid:19) X ( i ) a + a ( h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) X ( i ) j a − e j,m , (21)where e j,m ∈ R m has 1 at the j -th component and 0 elsewhere, and the j -th entry of g η is1 n n (cid:88) i =1 − ∂ j ( h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) X ( i ) j b − − ( b − h j ◦ ϕ j ) (cid:16) X ( i ) (cid:17) X ( i ) j b − . (22)If a = 0, set the coefficients ( a −
1) to − a to 1 in (21); for b = 0 set ( b −
1) to − δ to the diagonals of Γ K ,j ∈ R m × m ,not γ η ,j ∈ R . Note that each block of Γ and g correspond to each column of Ψ , i.e. ( κ ,j , η j ) ∈ R m +1 .In the penalized generalized score-matching loss (18), the penalty λ for K and η can be different,17 K and λ η , respectively, as long as the ratio λ η /λ K is fixed. For K we follow the convention thatwe penalize its off-diagonal entries only. That is,12 vec( Ψ ) (cid:62) Γ δ ( x )vec( Ψ ) − g ( x ) (cid:62) vec( Ψ ) + λ K (cid:107) K off (cid:107) + λ η (cid:107) η (cid:107) . (23)In the case where we do not penalize η , i.e. λ η = 0, we can simply profile out η , solve for ˆ η = Γ − η (cid:16) g η − Γ (cid:62) K , η vec( ˆ K ) (cid:17) , plug this back in and rewrite the loss in K only. Let Γ δ, K ∈ R m × m bethe block-diagonal matrix with blocks Γ K ,j and diagonal multiplier δ , and let Γ K , η ∈ R m × m and Γ η ∈ R m × m be the (block-)diagonal matrices with blocks γ K , η ,j and γ η ,j , respectively. Denote Γ δ, profiled as the Schur complement of Γ δ, K of (cid:104) Γ δ, K Γ K , η Γ (cid:62) K , η Γ η (cid:105) , i.e. Γ δ, K − Γ K , η Γ − η Γ (cid:62) K , η , which isguaranteed to be positive definite for δ >
1. Then the profiled loss isˆ J h , C , D ,λ,δ, profiled ( p K ) ≡
12 vec( K ) (cid:62) Γ δ, profiled vec( K ) − (cid:0) g K − Γ K , η Γ − η g η (cid:1) (cid:62) vec( K )+ λ K (cid:107) K (cid:107) . (24) To illustrate our generalized score matching framework, we first present univariate Gaussian modelson a general domain D ⊂ R that is a countable union of intervals and has positive Lebesgue measure.In particular, we estimate one of µ and σ assuming the other is known, given that the true densityis p µ ,σ ( x ) ∝ exp (cid:26) − ( x − µ ) σ (cid:27) , x ∈ D , with µ ∈ R and σ >
0. Let X (1) , . . . , X ( n ) ∼ p µ ,σ be i.i.d. samples. Without any regularizationby an (cid:96) or (cid:96) penalty and assuming the true σ is known, we have similar to Example 3.1 in Yuet al. (2019b) that our generalized score matching estimator for µ isˆ µ ≡ (cid:80) ni =1 ( h ◦ ϕ C ) (cid:0) X ( i ) (cid:1) · X ( i ) − σ ( h ◦ ϕ C ) (cid:48) (cid:0) X ( i ) (cid:1)(cid:80) ni =1 ( h ◦ ϕ C ) (cid:0) X ( i ) (cid:1) . By Theorem 7, it suffices to choose h ( x ) = x α with α >
0. Similar to Yu et al. (2019b), we have √ n (ˆ µ − µ ) → d N , E (cid:104) σ ( h ◦ ϕ C ) ( X ) + σ ( h ◦ ϕ C ) (cid:48) ( X ) (cid:105) E [( h ◦ ϕ C )( X )] , (25)if the expectations exist. On the other hand, assuming the true µ is known, the estimator for σ is ˆ σ ≡ (cid:80) ni =1 ( h ◦ ϕ C ) (cid:0) X ( i ) (cid:1) · (cid:0) X ( i ) − µ (cid:1) (cid:80) ni =1 ( h ◦ ϕ C ) (cid:0) X ( i ) (cid:1) + ( h ◦ ϕ C ) (cid:48) (cid:0) X ( i ) (cid:1) · (cid:0) X ( i ) − µ (cid:1) , with limiting distribution18 n (cid:0) ˆ σ − σ (cid:1) → d N , E (cid:104) σ ( h ◦ ϕ C ) ( X ) · ( X − µ ) + σ ( h ◦ ϕ C ) (cid:48) ( X ) · ( X − µ ) (cid:105) E [( h ◦ ϕ C )( X ) · ( X − µ ) ] . (26)Figure 3 shows a standard normal distribution N (0 ,
1) restricted to three univariate domains: D ≡ ( −∞ , − / ∪ [3 / , + ∞ ), D ≡ [ − , − / ∪ [3 / , D ≡ ( −∞ , − / ∪ [ − , − / ∪ [3 / , ∪ [3 / , + ∞ ). The endpoints are chosen so that the probability of the variable ly-ing in each interval is roughly the same: N (0 ,
1) (( −∞ , − / ≈ . N (0 ,
1) ([ − , − / ≈ . C for the distance ϕ C, D , we choose π ∈ (0 ,
1] and let C be the π quantile of the distribution of ϕ + ∞ , D ( X ), where the random variable X follows thetruncation of N (0 ,
1) to the domain D . So, C is such that P ( ϕ + ∞ , D ( X ) ≤ C ) = π . Here, ϕ + ∞ , D ( X ) = | X | − / | X | > /
2, or min( | X | − / , − | X | ) otherwise, ϕ + ∞ , D ( X ) = | X | − / ϕ + ∞ , D ( X ) = min( | X | − / , − | X | ).The first subfigure in each row of Figure 3 shows the density on each domain, along with thecorresponding ϕ + ∞ ( X ) in red, whose y axis is on the right. The second plot in each row showsthe log asymptotic variance for the corresponding ˆ µ , as on the right-hand side of (25), and thethird shows that for ˆ σ as in (26). Each curve represents a different α in h ( x ) = x α , and the x axis represents the quantiles π associated with the truncation point C as above. Finally, the reddotted curve shows C versus π for each domain. The “bumps” in the variance for x . are due tonumerical instability in integration.As we show in Section 5, for the purpose of edge recovery for graphical models, we recommendusing h ( x ) = ( x α , . . . , x α m m ) with α ≥ D that is a finite disjoint union of convex subsets of R m . Although minimizing the asymptotic variance in the univariate case is a different task, α = 1also seems to be consistently the best performing choice.For D and D , all variance curves are U-shaped, while for D = D ∪ D we see two suchcurves piecewise connected at C = max ϕ + ∞ , D ( x ) = (1 − / / . C ,the truncation is applied to the two unbounded intervals (i.e. D ) only. The first segments ofmost var(ˆ µ ) curves for D as well as most curves for D indicate there might still be benefit fromtruncating the distances ϕ within the bounded intervals, although the var (cid:0) ˆ σ (cid:1) curves for D aswell as both curves for x . on D suggest otherwise. On the other hand, the curves for D and D imply that a truncation constant larger than C is favorable; the ticks on the right-hand sideindicate that the curves for D reach their minimum at C ≥ .
5. Hence, a separate truncation point C for each connected component of D could be beneficial, especially for unbounded sets. However,19 D en s i t y U n t r un c a t ed d i s t an c e (a) Density on D . . . . . . . Truncation quantile l og ( A sy m p t o t i c v a r) T r un c a t i on c on s t an t (b) Asymptotic log var (ˆ µ ) for D . . . . . . . Truncation quantile l og ( A sy m p t o t i c v a r) T r un c a t i on c on s t an t (c) Asymptotic log var (cid:0) ˆ σ (cid:1) for D −3 −2 −1 0 1 2 30.00.20.40.60.81.0 x D en s i t y U n t r un c a t ed d i s t an c e (d) Density on D . . . . . Truncation quantile l og ( A sy m p t o t i c v a r) T r un c a t i on c on s t an t (e) Asymptotic log var (ˆ µ ) for D . . . . Truncation quantile l og ( A sy m p t o t i c v a r) T r un c a t i on c on s t an t (f) Asymptotic log var (cid:0) ˆ σ (cid:1) for D −1.0 −0.5 0.0 0.5 1.00.00.51.01.52.0 x D en s i t y U n t r un c a t ed d i s t an c e (g) Density on D . . . . . Truncation quantile l og ( A sy m p t o t i c v a r) T r un c a t i on c on s t an t (h) Asymptotic log var (ˆ µ ) for D . . . . Truncation quantile l og ( A sy m p t o t i c v a r) T r un c a t i on c on s t an t (i) Asymptotic log var (cid:0) ˆ σ (cid:1) for D Figure 3: Univariate Gaussian example. Each row represents a domain, with the first subfigureplotting the density, the second the asymptotic log variance of ˆ µ , the third that of ˆ σ . In each firstsubfigure the red lines show the untruncated distances ϕ + ∞ for each domain, and the red dottedlines in the second and third show the truncation point C versus the π quantile.20he necessary tuning of multiple parameters becomes infeasible for m (cid:29) This section presents theoretical guarantees for our generalized method applied to the pairwiseinteraction power a - b models. We first state a result analogous to Yu et al. (2019b) for truncatedGaussian densities on a general domain D , and then present a general result for a - b models. Inparticular, similar to and as a generalization of Yu et al. (2019b), we give a high probability boundon the deviation of our estimates ˆ K and ˆ η from their true values K and η . The main challengein deriving the results lies in obtaining marginal probability tail bounds of each observed value in Γ and g . We first restate Definition 12 from Yu et al. (2019b). Definition 5.
Let Γ ≡ E Γ ( x ) and g ≡ E g ( x ) be the population versions of Γ ( x ) and g ( x ) underthe distribution given by a true parameter matrix Ψ ≡ [ K , η ] (cid:62) ∈ R m ( m +1) , or Ψ ≡ K ∈ R m in the centered case with η ≡ . The support of a matrix Ψ is S ( Ψ ) ≡ { ( i, j ) : ψ ij (cid:54) = 0 } ,and we let S = S ( Ψ ) . We define d Ψ to be the maximum number of non-zero entries in anycolumn of Ψ , and c Ψ ≡ ||| Ψ ||| ∞ , ∞ . Writing Γ ,AB for the A × B submatrix of Γ , we define c Γ ≡ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( Γ ,S S ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ , ∞ . Finally, Γ satisfies the irrepresentability condition with incoherenceparameter ω ∈ (0 ,
1] and edge set S if (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Γ ,S c S ( Γ ,S S ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ , ∞ ≤ (1 − ω ) . (27) Truncated Gaussian models are covered by our a - b models described in Section 4.1 with a = b = 1.When the domain D is a finite disjoint union of convex sets with a positive Lebesgue measure, wehave the following theorem similar to Theorem 17 in Yu et al. (2019b), which bounds the errorsas long as one uses finite truncation points C for ϕ C , D and each component in h ( x ) is a powerfunction with a positive exponent.Specifically, we consider the truncated Gaussian distribution on D with inverse covariance pa-rameter K ∈ R m × m and mean parameter µ , namely with density p η , K ( x ) ∝ exp (cid:18) − x (cid:62) K x + η (cid:62) x (cid:19) D ( x )21ith K positive definite and η ≡ K µ . We assume D ⊂ R m to be a component-wise countableunion of intervals (Def 1) with positive Lebesgue measure, and assume it is a finite disjoint unionof convex sets ∆ ≡ (cid:8) D , . . . , D | ∆ | (cid:9) , i.e. D ≡ D (cid:116) · · · (cid:116) D | ∆ | . Theorem 8.
Suppose the data matrix contains n i.i.d. copies of X following a truncated Gaus-sian distribution on D as above with parameters K ∈ R m × m and µ , Let Ψ ≡ [ K , η ] (cid:62) ≡ [ K , K µ ] (cid:62) . Assume that (A.1)–(A.3) in Lemma 2 hold, and in addition that h and the trunca-tion points C in the truncated component-wise distance ϕ C , D satisfy ≤ ( h j ◦ ϕ C j , D ,j )( x ) ≤ M and ≤ ∂ j ( h j ◦ ϕ C j , D ,j )( x ) ≤ M (cid:48) almost surely for all j = 1 , . . . , m for some constants < M, M (cid:48) < + ∞ . Note that h ( x ) = ( x α , . . . , x α m m ) with α , . . . , α m ≥ satisfies all these assumptions, accord-ing to Theorem 7. Let the diagonal multiplier δ introduced in Section 3.5 satisfy < δ < C ( n, m ) ≡ − (cid:16) e max (cid:110) (6 log m + 2 log | ∆ | ) /n, (cid:112) (6 log m + 2 log | ∆ | ) /n (cid:111)(cid:17) − and suppose further that Γ ,S S is invertible and satisfies the irrepresentability condition (27) with ω ∈ (0 , . Define c X ≡ D (cid:48) ∈ ∆ max j (cid:12)(cid:12)(cid:12) (cid:113) ( K − ) jj + √ e E X j D (cid:48) ( X ) (cid:12)(cid:12)(cid:12) . Suppose for τ > thesample size and the regularization parameter satisfy n > O (cid:32) ( τ log m + log | ∆ | ) max (cid:40) M c Γ c X d Ψ ω , M c Γ c X d Ψ ω (cid:41)(cid:33) , (28) λ > O (cid:34) ( M c Ψ c X + M (cid:48) c X + M ) (cid:32)(cid:114) τ log m + log | ∆ | n + τ log m + log | ∆ | n (cid:33)(cid:35) . (29) Then the following statements hold with probability − m − τ :(a) The regularized generalized h -score matching estimator ˆ Ψ that minimizes (18) is unique, hasits support included in the true support, ˆ S ≡ S ( ˆ Ψ ) ⊆ S , and satisfies (cid:107) ˆ K − K (cid:107) ∞ ≤ c Γ − ω λ, (cid:107) ˆ η − η (cid:107) ∞ ≤ c Γ − ω λ, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ K − K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ≤ c Γ − ω λ (cid:112) | S | , ||| ˆ η − η ||| F ≤ c Γ − ω λ (cid:112) | S | , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ K − K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c Γ − ω λ min (cid:16)(cid:112) | S | , d Ψ (cid:17) , ||| ˆ η − η ||| ≤ c Γ − ω λ min (cid:16)(cid:112) | S | , d Ψ (cid:17) . (b) If in addition min j,k :( j,k ) ∈ S | κ ,jk | > c Γ − ω and min j :( m +1 ,j ) ∈ S | η ,j | > c Γ − ω λ , then we have ˆ S = S , sign(ˆ κ jk ) = sign( κ ,jk ) for all ( j, k ) ∈ S and sign(ˆ η j ) = sign( η j ) for ( m + 1 , j ) ∈ S .In the centered setting, the same bounds hold by removing the dependencies on ˆ η and η . D , . . . , D | ∆ | and combine the results with a union bound; (ii) in the proof ofLemma 22.1 of Yu et al. (2019b), replace D ≡ R m + by each D (cid:48) = D , . . . , D | ∆ | and replace X by X D (cid:48) ( X ), as the proof there only uses the convexity of the domain. R m + with Positive Measure In this section we present results for general a - b models on bounded domains with positive measure. Theorem 9. (1) Suppose a > and b ≥ . Let D be a bounded subset of R m + with positive Lebesguemeasure with D ⊆ [ u , v ] × · · · × [ u m , v m ] for finite nonnegative constants u , v , . . . , u m , v m , andsuppose that the true parameters K and η satisfy the conditions in Corollary 6 (for a well-defineddensity). Assume h ( x ) ≡ ( x α , . . . , x α m m ) with α , . . . , α m ≥ max { , − a, − b } , and suppose ϕ C , D has truncation points C = ( C , . . . , C m ) with < C j < + ∞ for j = 1 , . . . , m . Define ζ j ( α j , p j ) ≡ min { C j , ( v j − u j ) / } α j ( u j + v j ) p j / p j , p j < , v j − u j ≤ C j , min { C j , ( v j − u j ) / } α j ( u j + C j ) p j , p j < , v j − u j > C j , min { C j , ( v j − u j ) / } α j v p j j , p j ≥ ,ς Γ ≡ max j,k =1 ,...,m max (cid:8) ζ j ( α j , a − v ak , ζ j ( α j , b − (cid:9) ,ς g ≡ max j,k =1 ,...,m max (cid:8) α j ζ j ( α j − , a − v ak + | a − | ζ j ( α j , a − v ak + aζ j ( α j , a − ,α j ζ j ( α j − , b −
1) + | b − | ζ j ( α j , b − (cid:9) . Suppose that Γ ,S S is invertible and satisfies the irrepresentability condition (27) with ω ∈ (0 , .Suppose that for τ > the sample size, the regularization parameter and the diagonal multipliersatisfy n > c Γ d Ψ ς Γ ( τ log m + log 4) /ω , (30) λ > − ω ) ω max (cid:110) c Ψ ς Γ (cid:112) τ log m + log 4) /n, ς g (cid:112) ( τ log m + log 4) / (2 n ) (cid:111) , (31)1 < δ < C bounded ( n, m, τ ) ≡ (cid:112) ( τ log m + log 4) / (2 n ) . (32) Then the statements (a) and (b) in Theorem 8 hold with probability at least − m − τ .
2) For a = 0 and b ≥ , if u , . . . , u m > , letting w j ≡ max {| log u j | , | log v j |} , the above holdswith ς Γ ≡ max j,k =1 ,...,m max { ζ j ( α j , − w k , ζ j ( α j , b − } ,ς g ≡ max j,k =1 ,...,m max { α j ζ j ( α j − , − w k + | a − | ζ j ( α j , − w k + aζ j ( α j , − ,α j ζ j ( α j − , b −
1) + | b − | ζ j ( α j , b − } . We note that the requirement on α j ≥ ∂ j ( h j ◦ ϕ j ) terms in g ( x )and might not be necessary in practice as we see in the simulation studies. R m + with Positive Measure For unbounded domains D ⊂ R m + in the non-negative orthant, we are able to give consistencyresults but only with a sample complexity that includes an additional unknown constant factorthat may depend on m . For simplicity we only show the results for a >
0. The following lemmaenables us to bound each row of the data matrix x by a finite cube with high probability and thenproceed as for Theorem 9. Lemma 10.
Suppose D has positive measure, and the true parameters K and η satisfy theconditions in Corollary 6. Then for all j = 1 , . . . , m , X aj is sub-exponential if a > and log X j issub-exponential if a = 0 . We have the following corollary of Theorem 9. The result involves an unknown constant, namelythe sub-exponential norm (cid:107) · (cid:107) ψ of X aj , and n = Ω (log m ) · max j O (cid:16)(cid:13)(cid:13) X aj (cid:13)(cid:13) ψ (cid:17) ( α j +max { a, b }− /a becomes the required sample complexity. We conjecture that the sub-exponential norm scales likeΩ ((log m ) c ) for c small, but leave the exact dependency on m for further research. Corollary 11.
Suppose a > and D is a subset of R m + with positive measure and suppose thatthe true parameters K and η satisfy the conditions in Corollary 6. Let ρ j ( D ) ≡ { x j : x ∈ D } and ρ ∗ D ≡ { j = 1 , . . . , m : sup ρ j ( D ) < + ∞} , and suppose ρ j ( D ) ⊆ [ u j , v j ] for j ∈ ρ ∗ D . ThenTheorem 9 holds with log 4 replaced by log 6 in (30)–(32), and u j = max (cid:110) E X aj − (cid:15) ,j , (cid:111) / (2 a ) and v j = (cid:16) E X aj + (cid:15) ,j (cid:17) / (2 a ) for j (cid:54)∈ ρ ∗ D , where (cid:15) ,j ≡ max (cid:26) √ e (cid:13)(cid:13) X aj (cid:13)(cid:13) ψ (cid:113) log 3 + log n + τ log m + log (cid:0) m − | ρ ∗ D | (cid:1) , e (cid:13)(cid:13) X aj (cid:13)(cid:13) ψ (log 3 + log n + τ log m + log ( m − | ρ ∗ D | )) (cid:111) , (cid:13)(cid:13) X aj (cid:13)(cid:13) ψ ≡ sup q ≥ (cid:0) E | X j | aq (cid:1) /q /q ≥ E X aj . In this section we present results of numerical experiments using our method from Sections 3.2 and3.4, as well as our extension of Liu and Kanamori (2019) from Section 3.6. h and C Multiplying the gradient ∇ log p ( x ) with functions ( h ◦ ϕ C , D ) / ( x ) is key to our method, wherethe j -th component of ϕ C , D ( x ) = ( ϕ C , D , ( x ) , . . . , ϕ C m , D ,m ( x )) is the distance of x j to the bound-ary of its domain holding x − j fixed, with this distance truncated from above by some constant C j >
0. We use a single function h for all components (so, h ( x ) = ( h ( x ) , . . . , h ( x m ))), whichwe choose as h ( x ) = x c with exponent c = i/ i = 0 , , . . . ,
8. Instead of pre-specifyingtruncation points in C , we select 0 < π ≤ C j to be the π th sample quantile of { ϕ + ∞ , D ,j ( x (1) ) , . . . , ϕ + ∞ , D ,j ( x ( n ) ) } , where x ( i ) is the i th row of the data matrix x . Infinite valuesof ϕ + ∞ , D ,j are ignored, and ϕ j ≡ ϕ + ∞ , D ,j ( x (1) ) = · · · = ϕ + ∞ , D ,j ( x ( n ) ) = + ∞ . This empiricalchoice of the truncation points automatically adapts to the scale of data, and we found it to bemore effective than fixing the constant to a grid from 0.5 to 3 as done in Yu et al. (2019b). Ourexperiments consider π = 0 . , . , . , . ,
1, where π = 1 means no truncation of finite distances.Note that for c = 0, ( h ◦ ϕ C , D )( x ) ≡ R m of Hyv¨arinen (2005). With c = 2, C = (+ ∞ , . . . , + ∞ ) and D ≡ R m + , ( h ◦ ϕ C , D )( x ) ≡ x corresponds to the estimator of Hyv¨arinen (2007); Lin et al. (2016). The case where D ≡ R m + corresponds to Yu et al. (2018, 2019b).We also consider our extension of the method from Liu and Kanamori (2019), for which we use g ( x ) from (19) as opposed to ( h ◦ ϕ C , D ) / ( x ); see Section 3.6. The constant C in this case isalso determined using quantiles, but now of the untruncated (cid:96) distances of the given data sampleto ∂ D . For C = + ∞ ( π = 1) there is no truncation and the estimator corresponds to Liu andKanamori (2019). We present results for general a - b models restricted to domains with positive Lebesgue measure.25 .2.1 Experimental Setup Throughout, we choose dimension m = 100 and sample sizes n = 80 and 1000. For brevity, weonly present results for the centered case (assuming η ≡ ) where the b power does not come intoplay, i.e., the density is proportional to exp (cid:8) − x a (cid:62) K x a / (2 a ) (cid:9) for a > (cid:0) − log x (cid:62) K log x / (cid:1) for a = 0. Indeed, the experiments in Yu et al. (2019b) suggest that the results for non-centeredsettings are similar with the best choice of h mainly depending on a but not b .We consider six settings: (1) a = 0 (log), (2) a = 1 / a = 1 (Gaussian), (4) a = 3 / a = 2 and (6) a = 3. For all settings, we consider the following subsets of R m + as our domain D :i) non-negative (cid:96) ball { x ∈ R m + : (cid:107) x (cid:107) ≤ c } , which we call (cid:96) -nn (“non-negative”),ii) complement of (cid:96) ball in R m + : { x ∈ R m + : (cid:107) x (cid:107) ≥ c } , which we call (cid:96) (cid:123) -nn, andiii) [ c , + ∞ ) m , which we call unif-nn ,for some c >
0. For the Gaussian ( a = 1) case consider in addition the following subsets of R m :iv) the entire (cid:96) ball { x ∈ R m : (cid:107) x (cid:107) ≤ c } , which we call (cid:96) ,v) the complement of (cid:96) ball in R m : { x ∈ R m : (cid:107) x (cid:107) ≥ c } , which we call (cid:96) (cid:123) , andvi) (( −∞ , c ] ∪ [ c , + ∞ )) m , which we call unif .The constant c in each setting above is determined in the following way. We first generate n samples from the corresponding untruncated distribution on R m + for i)–iii) or R m for iv)–vi), thendetermine the c so that exactly half of the samples would fall inside the truncated boundary.The true interaction matrices K are taken block-diagonal as in Lin et al. (2016) and Yu et al.(2019b), with 10 blocks of size m/
10 = 10. In each block, each lower-triangular element is set to 0with probability 1 − ρ for some ρ ∈ (0 , . , K has minimum eigenvalue 0 .
1. We generate 5 different K , and run 10 trialsfor each of them. We choose ( ρ, n ) = (0 . , . ,
80) so that n/ ( d K log m ) is roughlyconstant, recall our theory in Section 5. 26 .2.2 Results Our focus is on recovery of the support of K = ( κ ,i,j ), i.e., the set S , off ≡ { ( i, j ) : i (cid:54) = j ∧ κ ,i,j (cid:54) = 0 } which corresponds to an undirected graph with S , off as edge set. We use the area under the ROCcurve (AUC) as the measure of performance. Let ˆ K be an estimate with support ˆ S off ≡ { ( i, j ) : i (cid:54) = j ∧ ˆ κ i,j (cid:54) = 0 } . Then the ROC curve plots the true positive rate (TPR) against the false positiverate (FPR), with FPR ≡ | ˆ S off \ S , off | m ( m − − | S , off | and TPR ≡ | ˆ S off ∩ S , off || S , off | . We plot the AUC averaged over all 50 trials in each setting against the probability π used to set thetruncation points C . Each plotted curve is for one choice of the function h ( x ), or for g ( x ). The y -ticks on the right-hand side are the original AUC values, whereas those on the left are the AUCsdivided by the AUC for h ( x ) = 1, measuring the relative performance of each method compared tothe original score matching in Hyv¨arinen (2005); h ( x ) = 1 does not depend on the truncation andis constant in each plot.Plots for a ≤ a > h ( x ) = x c with c ≈ max { − a, } works the best, as we also observedin Yu et al. (2019b). In most settings the truncated g function does not work well (Liu andKanamori (2019) corresponds to π = 1). The only notable exceptions are the domains iv)–vi),i.e., Gaussian models on subsets of R m not restricted to R m + , see Figure 7. The original scorematching in Hyv¨arinen (2005) seems to work the best in these settings, suggesting that estimationof Gaussians on such domains might not be challenging enough to warrant switching to the morecomplex generalized methods. However, Figure 7, for the iv) (cid:96) and v) (cid:96) (cid:123) domains, shows onlyinsignificant differences in the performance of all estimators. We illustrate the use of our generalized score matching for inference of conditional independencerelations among DNA methylations based on data for 500 patients. The dataset contains methy-lation levels of CpG islands associated with head and neck cancer from The Cancer Genome Atlas(TCGA) (Weinstein et al., 2013). Methylation levels are associated with epigenetic regulation ofgenes and, according to (Du et al., 2010), are commonly reported as Beta values, a value in [0 , . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (a) n = 80, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (b) n = 1000, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (c) n = 80, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (d) n = 1000, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (e) n = 80, unif-nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (f) n = 1000, unif-nn domain Figure 4: AUCs averaged over 50 trials for support recovery using generalized score matching forlog models ( a = 0). Each curve represents either our extension to g ( x ) from Liu and Kanamori(2019) or a choice of power function h ( x ) = x c . The x axes mark the probabilities π that determinethe truncation points C for the truncated component-wise distances. The colors are sorted by thepower c . 28 . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (a) n = 80, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (b) n = 1000, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (c) n = 80, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (d) n = 1000, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (e) n = 80, unif-nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (f) n = 1000, unif-nn domain Figure 5: AUCs averaged over 50 trials for support recovery using generalized score matching forexponential square-root models ( a = 1 / g ( x )from Liu and Kanamori (2019) or a choice of power function h ( x ) = x c .29 . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (a) n = 80, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (b) n = 1000, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (c) n = 80, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (d) n = 1000, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (e) n = 80, unif-nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (f) n = 1000, unif-nn domain Figure 6: AUCs averaged over 50 trials for support recovery using generalized score matchingfor Gaussian models ( a = 1) on domains being subsets of R m + . Each curve represents either ourextension to g ( x ) from Liu and Kanamori (2019) or a choice of power function h ( x ) = x c .30 . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (a) n = 80, (cid:96) domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (b) n = 1000, (cid:96) domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (c) n = 80, (cid:96) (cid:123) domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (d) n = 1000, (cid:96) (cid:123) domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (e) n = 80, unif domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (f) n = 1000, unif domain Figure 7: AUCs averaged over 50 trials for support recovery using generalized score matchingfor Gaussian models ( a = 1) on domains being subsets of R m (not restricted to R m + ). Each curverepresents either our extension to g ( x ) from Liu and Kanamori (2019) or a choice of power function h ( x ) = x c . 31robe intensities, or M values, defined as the base 2-logit of the Beta values. Supported on R , Mvalues can be analyzed using traditional methods, e.g., via Gaussian graphical models. In contrast,our new methodology allows direct analysis of Beta values using generalized score matching for the a - b model framework.We focus on a subset of CpG sites corresponding to genes known to belong to the pathway forThyroid cancer according to the Kyoto Encyclopedia of Genes and Genomes (KEGG). Furthermore,we remove sites with clearly bimodal methylations, which we assess using the methods from the Rpackage Mclust . This results in n = 500 samples and m = 478 sites belonging to 36 genes.When considering M values, we estimate the graph encoding the support of the interactionmatrix (and hence the conditional dependence structure) in a Gaussian model on R m , i.e., the a - b model with a = b = 1. In doing so, we use the profiled estimator in (24), and choose the upper-bound diagonal multiplier 2 − (1 + 80 (cid:112) log m/n ) − as suggested in Section 6.2 of Yu et al. (2019b).The support being all of R m we simply use the original score matching with ( h ◦ ϕ )( x ) = m .For Beta values, we assume a log-log model ( a = b = 0) on [0 , m , and use the profiled estimatorwith the upper-bound diagonal multiplier 1 + (cid:112) ( τ log m + log 4) / (2 n ) as in (32) with the choice of τ = 3. Suggested by our theory, we use h ( x ) = x , and choose the truncation points in ϕ to bethe 40th sample percentile, as suggested by the simulation results in Figure 4. For our illustration,the λ parameter that defines the (cid:96) penalty on K is chosen so that the number of edges is equal to478, the number of sites, following Lin et al. (2016) and Yu et al. (2019b).The estimated graphs are presented in Figure 8, where panel (a) is for Beta values, (c) is for Mvalues, and (b) shows their common edges, i.e., the intersection graph. The plots in (a), (b), and(c) exclude isolated nodes and the layout is optimized for each graph. Figure 13 in the appendixincludes isolated nodes where the layout is optimized for the graph for Beta values. Figure 14 inthe appendix shows the graphs in Figure 8 aggregated by the genes associated with the sites. In(a) and (c), red points indicate nodes with degree at least 10. Sites with the highest node degreesare listed in Table 1, where those shared by the two graphs are highlighted in bold.We quantify the similarity between the two site graphs (not aggregated) by their Hammingdistance and their DeltaCon similarity score (Koutra et al., 2013). The Hamming distance countsthe number of edge differences, and thus decreases as two graphs become more similar. Conversely,DeltaCon (Koutra et al., 2013) generates a similarity score in [0 , a) Beta values (b) Common edges (c) M values Figure 8: Graphs for CpG sites estimated by regularized generalized score matching estimator usingBeta values (a) and M values (c), and their intersection graph (b).
Beta values M values Beta values M valuesCDH1—4 (28) RXRB—24 (25) LEF1—2 (20) PAX8—9 (20)
TCF7L1—18 (22) MAPK3—8 (22)
TCF7L1—13 (20) TCF7—3 (18)
RXRA—19 (21)
PAX8—6 (21) CDKN1A—10 (20) TCF7L1—9 (18)RXRA—22 (21) CCND1—19 (20) CDKN1A—6 (19)
TCF7L1—18 (18)
RET—22 (21) RXRA—10 (20)
MAPK3—8 (17)
TCF7L2—63 (18)RXRB—82 (21)
RXRA—19 (20)
PAX8—28 (17) TPM3—12 (18)NTRK1—40 (21) RXRB—18 (20)
PAX8—29 (17) PAX8—29 (17)
Table 1: List of sites with the highest node degrees in each estimated graph.the minimal Hamming distance between the graph for Beta values and 10000 randomly generatedgraphs with the same number of edges, and 940, that value using the graph for M values. Onthe other hand, the DeltaCon similarity score between the two original graphs is 0.114, while themaximal score between the Beta graph and 10000 randomly generated graphs is only 0.0781, whilethat for the M graph is 0.0761. In Figure 9, we compare the distribution of node degrees for bothgraphs, with interlaced histogram on the left and Q-Q plot on the right. All these results suggestthat the two estimated graphs are similar to each other, but that the two analyses also revealcomplementary features.
Generalized score matching as proposed in Yu et al. (2019b) is an extension of the method ofHyv¨arinen (2007) that estimates densities supported on R m + using a loss, in which the log-gradientof the postulated density, ∇ log p ( x ), is multiplied component-wise with a function h ( x ). The33 C oun t o f s i t e s graphBetaM (a) Beta values Sample quantiles of node degrees, Beta values S a m p l e quan t il e s o f node deg r ee s , M v a l ue s (b) Common edges Figure 9: Interlaced histogram (left) and Q-Q plot (right) showing the node degree distributionsfor both site graphs.resulting estimator avoids the often costly calculation of normalizing constants and has a closed-form solution in exponential family models.In this paper, we further extend generalized score matching to be applicable to more generaldomains. Specifically, we allow for domains D that are component-wise countable union of intervals D (see Definition 1). We accomplish this by composing the function h with a distance function ϕ C = ( ϕ C , , . . . , ϕ C m ,m ) : D → R m + , where ϕ C j ,j ( x ) is a truncated distance of x j to the boundary ofthe relevant interval in the section of D given by x − j . The resulting loss can again be approximatedby an empirical loss, which is quadratic in the canonical parameters for exponential families.In our applications we focus on a - b pairwise interaction models supported on domains D withpositive Lebesgue measure. For these models we give a concrete choice of the function h and extendthe consistency theory for support recovery in Yu et al. (2019b) to Gaussian models on domainsthat are finite disjoint unions of convex sets, and on bounded domains with positive Lebesguemeasure, requiring the sample size to be n = Ω(log m ). For unbounded domains with a > m . Deriving a moreexplicit requirement on the sample size would be an interesting topic for future work. Finally, inour simulations we adaptively select the truncation points C of ϕ C using the sample quantiles ofthe untruncated distances. Developing a method to choose the best truncation points remains atopic for further research. 34 cknowledgments This work was supported by grants DMS/NIGMS-1561814 from the National Science Foundation(NSF) and R01-GM114029 from the National Institutes of Health (NIH).
References
Pan Du, Xiao Zhang, Chiang-Ching Huang, Nadereh Jafari, Warren A Kibbe, Lifang Hou, andSimon M Lin. Comparison of beta-value and m-value methods for quantifying methylation levelsby microarray analysis.
BMC Bioinformatics , 11(1):587, 2010.Peter G. M. Forbes and Steffen Lauritzen. Linear estimating equations for exponential familieswith application to Gaussian linear concentration models.
Linear Algebra Appl. , 473:261–283,2015.Aapo Hyv¨arinen. Estimation of non-normalized statistical models by score matching.
J. Mach.Learn. Res. , 6:695–709, 2005.Aapo Hyv¨arinen. Some extensions of score matching.
Comput. Statist. Data Anal. , 51(5):2499–2512,2007.David Inouye, Pradeep Ravikumar, and Inderjit Dhillon. Square root graphical models: Multi-variate generalizations of univariate exponential families that permit positive dependencies. In
Proceedings of the 33rd International Conference on Machine Learning , volume 48 of
Proceedingsof Machine Learning Research , pages 2445–2453, 2016.Eric Janofsky. Exponential series approaches for nonparametric graphical models. arXiv preprintarXiv:1506.03537 , 2015.Eric Janofsky. Learning high-dimensional graphical models with regularized quadratic scoring. arXiv preprint arXiv:1809.05638 , 2018.Danai Koutra, Joshua T Vogelstein, and Christos Faloutsos. Deltacon: A principled massive-graphsimilarity function. In
Proceedings of the 2013 SIAM International Conference on Data Mining ,pages 162–170. SIAM, 2013.Lina Lin, Mathias Drton, and Ali Shojaie. Estimation of high-dimensional graphical models usingregularized score matching.
Electron. J. Stat. , 10(1):806–854, 2016.35ong Liu and Takafumi Kanamori. Estimating density models with complex truncation boundaries. arXiv preprint arXiv:1910.03834 , 2019.Weidong Liu and Xi Luo. Fast and adaptive sparse precision matrix estimation in high dimensions.
J. Multivariate Anal. , 135:153–162, 2015.Marloes Maathuis, Mathias Drton, Steffen Lauritzen, and Martin Wainwright, editors.
Handbookof graphical models . Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRCPress, Boca Raton, FL, 2019.Siqi Sun, Mladen Kolar, and Jinbo Xu. Learning structured densities via infinite dimensional expo-nential families. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information Processing Systems 28 , pages 2287–2295. Curran Associates,Inc., 2015.Kean Ming Tan, Junwei Lu, Tong Zhang, and Han Liu. Layer-wise learning strategy for nonpara-metric tensor product smoothing spline regression and graphical models.
Journal of MachineLearning Research , 20(119):1–38, 2019.Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. In
CompressedSensing , pages 210–268. Cambridge Univ. Press, Cambridge, 2012.Martin J Wainwright.
High-dimensional statistics: A non-asymptotic viewpoint , volume 48. Cam-bridge University Press, 2019.John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger,Kyle Ellrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, Cancer Genome Atlas ResearchNetwork, et al. The cancer genome atlas pan-cancer analysis project.
Nature genetics , 45(10):1113, 2013.Ming Yu, Mladen Kolar, and Varun Gupta. Statistical inference for pairwise graphical models usingscore matching. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural Information Processing Systems 29 , pages 2829–2837. Curran Associates,Inc., 2016.Ming Yu, Varun Gupta, and Mladen Kolar. Simultaneous inference for pairwise graphical modelswith generalized score matching. arXiv preprint arXiv:1905.06261 , 2019a.36hiqing Yu, Mathias Drton, and Ali Shojaie. Graphical models for non-negative data using gener-alized score matching. In
International Conference on Artificial Intelligence and Statistics , pages1781–1790, 2018.Shiqing Yu, Mathias Drton, and Ali Shojaie. Generalized score matching for non-negative data.
Journal of Machine Learning Research , 20(76):1–70, 2019b.Teng Zhang and Hui Zou. Sparse precision matrix estimation via lasso penalized D-trace loss.
Biometrika , 101(1):103–120, 2014.
A Additional Plots
Below we present additional plots for our simulations in Sections 6 and 7.37 . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (a) n = 80, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (b) n = 1000, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (c) n = 80, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (d) n = 1000, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (e) n = 80, unif-nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (f) n = 1000, unif-nn domain Figure 10: AUCs averaged over 50 trials for support recovery using generalized score matching forthe a = 3 / g ( x ) from Liu and Kanamori(2019) or a choice of power function h ( x ) = x c . 38 . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (a) n = 80, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (b) n = 1000, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (c) n = 80, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (d) n = 1000, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (e) n = 80, unif-nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (f) n = 1000, unif-nn domain Figure 11: AUCs averaged over 50 trials for support recovery using generalized score matching forthe a = 2 models. Each curve represents either our extension to g ( x ) from Liu and Kanamori(2019) or a choice of power function h ( x ) = x c . The x axes mark the probabilities π that determinethe truncation points C for the truncated component-wise distances. The colors are sorted by thepower c . 39 . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (a) n = 80, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (b) n = 1000, (cid:96) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (c) n = 80, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (d) n = 1000, (cid:96) (cid:123) -nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (e) n = 80, unif-nn domain . . . . . . A UC Truncation quantile A UC , c o m pa r ed t o h ( x ) = . . . . . . FunctionConstg0(x)x^0.25x^0.5x^0.75x^1x^1.25x^1.5x^1.75x^2 (f) n = 1000, unif-nn domain Figure 12: AUCs averaged over 50 trials for support recovery using generalized score matching forthe a = 3 models. Each curve represents either our extension to g ( x ) from Liu and Kanamori(2019) or a choice of power function h ( x ) = x c . The x axes mark the probabilities π that determinethe truncation points C for the truncated component-wise distances. The colors are sorted by thepower c . 40 a) Beta values (b) Common edges (c) M values Figure 13: Graphs for CpG sites estimated by regularized generalized score matching estimatorusing Beta values (a) and M values (c), and their intersection graph (b). Isolated nodes areincluded and the layout is optimized for the graph for Beta values; red nodes have degree at least10 (“hub nodes”).
BRAFCCDC6CCND1 CDH1CDKN1ACTNNB1 GADD45AGADD45BGADD45G HRASKRASLEF1 MAP2K1MAPK1 MAPK3MYCNCOA4 NTRK1PAX8RETRXRARXRB RXRGTCF7TCF7L1TCF7L2TP53 TPM3TPR (a) Beta values
BRAF CCDC6CCND1CDH1CDKN1A GADD45BKRASLEF1MAPK3NCOA4 NTRK1 PAX8RET RXRA RXRBRXRGTCF7 TCF7L1TCF7L2 TP53TPM3 (b) Common edges
BRAF CCDC6CCND1CDH1CDKN1A CTNNB1 GADD45B HRASKRAS LEF1MAPK3NCOA4NRAS NTRK1 PAX8PPARGRET RXRA RXRBRXRGTCF7TCF7L1 TCF7L2 TP53TPM3 (c) M values(d) Beta values (e) Common edges (f) M values
Figure 14: Graphs in Figure 8 (equivalently Figure 13) aggregated by the genes associated withthe CpG sites, with Beta values (a, d) and M values (c, f), and their intersection graph (b, e). Redpoints indicate genes that are connected to at least 10 other genes in the case of Figure 1441
Proofs
Before proving Lemma 2 we first prove the following lemma.
Lemma 12.
Suppose h , . . . , h m are absolutely continuous in any bounded sub-interval of R + . Thenfor any j = 1 , . . . , m and any x − j ∈ S − j, D , ( h j ◦ ϕ j ) is absolutely continuous in x j in any boundedsub-interval of C j, D ( x − j ) .Proof of Lemma 12. In the proof we drop the dependency on x − j in notation. By assumption,under Equation 7 any bounded sub-interval [ a, b ] of C j, D ( x − j ) must be a sub-interval of [ a k,j , b k,j ]for some k (for simplicity we do not differentiate among [ a, b ], ( a, b ], [ a, b ) and ( a, b ) here).(1) If a k,j > −∞ and b k,j < + ∞ , denote C ≡ min { C j , ( b k,j − a k,j ) / } and rewrite( h j ◦ ϕ j )( x )= h j (min( C j , x j − a k,j , b k,j − x j ))= h j ( x j − a k,j ) x j ∈ [ a k,j ,a k,j + C ] + h j ( C j ) x j ∈ [ a k,j + C ,b k,j − C ] + h j ( b k,j − x j ) x j ∈ [ b k,j − C ,b k,j ] . Then by absolute continuity of h j in [ a k,j , b k,j ] it is apparent that ( h j ◦ ϕ j ) is differentiablein x j a.e. with partial derivative h (cid:48) j ( x j − a k,j ) x j ∈ [ a k,j ,a k,j + C ] − h (cid:48) j ( b k,j − x j ) x j ∈ [ b k,j − C ,b k,j ] . Then by the absolute continuity of h j again, for x j ∈ [ a k,j , b k,j ], (cid:90) x j a k,j ∂ j ( h j ◦ ϕ j )( t j ; x − j ) d t j = h j ( x j − a k,j ) x j ∈ [ a k,j ,a k,j + C ] + h j ( C j ) x j ∈ [ a k,j + C ,b k,j − C ] + h j ( b k,j − x j ) x j ∈ [ b k,j − C ,b k,j ] = ( h j ◦ ϕ j )( x ) , which proves that ( h j ◦ ϕ j )( x ) is absolutely continuous in x j in [ a k,j , b k,j ], and hence in[ a, b ] ⊂ [ a k,j , b k,j ].(2) If a k,j > −∞ and b k,j = + ∞ , on [ a, b ] ( h j ◦ ϕ j )( x ) = h j (min( C j , x j − a k,j )) is an absolutelycontinuous function in a linear function of x j truncated above by C j , and is thus triviallyabsolutely continuous in [ a, b ].(3) If a k,j = −∞ and b k,j < + ∞ , on [ a, b ] ( h j ◦ ϕ j )( x ) = h j (min( C j , b k,j − x j )) is an absolutelycontinuous function in a linear function of x j truncated above by C j , and is thus triviallyabsolutely continuous in [ a, b ]. 424) If a k,j = −∞ and b k,j = + ∞ , ( h j ◦ ϕ j )( x ) = h j ( C j ) is constant and hence trivially absolutelycontinuous in [ a, b ]. Proof of Lemma 2.
By simple manipulation J h , C , D ( p ) ≡ (cid:90) D p ( x ) (cid:13)(cid:13)(cid:13) ∇ log p ( x ) (cid:12) ( h ◦ ϕ ) / ( x ) − ∇ log p ( x ) (cid:12) ( h ◦ ϕ ) / ( x ) (cid:13)(cid:13)(cid:13) d x (33)= 12 m (cid:88) j =1 (cid:90) D p ( x )( h j ◦ ϕ j )( x ) ( ∂ j log p ( x ) − ∂ j log p ( x )) d x = 12 m (cid:88) j =1 (cid:90) D p ( x )( h j ◦ ϕ j )( x ) ( ∂ j log p ( x )) d x − m (cid:88) j =1 (cid:90) D p ( x )( h j ◦ ϕ j )( x ) ∂ j log p ( x ) ∂ j log p ( x ) d x + const . (34)By (34) it suffices to prove for all j = 1 , . . . , m that (cid:90) D p ( x )( h j ◦ ϕ j )( x ) ∂ j log p ( x ) ∂ j log p ( x ) d x = − (cid:90) D p ( x ) ∂ j [( h j ◦ ϕ j )( x ) ∂ j log p ( x )] d x . (35)Since (cid:82) D p ( x ) (cid:13)(cid:13) ∇ log p ( x ) (cid:12) ( h ◦ ϕ ) / ( x ) (cid:13)(cid:13) d x and (cid:82) D p ( x ) (cid:13)(cid:13) ∇ log p ( x ) (cid:12) ( h ◦ ϕ ) / ( x ) (cid:13)(cid:13) d x areboth finite under assumption, by | ab | ≤ a + b the integrand in the left-hand side of (35) isintegrable. Then by Fubini-Tonelli (cid:90) D p ( x )( h j ◦ ϕ j )( x ) ∂ j log p ( x ) ∂ j log p ( x ) d x = (cid:90) S − j (cid:90) C j ( x − j ) ( h j ◦ ϕ j )( x ) ∂ j p ( x ) ∂ j log p ( x ) (cid:124) (cid:123)(cid:122) (cid:125) ≡ f ( x ) d x j d x − j = (cid:90) S − j (cid:90) R C j ( x − j ) ( x j ) f ( x ) d x j d x − j = (cid:90) S − j (cid:90) R K j ( x − j ) (cid:88) k =1 [ a k,j ( x − j ) , b k,j ( x − j )] ( x j ) f ( x j ; x − j ) d x j d x − j = (cid:90) S − j K j ( x − j ) (cid:88) k =1 (cid:90) b k,j ( x − j ) a k,j ( x − j ) f ( x j ; x − j ) d x j d x − j (36)where the interchangeability of integration and (potentially infinite) summation is justified byFubini-Tonelli again. Then using the decomposition of the domain in (7) while omitting the de-pendency of a k,j and b k,j on x − j in notation, for a.e. x − j ∈ S − j and any k = 1 , . . . , K j ( x − j ) we43ave (cid:90) b k,j a k,j f ( x ) d x j = (cid:90) b k,j a k,j ( h j ◦ ϕ j )( x ) ∂ j p ( x ) ∂ j log p ( x ) d x j = lim x j (cid:37) b − kj ( h j ◦ ϕ j )( x ) p ( x ) ∂ j log p ( x ) − lim x j (cid:38) a + kj ( h j ◦ ϕ j )( x ) p ( x ) ∂ j log p ( x ) − (cid:90) b k,j a k,j p ( x ) ∂ j [( h j ◦ ϕ j )( x ) ∂ j log p ( x )] d x j = − (cid:90) b k,j a k,j p ( x ) ∂ j [( h j ◦ ϕ j )( x ) ∂ j log p ( x )] d x j , by integration by parts and by Assumption (A1) on the limits going to 0. The integration by partsis justified by the fundamental theorem of calculus for absolutely continuous functions (Lemma 12)as well as the product rule (cf. proof of Lemma 19 in Yu et al. (2019b)). Thus, by going backwardsusing Fubini-Tonelli twice again, (36) becomes (cid:90) S − j − K j ( x − j ) (cid:88) k =1 (cid:90) b k,j ( x − j ) a k,j ( x − j ) p ( x ) ∂ j [( h j ◦ ϕ j )( x ) ∂ j log p ( x )] d x j d x − j = − (cid:90) S − j (cid:90) C j ( x − j ) p ( x ) ∂ j [( h j ◦ ϕ j )( x ) ∂ j log p ( x )] d x j d x − j = − (cid:90) D p ( x ) ∂ j [( h j ◦ ϕ j )( x ) ∂ j log p ( x )] d x , proving (35). Proof of Theorem 5.
Note that the condition v a (cid:62) K v a > ∀ v ∈ D \{ } implies that v a (cid:62) K v a > ∀ v ∈ D + ≡ { v / (cid:107) v (cid:107) : v ∈ D \{ }} ⊆ { v ∈ R m : (cid:107) v (cid:107) = 1 } ≡ S m − with S m − compact, so N K ≡ inf v ∈ D \{ } v a (cid:62) K v a / v a (cid:62) v a = inf v ∈ D + v a (cid:62) K v a / v a (cid:62) v a ≥ inf v ∈ S m − v a (cid:62) K v a / v a (cid:62) v a > . (1) Case a > and b > (CC1, CC2): Since p is bounded everywhere, it is integrable over abounded D (proving (CC1)). Otherwise, assume D is unbounded. If either a or b is non-integer,then D ⊂ R m + and a sufficient condition is v a K v a > ∀ v ∈ D \{ } , and either η (cid:62) v b ≤ ∀ v ∈ D or 2 a > b >
0, corresponding to (i) and (ii) in the Proof of Theorem 9 in Section A.3 of Yu et al.(2019b), respectively. If a and b are both integers, D ⊂ R m and the same sufficient condition canbe implied following the same proof in Yu et al. (2019b), with integration over ( −∞ , + ∞ ) insteadof (0 , + ∞ ). This proves (CC2). 442) Case a > and b = 0 (CC3): By definition D ⊆ R m + . If D is bounded, − a x a (cid:62) K x a as a con-tinuous function is bounded, and so it suffices to bound (cid:82) D exp (cid:0) η (cid:62) log( x ) (cid:1) d x = (cid:82) D (cid:81) mj =1 x η j j d x ≤ (cid:81) mj =1 (cid:82) ρ j ( D ) x η j j d x j < + ∞ if η j > − j such that 0 ∈ ρ j ( D ), where for the ≤ step we usedthe fact that x j >
0. This proves (CC3) (i).If D is unbounded and v a (cid:62) K v a > v ∈ D \{ } , using the fact that exp( · · · ) > (cid:90) D p η , K ( x ) d x = (cid:90) D exp (cid:16) − x a (cid:62) K x a / (2 a ) + η (cid:62) log( x ) (cid:17) d x ≤ m (cid:89) j =1 (cid:90) ρ j ( D ) exp (cid:0) − N K x aj / (2 a ) + η j log( x j ) (cid:1) d x j . Note that the indefinite integral of the last display is − a x η j (cid:18) N K a x a (cid:19) − (1+ η j ) / (2 a ) Γ (cid:20) η j a , N K x a a (cid:21) so the definite integral is finite if and only if η j > − j s.t. 0 ∈ ρ j ( D ). This proves (CC3)(ii).If D is unbounded and v a (cid:62) K v a ≥ v ∈ D , then (cid:82) D p η , K ( x ) d x ≤ (cid:81) mj =1 (cid:82) ρ j ( D ) x η j j d x j < ∞ if η j > − j s.t. 0 ∈ ρ j ( D ) and η j < − j s.t. ρ j ( D ) is unbounded. This proves(CC3) (iii).(3) Case a = 0 , D is bounded and (cid:54)∈ ρ j ( D ) for all j (CC4): If D is bounded and 0 (cid:54)∈ ρ j ( D )for all j , then log( D ) is bounded, and since the integrand is continuous and bounded, the integralis finite without any further requirements.(4) Case a = 0 and b = 0 (CC5): Assume log( x ) (cid:62) K log( x ) > x ∈ D , then (cid:90) D p η , K ( x ) d x = (cid:90) D exp (cid:18) −
12 log( x ) (cid:62) K log( x ) + η (cid:62) log( x ) (cid:19) d x = (cid:90) log( D ) exp (cid:18) − x (cid:62) K x + ( η + m ) (cid:62) x (cid:19) d x < m (cid:89) j =1 (cid:90) log( ρ j ( D )) exp (cid:0) − N K x j / η j + 1) x j (cid:1) d x j < m (cid:89) j =1 (cid:90) ∞−∞ exp (cid:0) − N K x j / η j + 1) x j (cid:1) d x j < + ∞ since the integrand is proportional to a univariate Gaussian density.(5) Case a = 0 and b > (CC6, CC7): Assume log( x ) (cid:62) K log( x ) > x ∈ D and η j ≤ j s.t. ρ j ( D ) is unbounded (from above). Then (cid:90) D p η , K ( x ) d x = (cid:90) D exp (cid:18) −
12 log( x ) (cid:62) K log( x ) + η (cid:62) x b (cid:19) d x (cid:90) log( D ) exp (cid:18) − x (cid:62) K x + (cid:62) m x + η (cid:62) exp( b x ) (cid:19) d x < m (cid:89) j =1 (cid:90) log( ρ j ( D )) exp (cid:0) − N K x j / x j + η j exp( bx j ) (cid:1) d x j ≤ m (cid:89) j =1 (cid:90) ∞−∞ c j exp (cid:0) − N K x j / x j (cid:1) d x j < + ∞ , where c j ≡ η j ≤ c j ≡ exp (cid:16) η j (sup ρ j ( D )) b (cid:17) > + ∞ otherwise. This proves (CC6).Finally, if log( D ) is unbounded and log( x ) (cid:62) K log( x ) ≥ x ∈ D , the integral is boundedby m (cid:89) j =1 (cid:90) log( ρ j ( D )) exp ( x j + η j exp( bx j )) d x j which is finite if and only if η j < j s.t. ρ j ( D ) is unbounded (from above). This proves(CC7). Proof of Theorem 7.
It suffices to consider the case D = R m + for general a and b as well as D = R m for integer a > b > R m ): For (A.1), the irregularitiesonly occur at the boundary points, but with the composition ( h j ◦ ϕ j )( x ) with x j approachingany finite boundary point behaves like h j ( x j ) with x j (cid:38) + in D = R m + , and ( h j ◦ ϕ j )( x ) with x j → ∞ behaves like h j ( x j ) with x j → ∞ in D = R m + (or R m if applicable). For (A.2), obviouslyintegrability over D follows from that over D = R m + or R m . (A.3) is trivially satisfied by a powerfunction h j .As in the proof of Theorem 5, N K ≡ inf v ∈ D v a (cid:62) K v a / v a (cid:62) v a > a > b ≥ D = R m + is covered in Yu et al. (2019b). The proof forthe case for a > b > D = R m is analogous and omitted.(2) Case a = 0 and b = 0 : | p ( x ) ∂ j log p ( x ) |∝ exp (cid:18) −
12 log( x ) (cid:62) K log( x ) + η (cid:62) log( x ) (cid:19) × (cid:12)(cid:12)(cid:12) x − j (cid:16) η j − κ (cid:62) j, − j log( x − j ) (cid:17) − κ jj x − j log x j (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) (cid:16) η j − κ (cid:62) j, − j log x − j (cid:17) exp (cid:2) − N K (log x j ) / η j −
1) log x j (cid:3) − κ jj exp (cid:2) − N K (log x j ) / η j −
1) log x j (cid:3) log x j (cid:12)(cid:12)(cid:12) (cid:89) k (cid:54) = m exp (cid:0) − N K (log x k ) / η j log x k (cid:1) ∝ O (cid:2) exp (cid:0) − N K y j / η j − y j (cid:1)(cid:3) + O (cid:2) exp (cid:0) − N K y j / η j − y j (cid:1) y j (cid:3) which apparently vanishes as x j (cid:38) + and x j (cid:37) + ∞ with y j ≡ log( x j ) since it is dominated by aconstant times a Gaussian density in y j . Thus, by Proposition 3, (A.1) is satisfied with any α j ≥ (cid:90) R m + p ( x ) (cid:13)(cid:13)(cid:13) ∇ log p ( x ) (cid:12) ( h ◦ ϕ ) / ( x ) (cid:13)(cid:13)(cid:13) d x ≤ const · m (cid:88) j =1 (cid:90) R m + m (cid:89) k =1 exp (cid:2) − N K (log x k ) / η k log( x k ) (cid:3) × h j ( x j ) (cid:104) x − j (cid:16) η j − κ (cid:62) j, − j log( x − j ) (cid:17) − κ jj x − j log x j (cid:105) d x , which can be decomposed into a sum of products of univariate integrals of the formconst · exp (cid:0) − N K (log x j ) / A log( x j ) (cid:1) (log x j ) B ( h j ( x j )) C with B = 0 , , C = 0 ,
1, and constants A . With h j ( x j ) = x α j j for any α j ≥ x j , so (cid:82) R m + p ( x ) (cid:107)∇ log p ( x ) (cid:12) ( h ◦ ϕ ) / ( x ) (cid:107) d x < + ∞ . Similarly,we have (cid:82) R m + p ( x ) (cid:107) [ ∇ log p ( x ) (cid:12) ( h ◦ ϕ )( x )] (cid:48) (cid:107) d x < + ∞ and the proof is omitted.(3) Case a = 0 and b > : Recall ρ j ( D ) ≡ { x j : x ∈ D } . Let ρ ∗ j ( D ) ≡ sup ρ j ( D ). Since weassume that η j ≤ j such that ρ ∗ j ( D ) < + ∞ , p ( x ) ∂ j log p ( x ) ∝ exp (cid:18) −
12 log( x ) (cid:62) K log( x ) + 1 b η (cid:62) x b (cid:19) × (cid:104) η j x b − j − x − j κ (cid:62) j, − j log( x − j ) − κ jj x − j log x j (cid:105) ≤ exp −
12 log( x ) (cid:62) K log( x ) + 1 b (cid:88) j : ρ ∗ j ( D ) < + ∞ η j ( ρ ∗ j ( D )) b × (cid:104) − x − j κ (cid:62) j, − j log( x − j ) − κ jj x − j log x j (cid:105) ∝ exp (cid:18) −
12 log( x ) (cid:62) K log( x ) (cid:19) (cid:104) − x − j κ (cid:62) j, − j log( x − j ) − κ jj x − j log x j (cid:105) is bounded by the corresponding quantity in the a = b = 0 case with η = m , and (A.1) is thussatisfied. Similarly, the two quantities for (A.2) are bounded by a constant times those in the a = b = 0 case with η = m and (A.2) is thus also satisfied.47 roof of Theorem 9. It suffices to bound Γ and g using their forms in Section 4.3 and apply Theo-rem 1 in Lin et al. (2016). Thus, we first find the bounds of ( h j ◦ ϕ j )( x ) x p j j x p k k x p (cid:96) (cid:96) with h j ( x ) = x α j , α j ≥ α j ≥ − p j , p j ∈ R , p k ≥ p (cid:96) ≥ x i ∈ [ u i , v i ] for i = 1 , . . . , m . Suppose without loss ofgenerality that j , k , (cid:96) are all different, asmax x j f j, ( x j ) max x j f j, ( x j ) max x j f j, ( x j ) ≥ max x j ( f j, ( x j ) f j, ( x j ) f j, ( x j )) ≥ f j, , f j, , f j, .As x j approaches its boundary, ϕ j ( x ) (cid:38) + and hence ( h j ◦ ϕ j )( x ) x p j j (cid:38) + if α j > − p j . Thelower bound 0 for ( h j ◦ ϕ j )( x ) x p j j x p k k x p (cid:96) (cid:96) is thus tight enough.As for the upper bound, the only way for the quantity to be unbounded from above is when x j (cid:38) + and p j <
0, but as x j (cid:38) + , ( h j ◦ ϕ j )( x ) = x α j j so this cannot happen with the choice of α j ≥ − p j . Noting that h j is monotonically increasing, we consider the following cases:(1) Suppose x j ≥ ( u j + v j ) /
2. Then( h j ◦ ϕ j )( x ) ≤ h j (min { C j , v j − x j } ) ≤ h j (min { C j , ( v j − u j ) / } ) ≤ min { C α j j , ( v j − u j ) α j / α j } , and x p j j ≤ ( u j + v j ) p j / p j if p j < x p j j ≤ v p j j if p j ≥ x j ≤ ( u j + v j ) /
2. Then( h j ◦ ϕ j )( x ) x p j j ≤ h j (min { C j , x j − u j } ) x p j j = min { C α j j , ( x j − u j ) α j } x p j j . Now let f ( x ) = (min { C j , x − u j } ) α j x p j . Then (log f ( x )) (cid:48) = α j / ( x − u j ) x u j , α j ≥ − p j and α j ≥
0. This implies that if p j ≥ v j − u j ≤ C j , f is increasing on ( u j , ( u j + v j ) / h j ◦ ϕ j )( x ) x p j j ≤ min { C j , ( v j − u j ) / } α j ( u j + v j ) p j / p j ; otherwise, f is increasing on ( u j , u j + C j ) and decreasing on ( u j + C j , ( u j + v j ) / h j ◦ ϕ j )( x ) x p j j ≤ C α j j ( u j + C j ) p j .Thus, defining ζ j ( α j , p j ) 48 min { C j , ( v j − u j ) / } α j ( u j + v j ) p j / p j , p j < , v j − u j ≤ C j , min { C j , ( v j − u j ) / } α j ( u j + C j ) p j , p j < , v j − u j > C j , min { C j , ( v j − u j ) / } α j v p j j , p j ≥ , we have 0 ≤ ( h j ◦ ϕ j )( x ) x p j j x p k k x p (cid:96) (cid:96) ≤ ζ j ( α j , p j ) v p k k v p (cid:96) (cid:96) . Now assume additionally that α j ≥ max { , − p j } , then by h (cid:48) j ( x ) = α j x α j − j , 0 ≤ ∂ j ( h j ◦ ϕ j )( x ) x p j j x p k k ≤ α j ζ j ( α j − , p j ) v p k k .First assume a >
0. Then assuming α , . . . , α m ≥ max { , − a, − b, − a, − a, − b } =max { , − a, − b } , using the form of Γ and g in Section 4.3, for all j, k, (cid:96) we have0 ≤ γ j,k,(cid:96) ( x ) ≤ ς Γ ≡ max j,k =1 ,...,m max { ζ j ( α j , a − v ak , ζ j ( α j , b − } and 0 ≤ g j,k ( x ) ≤ ς g ≡ max j,k =1 ,...,m max { α j ζ j ( α j − , a − v ak + | a − | ζ j ( α j , a − v ak + aζ j ( α j , a − ,α j ζ j ( α j − , b −
1) + | b − | ζ j ( α j , b − } .Then by Hoeffding’s inequality, P (cid:18) max j,k,(cid:96) | γ j,k,(cid:96) − E γ j,k,(cid:96) | ≥ (cid:15) / (cid:19) ≤ (cid:0) − n(cid:15) / (2 ς Γ ) (cid:1) , (37) P (cid:18) max j,k | g j,k − E g j,k | ≥ (cid:15) (cid:19) ≤ (cid:0) − n(cid:15) /ς g (cid:1) . (38)Let (cid:15) ≡ ς Γ (cid:112) m τ + log 4) /n and (cid:15) ≡ ς g (cid:112) (log m τ + log 4) / (2 n ). With the choice of δ ≤ (cid:112) (log m τ + log 4) / (2 n ) and using the fact that 0 ≤ max j,k,(cid:96) γ j,k,(cid:96) ≤ ς Γ = (cid:15) / (2 δ − P (cid:18) max j,k,(cid:96) | δγ j,k,(cid:96) − E γ j,k,(cid:96) | ≥ (cid:15) (cid:19) (39) ≤ P (cid:18) max j,k,(cid:96) | γ j,k,(cid:96) − E γ j,k,(cid:96) | + ( δ −
1) max j,k,(cid:96) γ j,k,(cid:96) ≥ (cid:15) (cid:19) ≤ P (cid:18) max j,k,(cid:96) | γ j,k,(cid:96) − E γ j,k,(cid:96) | ≥ (cid:15) / (cid:19) ≤ m − τ / , (40) P (cid:18) max j,k | g j,k − E g j,k | ≥ (cid:15) (cid:19) ≤ m − τ / . (41)The results then follow by applying Theorem 1 in Lin et al. (2016).49n the case where a = 0, and u k > k , | ( h j ◦ ϕ j )( x ) x p j j log( x k ) log( x (cid:96) ) | ≤ ζ j ( α j , p j ) · max {| log( u k ) log( u (cid:96) ) | , | log( v k ) log( v (cid:96) ) |} and everything else follows similarly as for a > Proof of Lemma 10.
We show that X aj for a > X j for a = 0 is sub-exponential by showingits moment-generating function is finite. Then the sub-exponentiality follows from Theorem 2.13of Wainwright (2019).First consider the case where a = 0. In Corollary 6, we only require K to be positive definitewithout any restrictions on η , and thus for any t ∈ R , E exp( t log X j ) is the inverse normalizingconstant for the model with parameters K and η + t e j , where e j is the vector with the j -thcoordinate equal to 1 and the rest equal to 0, and is thus finite.Next, consider a >
0. Corollary 6 requires K to be positive definite, and in addition η (cid:31) − m if b = 0. Then, again writing x / x for the b part, p ( x ) exp (cid:0) tx aj (cid:1) ∝ exp (cid:18) − a x a (cid:62) K x a + 1 b η (cid:62) x b + tx aj (cid:19) ≤ exp (cid:32) m (cid:88) k =1 (cid:16) ( − λ min ( K ) + 2 at k = j ) x ak / (2 a ) + η ,k x bk /b (cid:17)(cid:33) , a constant times the density for parameters diag ( λ min ( K ) m − at e j ) and η . Thus, for t ∈ ( −∞ , λ min ( K ) / (2 a )) (cid:51) E exp (cid:16) tX aj (cid:17) is finite. Proof of Corollary 11.
Let the sub-exponential norm of X aj be (cid:13)(cid:13)(cid:13) X aj (cid:13)(cid:13)(cid:13) ψ ≡ sup q ≥ ( E | X j | aq ) /q /q ,then by Lemma 21.6) of Yu et al. (2019b) or Corollary 5.17 of Vershynin (2012), P (cid:0)(cid:12)(cid:12) X aj − E X aj (cid:12)(cid:12) ≥ (cid:15) ,j (cid:1) ≤ exp − min (cid:15) e (cid:13)(cid:13)(cid:13) X aj (cid:13)(cid:13)(cid:13) ψ , (cid:15) e (cid:13)(cid:13)(cid:13) X aj (cid:13)(cid:13)(cid:13) ψ . Letting (cid:15) ,j ≡ max (cid:110) √ e (cid:13)(cid:13) X aj (cid:13)(cid:13) ψ (cid:113) log 3 + log n + τ log m + log (cid:0) m − (cid:12)(cid:12) ρ ∗ D (cid:12)(cid:12)(cid:1) , e (cid:13)(cid:13) X aj (cid:13)(cid:13) ψ (log 3 + log n + τ log m + log ( m − | ρ ∗ D | )) (cid:111) , then max (cid:110) E X aj − (cid:15) ,j , (cid:111) / (2 a ) ≤ X ( i ) j ≤ (cid:16) E X aj + (cid:15) ,j (cid:17) / (2 a ) for all j (cid:54)∈ ρ ∗ D and i = 1 , . . . , n with probability at least 1 − / (3 m ττ