[PDF] Monomial-agnostic computation of vanishing ideals

Abstract

In recent years, the approximate basis computation of vanishing ideals has been studied extensively and adopted both in computer algebra and data-driven applications such as machine learning. However, symbolic computation and the dependency on monomial ordering remain as essential gaps between the two above-mentioned fields. In this paper, we propose the first efficient monomial-agnostic approximate basis computation of vanishing ideals, where polynomials are manipulated without any information of monomials; this can be implemented in a fully numerical manner and is thus desirable for data-driven applications. In particular, we propose gradient normalization, which achieves not only the first efficient and monomial-agnostic normalization of polynomials but also provides significant advantages such as consistency in translation and scaling of data points, which cannot be realized by existing basis computation algorithms. During the basis computation, the gradients of polynomials at the given points are proven to be efficiently and exactly obtained without performing differentiation. By exploiting the gradient information, we further propose a basis reduction method to remove redundant polynomials in a monomial-agnostic manner. Finally, we also propose a regularization method using gradients to avoiding overfitting of the basis for the given perturbed points.

Full PDF

MMonomial-agnostic computation of vanishing ideals

Hiroshi Kera ∗ and Yoshihiko Hasegawa ∗ Abstract

In recent years, the approximate basis computation of vanishing ideals hasbeen studied extensively and adopted both in computer algebra and data-drivenapplications such as machine learning. However, symbolic computation and thedependency on monomial ordering remain as essential gaps between the two above-mentioned ﬁelds. In this paper, we propose the ﬁrst eﬃcient monomial-agnosticapproximate basis computation of vanishing ideals, where polynomials are manip-ulated without any information of monomials; this can be implemented in a fullynumerical manner and is thus desirable for data-driven applications. In particu-lar, we propose gradient normalization, which achieves not only the ﬁrst eﬃcientand monomial-agnostic normalization of polynomials but also provides signiﬁcantadvantages such as consistency in translation and scaling of data points, whichcannot be realized by existing basis computation algorithms. During the basiscomputation, the gradients of polynomials at the given points are proven to beeﬃciently and exactly obtained without performing diﬀerentiation. By exploitingthe gradient information, we further propose a basis reduction method to removeredundant polynomials in a monomial-agnostic manner. Finally, we also proposea regularization method using gradients to avoiding overﬁtting of the basis for thegiven perturbed points.

In the last decade, the basis computation of vanishing ideals has attracted interests ina variety of data-driven applications such as machine learning, signal processing, andcomputer vision [3, 8, 9, 10, 14, 17, 21, 24, 26, 27, 28]. This connection between computeralgebra and data-driven applications has been established via three breakthroughs. Theﬁrst breakthrough is the Buchberger–M¨oller (BM) algorithm [22]. In contrast to thestandard computation of the Gr¨obner bases of polynomial ideals, the BM algorithmtakes not a set of polynomials but a set of points as input and computes a Gr¨obner basisof the vanishing ideal of the points in polynomial time. This suits the settings of manyapplications that aim to extract information from large datasets and learn the underlyingstructure in a reasonable time. The second breakthrough is the development of border ∗ Department of Information and Communication Engineering, Graduate School of InformationScience and Technology, The University of Tokyo. Corresponding author: Hiroshi Kera (e-mail:[email protected]). a r X i v : . [ c s . S C ] J a n ases [11, 12, 19] and their eﬃcient, stable, and approximate computation [2, 7, 20]of the vanishing ideal of a set of perturbed points. In particular, the approximatelyvanishing ideal (AVI) alogorithm [7] is a seminal algorithm. As the data available inmost applications are typically noisy, stable approximate basis computation bridges acritical gap between the computer algebra and data-driven applications.However, another critical gap remains between these two ﬁelds: computer algebrais based on symbolic computation, whereas data-driven applications are based on nu-merical computation. In computer algebra, the basis computation of vanishing idealshandles polynomials through symbolic computation. Fixing a monomial ordering leadsto eﬃcient and well-deﬁned operations, accompanied by solid theoretical analysis. How-ever, in most data-driven applications, algorithms and computational libraries have beendeveloped for numerical computation. The dependency on the monomial ordering is alsoproblematic. The choice of the monomial ordering aﬀects the output and runtime of abasis computation; however, it cannot be determined in a data-driven manner. Further-more, powerful results realized by the Gr¨obner bases or border bases (computed with aﬁxed monomial ordering), such as the unique normal form computed by the polynomialdivision, are not attractive on the application side because the polynomial division is notcommonly used, and at the approximate setting, such theoretical guarantees no longerhold.To circumvent symbolic computation and monomial ordering, several basis computa-tion methods have developed in the ﬁeld of machine learning [15, 18, 21]. In particular,vanishing component analysis (VCA) provides the third breakthrough, whereby polyno-mials are generated from linear combinations of polynomials (not monomials). As such,monomials and, thus, monomial ordering, need not be handled explicitly, resulting inmuch wider applications compared to the AVI algorithm and its variants, such as clas-siﬁcation, dimensionality reduction, and blind source separation [8, 24, 27]. However,it is diﬃcult to say that the VCA achieves a monomial-agnostic basis computation ofvanishing ideals because VCA considers unnormalized polynomials, i.e., a polynomial h obtained by VCA does not have a normalized coeﬃcient norm. Consequently, in the ap-proximate case, h can be considered approximately vanishing if only the coeﬃcient normof h is suﬃciently small regardless of the roots. This issue was ﬁrst discussed as the spu-rious vanishing problem in [13], and an optimal normalization framework was proposed.However, in VCA, polynomials are constructed by linear combinations of products oflower-degree polynomials, each of which is a linear combinations of products of furtherlower-degree polynomials. Thus, accessing the coeﬃcients of monomials requires sym-bolic computation. Moreover, without monomial ordering, one has to handle all themonomials, the number of which sharply grows by (cid:0) n + dd (cid:1) for n -variate polynomials ofdegree d .In summary, no algorithm is available for the eﬃcient monomial-agnostic basis com-putation algorithm of vanishing ideals. The present paper resolves this crucial issuetoward full integration of the vanishing ideal computation in computer algebra and nu-merical computation in data-driven applications.2 Our contributions

We develop theory and algorithms to realize the ﬁrst eﬃcient monomial-agnostic basiscomputation of vanishing ideals, which can circumvent the spurious vanishing problemswithout symbolic computation and monomial ordering. We also realize new fascinatingproperties in basis computation that have not been realized for the basis computationalgorithms of vanishing ideals. In the following subsections, we present an overview ofour contributions . We propose gradient normalization of polynomials, by which the spurious vanishingproblem can be eﬃciently avoided in a monomial-agnostic manner. Gradient normal-ization of a polynomial g ∈ R [ x , . . . , x n ] with respect to X ⊂ R n is given by g (cid:113)(cid:80) x ∈ X (cid:107)∇ g ( x ) (cid:107) , (1)where ∇ g is the gradient of g and (cid:107) · (cid:107) is the Euclidean norm. Although, there is anatural concern regarding this normalization, i.e., a possibility that the denominatorwill vanish, we will prove that this does not matter in the basis computation of avanishing ideal; if g is necessary for the basis, then the denominator never vanishes.The gradient normalization scheme has the following advantages over the conventionalcoeﬃcient normalization scheme. Time eﬃciency

A naive implementation of gradient normalization requires a cop-mutational cost of only O ( n | X | T pd ), where | · | and T pd denote the cardinality of a setand the evaluation cost of a partial derivative of a polynomial, respectively. Further, weshow that in the basis computation, the gradient can be eﬃciently and exactly evaluatedwithout diﬀerentiation. Thus, the computational cost of gradient normalization is muchlower than that of coeﬃcient normalization, i.e., O ( (cid:0) n + dd (cid:1) ), where d is the degree of apolynomial, especially when the number of variables and degree of polynomials are largeand the dataset size is moderate. Even for a large dataset, as a simple heuristic, onecan perform thinning out [1], clustering, or coreset construction to ﬁnd representativepoints in order to restrict the number of points. Data dependency

Gradient normalization is a data-dependent normalization, whichmakes it essentially diﬀerent from coeﬃcient normalization. For instance, the data-dependent nature of gradient normalization leads to (a sort of) translation and scaling The present paper is an extended version of a conference paper [16]. Additional contents are asfollows: (i) complete proofs of the relevant theorems, propositions, and lemmas, (ii) extensive discussionand analysis of the application of a proposed scheme (gradient normalization) to computer-algebraicalgorithms (in particular, border basis computation), (iii) a new bound on the perturbation in givenpoints, and (iv) a new heuristics to terminate the basis computation for positive-dimensional ideals. X , translated X , and scaled X . Then, there is a one-to-one correspondence across these bases; there is the samenumber of polynomials at the same degree. This provides a theoretical justiﬁcation forthe preprocessing (i.e., translation and scaling) before the basis computation. To thebest of our knowledge, such a property has never been realized in the literature of van-ishing ideal computation. The numerical and symbolic relations of the correspondingpolynomials are also revealed. The data-dependent nature of gradient normalizationalso leads to the robustness of the normalized polynomials; the change in the extent ofvanishing can be upper bounded by the magnitude of the perturbation (and negligiblehigher-order terms). The computed basis of a vanishing ideal I ( X ) can contain redundant basis polynomials.For example, a basis G can contain g and g (cid:48) such that g (cid:48) = gh for some polynomial h .More generally, it is likely that g ∈ (cid:104) G − { g }(cid:105) , where (cid:104)·(cid:105) denotes the ideal generated by apolynomial set. We propose a monomial-agnostic method that uses gradient information.Speciﬁcally, we show that if g ∈ (cid:104) G − { g }(cid:105) , then this implies that the gradient ∇ g islinearly dependent on the gradients of lower-degree polynomials in G −{ g } for all x ∈ X .Consequently, the basis reduction can be addressed by solving least-squares problems,which can be achieved eﬃciently and work stably even for the approximate case. Vanishing ideals are zero-dimensional ideals, i.e., the associated aﬃne varieties are sets ofﬁnite points. However, in many applications, the aﬃne variety where data are distributedis often a positive-dimensional variety. Therefore, it is sometimes preferable to terminatethe basis computation at an early stage; otherwise, the computed system of polynomialsoverﬁts the observed data. We propose using the codimension of the tangent space asa new termination criterion. Compared to some common approaches such as restrictingthe maximum degree or the number of polynomials, the proposed approach has twoadvantages: (i) the potential range of the parameters is evident and (ii) the redundancyof the basis does not aﬀect the result of the regularization. Throughout the paper, we focus on the polynomial ring R n = R [ x , x , . . . , x n ], where x i is the i -th indeterminate. When we handle a set (say, H ), then H is sometimes For example, for n -dimensional points, the codimension of a tangent space ranges from 0 to n , whilethe range of the degree or the number of polynomials is unknown. H = { h , h , . . . , h | H | } using the cardinality | H | . Deﬁnition 3.1.

The vanishing ideal I ( X ) ⊂ R n of X ⊂ R n is the set of polynomialsthat take the zero value (i.e., vanish) for any point in X : I ( X ) = { g ∈ R n | ∀ x ∈ X, g ( x ) = 0 } . Deﬁnition 3.2.

Given a set of points X = { x , x , . . . , x | X | } ⊂ R n , the evaluationvector of a polynomial h ∈ R n with respect to X is deﬁned as follows: h ( X ) = (cid:0) h ( x ) h ( x ) · · · h ( x | X | ) (cid:1) (cid:62) ∈ R | X | . For a set of polynomials H = (cid:8) h , h , . . . , h | H | (cid:9) ⊂ R n , the evaluation matrix of H with respect to X is deﬁned as H ( X ) = (cid:0) h ( X ) h ( X ) · · · h | H | ( X ) (cid:1) ∈ R | X |×| H | . Deﬁnition 3.3.

A polynomial g ∈ R n is an (cid:15) -approximately vanishing polynomial for a set of points X ⊂ R n if (cid:107) g ( X ) (cid:107) ≤ (cid:15) , where (cid:15) ≥ and (cid:107) · (cid:107) denotes the Euclideannorm. When (cid:15) > is not speciﬁed, an (cid:15) -approximately vanishing polynomial is alsoreferred to as an approximately vanishing polynomial. When (cid:15) = 0 , g is referred toas a vanishing polynomial. As indicated by the deﬁnition of the vanishing ideal (Deﬁnition 3.1), we are interestedin only the evaluation values of polynomials at the given set of points X . Hence, apolynomial h can be represented by its evaluation vector h ( X ), which link the productand weighted sum of polynomials with linear algebra operations. Given a polynomial set H = { h , h , . . . , h | H | } ⊂ R n , the product of h i , h j ∈ H corresponds to h i ( X ) (cid:12) h j ( X ),where (cid:12) denotes the element-wise product. A weighted sum (cid:80) | X | i =1 w i h i , where w i ∈ R ,corresponds to (cid:80) | H | i =1 w i h i ( X ). For convenience, we deﬁne the following notations. Deﬁnition 3.4.

Given a polynomial set H = { h , h , . . . , h | H | } ⊂ R n and a matrix W = (cid:0) w w · · · w s (cid:1) ∈ R | H |× s , the products H w i and HW are deﬁned as H w i = | H | (cid:88) j =1 w ( j ) i h j ,HW = { H w , H w , . . . , H w s } , where w ( j ) i is the j -th element of w i . Note that by deﬁnition of the evaluation vector and evaluation matrix, ( H w )( X ) = H ( X ) w and ( HW )( X ) = H ( X ) W . Deﬁnition 3.5.

The span of a polynomial set F ⊂ R n is deﬁned as span( F ) = { (cid:80) f ∈ F a f f | a f ∈ R } . The polynomial ideal that is generated by a set of vanish-ing polynomials G ⊂ R n is deﬁned as (cid:104) G (cid:105) = { (cid:80) g ∈ G h g g | h g ∈ R n } . ther notations We denote by span( · ) the column space of a given matrix. Thetotal degree of a polynomial h ∈ R n is denoted by deg( h ). The degree with respectto the k -th indeterminate x k is denoted by deg k ( h ). Given a polynomial set H ⊂ R n ,we denote by H t ⊂ H the set of degree- t polynomials of H . The restriction of H topolynomials of degree at most t is denoted by H t = (cid:83) tτ =0 H τ . Our idea of using gradient is general enough to be integrated with many basis compu-tation algorithms of the vanishing ideal. However, to avoid an unnecessarily abstractdiscussion, we focus on the simple basis construction (SBC) algorithm [13], which ex-tends VCA by incorporating normalization. The SBC algorithm is free of monomialordering and the spurious vanishing problem, albeit not eﬃcient when using coeﬃcientnormalization.

Algorithm 3.6. (SBC algorithm)

Let X ⊂ R n be a ﬁnite set of points and (cid:15) ≥ be a ﬁxed threshold. We deﬁne F = { m } , G = {} ⊂ R n , where m is any nonzero constant polynomial. For t = 1 , , . . . , consider the following procedures. Below, we use the notations F t = (cid:83) tτ =0 F τ and G t = (cid:83) tτ =0 G τ . S1 For t > , generate pre-candidate polynomials of degree t by multiplying nonvan-ishing polynomials across F and F t − . C pre t = { pq | p ∈ F , q ∈ F t − } . For t = 1 , deﬁne C pre1 = { x , x , . . . , x n } . The candidate polynomials are thengenerated through an orthogonal projection with respect to the evaluation matrix. C t = C pre t − F t − F t − ( X ) † C pre t ( X ) , (2) where · † denotes the pseudo-inverse of a matrix. S2 Solve the following generalized eigenvalue problem: C t ( X ) (cid:62) C t ( X ) V = N ( C t ) V Λ , (3) where V is a matrix that has generalized eigenvectors v , v , . . . , v | C t | for its columns, Λ is a diagonal matrix with generalized eigenvalues λ , λ , . . . , λ | C t | along its diag-onal, and N ( C t ) ∈ R | C t |×| C t | is the normalization matrix, which will be introducedshortly (refer to Example 3.11 and the preceding description). S3 Generate basis polynomials by linearly combining polynomials in C t with { v , v , . . . , v | C t | } . F t = (cid:110) C t v i | (cid:112) λ i > (cid:15) (cid:111) ,G t = (cid:110) C t v i | (cid:112) λ i ≤ (cid:15) (cid:111) . If | F t | = 0 , return F = F t , G = G t and terminate. Otherwise, increment t by 1and go to S1 . emark 3.7. At S1 , Eq. (2) projects span( C pre t ( X )) to a subspace that is orthogonalto span( F t − ( X )) . Consequently, span( C t ( X )) is a subspace that cannot be spanned bythe evaluation vectors of polynomials of degree less than t . Remark 3.8. At S3 , a polynomial C t v i is classiﬁed as an (cid:15) -approximately vanishingpolynomial if √ λ i ≤ (cid:15) , because √ λ i equals the extent of vanishing of C t v i for X , i.e., (cid:107) ( C t v i )( X ) (cid:107) = (cid:113) v (cid:62) i C t ( X ) (cid:62) C t ( X ) v i = (cid:112) λ i . Remark 3.9.

When the SBC algorithm runs with (cid:15) = 0 , the outputs

F, G form the basesof nonvanishing polynomials and vanishing polynomials, respectively, i.e., span( F ( X )) = R n and (cid:104) G (cid:105) = I ( X ) . Before giving the strict description of the properties of F and G , we introduce thenormalization mapping, which directly relates to the normalization matrix in Eq. (3). Deﬁnition 3.10.

Let n : R n → V be a linear mapping, where V is a vector spaceover R . Let A be a basis construction algorithm that returns F, G ⊂ R n such that span( F ( X )) = R | X | and I ( X ) = (cid:104) G (cid:105) , where X ⊂ R n . Let F t and G t be the restrictionsof F and G to polynomials of degree up to t . When the following conditions hold, n iscalled a valid normalization mapping for A . • For any f ∈ F , if the norm of n ( f ) equals zero, then f ( X ) ∈ span( F deg( f ) − ( X )) . • For any g ∈ G , if the norm of n ( g ) equals zero, then g ∈ (cid:10) G deg( g ) − (cid:11) . Deﬁnition 3.10 is a subtle generalization of that in [13]. Note that since n is a linearmapping, it is n ( af + bg ) = a n ( f ) + b n ( g ) for any f, g ∈ R n and a, b ∈ R . Furthermore,since V is a vector space, the inner product (cid:104) n ( f ) , n ( g ) (cid:105) is deﬁned for any f, g ∈ R n .Throughout this paper, we consider V = R (cid:96) , where (cid:96) is a positive integer deﬁned foreach case. For convenience, we write n ( H ) = (cid:0) n ( h ) n ( h ) · · · n ( h | H | ) (cid:1) ∈ R (cid:96) ×| H | ,where H = { h , h , . . . , h | H | } ⊂ R n hereinafter. The normalization matrix in Eq. (3) isdeﬁned by N ( H ) = n ( H ) (cid:62) n ( H ) . Example 3.11.

Let n c : R → R (cid:96) be a mapping that returns the coeﬃcient vectorof a polynomial. Obviously, n c is a linear mapping. One can see that n c is a validnormalization for any algorithm for the basis construction of vanishing ideals because (cid:107) n c ( h ) (cid:107) = 0 implies h is the zero polynomial. The normalization matrix in Eq. (3)becomes N c ( C t ) = n c ( C t ) (cid:62) n c ( C t ) , the ( i, j ) -th element of which is the inner productbetween the coeﬃcient vectors of the i -th and j -th polynomials in C t . Theorem 3.12.

Let n be a valid normalization mapping for the SBC algorithm. Whenthe SBC algorithm runs with n and (cid:15) = 0 for a set of points X ⊂ R n , the output bases F, G ⊂ R n satisfy the following. • The vanishing ideal I ( X ) is generated by G , i.e., I ( X ) = (cid:104) G (cid:105) . • Any polynomial h ∈ R n can be represented by h = f (cid:48) + g (cid:48) , where f (cid:48) ∈ span( F ) and g (cid:48) ∈ (cid:104) G (cid:105) . For any g ∈ I ( X ) , g ∈ (cid:104) G deg( g ) (cid:105) . • Any polynomial h ∈ R n can be represented by h = f (cid:48) + g (cid:48) , where f (cid:48) ∈ span( F deg( h ) ) and g (cid:48) ∈ (cid:104) G deg( h ) (cid:105) .Proof. The claim is identical to Theorem 2 in [13], except for the replacement of “anormalization mapping” with “a valid normalization mapping”. The original proof isstill applicable with this replacement.In the literature, the basis computation of vanishing ideals deals with polynomialsthrough the evaluation vectors of a set of points X . However, two vanishing polynomials,say g and g , have identical evaluation vectors g ( X ) = g ( X ) = ; thus, no cluesregarding their symbolic forms can be retrieved from their evaluation vectors. In thisstudy, we propose exploiting the gradient as a key tool for handling polynomials bynumerical computation. Speciﬁcally, given a polynomial h , we consider the evaluationof its partial derivatives ∇ h := { ∂h/∂x , ∂h/∂x , · · · , ∂h/∂x n } at the given set of datapoints X ⊂ R n . From the deﬁnition, the evaluation matrix of the gradient ∇ h is givenby ∇ h ( X ) = (cid:0) ∂h∂x ( X ) ∂h∂x ( X ) · · · ∂h∂x n ( X ) (cid:1) ∈ R | X |× n . We can also show that in the basis construction, the gradient of polynomials can beeﬃciently and exactly calculated without performing diﬀerentiation using the iterativeframework. Notably, one can infer the symbolic structure of a vanishing polynomial g from ∇ g ( X ). For example, if ( ∂g/∂x k )( X ) ≈ , then x k is unlikely to be dominant in g ; if ( ∂g/∂x k )( X ) ≈ for all k , then g can be close to the zero polynomial. One mayargue that a nonzero vanishing polynomial g can have ∇ g ( X ) = O , where O denotesthe zero matrix. However, such g is later revealed to be redundant in the basis; thus, itcan be excluded from our consideration (cf., Lemmas 4.2 and 4.3). It can also be shownthat some symbolic relation between vanishing polynomials g and g is reﬂected in thelinear dependency between ∇ g ( X ) and ∇ g ( X ); if g is a polynomial multiple of g ,i.e., g = g h for some h ∈ R n , then for any x ∈ X , ∇ g ( x ) and ∇ g ( x ) are identical upto a constant scale. A more general symbolic relation between polynomials is discussedin Conjecture 5.1. The spurious vanishing problem can be resolved by normalizing a polynomial for somescale (e.g., the norm of its coeﬃcient vector). In this paper, we propose gradient nor-malization, which normalizes a polynomial using the norm of its gradient. Speciﬁcally,a polynomial h is normalized with the norm of the following vector. n g ( h ; X ) = vec( ∇ h ( X )) ∈ R | X | n , (4)8here vec( · ) denotes the vectorization of a given matrix. We refer to the norm (cid:107) n g ( h ; X ) (cid:107) as the gradient norm of h with respect to X . By solving Eq. (3) in S2 , each basis poly-nomial (say, h ) of vanishing polynomials and nonvanishing polynomials is normalizedas (cid:107) n g ( h ; X ) (cid:107) = 1. Conceptually, this rescales h with respect to the gradient norm as h/ (cid:107) n g ( h ; X ) (cid:107) , albeit in an optimal way [13].In Section 4.1, we show that gradient normalization is always valid for the problemsetting of our interest—the basis computation of the vanishing ideal. Next, we discussthe advantages of using gradient normalization. In Section 4.2, we show the computa-tional eﬃciency of gradient normalization. In particular, we show that (i) as is obviousfrom Eq. (4), the length of n g ( h ; X ) = 0 increases linearly according to n (the dimen-sion of data) and independently of the degree of polynomials; and (ii) the calculation of n g ( h ; X ) can be performed without diﬀerentiation ; thus, no symbolic computation norapproximation (e.g., ﬁnite diﬀerence method) is required. Furthermore, in Section 4.3,we show that gradient normalization equips the SBC algorithm with a sort of consistencyin the translation and scaling of input data points (Theorem 4.10), which, to the bestof our knowledge, has not been realized by existing basis computation algorithms. Wealso show in Section 4.4 that polynomials with a unit gradient norm are robust againstthe perturbation of points owing to the bounded gradient around the given points. The gradient norm (cid:107) n g ( h ; X ) (cid:107) can be equal to zero even for a nonzero polynomial h .In other words, all partial derivatives ∂h/∂x k can be simultaneously vanishing for X ,i.e., ( ∂h/∂x k )( X ) = . In such a case, gradient normalization does not yield a validpolynomial. Solving the generalized eigenvalue problem Eq. (3) provides only polyno-mials with the nonzero gradient norm; however, this may drop some basis polynomialsthat have the zero gradient norm. Notably, we can show that this does not matter inour case; it is suﬃcient for the basis construction to collect only polynomials with thenonzero gradient norm. We start by proving the following lemmas. Lemma 4.1.

Any g ∈ R n of degree at least one can be represented as g = n (cid:88) k =1 h k ∂g∂x k + r, where h k , r ∈ R n and deg k ( r ) < deg k ( g ) for k = 1 , , . . . , n .Proof. We present a constructive proof. For simplicity of notation, we use t k such that d k = deg k ( g ). At = 1, if d = 0, we set h = 0 and proceed to k = 2. Otherwise, werearrange g according to the degree of x as follows. g = x d g (0)1 + x d − g (1)1 + · · · + g ( d )1 , where g ( τ )1 denotes an ( n − τ that does notcontain x , ( τ = 0 , , . . . , d ). Then, ∂g∂x = d x d − g (0)1 + ( d − x d − g (1)1 + · · · + 0 .

9y setting h = x /d , g = h ∂g∂x + r d , where r = x d − g (1)1 + 2 x d − g (2)1 · · · + d g ( d )1 . Note that deg ( r ) ≤ d − l ( r ) ≤ d l for l (cid:54) = 1. Next, we perform the same procedure for k = 2 and r ; if d = 0, then set h = 0 and r = r , and proceed to k = 3; otherwise, rearrange r according to the degree of x as r = x d r (0)2 + x d − r (1)2 + · · · + r ( d )2 , where r ( τ )2 denotes an ( n − τ that does notcontain x , ( τ = 0 , , . . . , d ). By setting h = x /d , we obtain g = h ∂g∂x + h ∂g∂x + r , where r = x d − r (1)2 + 2 x d − r (2)2 · · · + d r ( d )2 . Note that deg ( r ) ≤ d −

1, deg ( r ) ≤ d − l ( r ) ≤ d l for l (cid:54) = 1 ,

2. Repeat this procedure until k = n ; then, r := r n satisﬁes deg l ( r ) ≤ d l − l . Lemma 4.2.

Suppose that G t ⊂ R n is a basis of vanishing polynomials of degree atmost t for a set of points X such that for any (cid:101) g ∈ I ( X ) of degree at most t , (cid:101) g ∈ (cid:104) G t (cid:105) .Then, for any g ∈ I ( X ) of degree t + 1 , if ( ∂g/∂x k )( X ) = for all k = 1 , , . . . , n , then g ∈ (cid:104) G t (cid:105) .Proof. From Lemma 4.1, we can represent g as g = n (cid:88) k =1 h k ∂g∂x k + r, where h k , r ∈ R n and deg( r ) ≤ t . From g ( X ) = ( ∂g/∂x k )( X ) = for k = 1 , , . . . , n , weobtain r ( X ) = . Thus, ∂g/∂x k and r are all vanishing polynomials of degree at most t . Now, from the deﬁnition of G t , it follows that ∂g/∂x k , r ∈ (cid:104) G t (cid:105) for k = 1 , , . . . , n .From the absorption property of the ideal, we can conclude that g ∈ (cid:104) G t (cid:105) . Lemma 4.3.

Suppose that F t ⊂ R n is a basis of nonvanishing polynomials of degree atmost t for a set of points X such that for the evaluation vector ˜ f ( X ) of any nonvanishingpolynomial (cid:101) f of degree at most t , (cid:101) f ( X ) ∈ span( F t ( X )) . Then, for any nonvanishingpolynomial f ∈ R n of degree t + 1 , if ( ∂f /∂x k )( X ) = 0 for all k = 1 , , . . . , n , then f ( X ) ∈ span( F t ( X )) .Proof. From Lemma 4.1, we can represent g as f = n (cid:88) k =1 h k ∂f∂x k + r, h k , r ∈ R n and deg( r ) ≤ t . From ( ∂f /∂x k )( X ) = for k = 1 , , . . . , n , we obtain f ( X ) = r ( X ). Thus, ∂g/∂x k and r are all vanishing polynomials of degree at most t .Since the column space of F t ( X ) spans evaluation vectors of any polynomial of degreeat most t , r ( X ) ∈ span( F t ( X )).These two lemmas imply that we do not need polynomials with zero gradient normsfor constructing bases because these polynomials can be described by basis polynomialsof lower degrees. Therefore, it is valid to use n g for the normalization in the SBCalgorithm. Theorem 4.4.

The mapping n g of Eq. (4) is a valid normalization mapping for theSBC algorithm.Proof. It is trivial that n g is a linear mapping because of the linearity of the gradientoperator ∇ . The image of n g is a vector space R | X | n . From Lemmas 4.2 and 4.3, thetwo requirements for the validity of the normalization mapping (cf. Deﬁnition 3.10) aresatisﬁed. In contrast to coeﬃcient normalization, the proposed gradient normalization is muchmore eﬃcient in both theory and practice in the monomial-agnostic basis computationof vanishing ideals. We discuss its eﬃciency and advantages from three viewpoints: thelength of the normalization vector, th computational cost of its construction, and theease of approximation.

Given a polynomial h ∈ R n , the coeﬃcient vector n c ( h ) is of length (cid:0) n +deg( h ) n (cid:1) , whereasgradient normalization vector n g ( h ) is of length | X | n , where X ⊂ R n . The length ofa gradient normalization vector is independent of the degree of the polynomial andlinearly dependent on the dataset size. One may be concerned that for a large dataset,the gradient normalization vector can be longer than the coeﬃcient vector. However,in the basis computation of vanishing ideals, a larger dataset implies the need for theconsideration of higher-degree polynomials, especially when the approximation threshold (cid:15) is close to zero. This is because during the basis computation, a basis of nonvanishingpolynomials (say, F ) is constructed so that the column space of F ( X ) (approximately)spans R | X | . To this end, one needs to collect O ( | X | ) polynomials for | F | , which impliesthe need for O ( | X | ) monomials out of a large set of monomials. Therefore, the lengthof the coeﬃcient vector is at least O ( | X | ), and in the collection process, even moremonomials are required to be handled. In particular, when | X | is large, then we haveto access higher-degree monomials, the number of which sharply increases according tothe degree. 11 .2.2 Computational cost of normalization vectors The essence of eﬃcient monomial-ordering-free basis computation is to consider a linearcombination of polynomials instead of monomials. In both the SBC algorithm and VCA,we ﬁrst generate a set C t of candidate polynomials of degree- t and then consider a linearcombination of its elements. h = (cid:88) c ∈ C t v c c, (5)where v c ∈ R . To obtain the coeﬃcient vector n c ( h ), one needs to access the coeﬃcientof monomials, which requires an expansion of h into a linear combination of monomials.The expansion is signiﬁcantly expensive because h is a highly nested sum-product ofpolynomials. Speciﬁcally, c ∈ C t is calculated from a product of a linear polynomial p and a degree-( t −

1) polynomial q , while q is also a linear combination of C t − , each ofwhose elements is calculated from a product of linear and degree-( t −

2) polynomials.In contrast, a gradient normalization vector can be calculated in a monomial-agnosticway. In particular, we show that the gradient can be evaluated eﬃciently and exactlywithout diﬀerentiation by exploiting the iterative nature of the SBC algorithm (below,we use the symbols used in Algorithm 3.6). We start by rewriting Eq. (5). Noting that C t is generated from the linear combinations of C pre t and F t − , we can describe any h ∈ span( C t ) as h = (cid:88) c ∈ C pre t u c c + (cid:88) f ∈ F t − v f f, where u c , v f ∈ R . Note that c ∈ C pre t is a product of a polynomial in F and a polynomialin F t − . Let p c ∈ F and q c ∈ F t − be such polynomials, i.e., c = p c q c . Using the productrule, we evaluate ∂h/∂x k for x ∈ X as ∂h∂x k ( x ) = (cid:88) c ∈ C pre t u c ∂ ( p c q c ) ∂x k ( x ) + (cid:88) f ∈ F t − v f ∂f∂x k ( x ) , = (cid:88) c ∈ C pre t u c q c ( x ) ∂p c ∂x k ( x ) + (cid:88) c ∈ C pre t u c p c ( x ) ∂q c ∂x k ( x ) + (cid:88) f ∈ F t − v f ∂f∂x k ( x ) . (6)All of p c ( x ), q c ( x ), ( ∂p c /∂x k )( x ), ( ∂q c /∂x k )( x ), and ( ∂f /∂x k )( x ) are the evaluations at x ∈ X of polynomials of degree at most t −

1; thus, these have already been calculatedin the previous iterations up to degree t −

1. For degree t = 1, the gradients of thelinear polynomials are the combination vectors { v i } obtained at S2 . Thus, ∇ h ( X ) canbe exactly calculated without diﬀerentiation using the results at lower degrees. Proposition 4.5.

Given X , at degree t in the SBC algorithm, for any polynomial h ∈ span( C t ∪ F t − ) and any point x ∈ X , we can compute ∇ h ( x ) without diﬀerentiationwith a computational cost of O ( n rank( X ) | X | ) .Proof. Equation (6) suggests that the evaluation of a partial derivative ( ∂h/∂x k )( x ) isthe sum of | C pre t | + | F t − | terms (note that | C t | = | C pre t | ). Hence, ∇ h ( x ) requires a12omputational cost of O ( n ( | C t | + | F t − | )). Note that a ﬁnal output F satisﬁes | F | ≤| X | . This is because F ( X ) is full rank owing to the orthogonal projection in Eq. (2),and rank( F t ( X )) ≤ | X | because span( F ( X )) ⊆ R | X | (the equalities hold at (cid:15) = 0),where rank( · ) denotes the matrix rank of the given matrix. Hence, | F t − | = O ( | X | ).Furthermore, by its construction, | F | ≤ rank( X ). Therefore, | C t | = | F || F t − | ≤ rank( X ) | X | ; hence, O ( | C t | ) = O (rank( X ) | X | ). Thus, O ( n ( | C t | + | F t − | )) = O ( n | C t | ) = O ( n · rank( X ) | X | ). Remark 4.6.

One can approximate the gradient by the ﬁnite diﬀerence method. Al-though this does not reduce the computational complexity, it might simplify the imple-mentation.

The computational cost O ( n rank( X ) | X | ) is quite acceptable, considering that gen-erating C t already requires a computation cost of O (rank( X ) | X | ) and solving Eq. (3)requires a computational cost of O ( | C t | ) = O (rank( X ) | X | ). Furthermore, in thisanalysis, we use a very rough relation | F t | = O ( | X | ), whereas it is | F t | (cid:28) | X | in prac-tice. Gradient normalization is also practically convenient because there is a simple heuristicto reduce the runtime. By giving up the exact calculation, one can reduce the runtimeby restricting the variables and points to be considered, i.e., a normalization vector of apolynomial h ∈ R n can be approximated by (cid:98) n g ( h ; Y, Ω) = vec( ∇ Ω h ( Y )) , where Y ⊂ R n , Ω ⊂ { , , . . . , n } , and ∇ Ω h = { ∂h/∂x i | i ∈ Ω } . For example, Ω and Y can be deﬁned as an index set of variables that have large variance and the set ofcentroids of clusters on X , respectively. Note that in this heuristic, (cid:98) n g is no longerguaranteed to be a valid normalization mapping. However, by properly selecting Y suchthat it represents X well, we can expect the algebraic varieties of I ( Y ) and I ( X ) to besimilar. In this case, it is unlikely that (cid:98) n g ( h ; Y ) = when n g ( h ; X ) (cid:54) = . To ﬁnd such Y ,we can exploit methods developed in machine learning, such as thinning out, clustering,and coreset construction. We emphasize that gradient normalization is essentially diﬀerent from coeﬃcient normal-ization because it is a data-dependent normalization. Its data-dependent nature equipsthe basis computation with a sort of consistency with respect to data transformation.We start with the following intuitive example.

Example 4.7.

Let us consider h ∈ R n and x ∈ R n . For translation of x by β ∈ R n , h ( x ) (cid:107)∇ h ( x ) (cid:107) = (cid:101) h ( x − β ) (cid:107)∇ (cid:101) h ( x − β ) (cid:107) , (7)13 here (cid:101) h is a polynomial whose variables are translated by β , i.e., (cid:101) h ( x , x , . . . , x n ) = h ( x + β , x + β , · · · , x n + β n ) . For scaling of x by α (cid:54) = 0 , α h ( x ) (cid:107)∇ h ( x ) (cid:107) = (cid:98) h ( α x ) (cid:107)∇ (cid:98) h ( α x ) (cid:107) , (8) where (cid:98) h = (cid:80) deg( h ) τ =0 α − τ h ( τ ) and h ( τ ) is the degree- τ part of h . Even if h and (cid:98) h arenonlinear, their evaluation vectors relate linearly under gradient normalization. It might seem trivial that by selecting (cid:101) h and (cid:98) h , Eqs. (7) and (8) hold; however, weemphasize that it is not trivial whether the basis computation gives these polynomialsfor the transformed data. Interestingly, we can show that this is the case for the SBCalgorithm under gradient normalization. First, we deﬁne the following, which accountsfor (cid:98) h in Example 4.7. Deﬁnition 4.8 (( t, α )-degree-wise identical) . Let α (cid:54) = 0 and let t be an integer. Apolynomial (cid:101) h ∈ R n is ( t, α ) -degree-wise identical to a polynomial h ∈ R n if (cid:98) h = (cid:80) deg( h ) τ =0 α t − τ h ( τ ) and h ( τ ) is the degree- τ part of h . Example 4.9. (cid:98) h = x y + 4 x + 8 y is (3 , -degree-wise identical to h = x y + x + 2 y . Now, we claim that the SBC algorithm with gradient normalization (SBC- n g ) holdsa sort of consistency for the input translation and scaling. Theorem 4.10.

Suppose that SBC- n g outputs ( G, F ) for input ( X, (cid:15) ) , ( (cid:101) G, (cid:101) F ) for input ( X − b , (cid:15) ) , and ( (cid:98) G, (cid:98) F ) for input ( αX, | α | (cid:15) ) , where X − β and αX , respectively, denotethe translation by β and scaling by α (cid:54) = 0 of each point in X. Then, the followings hold. • G, (cid:101) G , and (cid:98) G have the same number of basis polynomials at each degree. • F, (cid:101) F , and (cid:98) F have the same number of basis polynomials at each degree. • Any pair of the corresponding polynomials (cid:101) h ∈ (cid:101) G ∪ (cid:101) F and h ∈ G ∪ F satisﬁes h ( x , x , . . . , x n ) = (cid:101) h ( x + β , x + β , . . . , x n + β n ) , where h ( x , x , . . . , x n ) denotesa polynomial in n variables x , x , . . . , x n (i.e., a polynomial in R n ) and β =( β , β , . . . , β n ) (cid:62) . • For any pair of the corresponding polynomials (cid:98) h ∈ (cid:98) G ∪ (cid:98) F and h ∈ G ∪ F , (cid:98) h is (1 , α ) -degree-wise identical to h . Since the proof is a little lengthy, we ﬁrst brieﬂy discuss Theorem 4.10. The ﬁrsttwo statements argue that translation and scaling of the input points do not aﬀectthe essential structure of inferred aﬃne varieties where noisy data approximately lie;that is, the aﬃne varieties, which are inferred from data with and without these datatransformations, are described by the same number of polynomials of the same degree.This intuition seems very natural; however, to the best of our knowledge, no existingbasis construction algorithms has this property. The third statement of Theorem 4.10argues that basis polynomials for translated data can be retrieved from basis polynomials14or nontranslated data by a variable translation. We emphasize that it is not trivialthat the algorithm outputs these translated polynomials. VCA has this translationconsistency property, whereas most other basis construction algorithms do not. Thelast statement of Theorem 4.10 is the most intriguing property, and it is not exhibitedby any other basis construction algorithm. The (1 , α )-degree-wise identicality betweenthe corresponding (cid:98) h ∈ (cid:98) G ∪ (cid:98) F and h ∈ G ∪ F implies the following relation: (cid:98) h ( αX ) = αh ( X ) . (9)In words, scaling by α on input X of SBC- n g only linearly aﬀects the evaluation vectorsof the nonlinear polynomials. Thus, we need only a linearly scaled threshold | α | (cid:15) for αX . Remark 4.11.

Without the property shown by Eq. (9), a linear scaling on the input leadsto a nonlinear scaling on the evaluation vectors of the output polynomials. Thus, theconsistency in input scaling (the second statement in Theorem 4.10) cannot be generallyrealized regardless of how well (cid:15) > is chosen for scaled data. Symbolically, (1 , α )-degree-wise identicality means that h and (cid:98) h consist of the sameterms up to a scale, and the corresponding terms m of h and (cid:98) m of (cid:98) h are related as (cid:98) m = α − τ m . Hence, with larger α , the coeﬃcients of higher-degree terms decreasemore sharply. This is quite natural because higher-order terms increase sharply withthe input value increases and α − τ cancels this eﬀect. One may argue that translationand scale consistency can be realized by preprocessing, such as mean centralizationand normalization. Although preprocessing may be useful in some practical scenarios,it discards the mean and scale information, and the output bases do not reﬂect thisinformation. By contrast, the output polynomials of SBC- n g reﬂect the mean and scalein a convenient form.Now, we present the proof of Theorem 4.10. First, we prepare lemmas for the proof. Lemma 4.12.

Consider sets of polynomials, H = { h , h , . . . , h s } ⊂ R n and (cid:98) H = { (cid:98) h , (cid:98) h , . . . , (cid:98) h s } ⊂ R n , where (cid:98) h i is ( t, α ) -degree-wise identical to h for i = 1 , , . . . , s .Then, any nonzero vectors (cid:98) w , w ∈ R s such that (cid:98) w = α τ w yield a polynomial (cid:98) H w thatis ( t + τ, α ) -degree-wise identical to H w .Proof. The proof is trivial from Deﬁnition 4.8.

Lemma 4.13.

Let a polynomial (cid:98) h ∈ R n be ( t, α ) -degree-wise identical to a polynomial h ∈ R n . Let X ⊂ R n be a set of points. Then, (cid:98) h ( αX ) = α t h ( X ) and ∇ (cid:98) h ( αX ) = α t − ∇ h ( X ) . Thus, (cid:98) h (cid:107)∇ (cid:98) h ( αX ) (cid:107) ( αX ) = α h (cid:107)∇ h ( X ) (cid:107) ( X ) . Proof.

Let m and (cid:98) m be any corresponding terms between h and (cid:98) h , which satisfy deg( (cid:98) m ) =15eg( m ) and (cid:98) m = α t − deg( (cid:98) m ) m . Then, (cid:98) m ( αX ) = α t − deg( (cid:98) m ) m ( αX ) , = α t − deg( (cid:98) m ) α deg( m ) m ( X ) , = α t m ( X ) . Similarly, for any k , ∂ (cid:98) m∂x k ( αX ) = α t − deg( (cid:98) m ) ∂m∂x k ( αX ) , = α t − deg( (cid:98) m ) α deg( m ) − ∂m∂x k ( X ) , = α t − ∂m∂x k ( X ) , resulting in ∇ (cid:98) m ( αX ) = α t − ∇ m ( X ). Lemma 4.14.

Suppose that we perform SBC- n g for ( X, (cid:15) ) and ( αX, | α | (cid:15) ) ( α (cid:54) = 0 ) upto degree t = τ , and obtain ( F τ , G τ ) and ( (cid:98) F τ , (cid:98) G τ ) , respectively. If for all t ≤ τ , thereis one-to-one correspondence between the elements of F t and (cid:98) F t (or the elements of G t and (cid:98) G t ) such that for any corresponding polynomials h ∈ F t and (cid:98) h ∈ (cid:98) F t (or h ∈ G t and (cid:98) h ∈ (cid:98) G t ), (cid:98) h is degree-wise- (1 , α ) identical to h . Then, the same claim holds for t = τ + 1 .Proof. For t = τ +1, C τ +1 and (cid:98) C τ +1 are generated from ( F , F τ ) and ( (cid:98) F , (cid:98) F τ +1 ), respec-tively. From the one-to-one correspondence between F t and (cid:98) F t for t ≤ τ , there is alsoone-to-one correspondence between C pre τ +1 and (cid:98) C pre τ +1 . The degree-wise-(1 , α ) identicalitybetween the elements of F t and (cid:98) F t leads to the degree-wise-(2 , k ) identicality betweenthe elements of C pre τ +1 and (cid:98) C pre τ +1 . Let c pre ∈ C pre τ +1 and (cid:98) c pre ∈ (cid:98) C pre τ +1 be such correspondingpolynomials; that is (cid:98) c pre is degree-wise-(2 , k ) identical to c pre . Suppose that c pre and (cid:98) c pre become c ∈ C τ +1 and (cid:98) c ∈ (cid:98) C τ +1 , respectively, after the orthogonalization by Eq. (2).As we will show in Lemma 4.15, (cid:98) c is degree-wise-(2 , k ) identical to c . From Lemma 4.13, (cid:98) c ( αX ) = α c ( X ) and ∇ (cid:98) c ( αX ) = α ∇ c ( X ). Therefore, (cid:98) C τ +1 ( αX ) = α C τ +1 ( X ) and ∇ (cid:98) C τ +1 ( αX ) = α ∇ C τ +1 ( X ). Furthermore, n g ( (cid:98) C τ +1 )( αX ) = α n g ( C τ +1 )( X ); thus, N g ( (cid:98) C τ +1 ; αX ) = α N g ( C τ +1 ; X ). Using these results, we now compare two general-ized eigenvalue problems. C τ +1 ( X ) (cid:62) C τ +1 ( X ) V = N g ( C τ +1 ; X ) V Λ , (cid:98) C τ +1 ( αX ) (cid:62) (cid:98) C τ +1 ( αX ) (cid:98) V = N g ( (cid:98) C τ +1 ; αX ) (cid:98) V (cid:98) Λ . Note that V (cid:62) N g ( C τ +1 ; X ) V = I and (cid:98) V (cid:62) N g ( (cid:98) C τ +1 ; αX ) (cid:98) V = I hold, respectively. Thus,from these relations and (cid:98) V (cid:62) N g ( (cid:98) C τ +1 ; αX ) (cid:98) V = α (cid:98) V (cid:62) N g ( C τ +1 ; X ) (cid:98) V , we obtain (cid:98) V =16 − V . Furthermore, we obtain (cid:98) Λ = (cid:98) V (cid:62) (cid:98) C τ +1 ( αX ) (cid:62) (cid:98) C τ +1 ( αX ) (cid:98) V , = α (cid:98) V (cid:62) C τ +1 ( X ) (cid:62) C τ +1 ( X ) (cid:98) V , = α V (cid:62) C τ +1 ( X ) (cid:62) C τ +1 ( X ) V, = α Λ . Let v i and (cid:98) v i be the i -th columns of V and (cid:98) V , respectively. From (cid:98) v i = α − v i andLemma 4.12, (cid:98) C (cid:98) v i is (1 , α )-degree-wise identical to C τ +1 v i . Hence, any correspondingpolynomials h ∈ C τ +1 V and (cid:98) h ∈ (cid:98) C τ +1 (cid:98) V satisfy (cid:98) h ( αX ) = αh ( X ). This fact is alsosupported by (cid:98) Λ = α Λ (recall that the square root of the eigenvalues corresponds tothe extent of vanishing). Note that polynomials in (cid:98) C τ +1 (cid:98) V are assorted into (cid:98) F τ +1 or (cid:98) G τ +1 by the threshold | α | (cid:15) , which leads to the same classiﬁcation as F τ +1 and G τ +1 by (cid:15) . Thus, the one-to-one correspondence is maintained between F τ +1 and (cid:98) F τ +1 and alsobetween G τ +1 and (cid:98) G τ +1 . Lemma 4.15.

Consider the same setting in Lemma 4.13: suppose that we perform SBCwith the gradient-based normalization for ( S, (cid:15) ) and for ( αX, | α | (cid:15) ) , ( k (cid:54) = 0 ), and obtain ( F τ , G τ ) and ( (cid:98) F τ , (cid:98) G τ ) , respectively, up to degree t = τ . Suppose that for t ≤ τ , F t ∪ G t and (cid:98) F t ∪ (cid:98) G t have one-to-one correspondence, say, h ∈ F t ∪ G t and (cid:98) h ∈ (cid:98) F t ∪ (cid:98) G t , where (cid:98) h is degree-wise- (1 , α ) identical to h . Now, suppose that h pre ∈ C pre τ +1 and (cid:98) c pre ∈ (cid:98) C pre τ +1 become c ∈ C τ +1 and (cid:98) c ∈ (cid:98) C τ +1 , respectively, after the orthogonalization by Eq. (2).Then, (cid:98) c is (2 , α ) -degree-wise identical to c .Proof. The element-wise description of the orthogonalization by Eq. (2) for C τ +1 and (cid:98) C τ +1 is respectively as follows. c = c pre − F τ F τ ( X ) † c pre ( X ) , (cid:98) c = (cid:98) c pre − (cid:98) F τ (cid:98) F τ ( αX ) † (cid:98) c pre ( αX ) . Let w = F τ ( X ) † c pre ( X ) and (cid:98) w = (cid:98) F τ ( αX ) † (cid:98) c pre ( αX ). We will now show that (cid:98) w = α w . If this holds, each element of (cid:98) F τ (cid:98) w becomes degree-wise-(2 , k ) identical to thecorresponding element of F τ w . Thus, from Lemma 4.12, (cid:98) c is (2 , α )-degree-wise identicalto c .First, note that the column vectors of (cid:98) F τ ( αX ) are mutually orthogonal by construc-tion because the orthogonalization makes span( (cid:98) F t ( αX )) and span( (cid:98) F t ( αX )) mutuallyorthogonal for any t (cid:54) = t , and the generalized eigenvalue decomposition makes thecolumns of (cid:98) F t ( αX ) mutually orthogonal for t ≤ τ . Therefore, (cid:98) D := (cid:98) F τ ( αX ) (cid:62) (cid:98) F τ ( αX ) , = α F τ ( X ) (cid:62) F τ ( X ) , =: α D, (cid:98) D and D are diagonal matrices with positive diagonal elements. Hence, thepseudo-inverse becomes (cid:98) F τ ( αX ) † = (cid:98) D − (cid:98) F τ ( αX ) (cid:62) , = ( α − D − )( αF τ ( X ) (cid:62) ) , = α − D − F τ ( X ) (cid:62) , = α − F τ ( X ) † . Therefore, (cid:98) w = (cid:98) F τ ( αX ) † (cid:98) c pre ( αX ) , = α − F τ ( X ) † ( α c pre ( X )) , = αF τ ( X ) † c pre ( X ) , = α w . Proof of Theorem 4.10.

We consider three processes of the basis construction for (

X, (cid:15) ),( X − b , (cid:15) ), and ( αX, | α | (cid:15) ). We use notations such as H , (cid:101) H , and (cid:98) H for the correspondingsymbols across the three processes.First, we consider the case of translation. At t = 1, C pre1 = (cid:101) C pre1 . Owing to theorthogonalization by Eq. (2) with respect to a constant vector F ( X ), both C ( X ) and (cid:101) C ( X − b ) are mean centralized. Thus, the eﬀect of translation is canceled in terms ofthe evaluation matrices. Therefore, the succeeding basis construction proceeds in thesame way. Symbolically, the mean centralization is equivalent to the shift of variables.Next, we consider the case of scaling. From the assumption, the mean vector of X and αX is the zero vector. Thus, C = (cid:98) C and αC ( X ) = (cid:98) C ( αX ) because C and (cid:98) C consist of linear polynomials. Furthermore, n g ( C ; X ) = n g ( (cid:98) C ; αX ) because the partialderivatives of linear polynomials are constant polynomials. Thus, (cid:98) C ( αX ) (cid:62) (cid:98) C ( αX ) (cid:98) V = N g ( (cid:98) C ; αX ) (cid:98) V (cid:98) Λ , ⇐⇒ C ( X ) (cid:62) C ( X ) V = 1 α N g ( C ; X ) V Λ , which implies that α − V = (cid:98) V and α Λ = (cid:98)

Λ (refer to the proof of Lemma 4.14). There-fore, C V and (cid:98) C (cid:98) V consist of the same polynomials; for any corresponding pair h ∈ C V and (cid:98) h ∈ (cid:98) C (cid:98) V , it follows that h = (cid:98) h and (cid:98) h is (1, α )-degree-wise identical to h . Since (cid:98) h ( αX ) = αh ( X ), C with (cid:15) and (cid:98) C with | α | (cid:15) result in C = F ∪ G and (cid:98) C = (cid:98) F ∪ (cid:98) G ,where F = (cid:98) F and G = (cid:98) G . By induction and Lemma 4.13, we can conclude ourproof. Given a set of perturbed points X ⊂ R n , the approximate computation of a basis of I ( X ) returns a set of polynomials that are approximately vanishing for X . However,18his does not necessarily imply that these polynomials are also approximately vanishingfor the set of unperturbed points X ∗ . Here, we show that with gradient normalization,a vanishing polynomial for X is also approximately vanishing for X ∗ , and the extent ofvanishing increases linearly according to the perturbation level. We also show that thisis not the case for coeﬃcient normalization. First, we consider the following example. Example 4.16.

Let us consider a polynomial g = ( x − p ) x / (cid:112) p , which is an exactvanishing polynomial for a set of single one-dimensional point X = { p } ⊂ R . Note that g is normalized by its coeﬃcient norm. Let p ∗ = p − δ be an unperturbed point of p ,where δ is the perturbation. Then, g ( p ∗ ) = δ ( p ∗ ) / (cid:112) p ∗ ) = O ( δ ( p ∗ ) ) . Therefore,the eﬀect of noise can be arbitrarily large by increasing the value of p . In contrast, if we normalize g by its gradient in Example 4.16 (say, (cid:101) g ), the value of p has no eﬀect on the evaluation of (cid:101) g at p ∗ , which is only determined by δ . Example 4.17.

Let us consider the same setting as that in Example 4.16. If g isnormalized by its gradient, i.e., (cid:101) g := g/ ( dg/dx )( p ) = ( x − p ) , then its evaluation at p ∗ is (cid:101) g ( p ∗ ) = δ/ O ( δ ) , which is not dependent on the value of p ∗ . This diﬀerence between coeﬃcient normalization and gradient normalization arisesfrom the fact that the latter is a data-dependent normalization. Although Examples 4.16and 4.17 cannot be directly generalized to multivariate and multiple-point cases, wecan still prove a similar statement. The following proposition argues that when theperturbation is suﬃciently small, the extent of vanishing at unperturbed points is alsosmall. Moreover, it is bounded only by the largest perturbation and not by the norm ofthe points, which is not the case with coeﬃcient normalization as shown in Example 4.16.

Proposition 4.18.

Let X ⊂ R n be a set of points that are generated by perturbing X ∗ ⊂ R n . We denote the corresponding perturbed point in X and unperturbed point in X ∗ by x and x ∗ , respectively. Let g be an (cid:15) -approximately vanishing polynomial for X with aunit gradient norm, i.e., (cid:107) n g ( g ; X ) (cid:107) = 1 . Then, (cid:107) g ( X ∗ ) (cid:107) ≤ (cid:15) + (cid:107) n max (cid:107) + o ( (cid:107) n max (cid:107) ) ,where n max = max x ∈ X (cid:107) x ∗ − x (cid:107) and o ( · ) is the Landau’s small o.Proof. Let us denote the perturbation by n x = x ∗ − x . By the Taylor expansion, (cid:107) g ( X ∗ ) (cid:107) = (cid:115) (cid:88) x ∈ X g ( x + n x ) , = (cid:115) (cid:88) x ∈ X ( g ( x ) + ∇ g ( x ) n x + o ( (cid:107) n x (cid:107) )) , ≤ (cid:115) (cid:88) x ∈ X g ( x ) + (cid:115) (cid:88) x ∈ X ( ∇ g ( x ) n x ) + (cid:115) (cid:88) x ∈ X o ( (cid:107) n x (cid:107) ) , ≤ (cid:15) + (cid:115) (cid:88) x ∈ X ( ∇ g ( x ) n x ) + o ( (cid:107) n max (cid:107) ) . ∇ g ( x ) is a row vector from the deﬁnitions of ∇ g := { ∂g/∂x , . . . , ∂g/∂x n } and the evaluation matrix. By the Cauchy–Schwarz inequality, the second term becomes (cid:115) (cid:88) x ∈ X ( ∇ g ( x i ) n i ) ≤ (cid:115) (cid:88) x ∈ X (cid:107)∇ g ( x ) (cid:107) (cid:107) n i (cid:107) , = max x ∈ X (cid:107) n x (cid:107) (cid:115) (cid:88) x ∈ X (cid:107)∇ g ( x ) (cid:107) , = max x ∈ X (cid:107) n x (cid:107) , where we used (cid:107) n g ( g ; X ) (cid:107) = 1 for the last equality. Therefore, we obtain (cid:107) g ( X ∗ ) (cid:107) ≤ (cid:15) + max x ∈ X (cid:107) n x (cid:107) + o (cid:18) max x ∈ X (cid:107) n x (cid:107) (cid:19) . This result implies that gradient-normalized approximately vanishing polynomialsfor X behave as linear functions for regions around the points in X . This behavior issigniﬁcantly important in practical scenarios where one has to adjust a proper (cid:15) em-pirically. In the Bayesian setting, the prior on the perturbation is typically a Gaussiandistribution N ( , (cid:15) I ), which assumes the (Euclidean) distance from a perturbed pointto the unperturbed point is at most (cid:15) with a moderate probability. This can be consid-ered as a soft thresholding based on the geometrical distance, which can be empiricallyestimated, e.g., by repeating measurements of data. However, in the approximate com-putation of vanishing ideals, the threshold (cid:15) measures the evaluations of the polynomialsat points, which are diﬃcult to estimate without knowing the form of the polynomials.Proposition 4.18 provides a (locally) linear relation between the evaluation of polyno-mials and the geometrical distance. Thus, it is reasonable to set (cid:15) directly based on theestimated perturbation magnitude.The geometrical distance of approximately vanishing polynomials was also consideredin [6]. As in our analysis, they used the ﬁrst-order approximation. The main diﬀerencesfrom their work and our work are as follows: (i) they focus on ﬁnding a single polynomialrather than the basis and (ii) there is a possibility that no polynomial is found withtheir algorithm. This is because with coeﬃcient normalization, we cannot ensure a smallgradient of polynomials around the given points. In contrast, our approach with gradientnormalization always achieves local linearity between geometric distance and evaluation,which is important as mentioned above. Note that this does not necessarily mean thatthe approach in [6] or coeﬃcient normalization is inferior to our approach because incomputer algebra, many theories, operations, and algorithms implicitly or explicitlyconsider the information on coeﬃcients of monomials. For example, the δ -approximateborder bases [7] are deﬁned based on the coeﬃcient norm of the normal remainder of theS-polynomials of a basis. In summary, gradient normalization is particularly designedand eﬀective for our setting (i.e., monomial-agnostic computation), which aims at data-driven applications. 20 .5 Connection to the monomial-aware computation In computer algebra, the basis computation of vanishing ideals is conducted by consider-ing linear combinations of monomials, i.e., in a monomial-aware manner. In particular,border bases are commonly computed owing to their better numerical stability com-pared to the Gr¨obner bases [25, 5]. We can readily prove that gradient normalizationis also valid for the computation of border bases. Let

O ⊂ R n be an order ideal and ∂ O = { x k o | o ∈ O , k = 1 , , . . . , n }\O be its border. Note that by deﬁnition, if a mono-mial m can divide a monomial o ∈ O , then m ∈ O . Now, let us consider an O -borderbasis B of a vanishing ideal I ( X ). Any border basis polynomial g ∈ B is represented by g = b − (cid:88) o ∈O c o o, (10)where b ∈ ∂ O and c o ∈ R . Then, the partial derivative of g with respect to x k becomesa linear combination of the monomials in O . By using the fact that the column vectorsof O ( X ) are linear independent, we can show the following lemma. Lemma 4.19.

Let us consider a border basis computation of the vanishing ideal I ( X ) ⊂R n of a set of points X ∈ R n . Let O ⊂ R n be an order ideal obtained in the computation.Then, for any nonzero polynomial g ∈ span( ∂ O t ∪ O t ) , there exists x k , for which ∂g/∂x k is a nonvanishing polynomial for X .Proof. Note that b ∈ ∂ O t can be represented as b = x i o b for some x i and o b ∈ O t .Thus, ∂b/∂x k = δ ik o b + x i ∂o b /∂x k , where δ ik is the Kronecker delta. Furthermore,note that for any o ∈ O t , its partial derivative with respect to any x k is identical toa monomial in O t (up to a constant factor). Thus, ∂g/∂x k is a linear combination ofdistinct monomials in O t . By construction, any linear combination of the monomials in O t is nonvanishing for X unless the combination coeﬃcients are all zeros, which happensonly when g is not a function of x k (i.e., ∂g/∂x k = 0). However, since g is a nonzeropolynomial, there exists k ∈ { , , . . . , n } , for which ∂g/∂x k (cid:54) = 0. Theorem 4.20.

Let

O ⊂ R n be an order ideal and B ⊂ R n be an O -border basis ofa vanishing ideal I ( X ) . For any g ∈ B , its gradient norm has a nonzero value, i.e., n g ( g ; X ) (cid:54) = .Proof. By Lemma 4.19, for some x k , ∂g/∂x k ( X ) (cid:54) = . This concludes the proof.Although Theorem 4.20 ensures that one can use gradient normalization in the borderbasis computation, it is still unclear if speciﬁc algorithms (e.g., the approximate vanish-ing ideal algorithm [7] and the approximate Buchberger–M¨oller (ABM) algorithm [20])have the advantages of gradient normalization as in Theorem 4.10 and Proposition 4.18.In fact, direct application does not seem fruitful. For example, the ABM algorithm withgradient normalization solves the following generalized eigenvalue problem. (cid:0) b ( X ) O ( X ) (cid:1) (cid:62) (cid:0) b ( X ) O ( X ) (cid:1) v = λ n g ( b ∪ O ; X ) (cid:62) n g ( b ∪ O ; X ) v . (11)21 polynomial g = ( { b } ∪ O ) v is √ λ -approximately vanishing for X . However, sinceeach term of g has diﬀerent nonlinearity, (cid:98) g = ( { b } ∪ O ) (cid:98) v , where (cid:98) v is obtained bysolving Eq. (11) with αX , is not degree-wise identical to g in general. Furthermore,in computer algebra, coeﬃcients of monomials or coeﬃcient norms of polynomials areof interest in many applications; however, it is diﬃcult to relate the coeﬃcient normto the gradient norm. Finding useful data-dependent normalization and realizing theproperties in Theorem 4.10 and Proposition 4.18 in border basis computation are leftas topics for future work. In addition to normalization, we can exploit the gradient information to further boost thebasis computation of vanishing ideals. Here, we introduce two applications of gradientinformation. The ﬁrst application is basis reduction, the goal of which is to remove basispolynomials from a basis. The second application is a regularization method of the basiscomputation by the dimension of the variety.

A computed basis G of a vanishing ideal can be redundant in that a basis polynomial g ∈ G may be generated by other basis polynomials of lower degree, i.e., g ∈ (cid:10) G deg( g ) − (cid:11) .Here, we provide a numerical approach to remove such redundant basis polynomials inorder to reduce the size of a basis. This is sometimes useful; for example, a reducedbasis will provide compact feature vectors for classiﬁcation [15, 21, 28, 29]. To deter-mine whether g ∈ (cid:104) G deg( g ) − (cid:105) for a given g , a standard approach in computer algebra isto divide g by the Gr¨obner basis of G deg( g ) − and solve an ideal membership problem.However, the complexity of computing a Gr¨obner basis is known to be doubly expo-nential [4]. In addition, the polynomial division requires an expanded form of g , whichis also computationally expensive to obtain in our case. Moreover, this polynomial-division-based approach is not favorable in the approximate setting, where g can beapproximately represented by polynomials in G deg( g ) − . Thus, we would like to handlethe redundancy in a non-symbolic and noise-tolerant manner using the evaluation valuesat the points. However, vanishing polynomials have the same evaluation vectors .We can again resort to the gradient of the polynomials, whose evaluation values areproven to be nonvanishing for a given set of points (Lemma 4.2). We can consider g asredundant if for any x ∈ X , the gradient ∇ g ( x ) is linearly dependent on that of thepolynomials in G deg( g ) − . Conjecture 5.1.

Let G be a basis of the vanishing ideal I ( X ) of a set of points X ∈ R n ,which is an output by the SBC algorithm with (cid:15) = 0 . Then, g ∈ G is g ∈ (cid:104) G deg( g ) − (cid:105) ifand only if for any x ∈ X , ∇ g ( x ) = (cid:88) g (cid:48) ∈ G deg( g ) − α g (cid:48) , x ∇ g (cid:48) ( x ) , (12) for some α g (cid:48) , x ∈ R . g = (cid:80) g (cid:48) ∈ G deg( g ) − g (cid:48) h g (cid:48) and using g (cid:48) ( x ) = 0 , ∀ x ∈ X . From the assumption g ∈ (cid:104) G deg( g ) − (cid:105) , we can represent g as g = (cid:80) g (cid:48) ∈ G deg( g ) − g (cid:48) h g (cid:48) for some { h g (cid:48) } ⊂ P n . Thus, ∇ g ( x ) = (cid:88) g (cid:48) ∈ G deg( g ) − h g (cid:48) ( x ) ∇ g (cid:48) ( x ) + g (cid:48) ( x ) ∇ h g (cid:48) ( x ) , = (cid:88) g (cid:48) ∈ G deg( g ) − h g (cid:48) ( x ) ∇ g (cid:48) ( x ) , where we used g (cid:48) ( x ) = 0 in the last equality.From the suﬃciency, we can remove all the redundant polynomials from a basis bychecking whether Eq. (12) holds. We may accidentally remove some basis polynomialsthat are not redundant because the necessity (“only if” statement) remains to be proven.Conceptually, the necessity implies that the global (symbolic) relation g ∈ (cid:104) G deg( g ) − (cid:105) can be inferred from the local relation Eq. (12) at ﬁnitely many points X . This is nottrue for general g and G deg( g ) − . However, g and G deg( g ) − are constructed in a veryrestrictive way; hance, we suspect that our conjecture may be true.We also support the validity of the use of Conjecture 5.1 from another perspective.When Eq. (12) holds, this indicates that with the basis polynomials of lower degrees, onecan generate a polynomial (cid:98) g that takes the same value and gradient as g at all the givenpoints; in other words, (cid:98) g behaves identically to g up to the ﬁrst order for all the points.According to the essence of the vanishing ideal computation—identifying a polynomialonly by its behavior at given points—it is reasonable to consider g as “redundant” forpractical use.Now, we describe the removal of redundant polynomials based on Conjecture 5.1.Given g and G deg( g ) − , we solve the following least-squares problem for each x ∈ X :min v ∈ R | G deg( g ) − | (cid:107)∇ g ( x ) − v (cid:62) ∇ G deg( g ) − ( x ) (cid:107) , (13)where ∇ G deg( g ) − ( x ) ∈ R | G deg( g ) − |× n is a matrix that stacks ∇ g (cid:48) ( x ) for g (cid:48) ∈ G deg( g ) − in each row (note that ∇ g ( x ) ∈ R × n ). This problem has a closed-form solution v (cid:62) = ∇ g ( x ) ∇ G deg( g ) − ( x ) † . If the residual error equals zero at all the points in X , then g is removed as a redundant polynomial. In the approximately vanishing case ( (cid:15) > g ∈ G t (not G t ), it is g / ∈ (cid:104) G t − { g }(cid:105) . Thus, it is not necessary to consider the reduction within the samedegree. Proposition 5.2.

Let us consider SBC- n g and use the notations in Algorithm 3.6. Let h i = C t v i , where i = 1 , , . . . , | C t | . Then, if n g ( h i ; X ) (cid:62) n g ( h j ; X ) = δ ij , there is x ∈ X such that ∇ h i ( x ) and ∇ h j ( x ) are mutually linearly independent.Proof. We prove the claim by contradiction. Let us assume that ∀ x ∈ X, ∃ k x ∈ , ∇ h i ( x ) = k x ∇ h j ( x ), where k x (cid:54) = 0 for ∇ h i ( x ) (cid:54) = . n g ( h i ; X ) (cid:62) n g ( h j ; X ) = (cid:88) x ∈ X ∇ h i ( x ) (cid:62) ∇ h j ( x ) , = (cid:88) x ∈ X k x (cid:107)∇ h i ( x ) (cid:107) , = 0 . Noting that (cid:107) n g ( h i ; X ) (cid:107) = (cid:80) x ∈ X (cid:107)∇ h i ( x ) (cid:107) = 1, for some x ∈ X , (cid:107)∇ h i ( x ) (cid:107) (cid:54) = 0 andthus k x = 0. This contradicts the assumption.It is worth noting that the redundancy of a basis does not necessarily imply imperfec-tion of the basis computation. In fact, the fascinating property of border bases—betternumerical stability compared to the Gr¨obner bases—stems from the fact that the re-dundancy of border bases [23]. Thus, we did not try to alter the basis computationframework (here, the SBC algorithm); instead, we provided a postprocessing method toremove the redundancy. A vanishing ideal is by deﬁnition a zero-dimensional ideal; that is, the aﬃne varietyassociated with the ideal is a ﬁnite set of points. However, in several applications, datapoints are assumed to come from a positive-dimensional ideal, whose associated aﬃnevariety is smooth (nonsingular) except for zero-measure sets. Therefore, the algorithmsfor the basis computation of vanishing ideals sometimes over-specify the aﬃne varietyfor the given data points, and they do not generalize to unobserved data well. For abetter generalization, a typical regularization method is to restrict the maximum degreeof basis polynomials or restricting the size of a basis. However, it is diﬃcult to chooseproper values for these parameters; it is not obvious which degree of polynomials willvanish for the given data and how many polynomials are obtained for the given data.The potential range of these values is unknown. Instead of degree or basis size restriction,we propose using the dimension of the aﬃne variety for the regularization, where thegradient information is again exploited.

Deﬁnition 5.3 (Dimension of aﬃne variety) . Let

V ⊂ R n be an aﬃne variety, whichis the zero set of a polynomial set G . Dimension dim( V ) of V is deﬁned as follows. dim( V ) = n − min x ∈V ∗ rank( ∇ G ( x )) , where V ∗ is the set of nonsingular points of V and ∇ G ( x ) ∈ R n ×| G | ( x ) is a matrix whose i -th column is the gradient (at x ) of the i-th polynomial of G . We provide an example in Fig. 1. The union of a line and a plane is the zero set of G = { xz, yz } . At the point on the vertical line, the rank of ∇ G ( x ) is one and at thepoint on the plane, the rank of ∇ G ( x ) is two. Therefore, the dimension of this varietyis 3 − ank( r G ( x )) = 2 AAACFHicbVDLSgMxFM34rPVVdekmWIQWocwUQTdCwUVdVrAP6AzlTiZtQzOZIcmIZehHuPFX3LhQxK0Ld/6NmbYLbT0QcjjnXu69x485U9q2v62V1bX1jc3cVn57Z3dvv3Bw2FJRIgltkohHsuODopwJ2tRMc9qJJYXQ57Ttj64zv31PpWKRuNPjmHohDATrMwLaSL3CmRuCHsowlSBGk5IrwOeA6yXXj3igxqH50odJuYyvcLVXKNoVewq8TJw5KaI5Gr3ClxtEJAmp0ISDUl3HjrWXgtSMcDrJu4miMZARDGjXUAEhVV46PWqCT40S4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjnmTQjO4snLpFWtOHbFuT0v1urzOHLoGJ2gEnLQBaqhG9RATUTQI3pGr+jNerJerHfrY1a6Ys17jtAfWJ8/zQWd/A== AAACFHicbVDLSgMxFM34rPVVdekmWIQWocwUQTdCwUVdVrAP6AzlTiZtQzOZIcmIZehHuPFX3LhQxK0Ld/6NmbYLbT0QcjjnXu69x485U9q2v62V1bX1jc3cVn57Z3dvv3Bw2FJRIgltkohHsuODopwJ2tRMc9qJJYXQ57Ttj64zv31PpWKRuNPjmHohDATrMwLaSL3CmRuCHsowlSBGk5IrwOeA6yXXj3igxqH50odJuYyvcLVXKNoVewq8TJw5KaI5Gr3ClxtEJAmp0ISDUl3HjrWXgtSMcDrJu4miMZARDGjXUAEhVV46PWqCT40S4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjnmTQjO4snLpFWtOHbFuT0v1urzOHLoGJ2gEnLQBaqhG9RATUTQI3pGr+jNerJerHfrY1a6Ys17jtAfWJ8/zQWd/A== AAACFHicbVDLSgMxFM34rPVVdekmWIQWocwUQTdCwUVdVrAP6AzlTiZtQzOZIcmIZehHuPFX3LhQxK0Ld/6NmbYLbT0QcjjnXu69x485U9q2v62V1bX1jc3cVn57Z3dvv3Bw2FJRIgltkohHsuODopwJ2tRMc9qJJYXQ57Ttj64zv31PpWKRuNPjmHohDATrMwLaSL3CmRuCHsowlSBGk5IrwOeA6yXXj3igxqH50odJuYyvcLVXKNoVewq8TJw5KaI5Gr3ClxtEJAmp0ISDUl3HjrWXgtSMcDrJu4miMZARDGjXUAEhVV46PWqCT40S4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjnmTQjO4snLpFWtOHbFuT0v1urzOHLoGJ2gEnLQBaqhG9RATUTQI3pGr+jNerJerHfrY1a6Ys17jtAfWJ8/zQWd/A== AAACFHicbVDLSgMxFM34rPVVdekmWIQWocwUQTdCwUVdVrAP6AzlTiZtQzOZIcmIZehHuPFX3LhQxK0Ld/6NmbYLbT0QcjjnXu69x485U9q2v62V1bX1jc3cVn57Z3dvv3Bw2FJRIgltkohHsuODopwJ2tRMc9qJJYXQ57Ttj64zv31PpWKRuNPjmHohDATrMwLaSL3CmRuCHsowlSBGk5IrwOeA6yXXj3igxqH50odJuYyvcLVXKNoVewq8TJw5KaI5Gr3ClxtEJAmp0ISDUl3HjrWXgtSMcDrJu4miMZARDGjXUAEhVV46PWqCT40S4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjnmTQjO4snLpFWtOHbFuT0v1urzOHLoGJ2gEnLQBaqhG9RATUTQI3pGr+jNerJerHfrY1a6Ys17jtAfWJ8/zQWd/A== rank( r G ( x )) = 1 AAACFHicbVDLSgMxFM3UV62vqks3wSK0CGVGBN0IBRd1WcE+oDOUO5m0Dc1khiQjlqEf4cZfceNCEbcu3Pk3ZtoutPVAyOGce7n3Hj/mTGnb/rZyK6tr6xv5zcLW9s7uXnH/oKWiRBLaJBGPZMcHRTkTtKmZ5rQTSwqhz2nbH11nfvueSsUicafHMfVCGAjWZwS0kXrFUzcEPZRhKkGMJmVXgM8B18uuH/FAjUPzpQ+TSgVfYadXLNlVewq8TJw5KaE5Gr3ilxtEJAmp0ISDUl3HjrWXgtSMcDopuImiMZARDGjXUAEhVV46PWqCT4wS4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjkWTAjO4snLpHVWdeyqc3teqtXnceTRETpGZeSgC1RDN6iBmoigR/SMXtGb9WS9WO/Wx6w0Z817DtEfWJ8/y4Gd+w== AAACFHicbVDLSgMxFM3UV62vqks3wSK0CGVGBN0IBRd1WcE+oDOUO5m0Dc1khiQjlqEf4cZfceNCEbcu3Pk3ZtoutPVAyOGce7n3Hj/mTGnb/rZyK6tr6xv5zcLW9s7uXnH/oKWiRBLaJBGPZMcHRTkTtKmZ5rQTSwqhz2nbH11nfvueSsUicafHMfVCGAjWZwS0kXrFUzcEPZRhKkGMJmVXgM8B18uuH/FAjUPzpQ+TSgVfYadXLNlVewq8TJw5KaE5Gr3ilxtEJAmp0ISDUl3HjrWXgtSMcDopuImiMZARDGjXUAEhVV46PWqCT4wS4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjkWTAjO4snLpHVWdeyqc3teqtXnceTRETpGZeSgC1RDN6iBmoigR/SMXtGb9WS9WO/Wx6w0Z817DtEfWJ8/y4Gd+w== AAACFHicbVDLSgMxFM3UV62vqks3wSK0CGVGBN0IBRd1WcE+oDOUO5m0Dc1khiQjlqEf4cZfceNCEbcu3Pk3ZtoutPVAyOGce7n3Hj/mTGnb/rZyK6tr6xv5zcLW9s7uXnH/oKWiRBLaJBGPZMcHRTkTtKmZ5rQTSwqhz2nbH11nfvueSsUicafHMfVCGAjWZwS0kXrFUzcEPZRhKkGMJmVXgM8B18uuH/FAjUPzpQ+TSgVfYadXLNlVewq8TJw5KaE5Gr3ilxtEJAmp0ISDUl3HjrWXgtSMcDopuImiMZARDGjXUAEhVV46PWqCT4wS4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjkWTAjO4snLpHVWdeyqc3teqtXnceTRETpGZeSgC1RDN6iBmoigR/SMXtGb9WS9WO/Wx6w0Z817DtEfWJ8/y4Gd+w== AAACFHicbVDLSgMxFM3UV62vqks3wSK0CGVGBN0IBRd1WcE+oDOUO5m0Dc1khiQjlqEf4cZfceNCEbcu3Pk3ZtoutPVAyOGce7n3Hj/mTGnb/rZyK6tr6xv5zcLW9s7uXnH/oKWiRBLaJBGPZMcHRTkTtKmZ5rQTSwqhz2nbH11nfvueSsUicafHMfVCGAjWZwS0kXrFUzcEPZRhKkGMJmVXgM8B18uuH/FAjUPzpQ+TSgVfYadXLNlVewq8TJw5KaE5Gr3ilxtEJAmp0ISDUl3HjrWXgtSMcDopuImiMZARDGjXUAEhVV46PWqCT4wS4H4kzRMaT9XfHSmEKlvQVGYnqEUvE//zuonuX3opE3GiqSCzQf2EYx3hLCEcMEmJ5mNDgEhmdsVkCBKINjkWTAjO4snLpHVWdeyqc3teqtXnceTRETpGZeSgC1RDN6iBmoigR/SMXtGb9WS9WO/Wx6w0Z817DtEfWJ8/y4Gd+w== G = { xz, yz } AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIHqQkIuhFKHioxwr2A5pQNtttu3SzCbsbMQ39JV48KOLVn+LNf+O2zUFbHww83pthZl4Qc6a043xbhbX1jc2t4nZpZ3dvv2wfHLZUlEhCmyTikewEWFHOBG1qpjntxJLiMOC0HYxvZ377kUrFIvGg05j6IR4KNmAEayP17HIdoRvkZU+Tc5ROvGnPrjhVZw60StycVCBHo2d/ef2IJCEVmnCsVNd1Yu1nWGpGOJ2WvETRGJMxHtKuoQKHVPnZ/PApOjVKHw0iaUpoNFd/T2Q4VCoNA9MZYj1Sy95M/M/rJnpw7WdMxImmgiwWDRKOdIRmKaA+k5RonhqCiWTmVkRGWGKiTVYlE4K7/PIqaV1UXafq3l9WavU8jiIcwwmcgQtXUIM7aEATCCTwDK/wZk2sF+vd+li0Fqx85gj+wPr8AQYXkgs= AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIHqQkIuhFKHioxwr2A5pQNtttu3SzCbsbMQ39JV48KOLVn+LNf+O2zUFbHww83pthZl4Qc6a043xbhbX1jc2t4nZpZ3dvv2wfHLZUlEhCmyTikewEWFHOBG1qpjntxJLiMOC0HYxvZ377kUrFIvGg05j6IR4KNmAEayP17HIdoRvkZU+Tc5ROvGnPrjhVZw60StycVCBHo2d/ef2IJCEVmnCsVNd1Yu1nWGpGOJ2WvETRGJMxHtKuoQKHVPnZ/PApOjVKHw0iaUpoNFd/T2Q4VCoNA9MZYj1Sy95M/M/rJnpw7WdMxImmgiwWDRKOdIRmKaA+k5RonhqCiWTmVkRGWGKiTVYlE4K7/PIqaV1UXafq3l9WavU8jiIcwwmcgQtXUIM7aEATCCTwDK/wZk2sF+vd+li0Fqx85gj+wPr8AQYXkgs= AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIHqQkIuhFKHioxwr2A5pQNtttu3SzCbsbMQ39JV48KOLVn+LNf+O2zUFbHww83pthZl4Qc6a043xbhbX1jc2t4nZpZ3dvv2wfHLZUlEhCmyTikewEWFHOBG1qpjntxJLiMOC0HYxvZ377kUrFIvGg05j6IR4KNmAEayP17HIdoRvkZU+Tc5ROvGnPrjhVZw60StycVCBHo2d/ef2IJCEVmnCsVNd1Yu1nWGpGOJ2WvETRGJMxHtKuoQKHVPnZ/PApOjVKHw0iaUpoNFd/T2Q4VCoNA9MZYj1Sy95M/M/rJnpw7WdMxImmgiwWDRKOdIRmKaA+k5RonhqCiWTmVkRGWGKiTVYlE4K7/PIqaV1UXafq3l9WavU8jiIcwwmcgQtXUIM7aEATCCTwDK/wZk2sF+vd+li0Fqx85gj+wPr8AQYXkgs= AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIHqQkIuhFKHioxwr2A5pQNtttu3SzCbsbMQ39JV48KOLVn+LNf+O2zUFbHww83pthZl4Qc6a043xbhbX1jc2t4nZpZ3dvv2wfHLZUlEhCmyTikewEWFHOBG1qpjntxJLiMOC0HYxvZ377kUrFIvGg05j6IR4KNmAEayP17HIdoRvkZU+Tc5ROvGnPrjhVZw60StycVCBHo2d/ef2IJCEVmnCsVNd1Yu1nWGpGOJ2WvETRGJMxHtKuoQKHVPnZ/PApOjVKHw0iaUpoNFd/T2Q4VCoNA9MZYj1Sy95M/M/rJnpw7WdMxImmgiwWDRKOdIRmKaA+k5RonhqCiWTmVkRGWGKiTVYlE4K7/PIqaV1UXafq3l9WavU8jiIcwwmcgQtXUIM7aEATCCTwDK/wZk2sF+vd+li0Fqx85gj+wPr8AQYXkgs= Figure 1: Union of a line and a plane, which is the zero set of G = { xz, yz } . The rankof ∇ G ( x ) is two at the point on the line and one on the plane. This indicates that thedimension of this aﬃne variety is 3 − min { , } = 2In practice, we must estimate the dimension of the aﬃne variety from ﬁnite samplepoints X ⊂ R n . In particular, the following two numbers are of our interest. d max = n − min x ∈ X, ∇ G ( x ) (cid:54) = rank( ∇ G ( x )) ,d min = n − max x ∈ X rank( ∇ G ( x )) . Without any restriction, the computed basis generates a zero-dimensional ideal. Duringthe basis computation, the basis is extended at each degree and the dimension of thesubspace spanned by the gradients (i.e., the codimension of the tangent space) at eachpoint monotonically increases. By restricting d max , the basis computation terminateswhen the codimension of the tangent space exceeds n − d max at all the points in X . Incontrast, by restricting d min , the basis computation terminates when the codimensionof the tangent space exceeds n − d min at some point. In the case of the union of a lineand a plane in Fig. 1, we can impose d max ≤ d min ≤ R n , then the dimension ranges from 0 to n . When d max = 0 or d min = 0 is used, the full basis computation is performed. Thisregularization is also robust against the redundancy of a basis because, as discussedin Section 5.1, the redundant basis polynomials do not change the dimension of thetangent space. This is not the case for the degree restriction or the basis size restriction;the potential range of those values is unknown and the existence of redundant basispolynomials can terminate the algorithm too soon. Therefore, it is more diﬃcult tocontrol the regularization level in those approaches.25 ca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- c VCA Redundant and removable by our method SBC + grad. normalization Redundant and removable by our method vca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- cvca- c vca- c vca- c vca- c vca- cmbc- g- c mbc- g- c mbc- g- c mbc- g- cmbc- g- r- c mbc- g- r- cmbc- c mbc- c mbc- c mbc- c

SBC + coef. normalization Redundant and removable by our method

Figure 2: Contour plots of vanishing polynomials obtained by VCA (left panel) andSBC- n g (right panel). Both bases contain redundant basis polynomials (the last threein the left panel and the last two in the right panel), which can be eﬃciently removedby the proposed method based on Conjecture 5.1.Table 1: Comparison of bases obtained by SBC with diﬀerent normalizations ( n c and n g ). Here, n -ratio denotes the ratio of the largest norm to the smallest norm of thepolynomials in the basis with respect to n .basis size n c -ratio n g -ratio runtime (ms)D n c

41 1.00 12.2e+2 48.0e+1 n g

30 46.6 1.00 D n c

70 1.00 19.6e+2 17.5e+3 n g

33 76.9 1.00

We compared three basis construction algorithms: VCA, SBC- n c , and SBC- n g . Theexperiments were performed using Julia implementations on a desktop machine with aneight-core processor and 32 GB memory. We demonstrate that the redundancy of bases can be removed by our basis reductionmethod. We consider the vanishing ideal of X = { (1 , , (0 , , ( − , , (0 , − } with noperturbation, where the exact Gr¨obner basis and polynomial division can be computedfor veriﬁcation. As shown in Fig. 2, the VCA basis and the SBC- n g basis consists ofﬁve and four vanishing polynomials, respectively. The bases share g = x + y − g = xy (the constant scale is ignored). A simple calculation using the Gr¨obner basis of { g , g } reveals that the other polynomials in each basis can be generated by { g , g } .With our basis reduction method, both bases were successfully reduced to { g , g } . We constructed two datasets (denoted by D and D , respectively) from two aﬃne vari-eties: (i) triple concentric ellipses (radii of ( √ , / √ √ , / √ √ , / √ π/ (cid:8) x x − x , x − x x (cid:9) . We randomly sampled 75 pointsand 100 points from them, respectively. To the former, we introduced ﬁve additional26ariables y i = k i x +(1 − k i ) x for k i ∈ { . , . , . , . , . } , and to the latter, we intro-duced nine additional variables y i = k i x + l i x +(1 − k i − l i ) x for ( k i , l i ) ∈ { . , . , . } .The sampled points were then mean centralized and perturbed by additive zero-meanGaussian noise whose standard deviation was set to 5% of the average absolute value ofthe points. The parameter (cid:15) was selected such that (i) the number of linear vanishingpolynomials in the basis agrees with the number of additional variables y i , and (ii) ex-cept for these linear polynomials, the lowest degree (say, d min ) of the polynomials agreeswith that of the Gr¨obner basis of the target variety, and the number of degree- d min polynomials in the basis agrees with or exceeds that of the Gr¨obner basis. As shown inTable 1, SBC- n g runs substantially faster than SBC- n c (around 10 times faster in D and around 10 times faster in D ). Here, n -ratio denotes the ratio of the largest normto the smallest norm of the polynomials in a basis with respect to n . Hence, n c -ratio and n g -ratio are unity for SBC- n c and SBC- n g , respectively. Here, VCA was not consideredfor comparison because proper (cid:15) values were not found; when (cid:15) was set such that thecorrect number of linear vanishing polynomials was found by VCA, then the degree- d min polynomials could not be found, and vice versa. This indicates the importance ofcircumventing the spurious vanishing problem by normalization. In this paper, we proposed exploiting the gradient of polynomials and realized the ﬁrsteﬃcient monomial-agnostic approximate computation of vanishing ideals, which aims tobridge the gap between computer algebra and recent data-driven applications such asmachine learning. We introduced gradient normalization as an eﬃcient and monomial-agnostic normalization, and showed that its data-dependent nature provides novel prop-erties that have not been achieved in the literature, such as translation and scalingconsistency, by which the basis of non-translated or non-scaled points can be retrievedfrom that of translated or scaled points. We also showed that gradient normalizationcan lead to a bounded increase in the extent of vanishing, which enables us to selectthe threshold (cid:15) based on the geometrical intuition on the perturbation. Moreover, wedemonstrated that although the gradients of polynomials at points are numerical en-tities, these can reveal some symbolic relations between polynomials. This result wasused for removing the redundancy of the basis. We believe that this work could open upnew directions for research on data-driven monomial-agnostic computer algebra, wherethe existing notions, operations, and algorithms based on symbolic computation couldbe redeﬁned or reformulated into a fully numerical computation.

This work was supported by JST, ACT-X Grant Number JPMJAX200F, Japan.27 eferences [1] John Abbott, Claudia Fassino, and Maria-Laura Torrente. Thinning out redundantempirical data.

Mathematics in Computer Science , 1(2):375–392, Dec 2007.[2] John Abbott, Claudia Fassino, and Maria-Laura Torrente. Stable border bases forideals of points.

Journal of Symbolic Computation , 43(12):883–894, 2008.[3] Rika Antonova, Maksim Maydanskiy, Danica Kragic, Sam Devlin, and Katja Hof-mann. Analytic manifold learning: Unifying and evaluating representations forcontinuous control. arXiv preprint arXiv:2006.08718 , 2020.[4] David Cox, John Little, and Donal O’shea.

Ideals, varieties, and algorithms , vol-ume 3. Springer, 1992.[5] Claudia Fassino. Almost vanishing polynomials for sets of limited precision points.

Journal of Symbolic Computation , 45(1):19–37, 2010.[6] Claudia Fassino and Maria-Laura Torrente. Simple varieties for limited precisionpoints.

Theoretical Computer Science , 479:174–186, 2013.[7] Daniel Heldt, Martin Kreuzer, Sebastian Pokutta, and Hennie Poulisse. Approxi-mate computation of zero-dimensional polynomial ideals.

Journal of Symbolic Com-putation , 44(11):1566–1591, 2009.[8] Chenping Hou, Feiping Nie, and Dacheng Tao. Discriminative vanishing componentanalysis. In

Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence(AAAI) , pages 1666–1672. AAAI Press, 2016.[9] Reza Iraji and Hamidreza Chitsaz. Principal variety analysis. In

Proceedings of the1st Annual Conference on Robot Learning (ACRL) , pages 97–108. PMLR, 2017.[10] Artur Karimov, Erivelton G. Nepomuceno, Aleksandra Tutueva, and Denis Bu-tusov. Algebraic method for the reconstruction of partially observed nonlinearsystems using diﬀerential and integral embedding.

Mathematics , 8(2):300–321, Feb2020.[11] Achim Kehrein and Martin Kreuzer. Characterizations of border bases.

Journal ofPure and Applied Algebra , 196(2):251–270, 2005.[12] Achim Kehrein and Martin Kreuzer. Computing border bases.

Journal of Pure andApplied Algebra , 205(2):279–295, 2006.[13] H. Kera and Y. Hasegawa. Spurious vanishing problem in approximate vanishingideal.

IEEE Access , 7:178961–178976, 2019.[14] Hiroshi Kera and Yoshihiko Hasegawa. Noise-tolerant algebraic method for re-construction of nonlinear dynamical systems.

Nonlinear Dynamics , 85(1):675–692,2016. 2815] Hiroshi Kera and Yoshihiko Hasegawa. Approximate vanishing ideal via data knot-ting. In

Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence(AAAI) , pages 3399–3406. AAAI Press, 2018.[16] Hiroshi Kera and Yoshihiko Hasegawa. Gradient boosts the approximate vanishingideal. In

Proceedings of the Thirty-Fourth AAAI Conference on Artiﬁcial Intelli-gence (AAAI) , pages 4428–4425. AAAI Press, 2020.[17] Hiroshi Kera and Hitoshi Iba. Vanishing ideal genetic programming. In

Proceedingsof the 2016 IEEE Congress on Evolutionary Computation (CEC) , pages 5018–5025.IEEE, 2016.[18] Franz J Kir´aly, Martin Kreuzer, and Louis Theran. Dual-to-kernel learning withideals. arXiv preprint arXiv:1402.0099 , 2014.[19] Martin Kreuzer and Lorenzo Robbiano.

Computational commutative algebra 2 ,volume 2. Springer Science & Business Media, 2005.[20] Jan Limbeck.

Computation of approximate border bases and applications . PhDthesis, Passau, Universit¨at Passau, 2013.[21] Roi Livni, David Lehavi, Sagi Schein, Hila Nachliely, Shai Shalev-Shwartz, andAmir Globerson. Vanishing component analysis. In

Proceedings of the ThirteenthInternational Conference on Machine Learning (ICML) , pages 597–605. PMLR,2013.[22] H. M. M¨oller and B. Buchberger. The construction of multivariate polynomialswith preassigned zeros. In

Computer Algebra. EUROCAM 1982. Lecture Notes inComputer Science , pages 24–31. Springer Berlin Heidelberg, 1982.[23] Lorenzo Robbiano and John Abbott.

Approximate Commutative Algebra , volume 1.Springer, 2010.[24] Yunxue Shao, Guanglai Gao, and Chunheng Wang. Nonlinear discriminant analysisbased on vanishing component analysis.

Neurocomputing , 218:172–184, 2016.[25] Hans J. Stetter.

Numerical Polynomial Algebra . Society for Industrial and AppliedMathematics, USA, 2004.[26] Maria-Laura Torrente.

Application of algebra in the oil industry . PhD thesis, ScuolaNormale Superiore, Pisa, 2008.[27] L. Wang and T. Ohtsuki. Nonlinear blind source separation unifying vanishingcomponent analysis and temporal structure.

IEEE Access , 6:42837–42850, 2018.[28] Zhichao Wang, Qian Li, Gang Li, and Guandong Xu. Polynomial representationfor persistence diagram. In

Proceedings of the 2019 IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) , pages 6116–6125, 2019.2929] Yan-Guo Zhao and Zhan Song. Hand posture recognition using approximate van-ishing ideal generators. In