Optimal Spectral-Norm Approximate Minimization of Weighted Finite Automata
Borja Balle, Clara Lacroce, Prakash Panangaden, Doina Precup, Guillaume Rabusseau
aa r X i v : . [ c s . F L ] F e b Optimal Spectral-Norm Approximate Minimization of WeightedFinite Automata
Borja Balle , Clara Lacroce ∗ , Prakash Panangaden , Doina Precup , andGuillaume Rabusseau † DeepMind, London, United Kingdom. School of Computer Science, McGill University, Montr´eal, Canada Mila, Montr´eal, Canada DIRO, Universit´e de Montr´eal, Montr´eal, Canada
Abstract
We address the approximate minimization problem for weighted finite automata (WFAs)over a one-letter alphabet: to compute the best possible approximation of a WFA given abound on the number of states. This work is grounded in Adamyan-Arov-Krein Approxima-tion theory, a remarkable collection of results on the approximation of Hankel operators. Inaddition to its intrinsic mathematical relevance, this theory has proven to be very effectivefor model reduction. We adapt these results to the framework of weighted automata overa one-letter alphabet. We provide theoretical guarantees and bounds on the quality of theapproximation in the spectral and ℓ norm. We develop an algorithm that, based on theproperties of Hankel operators, returns the optimal approximation in the spectral norm. Weighted finite automata (WFAs) are an expressive class of models representing functionsdefined over sequences. The approximate minimization problem is concerned with finding anautomaton that approximates the behaviour of a given minimal WFA, while being smaller insize. This second automaton recognizes a different language, and the objective is to minimizethe approximation error [10, 11]. Approximate minimization becomes particularly useful in thecontext of spectral learning algorithms [5, 7, 9, 22]. When applied to a learning task, suchalgorithms can be viewed as working in two steps. First, they compute a minimal WFA thatexplains the training data exactly. Then, they obtain a model that generalizes to the unseendata by producing a smaller approximation to the minimal WFA. It is not just a questionof saving space by having a smaller state space; the exact machine will overfit the data andgeneralize poorly. To obtain accurate results it is crucial to guess correctly the size of theminimal WFA, in particular when the data is generated by this type of machine.The minimization task is greatly shaped by the way we decide to measure the approximationerror. It is thus natural to wonder if there are norms that are preferable to others. We believethat the chosen norm should be computationally reasonable to minimize. For instance, thedistance between WFAs can be computed using a metric based on bisimulation [8]. Whileexploring this approach could still be interesting, the fact that this metric is hard to compute ∗ Corresponding author: [email protected] † The names of the authors appear in alphabetical order • We use AAK theory to study the approximate minimization problem of WFAs. To connectthose areas, we establish a correspondence between the parameters of a WFA and thecoefficients of a complex function on the unit circle. To the best of our knowledge, thispaper represents the first attempt to apply AAK theory to WFAs. • We present a theoretical analysis of the optimal spectral-norm approximate minimizationproblem for WFAs, based on their connection with finite-rank infinite Hankel matrices.We provide a closed form solution for a class of weighted automata over a one-letteralphabet, and bounds on the approximation error, both in terms of the Hankel matrix(spectral norm) and of the rational function computed by the WFA ( ℓ norm). • We propose a self-contained algorithm that returns the unique optimal spectral-normapproximation of a given size. • We tighten the connection, made in [11], between WFAs and discrete dynamical systems,by adapting some of the control theory concepts, like the allpass system [20], to theformalism of WFAs.
We denote with N , Z and R the set of natural, integers and real numbers, respectively. We usebold letters for vectors and matrices; all vectors considered are column vectors. We denote with the identity matrix, specifying its dimension only when not clear from the context. We referto the i -th row and the j -th column of M by M ( i, :) and M (: , j ). Given a matrix M ∈ R p × q of rank n , a rank factorization is a factorization M = PQ , where P ∈ R p × n , Q ∈ R n × q andrank( P ) = rank( Q ) = n . Let M ∈ R p × q of rank n , the compact singular value decomposition SVD of M is the factorization M = UDV ⊤ , where U ∈ R p × n , D ∈ R n × n , V ∈ R q × n are suchthat U ⊤ U = V ⊤ V = , and D is a diagonal matrix. The columns of U and V are called leftand right singular vectors , while the entries of D are the singular values . The Moore-Penrosepseudo-inverse M + of M is the unique matrix such that MM + M = M , M + MM + = M + ,2ith M + M and MM + Hermitian. [42]. The spectral radius ρ ( M ) of a matrix M is the largestmodulus among its eigenvalues.A Hilbert space is a complete normed vector space where the norm arises from an innerproduct. A linear operator T : X → Y between Hilbert spaces is bounded if it has finiteoperator norm, i.e. k T k op = sup k g k X ≤ k T g k Y < ∞ . We denote by T the (infinite) matrixassociated with T by some (canonical) orthonormal basis on H . An operator is compact if theimage of the unit ball in X is relatively compact. Given Hilbert spaces X, Y and a compactoperator T : X → Y , we denote its adjoint by T ∗ . The singular numbers { σ n } n ≥ of T arethe square roots of the non-negative eigenvalues of the self-adjoint operator T ∗ T , arrangedin decreasing order. A σ - Schmidt pair { ξ , η } for T is a couple of norm 1 vectors such that: T ξ = σ η and T ∗ η = σ ξ . The Hilbert-Schmidt decomposition provides a generalization ofthe compact SVD for the infinite matrix of a compact operator T using singular numbers andorthonormal Schmidt pairs: Tx = P n ≥ σ n h x , ξ n i η k [42]. The spectral norm k T k of the matrixrepresenting the operator T is the largest singular number of T . Note that the spectral normof T corresponds to the operator norm of T .Let ℓ be the Hilbert space of square-summable sequences over Σ ∗ , with norm k f k = P x ∈ Σ ∗ | f ( x ) | and inner product h f, g i = P x ∈ Σ ∗ f ( x ) g ( x ) for f, g ∈ R Σ ∗ . Let T = { z ∈ C : | z | = 1 } be the complex unit circle, D = { z ∈ C : | z | < } the (open) complex unit disc.Let 1 < p < ∞ , L p ( T ) be the space of measurable functions on T for which the p -th powerof the absolute value is Lebesgue integrable. For p = ∞ , we denote with L ∞ ( T ) the space ofmeasurable functions that are bounded, with norm k f k ∞ = sup {| f ( x ) | : x ∈ T } . Let Σ be a fixed finite alphabet, Σ ∗ the set of all finite strings with symbols in Σ. We use ε todenote the empty string. Given p, s ∈ Σ, we denote with ps their concatenation.A weighted finite automaton (WFA) of n states over Σ is a tuple A = h α , { A a } , β i ,where α , β ∈ R n are the vector of initial and final weights, respectively, and A a ∈ R n × n is the matrix containing the transition weights associated with each symbol a . Every WFA A realizes a function f A : Σ ∗ → R , i.e. given a string x = x · · · x t ∈ Σ ∗ , it returns f A ( x ) = α ⊤ A x · · · A x t β = α ⊤ A x β . A function f : Σ ∗ → R is called rational if there ex-ists a WFA A that realizes it. The rank of the function is the size of the smallest WFA realizing f . Given f : Σ ∗ → R , we can consider a matrix H f ∈ R Σ ∗ × Σ ∗ having rows and columns indexedby strings and defined by H f ( p, s ) = f ( ps ) for p, s ∈ Σ ∗ . Definition 2.1.
A (bi-infinite) matrix H ∈ R Σ ∗ × Σ ∗ is Hankel if for all p, p ′ , s, s ′ ∈ Σ ∗ suchthat ps = p ′ s ′ , we have H ( p, s ) = H ( p ′ , s ′ ) . Given a Hankel matrix H ∈ R Σ ∗ × Σ ∗ , there exists aunique function f : Σ ∗ → R such that H f = H . Theorem 2.1 ([15, 19]) . A function f : Σ ∗ → R can be realized by a WFA if and only if H f has finite rank n . In that case, n is the minimal number of states of any WFA A such that f = f A . Given a WFA A = h α , { A a } , β i , the forward matrix of A is the infinite matrix F A ∈ R Σ ∗ × n given by F A ( p, :) = α ⊤ A p for any p ∈ Σ ∗ , while the backward matrix of A is B A ∈ R Σ ∗ × n , givenby B A ( s, :) = ( A s β ) ⊤ for any s ∈ Σ ∗ . Let H f be the Hankel matrix of f , its forward-backward(FB) factorization is: H f = FB ⊤ . A WFA with n states is reachable if rank( F A ) = n , whileit is observable if rank( B A ) = n . A WFA is minimal if it is reachable and observable. If A isminimal, the FB factorization is a rank factorization [7].We recall the definition of the singular value automaton, a canonical form for WFAs [10].3 efinition 2.2. Let f be a rational function and suppose H f admits an SVD, H f = UDV ⊤ .A singular value automaton (SVA) for f is the minimal WFA A realizing f such that F A = UD / and B A = VD / . The SVA can be computed with an efficient algorithm relying on the following matrices [11].
Definition 2.3.
Let f be a rational function, H f = FB ⊤ a FB factorization. If the matrices P = F ⊤ F and Q = B ⊤ B are well defined, we call P the reachability Gramian and Q the observability Gramian . Note that if A is an SVA, then the Gramians associated with its FB factorization satisfy P A = Q A = D , where D is the matrix of singular values of its Hankel matrix. The Gramians canalternatively be characterized (and computed [11]) using fixed point equations, correspondingto Lyapunov equations when | Σ | = 1 [27]. Theorem 2.2.
Let | Σ | = 1 , A = h α , A , β i a WFA with n states and well-defined Gramians P , Q . Then X = P and Y = Q solve: X − A X A ⊤ = ββ ⊤ , (1) Y − A ⊤ Y A = αα ⊤ . (2)Finally, we recall the following definition. Definition 2.4.
A WFA A = h α , { A a } , β i is a generative probabilistic automaton (GPA)if f A ( x ) ≥ for every x , and P x ∈ Σ ∗ f A ( x ) = 1 , i.e. if f A computes a probability distributionover Σ ∗ . Example 2.3.
Let | Σ | = 1, Σ = { x } . The WFA A = h α , A , β i , with: A = (cid:18) (cid:19) , α = √ ! , β = √ ! , is a GPA. Ideed, f A ( x ) ≥ P x ∈ Σ ∗ f A ( x ) = 1, since the rational function is: f A ( x · · · x ) = f A ( k ) = α ⊤ A k β = ( k is odd − k if k is evenwhere k corresponds to the string where x is repeated k -times. We remark that A is minimaland in its SVA form, with Gramians P = Q = (cid:18) (cid:19) , and f A has rank 2. Finally, thecorresponding Hankel matrix, also of rank 2, is: H = f A (0) f A (1) f A (2) . . .f A (1) f A (2) f A (3) . . .f A (2) f A (3) f A (4) . . . ... ... ... . . . = . . . . . . . . . ... ... ... . . . . (3) Theorem 2.1 provides us with a way to associate a minimal WFA A with n states to a Hankelmatrix H of rank n . The approach we propose to approximate A is to find the WFA corre-sponding to the matrix that minimizes H in the spectral norm. We recall the fundamentalresult of Schmidt, Eckart, Young and Mirsky [17].4 heorem 2.4 ([17]) . Let H be a Hankel matrix corresponding to a compact Hankel operator ofrank n , and σ m , with ≤ m < n and σ ≥ · · · ≥ σ n − > , its singular numbers. Then, if R isa matrix of rank k , we have: k H − R k ≥ σ k . (4) The equality is attained when R corresponds to the truncated SVD of H . Note that a low-rank approximation obtained by truncating the SVD is not in general aHankel matrix. This is problematic, since G needs to be Hankel in order to be the matrix of aWFA. Surprisingly, we can obtain a result comparable to the one of Theorem 2.4 while preservingthe Hankel property. This is the possible thanks to a theory of optimal approximation calledAAK theory [1]. To apply this theory, we will need to rewrite the approximation problem infunctional analysis terms. First, we will associate a linear operator to the Hankel matrix. Then,we will use Fourier analysis to reformulate the problem in a function space. A comprehensivepresentation of the concepts recalled in this section can be found in [30, 33, 28].Let f : Σ ∗ → R be a rational function, we interpret the corresponding Hankel matrix H f asthe expression of a linear (Hankel) operator H f : ℓ → ℓ in terms of the canonical basis. Werecall that a Hankel operator H f is bounded if and only if f ∈ ℓ [11]. This property, togetherwith the fact that we only consider finite rank operators (corresponding to the Hankel matricesof WFAs), is sufficient to guarantee compactness.To introduce AAK theory, we need to consider a second realization of Hankel operators oncomplex spaces. Since in this paper we work with two classes of functions – functions oversequences and complex functions – to avoid any confusion we will make explicit the dependenceon the complex variable z = e it . We start by recalling a few fundamental definitions from thetheory of complex functions. Note that a function φ ( z ) ∈ L ( T ) can be represented, using theorthonormal basis { z n } n ∈ Z , by means of its Fourier series: φ ( z ) = P n ∈ Z b φ ( n ) z n , with Fouriercoefficients b φ ( n ) = R T φ ( z )¯ z n dz, n ∈ Z . This establishes an isomorphism between the function φ ( z ) and the sequence of the corresponding Fourier coefficients b φ . Thus, we can partition thefunction space L ( T ) into two subspaces. Definition 2.5.
For < p ≤ ∞ , the Hardy space H p on T is the subspace of L p ( T ) definedas: H p = { φ ( z ) ∈ L p ( T ) : b φ ( n ) = 0 , n < } , (5) while the negative Hardy space on T is the subspace of L p ( T ) H p − = { φ ( z ) ∈ L p ( T ) : b φ ( n ) = 0 , n ≥ } . (6)It is possible to define Hardy spaces also on the open unit disc D . Definition 2.6.
The
Hardy space H p ( D ) on D for < p < ∞ consists of functions φ ( z ) analytic in D and such that: k φ k p := sup Let φ ( z ) be a function in the space L ( T ) . A Hankel operator is an operator H φ : H → H − defined by H φ f ( z ) = P − φf ( z ) , where P − is the orthogonal projection from L ( T ) onto H − . The function φ ( z ) is called a symbol of the Hankel operator H φ . If H φ is a bounded operator, we can consider without loss of generality φ ( z ) ∈ L ∞ ( T ). Thisis a consequence of Nehari’s theorem [29], whose formulation can be found in Appendix A,together with more details about the two definitions of Hankel operators. We remark that aHankel operator has infinitely many different symbols, since H φ = H φ + ψ for ψ ( z ) ∈ H ∞ . Remark 1. In the standard orthonormal bases, { z k } k ≥ in H and { z − ( j +1) } j ≥ in H − , theHankel operator H φ has Hankel matrix H ( j, k ) = b φ ( − j − k − 1) for j, k ≥ Example 2.5. In the case of the Hankel matrix in Example 2.3, since H ( j, k ) = b φ ( − j − k − H = . . . . . . . . . ... ... ... . . . = b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . ... ... ... . . . . Hence, we can recover the corresponding symbol: P − φ = X n ≥ b φ ( − n − z − n − = X n ≥ 34 4 − n z − n − = 3 z z − . Definition 2.8. The complex function f ( z ) is rational if f ( z ) = p ( z ) /q ( z ) , with p ( z ) and q ( z ) polynomials. The rank of f ( z ) is the maximum between the degrees of p ( z ) and q ( z ) . A rationalfunction is strictly proper if the degree of p ( z ) is strictly smaller than that of q ( z ) . We remark that the poles of a complex function f correspond to the zeros of 1 /f . Thefollowing result of Kronecker relates finite-rank infinite Hankel matrices to rational functions. Theorem 2.6 ([24]) . Let H φ be a bounded Hankel operator with matrix H . Then H has finiterank if and only if P − φ is a strictly proper rational function. Moreover the rank of H is equalto the number of poles (with multiplicities) of P − φ inside the unit disc. xample 2.7. The function in Example 2.5 is rational with degree 2 and has two poles insidethe unit disc at z = ± . Thus, the Hankel matrix associated has rank 2.We state as remark an important takeaway from this section. Remark 2. Given a rank n Hankel matrix H , we can look at it in two alternative ways. Onthe one hand we can consider the Hankel operator over sequences H f : ℓ → ℓ , associated toa function f : Σ ∗ → R . In this case H ( i, j ) = f ( i + j ) for i, j ≥ 0, and f is rational in thesense that it is realized by a WFA of size n . On the other hand, we can consider the Hankeloperator over complex (Hardy) spaces H φ : H → H − , associated to a function φ ( z ) ∈ L ( T ),the symbol. In this case H ( j, k ) = b φ ( − j − k − 1) for j, k ≥ 0, and P − φ = b φ ( − j − k − 1) isrational of rank n in the sense of Definition 2.8.We can introduce now the main result of Adamyan, Arov and Krein [1]. The theorem, statedfor Hankel operators over Hardy spaces, shows that for infinite dimensional Hankel matricesthe constraint of preserving the Hankel property does not affect the achievable approximationerror. Theorem 2.8 (AAK-1[1]) . Let H φ be a compact Hankel operator of rank n and singular numbers σ m , with ≤ m < n and σ ≥ · · · ≥ σ n − > . Then there exists a unique Hankel operator G of rank k < n such that: k H − G k = σ k . (11)We denote with R k ⊂ H ∞− the set of strictly proper rational functions of rank k , and weconsider the set: H ∞ k = { ψ ( z ) ∈ L ∞ ( T ) : ∃ g ( z ) ∈ R k , ∃ l ( z ) ∈ H ∞ , ψ ( z ) = g ( z ) + l ( z ) } . (12)We can reformulate the theorem in terms of symbols. Theorem 2.9 (AAK-2 [1]) . Let φ ( z ) ∈ L ∞ ( T ) . Then, there exists ψ ( z ) ∈ H ∞ k such that: k φ ( z ) − ψ ( z ) k ∞ = σ k ( H φ ) . (13)The solutions of Theorem 2.8 and 2.9 are strictly related (proof in Appendix A). Corollary 1. Let ψ ( z ) = g ( z ) + l ( z ) ∈ H ∞ k , with g ( z ) ∈ R k , l ( z ) ∈ H ∞ . If ψ ( z ) solvesEquation 13, then G = H g is the unique Hankel operator from Theorem 2.8. We state as corollary the key point of the proof of AAK Theorem, that provides us with apractical way to find the best approximating symbol. Corollary 2. Let φ ( z ) and { ξ k , η k } be a symbol and a σ k -Schmidt pair for H φ . A function ψ ( z ) ∈ L ∞ is the best AAK approximation according to Theorem 2.9, if and only if: ( φ ( z ) − ψ ( z )) ξ + k ( z ) = σ k η − k ( z ) . (14) Moreover, the function ψ ( z ) does not depend on the particular choice of the pair { ξ k , η k } . Approximate Minimization To apply AAK theory to the approximate minimization problem, we consider only automatadefined over a one-letter alphabet. In this case, the free monoid generated by the single lettercan be identified with N , and canonically embedded into Z . This passage is fundamental to useFourier analysis and the isomorphism that leads to Theorem 2.8 and 2.9. Unfortunately, thisidea cannot be directly generalized to bigger alphabets, since in this case we would obtain afree non-abelian monoid (not identifiable with Z ).Theorem 2.8 requires the Hankel operator H to be bounded. To ensure that a minimalWFA A = h α , A , β i satisfies this condition, we assume ρ ( A ) < 1, where ρ is the spectral radiusof A [11]. As a matter of fact, to guarantee the boundness of the Hankel operator it is enoughthat the considered WFA computes a function f ∈ ℓ [11]. However, the stricter assumption onthe spectral radius is needed when computing the symbol associated to a WFA. This conditiondirectly implies the existence of the SVA, and of the Gramian matrices P and Q , with P = Q diagonal [11]. We assume that A = h α , A , β i is in SVA form. In this case, given the size of thealphabet, the Hankel matrix is symmetric, so β = α and A = A ⊤ .Note that, for example, a minimal GPA computes a function f ∈ ℓ , so the condition on ρ ( A ) is automatically satisfied by this class of WFAs [11]. Possible relaxations of the spectralradius assumption are discussed in Appendix C, together with an alternative method to findthe optimal spectral-norm approximation of a Hankel matrix without extracting a WFA.Finally, in this paper we consider automata with weights in R , though results remain truefor complex numbers. The method we present can be easily extended to vector-valued au-tomata [35], but the solution to the optimal approximation problem will not be unique [33]. Let A = h α , A , α i be a minimal WFA with n states in SVA form, defined over a one-letteralphabet. Let H be the Hankel matrix of A , we denote with σ i , for 0 ≤ i < n , the singularnumbers. Given a target number of states k < n , we say that a WFA b A k with k states solvesthe optimal spectral-norm approximate minimization problem if the Hankel matrix G of b A k satisfies k H − G k = σ k ( H ). Note that the content of the “optimal spectral-norm approximateminimization” is equivalent to the problem solved by Theorem 2.8, with the exception that herewe insist on representing the inputs and outputs of the problem effectively by means of WFAs.Based on the AAK theory sketched in Section 2.3, we draw the following steps:1. Compute a symbol φ ( z ) for H using Remark 2 . We obtain the negative Fourier coefficientsof φ ( z ) and derive its Fourier series.2. Compute the optimal symbol ψ ( z ) using Corollary 2 . The main challenge here is to find asuitable representation for the functions ψ ( z ) and e ( z ) = φ ( z ) − ψ ( z ). We define them interms of two auxiliary WFAs. The key point is to select constraints on their parameters toleverage the properties of weighted automata, while still keeping the formulation general.3. Extracting the rational component by solving for g ( z ) in Corollary 1 . This step is arguablythe most conceptually challenging, as it requires to identify the position of the function’spoles. In fact, we know from Theorem 2.6 that g ( z ) has k poles, all inside the unit disc.4. Find a WFA representation for g ( z ). Since in Step 2 we parametrized the functions usingWFAs, the expression of g ( z ) directly reveals the WFA b A k .8 .3 Spectral-Norm Approximate Minimization In the following sections we will consider a minimal WFA A = h α , A , α i with n states in SVAform, defined over a one-letter alphabet Σ = { a } , its Hankel matrix H , corresponding to thebounded operator H , and the singular numbers σ i , for 0 ≤ i < n . Let f : Σ ∗ → R be thefunction realized by A . We denote by x the string where a is repeated x times, so we have f ( x ) = α ⊤ A x α . To determine the symbol φ ( z ) of H , we recall that each entry of the Hankel matrix correspondssimultaneously to the values of f and to the negative Fourier coefficients of φ ( z ). In fact, asseen in Remark 2, we have: H = f A (0) f A (1) f A (2) . . .f A (1) f A (2) f A (3) . . .f A (2) f A (3) f A (4) . . . ... ... ... . . . = b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . ... ... ... . . . . (15)We obtain: P − φ ( z ) = X k ≥ f ( k ) z − k − = X k ≥ α ⊤ A k α z − k − = α ⊤ ( z − A ) − α (16)where we use the fact that ρ ( A ) < φ ( z ) = α ⊤ ( z − A ) − α as a symbol for H . Example 3.1. If we apply the formula in Equation 16 to the GPA in Example 2.3, we recoverthe rational function φ ( z ) = z z − found in Example 2.5. We consider two auxiliary WFAs. Let b A = h b α , b A , b β i be a WFA with j ≥ k states, satisfyingthe following properties:1. 1 is not an eigenvalue of b A 2. the automaton E = h α e , A e , β e i is minimal, with A e = (cid:18) A 00 b A (cid:19) , α e = (cid:18) α − b α (cid:19) , β e = (cid:18) α b β (cid:19) . (17)Using the parameters of the automaton b A and a constant C , we define a function ψ ( z ) = b α ⊤ ( z − b A ) − b β + C . We remark that the poles of ψ ( z ) correspond to the eigenvalues of b A ,counted with their multiplicities. By assumption, 1 is not an eigenvalue of b A , so ψ ( z ) doesnot have any poles on the unit circle, and therefore ψ ( z ) ∈ L ∞ ( T ). Analogously, the function e ( z ) = φ ( z ) − ψ ( z ) = α ⊤ e ( z − A e ) − β e − C is also bounded on the circle.Our objective is to compute the parameters of b A = h b α , b A , b β i that make ψ ( z ) the bestapproximation of φ ( z ) according to Theorem 2.9. In particular, we will use Corollary 2 tofind the triple b α , b A , b β such that ψ ( z ) satisfies Equation 14. Note that, with this purpose,the constant term C ∈ H ∞ becomes necessary to characterize ψ ( z ). In fact, while the H ∞ -component of the symbol does not affect the Hankel norm, it plays a role in the computation ofthe L ∞ norm (in Equation 13) according to Nehari (Theorem A.1), so it cannot be dismissed.9he following theorem provides us with an explicit expression for the functions in the Hardyspace corresponding to a σ k -Schmidt pair. Theorem 3.2. Let σ k be a singular number of the Hankel operator H . The singular functionsassociated with the σ k -Schmidt pair { ξ k , η k } of H are: ξ + k ( z ) = σ − / k α ⊤ ( − z A ) − e k (18) η − k ( z ) = σ − / k α ⊤ ( z − A ) − e k . (19) If ψ ( z ) is the best approximation to the symbol, then σ − k e ( z ) has modulus almost everywhereon the unit circle ( i.e. it is unimodular).Proof. Let F and B be the forward and backward matrices, respectively, with H = FB ⊤ , P = F ⊤ F , Q = B ⊤ B . We consider the σ k -Schmidt pair { ξ k , η k } . By definition, H ⊤ H ξ k = σ k ξ k .Rewriting in terms of the FB factorization, we obtain: H ⊤ H ξ k = σ k ξ k (20) BF ⊤ FB ⊤ ξ k = σ k ξ k (21) BPB ⊤ ξ k = σ k ξ k (22) BPe k = σ k ξ k (23)where in the last step we set e k = B ⊤ ξ k , to reduce the SVD problem of H to the one of QP .Note that, since P and Q are diagonal, e k is the k -th coordinate vector (0 , . . . , , , , . . . , ⊤ .Since e k is an eigenvector of QP for σ k , we get: BQ − QPe k = σ k ξ k (24) BQ − e k = ξ k . (25)Moreover, H is symmetric, so we have that η k = ξ k . We obtain: ξ + k ( z ) = ∞ X j =0 σ − / k α ⊤ A j e k z j = σ − / k α ⊤ ( − z A ) − e k (26) η − k ( z ) = ∞ X j =0 σ − / k α ⊤ A j e k z − j − = σ − / k α ⊤ ( z − A ) − e k (27)where the singular functions have been computed following Equation 9. If r is the multi-plicity of σ k , from Corollary 2 we get the following fundamental equation:( φ ( z ) − ψ ( z )) α ⊤ ( − z A ) − V = σ k α ⊤ ( z − A ) − V where V = (cid:0) r (cid:1) ⊤ is a n × r matrix. Consequently, we obtain the unimodular function: σ − k e ( z ) = α ⊤ ( z − A ) − V α ⊤ ( − z A ) − V . It is reasonable to wonder how the fact that σ − k e ( z ) is unimodular reflects on the structureof the WFA E = h α e , A e , β e i associated with it. We remark that, a priori , the controllabilityand observability Gramians of E might not be well defined. The following theorem providesus with two matrices P e and Q e satisfying properties similar to those of the Gramians. Thistheorem is the analogous of a control theory result [16], rephrased in terms of WFAs. A sketchof the proof, that relies on the minimality of the WFA E [38], can be found in Appendix B. Forthe detailed version of the proof and the original theorem we refer the reader to [16].10 heorem 3.3 ([16]) . Consider the function e ( z ) = α ⊤ e ( z − A e ) − β e − C and the correspondingminimal WFA E = h α e , A e , β e i associated with it. If σ − k e ( z ) is unimodular, then there existsa unique pair of symmetric invertible matrices P e and Q e satisfying:(a) P e − A e P e A ⊤ e = β e β ⊤ e (b) Q e − A ⊤ e Q e A e = α e α ⊤ e (c) P e Q e = σ k We can now derive the parameters of the WFA b A = h b α , b A , b β i . Theorem 3.4. Let A = h α , A , α i be a minimal WFA with n states in its SVA form, and let φ ( z ) = α ⊤ ( z − A ) − α be a symbol for its Hankel operator H . Let σ k be a singular numberof multiplicity r for H , with σ ≥ · · · > σ k = · · · = σ k + r − > σ k + r ≥ · · · ≥ σ n − > . We canpartition the Gramian matrices P , Q as follows: P = Q = (cid:18) Σ 00 σ k r (cid:19) , (28) where Σ ∈ R ( n − r ) × ( n − r ) is the diagonal matrix containing the remaining singular numbers, andpartition A and α conformally to the Gramians: A = (cid:18) A A A ⊤ A (cid:19) , α = (cid:18) α α (cid:19) . (29) Let R = σ k n − r − Σ , we denote by ( · ) + the Moore-Penrose pseudo-inverse. If the function ψ ( z ) = b α ⊤ ( z − b A ) − b β + C is the best approximation of φ ( z ) , then: • If α = : b β = − b AA ( α ⊤ ) + b α = − b A ⊤ RA ( α ⊤ ) + b A ( A − A ( α ⊤ ) + α ⊤ ) = (30) • If α = : b β = ( − b AA )( α ⊤ ) + b α = ( R − b A ⊤ RA )( α ⊤ ) + b AA = (31) Proof of Theorem 3.4. Since σ − e ( z ) = φ ( z ) − ψ ( z ) is unimodular, from Theorem 3.3 thereexist two symmetric nonsingular matrices P e , Q e satisfying the fixed point equations: P e − A e P e A ⊤ e = β e β ⊤ e (32) Q e − A ⊤ e Q e A e = α e α ⊤ e (33)and such that P e Q e = σ k . We can partition P e and Q e according to the definition of A e (seeEq. 17): P e = (cid:18) P P P ⊤ P (cid:19) , Q e = (cid:18) Q Q Q ⊤ Q (cid:19) . P and Q corresponds to the controllability andobservability Gramians of A : P = Q = P = (cid:18) Σ 00 σ k (cid:19) . Moreover, since P e Q e = σ k , we get P Q ⊤ = σ k − P . It follows that P Q ⊤ has rank n − r . Without loss of generality we can set dim b A = j = n − r , and choose an appropriate basisfor the state space such that P = (cid:0) (cid:1) ⊤ and Q = (cid:0) R 0 (cid:1) ⊤ , with R = σ k − Σ . Once P and Q are fixed, the values of P and Q are automatically determined. We obtain: P e = Σ 0 10 σ k − ΣR − , Q e = Σ 0 R0 σ k − ΣR . (34)Now that we have an expression for the matrices P e and Q e of Theorem 3.3, we can rewritethe fixed point equations to derive the parameters b α , b A and b β . We obtain the following systems: P − APA = αα ⊤ N − AN b A ⊤ = α b β ⊤ − ΣR − + b AΣR − b A ⊤ = b β b β ⊤ P − APA = αα ⊤ M − A ⊤ M b A = α b α ⊤ − ΣR + b A ⊤ ΣR b A = b α b α ⊤ (35)where N = (cid:18) (cid:19) and M = (cid:18) R0 (cid:19) . We can rewrite the second equation of each system as follows: ( − A b A ⊤ = α b β ⊤ − A ⊤ b A ⊤ = α b β ⊤ ( R − A R b A = α b α ⊤ − b A ⊤ RA = b αα ⊤ (36)If α = , we have: b β = − b AA ( α ⊤ ) + b α = − b A ⊤ RA ( α ⊤ ) + b A ( A − A ( α ⊤ ) + α ⊤ ) = (37)with ( α ⊤ ) + = α α ⊤ α .If α = , we have b AA = . We remark that b A has size ( n − r ) × ( n − r ), while A is( n − r ) × r , so the system of equations corresponding to b AA = is underdetermined if r < n ,in which case we can find an alternative set of solutions: b β = ( − b AA )( α ⊤ ) + b α = ( R − b A ⊤ RA )( α ⊤ ) + b AA = (38)with b A = . On the other hand, if r ≥ n , i.e. if the multiplicity of the singular number σ k is more than half the size of the original WFA, the system might not have any solution unless b A = (or unless A was zero to begin with). In this setting the method proposed returns b A = . An alternative in this case is to search for an approximation of size k − k + 1, sothat r < n , and the system in Equation 38 is underdetermined.12 .3.3 Extraction of the Rational Component The function ψ ( z ) = b α ⊤ ( z − b A ) − b β + C found in Theorem 3.4 corresponds to the solution ofTheorem 2.9. To find the solution to the approximation problem we only need to “isolate” thefunction g ( z ) ∈ R k , i.e. the rational component . To do this, we study the position of the polesof ψ ( z ), since the poles of a strictly proper rational function lie in the unit disc (Theorem 2.6).As noted before, we parametrized ψ ( z ) so that its poles correspond to the eigenvalues of b A .After a change of basis (detailed in the Paragraph 3.3.3.1), we can rewrite b A in block-diagonalform: b A = b A + b A − ! (39)where the modulus of the eigenvalues of b A + (resp. b A − ) is smaller (resp. greater) than one. Weapply the same change of coordinates on b α and b β .To conclude the study of the eigenvalues of b A , we need the following auxiliary result fromOstrowski [32]. A proof of this theorem can be found in [41]. Theorem 3.5 ([32]) . Let | Σ | = 1 , and let P be a solution to the fixed point equation X − A X A ⊤ = ββ ⊤ for the WFA A = h α , A , β i . If A is reachable, then: • The number of eigenvalues λ of A such that | λ | < is equal to the number of positiveeigenvalues of P . • The number of eigenvalues λ of A such that | λ | > is equal to the number of negativeeigenvalues of P . We can finally find the rational component of the function ψ ( z ), i.e. the function g ( z ) fromCorollary 1 necessary to solve that approximate minimization problem. Theorem 3.6. Let b A + , b α + , b β + be as in Theorem 3.4. The rational component of ψ ( z ) is thefunction g ( z ) = b α ⊤ + ( z − b A + ) − b β + .Proof. Clearly ψ ( z ) = g ( z ) + l ( z ), with l ( z ) = b α ⊤− ( z − b A − ) − b β − , l ∈ H ∞ . To conclude theproof we need to show that g ( z ) has k poles inside the unit disc, and therefore has rank k . Wedo this by studying the position of the eigenvalues of b A + .Since E is minimal, by definition b A is reachable, so we can use Theorem 3.5 and solve theproblem by directly examining the eigenvalues of − ΣR . From the proof of Theorem 3.4 we have − ΣR = Σ ( Σ − σ k ), where Σ is the diagonal matrix having as elements the singular numbersof H different from σ k . It follows that − ΣR has only k strictly positive eigenvalues, and b A has k eigenvalues with modulus smaller than 1. Thus, b A + has k eigenvalues, corresponding to thepoles of g ( z ). In this paragraph we detail the technical steps necessaryto rewrite b A in block-diagonal form. The problem of computing the Jordan form of a matrix isill-conditioned, hence it is not suitable for our algorithmic purposes. Instead, we compute theSchur decomposition, i.e. we find an orthogonal matrix U such that U ⊤ b AU is upper triangular,with the eigenvalues of b A on the diagonal. We obtain: T = U ⊤ b AU = b A + b A b A − ! (40)13here the eigenvalues are arranged in increasing order of modulus, and the modulus of those in b A + (resp. b A − ) is smaller (resp. greater) than one. To transform this upper triangular matrixinto a block-diagonal one, we use the following result. Theorem 3.7 ([37]) . Let T be the matrix defined in Equation 40. The matrix X is a solutionof the equation b A + X − X b A − + b A = if and only if M = (cid:18) (cid:19) , and M − = (cid:18) − X0 1 (cid:19) (41) satisfy: M − TM = b A + b A − ! , (42) where T is the matrix defined in Equation 40. Setting Γ = (cid:0) k (cid:1) we can now derive the rational component of the WFA: b A + = ΓM − U ⊤ b AUMΓ ⊤ (43) b α + = ΓM ⊤ U ⊤ b α (44) b β + = ΓM − U ⊤ b β . (45) In the previous sections, we have derived the rational function g ( z ) corresponding to the symbolof G , the operator that solves Theorem 2.8. To find the solution to the approximation problemwe only need to find the parameters of b A k , the optimal approximating WFA. These are directlyrevealed by the expression of g ( z ), thanks to the way we parametrized the functions. Theorem 3.8. Let A = h α , A , α i be a minimal WFA with n states over a one-letter alphabet.Let A be in its SVA form. The optimal spectral-norm approximation of rank k is given by theWFA b A k = h b α + , b A + , b β + i .Proof. From Corollary 1 we know that g ( z ) is the rational function associated with the Hankelmatrix of the best approximation. Given the correspondence between the Fourier coefficientsof g ( z ) and the entries of the matrix (Remark 2), we have: g ( z ) = b α ⊤ + ( z − b A + ) − b β + = X k ≥ b α ⊤ + b A k + b β + z − k − = X k ≥ ¯ f ( k ) z − k − (46)where ¯ f : Σ ∗ → R is the function computed by b A k and b α + , b A + , b β + are the parameters. The theoretical foundations of AAK theory guarantee that the construction detailed in Sec-tion 3.3 produces the rank k optimal spectral-norm approximation of a WFA satisfying ourassumptions, and the singular number σ k provides the exact error.Similarly to the case of SVA truncation [11], due to the ordering of the singular numbers,the error decreases when k increases, meaning that allowing b A k to have more states guaranteesa better approximation of A . We remark that, while the solution we propose is optimal inthe spectral norm, the same is not necessarily true in other norms. Nonetheless, we have thefollowing bound between ℓ -norm and spectral-norm (proof in Appendix B).14 lgorithm 1: AAKapproximation input : A minimal WFA A , with α = 0, n states and in SVA form,its Gramian P , a target number of states k < n output: A WFA b A k with k states Let α , α , A , A , A , Σ be the blocks defined in Eq. 28 Let ( α ⊤ ) + = α α ⊤ α Let R = σ k − Σ Let b A = ( A − A ( α ⊤ ) + α ⊤ ) − Let b α = − b A ⊤ RA ( α ⊤ ) + Let b β = − b AA ( α ⊤ ) + Let b A = h b α , b A , b β i Let b A k ← BlockDiagonalize ( b A ) return b A k Theorem 3.9. Let A be a minimal WFA computing f : Σ ∗ → R , with matrix H . Let b A k be itsoptimal spectral-norm approximation, computing ¯ f : Σ ∗ → R , with matrix G . Then: k f − ¯ f k ℓ ≤ k H − G k = σ k . (47) Proof. Let e = (cid:0) · · · (cid:1) ⊤ , f : Σ → R , g : Σ → R with Hankel matrices H and G ,respectively. We have: k f − g k ℓ = ∞ X n =0 | f n − g n | ! / = k ( H − G ) e k ℓ ≤ sup k x k ℓ =1 k ( H − G ) x k ℓ = k H − G k = σ k where the second equation follows by definition and by observing that matrix difference iscomputed entry-wise. In this section we describe the algorithm for spectral-norm approximate minimization. Thealgorithm takes as input a target number of states k < n , a minimal WFA A with ρ ( A ) < α = 0, n states and in SVA form, and its Gramian P . Note that, in the case of α = 0, it isenough to substitute the Steps 4 , , A to be minimal and in SVA form are non essential. In fact a WFA with n states canbe minimized in time O ( n ) [14], and the SVA computed in O ( n ) [11].Using the results of Theorem 3.4, we outline in Algorithm 1, AAKapproximation , the stepsnecessary to extract the best spectral-norm approximation of a WFA.The algorithm involves a call to Algorithm 2, BlockDiagonalize . In particular, this cor-responds to the steps, outlined in Paragraph 3.3.3.1, necessary to derive the WFA b A k corre-sponding to the rational function g ( z ). We remark that Step 2 in BlockDiagonalize can beperformed using the Bartels-Stewart algorithm [13].To compute the computational cost we recall the following facts [39]: • The product of two n × n matrices can be computed in time O ( n ) using a standarditerative algorithm, but can be reduced to O ( n ω ) with ω < . lgorithm 2: BlockDiagonalize input : A WFA b A output: A WFA b A k wit ρ < Compute the Schur decomposition of b A = UTU ⊤ , where | T | ≤ | T | ≤ . . . Solve b A X − X b A + b A = for X Let M = (cid:18) (cid:19) and M − = (cid:18) − X0 1 (cid:19) Let Γ = (cid:0) k (cid:1) Let b A + = ΓM − U ⊤ b AUMΓ ⊤ Let b α + = ΓM ⊤ U ⊤ b α Let b β + = ΓM − U ⊤ b β Let b A k = h b α + , b A + , b β + i return b A k • The inversion of a n × n matrix can be computed in time O ( n ) using Gauss-Jordanelimination, but can be reduced to O ( n ω ) with ω < . • The computation of the Schur decomposition of a n × n matrix can be done with a two-stepalgorithm, where each step takes O ( n ), using the Hessenberg form of the matrix. • The Bartels-Stewart algorithm applied to upper triangular matrices to find a matrix m × n takes O ( mn + nm ).The running time of BlockDiagonalize with input a WFA b A with ( n − r ) states is thus in O (( n − r ) ), where r is the multiplicity of the singular value considered. The running time of AAKapproximation for an input WFA b A with n states is in O (( n − r ) ). The study of approximate minimization for WFAs is very recent, and only a few works havebeen published on the subject. In [10, 11] the authors present an approximate minimizationtechnique using a canonical expression for WFAs, and provide bounds on the error in the ℓ norm. The result is supported by strong theoretical guarantees, but it is not optimal in anynorm. An extension of this method to the case of Weighted Tree Automata can be found in [12].A similar problem is addressed in [25], with less general results. In [26], the authors connectspectral learning to the approximate minimization problem of a small class of Hidden Markovmodels, bounding the error in terms of the total variation distance.The control theory community has largely studied approximate minimization in the contextof linear time-invariant systems, and several methods have been proposed [3]. A parallel canbe drown between those results and ours, by noting that the impulse response of a discretetime-invariant Single-Input-Single-Output SISO system can be parametrized as a WFA over aone-letter alphabet. In [20] Glover presents a state-space solution for the case of continuousMulti-Input-Multi-Output MIMO systems. Glover’s method led to a widespread applicationof these results, thanks to its computational and theoretical simplicity. This stems from thestructure of the Lyapunov equations for continuous systems. It is however not the case fordiscrete control systems, where the Lyapunov equations have a quadratic form. As noted in [16],there is not a simple closed form formula for the state space solution of a discrete system. Thus,16ost of the results for the discrete case work with a suboptimal version of the problem, withrestrictions on the multiplicity of the singular values [6, 2, 23]. A solution for the SISO case canbe found, without additional assumptions, using a polynomial approach, but it does not providean explicit representation of the state space nor it generalizes to the MIMO setting. The firstto actually extend Glover results to the discrete case is Gu, who provides an elegant solutionfor the MIMO discrete problem [21]. Glover and Gu’s solutions rely on embedding the initialsystem into an extension of it, the all-pass system , equivalent to the WFA E in our method.Part of our contribution is the adaptation of some of the control theory tools to our setting. In this paper we applied the AAK theory for Hankel operators and complex functions to theframework of WFAs in order to construct the best possible approximation to an automatongiven a bound on the size. We provide an algorithm to find the parameters of the best WFAapproximation in the spectral norm, and bounds on the error. Our method applies to WFAs A = h α , A , β i , defined over a one-letter alphabet, with ρ ( A ) < 1. While this setting is certainlyrestricted, we believe that this work constitutes a first fundamental step in the direction ofoptimal approximation. Furthermore, the use of AAK techniques has proven to be very fruitfulin related areas like control theory; we think that automata theory can also benefit from it. Theuse of such methods can help deepen the understanding of the behaviour of rational functions.This paper highlights and strengthens the interesting connections between functional analysis,automata theory and control theory, unifying tools from different domains in one formalism.A compelling direction for future work is to extend our results to the multi-letter case. Thework of Adamyan, Arov and Krein provides us with a powerful theory connecting sequences tothe study of complex functions. We note that, unfortunately, this approach cannot be directlygeneralized to the multi-letter case because of the non-commutative nature of the monoid con-sidered. Extending this work would require tools from harmonic analysis that are not availablefor non-abelian structures. A recent line of work in functional analysis is centered around theextension of operator theory to the non-commutative case, and in [34] a non-commutative ver-sion of the AAK theorem is presented. However, those results are non-constructive, making thisdirection, already challenging, even harder to pursue. Acknowledgments This research has been supported by NSERC Canada (C. Lacroce, P. Panangaden, D. Precup)and Canada CIFAR AI chairs program (Guillaume Rabusseau). The authors would like tothank Tianyu Li, Harsh Satija and Alessandro Sordoni for feedback on earlier drafts of thiswork, Gheorghe Comanici for a detailed review and Maxime Wabartha for fruitful discussionsand comments on the proofs. References [1] Vadim M. Adamyan, Damir Zyamovich Arov, and Mark Grigorievich Krein. Analytic prop-erties of Schmidt pairs for a Hankel operator and the generalized Schur–Takagi problem. Mathematics of The Ussr-sbornik , 15:31–73, 1971.[2] M.M Al-Hussari, I.M. Jaimoukha, and D.J.N. Limebeer. A descriptor approach for thesolution of the one-block distance problem. In In Proceedings of the IFAC World Congress ,1993. 173] Athanasios C. Antoulas. Approximation of Large-Scale Dynamical Systems . SIAM, 2005.[4] Stephane Ayache, Remy Eyraud, and Noe Goudian. Explaining black boxes on sequentialdata using weighted automata. In Proceedings of The 14 th International Conference onGrammatical Inference, Volume 93 of Proceedings of Machine Learning Research , pages81–103, 2018.[5] Rapha¨el Bailly, Fran¸cois Denis, and Liva Ralaivola. Grammatical inference as a principalcomponent analysis problem. In Proceedings of the 26th Annual International Conferenceon Machine Learning , ICML ’09, pages 33––40, New York, NY, USA, 2009. Associationfor Computing Machinery. doi:10.1145/1553374.1553379 .[6] Joseph A. Ball and Andre CM. Ran. Optimal Hankel norm model reductions and Wiener–Hopf factorization I: The canonical case. SIAM Journal on Control and Optimization ,25(2):362–382, 1987.[7] Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral learningof weighted automata - A forward-backward perspective. Machine Learning , 96(1-2):33–63,2014.[8] Borja Balle, Pascale Gourdeau, and Prakash Panangaden. Bisimulation metrics forweighted finite automata. In Proceedings of the 44th International Colloquium On Au-tomata Languages and Programming Warsaw , pages 103:1–14, 2017.[9] Borja Balle, William Hamilton, and Joelle Pineau. Methods of moments for learningstochastic languages: Unified presentation and empirical comparison. In InternationalConference on Machine Learning , pages 1386–1394. PMLR, 2014.[10] Borja Balle, Prakash Panangaden, and Doina Precup. A canonical form for weightedautomata and applications to approximate minimization. In Proceedings of the ThirtiethAnnual ACM-IEEE Symposium on Logic in Computer Science , July 2015.[11] Borja Balle, Prakash Panangaden, and Doina Precup. Singular value automataand approximate minimization. Math. Struct. Comput. Sci. , 29(9):1444–1478, 2019. doi:10.1017/S0960129519000094 .[12] Borja Balle and Guillaume Rabusseau. Approximate minimization of weighted tree au-tomata. Information and Computation , page 104654, 2020.[13] R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax + xb = c [f4]. Commun. ACM , 15(9):820—-826, 1972.[14] Jean Berstel and Christophe Reutenauer. Noncommutative rational series with applications ,volume 137. Cambridge University Press, 2011.[15] J.W. Carlyle and A. Paz. Realizations by stochastic finite automata. Journal of Computerand System Sciences , 5(1):26–40, 1971.[16] Charles K. Chui and Guanrong Chen. Discrete H ∞ Optimization With Applications inSignal Processing and Control Systems . Springer-Verlag, 1997.[17] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika , 1:211–218, 1936. doi:10.1007/BF02288367 .1818] Remi Eyraud and Stephane Ayache. Distillation of weighted automata from recurrentneural networks using a spectral approach. arXiv preprint arXiv:2009.13101 , 2020.[19] M. Flies. Matrice de Hankel. Journal de Math´ematique Pures et Appliqu´ees , 5:197–222,1974.[20] Keith Glover. All optimal Hankel-norm approximations of linear multivariable systemsand their L ∞ -error bounds. International Journal of Control , 39(6):1115–1193, 1984. doi:10.1080/00207178408933239 .[21] Guoxiang Gu. All optimal Hankel-norm approximations and their errorbounds in discrete-time. International Journal of Control , 78(6):408–423, 2005. doi:10.1080/00207170500110988 .[22] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learn-ing hidden markov models. J. Comput. Syst. Sci. , 78(5):1460–1480, September 2012. doi:10.1016/j.jcss.2011.12.025 .[23] Vlad Ionescu and Cristian Oara. The four-block Adamjan-Arov-Kein problem for discrete-time systems. In Linear Algebra and its Application , pages 95–119. Elsevier, 2001.[24] L. Kronecker. Zur Theorie der Elimination einer Variablen aus zwei algebraischen Gle-ichungen. Montasber. K¨onigl. Preussischen Acad Wies , pages 535 – 600, 1881.[25] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-Rank Spectral Learning with WeightedLoss Functions. In Artificial Intelligence and Statistics , pages 517–525. PMLR, 2015.[26] Alex Kulesza, N. Raj Rao, and Satinder Singh. Low-Rank Spectral Learning. In SamuelKaski and Jukka Corander, editors, Proceedings of the Seventeenth International Confer-ence on Artificial Intelligence and Statistics , volume 33 of Proceedings of Machine LearningResearch , pages 522–530, Reykjavik, Iceland, 22–25 April 2014. PMLR.[27] Aersity M Lyapunov. The General Problem of the Stability of Motion [in Russian]. Gostekhizdat, Moscow , 1950.[28] Jean Meinguet. A Simplified Presentation of the Adamjan-Arov-Krein ApproximationTheory. In H. Werner, L. Wuytack, E. Ng, and H. J. B¨unger, editors, ComputationalAspects of Complex Analysis , pages 217–248, Dordrecht, 1983. Springer Netherlands. doi:10.1007/978-94-009-7121-9_9 .[29] Zeev Nehari. On Bounded Bilinear Forms. Annals of Mathematics , 65(1):153–162, 1957.[30] Nikolai K. Nikol’Skii. Operators, Functions and Systems: An Easy Reading , volume 92 of Mathematical Surveys and Monographs . American Mathematical Society, 2002.[31] Takamasa Okudono, Masaki Waga, Taro Sekiyama, and Ichiro Hasuo. Weighted AutomataExtraction from Recurrent Neural Networks via Regression on State Spaces. In Proceedingsof the AAAI Conference on Artificial Intelligence , volume 34, pages 5306–5314, 2020.[32] Alexander Ostrowski and Hans Schneider. Some Theorems on the Inertia of General Ma-trices. J. Math. Anal. Appl , 4(1):72–84, 1962.[33] Vladimir Peller. Hankel Operators and their Applications . Springer Science & BusinessMedia, 2012. 1934] Gelu Popescu. Multivariable Nehari Problem and Interpolation. Journal of FunctionalAnalysis , 200:536–581, 2003. doi:10.1016/S0022-1236(03)00078-8 .[35] Guillaume Rabusseau, Borja Balle, and Joelle Pineau. Multitask Spectral Learning ofWeighted Automata. In Proceedings of the 31st International Conference on Neural Infor-mation Processing Systems , pages 2585–2594, 2017.[36] Guillaume Rabusseau, Tianyu Li, and Doina Precup. Connecting Weighted Automata andRecurrent Neural Networks through Spectral Learning. In Proceedings of AISTATS , pages1630–1639, 2019.[37] William E. Roth. The equations ax - yb = c and ax - xb = c in matrices. Proceedings ofthe American Mathematical Society , 3(3):392–396, 1952.[38] B.De Schutter. Minimal State-Space Realization in Linear System Theory: anOverview. Journal of Computational and Applied Mathematics , 121(1):331–354, 2000. doi:10.1016/S0377-0427(00)00341-1 .[39] Lloyd N Trefethen and David Bau III. Numerical linear algebra , volume 50. Siam, 1997.[40] Gail Weiss, Yoav Goldberg, and Eran Yahav. Learning Deterministic Weighted Automatawith Queries and Counterexamples. In Advances in Neural Information Processing Systems ,pages 8560–8571, 2019.[41] H.K Wimmer. On the Ostrowski-Schneider Inertia Theorem. Journal of MathematicalAnalysis and Applications , 41(1):164–169, 1973. doi:10.1016/0022-247X(73)90190-X .[42] Kehe Zhu. Operator Theory in Function Spaces , volume 138. American MathematicalSociety, 1990. 20 Hankel Operators For more details on the content of this section we refer the reader to [30]. We recall the firstdefinition of Hankel operator. Definition A.1. A Hankel operator is a mapping H : ℓ → ℓ with matrix H = { α j + k } j,k ≥ .In other words, given a = { a n } n ≥ ∈ ℓ , we have H ( a ) = b , where b = { b n } n ≥ is defined by: b k = X j ≥ α j + k a j , k ≥ . (48)This property on the Hankel matrix can be rephrased as an operator identity. Definingthe shift operator by S ( x , x , . . . ) = (0 , x , x , . . . ) and denoting its left inverse by S ∗ =( y , y , . . . ) = ( y , y , . . . ), we have that H is a Hankel operator if and only if: HS = S ∗ H (49)The correspondence between Definition A.1 and Definition 2.7 can be easily made explicit.First, we note that using the isomorphism φ ( b φ ( n )) n ∈ Z , introduced with Fourier series, wecan identify H with ℓ . Moreover, the operator of multiplication by z acts as right shift S onthe space of Fourier coefficients, in fact d ( zφ )( n ) = b φ ( n − S ∗ corresponds to the truncated multiplication operator. Now, let B = { z k } k ≥ and B = { z j } j< be bases for H and H − , respectively, and let Iz n = z − n − (50)be the involution on L ( T ). Note that I H = H − .Let H : ℓ → ℓ be a Hankel operator with matrix H . Using the bases B , B and theFourier identification map, we can obtain an operator acting between Hardy spaces. Followingthis interpretation, the operator H = IH : H → H − has matrix H with respect to B , B , andsatisfies: HS = P − SH. (51)In particular, HS = S ∗ H if and only if HS = P − SH . It is now easy to see that the characteri-zation of the Hankel operator given in Definition 2.7 satisfies Equation 51.The following theorem, due to Nehari [29], is of great importance as it highlights a corre-spondence between bounded Hankel operators and functions in L ∞ ( T ). Theorem A.1 ([29]) . A Hankel operator H : ℓ → ℓ with matrix H ( j, k ) = { α j + k } j,k ≥ isbounded on ℓ if and only if there exists a function ψ ∈ L ∞ ( T ) such that α m = b ψ ( m ) , m ≥ . (52) In this case: k H k = inf {k ψ k ∞ : b ψ ( n ) = b φ ( n ) , n ≥ } . (53) Where b ψ ( n ) is the n -th Fourier coefficient of ψ . We can now reformulate the theorem using the characterization of Hankel operators in Hardyspaces. Theorem A.2 ([29]) . Let φ ∈ L ( T ) be a symbol of the Hankel operator on Hardy spaces H φ : H → H − . The following are equivalent: H φ is bounded on H ,(2) there exists ψ ∈ L ∞ ( T ) such that b ψ ( m ) = b φ ( m ) for all m < .If the conditions above are satisfied, then: k H φ k = inf {k ψ k ∞ : b ψ ( m ) = b φ ( m ) , m < } , (54) or equivalently: k H φ k = inf f ( z ) ∈H ∞ k φ ( z ) − f ( z ) k ∞ . (55)Nehari’s Theorem is at the core of the proof of Corollary 1. Proof of Corollary 1 . Let H φ be a Hankel operator with symbol φ ( z ) ∈ L ∞ ( T ) and matrix H . Let ψ ( z ) = g ( z ) + l ( z ) ∈ H ∞ k be the solution of Equation 13. We have: k H φ − H ψ k = k H φ − ψ k (56)= k H σ k η − k ( z ) /ξ + k ( z ) k (57) ≤ σ k k η − k ( z ) /ξ + k ( z ) k ∞ = σ k (58)where first we used Corollary 2 and then Equation 55. Now, using the definition of Hankeloperator, we have: k H φ − H ψ k = k H φ − H g k = k H − G k ≤ σ k . (59)Since k H − G k ≥ σ k (from Eckart-Young theorem [17]), it follows that k H − G k = σ k . Notethat G has rank k , as required, because g ∈ R k (Theorem 2.6). B Proofs from Section 3 Proof of Theorem 3.3. In order to prove Theorem 3.3 we need an auxiliary lemma. Theseare the analogous of a control theory result, rephrased in terms of WFAs. The original theoremand lemma, together with the corresponding proofs, can be found in [16]. Hence, we onlyprovide a sketch of the proofs. Lemma B.1 ( [16]) . Let E = h α e , A e , β e i be a minimal WFA. Let e ( z ) = α ⊤ e ( z − A e ) − β e − C ,if σ − k e ( z ) is unimodular, then there exist a unique invertible symmetric matrix T satisfying:(a) A ⊤ e T β e = α e C (b) σ k α ⊤ e T − A ⊤ e = C β ⊤ e (c) A ⊤ e TA e − C − A ⊤ e T β e α ⊤ e = T Proof. Since σ − k e ( z ) is unimodular, we have that: e ( z ) e ∗ (¯ z − ) = σ k (60)where we denote with e ∗ the adjoint function. From the equation above, we obtain: e ∗ (¯ z − ) = σ k e − ( z ) = σ k ( − C + α ⊤ e ( z − A e ) − β e ) − (61)= − σ k C − − σ k C − α ⊤ e (( z − ( A e + C − β e α e )) − β e C − (62)22here we used the matrix inversion lemma. On the other hand we have: e ∗ (¯ z − ) = − C + β ⊤ e ( z − − A ⊤ e ) − α e (63)= − C + β ⊤ e ( − A −⊤ e ( − z A ⊤ e ) + A −⊤ e )( − z A ⊤ e ) − α e (64)= − ( C − β ⊤ e A −⊤ e α e ) − β ⊤ e A −⊤ e ( z − A −⊤ e ) − A −⊤ e α e (65)where we used again the matrix inversion lemma before grouping the terms. If the quantitiesin Equation 62 and Equation 65 have to be equal, we need their constant term to be the same.Then we want the H ∞− -components to correspond, so we consider the corresponding Hankelmatrices. It is easy to see that we can once again associate the coefficients of these complexfunctions to the parameters of a WFA. From the minimality of E we obtain: σ k C − α ⊤ e = β ⊤ e A −⊤ e TA e + C − β e α e = T − A −⊤ e T β e C − = T − A −⊤ e α e (66)where T is an invertible matrix [7]. This system is equivalent to: σ k α ⊤ e T − A ⊤ e = C β ⊤ e A ⊤ e TA e − C − A ⊤ e T β e α ⊤ e = TA ⊤ e T β e = α e C (67)To conclude the proof it remains to check that T is symmetric, and this can be checked bydirect computations. Proof of Theorem 3.3 . This proof follows easily from Lemma B.1 by setting P = − σ k T − and Q = − T . We obtain point ( c ) by direct multiplication. Then, we substitute the lastequation in 67 into the second one, and we obtain: A ⊤ e TA e − α e α ⊤ e = T (68)which verifies point ( b ) with Q = − T . Point ( a ) can be obtained analogously combining thefirst and second equations in 67. C Possible Extensions C.1 Relaxing the Spectral Radius Assumption It is possible to extend part of our method to WFAs over a one-letter alphabet with ρ ( A ) = 1,but the approximation recovered is not optimal in the spectral norm.Let A = h α , A , α i , with ρ ( A ) = 1, be a WFA with n states that we want to minimize. Theidea is to block-diagonalize A like we did in Section 3.3.3, and tackle each component separately.The case of A + = h α + , A + , α + i , the component having ρ ( A ) < 1, can be dealt with in theway presented in the previous sections. This means that we can find an optimal spectral-norm approximation of the desired size for A + . Now we can consider the second component, A − = h α − , A − , α − i . The key idea is to apply the transformation z j − z − j for j ≥ φ ′′ ( z ). Then, the function φ ′′ ( z − ) = X k ≥ b α ⊤− b A k − z k b α − = b α ⊤− ( − z b A − ) − b α − (69)23s well defined, as the series converges for z with small enough modulus. Using this trans-formation we obtain a function with poles inside the unit disc and we can apply the methodpresented in the paper. An important choice to make is the size of the approximation of A − , asit can influence the quality of the approximation. Analyzing the effects of this parameter on theapproximation error constitutes an interesting direction for future work, both in the theoreticaland experimental side. Some theoretical work has been done in the control theory literature tostudy an analogous approach for continuous time systems and its approximation error [20]. C.2 Polynomial method We remark that Equation 14 from Corollary 2 can be rewritten as ψ ( z ) = φ ( z ) − Hξ + k ( z ) ξ + k ( z ) , (70)where ξ + k ( z ) is the function in H associated to the vector ξ k ∈ Ker( H ∗ H − σ k ) (and ψ ( z )does not depend on the choice of the specific ξ k ). There is an alternative way to find the bestapproximation, particularly useful when the objective is to approximate a finite-rank infiniteHankel matrix with another Hankel matrix, without necessarily extract a WFA. We can considerthe adjoint operator H ∗ and its matrix H ∗ . The singular numbers and singular vectors of H correspond to the eigenvalues and eigenvectors of R = ( H ∗ H ) / . Hence, it is possible tocompute σ k and a corresponding singular vector ξ k . The function ξ + k ( z ) is then obtainedfollowing Equation 9. The Hankel matrix G that best approximates H is given by G = H − M ,where M is the Hankel matrix having Hξ + k ( z )) We remark that Equation 14 from Corollary 2 can be rewritten as ψ ( z ) = φ ( z ) − Hξ + k ( z ) ξ + k ( z ) , (70)where ξ + k ( z ) is the function in H associated to the vector ξ k ∈ Ker( H ∗ H − σ k ) (and ψ ( z )does not depend on the choice of the specific ξ k ). There is an alternative way to find the bestapproximation, particularly useful when the objective is to approximate a finite-rank infiniteHankel matrix with another Hankel matrix, without necessarily extract a WFA. We can considerthe adjoint operator H ∗ and its matrix H ∗ . The singular numbers and singular vectors of H correspond to the eigenvalues and eigenvectors of R = ( H ∗ H ) / . Hence, it is possible tocompute σ k and a corresponding singular vector ξ k . The function ξ + k ( z ) is then obtainedfollowing Equation 9. The Hankel matrix G that best approximates H is given by G = H − M ,where M is the Hankel matrix having Hξ + k ( z )) ξ + k ( z ))