[PDF] Optimal Spectral-Norm Approximate Minimization of Weighted Finite Automata

Abstract

We address the approximate minimization problem for weighted finite automata (WFAs) with weights in \mathbb{R}, over a one-letter alphabet: to compute the best possible approximation of a WFA given a bound on the number of states. This work is grounded in Adamyan-Arov-Krein Approximation theory, a remarkable collection of results on the approximation of Hankel operators. In addition to its intrinsic mathematical relevance, this theory has proven to be very effective for model reduction. We adapt these results to the framework of weighted automata over a one-letter alphabet. We provide theoretical guarantees and bounds on the quality of the approximation in the spectral and \ell^2 norm. We develop an algorithm that, based on the properties of Hankel operators, returns the optimal approximation in the spectral norm.

Full PDF

aa r X i v : . [ c s . F L ] F e b Optimal Spectral-Norm Approximate Minimization of WeightedFinite Automata

Borja Balle , Clara Lacroce ∗ , Prakash Panangaden , Doina Precup , andGuillaume Rabusseau † DeepMind, London, United Kingdom. School of Computer Science, McGill University, Montr´eal, Canada Mila, Montr´eal, Canada DIRO, Universit´e de Montr´eal, Montr´eal, Canada

Abstract

We address the approximate minimization problem for weighted ﬁnite automata (WFAs)over a one-letter alphabet: to compute the best possible approximation of a WFA given abound on the number of states. This work is grounded in Adamyan-Arov-Krein Approxima-tion theory, a remarkable collection of results on the approximation of Hankel operators. Inaddition to its intrinsic mathematical relevance, this theory has proven to be very eﬀectivefor model reduction. We adapt these results to the framework of weighted automata overa one-letter alphabet. We provide theoretical guarantees and bounds on the quality of theapproximation in the spectral and ℓ norm. We develop an algorithm that, based on theproperties of Hankel operators, returns the optimal approximation in the spectral norm. Weighted ﬁnite automata (WFAs) are an expressive class of models representing functionsdeﬁned over sequences. The approximate minimization problem is concerned with ﬁnding anautomaton that approximates the behaviour of a given minimal WFA, while being smaller insize. This second automaton recognizes a diﬀerent language, and the objective is to minimizethe approximation error [10, 11]. Approximate minimization becomes particularly useful in thecontext of spectral learning algorithms [5, 7, 9, 22]. When applied to a learning task, suchalgorithms can be viewed as working in two steps. First, they compute a minimal WFA thatexplains the training data exactly. Then, they obtain a model that generalizes to the unseendata by producing a smaller approximation to the minimal WFA. It is not just a questionof saving space by having a smaller state space; the exact machine will overﬁt the data andgeneralize poorly. To obtain accurate results it is crucial to guess correctly the size of theminimal WFA, in particular when the data is generated by this type of machine.The minimization task is greatly shaped by the way we decide to measure the approximationerror. It is thus natural to wonder if there are norms that are preferable to others. We believethat the chosen norm should be computationally reasonable to minimize. For instance, thedistance between WFAs can be computed using a metric based on bisimulation [8]. Whileexploring this approach could still be interesting, the fact that this metric is hard to compute ∗ Corresponding author: [email protected] † The names of the authors appear in alphabetical order • We use AAK theory to study the approximate minimization problem of WFAs. To connectthose areas, we establish a correspondence between the parameters of a WFA and thecoeﬃcients of a complex function on the unit circle. To the best of our knowledge, thispaper represents the ﬁrst attempt to apply AAK theory to WFAs. • We present a theoretical analysis of the optimal spectral-norm approximate minimizationproblem for WFAs, based on their connection with ﬁnite-rank inﬁnite Hankel matrices.We provide a closed form solution for a class of weighted automata over a one-letteralphabet, and bounds on the approximation error, both in terms of the Hankel matrix(spectral norm) and of the rational function computed by the WFA ( ℓ norm). • We propose a self-contained algorithm that returns the unique optimal spectral-normapproximation of a given size. • We tighten the connection, made in [11], between WFAs and discrete dynamical systems,by adapting some of the control theory concepts, like the allpass system [20], to theformalism of WFAs.

We denote with N , Z and R the set of natural, integers and real numbers, respectively. We usebold letters for vectors and matrices; all vectors considered are column vectors. We denote with the identity matrix, specifying its dimension only when not clear from the context. We referto the i -th row and the j -th column of M by M ( i, :) and M (: , j ). Given a matrix M ∈ R p × q of rank n , a rank factorization is a factorization M = PQ , where P ∈ R p × n , Q ∈ R n × q andrank( P ) = rank( Q ) = n . Let M ∈ R p × q of rank n , the compact singular value decomposition SVD of M is the factorization M = UDV ⊤ , where U ∈ R p × n , D ∈ R n × n , V ∈ R q × n are suchthat U ⊤ U = V ⊤ V = , and D is a diagonal matrix. The columns of U and V are called leftand right singular vectors , while the entries of D are the singular values . The Moore-Penrosepseudo-inverse M + of M is the unique matrix such that MM + M = M , M + MM + = M + ,2ith M + M and MM + Hermitian. [42]. The spectral radius ρ ( M ) of a matrix M is the largestmodulus among its eigenvalues.A Hilbert space is a complete normed vector space where the norm arises from an innerproduct. A linear operator T : X → Y between Hilbert spaces is bounded if it has ﬁniteoperator norm, i.e. k T k op = sup k g k X ≤ k T g k Y < ∞ . We denote by T the (inﬁnite) matrixassociated with T by some (canonical) orthonormal basis on H . An operator is compact if theimage of the unit ball in X is relatively compact. Given Hilbert spaces X, Y and a compactoperator T : X → Y , we denote its adjoint by T ∗ . The singular numbers { σ n } n ≥ of T arethe square roots of the non-negative eigenvalues of the self-adjoint operator T ∗ T , arrangedin decreasing order. A σ - Schmidt pair { ξ , η } for T is a couple of norm 1 vectors such that: T ξ = σ η and T ∗ η = σ ξ . The Hilbert-Schmidt decomposition provides a generalization ofthe compact SVD for the inﬁnite matrix of a compact operator T using singular numbers andorthonormal Schmidt pairs: Tx = P n ≥ σ n h x , ξ n i η k [42]. The spectral norm k T k of the matrixrepresenting the operator T is the largest singular number of T . Note that the spectral normof T corresponds to the operator norm of T .Let ℓ be the Hilbert space of square-summable sequences over Σ ∗ , with norm k f k = P x ∈ Σ ∗ | f ( x ) | and inner product h f, g i = P x ∈ Σ ∗ f ( x ) g ( x ) for f, g ∈ R Σ ∗ . Let T = { z ∈ C : | z | = 1 } be the complex unit circle, D = { z ∈ C : | z | < } the (open) complex unit disc.Let 1 < p < ∞ , L p ( T ) be the space of measurable functions on T for which the p -th powerof the absolute value is Lebesgue integrable. For p = ∞ , we denote with L ∞ ( T ) the space ofmeasurable functions that are bounded, with norm k f k ∞ = sup {| f ( x ) | : x ∈ T } . Let Σ be a ﬁxed ﬁnite alphabet, Σ ∗ the set of all ﬁnite strings with symbols in Σ. We use ε todenote the empty string. Given p, s ∈ Σ, we denote with ps their concatenation.A weighted ﬁnite automaton (WFA) of n states over Σ is a tuple A = h α , { A a } , β i ,where α , β ∈ R n are the vector of initial and ﬁnal weights, respectively, and A a ∈ R n × n is the matrix containing the transition weights associated with each symbol a . Every WFA A realizes a function f A : Σ ∗ → R , i.e. given a string x = x · · · x t ∈ Σ ∗ , it returns f A ( x ) = α ⊤ A x · · · A x t β = α ⊤ A x β . A function f : Σ ∗ → R is called rational if there ex-ists a WFA A that realizes it. The rank of the function is the size of the smallest WFA realizing f . Given f : Σ ∗ → R , we can consider a matrix H f ∈ R Σ ∗ × Σ ∗ having rows and columns indexedby strings and deﬁned by H f ( p, s ) = f ( ps ) for p, s ∈ Σ ∗ . Deﬁnition 2.1.

A (bi-inﬁnite) matrix H ∈ R Σ ∗ × Σ ∗ is Hankel if for all p, p ′ , s, s ′ ∈ Σ ∗ suchthat ps = p ′ s ′ , we have H ( p, s ) = H ( p ′ , s ′ ) . Given a Hankel matrix H ∈ R Σ ∗ × Σ ∗ , there exists aunique function f : Σ ∗ → R such that H f = H . Theorem 2.1 ([15, 19]) . A function f : Σ ∗ → R can be realized by a WFA if and only if H f has ﬁnite rank n . In that case, n is the minimal number of states of any WFA A such that f = f A . Given a WFA A = h α , { A a } , β i , the forward matrix of A is the inﬁnite matrix F A ∈ R Σ ∗ × n given by F A ( p, :) = α ⊤ A p for any p ∈ Σ ∗ , while the backward matrix of A is B A ∈ R Σ ∗ × n , givenby B A ( s, :) = ( A s β ) ⊤ for any s ∈ Σ ∗ . Let H f be the Hankel matrix of f , its forward-backward(FB) factorization is: H f = FB ⊤ . A WFA with n states is reachable if rank( F A ) = n , whileit is observable if rank( B A ) = n . A WFA is minimal if it is reachable and observable. If A isminimal, the FB factorization is a rank factorization [7].We recall the deﬁnition of the singular value automaton, a canonical form for WFAs [10].3 eﬁnition 2.2. Let f be a rational function and suppose H f admits an SVD, H f = UDV ⊤ .A singular value automaton (SVA) for f is the minimal WFA A realizing f such that F A = UD / and B A = VD / . The SVA can be computed with an eﬃcient algorithm relying on the following matrices [11].

Deﬁnition 2.3.

Let f be a rational function, H f = FB ⊤ a FB factorization. If the matrices P = F ⊤ F and Q = B ⊤ B are well deﬁned, we call P the reachability Gramian and Q the observability Gramian . Note that if A is an SVA, then the Gramians associated with its FB factorization satisfy P A = Q A = D , where D is the matrix of singular values of its Hankel matrix. The Gramians canalternatively be characterized (and computed [11]) using ﬁxed point equations, correspondingto Lyapunov equations when | Σ | = 1 [27]. Theorem 2.2.

Let | Σ | = 1 , A = h α , A , β i a WFA with n states and well-deﬁned Gramians P , Q . Then X = P and Y = Q solve: X − A X A ⊤ = ββ ⊤ , (1) Y − A ⊤ Y A = αα ⊤ . (2)Finally, we recall the following deﬁnition. Deﬁnition 2.4.

A WFA A = h α , { A a } , β i is a generative probabilistic automaton (GPA)if f A ( x ) ≥ for every x , and P x ∈ Σ ∗ f A ( x ) = 1 , i.e. if f A computes a probability distributionover Σ ∗ . Example 2.3.

Let | Σ | = 1, Σ = { x } . The WFA A = h α , A , β i , with: A = (cid:18) (cid:19) , α = √ ! , β = √ ! , is a GPA. Ideed, f A ( x ) ≥ P x ∈ Σ ∗ f A ( x ) = 1, since the rational function is: f A ( x · · · x ) = f A ( k ) = α ⊤ A k β = ( k is odd − k if k is evenwhere k corresponds to the string where x is repeated k -times. We remark that A is minimaland in its SVA form, with Gramians P = Q = (cid:18) (cid:19) , and f A has rank 2. Finally, thecorresponding Hankel matrix, also of rank 2, is: H =  f A (0) f A (1) f A (2) . . .f A (1) f A (2) f A (3) . . .f A (2) f A (3) f A (4) . . . ... ... ... . . .  =  . . . . . . . . . ... ... ... . . .  . (3) Theorem 2.1 provides us with a way to associate a minimal WFA A with n states to a Hankelmatrix H of rank n . The approach we propose to approximate A is to ﬁnd the WFA corre-sponding to the matrix that minimizes H in the spectral norm. We recall the fundamentalresult of Schmidt, Eckart, Young and Mirsky [17].4 heorem 2.4 ([17]) . Let H be a Hankel matrix corresponding to a compact Hankel operator ofrank n , and σ m , with ≤ m < n and σ ≥ · · · ≥ σ n − > , its singular numbers. Then, if R isa matrix of rank k , we have: k H − R k ≥ σ k . (4) The equality is attained when R corresponds to the truncated SVD of H . Note that a low-rank approximation obtained by truncating the SVD is not in general aHankel matrix. This is problematic, since G needs to be Hankel in order to be the matrix of aWFA. Surprisingly, we can obtain a result comparable to the one of Theorem 2.4 while preservingthe Hankel property. This is the possible thanks to a theory of optimal approximation calledAAK theory [1]. To apply this theory, we will need to rewrite the approximation problem infunctional analysis terms. First, we will associate a linear operator to the Hankel matrix. Then,we will use Fourier analysis to reformulate the problem in a function space. A comprehensivepresentation of the concepts recalled in this section can be found in [30, 33, 28].Let f : Σ ∗ → R be a rational function, we interpret the corresponding Hankel matrix H f asthe expression of a linear (Hankel) operator H f : ℓ → ℓ in terms of the canonical basis. Werecall that a Hankel operator H f is bounded if and only if f ∈ ℓ [11]. This property, togetherwith the fact that we only consider ﬁnite rank operators (corresponding to the Hankel matricesof WFAs), is suﬃcient to guarantee compactness.To introduce AAK theory, we need to consider a second realization of Hankel operators oncomplex spaces. Since in this paper we work with two classes of functions – functions oversequences and complex functions – to avoid any confusion we will make explicit the dependenceon the complex variable z = e it . We start by recalling a few fundamental deﬁnitions from thetheory of complex functions. Note that a function φ ( z ) ∈ L ( T ) can be represented, using theorthonormal basis { z n } n ∈ Z , by means of its Fourier series: φ ( z ) = P n ∈ Z b φ ( n ) z n , with Fouriercoeﬃcients b φ ( n ) = R T φ ( z )¯ z n dz, n ∈ Z . This establishes an isomorphism between the function φ ( z ) and the sequence of the corresponding Fourier coeﬃcients b φ . Thus, we can partition thefunction space L ( T ) into two subspaces. Deﬁnition 2.5.

For < p ≤ ∞ , the Hardy space H p on T is the subspace of L p ( T ) deﬁnedas: H p = { φ ( z ) ∈ L p ( T ) : b φ ( n ) = 0 , n < } , (5) while the negative Hardy space on T is the subspace of L p ( T ) H p − = { φ ( z ) ∈ L p ( T ) : b φ ( n ) = 0 , n ≥ } . (6)It is possible to deﬁne Hardy spaces also on the open unit disc D . Deﬁnition 2.6.

The

Hardy space H p ( D ) on D for < p < ∞ consists of functions φ ( z ) analytic in D and such that: k φ k p := sup

Let φ ( z ) be a function in the space L ( T ) . A Hankel operator is an operator H φ : H → H − deﬁned by H φ f ( z ) = P − φf ( z ) , where P − is the orthogonal projection from L ( T ) onto H − . The function φ ( z ) is called a symbol of the Hankel operator H φ . If H φ is a bounded operator, we can consider without loss of generality φ ( z ) ∈ L ∞ ( T ). Thisis a consequence of Nehari’s theorem [29], whose formulation can be found in Appendix A,together with more details about the two deﬁnitions of Hankel operators. We remark that aHankel operator has inﬁnitely many diﬀerent symbols, since H φ = H φ + ψ for ψ ( z ) ∈ H ∞ . Remark 1.

In the standard orthonormal bases, { z k } k ≥ in H and { z − ( j +1) } j ≥ in H − , theHankel operator H φ has Hankel matrix H ( j, k ) = b φ ( − j − k −

1) for j, k ≥ Example 2.5.

In the case of the Hankel matrix in Example 2.3, since H ( j, k ) = b φ ( − j − k − H =  . . . . . . . . . ... ... ... . . .  =  b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . ... ... ... . . .  . Hence, we can recover the corresponding symbol: P − φ = X n ≥ b φ ( − n − z − n − = X n ≥

34 4 − n z − n − = 3 z z − . Deﬁnition 2.8.

The complex function f ( z ) is rational if f ( z ) = p ( z ) /q ( z ) , with p ( z ) and q ( z ) polynomials. The rank of f ( z ) is the maximum between the degrees of p ( z ) and q ( z ) . A rationalfunction is strictly proper if the degree of p ( z ) is strictly smaller than that of q ( z ) . We remark that the poles of a complex function f correspond to the zeros of 1 /f . Thefollowing result of Kronecker relates ﬁnite-rank inﬁnite Hankel matrices to rational functions. Theorem 2.6 ([24]) . Let H φ be a bounded Hankel operator with matrix H . Then H has ﬁniterank if and only if P − φ is a strictly proper rational function. Moreover the rank of H is equalto the number of poles (with multiplicities) of P − φ inside the unit disc. xample 2.7. The function in Example 2.5 is rational with degree 2 and has two poles insidethe unit disc at z = ± . Thus, the Hankel matrix associated has rank 2.We state as remark an important takeaway from this section. Remark 2.

Given a rank n Hankel matrix H , we can look at it in two alternative ways. Onthe one hand we can consider the Hankel operator over sequences H f : ℓ → ℓ , associated toa function f : Σ ∗ → R . In this case H ( i, j ) = f ( i + j ) for i, j ≥

0, and f is rational in thesense that it is realized by a WFA of size n . On the other hand, we can consider the Hankeloperator over complex (Hardy) spaces H φ : H → H − , associated to a function φ ( z ) ∈ L ( T ),the symbol. In this case H ( j, k ) = b φ ( − j − k −

1) for j, k ≥

0, and P − φ = b φ ( − j − k −

1) isrational of rank n in the sense of Deﬁnition 2.8.We can introduce now the main result of Adamyan, Arov and Krein [1]. The theorem, statedfor Hankel operators over Hardy spaces, shows that for inﬁnite dimensional Hankel matricesthe constraint of preserving the Hankel property does not aﬀect the achievable approximationerror. Theorem 2.8 (AAK-1[1]) . Let H φ be a compact Hankel operator of rank n and singular numbers σ m , with ≤ m < n and σ ≥ · · · ≥ σ n − > . Then there exists a unique Hankel operator G of rank k < n such that: k H − G k = σ k . (11)We denote with R k ⊂ H ∞− the set of strictly proper rational functions of rank k , and weconsider the set: H ∞ k = { ψ ( z ) ∈ L ∞ ( T ) : ∃ g ( z ) ∈ R k , ∃ l ( z ) ∈ H ∞ , ψ ( z ) = g ( z ) + l ( z ) } . (12)We can reformulate the theorem in terms of symbols. Theorem 2.9 (AAK-2 [1]) . Let φ ( z ) ∈ L ∞ ( T ) . Then, there exists ψ ( z ) ∈ H ∞ k such that: k φ ( z ) − ψ ( z ) k ∞ = σ k ( H φ ) . (13)The solutions of Theorem 2.8 and 2.9 are strictly related (proof in Appendix A). Corollary 1.

Let ψ ( z ) = g ( z ) + l ( z ) ∈ H ∞ k , with g ( z ) ∈ R k , l ( z ) ∈ H ∞ . If ψ ( z ) solvesEquation 13, then G = H g is the unique Hankel operator from Theorem 2.8. We state as corollary the key point of the proof of AAK Theorem, that provides us with apractical way to ﬁnd the best approximating symbol.

Corollary 2.

Let φ ( z ) and { ξ k , η k } be a symbol and a σ k -Schmidt pair for H φ . A function ψ ( z ) ∈ L ∞ is the best AAK approximation according to Theorem 2.9, if and only if: ( φ ( z ) − ψ ( z )) ξ + k ( z ) = σ k η − k ( z ) . (14) Moreover, the function ψ ( z ) does not depend on the particular choice of the pair { ξ k , η k } . Approximate Minimization

To apply AAK theory to the approximate minimization problem, we consider only automatadeﬁned over a one-letter alphabet. In this case, the free monoid generated by the single lettercan be identiﬁed with N , and canonically embedded into Z . This passage is fundamental to useFourier analysis and the isomorphism that leads to Theorem 2.8 and 2.9. Unfortunately, thisidea cannot be directly generalized to bigger alphabets, since in this case we would obtain afree non-abelian monoid (not identiﬁable with Z ).Theorem 2.8 requires the Hankel operator H to be bounded. To ensure that a minimalWFA A = h α , A , β i satisﬁes this condition, we assume ρ ( A ) <

1, where ρ is the spectral radiusof A [11]. As a matter of fact, to guarantee the boundness of the Hankel operator it is enoughthat the considered WFA computes a function f ∈ ℓ [11]. However, the stricter assumption onthe spectral radius is needed when computing the symbol associated to a WFA. This conditiondirectly implies the existence of the SVA, and of the Gramian matrices P and Q , with P = Q diagonal [11]. We assume that A = h α , A , β i is in SVA form. In this case, given the size of thealphabet, the Hankel matrix is symmetric, so β = α and A = A ⊤ .Note that, for example, a minimal GPA computes a function f ∈ ℓ , so the condition on ρ ( A ) is automatically satisﬁed by this class of WFAs [11]. Possible relaxations of the spectralradius assumption are discussed in Appendix C, together with an alternative method to ﬁndthe optimal spectral-norm approximation of a Hankel matrix without extracting a WFA.Finally, in this paper we consider automata with weights in R , though results remain truefor complex numbers. The method we present can be easily extended to vector-valued au-tomata [35], but the solution to the optimal approximation problem will not be unique [33]. Let A = h α , A , α i be a minimal WFA with n states in SVA form, deﬁned over a one-letteralphabet. Let H be the Hankel matrix of A , we denote with σ i , for 0 ≤ i < n , the singularnumbers. Given a target number of states k < n , we say that a WFA b A k with k states solvesthe optimal spectral-norm approximate minimization problem if the Hankel matrix G of b A k satisﬁes k H − G k = σ k ( H ). Note that the content of the “optimal spectral-norm approximateminimization” is equivalent to the problem solved by Theorem 2.8, with the exception that herewe insist on representing the inputs and outputs of the problem eﬀectively by means of WFAs.Based on the AAK theory sketched in Section 2.3, we draw the following steps:1. Compute a symbol φ ( z ) for H using Remark 2 . We obtain the negative Fourier coeﬃcientsof φ ( z ) and derive its Fourier series.2. Compute the optimal symbol ψ ( z ) using Corollary 2 . The main challenge here is to ﬁnd asuitable representation for the functions ψ ( z ) and e ( z ) = φ ( z ) − ψ ( z ). We deﬁne them interms of two auxiliary WFAs. The key point is to select constraints on their parameters toleverage the properties of weighted automata, while still keeping the formulation general.3. Extracting the rational component by solving for g ( z ) in Corollary 1 . This step is arguablythe most conceptually challenging, as it requires to identify the position of the function’spoles. In fact, we know from Theorem 2.6 that g ( z ) has k poles, all inside the unit disc.4. Find a WFA representation for g ( z ). Since in Step 2 we parametrized the functions usingWFAs, the expression of g ( z ) directly reveals the WFA b A k .8 .3 Spectral-Norm Approximate Minimization In the following sections we will consider a minimal WFA A = h α , A , α i with n states in SVAform, deﬁned over a one-letter alphabet Σ = { a } , its Hankel matrix H , corresponding to thebounded operator H , and the singular numbers σ i , for 0 ≤ i < n . Let f : Σ ∗ → R be thefunction realized by A . We denote by x the string where a is repeated x times, so we have f ( x ) = α ⊤ A x α . To determine the symbol φ ( z ) of H , we recall that each entry of the Hankel matrix correspondssimultaneously to the values of f and to the negative Fourier coeﬃcients of φ ( z ). In fact, asseen in Remark 2, we have: H =  f A (0) f A (1) f A (2) . . .f A (1) f A (2) f A (3) . . .f A (2) f A (3) f A (4) . . . ... ... ... . . .  =  b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . b φ ( − b φ ( − b φ ( − . . . ... ... ... . . .  . (15)We obtain: P − φ ( z ) = X k ≥ f ( k ) z − k − = X k ≥ α ⊤ A k α z − k − = α ⊤ ( z − A ) − α (16)where we use the fact that ρ ( A ) < φ ( z ) = α ⊤ ( z − A ) − α as a symbol for H . Example 3.1.

If we apply the formula in Equation 16 to the GPA in Example 2.3, we recoverthe rational function φ ( z ) = z z − found in Example 2.5. We consider two auxiliary WFAs. Let b A = h b α , b A , b β i be a WFA with j ≥ k states, satisfyingthe following properties:1. 1 is not an eigenvalue of b A

2. the automaton E = h α e , A e , β e i is minimal, with A e = (cid:18) A 00 b A (cid:19) , α e = (cid:18) α − b α (cid:19) , β e = (cid:18) α b β (cid:19) . (17)Using the parameters of the automaton b A and a constant C , we deﬁne a function ψ ( z ) = b α ⊤ ( z − b A ) − b β + C . We remark that the poles of ψ ( z ) correspond to the eigenvalues of b A ,counted with their multiplicities. By assumption, 1 is not an eigenvalue of b A , so ψ ( z ) doesnot have any poles on the unit circle, and therefore ψ ( z ) ∈ L ∞ ( T ). Analogously, the function e ( z ) = φ ( z ) − ψ ( z ) = α ⊤ e ( z − A e ) − β e − C is also bounded on the circle.Our objective is to compute the parameters of b A = h b α , b A , b β i that make ψ ( z ) the bestapproximation of φ ( z ) according to Theorem 2.9. In particular, we will use Corollary 2 toﬁnd the triple b α , b A , b β such that ψ ( z ) satisﬁes Equation 14. Note that, with this purpose,the constant term C ∈ H ∞ becomes necessary to characterize ψ ( z ). In fact, while the H ∞ -component of the symbol does not aﬀect the Hankel norm, it plays a role in the computation ofthe L ∞ norm (in Equation 13) according to Nehari (Theorem A.1), so it cannot be dismissed.9he following theorem provides us with an explicit expression for the functions in the Hardyspace corresponding to a σ k -Schmidt pair. Theorem 3.2.

Let σ k be a singular number of the Hankel operator H . The singular functionsassociated with the σ k -Schmidt pair { ξ k , η k } of H are: ξ + k ( z ) = σ − / k α ⊤ ( − z A ) − e k (18) η − k ( z ) = σ − / k α ⊤ ( z − A ) − e k . (19) If ψ ( z ) is the best approximation to the symbol, then σ − k e ( z ) has modulus almost everywhereon the unit circle ( i.e. it is unimodular).Proof. Let F and B be the forward and backward matrices, respectively, with H = FB ⊤ , P = F ⊤ F , Q = B ⊤ B . We consider the σ k -Schmidt pair { ξ k , η k } . By deﬁnition, H ⊤ H ξ k = σ k ξ k .Rewriting in terms of the FB factorization, we obtain: H ⊤ H ξ k = σ k ξ k (20) BF ⊤ FB ⊤ ξ k = σ k ξ k (21) BPB ⊤ ξ k = σ k ξ k (22) BPe k = σ k ξ k (23)where in the last step we set e k = B ⊤ ξ k , to reduce the SVD problem of H to the one of QP .Note that, since P and Q are diagonal, e k is the k -th coordinate vector (0 , . . . , , , , . . . , ⊤ .Since e k is an eigenvector of QP for σ k , we get: BQ − QPe k = σ k ξ k (24) BQ − e k = ξ k . (25)Moreover, H is symmetric, so we have that η k = ξ k . We obtain: ξ + k ( z ) = ∞ X j =0 σ − / k α ⊤ A j e k z j = σ − / k α ⊤ ( − z A ) − e k (26) η − k ( z ) = ∞ X j =0 σ − / k α ⊤ A j e k z − j − = σ − / k α ⊤ ( z − A ) − e k (27)where the singular functions have been computed following Equation 9. If r is the multi-plicity of σ k , from Corollary 2 we get the following fundamental equation:( φ ( z ) − ψ ( z )) α ⊤ ( − z A ) − V = σ k α ⊤ ( z − A ) − V where V = (cid:0) r (cid:1) ⊤ is a n × r matrix. Consequently, we obtain the unimodular function: σ − k e ( z ) = α ⊤ ( z − A ) − V α ⊤ ( − z A ) − V . It is reasonable to wonder how the fact that σ − k e ( z ) is unimodular reﬂects on the structureof the WFA E = h α e , A e , β e i associated with it. We remark that, a priori , the controllabilityand observability Gramians of E might not be well deﬁned. The following theorem providesus with two matrices P e and Q e satisfying properties similar to those of the Gramians. Thistheorem is the analogous of a control theory result [16], rephrased in terms of WFAs. A sketchof the proof, that relies on the minimality of the WFA E [38], can be found in Appendix B. Forthe detailed version of the proof and the original theorem we refer the reader to [16].10 heorem 3.3 ([16]) . Consider the function e ( z ) = α ⊤ e ( z − A e ) − β e − C and the correspondingminimal WFA E = h α e , A e , β e i associated with it. If σ − k e ( z ) is unimodular, then there existsa unique pair of symmetric invertible matrices P e and Q e satisfying:(a) P e − A e P e A ⊤ e = β e β ⊤ e (b) Q e − A ⊤ e Q e A e = α e α ⊤ e (c) P e Q e = σ k We can now derive the parameters of the WFA b A = h b α , b A , b β i . Theorem 3.4.

Let A = h α , A , α i be a minimal WFA with n states in its SVA form, and let φ ( z ) = α ⊤ ( z − A ) − α be a symbol for its Hankel operator H . Let σ k be a singular numberof multiplicity r for H , with σ ≥ · · · > σ k = · · · = σ k + r − > σ k + r ≥ · · · ≥ σ n − > . We canpartition the Gramian matrices P , Q as follows: P = Q = (cid:18) Σ 00 σ k r (cid:19) , (28) where Σ ∈ R ( n − r ) × ( n − r ) is the diagonal matrix containing the remaining singular numbers, andpartition A and α conformally to the Gramians: A = (cid:18) A A A ⊤ A (cid:19) , α = (cid:18) α α (cid:19) . (29) Let R = σ k n − r − Σ , we denote by ( · ) + the Moore-Penrose pseudo-inverse. If the function ψ ( z ) = b α ⊤ ( z − b A ) − b β + C is the best approximation of φ ( z ) , then: • If α = : b β = − b AA ( α ⊤ ) + b α = − b A ⊤ RA ( α ⊤ ) + b A ( A − A ( α ⊤ ) + α ⊤ ) = (30) • If α = : b β = ( − b AA )( α ⊤ ) + b α = ( R − b A ⊤ RA )( α ⊤ ) + b AA = (31) Proof of Theorem 3.4.

Since σ − e ( z ) = φ ( z ) − ψ ( z ) is unimodular, from Theorem 3.3 thereexist two symmetric nonsingular matrices P e , Q e satisfying the ﬁxed point equations: P e − A e P e A ⊤ e = β e β ⊤ e (32) Q e − A ⊤ e Q e A e = α e α ⊤ e (33)and such that P e Q e = σ k . We can partition P e and Q e according to the deﬁnition of A e (seeEq. 17): P e = (cid:18) P P P ⊤ P (cid:19) , Q e = (cid:18) Q Q Q ⊤ Q (cid:19) . P and Q corresponds to the controllability andobservability Gramians of A : P = Q = P = (cid:18) Σ 00 σ k (cid:19) . Moreover, since P e Q e = σ k , we get P Q ⊤ = σ k − P . It follows that P Q ⊤ has rank n − r . Without loss of generality we can set dim b A = j = n − r , and choose an appropriate basisfor the state space such that P = (cid:0) (cid:1) ⊤ and Q = (cid:0) R 0 (cid:1) ⊤ , with R = σ k − Σ . Once P and Q are ﬁxed, the values of P and Q are automatically determined. We obtain: P e =  Σ 0 10 σ k − ΣR −  , Q e =  Σ 0 R0 σ k − ΣR  . (34)Now that we have an expression for the matrices P e and Q e of Theorem 3.3, we can rewritethe ﬁxed point equations to derive the parameters b α , b A and b β . We obtain the following systems:  P − APA = αα ⊤ N − AN b A ⊤ = α b β ⊤ − ΣR − + b AΣR − b A ⊤ = b β b β ⊤  P − APA = αα ⊤ M − A ⊤ M b A = α b α ⊤ − ΣR + b A ⊤ ΣR b A = b α b α ⊤ (35)where N = (cid:18) (cid:19) and M = (cid:18) R0 (cid:19) . We can rewrite the second equation of each system as follows: ( − A b A ⊤ = α b β ⊤ − A ⊤ b A ⊤ = α b β ⊤ ( R − A R b A = α b α ⊤ − b A ⊤ RA = b αα ⊤ (36)If α = , we have: b β = − b AA ( α ⊤ ) + b α = − b A ⊤ RA ( α ⊤ ) + b A ( A − A ( α ⊤ ) + α ⊤ ) = (37)with ( α ⊤ ) + = α α ⊤ α .If α = , we have b AA = . We remark that b A has size ( n − r ) × ( n − r ), while A is( n − r ) × r , so the system of equations corresponding to b AA = is underdetermined if r < n ,in which case we can ﬁnd an alternative set of solutions: b β = ( − b AA )( α ⊤ ) + b α = ( R − b A ⊤ RA )( α ⊤ ) + b AA = (38)with b A = . On the other hand, if r ≥ n , i.e. if the multiplicity of the singular number σ k is more than half the size of the original WFA, the system might not have any solution unless b A = (or unless A was zero to begin with). In this setting the method proposed returns b A = . An alternative in this case is to search for an approximation of size k − k + 1, sothat r < n , and the system in Equation 38 is underdetermined.12 .3.3 Extraction of the Rational Component The function ψ ( z ) = b α ⊤ ( z − b A ) − b β + C found in Theorem 3.4 corresponds to the solution ofTheorem 2.9. To ﬁnd the solution to the approximation problem we only need to “isolate” thefunction g ( z ) ∈ R k , i.e. the rational component . To do this, we study the position of the polesof ψ ( z ), since the poles of a strictly proper rational function lie in the unit disc (Theorem 2.6).As noted before, we parametrized ψ ( z ) so that its poles correspond to the eigenvalues of b A .After a change of basis (detailed in the Paragraph 3.3.3.1), we can rewrite b A in block-diagonalform: b A = b A + b A − ! (39)where the modulus of the eigenvalues of b A + (resp. b A − ) is smaller (resp. greater) than one. Weapply the same change of coordinates on b α and b β .To conclude the study of the eigenvalues of b A , we need the following auxiliary result fromOstrowski [32]. A proof of this theorem can be found in [41]. Theorem 3.5 ([32]) . Let | Σ | = 1 , and let P be a solution to the ﬁxed point equation X − A X A ⊤ = ββ ⊤ for the WFA A = h α , A , β i . If A is reachable, then: • The number of eigenvalues λ of A such that | λ | < is equal to the number of positiveeigenvalues of P . • The number of eigenvalues λ of A such that | λ | > is equal to the number of negativeeigenvalues of P . We can ﬁnally ﬁnd the rational component of the function ψ ( z ), i.e. the function g ( z ) fromCorollary 1 necessary to solve that approximate minimization problem. Theorem 3.6.

Let b A + , b α + , b β + be as in Theorem 3.4. The rational component of ψ ( z ) is thefunction g ( z ) = b α ⊤ + ( z − b A + ) − b β + .Proof. Clearly ψ ( z ) = g ( z ) + l ( z ), with l ( z ) = b α ⊤− ( z − b A − ) − b β − , l ∈ H ∞ . To conclude theproof we need to show that g ( z ) has k poles inside the unit disc, and therefore has rank k . Wedo this by studying the position of the eigenvalues of b A + .Since E is minimal, by deﬁnition b A is reachable, so we can use Theorem 3.5 and solve theproblem by directly examining the eigenvalues of − ΣR . From the proof of Theorem 3.4 we have − ΣR = Σ ( Σ − σ k ), where Σ is the diagonal matrix having as elements the singular numbersof H diﬀerent from σ k . It follows that − ΣR has only k strictly positive eigenvalues, and b A has k eigenvalues with modulus smaller than 1. Thus, b A + has k eigenvalues, corresponding to thepoles of g ( z ). In this paragraph we detail the technical steps necessaryto rewrite b A in block-diagonal form. The problem of computing the Jordan form of a matrix isill-conditioned, hence it is not suitable for our algorithmic purposes. Instead, we compute theSchur decomposition, i.e. we ﬁnd an orthogonal matrix U such that U ⊤ b AU is upper triangular,with the eigenvalues of b A on the diagonal. We obtain: T = U ⊤ b AU = b A + b A b A − ! (40)13here the eigenvalues are arranged in increasing order of modulus, and the modulus of those in b A + (resp. b A − ) is smaller (resp. greater) than one. To transform this upper triangular matrixinto a block-diagonal one, we use the following result. Theorem 3.7 ([37]) . Let T be the matrix deﬁned in Equation 40. The matrix X is a solutionof the equation b A + X − X b A − + b A = if and only if M = (cid:18) (cid:19) , and M − = (cid:18) − X0 1 (cid:19) (41) satisfy: M − TM = b A + b A − ! , (42) where T is the matrix deﬁned in Equation 40. Setting Γ = (cid:0) k (cid:1) we can now derive the rational component of the WFA: b A + = ΓM − U ⊤ b AUMΓ ⊤ (43) b α + = ΓM ⊤ U ⊤ b α (44) b β + = ΓM − U ⊤ b β . (45) In the previous sections, we have derived the rational function g ( z ) corresponding to the symbolof G , the operator that solves Theorem 2.8. To ﬁnd the solution to the approximation problemwe only need to ﬁnd the parameters of b A k , the optimal approximating WFA. These are directlyrevealed by the expression of g ( z ), thanks to the way we parametrized the functions. Theorem 3.8.

Let A = h α , A , α i be a minimal WFA with n states over a one-letter alphabet.Let A be in its SVA form. The optimal spectral-norm approximation of rank k is given by theWFA b A k = h b α + , b A + , b β + i .Proof. From Corollary 1 we know that g ( z ) is the rational function associated with the Hankelmatrix of the best approximation. Given the correspondence between the Fourier coeﬃcientsof g ( z ) and the entries of the matrix (Remark 2), we have: g ( z ) = b α ⊤ + ( z − b A + ) − b β + = X k ≥ b α ⊤ + b A k + b β + z − k − = X k ≥ ¯ f ( k ) z − k − (46)where ¯ f : Σ ∗ → R is the function computed by b A k and b α + , b A + , b β + are the parameters. The theoretical foundations of AAK theory guarantee that the construction detailed in Sec-tion 3.3 produces the rank k optimal spectral-norm approximation of a WFA satisfying ourassumptions, and the singular number σ k provides the exact error.Similarly to the case of SVA truncation [11], due to the ordering of the singular numbers,the error decreases when k increases, meaning that allowing b A k to have more states guaranteesa better approximation of A . We remark that, while the solution we propose is optimal inthe spectral norm, the same is not necessarily true in other norms. Nonetheless, we have thefollowing bound between ℓ -norm and spectral-norm (proof in Appendix B).14 lgorithm 1: AAKapproximation input :

A minimal WFA A , with α = 0, n states and in SVA form,its Gramian P , a target number of states k < n output: A WFA b A k with k states Let α , α , A , A , A , Σ be the blocks deﬁned in Eq. 28 Let ( α ⊤ ) + = α α ⊤ α Let R = σ k − Σ Let b A = ( A − A ( α ⊤ ) + α ⊤ ) − Let b α = − b A ⊤ RA ( α ⊤ ) + Let b β = − b AA ( α ⊤ ) + Let b A = h b α , b A , b β i Let b A k ← BlockDiagonalize ( b A ) return b A k Theorem 3.9.

Let A be a minimal WFA computing f : Σ ∗ → R , with matrix H . Let b A k be itsoptimal spectral-norm approximation, computing ¯ f : Σ ∗ → R , with matrix G . Then: k f − ¯ f k ℓ ≤ k H − G k = σ k . (47) Proof.

Let e = (cid:0) · · · (cid:1) ⊤ , f : Σ → R , g : Σ → R with Hankel matrices H and G ,respectively. We have: k f − g k ℓ = ∞ X n =0 | f n − g n | ! / = k ( H − G ) e k ℓ ≤ sup k x k ℓ =1 k ( H − G ) x k ℓ = k H − G k = σ k where the second equation follows by deﬁnition and by observing that matrix diﬀerence iscomputed entry-wise. In this section we describe the algorithm for spectral-norm approximate minimization. Thealgorithm takes as input a target number of states k < n , a minimal WFA A with ρ ( A ) < α = 0, n states and in SVA form, and its Gramian P . Note that, in the case of α = 0, it isenough to substitute the Steps 4 , , A to be minimal and in SVA form are non essential. In fact a WFA with n states canbe minimized in time O ( n ) [14], and the SVA computed in O ( n ) [11].Using the results of Theorem 3.4, we outline in Algorithm 1, AAKapproximation , the stepsnecessary to extract the best spectral-norm approximation of a WFA.The algorithm involves a call to Algorithm 2,

BlockDiagonalize . In particular, this cor-responds to the steps, outlined in Paragraph 3.3.3.1, necessary to derive the WFA b A k corre-sponding to the rational function g ( z ). We remark that Step 2 in BlockDiagonalize can beperformed using the Bartels-Stewart algorithm [13].To compute the computational cost we recall the following facts [39]: • The product of two n × n matrices can be computed in time O ( n ) using a standarditerative algorithm, but can be reduced to O ( n ω ) with ω < . lgorithm 2: BlockDiagonalize input :

A WFA b A output: A WFA b A k wit ρ < Compute the Schur decomposition of b A = UTU ⊤ , where | T | ≤ | T | ≤ . . . Solve b A X − X b A + b A = for X Let M = (cid:18) (cid:19) and M − = (cid:18) − X0 1 (cid:19) Let Γ = (cid:0) k (cid:1) Let b A + = ΓM − U ⊤ b AUMΓ ⊤ Let b α + = ΓM ⊤ U ⊤ b α Let b β + = ΓM − U ⊤ b β Let b A k = h b α + , b A + , b β + i return b A k • The inversion of a n × n matrix can be computed in time O ( n ) using Gauss-Jordanelimination, but can be reduced to O ( n ω ) with ω < . • The computation of the Schur decomposition of a n × n matrix can be done with a two-stepalgorithm, where each step takes O ( n ), using the Hessenberg form of the matrix. • The Bartels-Stewart algorithm applied to upper triangular matrices to ﬁnd a matrix m × n takes O ( mn + nm ).The running time of BlockDiagonalize with input a WFA b A with ( n − r ) states is thus in O (( n − r ) ), where r is the multiplicity of the singular value considered. The running time of AAKapproximation for an input WFA b A with n states is in O (( n − r ) ). The study of approximate minimization for WFAs is very recent, and only a few works havebeen published on the subject. In [10, 11] the authors present an approximate minimizationtechnique using a canonical expression for WFAs, and provide bounds on the error in the ℓ norm. The result is supported by strong theoretical guarantees, but it is not optimal in anynorm. An extension of this method to the case of Weighted Tree Automata can be found in [12].A similar problem is addressed in [25], with less general results. In [26], the authors connectspectral learning to the approximate minimization problem of a small class of Hidden Markovmodels, bounding the error in terms of the total variation distance.The control theory community has largely studied approximate minimization in the contextof linear time-invariant systems, and several methods have been proposed [3]. A parallel canbe drown between those results and ours, by noting that the impulse response of a discretetime-invariant Single-Input-Single-Output SISO system can be parametrized as a WFA over aone-letter alphabet. In [20] Glover presents a state-space solution for the case of continuousMulti-Input-Multi-Output MIMO systems. Glover’s method led to a widespread applicationof these results, thanks to its computational and theoretical simplicity. This stems from thestructure of the Lyapunov equations for continuous systems. It is however not the case fordiscrete control systems, where the Lyapunov equations have a quadratic form. As noted in [16],there is not a simple closed form formula for the state space solution of a discrete system. Thus,16ost of the results for the discrete case work with a suboptimal version of the problem, withrestrictions on the multiplicity of the singular values [6, 2, 23]. A solution for the SISO case canbe found, without additional assumptions, using a polynomial approach, but it does not providean explicit representation of the state space nor it generalizes to the MIMO setting. The ﬁrstto actually extend Glover results to the discrete case is Gu, who provides an elegant solutionfor the MIMO discrete problem [21]. Glover and Gu’s solutions rely on embedding the initialsystem into an extension of it, the all-pass system , equivalent to the WFA E in our method.Part of our contribution is the adaptation of some of the control theory tools to our setting. In this paper we applied the AAK theory for Hankel operators and complex functions to theframework of WFAs in order to construct the best possible approximation to an automatongiven a bound on the size. We provide an algorithm to ﬁnd the parameters of the best WFAapproximation in the spectral norm, and bounds on the error. Our method applies to WFAs A = h α , A , β i , deﬁned over a one-letter alphabet, with ρ ( A ) <

1. While this setting is certainlyrestricted, we believe that this work constitutes a ﬁrst fundamental step in the direction ofoptimal approximation. Furthermore, the use of AAK techniques has proven to be very fruitfulin related areas like control theory; we think that automata theory can also beneﬁt from it. Theuse of such methods can help deepen the understanding of the behaviour of rational functions.This paper highlights and strengthens the interesting connections between functional analysis,automata theory and control theory, unifying tools from diﬀerent domains in one formalism.A compelling direction for future work is to extend our results to the multi-letter case. Thework of Adamyan, Arov and Krein provides us with a powerful theory connecting sequences tothe study of complex functions. We note that, unfortunately, this approach cannot be directlygeneralized to the multi-letter case because of the non-commutative nature of the monoid con-sidered. Extending this work would require tools from harmonic analysis that are not availablefor non-abelian structures. A recent line of work in functional analysis is centered around theextension of operator theory to the non-commutative case, and in [34] a non-commutative ver-sion of the AAK theorem is presented. However, those results are non-constructive, making thisdirection, already challenging, even harder to pursue.

Acknowledgments

This research has been supported by NSERC Canada (C. Lacroce, P. Panangaden, D. Precup)and Canada CIFAR AI chairs program (Guillaume Rabusseau). The authors would like tothank Tianyu Li, Harsh Satija and Alessandro Sordoni for feedback on earlier drafts of thiswork, Gheorghe Comanici for a detailed review and Maxime Wabartha for fruitful discussionsand comments on the proofs.

References [1] Vadim M. Adamyan, Damir Zyamovich Arov, and Mark Grigorievich Krein. Analytic prop-erties of Schmidt pairs for a Hankel operator and the generalized Schur–Takagi problem.

Mathematics of The Ussr-sbornik , 15:31–73, 1971.[2] M.M Al-Hussari, I.M. Jaimoukha, and D.J.N. Limebeer. A descriptor approach for thesolution of the one-block distance problem. In

In Proceedings of the IFAC World Congress ,1993. 173] Athanasios C. Antoulas.

Approximation of Large-Scale Dynamical Systems . SIAM, 2005.[4] Stephane Ayache, Remy Eyraud, and Noe Goudian. Explaining black boxes on sequentialdata using weighted automata. In

Proceedings of The 14 th International Conference onGrammatical Inference, Volume 93 of Proceedings of Machine Learning Research , pages81–103, 2018.[5] Rapha¨el Bailly, Fran¸cois Denis, and Liva Ralaivola. Grammatical inference as a principalcomponent analysis problem. In

Proceedings of the 26th Annual International Conferenceon Machine Learning , ICML ’09, pages 33––40, New York, NY, USA, 2009. Associationfor Computing Machinery. doi:10.1145/1553374.1553379 .[6] Joseph A. Ball and Andre CM. Ran. Optimal Hankel norm model reductions and Wiener–Hopf factorization I: The canonical case.

SIAM Journal on Control and Optimization ,25(2):362–382, 1987.[7] Borja Balle, Xavier Carreras, Franco M. Luque, and Ariadna Quattoni. Spectral learningof weighted automata - A forward-backward perspective.

Machine Learning , 96(1-2):33–63,2014.[8] Borja Balle, Pascale Gourdeau, and Prakash Panangaden. Bisimulation metrics forweighted ﬁnite automata. In

Proceedings of the 44th International Colloquium On Au-tomata Languages and Programming Warsaw , pages 103:1–14, 2017.[9] Borja Balle, William Hamilton, and Joelle Pineau. Methods of moments for learningstochastic languages: Uniﬁed presentation and empirical comparison. In

InternationalConference on Machine Learning , pages 1386–1394. PMLR, 2014.[10] Borja Balle, Prakash Panangaden, and Doina Precup. A canonical form for weightedautomata and applications to approximate minimization. In

Proceedings of the ThirtiethAnnual ACM-IEEE Symposium on Logic in Computer Science , July 2015.[11] Borja Balle, Prakash Panangaden, and Doina Precup. Singular value automataand approximate minimization.

Math. Struct. Comput. Sci. , 29(9):1444–1478, 2019. doi:10.1017/S0960129519000094 .[12] Borja Balle and Guillaume Rabusseau. Approximate minimization of weighted tree au-tomata.

Information and Computation , page 104654, 2020.[13] R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax + xb = c [f4].

Commun. ACM , 15(9):820—-826, 1972.[14] Jean Berstel and Christophe Reutenauer.

Noncommutative rational series with applications ,volume 137. Cambridge University Press, 2011.[15] J.W. Carlyle and A. Paz. Realizations by stochastic ﬁnite automata.

Journal of Computerand System Sciences , 5(1):26–40, 1971.[16] Charles K. Chui and Guanrong Chen.

Discrete H ∞ Optimization With Applications inSignal Processing and Control Systems . Springer-Verlag, 1997.[17] C. Eckart and G. Young. The approximation of one matrix by another of lower rank.

Psychometrika , 1:211–218, 1936. doi:10.1007/BF02288367 .1818] Remi Eyraud and Stephane Ayache. Distillation of weighted automata from recurrentneural networks using a spectral approach. arXiv preprint arXiv:2009.13101 , 2020.[19] M. Flies. Matrice de Hankel.

Journal de Math´ematique Pures et Appliqu´ees , 5:197–222,1974.[20] Keith Glover. All optimal Hankel-norm approximations of linear multivariable systemsand their L ∞ -error bounds. International Journal of Control , 39(6):1115–1193, 1984. doi:10.1080/00207178408933239 .[21] Guoxiang Gu. All optimal Hankel-norm approximations and their errorbounds in discrete-time.

International Journal of Control , 78(6):408–423, 2005. doi:10.1080/00207170500110988 .[22] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learn-ing hidden markov models.

J. Comput. Syst. Sci. , 78(5):1460–1480, September 2012. doi:10.1016/j.jcss.2011.12.025 .[23] Vlad Ionescu and Cristian Oara. The four-block Adamjan-Arov-Kein problem for discrete-time systems. In

Linear Algebra and its Application , pages 95–119. Elsevier, 2001.[24] L. Kronecker. Zur Theorie der Elimination einer Variablen aus zwei algebraischen Gle-ichungen.

Montasber. K¨onigl. Preussischen Acad Wies , pages 535 – 600, 1881.[25] Alex Kulesza, Nan Jiang, and Satinder Singh. Low-Rank Spectral Learning with WeightedLoss Functions. In

Artiﬁcial Intelligence and Statistics , pages 517–525. PMLR, 2015.[26] Alex Kulesza, N. Raj Rao, and Satinder Singh. Low-Rank Spectral Learning. In SamuelKaski and Jukka Corander, editors,

Proceedings of the Seventeenth International Confer-ence on Artiﬁcial Intelligence and Statistics , volume 33 of

Proceedings of Machine LearningResearch , pages 522–530, Reykjavik, Iceland, 22–25 April 2014. PMLR.[27] Aersity M Lyapunov. The General Problem of the Stability of Motion [in Russian].

Gostekhizdat, Moscow , 1950.[28] Jean Meinguet. A Simpliﬁed Presentation of the Adamjan-Arov-Krein ApproximationTheory. In H. Werner, L. Wuytack, E. Ng, and H. J. B¨unger, editors,

ComputationalAspects of Complex Analysis , pages 217–248, Dordrecht, 1983. Springer Netherlands. doi:10.1007/978-94-009-7121-9_9 .[29] Zeev Nehari. On Bounded Bilinear Forms.

Annals of Mathematics , 65(1):153–162, 1957.[30] Nikolai K. Nikol’Skii.

Operators, Functions and Systems: An Easy Reading , volume 92 of

Mathematical Surveys and Monographs . American Mathematical Society, 2002.[31] Takamasa Okudono, Masaki Waga, Taro Sekiyama, and Ichiro Hasuo. Weighted AutomataExtraction from Recurrent Neural Networks via Regression on State Spaces. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , volume 34, pages 5306–5314, 2020.[32] Alexander Ostrowski and Hans Schneider. Some Theorems on the Inertia of General Ma-trices.

J. Math. Anal. Appl , 4(1):72–84, 1962.[33] Vladimir Peller.

Hankel Operators and their Applications . Springer Science & BusinessMedia, 2012. 1934] Gelu Popescu. Multivariable Nehari Problem and Interpolation.

Journal of FunctionalAnalysis , 200:536–581, 2003. doi:10.1016/S0022-1236(03)00078-8 .[35] Guillaume Rabusseau, Borja Balle, and Joelle Pineau. Multitask Spectral Learning ofWeighted Automata. In

Proceedings of the 31st International Conference on Neural Infor-mation Processing Systems , pages 2585–2594, 2017.[36] Guillaume Rabusseau, Tianyu Li, and Doina Precup. Connecting Weighted Automata andRecurrent Neural Networks through Spectral Learning. In

Proceedings of AISTATS , pages1630–1639, 2019.[37] William E. Roth. The equations ax - yb = c and ax - xb = c in matrices.

Proceedings ofthe American Mathematical Society , 3(3):392–396, 1952.[38] B.De Schutter. Minimal State-Space Realization in Linear System Theory: anOverview.

Journal of Computational and Applied Mathematics , 121(1):331–354, 2000. doi:10.1016/S0377-0427(00)00341-1 .[39] Lloyd N Trefethen and David Bau III.

Numerical linear algebra , volume 50. Siam, 1997.[40] Gail Weiss, Yoav Goldberg, and Eran Yahav. Learning Deterministic Weighted Automatawith Queries and Counterexamples. In

Advances in Neural Information Processing Systems ,pages 8560–8571, 2019.[41] H.K Wimmer. On the Ostrowski-Schneider Inertia Theorem.

Journal of MathematicalAnalysis and Applications , 41(1):164–169, 1973. doi:10.1016/0022-247X(73)90190-X .[42] Kehe Zhu.

Operator Theory in Function Spaces , volume 138. American MathematicalSociety, 1990. 20

Hankel Operators

For more details on the content of this section we refer the reader to [30]. We recall the ﬁrstdeﬁnition of Hankel operator.

Deﬁnition A.1. A Hankel operator is a mapping H : ℓ → ℓ with matrix H = { α j + k } j,k ≥ .In other words, given a = { a n } n ≥ ∈ ℓ , we have H ( a ) = b , where b = { b n } n ≥ is deﬁned by: b k = X j ≥ α j + k a j , k ≥ . (48)This property on the Hankel matrix can be rephrased as an operator identity. Deﬁningthe shift operator by S ( x , x , . . . ) = (0 , x , x , . . . ) and denoting its left inverse by S ∗ =( y , y , . . . ) = ( y , y , . . . ), we have that H is a Hankel operator if and only if: HS = S ∗ H (49)The correspondence between Deﬁnition A.1 and Deﬁnition 2.7 can be easily made explicit.First, we note that using the isomorphism φ ( b φ ( n )) n ∈ Z , introduced with Fourier series, wecan identify H with ℓ . Moreover, the operator of multiplication by z acts as right shift S onthe space of Fourier coeﬃcients, in fact d ( zφ )( n ) = b φ ( n − S ∗ corresponds to the truncated multiplication operator. Now, let B = { z k } k ≥ and B = { z j } j< be bases for H and H − , respectively, and let Iz n = z − n − (50)be the involution on L ( T ). Note that I H = H − .Let H : ℓ → ℓ be a Hankel operator with matrix H . Using the bases B , B and theFourier identiﬁcation map, we can obtain an operator acting between Hardy spaces. Followingthis interpretation, the operator H = IH : H → H − has matrix H with respect to B , B , andsatisﬁes: HS = P − SH. (51)In particular, HS = S ∗ H if and only if HS = P − SH . It is now easy to see that the characteri-zation of the Hankel operator given in Deﬁnition 2.7 satisﬁes Equation 51.The following theorem, due to Nehari [29], is of great importance as it highlights a corre-spondence between bounded Hankel operators and functions in L ∞ ( T ). Theorem A.1 ([29]) . A Hankel operator H : ℓ → ℓ with matrix H ( j, k ) = { α j + k } j,k ≥ isbounded on ℓ if and only if there exists a function ψ ∈ L ∞ ( T ) such that α m = b ψ ( m ) , m ≥ . (52) In this case: k H k = inf {k ψ k ∞ : b ψ ( n ) = b φ ( n ) , n ≥ } . (53) Where b ψ ( n ) is the n -th Fourier coeﬃcient of ψ . We can now reformulate the theorem using the characterization of Hankel operators in Hardyspaces.

Theorem A.2 ([29]) . Let φ ∈ L ( T ) be a symbol of the Hankel operator on Hardy spaces H φ : H → H − . The following are equivalent: H φ is bounded on H ,(2) there exists ψ ∈ L ∞ ( T ) such that b ψ ( m ) = b φ ( m ) for all m < .If the conditions above are satisﬁed, then: k H φ k = inf {k ψ k ∞ : b ψ ( m ) = b φ ( m ) , m < } , (54) or equivalently: k H φ k = inf f ( z ) ∈H ∞ k φ ( z ) − f ( z ) k ∞ . (55)Nehari’s Theorem is at the core of the proof of Corollary 1. Proof of Corollary 1 . Let H φ be a Hankel operator with symbol φ ( z ) ∈ L ∞ ( T ) and matrix H . Let ψ ( z ) = g ( z ) + l ( z ) ∈ H ∞ k be the solution of Equation 13. We have: k H φ − H ψ k = k H φ − ψ k (56)= k H σ k η − k ( z ) /ξ + k ( z ) k (57) ≤ σ k k η − k ( z ) /ξ + k ( z ) k ∞ = σ k (58)where ﬁrst we used Corollary 2 and then Equation 55. Now, using the deﬁnition of Hankeloperator, we have: k H φ − H ψ k = k H φ − H g k = k H − G k ≤ σ k . (59)Since k H − G k ≥ σ k (from Eckart-Young theorem [17]), it follows that k H − G k = σ k . Notethat G has rank k , as required, because g ∈ R k (Theorem 2.6). B Proofs from Section 3

Proof of Theorem 3.3.

In order to prove Theorem 3.3 we need an auxiliary lemma. Theseare the analogous of a control theory result, rephrased in terms of WFAs. The original theoremand lemma, together with the corresponding proofs, can be found in [16]. Hence, we onlyprovide a sketch of the proofs.

Lemma B.1 ( [16]) . Let E = h α e , A e , β e i be a minimal WFA. Let e ( z ) = α ⊤ e ( z − A e ) − β e − C ,if σ − k e ( z ) is unimodular, then there exist a unique invertible symmetric matrix T satisfying:(a) A ⊤ e T β e = α e C (b) σ k α ⊤ e T − A ⊤ e = C β ⊤ e (c) A ⊤ e TA e − C − A ⊤ e T β e α ⊤ e = T Proof.

Since σ − k e ( z ) is unimodular, we have that: e ( z ) e ∗ (¯ z − ) = σ k (60)where we denote with e ∗ the adjoint function. From the equation above, we obtain: e ∗ (¯ z − ) = σ k e − ( z ) = σ k ( − C + α ⊤ e ( z − A e ) − β e ) − (61)= − σ k C − − σ k C − α ⊤ e (( z − ( A e + C − β e α e )) − β e C − (62)22here we used the matrix inversion lemma. On the other hand we have: e ∗ (¯ z − ) = − C + β ⊤ e ( z − − A ⊤ e ) − α e (63)= − C + β ⊤ e ( − A −⊤ e ( − z A ⊤ e ) + A −⊤ e )( − z A ⊤ e ) − α e (64)= − ( C − β ⊤ e A −⊤ e α e ) − β ⊤ e A −⊤ e ( z − A −⊤ e ) − A −⊤ e α e (65)where we used again the matrix inversion lemma before grouping the terms. If the quantitiesin Equation 62 and Equation 65 have to be equal, we need their constant term to be the same.Then we want the H ∞− -components to correspond, so we consider the corresponding Hankelmatrices. It is easy to see that we can once again associate the coeﬃcients of these complexfunctions to the parameters of a WFA. From the minimality of E we obtain:  σ k C − α ⊤ e = β ⊤ e A −⊤ e TA e + C − β e α e = T − A −⊤ e T β e C − = T − A −⊤ e α e (66)where T is an invertible matrix [7]. This system is equivalent to:  σ k α ⊤ e T − A ⊤ e = C β ⊤ e A ⊤ e TA e − C − A ⊤ e T β e α ⊤ e = TA ⊤ e T β e = α e C (67)To conclude the proof it remains to check that T is symmetric, and this can be checked bydirect computations. Proof of Theorem 3.3 . This proof follows easily from Lemma B.1 by setting P = − σ k T − and Q = − T . We obtain point ( c ) by direct multiplication. Then, we substitute the lastequation in 67 into the second one, and we obtain: A ⊤ e TA e − α e α ⊤ e = T (68)which veriﬁes point ( b ) with Q = − T . Point ( a ) can be obtained analogously combining theﬁrst and second equations in 67. C Possible Extensions

C.1 Relaxing the Spectral Radius Assumption

It is possible to extend part of our method to WFAs over a one-letter alphabet with ρ ( A ) = 1,but the approximation recovered is not optimal in the spectral norm.Let A = h α , A , α i , with ρ ( A ) = 1, be a WFA with n states that we want to minimize. Theidea is to block-diagonalize A like we did in Section 3.3.3, and tackle each component separately.The case of A + = h α + , A + , α + i , the component having ρ ( A ) <

1, can be dealt with in theway presented in the previous sections. This means that we can ﬁnd an optimal spectral-norm approximation of the desired size for A + . Now we can consider the second component, A − = h α − , A − , α − i . The key idea is to apply the transformation z j − z − j for j ≥ φ ′′ ( z ). Then, the function φ ′′ ( z − ) = X k ≥ b α ⊤− b A k − z k b α − = b α ⊤− ( − z b A − ) − b α − (69)23s well deﬁned, as the series converges for z with small enough modulus. Using this trans-formation we obtain a function with poles inside the unit disc and we can apply the methodpresented in the paper. An important choice to make is the size of the approximation of A − , asit can inﬂuence the quality of the approximation. Analyzing the eﬀects of this parameter on theapproximation error constitutes an interesting direction for future work, both in the theoreticaland experimental side. Some theoretical work has been done in the control theory literature tostudy an analogous approach for continuous time systems and its approximation error [20]. C.2 Polynomial method

We remark that Equation 14 from Corollary 2 can be rewritten as ψ ( z ) = φ ( z ) − Hξ + k ( z ) ξ + k ( z ) , (70)where ξ + k ( z ) is the function in H associated to the vector ξ k ∈ Ker( H ∗ H − σ k ) (and ψ ( z )does not depend on the choice of the speciﬁc ξ k ). There is an alternative way to ﬁnd the bestapproximation, particularly useful when the objective is to approximate a ﬁnite-rank inﬁniteHankel matrix with another Hankel matrix, without necessarily extract a WFA. We can considerthe adjoint operator H ∗ and its matrix H ∗ . The singular numbers and singular vectors of H correspond to the eigenvalues and eigenvectors of R = ( H ∗ H ) / . Hence, it is possible tocompute σ k and a corresponding singular vector ξ k . The function ξ + k ( z ) is then obtainedfollowing Equation 9. The Hankel matrix G that best approximates H is given by G = H − M ,where M is the Hankel matrix having Hξ + k ( z ))