[PDF] Multitape automata and finite state transducers with lexicographic weights

Abstract

Finite state transducers, multitape automata and weighted automata have a lot in common. By studying their universal foundations, one can discover some new insights into all of them. The main result presented here is the introduction of lexicographic finite state transducers, that could be seen as intermediate model between multitape automata and weighted transducers. Their most significant advantage is being equivalent, but often exponentially smaller than even smallest nondeterministic automata without weights. Lexicographic transducers were discovered by taking inspiration from Eilenberg's algebraic approach to automata and Solomonoff's treatment of a priori probability. Therefore, a quick and concise survey of those topics is presented, prior to introducing lexicographic transducers.

Full PDF

MMultitape automata and ﬁnite state transducerswith lexicographic weights

Aleksander Mendoza-Drosik

Abstract —Finite state transducers, multitape automata andweighted automata have a lot in common. By studying theiruniversal foundations, it’s possible to discover new insightsinto all of them. The main result presented here is theintroduction of lexicographic ﬁnite state transducers, that couldbe seen as intermediate model between multitape automataand weighted transducers. Their most signiﬁcant advantageis being equivalent but often exponentially smaller than evensmallest nondeterministic automata without weights. Lexico-graphic transducers were discovered by taking inspiration fromEilenberg’s algebraic approach to automata and Solomonoff’streatment of a priori probability. Therefore, a quick andconcise survey of those topics is presented, prior to introducinglexicographic transducers.

Index Terms —Mealy machines, transducers, sequential ma-chines, computability, complexity

I. I

NTRODUCTION

A. Preliminaries

Product of sets B and C is the set B × C of all orderedpairs ( b, c ) such that b ∈ B and c ∈ C . A (partial) function B → C is a subset of B × C such that ( b, c ) , ( b, c ) (cid:48) ∈ B → C implies c = c (cid:48) . Given some function A ⊂ B → C , we saythat A is total if for every b there exists some c , such that ( b, c ) ∈ A . We shall not differentiate between ( B × C ) × D and B × ( C × D ) . We also assume that → binds weaker than × , hence B × C → D stands for ( B × C ) → D . One caneasily check that B → ( C → D ) is the same as B × C → D ,but it’s different from ( B → C ) → D .Suppose (cid:5) is some total function (cid:5) ⊂ A × A → A , thenset A together with (cid:5) is called a monoid if two criteria aremet. First there must exist some element A ∈ A (called identity element ) such that (cid:5) (1 A , a ) = (cid:5) ( a, A ) = a for all a ∈ A . Second, it must always hold that (cid:5) ( (cid:5) ( a , a ) , a ) = (cid:5) ( a , (cid:5) ( a , a )) . Instead of writing (cid:5) ( a , a ) , one can alsouse inﬁx notation a (cid:5) a . Thanks to the second criterion, theorder of brackets doesn’t matter and we can omit them, asin a (cid:5) a (cid:5) a .If A contains two elements a and a such that a (cid:5) a =1 A , then we call them invertible. a can be denoted as a − and called the inverse of a . Monoid, in which every elementhas some inverse, is called a group.If B × C is a monoid, then B and C must be monoidsthemselves, with B × C = (1 B , C ) . This is called directproduct of monoids.Basic understanding of measure theory is assumed. B. Algebraic foundations

Suppose A is some set of labels , Q is set of vertices and δ ⊂ Q × A × Q a set of edges . Then ( Q, A, δ ) is alabelled directed graph. Deﬁne path to be a ﬁnite sequence ofedges ( q k , x , q k ) , ( q k , x , q k ) , ... ( q k m , x m , q k m +1 ) where q k i , q k i +1 ∈ Q , x i ∈ A and ( q k i , x i , q k i +1 ) ∈ δ for everyindex i .If A together with operation · (which we call ”multipli-cation”) is a monoid, then deﬁne signature [1] of a path asthe result of multiplying consecutive labels x · x · ... · x m . Automaton [2] is deﬁned as tuple ( Q, I, A, δ, F ) where Q and δ are ﬁnite, A is ﬁnitely generated and both I and F are subsets of Q . It’s common to refer to elements of Q as states , instead of vertices. Similarly A is called set of strings or words instead of labels. All elements belonging to some(usually ﬁxed and known from context) generator of A arecalled symbols or letters . Elements of δ are called transitions instead of edges. States that belong to I are called initial andthose belonging to F are ﬁnal . Sometimes (cid:15) is used insteadof A to put emphasis that neutral element is an empty string .Path is accepting if it starts in some initial q k and endsin ﬁnal q k m +1 . An automaton accepts string x ∈ A if it is asignature of some accepting path.The elements of A need not be ”actual strings”. Forinstance they might be pairs or triples of elements from othersets. In cases when A = B × C , the automaton is said to be multitape . Note that B or C itself might be nested product ofother sets. When A = ( B × B ) × C , then automaton has3 tapes. The distinction between single-tape and multitapeautomata is blurry. Indeed, a pair of letters ( b, c ) could alwaysbe encoded as a single letter a bc (if B has n letters and C has m , then B × C has n · m ), hence one tape can be usedto encode multiple other tapes within.There is not much distinction between A and A × { (cid:15) } .Tape that can only read an empty string, isn’t read at alland doesn’t make any difference to overall computation. Thesingleton set { (cid:15) } is a trivial tape. Usually there is also notmuch difference between A = B × C and A = C × B .The order of tapes can be switched and a nearly identicalautomaton can always be built.Given A = B × C , the automaton is said to be sequentialup to B if b (cid:54) = 1 B , b (cid:48) (cid:54) = 1 B implies ( q, ( bb (cid:48) , c ) , q (cid:48) ) / ∈ δ .This ensures that as consecutive symbols are read from inputtape, the transition that wasn’t taken before doesn’t suddenly”become valid”. Automaton is sequential if it is sequentialup to entire A . (Note that A is same as A × { (cid:15) } ). Forinstance, automata that allow entire strings on their edges,are not sequential, white automata that only allow individualsymbols, are sequential.Automaton with A = B × C is deterministic up to B if itis sequential up to B and | I | = 1 and δ ⊂ Q × ( B \{ B } ) → C × Q (when the function is partial, then automaton iscalled partial , otherwise it’s called complete ). Automatonis deterministic when it is deterministic up to entire A.Automaton with A = B × C is (cid:15) -free up to B if δ is asubset of Q × ( B \{ B } ) × C × Q . Automaton is (cid:15) -free if itis (cid:15) -free up to entire A. a r X i v : . [ c s . F L ] S e p onoidal language [3] is any subset of A . If A is afree monoid then its subset is called a classical language .Monoidal and classical languages are jointly known underthe name of formal languages or simply languages forshort. Language L is rational if and only if there existssome automaton accepting all strings in L and rejecting allthose not in L . If A is a direct product of several monoids,then we call L a rational relation . If rational relation is afunction, then automaton recognizing it is called functional .If M = ( Q, I, A, δ, F ) is some automaton, then L ( M ) is used to denote language recognized by M . If L ( M ) is a relation B × C , then M ( b ) is used to denote all c such that ( b, c ) is in L ( M ) . If automaton has 3 tapes, say A = B × C × D , then M ( b ) treats it like B × ( C × D ) .Similarly M ( b, c ) treats A as if it was ( B × C ) × D . Moreover,because order of tapes makes little difference, it can beimplicitly switched, hence M ( b, d ) denotes accepted subsetof ( B × D ) × C .Automaton with A = B × C is k-valued up to B if the number of outputs M ( b ) for any b is bounded byconstant k (precisely | M ( b ) | ≤ k ). Functional automata areexactly those that are 1-valued. Automaton is k-ambiguousup to B if for every accepted b there are at most k distinctaccepting paths with signature b . Ambiguous automaton maystill be functional if all the accepting paths generate the sameoutputs. It’s possible to decide functionality of automaton inpolynomial time[4][5].The notion of automata can be generalized to syntactictransformation semigroup [6]. The δ function can be seen asthe (right) action Q × A → Q of monoid A on arbitrary (pos-sibly inﬁnite) set Q . Every element a of A determines some(partial) function a : Q → Q . When A is a subset Q → Q ,then ( Q, A ) is called the transformation semigroup . Thisgeneralizes the notion of states and alphabet symbols.Subset L (language) of strings A = B × C is preﬁx freeup to B if there is no element b that would be a preﬁx ofanother b (the notion of string preﬁx makes the most sensewhen B is a free monoid). More formally if ( b , c ) and ( b , c ) are both in L and b is a preﬁx of b , then b = b .Automaton is subsequential up to B if it is sequentialup to B and L ( M ) is preﬁx free up to B . This deﬁnitionis very different from those found in other papers[7][8][9].Usually most authors extend their automata with additionaloutput function for accepting states. If automaton ends inthat state, then some additional ﬁnal output is appendedbefore accepting. Such functionality can be emulated byadding special symbol as end marker [10]. If seen fromthe perspective of transformation semigroup, all strings andsymbols are functions, therefore the end marker is in a sensethe same as ”state output function”. Moreover, any languagewith end marker is indeed preﬁx free. The resemblance isanalogical to that between plain kolmogorov complexity and preﬁx free complexity [11]. In essence, sequential automatacontinue working as long as there is input to read, whereassubsequential automata can ”decide on their own” when inputshould end and can take some additional action. It’s easy toprove that any subsequential machine on minimal numberof states can have at most one accepting state. Automatawith ”state output function” also have such unique acceptingstate but it’s ”secretly hidden” in the deﬁnition of automaton,rather than explicitly speciﬁed in Q . There is no formal distinction between input and outputtapes. For instance, given L ( M ) ⊂ B × C × D , the tape B could be seen as input and C × D to be the output, when M ( b ) is used. If instead M ( b, c ) is used, then B × C becomeinput tapes and D becomes output. In case of M ( b, c, d ) all tapes are input. Automata having 2 tapes, with ﬁrst onedesignated as input, are often called transducers .Norm | · | is a function that assigns real number to everyelement of some set. If A is a free monoid, then deﬁne thenorm | a | to be length of string a . If A is not free, then it’smuch less obvious what the length should be. (For instance,if a a = a , then a a a a might have length 4 or itmight have length because a a a a = a a a = a a ).When A is a direct product of several free monoids, thenone can study the relationship between their lengths. Themost notable property is that if A = Σ ∗ × Γ ∗ and δ ⊂ Q × Σ × Γ × Q , then for every accepted ( σ, γ ) ∈ A thelengths | σ | and | γ | are equal. This can be further generalizedto A = Σ ∗ × Σ ∗ × ... Σ ∗ n . If δ is of the form such that at leastone Σ ∗ i is required to be of length exactly 1 on each transition(that is δ ⊂ Q × Σ ∗ × ... × Σ i × ... Σ ∗ n × Q ) then on the acceptedsubset of A deﬁne induced norm | ( σ , ..., σ i , ...σ n ) | = | σ i | .Such norm also coincides with length of accepting path,therefore it allows to generalise and apply pumping lemmato multitape automata.If A contains some elements with inverses then the au-tomaton cannot be sequential. In particular suppose aa − =1 A and ( q, a (cid:48) , q (cid:48) ) ∈ δ then a (cid:48) = a (cid:48) A = a (cid:48) ( aa − ) =( a (cid:48) a ) a − but a (cid:48) a (cid:54) = 1 A and a − (cid:54) = 1 A , hence sequentialityis violated.Every input tape can be seen as a read-only tape and everyoutput tape can be thought of as write-only. Just as there isno formal distinction between input and output, there is nodistinction between read-only and write-only. The differencebecomes signiﬁcant only when we allow read-write tapes,also known as stacks . In particular, if B is a group, thenpushing b onto stack is the same a reading b from tape.Popping b off of the stack can be seen as reading b − . A may or may not contain commuting elements. If A = B × C then all the elements of the form (1 B , c ) and ( b, C ) commute. This phenomenon characterizes nonsequential ma-chines which are used to encode conurrent systems (seetheory of traces [12]). For this reason, all multitatpe automatawith (cid:15) -transitions are in a sense ”concurrent” machines. Conﬁguration is deﬁned to be a subset of Q . Givenconﬁguration K and x ∈ A , deﬁne ˆ δ to be transitive closureof δ , that is, ˆ δ ( K, x ) is the set of all states q , for which thereexists a path starting in K and ending in q with signature x .In cases when A = B × C the concept of conﬁguration canbe extended to include C , that is, deﬁne superposition as asubset of Q × C . Given some superposition S deﬁne ˆ δ C suchthat ( q (cid:48) , yy (cid:48) ) ∈ ˆ δ C ( S, x ) whenever there exists ( q, y ) ∈ S and path starting in q and ending in q (cid:48) with signature ( x, y (cid:48) ) .Conﬁguration is a way of capturing what states are ”ac-tive” at a particular moment of computation. Superpositionkeeps track outputs associated with a each ”active” state. Ev-ery element of superposition represents one possible branchof nondeterministic computation and the output accumulatedalong the way.If some automaton M is sequential up to B and (cid:15) -free upo B , then we can deﬁne image of conﬁguration δ C ( K, b ) = { q (cid:48) ∈ Q : ∃ q ∈ K ( q, ( b, c (cid:48) ) , q (cid:48) ) ∈ δ } and image of superposition δ C ( S, b ) = { ( q (cid:48) , cc (cid:48) ) ∈ Q × C : ∃ ( q,c ) ∈ S ( q, ( b, c (cid:48) ) , q (cid:48) ) ∈ δ } and then ˆ δ C becomes ˆ δ C ( S, (cid:15) ) = S ˆ δ C ( S, bx ) = ˆ δ C ( δ C ( S, b ) , x ) In all four equations above b is an element of smallestgenerator of B (which is usually ﬁxed and known fromcontext). This gives an effective way of computing the outputof M , that is, for all x in B there is c ∈ M ( x ) whenever ( q, c ) ∈ ˆ δ C ( I × { C } , x ) for some ﬁnal state q . Theorem 1 (Deterministic superposition) . If automaton over A = B × C is deterministic up to B then | ˆ δ C ( S, x ) | ≤ forall x ∈ B and all initial superpositions | S | = 1 .Proof: Determinism states that δ ⊂ Q × B → C × Q , sothere is at most one transition that can be taken at each step.Therefore the number of elements in superposition cannotincrease.As direct consequence it can be shown that in deterministicautomata initial state and signature uniquely determine path.This leads to introduction of the following theorem. Theorem 2 (Preservation of preﬁxes) . Let M be someautomaton over A = B × C deterministic up to B . For allstrings x, x (cid:48) ∈ B if M ( xx (cid:48) ) = y (cid:48) (cid:54) = ∅ and M ( x ) = y (cid:54) = ∅ ,then y is a preﬁx of y (cid:48) .Proof: It follows directly from uniqueness of path thatcorresponds to signature xx (cid:48) .Theorem 1 also applies to single-tape automata, because A can be treated like A × { (cid:15) } . Superposition belonging to Q × { (cid:15) } is the same as conﬁguration. Theorem 3 (Inﬁnite superposition) . Let M be an automa-ton over A = B × C sequential up to B . | M ( x ) | = ∞ for some x ∈ B only if M contains (cid:15) -cycle ( q k , (1 B , y ) , q k ) ,..., ( q k m , (1 B , y m ) , q k ) where y i ∈ C and (1 B , y ...y m ) (cid:54) = 1 A .Proof: Every time a non- (cid:15) -transition from δ ⊂ Q × ( B \{ B } ) × C × Q is taken, it increases the length of x inthe corresponding signature ( x, y ) ∈ A . Only (cid:15) -transitions ofthe form δ ⊂ Q × { B } × C × Q do not increase length of x . There are only ﬁnitely many elements y of speciﬁc ﬁnitelength | y | . Therefore in order to obtain inﬁnite subset of C it must contain strings of unbounded length. The only wayto have unbounded y , while keeping x bounded is by takinginﬁnitely many Q × { B } × C × Q transitions. If there is no (cid:15) -cycle then only ﬁnite number of (cid:15) transitions can be taken,before having to take some non- (cid:15) -transition. Therefore theremust be an (cid:15) -cycle. Theorem 4 (Functional superposition) . Let M be a func-tional automaton over A = B × C , sequential up to B andwhose recognized language is of the form L ⊂ B → C . Thenthere exists an equivalent automaton such that ˆ δ C ( S, x ) ⊂ Q → C for all S ⊂ Q → C and x ∈ B . Proof: Suppose to the contrary that there is x and q forwhich ˆ δ C ( S, x ) returns relation Q × C that is not a function Q → C . Then there are two possibilities: either there is apath that starts in q and ends in F or there is not. If the ﬁrstcase is true, then M is not functional, because we mightfollow that path and accept with multiple C outputs. If thesecond case applies, then the state q is redundant and we arefree to delete it. C. Stochastic languages and weighted automata

Suppose that B is a tape of some automaton. If B is acomplete semiring then it is called the tape of weights andthe automaton itself is weighted . Completeness is requiredbecause inﬁnite sum may arise, although, this requirementcan be relaxed for (cid:15) -free automata (theorem 3). Probabilistic automaton is any automaton, whose A isa measure space with total measure µ ( A ) equal . Everymeasurable subset of A is called a stochastic language andcan be treated like a random event.Now a way of constructing automata with probabilisticweights can be presented. Let N ≥ be some naturalnumber. Take the segment (0 , of real number line and splitit into N equally sized intervals. Let Ω = { ω , ω , ω , ...ω N } be the set of all those intervals ( i − N , iN ) , including ω repre-senting (0 , . Set Ω generates a monoid with multiplication ω x · ω y deﬁned as ω x · ω y = ( x , x ) · ω y = x + ( x − x ) · ω y In other words, ω x determines linear transformation thattreats ω x as the new unit interval (0 , and ω y is maderelative to it (for instance, if ω = (0 , . and ω = (0 . , then ω ω = (0 . , . ). Norm | ω i | is equal to the length ofinterval. It holds that | ω x · ω y | = | ω x | · | ω y | . Deﬁne completesemiring B generated by Ω with union of intervals as additiveoperation (hence ω + ω ω = ω ). Norm of b is equalto summing and multiplying norms of individual elementsof Ω (note | b + b | (cid:54) = | b | + | b | ). As N approaches ∞ ,the accuracy of Ω increases and their sums can approximateany real number. Consider Ω b to be the set of all inﬁnitestrings starting with b and Ω (cid:15) is the set of all possibleinﬁnite strings. The set B can be turned into a measurespace by mapping every b into the corresponding measur-able set Ω b . Such deﬁnition of measure space correspondsto Solomonoff’s a priori preﬁx complexity [11]. Norm | b | coincides with measure µ (Ω b ) . For any subset B (cid:48) of B themeasure of B (cid:48) is equal to the sum µ (Ω B (cid:48) ) = | (cid:80) b ∈ B (cid:48) b | . Notethat the subset { ω , ..., ω N } itself has uniform distributionbut if (0 , was partitioned in some irregular way, differentdistributions could be obtained. Moreover this subset can beseen as a random variable and every sequence of randomvariables ”falls into” some b in B with probability | b | .Consider automaton M with single initial state and tran-sitions of the form Q × Ω \{ ω } → C × Q . For any input c take the set M ( c ) ⊂ B and turn it into preﬁx-free set B (cid:48) (that is, if b , b ∈ M ( c ) and b is a preﬁx of b , then don’tinclude b in B (cid:48) ). Such set is a random event with probability P ( c ) = µ (Ω B (cid:48) ) . To prove that P ( c ) never exceeds , noticethat in every preﬁx-free subset of B , no segments of (0 , overlap, so they can be summed without double-counting.his also implies that | (cid:80) b ∈ B (cid:48) b | = (cid:80) b ∈ B (cid:48) | b | . Every string b uniquely determines some path, so if both ( c, b ) and ( c, b ) belong to L ( M ) but b is preﬁx of b , that means there is (cid:15) -cycle starting and ending in some ﬁnal state (so it’s onlynatural and intuitive to discard b when counting P ( c ) ).If automaton has transitions of the form Q × (Ω \{ ω } ) → C × D × Q , then probability of any output D can becalculated for a given input C . Probability P ( c, d ) is thesame as µ (Ω M ( c,d ) ) and P ( c ) equals sum (cid:80) d ∈ D P ( c, d ) of all possible outputs d . Then the conditional probability P ( d | c ) is obtained from P ( c,d ) P ( c ) .The construction described above is called the probabilis-tic semiring . Those familiar with the theory of weightedautomata[13][14] might notice that this deﬁnition is com-pletely different from the ”standard” one. No formal powerseries [15] or weight function for transitions[14] were used.Apart from assuming that B is a measure space, the deﬁnitionof automata wasn’t extended in any way. Perhaps, the mostsigniﬁcant difference is that everything was deﬁned in termsof formal languages and strings, instead of resorting to sum-mation over all possible paths. This presents an alternativeapproach to weighted automata, that lies much closer totheory formal languages. Tropical semiring (and all others)can be introduced in a similar approach.Suppose that A = B × C (the order doesn’t matter much)and C is a complete semiring. Given some language L introduce quotient of L denoted with L \ B and deﬁned as ( c, b ) ∈ L \ B ⇐⇒ b = (cid:88) ( b (cid:48) ,c ) ∈ L b (cid:48) L can be any subset of A but L \ B is speciﬁcally a function C → B . In case of probabilistic semiring, the probability P ( c ) = µ (Ω M ( c ) ) is the same as P ( c ) = | ( L ( M ) \ B )( c ) | .This will be the starting point for deﬁning tropical semiringin terms of strings and languages.Consider automaton M over A = B × C , where ≤ B issome relation of total order on B . Then B can be turnedinto semiring with max (or min ) as additive operation.Hence B can be treated as tape of weights. This should becalled max semiring (or min semiring ). If additionally B commutes under multiplication, then it can be called arcticsemiring (or tropical semiring ). In other papers[8][7][13], B is required to represent real numbers, but such assumptionis very restricting and would require inﬁnitary alphabets[16].If B is a free monoid with ≤ B representing lexicographicorder, then B is a special case of max semiring (min semir-ing), called lexicographic arctic semiring (or lexicographictropical semiring ). The lexicographic order itself might bedeﬁned by comparing strings from left to right or right toleft. Because each time the automaton takes the transition,the weight is appended, rather than prepended, it makesmore sense to consider right-to-left order (otherwise onlythe ﬁrst transition would matter and the remaining stepsof computation would be of little relevance). Therefore thispaper considers deﬁnition b w > b w ⇐⇒ w > w or ( w = w and b > b ) where w , w belong to generator of B . This semiring is anew discovery, which will be investigated in depth in thenext part of this paper. Let M = ( Q, I, C × D, δ, F ) be some automaton thatmay or may not be deterministic. Deﬁne δ (cid:48) ⊂ Q × B × C → D × Q to be a disambiguation up to C for M if ( Q, I, B × C × D, δ (cid:48) , F ) is deterministic and δ coincideswith δ (cid:48) in the following sense: ( q, ( c, d ) , q (cid:48) ) ∈ δ ⇐⇒ ∃ b ∈ B ( q, ( b, c, d ) , q (cid:48) ) ∈ δ (cid:48) Note that weighted automata can often be seen as disam-biguations of some otherwise nondeterministic automata.Given any L ⊂ C × D → B maximization of B withrespect to D written as max D → B L is deﬁned as ( c, d ) ∈ max D → B L ∧ ( c, d (cid:48) , b ) ∈ L = ⇒ b ≤ L ( c, d ) Analogically also deﬁne minimization min D → B L . Once a quo-tient of some weighted automaton is obtained, the weightscan be completely erased by either minimising or maximiz-ing them.Consider automaton over A = B × C with C = C × C × ...C n where every C i is a max (min) semiring. Then C canbe turned into max (min) semiring by treating C i to the leftas ”more important” than those to the right. More formally ( c , ( c , ...c n )) > ( c (cid:48) , ( c (cid:48) , ...c (cid:48) n )) if and only if either c > c (cid:48) or c = c (cid:48) and recursively ( c , ...c n ) > ( c (cid:48) , ...c (cid:48) n ) . Suchconstruction of C is known as lexicographic semiring [17].II. L EXICOGRAPHIC TROPICAL SEMIRING

Consider automaton M over A = W ∗ × Σ ∗ × D withtransitions δ ⊂ Q × W × Σ × D × Q and total order ≤ W , which induces lexicographic order on W ∗ , makingit a lexicographic tropical semiring. This guarantees thatfor any b ∈ W ∗ , c ∈ Σ ∗ and d ∈ D if ( b, c, d ) is in L ( M ) , then lengths | b | and | c | are equal. For any input c theoutput M ( c ) can be computed and after dividing it by W ∗ ,the quotient M ( c ) \ W ∗ is a function D → W ∗ assigning(lexicographically) lowest possible path to every obtainableoutput D . Because min is used as semiring addition, thequotient M ( c ) \ W ∗ becomes ( d, b ) ∈ M ( c ) \ W ∗ ⇐⇒ b = min ( b (cid:48) ,d ) ∈ M ( c ) b (cid:48) and all b (cid:48) are of equal lengths (same as | c | ). Because there areonly ﬁnitely many strings of any ﬁxed length, there is no needto require W ∗ to be a complete semiring. Moreover, the orderlexicographic ≤ W ∗ need not be total because comparisonwill never occur for strings of different lengths. Such M will be referred to as lexicographic transducers .An interesting property emerges, when studying superpo-sitions Q × W ∗ × D . Suppose that S is some superposi-tion obtained on lexicographic transducer by reading string σ σ ...σ k ∈ Σ ∗ . Let ( q, b, d ) ∈ S and imagine that theautomaton reads next symbol σ k +1 and enters new super-position S (cid:48) . As it takes some transition ( q, σ k +1 , w, d (cid:48) , q (cid:48) ) ,it causes the element ( q (cid:48) , bw, dd (cid:48) ) to be included in S (cid:48) . Ifthere were two elements ( q, b , d ) and ( q, b , d ) in S and b < b , then the inequality would still be preserved for b w < b w in S (cid:48) . In that sense, the superpositions aremonotonous and ( q, b , d ) can be safely removed from S without making any difference to M ( c ) \ W ∗ .On the other hand, suppose that ( q , b , d ) and ( q , b , d ) are in S and then automaton takes transitions q , σ k +1 , w , d (cid:48) , q (cid:48) ) and ( q , σ k +1 , w , d (cid:48) , q (cid:48) ) both leadingto the same q (cid:48) over the same σ k +1 . Such states are said to be conﬂicting . In order to determine whether b w > b w , allthat’s needed to know is b > b and w > w but it’s notnecessary to know the actual strings, because by deﬁnition b w > b w ⇐⇒ w > w or ( w = w and b > b ) This introduces everything that’s necessary for the followingtheorem.

Theorem 5 (Weights can be erased) . Let M be some lexico-graphic transducer over A = W ∗ × Σ ∗ × D then there existsautomaton N over Σ ∗ × D equivalent to min D → W ∗ L ( M ) \ W ∗ .Proof: Suppose that M = ( Q, I, A, δ, F ) and N =( Q (cid:48) , I (cid:48) , Σ ∗ × D, δ (cid:48) , F (cid:48) ) . Conversion can be carried out using”extended” powerset construction. Instead of using Q (cid:48) = 2 Q ,which can keep track of current conﬁguration in Q , we needto keep track of superposition Q × W ∗ . However, becausethere are inﬁnitely many strings W ∗ , such powerset wouldresult in inﬁnite Q (cid:48) . To make Q (cid:48) bounded, we abstractthe exact strings W ∗ away and only focus on the orderrelationship between them. More precisely, let S ⊂ Q → W ∗ be some superposition and let φ S be a formula of thefollowing form: (cid:15) < S ( q ) < S ( q ) = S ( q ) < ... < ... = ... < ... = S ( q n ) where q ...q n are all the states included in given S . Let Φ be the set of formulas for all possible superpositions. Wecan treat Φ as equivalence classes for Q → W ∗ . Note thatit’s enough to only consider Q → W ∗ instead of Q × W ∗ ,because (as shown a few paragraphs before) the weights aremonotonous and we can remove all but the smallest one.Having said all this, put Q (cid:48) = Q × Φ . The extra Q isneeded, because we want to pick one representative state q from every formula φ . We can immediately remove all thoseelements ( q, φ ) of Q (cid:48) for which q cannot be found in φ . Thestate q we be used to keep track of D . Hence superposition S (cid:48) in N that corresponds to Q (cid:48) × D translates to Q ×{ φ }× D ,which can be seen as entire class of superpositions Q × W ∗ × D .The set of states used in any given φ determines someconﬁguration K φ . For every two states ( q , φ ) , ( q , φ ) ∈ Q (cid:48) we put transition from ( q , φ ) to ( q , φ ) over symbol σ with output d , whenever conﬁguration K φ transitions to K φ over σ (formally δ W ∗ × D ( K φ , σ ) = K φ ) and the state q itself also transitions to q (formally ( q , w, σ, d, q ) ∈ δ )and the formula φ indeed holds true (after transitioning from φ over w ). In a moment some of those transitions willneed to be removed in order to simulate the effect of erasedweights. Before that, we should ﬁrst add one more extra state f to Q (cid:48) , which will be the only accepting state of N . Everytime we put transition from ( q , φ ) to ( q , φ ) and q is anaccepting state, we need to put the exact same transition from ( q , φ ) to f . (This way we can simulate (cid:15) -transition from ( q , φ ) to f .) For every initial state q i of M , we designate ( q, φ ) as initial state of N , where φ is the formula (cid:15) = S ( q ) = ... = S ( q i ) = ... = S ( q n ) and { q , ..., q n } is the set of initial states I . If any initialstate is also an accepting state, then we additionally set f as initial state of N .Finally, the last step of conversion is to ﬁnd all conﬂictingstates and remove the transitions with lower weights. Recallthat if there are two states ( q , φ ) , ( q , φ ) ∈ Q (cid:48) transitioningto the same third state q (cid:48) ∈ Q (cid:48) over the same symbol σ ,then we call ( q , φ ) and ( q , φ ) conﬂicting. Remember thatevery transition in δ (cid:48) is a ”copy” of some weighted transitionin δ (including those leading to f ). Let’s say that w is theweight that ”would be” put between ( q , φ ) and q (cid:48) , if wehadn’t erased it. Similarly fro w and ( q , φ ) . Next we needto lookup if according to φ the state q carries lower or higherweight than q . This, together with the w and w , gives usenough information to decide which of the transitions shouldbe erased (if any).This concludes the construction of N . Note that N is non-deterministic and it’s not possible to reach such conﬁgurationof Q (cid:48) in which two states would have different φ (This doesnot include f which doesn’t have any φ associated with it.There is no problem, because f has no outgoing transitions).Lexicographic transducer M is said to be functional whenthe relation min D → W ∗ L ( M ) \ W ∗ is functional. Theorem 4together with theorem 5 tells that every functional M (afterremoving dead-end states) has all reachable superpositionsof the form Q × W ∗ → D . After removing all but the lowestweight (due to monotonicity), that leaves only Q → W ∗ × D .This implies that any time there are two conﬂicting states,either the weights on transitions are different or they arethe same but also the associated D outputs are the same. Ifthere was a conﬂicting pair of states with equal weights butdifferent D , that would break functional nature of automatonand lead to ambiguous output.Suppose that the automaton M has no reachable conﬂict-ing states with equal weights and only one single acceptingstate. Then M is guaranteed to be functional. Such automataare called strongly functional . The lexicographic tropicalsemiring becomes ”unnecessary” because the formula b w > b w ⇐⇒ w > w or ( w = w and b > b ) always falls into the left side of ”or” and the recursion onthe right never happens ( w > w always holds). Thereforeit’s not necessary to keep the history of weights W ∗ inthe superposition Q → W ∗ × D . They can be droppedaltogether and all computation can be carried out with only Q → D . Another special property of such M is thatit’s ”deterministic in reverse”, that is, given sequence ofconﬁgurations K , K , ...K n for each step σ , σ , ...σ n ofcomputation, it can backtracked from accepting state backto initial state in a deterministic way, because when givenparticular state q i in K i ( q n is the unique ﬁnal state) and σ i ,then there is always only one smallest weight that could leadto q i from some state of K i − . Moreover, the unique ﬁnalstate is not a limitation, because nondeterminism can be usedto simulate (cid:15) -transitions (similarly to the way is was done intheorem 5) or a special end-marker could be introduced.Now it can be shown that weights of such automata can beerased in a simpler way than in theorem 5. Theorem 6 (Weights can be erased - strongly functionalcase) . Let M be some lexicographic transducer over A = W ∗ × Σ ∗ × D that has only one ﬁnal state and no conﬂictingtates with equal weights. Then there exists automaton N over Σ ∗ × D equivalent to min D → W ∗ L ( M ) \ W ∗ .Proof: Similar to the previous case but this time weput Q (cid:48) = 2 Q × Q ∪ { f } . If S is conﬁguration in M , weknow that we only need to keep track of Q → D , instead of Q × W ∗ × D . Therefore we can convert S to superposition S (cid:48) of N , by setting (( K S , q ) , d ) ∈ S (cid:48) for every ( q, d ) ∈ S ,where K S is the conﬁguration corresponding to S .We put transition from ( K , q ) to ( K , q ) in Q (cid:48) over σ with output d , whenever K transitions to K over σ (formally δ W ∗ × D ( K , x ) = K ) and q transitions to q over σ and d (formally ( q , w, σ, d, q ) ∈ δ ).We make state ( K, q ) of N ﬁnal whenever q is ﬁnal. Theinitial states of N are all of the form ( I, q ) for each q in I .The last step is to ﬁnd all conﬂicting states, that is, twostates ( K, q ) , ( K, q ) ∈ Q (cid:48) having the same conﬁguration K and transitioning to some q (cid:48) ∈ Q (cid:48) over the same σ . Thetransitions (( K, q ) , σ, d, q (cid:48) ) and (( K, q ) , σ, d, q (cid:48) ) are calledconﬂicting transitions. Every time we encounter them, wedelete the one with higher weight. We will ﬁnd out theirweights by looking put what transition from δ lead to theircreation. It will never happen that two conﬂicting transitionshave equal weights.Lexicographic transducers (even the strongly functionalones) can be exponentially smaller than the smallest non-deterministic equivalent 2-tape automata. To show this wewill need help of Myhill-Nerode theorem.Let L ⊂ B → C be some (partial) function and let b , b ∈ B . Element ( b, c ) ∈ A is a distinguishing extensionup to B of b and b if exactly one of ( b b, L ( b ) c ) or ( b b, L ( b ) c ) belongs to L . Deﬁne an equivalence relation = L on A such that a = L a if and only if there is nodistinguishing extension up to B for a and a . Theorem 7 (Generalized Myhill-Nerode theorem) . Let L ⊂ B → C . Assume that L ( bb (cid:48) ) = c (cid:48) (cid:54) = ∅ and M ( b ) = c (cid:54) = ∅ implies c is a preﬁx of c (cid:48) (preservation of preﬁxes holds). L can be recognized by automaton deterministic up to B ifand only if there are only ﬁnitely many equivalence classesinduced by = L .Proof: ( ⇐ = ) First assume there are ﬁnitely manyequivalence classes. Let G be the smallest generator of B .Then build an automaton by treating every class as a state Q .Put a transition from class q to q (cid:48) over b (cid:48) ∈ G \ B wheneverthere exists ( b, c ) ∈ q and ( bb (cid:48) , c (cid:48) ) ∈ q (cid:48) . Preservation ofpreﬁxes guarantees existence of sufﬁx s such that cs = c (cid:48) .This sufﬁx shall be used as transition output. By taking b (cid:48) only from G \ B we ensure that automaton is sequential upto B and has no (cid:15) -transitions. The class that contains A isdesignated as the unique initial state (hence the automatonis deterministic up to B ). All the classes intersecting L areaccepting states (note that if q ∩ L (cid:54) = ∅ then q ⊂ L ).( = ⇒ ) Conversely, if there is an automaton deterministicup to B and recognizing L , then there could be founda homomorphism from states of machine to equivalenceclasses. (The exact proof is well covered in most introductorycourses to automata theory and this generalized version islargely analogical, so we won’t elaborate on this proof muchfurther.)The above theorem no longer works when automaton is not deterministic at least up to B . Theorem 8.

There exists a family of strongly functionallexicographic transducers such that their equivalent minimal2-tape nondeterministic automata (after erasing weights) areexponentially larger.Proof:

Deﬁne family of strongly functional lexi-cograpgic transducers in such a way that for every i ≥ there is one deﬁned on i states. The number of states ofminimal equivalent 2-tape automaton is O (2 i ) . Figure 1presents a way to build such automata. State q is initial.Using strings from { , } ∗ one can obtain any conﬁgurationof states q to q n . Let’s associate each conﬁguration witha string z ∈ { , } n (for instance z = 011 would be aconﬁguration { q , q } ). That gives n possible strings. State q n +1 is accepting and all the states q ... q n are connectedto it. Essentially the relation described by this automatonis a subset of { , } + × { y , ..., y n } . Also suppose thatweights w ... w n are in strictly ascending order. Then theautomaton maps every z determined by x ∈ { , } + to some ( x, y k ) such that k indicates the least signiﬁcant bit in z .For instance suppose n = 4 and x = 0011 ∼ z = 1101 then ( x , y ) , ( x , y ) , ( x , y ) , ( x , y ) , ( x , ∅ ) .Notice that one can reconstruct z from such sequence of y ’s.Every 2-tape automaton that has disjoin alphabets in eachtape can be simulated by a single-tape automaton readingunion of those alphabets. In this case such union is D = { , , , y , ..., y n } . This way the language becomes subset of { , } + { y , ..., y n } and every pair ( x, y k ) becomes a string xy k . Using Myhill-Nerode theorem, it can be seen that notwo x strings that map to two different z are equivalent, hencethe smallest deterministic FSA must have at least n states.Call this minimal automaton A . The most difﬁcult problemis to show that no nondeterministic automaton polynomiallysmaller than A can be build. The rigorous proof can beobtained with help of Theorem 7 and Lemma 7 presented byKameda and Weiner [18]. We can build RAM using D ( A ) and D ( ←−A ) (all deﬁned in [18]). All conﬁgurations of states q ... q n have different succeeding event[18] and none of themis subset of the other (because there exists bijection between z and sequence of y ’s produced by ( x , y k ) , ( x , y k ) ,...).Hence the minimal legitimate grid cannot be extended forany of them and the nondeterministic FSA cannot be muchsmaller than n .One can easily notice that every 2-tape automaton can betreated like a lexicographic transducer with all weights equal,therefore the opposite of theorem 8 doesn’t hold (there is nofamily of 2-tape automata such that lexicographic transducerswould be larger).It’s possible to decide whether lexicographic transducersare functional using quadratic procedure analogical to theone for unweighted transducers [4]. In case of stronglyfunctional lexicographic transducers the procedure becomeseven simpler, as instead of using ”Advance & Delay”[4], it’senough to square automaton and make sure that it has noweight-conﬂicting transitions.When introducing transformation semigroup, the stringsin A were treated as partial functions Q → Q . Every suchfunction takes some conﬁguration and produces a new one.When A = B × C , superpositions can be used instead. Every q ... q n q n +1 , , ,

01 1 12 : y : w ... y n : w n Fig. 1. In this sketch of lexicographic transducer all the weights w ,..., w n are distinct. In many other transitions, weights were omitted, as they don’tplay any role and could be arbitrary. string b in B becomes a function b : Q × C → Q × C . In caseof nondeterministic automata, there is b : 2 Q × C → Q × C .In the particular case of strongly functional lexicographictransducers over A = W ∗ × Σ ∗ × D , every x in Σ ∗ becomesa function x : 2 Q → D → Q → D . (Notice how W wasn’tincluded, because the attempt is not to model L ( M ) butrather the language after minimization min D → W ∗ L ( M ) \ W ∗ .)This leads to conclusion that lexicographic transducers arenot a specialization, but rather a generalization of transduc-ers, because weights W are merely a tool to take greatercontrol over the transformation semigroup. This could notbe said about other types of weighted transducers, as theirtransformation semigroups cannot be expressed without alsoincluding information about weights accumulated in eachsuperposition (just using Q × D without tape of weightswould not be enough).This can shed some additional light for theorem 8. Eventhough nondeterministic single-tape automata can have ex-ponentially less states than deterministic ones, the number ofall reachable conﬁgurations would still be equal in both. Thesmallest subset Q for transformation semigroup Q → Q would be isomorphic to smallest set Q in deterministic Q → Q . This situation is different for multitape automataand it’s what lexicographic transducers try to take advantageof. III. C ONCLUSIONS

Weighted automata don’t have to be an alien concept thatrequires any special extensions. All weights can be viewedas tapes over alphabets with some particular properties. Thismore general approach give us necessary foundations fordeﬁning lexicographic transducers. They were invented bytrying to generalize and simplify weighted automata. Thereis yet a lot to discover. Theorem 8 gives certain clues, thatperhaps they could be inferred[9] more efﬁciently, or at leastgeneralise better. Solomonoff’s theory of inductive inference[19][11][20] says that simpler and shorter automata, shouldbe the preferred solution to inference problems. Lexico-graphic transducers can express complex ”replace-all” func-tions in simpler and more reliable ways than other weightedautomata, speciﬁcally thanks to lack of commutativity inlexicographic tropical semiring. They seem perfectly suitedfor tasks that require the automaton to ”forget history” oftheir weights. A

CKNOWLEDGMENT

The author would like to thank Piotr Radwan for all thegreat inspiration. R

EFERENCES[1] J.-E. Pin,

Mathematical Foundations of Automata Theory . AmericanMathematical Society, 2017.[2] S. Eilenberg,

Automata, Languages and Machines Vol. A . AcademicPress, 1974.[3] S. Mihov and K. U. Schulz,

Finite-State Techniques: Automata,Transducers and Bimachines , ser. Cambridge Tracts in TheoreticalComputer Science. Cambridge University Press, 2019.[4] M.-P. B´eal, O. Carton, C. Prieur, and J. Sakarovitch, “Squaringtransducers: An efﬁcient procedure for deciding functionality andsequentiality of transducers,” in

LATIN 2000: Theoretical Informatics ,G. H. Gonnet and A. Viola, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2000, pp. 397–406.[5] I. O. Gurari, E.M., “A note on ﬁnite-valued and ﬁnitely ambiguoustransducers,”

Math. Systems Theory , 1983.[6] S. Eilenberg,

Automata, Languages and Machines Vol. B . AcademicPress, 1976.[7] F. P. Mehryar Mohri and M. Riley, “Weighted ﬁnite-state transducersin speech recognition,”

AT&T Labs Research , 2008.[8] M. Mohri,

Weighted Finite-State Transducer Algorithms. An Overview .Springer, 2004.[9] C. de la Higuera,

Grammatical Inference: Learning Automata andGrammars . Cambridge University Press, 2010.[10] C. E. Hasan Ibne Akram, Colin de la Higuera, “Actively learning prob-abilistic subsequential transducers,”

JMLR: Workshop and ConferenceProceedings , 2012.[11] N. V. A. Shen, V. A. Uspensky,

Kolmogorov Complexity and Algorith-mic Randomness . IRIF, 2019.[12] V. Diekert,

The Book Of Traces . Wspc, 1995.[13] M. Droste, W. Kuich, and H. Vogler,

Handbook of Weighted Automata ,01 2009.[14] M. Droste and D. Kuske, “Weighted automata,”

Institut fur Informatik,Universitat Leipzig , 2010.[15] M. S. Arto Salomaa,

Automata-Theoretic Aspects of Formal PowerSeries . Springer-Verlag New York.[16] K. Meer and A. Naif, “Generalized ﬁnite automata over real andcomplex numbers,” vol. 591, 04 2014.[17] B. Roark, R. Sproat, and I. Shafran, “Lexicographic semirings forexact automata encoding of sequence models,” in

Proceedings ofthe 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies

IEEE Transactions on Computers , 1970.[19] R.J.Solomonoff, “A formal theory of inductive inference. part i,”

Information and Control , 1964.[20] ——, “A formal theory of inductive inference. part ii,”