[PDF] Online Non-Additive Path Learning under Full and Partial Information

Abstract

We study the problem of online path learning with non-additive gains, which is a central problem appearing in several applications, including ensemble structured prediction. We present new online algorithms for path learning with non-additive count-based gains for the three settings of full information, semi-bandit and full bandit with very favorable regret guarantees. A key component of our algorithms is the definition and computation of an intermediate context-dependent automaton that enables us to use existing algorithms designed for additive gains. We further apply our methods to the important application of ensemble structured prediction. Finally, beyond count-based gains, we give an efficient implementation of the EXP3 algorithm for the full bandit setting with an arbitrary (non-additive) gain.

Full PDF

aa r X i v : . [ c s . L G ] M a r Online Non-Additive Path Learningunder Full and Partial Information

Corinna Cortes [email protected]

Google Research, New York, NY

Vitaly Kuznetsov [email protected]

Google Research, New York, NY

Mehryar Mohri [email protected]

Google Research and Courant Institute, New York, NY

Holakou Rahmanian ∗ [email protected] Microsoft Corporation, Redmond, WA

Manfred K. Warmuth [email protected]

Google Inc., Z¨urich and UC Santa Cruz, CA

Abstract

We study the problem of online path learning with non-additive gains, which is acentral problem appearing in several applications, including ensemble structured prediction.We present new online algorithms for path learning with non-additive count-based gainsfor the three settings of full information, semi-bandit and full bandit with very favorableregret guarantees. A key component of our algorithms is the deﬁnition and computation ofan intermediate context-dependent automaton that enables us to use existing algorithmsdesigned for additive gains. We further apply our methods to the important applicationof ensemble structured prediction. Finally, beyond count-based gains, we give an eﬃcientimplementation of the EXP3 algorithm for the full bandit setting with an arbitrary (non-additive) gain.

Keywords: online learning, non-additive gains, ﬁnite-state automaton

1. Introduction

One of the core combinatorial online learning problems is that of learning a minimum losspath in a directed graph. Examples can be found in structured prediction problems suchas machine translation, automatic speech recognition, optical character recognition andcomputer vision. In these problems, predictions (or predictors) can be decomposed intopossibly overlapping substructures that may correspond to words, phonemes, characters, orimage patches. They can be represented in a directed graph where each edge represents adiﬀerent substructure.The number of paths, which serve as experts , is typically exponential in the size of thegraph. Extensive work has been done to design eﬃcient algorithms when the loss is addi-tive , that is when the loss of the path is the sum of the losses of the edges along that path. ∗ Research done in part while the author was visiting Courant Institute and interning at Google Research,NYC as a Ph.D. student from UCSC. c (cid:13) C. Cortes, V. Kuznetsov, M. Mohri, H. Rahmanian & M.K. Warmuth. nline Non-Additive Path Learning

HeShe wouldwould likelove toto havedrink teachai

Figure 1: Combining outputs of two diﬀerent translators (blue and red). There are 64interleaved translations represented as paths. The BLEU score measures the overlap in n -grams between sequences. Here, an example of a 4-gram is “like-to-drink-tea”.Several eﬃcient algorithms with favorable guarantees have been designed both for the full in-formation setting (Takimoto and Warmuth, 2003; Kalai and Vempala, 2005; Koolen et al.,2010) and diﬀerent bandit settings (Gy¨orgy et al., 2007; Cesa-Bianchi and Lugosi, 2012) byexploiting the additivity of the loss.However, in modern machine learning applications such as machine translation, speechrecognition and computational biology, the loss of each path is often not additive in the edgesalong the path. For instance, in machine translation, the BLEU score similarity determinesthe loss. The BLEU score can be closely approximated by the inner product of the countvectors of the n -gram occurrences in two sequences, where typically n = 4 (see Figure 1).In computational biology tasks, the losses are determined based on the inner product ofthe (discounted) count vectors of occurrences of n -grams with gaps (gappy n -grams). Inother applications, such as speech recognition and optical character recognition, the loss isbased on the edit-distance. Since the performance of the algorithms in these applications ismeasured via non-additive loss functions, it is natural to seek learning algorithms optimizingthese losses directly. This motivates our study of online path learning for non-additive losses.One of the applications of our algorithm is ensemble structured prediction . Online learn-ing of ensembles of structured prediction experts can signiﬁcantly improve the performanceof algorithms in a number of areas including machine translation, speech recognition, otherlanguage processing areas, optical character recognition, and computer vision (Cortes et al.,2014). In general, ensemble structured prediction is motivated by the fact that one partic-ular expert may be better at predicting one substructure while some other expert may bemore accurate at predicting another substructure. Therefore, it is desirable to interleavethe substructure predictions of all experts to obtain the more accurate prediction. Thisapplication becomes important, particularly in the bandit setting. Suppose one wishes tocombine the outputs of diﬀerent translators as in Figure 1. Instead of comparing oneselfto the outputs of the best translator, the comparator is the best “interleaved translation”where each word in the translation can come from a diﬀerent translator. However, comput-ing the loss or the gain (such as BLEU score) of each path can be costly and may requirethe learner to resort to learning from partial feedback only.Online path learning with non-additive losses has been previously studied by Cortes et al.(2015). That work focuses on the full information case providing an eﬃcient implementa-tions of Expanded Hedge (Takimoto and Warmuth, 2003) and Follow-the-Perturbed-Leader(Kalai and Vempala, 2005) algorithms under some technical assumptions on the outputs ofthe experts.In this paper, we design algorithms for online path learning with non-additive gainsor losses in the full information, as well as in several bandit settings speciﬁed in detailin Section 2. In the full information setting, we design an eﬃcient algorithm that enjoys nline Non-Additive Path Learning regret guarantees that are more favorable than those of Cortes et al. (2015), while notrequiring any additional assumption. In the bandit settings, our algorithms, to the best ofour knowledge, are the ﬁrst eﬃcient methods for learning with non-additive losses.The key technical tools used in this work are weighted automata and transducers (Mohri,2009). We transform the original path graph A (e.g. Figure 1) into an intermediate graph A ′ . The paths in A are mapped to the paths in A ′ , but now the losses in A ′ are additivealong the paths. Remarkably, the size of A ′ does not depend on the size of the alphabet(word vocabulary in translation tasks) from which the output labels of edges are drawn. Theconstruction of A ′ is highly non-trivial and is our primary contribution. This alternativegraph A ′ , in which the losses are additive, enables us to extend many well-known algorithmsin the literature to the path learning problem.The paper is organized as follows. We introduce the path learning setup in Section 2. InSection 3, we explore the wide family of non-additive count-based gains and introduce thealternative graph A ′ using automata and transducers tools. We present our algorithms inSection 4 for the full information, semi- and full bandit settings for the count-based gains.Next, we extend our results to gappy count-based gains in Section 5. The application of ourmethod to the ensemble structured prediction is detailed in Appendix A. In Appendix B,we go beyond count-based gains and consider arbitrary (non-additive) gains. Even withno assumption about the structure of the gains, we can eﬃciently implement the EXP3algorithm in the full bandit setting. Naturally, the regret bounds for this algorithm areweaker, however, since no special structure of the gains can be exploited in the absence ofany assumption.

2. Basic Notation and Setup

We describe our path learning setup in terms of ﬁnite automata. Let A denote a ﬁxedacyclic ﬁnite automaton. We call A the expert automaton . A admits a single initial state and one or several ﬁnal states which are indicated by bold and double circles, respectively,see Figure 2(a). Each transition of A is labeled with a unique name . Denote the set of alltransition names by E . An automaton with a single initial state is deterministic if no twooutgoing transitions from a given state admit the same name. Thus, our automaton A isdeterministic by construction since the transition names are unique. An accepting path isa sequence of transitions from the initial state to a ﬁnal state. The expert automaton A can be viewed as an indicator function over strings in E ∗ such that A ( π ) = 1 iﬀ π is anaccepting path. Each accepting path serves as an expert and we equivalently refer to it asa path expert . The set of all path experts is denoted by P . At each round t = 1 , . . . , T ,each transition e ∈ E outputs a symbol from a ﬁnite non-empty alphabet Σ, denoted byout t ( e ) ∈ Σ. The prediction of each path expert π ∈ E ∗ at round t is the sequence of outputsymbols along its transitions at that round and is denoted by out t ( π ) ∈ Σ ∗ . We also denoteby out t ( A ) the automaton with the same topology as A where each transition e is labeledwith out t ( e ), see Figure 2(b). At each round t , a target sequence y t ∈ Σ ∗ is presented to thelearner. The gain/loss of each path expert π is U (out t ( π ) , y t ) where U : Σ ∗ × Σ ∗ −→ R ≥ .Our focus is the U functions that are not necessarily additive along the transitions in A . Forexample, U can be either a distance function (e.g. edit-distance) or a similarity function (e.g. n -gram gain with n ≥ nline Non-Additive Path Learning (a)

20 13 4 e e e e e e e (b)

20 13 4 aba ba ba (c)

20 13 4 e : ae : be : a e : be : a e : be : a Figure 2: (a) The expert automaton denoted by A labeled with transition names. (b) Theoutput of expert automaton at round t denoted by out t ( A ) labeled with the outputs out t ( e )for each transition e . (c) The name and output of each transition together separated by a‘:’. (a)

20 13 4 e : ae : be : a e : be : a e : be : a (b)

20 13 4 e : ?e : be : ? e : ?e : a e : ?e : a (c)

20 13 4 e : ?e : ?e : ? e : ?e : ? e : ?e : ? Figure 3: Information revealed in diﬀerent settings: (a) full information (b) semi-bandit (c)full bandit. The name of each transition e and its output symbol (if revealed) are shownnext to it separated by a ‘:’. The blue path indicates the path expert predicted by thelearner at round t .We consider standard online learning scenarios of prediction with path experts. Ateach round t ∈ [ T ], the learner picks a path expert π t and predicts with its predictionout t ( π t ). The learner receives the gain of U (out t ( π t ) , y t ). Depending on the setting, the adversary may reveal some information about y t and the output symbols of the transitions(see Figure 3). In the full information setting, y t and out t ( e ) are revealed to the learnerfor every transition e in A . In the semi-bandit setting, the adversary reveals y t and out t ( e )for every transition e along π t . In full bandit setting, U (out t ( π t ) , y t ) is the only informationthat is revealed to the learner. The goal of the learner is to minimize the regret whichis deﬁned as the cumulative gain of the best path expert chosen in hindsight minus thecumulative expected gain of the learner.

3. Count-Based Gains

Many of the most commonly used non-additive gains in applications belong to the broadfamily of count-based gains , which are deﬁned in terms of the number of occurrences of aﬁxed set of patterns, θ , θ , . . . , θ p , in the sequence output by a path expert. These patternsmay be n -grams, that is sequences of n consecutive symbols, as in a common approximationof the BLEU score in machine translation, a set of relevant subsequences of variable-lengthin computational biology, or patterns described by complex regular expressions in pronun-ciation modeling. nline Non-Additive Path Learning For any sequence y ∈ Σ ∗ , let Θ( y ) ∈ R p denote the vector whose k th component is thenumber of occurrences of θ k in y , k ∈ [ p ]. The count-based gain function U at round t fora path expert π in A given the target sequence y t is then deﬁned as a dot product: U (out t ( π ) , y t ) := Θ(out t ( π )) · Θ( y t ) ≥ . (1)Such gains are not additive along the transitions and the standard online path learningalgorithms for additive gains cannot be applied. Consider, for example, the special case of4-gram-based gains in Figure 1. These gains cannot be expressed additively if the targetsequence is, for instance, “He would like to eat cake” (see Appendix F). The challenge oflearning with non-additive gains is even more apparent in the case of gappy count-basedgains which allow for gaps of varying length in the patterns of interest. We defer the studyof gappy-count based gains to Section 5.How can we design algorithms for online path learning with such non-additive gains?Can we design algorithms with favorable regret guarantees for all three settings of fullinformation, semi- and full bandit? The key idea behind our solution is to design a newautomaton A ′ whose paths can be identiﬁed with those of A and, crucially, whose gains areadditive. We will construct A ′ by deﬁning a set of context-dependent rewrite rules , whichcan be compiled into a ﬁnite-state transducer T A deﬁned below. The context-dependentautomaton A ′ can then be obtained by composition of the transducer T A with A . In additionto playing a key role in the design of our algorithms (Section 4), A ′ provides a compactrepresentation of the gains since its size is substantially less than the dimension p (numberof patterns). We will use context-dependent rewrite rules to map A to the new representation A ′ . Theseare rules that admit the following general form: φ → ψ/λ ρ, where φ , ψ , λ , and ρ are regular expressions over the alphabet of the rules. These rulesmust be interpreted as follows: φ is to be replaced by ψ whenever it is preceded by λ and followed by ρ . Thus, λ and ρ represent the left and right contexts of application ofthe rules. Several types of rules can be considered depending on their being obligatoryor optional, and on their direction of application, from left to right, right to left or si-multaneous application (Kaplan and Kay, 1994). We will be only considering rules withsimultaneous applications. Such context-dependent rules can be eﬃciently compiled into a ﬁnite-state transducer (FST), under the technical condition that they do not rewrite theirnon-contextual part (Mohri and Sproat, 1996; Kaplan and Kay, 1994). An FST T over aninput alphabet Σ and output alphabet Σ ′ deﬁnes an indicator function over the pairs ofstrings in Σ ∗ × Σ ′∗ . Given x ∈ Σ ∗ and y ∈ Σ ′∗ , we have T ( x, y ) = 1 if there exists a path

1. This can be extended to the case of weighted occurrences where more emphasis is assigned to somepatterns θ k whose occurrences are then multiplied by a factor α k >

1, and less emphasis to others.2. Additionally, the rules can be augmented with weights, which can help us cover the case of weightedcount-based gains, in which case the result of the compilation is a weighted transducer (Mohri and Sproat,1996). Our algorithms and theory can be extended to that case. nline Non-Additive Path Learning A (a) e e e T A (b) ǫ e e e e : ǫ e : ǫ e : ǫ e : e e e : ǫ e : ǫ e : ǫ e : ǫ e : ǫ Figure 4: (a) An expert automaton A ; (b) associated context-dependent transducer T A forbigrams. ǫ denotes the empty string. Inputs and outputs are written next to the transitionsseparated by a ‘:’.from an initial state to a ﬁnal state with input label x and output label y , and T ( x, y ) = 0otherwise.To deﬁne our rules, we ﬁrst introduce the alphabet E ′ as the set of transition names forthe target automaton A ′ . These capture all possible contexts of length r , where r is thelength of pattern θ k : E ′ = n e · · · e r | e · · · e r is a path segment of length r in A , r ∈ (cid:8) | θ | , . . . , | θ p | (cid:9)o , where the ‘ e , . . . , e r ∈ E together and forms one single symbol in E ′ . Wewill have one context-dependent rule of the following form for each element e · · · e r ∈ E ′ : e · · · e r → e · · · e r /ǫ ǫ. (2)Thus, in our case, the left- and right-contexts are the empty strings , meaning that the rulescan apply (simultaneously) at every position. In the special case where the patterns θ k arethe set of n -grams, then r is ﬁxed and equal to n . Figure 4 shows the result of the rulecompilation in that case for n = 2. This transducer inserts e e whenever e and e arefound consecutively and otherwise outputs the empty string. We will denote the resultingFST by T A . A ′ To construct the context-dependent automaton A ′ , we will use the composition operation.The composition of A and T A is an FST denoted by A ◦ T A and deﬁned as the followingproduct of two 0 / ∀ x ∈ E ∗ , ∀ y ∈ E ′∗ : ( A ◦ T A )( x, y ) := A ( x ) · T A ( x, y ) . There is an eﬃcient algorithm for the composition of FSTs and automata (Pereira and Riley,1997; Mohri et al., 1996; Mohri, 2009), whose worst-case complexity is in O ( | A | | T A | ). Theautomaton A ′ is obtained from the FST ( A ◦ T A ) by projection , that is by simply omittingthe input label of each transition and keeping only the output label. Thus if we denote byΠ the projection operator, then A ′ is deﬁned as A ′ = Π( A ◦ T A ) . Observe that A ′ admits a ﬁxed topology (states and transitions) at any round t ∈ [ T ]. Itcan be constructed in a pre-processing stage using the FST operations of composition and

3. Context-dependent rewrite rules are powerful tools for identifying diﬀerent patterns using their left-and right-contexts. For our application of count-based gains, however, identifying these patterns areindependent of their context and we do not need to fully exploit the strength of these rewrite rules. nline Non-Additive Path Learning A (a) e e e e e e A ′ (b) e e e e e e e4e e e e e e e e e Figure 5: (a) An example of the expert automaton A . (b) the associated context-dependentautomaton A ′ with bigrams as patterns. The path π = e e e in A and its correspondingpath π ′ = e e e e in A ′ are marked in blue.projection. Additional FST operations such as ǫ -removal and minimization can help furtheroptimize the automaton obtained after projection (Mohri, 2009). Proposition 1, proven inAppendix D, ensures that for every accepting path π in A , there is a unique correspondingaccepting path in A ′ . Figure 5 shows the automata A and A ′ in a simple case and how apath π in A is mapped to another path π ′ in A ′ . Proposition 1

Let A be an expert automaton and let T A be a deterministic transducerrepresenting the rewrite rules (2) . Then, for each accepting path π in A , there exists aunique corresponding accepting path π ′ in A ′ = Π( A ◦ T A ) . The size of the context-dependent automaton A ′ depends on the expert automaton A andthe lengths of the patterns. Notice that, crucially, its size is independent of the size ofthe alphabet Σ. Appendix A analyzes more speciﬁcally the size of A ′ in the importantapplication of ensemble structure prediction with n -gram gains.At any round t ∈ [ T ] and for any e · · · e r ∈ E ′ , let out t ( e · · · e r ) denote the sequenceout t ( e ) · · · out t ( e r ), that is the sequence obtained by concatenating the outputs of e , . . . , e r .Let out t ( A ′ ) be the automaton with the same topology as A ′ where each label e ′ ∈ E ′ is replaced by out t ( e ′ ). Once y t is known, the representation Θ( y t ) can be found, andconsequently, the additive contribution of each transition of A ′ can be computed. Thefollowing theorem, which is proved in Appendix D, shows the additivity of the gains in A ′ .See Figure 6 for an example. Theorem 2

At any round t ∈ [ T ] , deﬁne the gain g e ′ ,t of the transition e ′ ∈ E ′ in A ′ by g e ′ ,t := [Θ( y t )] k if out t ( e ′ ) = θ k for some k ∈ [ p ] and g e ′ ,t := 0 if no such k exists. Then, thegain of each path π in A at trial t can be expressed as an additive gain of the correspondingunique path π ′ in A ′ : ∀ t ∈ [ T ] , ∀ π ∈ P : U ( out t ( π ) , y t ) = X e ′ ∈ π ′ g e ′ ,t .

4. Algorithms

In this section, we present algorithms and associated regret guarantees for online pathlearning with non-additive count-based gains in the full information, semi-bandit and fullbandit settings. The key component of our algorithms is the context-dependent automaton A ′ . In what follows, we denote the length of the longest path in A ′ by K , an upper-bound nline Non-Additive Path Learning out t ( A )(a) ab bb aa out t ( A ′ )(b) ab , , , , , , , , Figure 6: (a) the automaton out t ( A ) of A in Figure 5(a), with bigram gains and Σ = { a, b } .(b) the automaton out t ( A ′ ) given y t = aba . Here, the patterns are ( θ , θ , θ , θ ) =( aa, ab, ba, bb ), and thus, Θ( y t ) = [0 , , , T . The additive gain contributed by each transi-tion e ′ ∈ E ′ in A ′ is written on it separated by a comma from out t ( e ′ ).on the gain of each transition in A ′ by B , the number of path experts by N , and the numberof transitions and states in A ′ by M and Q , respectively. We note that K is at most thelength of the longest path in A since each transition in A ′ admits a unique label. Remark.

The number of accepting paths in A ′ is often equal to but sometimes less thanthe number of accepting paths in A . In some degenerate cases, several paths π , . . . , π k in A may correspond to one single path π ′ in A ′ . This implies that π , . . . , π k in A willalways consistently have the same gains in every round and that is the additive gain of π ′ in A ′ . Thus, if π ′ is predicted by the algorithm in A ′ , any of the paths π , . . . , π k can beequivalently used for prediction in the original expert automaton A . Koolen et al. (2010) gave an algorithm for online path learning with non-negative additivelosses in the full information setting, the Component Hedge (CH) algorithm. For count-based losses, Cortes et al. (2015) provided an eﬃcient Rational Randomized Weighted Ma-jority (RRWM) algorithm. This algorithm requires the use of determinization (Mohri, 2009)which is only shown to have polynomial computational complexity under some additionaltechnical assumptions on the outputs of the path experts. In this section, we present anextension of CH, the

Context-dependent Component Hedge (CDCH), for the online pathlearning problem with non-additive count-based gains. CDCH admits more favorable re-gret guarantees than RRWM and can be eﬃciently implemented without any additionalassumptions.Our CDCH algorithm requires a modiﬁcation of A ′ such that all paths admit an equalnumber K of transitions (same as the longest path). This modiﬁcation can be done byadding at most ( K − Q −

2) + 1 states and zero-gain transitions (Gy¨orgy et al., 2007).Abusing the notation, we will denote this new automaton by A ′ in this subsection. Ateach iteration t , CDCH maintains a weight vector w t in the unit-ﬂow polytope P over A ′ ,which is a set of vectors w ∈ R M satisfying the following conditions: (1) the weights ofthe outgoing transitions from the initial state sum up to one, and (2) for every non-ﬁnalstate, the sum of the weights of incoming and outgoing transitions are equal. For each t ∈ { , . . . , T } , we observe the gain of each transition g t,e ′ , and deﬁne the loss of thattransition as ℓ e ′ = B − g t,e ′ . After observing the loss of each transition e ′ in A ′ , CDCH

4. For example, in the case of n -gram gains, all the paths in A with a length less than n correspond topath with empty output in A ′ and will always have a gain of zero. nline Non-Additive Path Learning updates each component of w as b w ( e ′ ) ← w t ( e ′ ) exp( − η ℓ t,e ′ ) (where η is a speciﬁed learningrate), and sets w t +1 to the relative entropy projection of the updated b w back to the unit-ﬂowpolytope, i.e. w t +1 = argmin w ∈ P P e ′ ∈ E ′ w ( e ′ ) ln w ( e ′ ) b w ( e ′ ) + b w ( e ′ ) − w ( e ′ ).CDCH predicts by decomposing w t into a convex combination of at most | E ′ | paths in A ′ and then sampling a single path according to this mixture as described below. Recallthat each path in A ′ identiﬁes a path in A which can be recovered in time K . Therefore,the inference step of the CDCH algorithm takes at most time polynomial in | E ′ | steps. Todetermine a decomposition, we ﬁnd a path from the initial state to a ﬁnal state with non-zero weights on all transitions, remove the largest weight on that path from each transitionon that path and use it as a mixture weight for that path. The algorithm proceeds in thisway until the outﬂow from initial state is zero. The following theorem from (Koolen et al.,2010) gives a regret guarantee for the CDCH algorithm. Theorem 3

With proper tuning of the learning rate η , the regret of CDCH is bounded asbelow: ∀ π ∗ ∈ P : n X t =1 U ( out t ( π ∗ ) , y t ) − U ( out t ( π t ) , y t ) ≤ p T B K log( K M ) +

B K log( KM ) . The regret bounds of Theorem 3 are in terms of the count-based gain U ( · , · ). Cortes et al.(2015) gave regret guarantees for the RRWM algorithm with count-based losses deﬁnedby − log U ( · , · ). In Appendix E, we show that the regret associated with − log U is upper-bounded by the regret bound associated with U . Observe that, even with this approxi-mation, the regret guarantees that we provide for CDCH are tighter by a factor of K . Inaddition, our algorithm does not require additional assumptions for an eﬃcient implemen-tation compared to the RRWM algorithm of Cortes et al. (2015). Gy¨orgy et al. (2007) gave an eﬃcient algorithm for online path learning with additive lossesin the semi-bandit setting. In this section, we present a

Context-dependent Semi-Bandit (CDSB) algorithm extending that work to solving the problem of online path learning withcount-based gains in a semi-bandit setting. To the best of our knowledge, this is the ﬁrsteﬃcient algorithm with favorable regret bounds for this problem.As with the algorithm of Gy¨orgy et al. (2007), CDSB makes use of a set C of coveringpaths with the property that, for each e ′ ∈ E ′ , there is an accepting path π ′ in C such that e ′ belongs to π ′ . At each round t , CDSB keeps track of a distribution p t over all N pathexperts by maintaining a weight w t ( e ′ ) on each transition e ′ in A ′ such that the weights ofoutgoing transitions for each state sum up to 1 and p t ( π ′ ) = Q e ′ ∈ π ′ w t ( e ′ ), for all acceptingpaths π ′ in A ′ . Therefore, we can sample a path π ′ from p t in at most K steps by selectinga random transition at each state according to the distribution deﬁned by w t . To makea prediction, we sample a path in A ′ according to a mixture distribution (1 − γ ) p t + γµ ,where µ is a uniform distribution over paths in C . We select p t with probability 1 − γ or µ with probability γ and sample a random path π ′ from the randomly chosen distribution.Once a path π ′ t in A ′ is sampled, we observe the gain of each transition e ′ of π ′ t ,denoted by g t,e ′ . CDSB sets b w t ( e ′ ) = w t ( e ′ ) exp( η e g t,e ′ ), where e g t,e ′ = ( g t,e ′ + β ) /q t,e ′ if nline Non-Additive Path Learning e ′ ∈ π ′ t and e g t,e ′ = β/q t,e ′ otherwise. Here, η, β, γ > q t,e ′ is the ﬂow through e ′ in A ′ , which can be computed using a standard shortest-distance algorithm over the probability semiring (Mohri, 2009). The updated distributionis p t +1 ( π ′ ) ∝ Q e ′ ∈ π ′ b w t ( e ′ ). Next, the weight pushing algorithm (Mohri, 1997) is applied(see Appendix C), which results in new transition weights w t +1 such that the total outﬂowout of each state is again one and the updated probabilities are p t +1 ( π ′ ) = Q e ′ ∈ π ′ w t +1 ( e ′ ),thereby facilitating sampling. The computational complexity of each of the steps above ispolynomial in the size of A ′ . The following theorem from Gy¨orgy et al. (2007) provides aregret guarantee for CDSB algorithm. Theorem 4

Let C denote the set of “covering paths” in A ′ . For any δ ∈ (0 , , with propertuning of the parameters η , β , and γ , the regret of the CDSB algorithm can be bounded asfollows with probability − δ : ∀ π ∗ ∈ P : n X t =1 U ( out t ( π ∗ ) , y t ) − U ( out t ( π t ) , y t ) ≤ B √ T K (cid:16)p K | C | ln N + q M ln Mδ (cid:17) . Here, we present an algorithm for online path learning with count-based gains in the fullbandit setting. Cesa-Bianchi and Lugosi (2012) gave an algorithm for online path learningwith additive gains,

ComBand . Our generalization, called

Context-dependent ComBand (CDCB), is the ﬁrst eﬃcient algorithm with favorable regret guarantees for learning withcount-based gains in this setting. For the full bandit setting with arbitrary gains, we developan eﬃcient execution of

EXP3 , called

EXP3-AG , in Appendix B.As with CDSB, CDCB maintains a distribution p t over all N path experts using weights w t on the transitions such that the outﬂow of each state is one and the probability of eachpath experts is the product of the weights of the transitions along that path. To make aprediction, we sample a path in A ′ according to a mixture distribution q t = (1 − γ ) p t + γµ ,where µ is a uniform distribution over the paths in A ′ . Note that this sampling can beeﬃciently implemented as follows. As a pre-processing step, deﬁne µ using a separate setof weights w ( µ ) over the transitions of A ′ in the same form. Set all the weights w ( µ ) toone and apply the weight-pushing algorithm to obtain a uniform distribution over the pathexperts. Next, we select p t with probability 1 − γ or µ with probability γ and sample arandom path π ′ from the randomly chosen distribution.After observing the scalar gain g π ′ of the chosen path, CDCB computes a surrogategain vector for all transitions in A ′ via e g t = g π ′ P v π ′ , where P is the pseudo-inverse of E [ v π ′ v Tπ ′ ] and v π ′ ∈ { , } M is a bit representation of the path π ′ . As for CDSB, we set b w ( e ′ ) = w t ( e ′ ) exp( − η e g t,e ′ ) and update A ′ via weighted-pushing to compute w t +1 . Weobtain the following regret guarantees from Cesa-Bianchi and Lugosi (2012) for CDCB: Theorem 5

Let λ min denote the smallest non-zero eigenvalue of E [ v π ′ v Tπ ′ ] where v π ′ ∈{ , } M is the bit representation of the path π ′ which is distributed according to the uniformdistribution µ . With proper tuning of the parameters η and γ , the regret of CDCB can be nline Non-Additive Path Learning bounded as follows: ∀ π ∗ ∈ P : n X t =1 U ( out t ( π ∗ ) , y t ) − U ( out t ( π t ) , y t ) ≤ B s(cid:18) KM λ min + 1 (cid:19)

T M ln N .

5. Extension to Gappy Count-Based Gains

Here, we generalize the results of Section 3 to a broader family of non-additive gains called gappy count-based gains : the gain of each path depends on the discounted counts of gappy occurrences of a ﬁxed set of patterns θ , . . . , θ p in the sequence output by that path. Ina gappy occurrence, there can be “gaps” between symbols of the pattern. The count of agappy occurrence is discounted multiplicatively by γ k where γ ∈ [0 ,

1] is a ﬁxed discountrate and k is the total length of gaps. For example, the gappy occurrences of the pattern θ = aab in a sequence y = babbaabaa with discount rate γ are • b a b b a a b a a , length of gap = 0, discount factor = 1; • b a b b a a b a a , length of gap = 3, discount factor = γ ; • b a b b a a b a a , length of gap = 3, discount factor = γ ,which makes the total discounted count of gappy occurrences of θ in y to be 1 + 2 · γ . Eachsequence of symbols y ∈ Σ ∗ can be represented as a discounted count vector Θ( y ) ∈ R p ofgappy occurrences of the patterns whose i th component is “the discounted number of gappyoccurrences of θ i in y ”. The gain function U is deﬁned in the same way as in Equation (1). A typical instance of such gains is gappy n -gram gains where the patterns are all | Σ | n -many n -grams.The key to extending our results in Section 3 to gappy n -grams is an appropriate def-inition of the alphabet E ′ , the rewrite rules, and a new context-dependent automaton A ′ .Once A ′ is constructed, the algorithms and regret guarantees presented in Section 4 can beextended to gappy count-based gains. To the best of our knowledge, this provides the ﬁrsteﬃcient online algorithms with favorable regret guarantees for gappy count-based gains infull information, semi-bandit and full bandit settings. Context-Dependent Rewrite Rules.

We extend the deﬁnition of E ′ so that it also en-codes the total length k of the gaps: E ′ = n ( e · · · e r ) k | e · · · e r ∈ E, r ∈ {| θ | , . . . , | θ p |} ,k ∈ Z , k ≥ o . Note that the discount factor in gappy occurrences does not depend on theposition of the gaps. Exploiting this fact, for each pattern of length n and total gap length k , we reduce the number of output symbols by a factor of (cid:0) k + n − k (cid:1) by encoding the numberof gaps as opposed to the position of the gaps.We extend the rewrite rules in order to incorporate the gappy occurrences. Given e ′ = ( e i e i . . . e i n ) k , for all path segments e j e j . . . e j n + k of length n + k in A where { i s } ns =1 is a subsequence of { j r } n + kr =1 with i = j and i n = j n + k , we introduce the rule: e j e j . . . e j n + k −→ ( e i e i . . . e i n ) k /ǫ ǫ.

5. The regular count-based gain can be recovered by setting γ = 0. nline Non-Additive Path Learning As with the non-gappy case in Section 3, the simultaneous application of all these rewriterules can be eﬃciently compiled into a FST T A . The context-dependent transducer T A mapsany sequence of transition names in E into a sequence of corresponding gappy occurrences.The example below shows how T A outputs the gappy trigrams given a path segment oflength 5 as input: e , e , e , e , e T A −−→ ( e e e ) , ( e e e ) , ( e e e ) , ( e e e ) , ( e e e ) , ( e e e ) , ( e e e ) , ( e e e ) , ( e e e ) , ( e e e ) . Context-Dependent Automaton A ′ . As in Section 3.2, we construct the context-dependent automaton as A ′ := Π( A ◦ T A ), which admits a ﬁxed topology through trials.The rewrite rules are constructed in a way such that diﬀerent paths in A are rewritten dif-ferently. Therefore, T A assigns a unique output to a given path expert in A . Proposition 1ensures that for every accepting path π in A , there is a unique corresponding acceptingpath in A ′ .For any round t ∈ [ T ] and any e ′ = ( e i e i · · · e i n ) k , deﬁne out t ( e ′ ) := out t ( e i ) . . . out t ( e i n ).Let out t ( A ′ ) be the automaton with the same topology as A ′ where each label e ′ ∈ E ′ isreplaced by out t ( e ′ ). Given y t , the representation Θ( y t ) can be found, and consequently,the additive contribution of each transition of A ′ . Again, we show the additivity of the gainin A ′ (see Appendix D for the proof). Theorem 6

Given the trial t and discount rate γ ∈ [0 , , for each transition e ′ ∈ E ′ in A ′ , deﬁne the gain g e ′ ,t := γ k [Θ( y t )] i if out t ( e ′ ) = ( θ i ) k for some i and k and g e ′ ,t := 0 ifno such i and k exist. Then, the gain of each path π in A at trial t can be expressed as anadditive gain of π ′ in A ′ : ∀ t ∈ [1 , T ] , ∀ π ∈ P , U ( out t ( π ) , y t ) = X e ′ ∈ π ′ g e ′ ,t . We can extend the algorithms and regret guarantees presented in Section 4 to gappy count-based gains. To the best of our knowledge, this provides the ﬁrst eﬃcient online algorithmswith favorable regret guarantees for gappy count-based gains in full information, semi-banditand full bandit settings.

6. Conclusion and Open Problems

We presented several new algorithms for online non-additive path learning with very favor-able regret guarantees for the full information, semi-bandit, and full bandit scenarios. Weconclude with two open problems: (1) Non-acyclic expert automata: we assumed here thatthe expert automaton A is acyclic and the language of patterns L = { θ , . . . , θ p } is ﬁnite.Solving the non-additive path learning problem with cyclic expert automaton together with(inﬁnite) regular language L of patterns remains an open problem; (2) Incremental con-struction of A ′ : in this work, regardless of the data and the setting, the context-dependentautomaton A ′ is constructed in advance as a pre-processing step. Is it possible to construct A ′ gradually as the learner goes through trials? Can we build A ′ incrementally in diﬀerentsettings and keep it as small as possible as the algorithm is exploring the set of paths andlearning about the revealed data? nline Non-Additive Path Learning Acknowledgments

The work of MM was partly funded by NSF CCF-1535987 and NSF IIS-1618662. Part of thiswork was done while MKW was at UC Santa Cruz, supported by NSF grant IIS-1619271.

References

Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. Open-Fst: a general and eﬃcient weighted ﬁnite-state transducer library. In

Proceedings ofCIAA , pages 11–23. Springer, 2007.Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochasticmultiarmed bandit problem.

SIAM journal on computing , 32(1):48–77, 2002.Nicolo Cesa-Bianchi and G´abor Lugosi. Combinatorial bandits.

Journal of Computer andSystem Sciences , 78(5):1404–1422, 2012.Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structuredprediction. In

Proceedings of ICML , 2014.Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Manfred K. Warmuth. On-linelearning algorithms for path experts with non-additive losses. In

Proceedings of The28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015 , pages424–447, 2015.Andr´as Gy¨orgy, Tam´as Linder, G´abor Lugosi, and Gy¨orgy Ottucs´ak. The on-line shortestpath problem under partial monitoring.

Journal of Machine Learning Research , 8(Oct):2369–2403, 2007.Adam Kalai and Santosh Vempala. Eﬃcient algorithms for online decision problems.

Jour-nal of Computer and System Sciences , 71(3):291–307, 2005.Ronald M. Kaplan and Martin Kay. Regular models of phonological rule systems.

Compu-tational Linguistics , 20(3):331–378, 1994.Wouter M. Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging structured concepts.In

Proceedings of COLT , pages 93–105, 2010.Mehryar Mohri. Finite-state transducers in language and speech processing.

ComputationalLinguistics , 23(2):269–311, 1997.Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems.

Journal of Automata, Languages and Combinatorics , 7(3):321–350, 2002.Mehryar Mohri. Weighted automata algorithms. In

Handbook of Weighted Automata , pages213–254. Springer, 2009.Mehryar Mohri and Richard Sproat. An eﬃcient compiler for weighted rewrite rules. In

Proceedings of the 34th annual meeting on Association for Computational Linguistics ,pages 231–238. Association for Computational Linguistics, 1996. nline Non-Additive Path Learning Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted automata in text andspeech processing. In

Proceedings of ECAI-96 Workshop on Extended ﬁnite state modelsof language , 1996.Fernando Pereira and Michael Riley. Speech recognition by composition of weighted ﬁniteautomata. In

Finite-State Language Processing , pages 431–453. MIT Press, 1997.Gilles Stoltz.

Information incomplete et regret interne en pr´ediction de suites individuelles .PhD thesis, Ph. D. thesis, Univ. Paris Sud, 2005.Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates.

JMLR ,4:773–818, 2003. nline Non-Additive Path Learning Appendix A. Applications to Ensemble Structured Prediction

The algorithms discussed in Section 4 can be used for the online learning of ensembles ofstructured prediction experts, and consequently, signiﬁcantly improve the performance ofalgorithms in a number of areas including machine translation, speech recognition, otherlanguage processing areas, optical character recognition, and computer vision. In structuredprediction problems, the output associated with a model h is a structure y that can bedecomposed and represented by ℓ substructures y , . . . , y ℓ . For instance, h may be a machinetranslation system and y i a particular word.The problem of ensemble structured prediction can be described as follows. The learnerhas access to a set of r experts h , . . . , h r to make an ensemble prediction. Therefore, ateach round t ∈ [1 , T ], the learner can use the outputs of the r experts out t ( h ) , . . . , out t ( h r ).As illustrated in Figure 7(a), each expert h j consists of ℓ substructures h j = ( h j, , . . . , h j,ℓ ). h ... ... h r (a)

00 11 · · ·· · · ℓℓh , h , h ,ℓ h r, h r, h r,ℓ A (b) · · · ℓh , h r, . . . h , h r, . . . h ,ℓ h r,ℓ . . . Figure 7: (a) the structured experts h , . . . , h r . (b) the expert automaton A allowing allcombinations.Represented by paths in an automaton, the substructures of these experts can be com-bined together . Allowing all combinations, Figure 7(b) illustrates the expert automaton A induced by r structured experts with ℓ substructures. The objective of the learner is toﬁnd the best path expert which is the combination of substructures with the best expectedgain. This is motivated by the fact that one particular expert may be better at predictingone substructure while some other expert may be more accurate at predicting another sub-structure. Therefore, it is desirable to combine the substructure predictions of all expertsto obtain the more accurate prediction.Consider the online path learning problem with expert automaton A in Figure 7(b) withnon-additive n -gram gains described in Section 3 for typical small values of n (e.g. n = 4).We construct the context-dependent automaton A ′ via a set of rewrite rules. The rewriterules are as follows: h j ,i +1 , h j ,i +2 , . . . , h j n ,i + n → h j ,i +1 h j ,i +2 . . . h j n ,i + n / ǫ ǫ, for all j , . . . , j n ∈ [1 , r ], i ∈ [0 , ℓ − n ]. The number of rewrite rules is ( ℓ − n + 1) r n . Wecompile these rewrite rules into the context-dependent transducer T A , and then constructthe context-dependent automaton A ′ = Π( A ◦ T A ).The context-dependent automaton A ′ is illustrated in Figure 8. The transitions in A ′ are labeled with n -grams of transition names h i,j in A . The context-dependent automaton A ′ has ℓ − n +1 layers of states each of which acts as a “memory” indicating the last observed nline Non-Additive Path Learning i − i ℓ − n + 1... ... ... ... r n − h ∗ , . . . h ∗ ,n h ∗ ,i . . . h ∗ ,i + n − Figure 8: The context-dependent automaton A ′ for the expert automaton A depicted inFigure 7(b).( n − h i,j . With each intermediate state (i.e. a state which isneither the initial state nor a ﬁnal state), a ( n − r n − many states encoding all combinations of ( n − r incoming transitions which are the n -grams ending with ( n − r outgoing transitions which arethe n -grams starting with ( n − A ′ are Q = 1 + r n ( ℓ − n ) and M = r n ( ℓ − n + 1),respectively. Note that the size of A ′ does not depend on the size of the output alphabetΣ. Also notice that all paths in A ′ have equal length of K = ℓ − n + 1. Furthermore thenumber of paths in A ′ and A are the same and equal to N = r ℓ .We now apply the algorithms introduced in Section 4. A.1. Full Information: Context-dependent Component Hedge Algorithm

We apply the CDCH algorithm to this application in full information setting. The context-dependent automaton A ′ introduced in this section is highly structured. We can exploitthis structure and obtain better bounds comparing to the general bounds of Theorem 3 forCDCH. Theorem 7

Let B denote an upper-bound for the gains of all the transitions in A ′ , and T be the time horizon. The regret of CDCH algorithm on ensemble structured prediction with r predictors consisting of ℓ substructures with n -gram gains can be bounded asRegret CDCH ≤ p T B ( ℓ − n + 1) n log r + B ( ℓ − n + 1) n log r. Proof

First, note that all paths in A ′ have equal length of K = ℓ − n + 1. Therefore thereis no need of modifying A ′ to make all paths of the same length. At each trial t ∈ [ T ], we nline Non-Additive Path Learning deﬁne the loss of each transition as ℓ t,e ′ := B − g t,e ′ . Extending the results of Koolen et al.(2010), the general regret bound of CDCH isRegret CDCH ≤ p T K B ∆( v π ∗ || w ) + B ∆( v π ∗ || w ) , (3)where v π ∗ ∈ { , } M is a bit vector representation of the best comparator π ∗ , w ∈ [0 , M is the initial weight vector in the unit-ﬂow polytope, and∆( w || b w ) := X e ′ ∈ E ′ w e ′ ln w e ′ b w e ′ + b w e ′ − w e ′ ! . Since the initial state has r n outgoing transitions, and all the intermediate states have r incoming and outgoing transitions, the initial vector w = r n falls into the unit-ﬂowpolytope, where is a vector of all ones. Also v π ∗ has exactly K = ℓ − n + 1 many ones.Therefore: ∆( v π ∗ || w ) = ( ℓ − n + 1) n log r (4)Combining the Equations (3) and (4) gives us the desired regret bound. A.2. Semi-Bandit: Context-dependent Semi-Bandit Algorithm

In order to apply the algorithm CDSB in semi-bandit setting in this application, we needto introduce a set of “covering paths” C in A ′ . We introduce C by partitioning all thetransitions in A ′ into r n paths of length ℓ − n + 1 iteratively as follows. At each iteration,choose an arbitrary path π from the initial state to a ﬁnal state. Add π to the set C and remove all its transitions from A ′ . Notice that the number of incoming and outgoingtransitions for each intermediate state are always equal throughout the iterations. Alsonote that in each iteration, the number of outgoing edges from the initial state decreasesby one. Therefore after r n iterations, C contains a set of r n paths that partition the set oftransitions in A ′ .Furthermore, observe that the number of paths in A ′ and A are the same and equal to N = r ℓ . The Corollary below is a direct result of Theorem 4 with | C | = r n . Corollary 8

For any δ ∈ (0 , , with proper tuning, the regret of the CDSB algorithm canbe bounded, with probability − δ , as:Regret CDSB ≤ B ( ℓ − n + 1) √ T √ r n ℓ ln r + r r n ln r n ( ℓ − n + 1) δ ! . A.3. Full Bandit: Context-dependent ComBand Algorithm

We apply the CDCB algorithm to this application in full bandit setting. The Corollarybelow, which is a direct result of Theorem 5, give regret guarantee for CDCB algorithm. nline Non-Additive Path Learning Corollary 9

Let λ min denote the smallest non-zero eigenvalue of E [ v π v Tπ ] where v π ∈{ , } M is the bit representation of the path π which is distributed according to the uniformdistribution µ . With proper tuning, the regret of CDCB can be bounded as follows:Regret CDCB ≤ B vuut T ℓ − n + 1) r n ( ℓ − n + 1) λ min + 1 ! r n ( ℓ − n + 1) ℓ ln r. Appendix B. Path Learning for Full Bandit and Arbitrary Gain

So far, we have discussed the count-based gains as a wide family of non-additive gains infull information, semi- and full bandit settings. We developed learning algorithms thatexploit the special structure of count-based gains. In this section, we go beyond the count-based gains in the full bandit setting and consider the scenario where the gain functionis arbitrary and admits no known structure. In other words, any two paths can havecompletely independent gains regardless of the number of overlapping transitions they mayshare. Clearly, the CDCB algorithm of Section 4 cannot be applied to this case as it isspecialized to (gappy) count-based gains. However, we present a general algorithm for pathlearning in the full bandit setting, when the gain function is arbitrary. This algorithm(called

EXP3-AG ) admits weaker regret bounds since no special structure of the gains canbe exploited in the absence of any assumption. The algorithm is essentially an eﬃcientimplementation of

EXP3 for path learning with arbitrary gains using weighted automataand graph operations.We start with a brief description of the EXP3 algorithm of Auer et al. (2002), which isan online learning algorithm designed for the full bandit setting over a set of N experts.The algorithm maintains a distribution w t over the set of experts, with w initializedto the uniform distribution. At each round t ∈ [ T ], the algorithm samples an expert I t according to w t and receives (only) the gain g t,I t associated to that expert. It then updatesthe weights multiplicatively via the rule w t +1 ,i ∝ w t,i exp( η e g t,i ) for all i ∈ [ N ], where e g t,i = g t,i w t,i { I t = i } is an unbiased surrogate gain associated with expert i . The weights w t +1 ,i are then normalized to sum to one. In our learning scenario, each expert is a path in A . Since the number of paths isexponential in the size of A , maintaining a weight per path is computationally intractable.We cannot exploit the properties of the gain function since it does not admit any knownstructure. However, we can make use of the graph representation of the experts. Wewill show that the weights of the experts at round t can be compactly represented by adeterministic weighted ﬁnite automaton (WFA) W t . We will further show that sampling apath from W t and updating W t can be done eﬃciently.A deterministic WFA W is a deterministic ﬁnite automaton whose transitions and ﬁnalstates carry weights. Let w ( e ) denote the weight of a transition e and w f ( q ) the weightat a ﬁnal state q . The weight W ( π ) of a path π ending in a ﬁnal state is deﬁned as theproduct of its constituent transition weights and the weight at the ﬁnal state: W ( π ) :=( Q e ∈ π w ( e )) · w f (dest( π )), where dest( π ) denotes the destination state of π .

6. The original EXP3 algorithm of Auer et al. (2002) mixes the weight vector with the uniform distributionin each trial. Later Stoltz (2005) showed that the mixing step is not necessary. nline Non-Additive Path Learning V t | | | | · · · k | exp( η g π t , t W t ( π t ) ) e | ρ | | ρ | | ρ | k | ρ | | | Figure 9: The update WFA V t . Theweight of each state and transition iswritten next to its name separated by “ | ”: e | weight −−−−−−→ W ←− A For t = 1 , . . . , T W t ←− WeightPush ( W t ) π t ←− Sample ( W t ) g π t ,t ←− ReceiveGain ( π t ) V t ←− UpdateWFA ( π t , W t ( π t ) , g t,π t ) W t +1 ←− W t ◦ V t Figure 10: Algorithm

EXP3-AG

Sampling paths from a deterministic WFA W is straightforward when it is stochastic ,that is when the weights of all outgoing transitions and the ﬁnal state weight (if the stateis ﬁnal) sum to one at every state: starting from the initial state, we can randomly drawa transition according to the probability distribution deﬁned by the outgoing transitionweights and proceed similarly from the destination state of that transition, until a ﬁnalstate is reached. The WFA we obtain after an update may not be stochastic, but we caneﬃciently compute an equivalent stochastic WFA W ′ from any W using the weight-pushing algorithm (Mohri, 1997, 2009; Takimoto and Warmuth, 2003): W ′ admits the same statesand transitions as W and assigns the same weight to a path from the initial state to aﬁnal state; but the weights along paths are redistributed so that W ′ is stochastic. Foran acyclic input WFA such as those we are considering, the computational complexity ofweight-pushing is linear in the sum of the number of states and transitions of W , see theAppendix C for details.We now show how W t can be eﬃciently updated using the standard WFA operation of intersection (or composition ) with a WFA V t representing the multiplicative weights thatwe will refer to as the update WFA at time t . V t is a deterministic WFA that assigns weight exp( η e g t,π ) to path π . Thus, since e g t,π = 0for all paths but the path π t sampled at time t , V t assigns weight 1 to all paths π = π t andweight exp (cid:0) ηg t,πt W t ( π t ) (cid:1) to π t . V t can be constructed deterministically as illustrated in Figure 9,using ρ -transitions (marked with ρ in green). A ρ -transition admits the semantics of the rest : it matches any symbol that is not labeling an existing outgoing transition at that state.For example, the ρ -transition at state 1 matches any symbol other than e . ρ -transitionslead to a more compact representation not requiring the knowledge of the full alphabet.This further helps speed up subsequent intersection operations (Allauzen et al., 2007).To update the weights W t , we use the intersection (or composition ) of WFAs. Bydeﬁnition, the intersection of W t and V t is a WFA denoted by ( W t ◦ V t ) that assigns to eachpath expert π the product of the weights assigned by W t and V t : ∀ π ∈ P : ( W t ◦ V t )( π ) = W t ( π ) · V t ( π ) .

7. The terminology of intersection is motivated by the case where the weights are either 0 or 1, in whichcase the set of paths with non-zero weights in W t ◦ V t is the intersection of the sets of paths with withweight 1 in W t and V t . nline Non-Additive Path Learning There exists a general an eﬃcient algorithm for computing the intersection of two WFAs(Pereira and Riley, 1997; Mohri et al., 1996; Mohri, 2009): the states of the intersectionWFA are formed by pairs of a state of the ﬁrst WFA and a state of the second WFA, andthe transitions obtained by matching pairs of transitions from the original WFAs, with theirweights multiplied, see Appendix C for more details. Since both W t and V t are deterministic,their intersection ( W t ◦ V t ) is also deterministic (Cortes et al., 2015).The following lemma (proven in Appendix D) shows that the weight assigned by EXP3-AG to each path expert coincides with those deﬁned by

EXP3 . Lemma 10

At each round t ∈ [ T ] in EXP3-AG , the following properties hold for W t and V t : W t +1 ( π ) ∝ exp( η t X s =1 e g s,π ) , V t ( π ) = exp( η e g t,π ) , s.t. e g s,π = ( g s,π / W s ( π )) · { π = π s } . Figure 10 gives the pseudocode of

EXP3-AG . The time complexity of

EXP3-AG is dom-inated by the cost of the intersection operation (line 7). The worst-case space and timecomplexity of the intersection of two deterministic WFA is linear in the size of the au-tomaton the algorithm returns. Due to the speciﬁc structure of V t , the size of W t ◦ V t can be shown to be at most O ( | W t | + | V t | ) where | W t | is the sum of the number of statesand transitions in W t . This is signiﬁcantly better than the worst case size of the inter-section in general (i.e. O ( | W t || V t | ). Recall that W t +1 is deterministic. Thus, unlike thealgorithms of Cortes et al. (2015), no further determinization is required. The followingLemma guarantees the eﬃciency of EXP3-AG algorithm. See Appendix D for the proof.

Lemma 11

The time complexity of

EXP3-AG at round t is in O ( | W t | + | V t | ) . Moreover,in the worst case, the growth of | W t | over time is at most linear in K where K is the lengthof the longest path in A . The following upper bound holds for the regret of

EXP3-AG , as a direct consequence ofexisting guarantees for

EXP3 (Auer et al., 2002).

Theorem 12

Let

U > be an upper bound on all path gains: g t,π ≤ U for all t ∈ [ T ] and all path π . Then, the regret of EXP3-AG with N path experts is upper bounded by U √ T N log N . The √ N dependency of the bound suggests that the guarantee will not be informativefor large values of N . However, the following known lower bound shows that, in the absenceof any assumption about the structure of the gains, this dependency cannot be improvedin general (Auer et al., 2002). Theorem 13

Let

U > be an upper bound on all path gains: g t,π ≤ U for all t ∈ [ T ] and all path π . Then, For any number of path experts N ≥ there exists a distributionover the assignment of gains to path experts such that the regret of any algorithm is at least U min {√ T N , T } . nline Non-Additive Path Learning Appendix C. Weighted Finite Automata

In this section, we formally describe several WFA operations relevant to this paper, as wellas their properties.

C.1. Intersection of WFAs

The intersection of two WFAs A and A is a WFA denoted by A ◦ A that accepts theset of sequences accepted by both A and A and is deﬁned for all π by( A ◦ A )( π ) = A ( π ) · A ( π ) . There exists a standard eﬃcient algorithm for computing the intersection WFA (Pereira and Riley,1997; Mohri et al., 1996; Mohri, 2009). States Q ⊆ Q × Q of A ◦ A are identiﬁed withpairs of states Q of A and Q of A , as are the set of initial and ﬁnal states. Transitionsare obtained by matching pairs of transitions from each WFA and multiplying their weights: (cid:16) q a | w −→ q ′ , q a | w −→ q ′ (cid:17) ⇒ ( q , q ) a | w · w −→ ( q ′ , q ′ ) . The worst-case space and time complexity of the intersection of two deterministic WFAs islinear in the size of the automaton the algorithm returns. In the worst case, this can beas large as the product of the sizes of the WFA that are intersected (i.e. O ( | A || A | ). Thiscorresponds to the case where every transition of A can be paired up with every transitionof A . In practice, far fewer transitions can be matched.Notice that, when both A and A are deterministic, then A ◦ A is also deterministicsince there is a unique initial state (pair of initial states of each WFA) and since there is atmost one transition leaving q ∈ Q or q ∈ Q labeled with a given symbol a ∈ Σ. C.2. Weight Pushing

Given a WFA W , the weight pushing algorithm (Mohri, 1997, 2009) computes an equivalentstochastic WFA. The weight pushing algorithm is deﬁned as follows. For any state q in W ,let d [ q ] denote the sum of the weights of all paths from q to ﬁnal states: d [ q ] = X π ∈ P ( q ) Y e ∈ π w ( e ) ! · w f (dest( π )) , where P ( q ) denotes the set of paths from q to ﬁnal states in W . The weights d [ q ]s can becomputed be simultaneously for all q s using standard shortest-distance algorithms over theprobability semiring (Mohri, 2002). The weight pushing algorithm performs the followingsteps. For any transition ( q, q ′ ) ∈ E such that d [ q ] = 0, its weight is updated as below: w ( q, q ′ ) ← d [ q ] − w ( q, q ′ ) d [ q ′ ] . For any ﬁnal state q , the weight is updated as follows: w f ( q ) ← w f ( q ) d [ q ] − . The resulting WFA is guaranteed to preserve the path expert weights and to be stochastic(Mohri, 2009). nline Non-Additive Path Learning Appendix D. Proofs

Lemma 10

At each round t ∈ [ T ] , the following properties hold for W t and V t : W t +1 ( π ) ∝ exp( η t X s =1 e g s,π ) , V t ( π ) = exp( η e g t,π ) , s.t. e g s,π = ( g s,π W s ( π ) π = π s otherwise. Proof

Consider V t in Figure 9 and let π t = e e . . . e k be the path chosen by the learner.Every state in V t is a ﬁnal state. Therefore, V t accepts any sequence of transitions names.Moreover, since the weights of all transitions are 1, the weight of any accepting path issimply the weight of its ﬁnal state. The construction of V t ensures that the weight of everysequence of transition names is 1, except for π t = e e . . . e k . Thus, the property of V t isachieved: V t ( π ) = ( exp (cid:16) η g t,π W t ( π ) (cid:17) π = π t W t +1 is by induction on t . Consider the base case of t = 0. W is initialized to the automaton A with all weights being one. Thus, the weights of all pathsare equal to 1 before weight pushing (i.e. W ( π ) ∝ W t +1 ( π ) ∝ W t ( π ) · V t ( π ) (deﬁnition of composition)= exp( η t − X s =1 e g s,π ) · exp( η e g t,π ) (induction hypothesis)= exp( η t X s =1 e g s,π ) , which completes the proof. Lemma 11

The time complexity of

EXP3-AG at round t is in O ( | W t | + | V t | ) . Moreover,in the worst case, the growth of | W t | over time is at most linear in K where K is the lengthof the longest path in A . Proof

Figure 10 gives the pseudocode of

EXP3-AG . The time complexity of the weight-pushing step is in O ( | W t | ), where | W t | is the sum of the number of states and transitionsin W t . Lines 4 and 6 in Algorithm 10 take O ( | V t | ) time. Finally, regarding line 7, theworst-case space and time complexity of the intersection of two deterministic WFA is linearin the size of the automaton the algorithm returns. However, the size of the intersectionautomaton W t ◦ V t is signiﬁcantly smaller than the general worst case (i.e. O ( | W t || V t | )) dueto the state “else” with all in-coming ρ -transitions (see Figure 9). Since W t is deterministic,in the construction of W t ◦ V t , each state of V t except from the “else” state is paired uponly with one state of W t . For example, if the state is the one reached by e e e , thenit is paired up with the single state of W t reached when reading e e e from the initialstate. Thus | W t ◦ V t | ≤ | W t | + | V t | , and therefore, the intersection operation in line 7 takes O ( | W t | + | V t | ) time, which also dominates the time complexity of EXP3-AG algorithm. nline Non-Additive Path Learning Additionally, observe that the size | V t | is in O ( K ) where K is the length of the longestpath in A . Since | W t +1 | = | W t ◦ V t | ≤ | W t | + | V t | , in the worst case, the growth of | W t | over time is at most linear in K . Proposition 1

Let A be an expert automaton and let T A be a deterministic transducerrepresenting the rewrite rules (2) . Then, for each accepting path π in A there exists aunique corresponding accepting path π ′ in A ′ = Π( A ◦ T A ) . Proof

To establish the correspondence, we introduce T A as a mapping from the acceptingpaths in A to the accepting paths in A ′ . Since T A is deterministic, for each accepting path π in A (i.e. A ( π ) = 1), T A assigns a unique output π ′ , that is T A ( π, π ′ ) = 1. We show that π ′ is an accepting path in A ′ . Observe that( A ◦ T A )( π, π ′ ) = A ( π ) · T A ( π, π ′ ) = 1 × , which implies that A ′ ( π ′ ) = Π( A ◦ T A )( π ′ ) = 1. Thus for each accepting path π in A thereis a unique accepting path π ′ in A ′ . Theorem 2

Given the trial t , for each transition e ′ ∈ E ′ in A ′ deﬁne the gain g e ′ ,t :=[Θ( y t )] i if out t ( e ′ ) = θ i for some i and g e ′ ,t := 0 if no such i exists. Then, the gain of eachpath π in A at trial t can be expressed as an additive gain of π ′ in A ′ : ∀ t ∈ [1 , T ] , ∀ π ∈ P : U ( out t ( π ) , y t ) = X e ′ ∈ π ′ g e ′ ,t . Proof

By deﬁnition, the i th component of Θ(out t ( π )) is the number of occurrences of θ i in out t ( π ). Also, by construction of the context-dependent automaton based on rewriterules, π ′ contains all path segments of length | θ i | of π in A as transition labels in A ′ . Thusevery occurrence of θ i in out t ( π ) will appear as a transition label in out t ( π ′ ). Therefore thenumber of occurrences of θ i in out t ( π ) is[Θ(out t ( π ))] i = X e ′ ∈ π ′ { out t ( e ′ ) = θ i } , (5)where {·} is the indicator function. Thus, we have that U (out t ( π ) , y t ) = Θ( y t ) · Θ(out t ( π )) (deﬁnition of U )= X i [Θ( y t )] i [Θ(out t ( π ))] i = X i [Θ( y t )] i X e ′ ∈ π ′ { out t ( e ′ ) = θ i } (Equation (5))= X e ′ ∈ π ′ X i [Θ( y t )] i { out t ( e ′ ) = θ i } | {z } = g e ′ ,t , which concludes the proof. nline Non-Additive Path Learning Theorem 14

Given the trial t and discount rate γ ∈ [0 , , for each transition e ′ ∈ E ′ in A ′ deﬁne the gain g e ′ ,t := γ k [Θ( y t )] i if out t ( e ′ ) = ( θ i ) k for some i and k and g e ′ ,t := 0 ifno such i and k exist. Then, the gain of each path π in A at trial t can be expressed as anadditive gain of π ′ in A ′ : ∀ t ∈ [1 , T ] , ∀ π ∈ P : U ( out t ( π ) , y t ) = X e ′ ∈ π ′ g e ′ ,t . Proof

By deﬁnition, the i th component of Θ(out t ( π )) is the discounted count of gappyoccurrences of θ i in out t ( π ). Also, by construction of the context-dependent automatonbased on rewrite rules, π ′ contains all gappy path segments of length | θ i | of π in A astransition labels in A ′ . Thus every gappy occurrence of θ i with k gaps in out t ( π ) willappear as a transition label ( θ i ) k in out t ( π ′ ). Therefore the discounted counts of gappyoccurrences of θ i in out t ( π ) is[Θ(out t ( π ))] i = X e ′ ∈ π ′ X k γ k { out t ( e ′ ) = ( θ i ) k } . (6)Therefore, the following holds: U (out t ( π ) , y t ) = Θ( y t ) · Θ(out t ( π )) (deﬁnition of U )= X i [Θ( y t )] i [Θ(out t ( π ))] i = X i [Θ( y t )] i X e ′ ∈ π ′ X k γ k { out t ( e ′ ) = ( θ i ) k } (Equation (6))= X e ′ ∈ π ′ X i X k γ k [Θ( y t )] i { out t ( e ′ ) = ( θ i ) k } | {z } = g e ′ ,t , and the proof is complete. Appendix E. Gains U vs Losses − log( U ) Let U be a non-negative gain function. Also let π ∗ ∈ P be the best comparator over the T rounds. The regret associated with U and − log U , which are respectively denoted by R G and R L , are deﬁned as below: R G := T X t =1 U (out t ( π ∗ ) , y t ) − U (out t ( π t ) , y t ) ,R L := T X t =1 − log( U (out t ( π t ) , y t )) − ( − log( U (out t ( π ∗ ) , y t ))) . Observe that if U (out t ( π t ) , y t ) = 0 for any t , then R L is unbounded. Otherwise, let usassume that there exists a positive constant α > U (out t ( π t ) , y t ) ≥ α for all t ∈ [ T ]. Note, for count-based gains, we have α ≥

1, since all components of the representationΘ( · ) are non-negative integers. Thus, the next proposition shows that for count-based gainswe have R L ≤ R G . nline Non-Additive Path Learning Proposition 15

Let U be a non-negative gain function. Assume that there exists α > such that U ( out t ( π t ) , y t ) ≥ α for all t ∈ [ T ] . Then, the following inequality holds: R L ≤ α R G . Proof

The following chain of inequalities hold: R L = T X t =1 − log( U (out t ( π t ) , y t )) − ( − log( U (out t ( π ∗ ) , y t )))= T X t =1 log (cid:20) U (out t ( π ∗ ) , y t ) U (out t ( π t ) , y t ) (cid:21) = T X t =1 log (cid:20) U (out t ( π ∗ ) , y t ) − U (out t ( π t ) , y t ) U (out t ( π t ) , y t ) (cid:21) ≤ T X t =1 U (out t ( π ∗ ) , y t ) − U (out t ( π t ) , y t ) U (out t ( π t ) , y t ) (since log(1 + x ) ≤ x ) ≤ α T X t =1 U (out t ( π ∗ ) , y t ) − U (out t ( π t ) , y t ) (since U (out t ( π t ) , y t ) ≥ α ) ≤ α R G , which completes the proof. Appendix F. Non-Additivity of the Count-Based Gains

Here we show that the count-based gains are not additive in general. Consider the specialcase of 4-gram-based gains in Figure 1. Suppose the count-based gain deﬁned in Equa-tion (1) can be expressed additively along the transitions (proof by contradiction). Let thetarget sequence y be as below: y = He would like to eat cake

Then the only 4-gram in Figure 1 with positive gain is “

He-would-like-to ”. Thus, ifa path contains this 4-gram, it will have a gain of 1. Otherwise, its gain will be 0. Supposethat the transitions are labeled as depicted in Figure 11 and each transition e ∈ E carriesan additive gain of g ( e ). Consider the following four paths: π = e e e e e e π = e ′ e e e e e π = e e e ′ e e e π = e ′ e e ′ e e e nline Non-Additive Path Learning e : He e ′ : She e : would e ′ : would e : like e ′ : love e : to e ′ : to e : have e ′ : drink e : tea e ′ : chai Figure 11: Additive gains carried by the transitions.Due to the additivity of gains, we can obtain: U (out t ( π ) , y ) + U (out t ( π ) , y ) = g ( e ) + g ( e ′ ) + g ( e ) + g ( e ′ )+ 2 g ( e ) + 2 g ( e ) + 2 g ( e ) + 2 g ( e )= U (out t ( π ) , y ) + U (out t ( π ) , y )This, however, contradicts the deﬁnition of the count-based gains in Equation (1): U (out t ( π ) , y ) | {z } =1 + U (out t ( π ) , y ) | {z } =0 = U (out t ( π ) , y ) | {z } =0 + U (out t ( π ) , y ) | {z } =0=0