A reaction network scheme which implements inference and learning for Hidden Markov Models
Abhinav Singh, Carsten Wiuf, Abhishek Behera, Manoj Gopalkrishnan
aa r X i v : . [ c s . ET ] A ug A reaction network scheme which implements inference andlearning for Hidden Markov Models
Abhinav Singh , Carsten Wiuf , Abhishek Behera , and Manoj Gopalkrishnan UM-DAE Centre for Excellence in Basic Sciences, Mumbai, India Department of Mathematical Sciences, University of Copenhagen, Denmark Indian Institute of Technology Bombay, Mumbai, India { abhinavsns7, abhishek.enlightened, manoj.gopalkrishnan } @[email protected] Abstract.
With a view towards molecular communication systems and molecular multi-agentsystems, we propose the Chemical Baum-Welch Algorithm, a novel reaction network schemethat learns parameters for Hidden Markov Models (HMMs). Each reaction in our schemechanges only one molecule of one species to one molecule of another. The reverse changeis also accessible but via a different set of enzymes, in a design reminiscent of futile cyclesin biochemical pathways. We show that every fixed point of the Baum-Welch algorithm forHMMs is a fixed point of our reaction network scheme, and every positive fixed point of ourscheme is a fixed point of the Baum-Welch algorithm. We prove that the “Expectation” stepand the “Maximization” step of our reaction network separately converge exponentially fast.We simulate mass-action kinetics for our network on an example sequence, and show that itlearns the same parameters for the HMM as the Baum-Welch algorithm.
The sophisticated behavior of living cells on short timescales is powered by biochemical reactionnetworks. One may say that evolution has composed the symphony of the biosphere, genetic machin-ery conducts the music, and reaction networks are the orchestra. Understanding the capabilities andlimits of this molecular orchestra is key to understanding living systems, as well as to engineeringmolecular systems that are capable of sophisticated life-like behavior.The technology of implementing abstract reaction networks with molecules is a subfield ofmolecular systems engineering that has witnessed rapid advances in recent times. Several researchers[1–8] have proposed theoretical schemes for implementing arbitrary reaction networks with DNAoligonucleotides. There is a growing body of experimental demonstrations of such schemes [2, 7, 9–11]. A stack of tools is emerging to help automate the design process. We can now compile abstractreaction networks to a set of DNA oligonucleotides that will implement the dynamics of the networkin solution [12]. We can computationally simulate the dynamics of these oligonucleotide molecularsystems [13] to allow debugging prior to experimental implementation. In view of these rapid ad-vances, the study of reaction networks from the point of view of their computational capabilities hasbecome even more urgent.It has long been known that reaction networks can compute any computable function [14]. Theliterature has several examples of reaction network schemes that have been inspired by known algo-rithms [15–22]. Our group has previously described reaction network schemes that solve statisticalproblems like maximum likelihood [23], sampling a conditional distribution and inference [24], andlearning from partial observations [25]. These schemes exploit the thermodynamic nature of the un-derlying molecular systems that will implement these reaction networks, and can be expressed interms of variational ideas involving minimization of Helmholtz free energy [26–28].n this paper, we consider situations where partial information about the environment is avail-able to a cell in the form of a sequence of observations. For example, this might happen when anenzyme is acting processively on a polymer, or a molecular walker [29–31] is trying to locate itsposition on a grid. In situations like this, multiple observations are not independent. Such sequencescan not be summarized merely by the type of the sequence [32], i.e., the number of times varioussymbols occur. Instead, the order of various observations carries information about state changesin the process producing the sequence. The number of sequences grows exponentially with length,and our previously proposed schemes are algorithmically inadequate. To deal with such situationsrequires a pithy representation of sequences, and a way of doing inference and learning directly onsuch representations. In Statistics and Machine Learning, this problem is solved by
Hidden MarkovModels (HMMs) [33].HMMs are a widely used model in Machine Learning, powering sequence analysis applicationslike speech recognition [34], handwriting recognition, and bioinformatics. They are also essentialcomponents of communication systems as well as of intelligent agents trained by reinforcementlearning methods. In this article, we describe a reaction network scheme which implements theBaum-Welch algorithm . The Baum-Welch algorithm is an iterative algorithm for learning HMMparameters. Reaction networks that can do such statistical analysis on sequences are likely to be anessential component of molecular communication systems, enabling cooperative behavior among apopulation of artificial cells. Our main contributions are:1. In Section 2, we describe what the reader needs to know about HMMs and the Baum-Welch al-gorithm to be able to follow the subsequent constructions. No prerequisites are assumed beyondfamiliarity with matrices and probability distributions.2. In Section 3, we describe a novel reaction network scheme to learn parameters for an HMM.3. We prove in Theorem 1 that every fixed point of the Baum-Welch algorithm is also a fixed pointof the continuous dynamics of this reaction network scheme.4. In Theorem 2, we prove that every positive fixed point of the dynamics of our reaction networkscheme is a fixed point of the Baum-Welch algorithm.5. In Theorem 3 and Theorem 4, we prove that subsets of our reaction network scheme whichcorrespond to the Expectation step and the Maximization step of the Baum-Welch algorithmboth separately converge exponentially fast.6. In Example 1, we simulate our reaction network scheme on an input sequence and show thatthe network dynamics is successfully able to learn the same parameters as the Baum-Welchalgorithm.7. We show in Example 2 that when the baum-welch fixed point is on the boundary then ourscheme need not converge to a Baum-Welch fixed point. However, we conjecture that if thereexists a positive Baum-Welch fixed point then our scheme must always converge to a Baum-Welch fixed point. In particular, there would be a positive Baum-Welch fixed point if the trueHMM generating the sequence has all parameters positive, and the observed sequence is longenough.
Fix two finite sets P and Q . A stochastic map is a | P | × | Q | matrix A = ( a pq ) | P |×| Q | such that a pq ≥ for all p ∈ P and q ∈ Q , and P q ∈ Q a pq = 1 for all p ∈ P . Intuitively, stochastic mapsrepresent conditional probability distributions.An HMM ( H, V, θ, ψ, π ) consists of finite sets H (for ‘hidden’) and V (for ‘visible’), a stochas-tic map θ from H to H called the transition matrix , a stochastic map ψ from H to V called the2 H V V V V θ θ θ θ ψ ψ ψ ψ (a) Hidden Markov Model H H H H v ℓ v ℓ +1 α ℓ θ ψ v ℓ +1 α ℓ θ ψ v ℓ +1 α ℓ α ℓ α ℓ θ ψ v ℓ + α ℓ θ ψ v ℓ + α ℓ +1 , α ℓ +1 , (b) Forward Algorithm H H H H v ℓ − v ℓ θ ψ v ℓ β ℓ θ ψ v ℓ β ℓ β ℓ − , β ℓ − , θ ψ v ℓ β ℓ θ ψ v ℓ β ℓ β ℓ β ℓ (c) Backward Algorithm ξ lgh γ lg Forward BackwardE-StepM-Step θ, ψ θ, ψθ, ψα lg β lg (d) Baum-Welch AlgorithmFig. 1. Learning HMMs from sequences. (a)
HMM : The hidden states H and H are not directly ob-servable. Instead what are observed are elements V , V from the set V = { V , V } of “visible states.” Theparameters θ , θ , θ , θ denote the probability of transitions between the hidden states. The probabil-ity of observing states V , V depends on the parameters ψ , ψ , ψ , ψ as indicated in the figure. (b)The forward algorithm computes the position l + 1 likelihood α l +1 , = α l θ ψ v l +1 + α l θ ψ v l +1 byforward propagating the position l likelihoods α l and α l . Here v l , v l +1 ∈ V are the observed emissionsat position l and l + 1 . (c) The backward algorithm computes the position l − conditional probability β l − , = θ ψ v l β l + θ ψ v l β l by propagating the position l conditional probabilities β l and β l back-wards. (d) The Baum-Welch Algorithm is a fixed point Expectation-Maximization computation. The E stepcalls the forward and backward algorithm as subroutines and, conditioned on the entire observed sequence ( v , v , . . . , v L ) ∈ V L , computes the probabilities γ lg of being in states g ∈ H at position l and the prob-abilities ξ lgh of taking the transitions gh ∈ H at position l . The M step updates the parameters θ and ψ tomaximize their likelihood given the observed sequence. emission matrix , and an initial probability distribution π = ( π h ) h ∈ H on H , i.e., π h ≥ for all h ∈ H and P h ∈ H π h = 1 . See Figure 1a for an example.Suppose a length L sequence ( v , v , . . . , v L ) ∈ V L of visible states is observed due to a hid-den sequence ( x , x , . . . , x L ) ∈ H L . The following questions related to an HMM are commonlystudied:1. Likelihood:
For fixed θ, ψ , compute the likelihood
Pr( v , v , . . . , v L | θ, ψ ) . This problem issolved by the forward-backward algorithm .3. Learning:
Estimate the parameters ˆ θ, ˆ ψ which maximize the likelihood of the observed se-quence ( v , v , . . . , v L ) ∈ V L . This problem is solved by the Baum-Welch algorithm whichis an Expectation-Maximization (EM) algorithm. It uses the forward-backward algorithm as asubroutine to compute the E step of EM.3.
Decoding:
For fixed θ, ψ , find the sequence (ˆ h , ˆ h , . . . , ˆ h l , . . . , ˆ h L ) ∈ H L that has the highestprobability of producing the given observed sequence ( v , v , . . . , v L ) . This problem is solvedby the Viterbi algorithm .The forward algorithm (Figure 1b) takes as input an HMM ( H, V, θ, ψ, π ) and a length L obser-vation sequence ( v , v , . . . , v L ) ∈ V L and outputs the L ×| H | likelihoods α lh = Pr[ v , v , . . . , v l , x l = h | θ, ψ ] of observing symbols v , . . . , v l and being in the hidden state h ∈ H at time l . It does sousing the following recursion. – Initialisation: α h = π h ψ hv for all h ∈ H , – Recursion: α lh = P g ∈ H α l − ,g θ gh ψ hv l for all h ∈ H and l = 2 , . . . , L .The backward algorithm (Figure 1c) takes as input an HMM ( H, V, θ, ψ, π ) and a length L observation sequence ( v , v , . . . , v L ) ∈ V L and outputs the L × | H | conditional probabilities β lh = Pr[ v l +1 , v l +2 , . . . , v L | x l = h, θ, ψ ] of observing symbols v l +1 , . . . , v L given that thehidden state x l at time l has label h ∈ H . – Initialisation: β Lh = 1 , for all h ∈ H , – Recursion: β lh = P g ∈ H θ hg ψ gv l +1 β l +1 ,g for all h ∈ H and l = 1 , . . . , L − .The E step for the Baum-Welch algorithm takes as input an HMM ( H, V, θ, ψ, π ) and a length L observation sequence ( v , v , . . . , v L ) . It outputs the L × H conditional probabilities γ lh = Pr[ x l = h | θ, ψ, v ] of being in hidden state h at time l conditioned on the observed sequence v by: γ lh = α lh β lh P g ∈ H α lg β lg for all h ∈ H and l = 1 , , . . . , L − . It also outputs the ( L − × H × H probabilities ξ lgh =Pr[ x l = g, x l +1 = h | θ, ψ, v ] of transitioning along the edge ( g, h ) at time l conditioned on theobserved sequence v by: ξ lgh = α lg θ gh ψ hv l +1 β l +1 ,h P f ∈ H α lf β lf for all g, h ∈ H and l = 1 , . . . , L − . Remark 1.
Note that the E-step uses the forward and backward algorithms as subroutines to firstcompute the α and β values. Further note that we don’t need the forward and backward algorithmsto return the actual values α lh and β lh . To be precise, let α l = ( α lh ) h ∈ H ∈ R H denote the vector offorward likelihoods at time l . Then for the E step to work, we only need the direction of α l and notthe magnitude. This is because the numerator and denominator in the updates are both linear in α lh ,and the magnitude cancels out. Similarly, if β l = ( β lh ) h ∈ H ∈ R H denotes the vector of backwardlikelihoods at time l then the E step only cares about the direction of β l and not the magnitude. Thisscale symmetry is a useful property for numerical solvers. We will also make use of this freedomwhen we implement a lax forward-backward algorithm using reaction networks in the next section.The M step of the Baum-Welch algorithm takes as input the values γ and ξ that are output by theE step as reconstruction of the dynamics on hidden states, and outputs new Maximum Likelihood4stimates of the parameters θ, ψ that best explain these values. The update rule turns out to be verysimple. For all g, h ∈ H and w ∈ V : θ gh ← P L − l =1 ξ lgh P L − l =1 P f ∈ H ξ lgf , ψ hw ← P Ll =1 γ lh δ w,v l P Ll =1 γ lh where δ w,v l = ( if w = v l otherwise is the Dirac delta function. Remark 2.
Like in Remark 1, note that the M step does not require its inputs to be the actual values γ and ξ . There is a scaling symmetry so that we only need the directions of the vectors γ ( h ) =( γ lh ) l =1 , ,...,L ∈ R L for all h ∈ H and ξ ( g ) = ( ξ lgh ) ≤ l ≤ L − ,h ∈ H ∈ R ( L − × H for all g ∈ H .This gives us the freedom to implement a lax E projection without affecting the M projection, andwe will exploit this freedom when designing our reaction network.The Baum-Welch algorithm (Figure 1d) is a fixed point EM computation that alternately runsthe E step and the M step till the updates become small enough. It is guaranteed to converge to afixed point (ˆ θ, ˆ ψ ) . However, the fixed point need not always be a global optimum. Following [25], we recall some concepts from reaction network theory [24, 35–39].Fix a finite set S of species. An S -reaction, or simply a reaction when S is understood fromcontext, is a formal chemical equation X X ∈ S y X X → X X ∈ S y ′ X X where the numbers y X , y ′ X ∈ Z ≥ are the stoichiometric coefficients of species X on the reactant side and product side respectively. A reaction network is a pair ( S, R ) where R is a finite set of S -reactions. A reaction system is a triple ( S, R, k ) where ( S, R ) is a reaction network and k : R → R > is called the rate function .As is common when specifying reaction networks, we will find it convenient to explicitly specifyonly a set of chemical equations, leaving the set of species to be inferred by the reader.Fix a reaction system ( S, R, k ) . Deterministic Mass Action Kinetics describes a system ofordinary differential equations on the concentration variables { x i ( t ) | i ∈ S } according to: ˙ x i ( t ) = X a → b ∈ R k a → b ( b i − a i ) Y j ∈ S x j ( t ) a j In this section we will describe a reaction networks for each part of the Baum-Welch algorithm.Fix an HMM M = ( H, V, θ, ψ, π ) . Pick an arbitrary hidden state h ∗ ∈ H and an arbitraryvisible state v ∗ ∈ V . Picking these states h ∗ ∈ H and v ∗ ∈ V is merely an artifice to breaksymmetry akin to selecting leaders, and our results hold independent of these choices. Also fix alength L ∈ Z > of observed sequence. 5e first work out in full detail how the forward algorithm of the Baum-Welch algorithm may betranslated into chemical reactions. Suppose a length L sequence ( v , v , . . . , v L ) ∈ V L of visiblestates is observed. Then recall that the forward algorithm uses the following recursion: – Initialisation: α h = π h ψ hv for all h ∈ H , – Recursion: α lh = P g ∈ H α l − ,g θ gh ψ hv l for all h ∈ H and l = 2 , . . . , L .Notice this implies – α h × π h ∗ ψ h ∗ v = α h ∗ × π h ψ hv for all h ∈ H \ { h ∗ } , – α lh × (cid:16)P g ∈ H α l − ,g θ gh ∗ ψ h ∗ v l (cid:17) = α lh ∗ × (cid:16)P g ∈ H α l − ,g θ gh ψ hv l (cid:17) for all h ∈ H \ { h ∗ } and l = 2 , . . . , L .This prompts the use of the following reactions for the initialization step: α h + π h ∗ + ψ h ∗ v −−→ α h ∗ + π h ∗ + ψ h ∗ v α h ∗ + π h + ψ hv −−→ α h + π h + ψ hv for all h ∈ H \ { h ∗ } and w ∈ V . By design α h × π h ∗ ψ h ∗ v = α h ∗ × π h ψ hv is the balanceequation for the pair of reactions corresponding to each h ∈ H \ { h ∗ } .Similarly for the recursion step, we use the following reactions: α lh + α l − ,g + θ gh ∗ + ψ h ∗ v l −−→ α lh ∗ + α l − ,g + θ gh ∗ + ψ h ∗ v l α lh ∗ + α l − ,g + θ gh + ψ hv l −−→ α lh + α l − ,g + θ gh + ψ hv l for all g ∈ H , h ∈ H \ { h ∗ } , l = 2 , . . . , L and w ∈ V . Again by design α lh × X g ∈ H α l − ,g θ gh ∗ ψ h ∗ v l = α lh ∗ × X g ∈ H α l − ,g θ gh ψ hv l is the balance equation for the set of reactions corresponding to each ( h, l ) ∈ H \{ h ∗ }×{ , . . . , L } .The above reactions depend on the observed sequence ( v , v , . . . , v L ) ∈ V L of visible state.And this is a problem because one would have to design different reaction networks for differentobserved sequences. To solve this problem we introduce the species E lw with l = 1 , . . . , L and w ∈ V . Now with the E lw species, we use the following reactions for the forward algorithm: α h + π h ∗ + ψ h ∗ w + E w −−→ α h ∗ + π h ∗ + ψ h ∗ w + E w α h ∗ + π h + ψ hw + E w −−→ α h + π h + ψ hw + E w for all h ∈ H \ { h ∗ } and w ∈ V . α lh + α l − ,g + θ gh ∗ + ψ h ∗ w + E lw −−→ α lh ∗ + α l − ,g + θ gh ∗ + ψ h ∗ w + E lw α lh ∗ + α l − ,g + θ gh + ψ hw + E lw −−→ α lh + α l − ,g + θ gh + ψ hw + E lw for all g ∈ H , h ∈ H \{ h ∗ } , l = 2 , . . . , L and w ∈ V . The E lw species are to be initialized such that E lw = 1 iff w = v l . So different observed sequences can now be processed by the same reactionnetwork, by appropriately initializing the species E lw .The other parts of the Baum-Welch algorithm may be translated into chemical reactions us-ing a similar logic. We call the resulting reaction network as the Baum-Welch Reaction Network BW ( M , h ∗ , v ∗ , L ) . It consists of four subnetworks corresponding to the four parts of the Baum-Welch algorithm, as shown in Table 1. 6able 1: Baum-Welch Reaction Network : The steps and reactions arefor all g, h ∈ H, w ∈ V and l = 1 , . . . , L − . Notice there are some nullreactions of the form P a i X i → P a i X i . As these null reaction have noeffect on the dynamics, they can be ignored. Baum-Welch Algorithm Baum-Welch Reaction Network α h = π h ψ hv α h + π h ∗ + ψ h ∗ w + E w −−→ α h ∗ + π h ∗ + ψ h ∗ w + E w α h ∗ + π h + ψ hw + E w −−→ α h + π h + ψ hw + E w α l +1 ,h = X g ∈ H α lg θ gh ψ hv l +1 α l +1 ,h + α lg + θ gh ∗ + ψ h ∗ w + E l +1 ,w −−→ α l +1 ,h ∗ + α lg + θ gh ∗ + ψ h ∗ w + E l +1 ,w α l +1 ,h ∗ + α lg + θ gh + ψ hw + E l +1 ,w −−→ α l +1 ,h + α lg + θ gh + ψ hw + E l +1 ,w β Lh = 1 β lh + β l +1 ,g + θ h ∗ g + ψ gw + E l +1 ,w −−→ β lh ∗ + β l +1 ,g + θ h ∗ g + ψ gw + E l +1 ,w β lh ∗ + β l +1 ,g + θ hg + ψ gw + E l +1 ,w −−→ β lh + β l +1 ,g + θ hg + ψ gw + E l +1 ,w β lh = X g ∈ H θ hg ψ gv l +1 β l +1 ,g γ lh = α lh β lh P g ∈ H α lg β lg γ lh + α lh ∗ + β lh ∗ −−→ γ lh ∗ + α lh ∗ + β lh ∗ γ lh ∗ + α lh + β lh −−→ γ lh + α lh + β lh ξ lgh = α lg θ gh ψ hv l +1 β l +1 ,h P f ∈ H α lf β lf ξ lgh + α lh ∗ + θ h ∗ h ∗ + β l +1 ,h ∗ + ψ h ∗ w + E l +1 ,w −−→ ξ lh ∗ h ∗ + α lh ∗ + θ h ∗ h ∗ + β l +1 ,h ∗ + ψ h ∗ w + E l +1 ,w ξ lh ∗ h ∗ + α lg + θ gh + β l +1 ,g + ψ hw + E l +1 ,w −−→ ξ lgh + α lg + θ gh + β l +1 ,g + ψ hw + E l +1 ,w θ gh ← P L − l =1 ξ lgh P L − l =1 P f ∈ H ξ lgf θ gh + ξ lgh ∗ −−→ θ gh ∗ + ξ lgh ∗ θ gh ∗ + ξ lgh −−→ θ gh + ξ lgh ψ hw ← P Ll =1 γ lh δ w,v l P Ll =1 γ lh ψ hw + γ lh + E lv ∗ −−→ ψ hv ∗ + γ lh + E lv ∗ ψ hv ∗ + γ lh + E lw −−→ ψ hw + γ lh + E lw The Baum-Welch reaction network described above has a special structure. Every reaction isa monomolecular transformation catalyzed by a set of species. The reverse transformation is alsopresent, catalyzed by a different set of species to give the network a “futile cycle” [40] structure.7n addition, each connected component in the undirected graph representation of the network has atopology with all transformations happening to and from a central species. This prompts the follow-ing definitions. α lh + α l − ,g + θ ih ∗ + ψ h ∗ w + E lw k α + lh −−−→ α lh ∗ + α l − ,g + θ ih ∗ + ψ h ∗ w + E lw α lh ∗ + α l − ,g + θ gh + ψ hw + E lw k α − lh −−−→ α lh + α l − ,g + θ gh + ψ hw + E lw α lh α lh ∗ k αlh k αlh = k α + lh = k α − lh | H | × | V | α lh ∗ α lh k αlh Petal Flower | H | × | V | β lh ∗ β lh k βlh | V | ξ l h ∗ h ∗ ξ lgh k ξlgh γ l h ∗ γ lh k γlh Lθ gh ∗ θ gh k θgh Lψ hv ∗ ψ hw k ψhw Fig. 2. The Baum-Welch Reaction Network represented as an undirected graph. (a) Each reaction is aunimolecular transformation driven by catalysts. The nodes represents the species undergoing transformation.The edge represents a pair of reactions which drive this transformation backwards and forwards in a futile cycle.Catalysts are omitted for clarity. (b) The network decomposes into a collection of disjoint flowers . Nodesrepresent species and edges represent pairs of reactions, species α, γ have L flowers each, β, ξ have L − flowers each, and species θ and ψ have | H | flowers each (not shown in figure). All reactions in the same petalhave the same specific rate constant, so the dimension of the space of permissible rate constants is equal to thenumber of petals in the graph. Definition 1 (Flowers, petals, gardens). A graph is a triple ( Nodes , Edges , π ) where Nodes andEdges are finite sets and π is a map from Edges to unordered pairs of elements from Nodes. A flower is a graph with a special node n ∗ such that for every edge e ∈ Edges we have n ∗ ∈ π ( e ) . A garden is a graph which is a union of disjoint flowers. A petal is a set of all edges e which have the same π ( e ) , i.e. they are incident between the same pair of nodes. Figure 2 shows how the Baum-Welch reaction network can be represented as a garden graph, inwhich species are nodes and reactions are edges.A collection of specific rates is permissible if all reactions in a petal have the same rate. How-ever, different petals may have different rates. We will denote the specific rate for a petal by su-perscripting the type of the species and subscripting its indices. For example, the specific rate forreactions in the petal for α lh would be denoted as k αlh . The notation for the remaining rate constantscan be read from Figure 2. 8 emark 3. The “flower” topology we have employed for the Baum-Welch reaction network is onlyone among several possibilities. The important thing is to achieve connectivity between differentnodes that ought to be connected, ensuring that the ratio of the concentrations of the species de-noted by adjacent nodes takes the value as prescribed by the Baum-Welch algorithm. Therefore,many other connection topologies can be imagined, for example a ring topology where each node isconnected to two other nodes while maintaining the same connected components. In fact, there areobvious disadvantages to the star topology, with a single point of failure, whereas the ring topologyappears more resilient. How network topology affects basins of attraction, rates of convergence, andemergence of spurious equilibria in our algorithm is a compelling question beyond the scope of thispresent work.The Baum-Welch reaction network with a permissible choice of rate constants k will define a Baum-Welch reaction system ( BW ( M , h ∗ , v ∗ , L ) , k ) whose deterministic mass action kineticsequations will perform a continuous time version of the Baum-Welch algorithm. We call this the Chemical Baum-Welch Algorithm , and describe it in Algorithm 1.
Algorithm 1:
Chemical Baum-Welch Algorithm
Input:
An HMM M = ( H, V, θ, ψ, π ) and an observed sequence v ∈ V L Output:
Parameters ˆ θ ∈ R | H |×| H |≥ , ˆ ψ ∈ R | H |×| V |≥ Initialization of concentrations at t=0:
1. For w ∈ V and l = 1 , . . . , L , initialize E lw (0) such that E lw (0) = ( if w = v l otherwise2. For g ∈ H , initialize β Lg = β
3. For every other species, initialize its concentration arbitrarily in R > . Algorithm:
Run the Baum-Welch reaction system with mass action kinetics until convergence.
The Baum-Welch reaction network has number of species of each type as follows:Type π α β θ ψ E γ ξ
Number | H | | H | L | H | L | H | | H || V | L | V | | H | L | H | ( L − The total number of species is | H | + 3 | H | L + | H | + | H || V | + L | V | + | H | ( L − which is O ( L | H | + | H || V | + L | V | ) . The number of reactions (ignoring the null reactions of the form P a i X i → P a i X i ) in each part is:Forward Backward Expectation Maximization | H | − V + (2 | H | ( | H | − | V | ) 2 L ( | H | − | H | ( | H | − L − | H | ( | H | − L − | V | ( L −
1) 2( L − | H | −
1) +2 | H | L ( | V | − so that the total number of reactions is O ( | H | L | V | ) .The first theorem shows that the Chemical Baum-Welch Algorithm recovers all of the Baum-Welch equilibria. Theorem 1.
Every fixed point of the Baum-Welch algorithm for an HMM M = ( H, V, θ, ψ, π ) is afixed point for the corresponding Chemical Baum-Welch Algorithm with permissible rates k . positive if all its components are strictly greater than . Thenext theorem shows that positive equilibria of the Chemical Baum-Welch Algorithm are also fixedpoints of the Baum-Welch algorithm. Theorem 2.
Every positive fixed point for the Chemical Baum-Welch Algorithm on a Baum-WelchReaction system ( BW ( M , h ∗ , v ∗ , L ) , k ) with permissible rate k is a fixed point for the Baum-Welchalgorithm for the HMM M = ( H, V, θ, ψ, π ) . See Appendix A.1 for proof.The Baum-Welch algorithm is an iterative algorithm. The E step and the M step are run itera-tively. In contrast, the Chemical Baum-Welch algorithm is a generalized EM algorithm [41] whereall the reactions are run at the same time in a single-pot reaction.The next theorem shows that the E step consisting of the forward network, the backward net-work, and the E network, converges exponentially fast to the correct equilibrium if the θ and ψ species are held fixed at a positive point. Theorem 3.
For the Baum-Welch Reaction System ( BW ( M , h ∗ , v ∗ , L ) , k ) with permissible rates k , if the concentrations of θ and ψ species are held fixed at a positive point then the Forward,Backward and Expection step reaction systems on α, β, γ and ψ species converge to equilibriumexponentially fast. See Appendix A.2 for proof.The next theorem shows that if the α, β, γ, ξ species are held fixed at a positive point, then the M step consisting of reactions modifying the θ and ψ species converges exponentially fast to thecorrect equilibrium. Theorem 4.
For the Baum-Welch Reaction System ( BW ( M , h ∗ , v ∗ , L ) , k ) with permissible rates k , if the concentrations of α, β, γ and ξ species are held fixed at a positive point then the Maximiza-tion step reaction system on θ and ψ converges to equilibrium exponentially fast. See Appendix A.2 for proof.The following examples demonstrate the behavior of the Chemical Baum-Welch Algorithm.
Example 1.
Consider an HMM ( H, V, θ, ψ, π ) with two hidden states H = { H , H } and twoemitted symbols V = { V , V } where the starting probability is π = (0 . , . , initial transitionprobability is θ = (cid:20) . . . . (cid:21) , and initial emission probability is ψ = (cid:20) . . . . (cid:21) . Suppose we wish tolearn ( θ, ψ ) for the following observed sequence: ( V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V , V ) . We ini-tialize the corresponding reaction system by setting species E l,v l = 1 and E l,w = 0 for w = v l , andrun the dynamics according to deterministic mass-action kinetics. Initial conditions of all speciesthat are not mentioned are chosen to be nonzero, but otherwise at random.For our numerical solution, we observe that the reaction network equilibrium point coincides withthe Baum-Welch steady state ˆ θ = (cid:20) . . . . (cid:21) and ˆ ψ = (cid:20) . . . . (cid:21) (See Figure 3).The next example shows that the Chemical Baum-Welch algorithm can sometimes get stuckat points that are not equilibria for the Baum-Welch algorithm. This is a problem especially forvery short sequences, and is probably happening because in such settings, the best model sets manyparameters to zero. When many species concentrations are set to , many reactions get turned off,10
250 500 750 1000 1250 1500 1750 2000Time (s)0.00.20.40.60.81.0 C o n c e n t r a t i o n o f Θ , Ψ π π θ θ θ θ ψ ψ ψ ψ (a) Reaction Network Dynamics Θ , Ψ π π θ θ θ θ ψ ψ ψ ψ (b) Baum-Welch Algorithm Fig. 3.
Simulation of Example 1: Both simulations are started at exactly the same initial vector. This may notbe apparent in the figure because concentration of some species change rapidly at start in the reaction network. and the network gets stuck away from the desired equilibrium. We believe this will not happen if theHMM generating the observed sequence has nonzero parameters, and the observed sequence is longenough.
Example 2.
Consider an HMM ( H, V, θ, ψ, π ) with two hidden states H = { H , H } and two emit-ted symbols V = { V , V } , initial distribution π = (0 . , . is fixed, initial transition probability θ = (cid:20) . . . . (cid:21) , and initial emission probability ψ = (cid:20) . . . . (cid:21) . Suppose we wish to learn ( θ, ψ ) forthe sequence ( V , V , V , V , V ) . We again simulate the corresponding reaction network and alsoperform the Baum-Welch algorithm for comparison. C o n c e n t r a t i o n o f Θ , Ψ π π θ θ θ θ ψ ψ ψ ψ (a) Reaction Network Dynamics Θ , Ψ π π θ θ θ θ ψ ψ ψ ψ (b) Baum-Welch Algorithm Fig. 4.
Simulation of Example 2: Both simulations are started at exactly the same initial vector. This may notbe apparent in the figure because concentration of some species change rapidly at start in the reaction network.
Figure 4 shows that the reaction network equilibrium point does not coincide with the Baum-Welchsteady state. Both converge to ˆ ψ = (cid:20) . . . . (cid:21) . However, the reaction network convergesto ˆ θ = (cid:20) . . . . (cid:21) whereas the Baum-Welch algorithm converges to ˆ θ = (cid:20) . . . . (cid:21) In previous work, our group has shown that reaction networks can perform the Expectation Maxi-mization (EM) algorithm for partially observed log linear statistical models [25, 42]. That algorithmalso applies “out of the box” to learning HMM parameters. The problem with that algorithm is thatthe size of the reaction network would become exponentially large in the length of the sequence,so that even examples like Example 1 with an observation sequence of length would becomeimpractical. In contrast, the scheme we have presented in this paper requires only a linear growthwith sequence length. We have obtained the savings by exploiting the graphical structure of HMMs.This allows us to compute the likelihoods α in a “dynamic programming” manner, instead of havingto explicitly represent each path as a separate species.Napp and Adams [20] have shown how to compute marginals on graphical models with reac-tion networks. They exploit graphical structure by mimicking belief propagation. Hidden MarkovModels can be viewed as a special type of graphical model where there are L random variables X , X , . . . , X L , Y , Y , . . . , Y L with the X random variables taking values in H and the Y randomvariables in V . The X random variables form a Markov chain X X . . . X L . In addition,there are L edges from X l to Y l for l = 1 to L denoting observations. Specialized to HMMs, thescheme of Napp and Adams would compute the equivalent of steady state values of the γ species,performing a version of the E step. They are able to show that true marginals are fixed points of theirscheme, which is similar to our Theorem 1. Thus their work may be viewed as the first example ofa reaction network scheme that exploits graphical structure to compute E projections. Our E stepgoes further by proving correctness as well as exponential convergence. Their work also raises thechallenge of extending our scheme to all graphical models.Poole et al. [43] have described Chemical Boltzmann Machines, which are reaction networkschemes whose dynamics reconstructs inference in Boltzmann Machines. This inference can beviewed as a version of E projection. No scheme for learning is presented. The exact schemes pre-sented there are exponentially large. The more realistically sized schemes are presented there with-out proof. In comparison, our schemes are polynomially sized, provably correct if the equilibrium ispositive, and perform both inference and learning for HMMs.Zechner et al. [11] have shown that Kalman filters can be implemented with reactions networks.Kalman filters can be thought of as a version of Hidden Markov Models with continuous hiddenstates [44]. It would be instructive to compare their scheme with ours, and note similarities anddifferences. In passing from position l to position l + 1 along the sequence, our scheme repeats thesame reaction network that updates α l +1 using α l values. It is worth examining if this can be done“in place” so that the same species can be reused, and a reaction network can be described that isnot tied to the length L of the sequence to be observed.Recently Cherry et al. [10] have given a brilliant experimental demonstration of learning withDNA molecules. They have empirically demonstrated a DNA molecular system that can classify types of handwritten digits from the MNIST database. Their approach is based on the notionof “winner-takes-all” circuits due to Maass [45] which was originally a proposal for how neuralnetworks in the brain work. Winner-take-all might also be capable of approximating HMM learning,at least in theory [46], and it is worth understanding precisely how such schemes relate to the kindof scheme we have described here. It is conceivable that our scheme could be converted to winner-take-all by getting different species in the same flower to give negative feedback to each other. This12ight well lead to sampling the most likely path, performing a decoding task similar to the Viterbialgorithm. We have described a one-pot one-shot reaction network implementation of the Baum-Welch algo-rithm. Firstly, this involves proposing a reaction system whose positive fixed points correspond toequilibria of the Baum-Welch Algorithm. Secondly, this involves establishing the conditions un-der which convergence of solutions to the equilibria can be accomplished. Further, from a practicalperspective, it is essential to obtain an implementation that does not rely on repeated iteration of areaction scheme but only requires to be run once.As we observe in Remark 3, there is a whole class of reaction networks that implements theBaum-Welch algorithm. We have proposed one such network and are aware that there are othernetworks, potentially with more efficient dynamical and analytical properties than the one proposedhere. Finding and characterizing efficient reaction networks that can do complicated statistical taskswill likely be of future concern.We have only discussed deterministic dynamical properties of the network. However, in realisticbiological contexts one might imagine that the network is implemented by relatively few moleculessuch that stochastic effects are significant. Consequently, the study of the Baum-Welch reactionsystem under stochastic mass-action kinetics is likely to be of interest.Lastly, we have mentioned the Viterbi algorithm, but have made no attempt to describe howthe maximum likelihood sequence can be recovered from our reaction network. This decoding stepis likely to be of as much interest for molecular communication systems and molecular multi-agentsystems as it is in more traditional domains of communications and multi-agent reinforcement learn-ing. Because of the inherent stochasticity of reaction networks, there might even be opportunitiesfor list decoding by sampling different paths through the hidden states that have high probabilityconditioned on the observations. This might give an artificial cell the ability to “imagine” differentpossible realities, and act assuming one of them to be the case, leading to an intrinsic stochasticityand unpredictability to the behavior.
References
1. David Soloveichik, Georg Seelig, and Erik Winfree. DNA as a universal substrate for chemical kinetics.
Proceedings of the National Academy of Sciences , 107(12):5393–5398, 2010.2. Niranjan Srinivas.
Programming chemical kinetics: engineering dynamic reaction networks with DNAstrand displacement . PhD thesis, California Institute of Technology, 2015.3. Lulu Qian, David Soloveichik, and Erik Winfree. Efficient Turing-universal computation with DNA poly-mers. In
DNA computing and molecular programming , pages 123–140. 2011.4. Luca Cardelli. Strand Algebras for DNA Computing.
Natural Computing , 10:407–428, 2011.5. Matthew R. Lakin, Simon Youssef, Luca Cardelli, and Andrew Phillips. Abstractions for DNA circuitdesign.
Journal of The Royal Society Interface , 9(68):470–486, 2011.6. Luca Cardelli. Two-domain DNA strand displacement.
Mathematical Structures in Computer Science ,23:02, 2013.7. Yuan-Jyue Chen, Neil Dalchau, Niranjan Srinivas, Andrew Phillips, Luca Cardelli, David Soloveichik, andGeorg Seelig. Programmable chemical controllers made from DNA.
Nature nanotechnology , 8(10):755–762, 2013.8. Matthew R. Lakin, Darko Stefanovic, and Andrew Phillips. Modular verification of chemical reactionnetwork encodings via serializability analysis.
Theoretical Computer Science , 632:21–42, 2016. . Niranjan Srinivas, James Parkin, Georg Seelig, Erik Winfree, and David Soloveichik. Enzyme-free nucleicacid dynamical systems. Science , 358:6369, 2017.10. Kevin M. Cherry and Lulu Qian. Scaling up molecular pattern recognition with DNA-based winner-take-allneural networks.
Nature , 559:7714, 2018.11. Christoph Zechner, Georg Seelig, Marc Rullan, and Mustafa Khammash. Molecular circuits for dynamicnoise filtering.
Proceedings of the National Academy of Sciences , 113(17):4729–4734, 2016.12. Stefan Badelt, Seung Woo Shin, Robert F. Johnson, Qing Dong, Chris Thachuk, and Erik Winfree. Ageneral-purpose CRN-to-DSD compiler with formal verification, optimization, and simulation capabilities.In
International Conference on DNA-Based Computers , pages 232–248, 2017.13. Matthew R. Lakin, Simon Youssef, Filippo Polo, Stephen Emmott, and Andrew Phillips. Visual DSD: adesign and analysis tool for dna strand displacement systems.
Bioinformatics , 27(22):3211–3213, 2011.14. Allen Hjelmfelt, Edward D. Weinberger, and John Ross. Chemical implementation of neural networks andTuring machines.
Proceedings of the National Academy of Sciences , 88(24):10983–10987, 1991.15. H J Buisman, Huub MM ten Eikelder, Peter AJ Hilbers, and Anthony ML Liekens. Computing algebraicfunctions with biochemical reaction networks.
Artificial life , 15(1):5–19, 2009.16. Kevin Oishi and Eric Klavins. Biomolecular implementation of linear I/O systems.
Systems Biology, IET ,5(4):252–260, 2011.17. David Soloveichik, Matthew Cook, Erik Winfree, and Jehoshua Bruck. Computation with finite stochasticchemical reaction networks. natural computing , 7(4):615–633, 2008.18. Ho-Lin Chen, David Doty, and David Soloveichik. Deterministic function computation with chemicalreaction networks.
Natural computing , 13(4):517–534, 2014.19. Lulu Qian and Erik Winfree. Scaling up digital circuit computation with DNA strand displacement cas-cades.
Science , 332(6034):1196–1201, 2011.20. Nils E. Napp and Ryan P. Adams. Message passing inference with chemical reaction networks.
Advancesin Neural Information Processing Systems , pages 2247–2255, 2013.21. Lulu Qian, Erik Winfree, and Jehoshua Bruck. Neural Network Computation with DNA Strand Displace-ment Cascades.
Nature , 475(7356):368–372, 2011.22. Luca Cardelli, Marta Kwiatkowska, and Max Whitby. Chemical reaction network designs for asynchronouslogic circuits.
Natural computing , 17(1):109–130, 2018.23. Manoj Gopalkrishnan. A scheme for molecular computation of maximum likelihood estimators for log-linear models. In , pages3–18, 2016.24. Muppirala Viswa Virinchi, Abhishek Behera, and Manoj Gopalkrishnan. A stochastic molecular schemefor an artificial cell to infer its environment from partial observations. In , 2017.25. Muppirala Viswa Virinchi, Abhishek Behera, and Manoj Gopalkrishnan. A reaction network scheme whichimplements the EM algorithm. In
International Conference on DNA Computing and Molecular Program-ming , pages 189–207, 2018.26. Shunichi Amari.
Information geometry and its applications . Springer, 2016.27. Imre Csisz´ar and Frantisek Matus. Information projections revisited.
IEEE Transactions on InformationTheory , 49(6):1474–1490, 2003.28. Edwin T. Jaynes. Information theory and statistical mechanics.
Physical review , 106:4, 1957.29. Jong-Shik Shin and Niles A. Pierce. A synthetic DNA walker for molecular transport.
Journal of theAmerican Chemical Society , 126(35):10834–10835, 2004.30. John Reif. The Design of Autonomous DNA Nano-mechanical Devices: Walking and Rolling DNA.
DNAComputing , pages 439–461, 2003.31. William Sherman and Nadrian Seeman. A Precisely Controlled DNA Biped Walking Device.
Nano Letters ,4:1203–1207, 2004.32. Thomas M. Cover and Joy A. Thomas.
Elements of information theory . Wiley, John & Sons, 2012.33. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.
Proceedings of the IEEE , 77(2):257–286, February 1989.34. Biing Hwang Juang and Laurence R. Rabiner. Hidden Markov models for speech recognition.
Technomet-rics , 33(3):251–272, 1991.
5. Martin Feinberg. On chemical kinetics of a certain class.
Arch. Rational Mech. Anal , 46, 1972.36. Friedrich J. M. Horn. Necessary and sufficient conditions for complex balancing in chemical kinetics.
Arch. Rational Mech. Anal , 49, 1972.37. Martin Feinberg. Lectures on chemical reaction networks. .1979.38. Manoj Gopalkrishnan. Catalysis in Reaction Networks.
Bulletin of Mathematical Biology , 73(12):2962–2982, 2011.39. David F. Anderson, Gheorghe Craciun, and Thomas G. Kurtz. Product-form stationary distributions fordeficiency zero chemical reaction networks.
Bulletin of mathematical biology , 72(8):1947–1970, 2010.40. Benjamin P. Tu and Steven L. McKnight. Metabolic cycles as an underlying basis of biological oscillations.
Nature reviews Molecular cell biology , 7:9, 2006.41. Geoffrey McLachlan and Thriyambakam Krishnan.
The EM algorithm and extensions , volume 382. Wiley,John & Sons, 2007.42. Abhinav Singh and Manoj Gopalkrishnan. EM Algorithm with DNA Molecules. In
Poster Presentationsof the 24-th edition of International Conference on DNA Computing and Molecular Programming , 2018.43. William Poole, Andr´es Ortiz-Munoz, Abhishek Behera, Nick S. Jones, Thomas E. Ouldridge, Erik Winfree,and Manoj Gopalkrishnan. Chemical boltzmann machines. In
International Conference on DNA-BasedComputers , pages 210–231, 2017.44. S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models.
Neural Computation ,11(2):305–345, 1999.45. Wolfgang Maass. On the computational power of winner-take-all.
Neural computation , 12(11):2519–2535,2000.46. David Kappel, Bernhard Nessler, and Wolfgang Maass. STDP installs in winner-take-all circuits an onlineapproximation to hidden Markov model learning.
PLoS computational biology , 10:3, 2014.
Appendix A
Appendix A.1 Comparing points of equilibria
We will now prove Theorem 1. For the sake of convenience, we first recall the statement.
Theorem 1.
Every fixed point of the Baum-Welch algorithm for an HMM M = ( H, V, θ, ψ, π ) is afixed point for the corresponding Chemical Baum-Welch Algorithm with permissible rates k .Proof. Consider a point Φ = ( α ′ , β ′ , γ ′ , ξ ′ , θ ′ , ψ ′ ) with α ′ , β ′ , γ ′ ∈ R L × H ≥ , ξ ′ ∈ R ( L − × H × H ≥ , θ ′ ∈ R H × H ≥ and ψ ′ ∈ R H × V ≥ . If Φ is a fixed point of Baum-Welch Algorithm then it must satisfy: – α ′ h = π h ψ ′ hv for all h ∈ H . Then for the chemical Baum-Welch Algorithm we have ˙ α h (cid:12)(cid:12) Φ = k α h (cid:0) α ′ h ∗ π h ψ ′ hv − α ′ h π h ∗ ψ ′ h ∗ v (cid:1) = 0 for all h ∈ H \ { h ∗ } and ˙ α h ∗ (cid:12)(cid:12) Φ = − P h = h ∗ ˙ α h (cid:12)(cid:12) Φ = 0 . – α ′ lh = P g ∈ H α ′ l − ,g θ ′ gh ψ ′ hv l for all h ∈ H and l = 2 , . . . , L . Then for the chemical Baum-Welch Algorithm we have ˙ α lh (cid:12)(cid:12) Φ = k αlh α ′ lh ∗ X g ∈ H α ′ l − ,g θ ′ gh ψ ′ hv l − α ′ lh X g ∈ H α ′ l − ,g θ ′ gh ∗ ψ ′ h ∗ v l = 0 for all h ∈ H \ { h ∗ } and l = 2 , . . . , L and ˙ α lh ∗ (cid:12)(cid:12) Φ = − P h = h ∗ ˙ α lh (cid:12)(cid:12) Φ = 0 .15 β ′ lh = P g ∈ H θ ′ hg ψ ′ gv l +1 β ′ l +1 ,g for all h ∈ H and l = 1 , . . . , L − . Then for the chemicalBaum-Welch Algorithm we have ˙ β lh (cid:12)(cid:12) Φ = k βlh β ′ lh ∗ X g ∈ H θ ′ h ∗ g ψ ′ gv l +1 β ′ l +1 ,g − β ′ lh X g ∈ H θ ′ hg ψ ′ gv l +1 β ′ l +1 ,g = 0 for all h ∈ H \ { h ∗ } and l = 1 , , . . . , L − and ˙ β lh ∗ (cid:12)(cid:12) Φ = − P h = h ∗ ˙ β lh (cid:12)(cid:12) Φ = 0 . – γ ′ l ( h ) = α ′ lh β ′ lh P g ∈ H α ′ lg β ′ lg for all h ∈ H and l = 1 , , . . . , L − . Then for the chemical Baum-WelchAlgorithm we have ˙ γ lh = k γlh (cid:0) γ ′ lh ∗ α ′ lh β ′ lh − γ ′ lh α ′ lh ∗ β ′ lh ∗ (cid:1) = 0 for all h ∈ H \ { h ∗ } and l = 1 , , . . . , L − and ˙ γ lh ∗ = − P h = h ∗ ˙ γ lh = 0 . – ξ ′ l ( g, h ) = α ′ lg θ ′ gh ψ ′ hvl +1 β ′ l +1 ,h P f ∈ H α ′ lf β ′ lf for all g, h ∈ H and l = 1 , . . . , L − . Then for the chemicalBaum-Welch Algorithm we have ˙ ξ lgh (cid:12)(cid:12) Φ = k ξlgh (cid:16) ξ ′ lh ∗ h ∗ α ′ lg θ ′ gh ψ ′ hv l +1 β ′ l +1 ,h − ξ ′ lgh α ′ lh ∗ θ ′ h ∗ h ∗ ψ ′ h ∗ v l +1 β ′ l +1 ,h ∗ (cid:17) = 0 for all g, h ∈ H × H \{ ( h ∗ , h ∗ ) } and l = 1 , . . . , L − and ˙ ξ lh ∗ h ∗ (cid:12)(cid:12) Φ = − P ( g,h ) =( h ∗ ,h ∗ ) ˙ ξ lgh (cid:12)(cid:12) Φ =0 . – θ ′ gh = P L − l =1 ξ ′ l ( g,h ) P L − l =1 P f ∈ H ξ ′ l ( g,f ) for all g, h ∈ H . Then for the chemical Baum-Welch Algorithm wehave ˙ θ gh (cid:12)(cid:12) Φ = k θgh θ ′ gh ∗ L − X l =1 ξ ′ lgh − θ ′ ghL − X l =1 ξ ′ l ( g, h ∗ ) = 0 for all g ∈ H and h ∈ H \ { h ∗ } and ˙ θ gh ∗ (cid:12)(cid:12) Φ = − P h = h ∗ ˙ θ gh (cid:12)(cid:12) Φ = 0 . – ψ ′ hw = P Ll =1 γ ′ l ( h ) δ w,vl P Ll =1 γ ′ l ( h ) for all h ∈ H and w ∈ V . Then for the chemical Baum-Welch Algo-rithm we have ˙ θ gh (cid:12)(cid:12) Φ = k θgh θ ′ gh ∗ L − X l =1 ξ ′ lgh − θ ′ ghL − X l =1 ξ ′ l ( g, h ∗ ) = 0 for all h ∈ H and w ∈ V \ { v ∗ } and ˙ ψ hw ∗ (cid:12)(cid:12) Φ = − P w = w ∗ ˙ ψ hw (cid:12)(cid:12) Φ = 0 .So Φ is fixed point of the chemical Baum-Welch Algorithm. ⊓⊔ We will now prove Theorem 2. For the sake of convenience, we first recall the statement.
Theorem 2.
Every positive fixed point for the Chemical Baum-Welch Algorithm on a Baum WelchReaction system ( BW ( M , h ∗ , v ∗ , L ) , k ) with permissible rate k is a fixed point for the Baum-Welchalgorithm for the HMM M = ( H, V, θ, ψ, π ) .Proof. Consider a positive point Φ = ( α ′ , β ′ , γ ′ , ξ ′ , θ ′ , ψ ′ ) with α ′ , β ′ , γ ′ ∈ R L × H> , ξ ′ ∈ R ( L − × H × H> , θ ′ ∈ R H × H> and ψ ′ ∈ R H × V> . If Φ is a fixed point for the Chemical Baum-Welch Algorithm thenwe must have: 16 ˙ α h (cid:12)(cid:12) Φ = 0 for all h ∈ H . This implies α ′ h × π h ∗ ψ ′ h ∗ v = α ′ h ∗ × π h ψ ′ hv for all h ∈ H \ h ∗ .Since Φ is positive, this implies α ′ h = (cid:0) π h ψ ′ hv (cid:1) P f ∈ H α ′ f P f ∈ H π f ψ ′ fv for all h ∈ H – ˙ α lh (cid:12)(cid:12) Φ = 0 for all h ∈ H and l = 2 , . . . , L . This implies α ′ lh × P g ∈ H α ′ l − ,g θ ′ gh ∗ ψ ′ h ∗ v l = α ′ lh ∗ × P g ∈ H α ′ l − ,g θ ′ gh ψ ′ hv l for all h ∈ H \ { h ∗ } and l = 2 , . . . , L . Since Φ is positive, thisimplies α ′ lh = X g ∈ H α ′ l − ,g θ ′ gh ψ ′ hv l P f ∈ H α ′ lf P f,g ∈ H α ′ l − ,g θ ′ gf ψ ′ fv l for all h ∈ H and l = 2 , . . . , L – ˙ β lh (cid:12)(cid:12) Φ = 0 for all h ∈ H and l = 1 , . . . , L . This implies β ′ lh × P g ∈ H θ ′ hg ψ ′ gv l +1 β ′ l +1 ,g = β ′ lh ∗ × P g ∈ H θ ′ h ∗ g ψ ′ gv l +1 β ′ l +1 ,g for all h ∈ H \{ h ∗ } and l = 1 , . . . , L − . Since Φ is positive,this implies β ′ lh = X g ∈ H θ ′ hg ψ ′ gv l +1 β ′ l +1 ,g P f ∈ H β ′ lf P f,g ∈ H θ ′ fg ψ ′ gv l +1 β ′ l +1 ,g for all h ∈ H and l = 1 , . . . , L − – ˙ γ lh (cid:12)(cid:12) Φ = 0 for all h ∈ H and l = 1 , . . . , L . This implies γ ′ l ( h ) × α ′ lh ∗ β ′ lh ∗ = γ ′ l ( h ∗ ) × α ′ lh β ′ lh for all h ∈ H \ { h ∗ } and l = 1 , , . . . , L − . Since Φ is positive, this implies γ ′ lh = α ′ lh β ′ lh P g ∈ H α ′ lg β ′ lg ! X g ∈ H γ ′ lg for all h ∈ H and l = 1 , , . . . , L − – ˙ ξ lgh (cid:12)(cid:12) Φ = 0 for all g, h ∈ H and l = 1 , . . . , L − . This implies ξ ′ l ( g, h ) × α ′ lh ∗ θ ′ h ∗ h ∗ ψ ′ h ∗ v l +1 β ′ l +1 ,h ∗ = ξ ′ l ( h ∗ , h ∗ ) × α ′ lg θ ′ gh ψ ′ hv l +1 β ′ l +1 ,h for all g, h ∈ H × H \ { ( h ∗ , h ∗ ) } and l = 1 , . . . , L − . Since Φ is positive, this implies ξ ′ lgh = α ′ lg θ ′ gh ψ ′ hv l +1 β ′ l +1 ,h P e,f ∈ H α ′ lf θ ′ ef ψ ′ fv l +1 β ′ l +1 ,f ! X e,f ∈ H ξ ′ lef for all g, h ∈ H × H and l = 1 , . . . , L − – ˙ θ gh (cid:12)(cid:12) Φ = 0 for all g, h ∈ H . This implies θ ′ gh × P L − l =1 ξ ′ l ( g, h ∗ ) = θ ′ gh ∗ × P L − l =1 ξ ′ l ( g, h ) for all g ∈ H and h ∈ H \ { h ∗ } . Since Φ is positive, this implies θ ′ gh = P L − l =1 ξ ′ lgh P f ∈ H P L − l =1 ξ ′ lgf X f ∈ H θ ′ gf for all g ∈ H and h ∈ H – ˙ ψ hw (cid:12)(cid:12) Φ = 0 for all h ∈ H and w ∈ V . This implies ψ ′ hw × P Ll =1 γ ′ l ( h ) δ v ∗ ,v l = ψ ′ hv ∗ × P Ll =1 γ ′ l ( h ) δ w,v l for all h ∈ H and w ∈ V \ { v ∗ } . Since Φ is positive, E lv = δ v,v l and P v ∈ V δ v,v l = 1 this implies ψ ′ hw = P Ll =1 γ ′ lh δ w,v l P Ll =1 γ ′ lh ! X v ∈ V ψ ′ hv for all h ∈ H and w ∈ V Because of the relaxation we get by Remark 1, the point Φ qualifies as a fixed point of the Baum-Welch algorithm. ⊓⊔ ppendix A.2 Rate of convergence analysis In this section we will prove Theorem 3 and Theorem 4, but first we will state and prove two usefullemmas.
Lemma 1.
Let A be an arbitrary n × n matrix. Let W be an r × n matrix comprising of r linearlyindependent left kernel vectors of A so that W A = 0 r,n , where i,j denotes a i × j matrix with allentries zero. Further suppose W is in the row reduced form, that it is, W = (cid:0) W ′ I r (cid:1) where I j denotes the j × j identity matrix and W ′ is a r × ( n − r ) matrix. Let A be given as A = (cid:18) A A A A (cid:19) , where A is a ( n − r ) × ( n − r ) matrix, A is a ( n − r ) × r matrix, A is a r × ( n − r ) matrix,and A is a r × r matrix.Then the n − r eigenvalues (with multiplicity) of the matrix A − A W ′ are the same as theeigenvalues of A except for r zero eigenvalues.Proof. Consider the n × n invertible matrix P given by P = (cid:18) I n − r n − r,r W ′ I r (cid:19) , P − = (cid:18) I n − r n − r,r − W ′ I r (cid:19) , with determinant det( P ) = det( P − ) = 1 . We have P AP − = (cid:18) I n − r n − r,r W ′ I r (cid:19) (cid:18) A A A A (cid:19) (cid:18) I n − r n − r,r − W ′ I r (cid:19) = (cid:18) A A r,n − r r,r (cid:19) (cid:18) I n − r n − r,r − W ′ I r (cid:19) = (cid:18) A − A W ′ A r,n − r r,r (cid:19) This implies that the characteristic polynomial of A fulfils det( A − λI n ) = det( P ) det( A − λI n ) det( P − ) = det( P AP − − λI n )= det (cid:18) A − A W ′ − λI n − r A r,n − r r,r − λI r (cid:19) = ( − r λ r det( A − A W ′ − λI n − r ) , and the statement follows. ⊓⊔ Now we revisit the observation that every reaction in the Baum-Welch reaction network is amonomolecular transformation catalyzed by a set of species. For the purposes of our analysis,each reaction can be abstracted as a monomolecular reaction with time varying rate constants. Thisprompts us to consider the following monomolecular reaction system with n species X , . . . , X n and m reactions X r j k j ( t ) −−−→ X p j , for j = 1 , . . . , m, r j = p j , and r j , p j ∈ { , . . . , n } , and k j ( t ) , j = 1 , . . . , m , are mass-action reactionrate constants, possibly depending on time. We assume k j ( t ) > for t ≥ and let k ( t ) =( k ( t ) , . . . , k m ( t )) be the vector of reaction rate constants. Furthermore, we assume there is at mostone reaction j such that ( r j , p j ) = ( r, p ) ∈ { , . . . , n } and that the reaction network is stronglyconnected. The later means there is a reaction path from any species X i to any other species X i ′ . (Inreaction network terms it means the network is weakly reversible and deficiency zero.)The mass action kinetics of this reaction system is given by the ODE system ˙ x i = − x i m X j =1: r j = i k j ( t ) + m X j =1: p j = i k j ( t ) x r j , i = 1 , . . . , n. Define the n × n matrix A ( t ) = ( a ii ′ ( t )) i,i ′ =1 ,...,n by a ii ( t ) = − m X j =1: r j = i k j ( t ) ,a ii ′ ( t ) = k j ( t ) , if there is j ∈ { , . . . , m } such that ( r j , p j ) = ( i ′ , i ) . (1)Then the ODE system might be written as ˙ x = A ( t ) x. (2)Note that the column sums of A ( t ) are zero, implying that P ni =1 x i is conserved. Lemma 2.
Assume k ( t ) for t ≥ , converges exponentially fast towards k = ( k , . . . , k m ) ∈ R m> as t → ∞ , that is, there exists γ > and K > such that (cid:13)(cid:13) k ( t ) − k (cid:13)(cid:13) ≤ K e − γ t for t ≥ . Let A ( t ) be the matrix as defined in equation 1. And let A be the matrix obtained with k inserted for k ( t ) in the matrix A ( t ) that is, A = lim t →∞ A ( t ) .Then solutions of ODE system ˙ x = A ( t ) x starting at x (0) ∈ R n ≥ converges exponentially fasttowards the equilibrium a ∈ R n> of the ODE system ˙ x = Ax starting at x (0) ∈ R n ≥ , that is, thereexists γ > and K > such that (cid:13)(cid:13) x ( t ) − a (cid:13)(cid:13) ≤ Ke − γt for t ≥ . Proof.
We will first rephrase the ODE system such that standard theory is applicable. Let rank of A be n − r . Let W be as defined in Lemma 1, that is, W be an r × n matrix comprising of r linearlyindependent left kernel vectors of A so that W A = 0 . Here since rank of A is n − r , the rows of W would form a basis for the left kernel of A . And as in Lemma 1, further suppose W is in the rowreduced form, that is, W = (cid:0) W ′ I r (cid:1) . Then ˙ x = Ax (3)is a linear dynamical system with r conservation laws (one for each row of W ). Let W x (0) = T ∈ R r be the vector of conserved amounts. Let ˆ x = ( x , . . . , x n − r ) and ˜ x = ( x n − r +1 , . . . , x n ) .We will consider the (equivalent) dynamical system in which r variables are eliminated, expressedthrough the conservation laws T = W x = (cid:0) W ′ I r (cid:1) x, or ˜ x = T − W ′ ˆ x.
19s in Lemma 1, let A be given as A = (cid:18) A A A A (cid:19) , where A is a ( n − r ) × ( n − r ) matrix, A is a ( n − r ) × r matrix, A is a r × ( n − r ) matrix,and A is a r × r matrix. This yields ˙ˆ x = (cid:0) A A (cid:1) (cid:18) ˆ x ˜ x (cid:19) = (cid:0) A A (cid:1) (cid:18) ˆ xT − W ′ ˆ x (cid:19) = ( A − A W ′ )ˆ x + A T = C ˆ x + DT, (4)with C = A − A W ′ and D = A . We call this as the reduced ODE system . Note that thisreduced system has only n − r variables and that the conservation laws are built directly into it. Thisimplies that the differential equation changes if T is changed. The role of this construction is so thatwe can work only with the non-zero eigenvalues of the A .As we also have W A ( t ) = 0 for all t ≥ , the ODE ˙ x = A ( t ) x can also be similarly reduced to ˙ˆ x = C ( t )ˆ x + D ( t ) T, (5)with C ( t ) = A ( t ) − A ( t ) W ′ and D ( t ) = A ( t ) , where analogous to A , we define A ( t ) tobe the top-left ( n − r ) × ( n − r ) sub-matrix of A ( t ) and analogous to A , we define A ( t ) to bethe top-right ( n − r ) × r sub-matrix of A ( t ) .Now if a is the equilibrium of the ODE system ˙ x = Ax starting at x (0) , then ˆ a = ( a , . . . , a n − r ) is an equilibrium of the reduced ODE sytem ˙ˆ x = C ˆ x + DT starting at ˆ x (0) = ( x (0) , . . . , x n − r (0)) .Suppose we are able to prove that solutions of reduced ODE ˙ˆ x = C ( t )ˆ x + D ( t ) T starting at ˆ x (0) converges exponentially fast towards ˆ a then because of the conservation laws ˜ x = T − W ′ ˆ x , wewould also have that solutions of ˙ x = A ( t ) x starting at x (0) converges exponentially fast towards a . So henceforth, we will work only with the reduced ODE systems. For notational convinience, wewill drop the hats off ˆ x and ˆ a and simply refer to them as x and a respectively.By subtracting and adding terms to the reduced ODE system (in equation 5), we have ˙ x = C ( t ) x + D ( t ) T = Cx + DT + ( C ( t ) − C ) x + ( D ( t ) − D ) T. As a is an equilibrium of the ODE system ˙ x = Cx + DT , we have Ca + DT = 0 .Define y = x − a . Then ˙ y = Cx + DT + ( C ( t ) − C ) x + ( D ( t ) − D ) T = Cy + Ca + DT + ( C ( t ) − C ) x + ( D ( t ) − D ) T = Cy + ( C ( t ) − C ) x ( t ) + ( D ( t ) − D ) T = Cy + E ( t ) where it is used that Ca + DT = 0 , and E ( t ) = ( C ( t ) − C ) x ( t ) + ( D ( t ) − D ) T .The solution to the above ODE system is known to be y ( t ) = e Ct y (0) + Z t e C ( t − s ) E ( s ) ds. (6)20e have, using (6) and the triangle inequality, (cid:13)(cid:13) y ( t ) (cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) e Ct y (0) (cid:13)(cid:13)(cid:13) + Z t (cid:13)(cid:13)(cid:13) e C ( t − s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E ( s ) (cid:13)(cid:13) ds. Now A as defined (see equation 1 with k inserted for k ( t ) ) would form a Laplacian matrix overa strongly connected graph and so it follows that all the eigenvalues of A are either zero or havenegative real part. And using C = A − A W ′ and Lemma 1 it follows that all eigenvalues of C have negative real part. Hence it follows that (cid:13)(cid:13)(cid:13) e Ct (cid:13)(cid:13)(cid:13) ≤ K e − γ t , where < γ < −ℜ ( λ ) and K > . Here λ is the eigenvalue of C with the largest real part.The matrices C ( t ) and D ( t ) are linear in k ( t ) . And as k ( t ) converges exponentially fast to-wards k , it follows that the matrices C ( t ) and D ( t ) converge exponentially fast towards C and D respectively. Hence it follows that (cid:13)(cid:13) E ( t ) (cid:13)(cid:13) = (cid:13)(cid:13) ( C ( t ) − C ) x ( t ) + ( D ( t ) − D ) T (cid:13)(cid:13) ≤ (cid:13)(cid:13) ( C ( t ) − C ) (cid:13)(cid:13)(cid:13)(cid:13) x ( t ) (cid:13)(cid:13) + (cid:13)(cid:13) ( D ( t ) − D ) (cid:13)(cid:13) k T k≤ K e − γ t + K e − γ t ≤ K e − γ t where – (cid:13)(cid:13) ( C ( t ) − C ) (cid:13)(cid:13)(cid:13)(cid:13) x ( t ) (cid:13)(cid:13) ≤ K e − γ t for some K , γ > as C ( t ) converges exponentially fasttowards C and x ( t ) is bounded (as in the original ODE P ni =1 x i is conserved), and – (cid:13)(cid:13) ( D ( t ) − D ) (cid:13)(cid:13) k T k ≤ K e − γ t for some K , γ > as D ( t ) converges exponentially fasttowards D , and – K = max( K , K ) and γ = min( γ , γ ) .Collecting all terms we have for all t ≥ , (cid:13)(cid:13) y ( t ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) y (0) (cid:13)(cid:13) K e − γ t + Z t K e − γ ( t − s ) × K e − γ s ds ≤ K e − γ t + K Z t e − γ t ds = K e − γ t (1 + t ) ≤ Ke − γt by choosing K = max (cid:16) K K , (cid:13)(cid:13) y (0) (cid:13)(cid:13) K (cid:17) and γ = min( γ , γ ) . In the last line γ is chosensuch that < γ < γ and K is sufficiently large. Since y ( t ) = x ( t ) − a we have, (cid:13)(cid:13) x ( t ) − a (cid:13)(cid:13) ≤ Ke − γt , as required. ⊓⊔
21e will now prove Theorem 3. For the sake of convenience, we first recall the statement.
Theorem 3.
For the Baum Welch Reaction System ( BW ( M , h ∗ , v ∗ , L ) , k ) with permissible rates k ,if the concentrations of θ and ψ species are held fixed at a positive point then the Forward, Backwardand Expection step reaction systems on α, β, γ and ψ species converge to equilibrium exponentiallyfast.Proof. It follows by repeated use of Lemma 2. For l = 1 the forward reaction network can beinterpretated as the following molecular reactions: α h π h ∗ ψ h ∗ w E w −−−−−−−−→ α h ∗ α h ∗ π h ψ hw E w −−−−−−−→ α h for all h ∈ H \ { h ∗ } and w ∈ V , as they are dynamically equivalent. Here the effective rateconstants ( π h ∗ ψ h ∗ w E w or π h ψ hw E w ) are independent of time and so the conditions of Lemma 2are fulfilled. Thus this portion of the reaction network converges exponentially fast.The rest of the forward reaction network can be similarly interpretated as the following molecularreactions: α lh α l − ,g θ gh ∗ ψ h ∗ w E lw −−−−−−−−−−−−−→ α lh ∗ α lh ∗ α l − ,g θ gh ψ hw E lw −−−−−−−−−−−→ α lh for all g ∈ H , h ∈ H \ { h ∗ } , l = 2 , . . . , L and w ∈ V . For layers l = 2 , . . . , L of the forward reac-tion network we observe that the effective rate constants ( α l − ,g θ gh ∗ ψ h ∗ w E lw or α l − ,g θ gh ψ hw E lw )for layer l depend on time only through α l − ,g . If we suppose that the concentration of α l − ,g con-verges exponentially fast, then we can use Lemma 2 to conclude that the concentration of α lh alsoconverges exponentially fast. Thus using Lemma 2 inductively layer by layer we conclude thatforward reaction network converges exponentially fast. The backward reaction network convergesexponentially fast, similarly.For the expectation reaction network it likewise follows by induction. But here, notice if weinterpreted the expectation network similarly into molecular reactions, the effective rate constantswould depend on time through the products such as α lh β lh or α lh β l +1 ,h . So to apply Lemma 2 weneed the following: If α l ( t ) and β l ( t ) converge exponentially fast towards a l and b l then the product α l ( t ) β l ( t ) converges exponentially fast towards a l b l as (cid:13)(cid:13) α l ( t ) β l ( t ) − a l b l (cid:13)(cid:13) = (cid:13)(cid:13) ( α l ( t ) − a l )( β l ( t ) − b l ) + a l ( β l ( t ) − b l ) + b l ( α l ( t ) − a l ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) α l ( t ) − a l (cid:13)(cid:13)(cid:13)(cid:13) β l ( t ) − b l (cid:13)(cid:13) + K (cid:13)(cid:13) β l ( t ) − b l (cid:13)(cid:13) + K (cid:13)(cid:13) α l ( t ) − a l (cid:13)(cid:13) , where K is some suitably large constant. We can further observe that α l ( t ) β l ( t ) converges expo-nentially fast towards a l b l at rate γ = min( γ a , γ b ) , where γ a and γ b , respectively, are the expo-nential convergence rates of α l ( t ) and β l ( t ) . A similar argument goes for the products of the form α l ( t ) β l +1 ( t ) . And thus the expectation reaction network, also converges exponentially fast. ⊓⊔ We will now prove Theorem 4. For the sake of convenience, we first recall the statement.
Theorem 4.
For the Baum Welch Reaction System ( BW ( M , h ∗ , v ∗ , L ) , k ) with permissible rates k ,if the concentrations of α, β, γ and ξ species are held fixed at a positive point then the Maximizationstep reaction system on θ and ψ converges to equilibrium exponentially fast.Proof. Exponential convergence of the maximisation network follows by a similar layer by layerinductive use of Lemma 2. ⊓⊔⊓⊔