[PDF] Glycan processing in the Golgi -- optimal information coding and constraints on cisternal number and enzyme specificity

Abstract

Many proteins that undergo sequential enzymatic modification in the Golgi cisternae are displayed at the plasma membrane as cell identity markers. The modified proteins, called glycans, represent a molecular code. The fidelity of this glycan code is measured by how accurately the glycan synthesis machinery realises the desired target glycan distribution for a particular cell type and niche. In this paper, we quantitatively analyse the tradeoffs between the number of cisternae and the number and specificity of enzymes, in order to synthesize a prescribed target glycan distribution of a certain complexity. We find that to synthesize complex distributions, such as those observed in real cells, one needs to have multiple cisternae and precise enzyme partitioning in the Golgi. Additionally, for fixed number of enzymes and cisternae, there is an optimal level of specificity of enzymes that achieves the target distribution with high fidelity. Our results show how the complexity of the target glycan distribution places functional constraints on the Golgi cisternal number and enzyme specificity.

Full PDF

GGlycan processing in the Golgi – optimal information coding and constraints oncisternal number and enzyme speciﬁcity

Alkesh Yadav, Quentin Vagne, Pierre Sens, Garud Iyengar, and Madan Rao Raman Research Institute, Bangalore 560080, India Laboratoire Physico Chimie Curie, Institut Curie, CNRS UMR168, 75005 Paris, France Industrial Engineering and Operations Research, Columbia University, New York 10027, USA Simons Centre for the Study of Living Machines, National Centre for Biological Sciences (TIFR), Bangalore 560065, India ∗ (Dated: May 19, 2020)Many proteins that undergo sequential enzymatic modiﬁcation in the Golgi cisternae are displayedat the plasma membrane as cell identity markers. The modiﬁed proteins, called glycans, represent amolecular code. The ﬁdelity of this glycan code is measured by how accurately the glycan synthesismachinery realises the desired target glycan distribution for a particular cell type and niche. In thispaper, we quantitatively analyse the tradeoﬀs between the number of cisternae and the number andspeciﬁcity of enzymes, in order to synthesize a prescribed target glycan distribution of a certaincomplexity. We ﬁnd that to synthesize complex distributions, such as those observed in real cells,one needs to have multiple cisternae and precise enzyme partitioning in the Golgi. Additionally,for ﬁxed number of enzymes and cisternae, there is an optimal level of speciﬁcity of enzymes thatachieves the target distribution with high ﬁdelity. Our results show how the complexity of thetarget glycan distribution places functional constraints on the Golgi cisternal number and enzymespeciﬁcity. ∗ [email protected] a r X i v : . [ q - b i o . S C ] M a y I. INTRODUCTION

A majority of the proteins synthesized in the endoplasmic reticulum (ER) are transferred to the Golgi cisternae forfurther chemical modiﬁcation by glycosylation [1], a process that sequentially and covalently attaches sugar moietiesto proteins, catalyzed by a set of enzymatic reactions within the ER and the Golgi cisternae. These enzymes,called glycosyltransferases, are localized in the ER and cis-medial and trans Golgi cisternae in a speciﬁc manner [2, 3].Glycans, the ﬁnal products of this glycosylation assembly line are delivered to the plasma membrane (PM) conjugatedwith proteins, whereupon they engage in multiple cellular functions, including immune recognition, cell identitymarkers, cell-cell adhesion and cell signalling [2–6]. This glycan code [7, 8], representing information [9] about the cell,is generated dynamically, following the biochemistry of sequential enzymatic reactions and the biophysics of secretorytransport [4, 10, 11].In this paper, we will focus on the role of glycans as markers of cell identity. For the glycans to play this role, theymust inevitably represent a molecular code [4, 7, 11]. While the functional consequences of glycan alterations havebeen well studied, the glycan code has remained an enigma [7, 11–13]. In this paper, we study one aspect of molecularcoding, namely the ﬁdelity of this molecular code generation. While it has been recognised that ﬁdelity of the glycancode is necessary for reliable cellular recognition [14], a quantitative measure of ﬁdelity of the code and its implicationsfor cellular structure and organization are lacking.There are two aspects of the cell-type speciﬁc glycan code that have an important bearing on quantifying ﬁdelity.The ﬁrst is that extant glycan distributions have high complexity , owing to evolutionary pressures arising from (a)reliable cell type identiﬁcation amongst a large set of diﬀerent cell types in a complex organism, the preservationand diversiﬁcation of “self-recognition” [5], (b) pathogen-mediated selection pressures [2, 4, 6], and (c) herd immunity within a heterogenous population of cells of a community [15] or within a single organism [5]. Here, we will interpretthis to mean that the target distribution of glycans of a given cell type is complex; in Sect. II we deﬁne a quantitativemeasure for complexity and demonstrate its implications in the context of human

T-cells. The second is that thecellular machinery for the synthesis of glycans, which involves sequential chemical processing via cisternal residentenzymes and cisternal transport, is subject to variation and noise [4, 10, 11]; the synthesized glycan distribution is,therefore, a function of cellular parameters such as the number and speciﬁcity of enzymes, inter-cisternal transferrates, and number of cisternae. We will discuss an explicit model of the cellular synthesis machinery in Sect. III.In this paper, we deﬁne ﬁdelity as the minimum achievable Kullback-Leibler (KL) divergence [16, 17] between thesynthesized distribution of glycans and the target glycan distribution. This KL divergence is a function of the cellularparameters governing glycan synthesis, such as the number and speciﬁcity of enzymes, inter-cisternal transfer rates,and number of cisternae (Sect. V). We analyze the tradeoﬀs between the number of cisternae and the number andspeciﬁcity of enzymes, in order to achieve a prescribed target glycan distribution with high ﬁdelity (Sect. VI). Ouranalysis leads to a number of interesting results, of which we list a few here: (i) In order to construct an accuraterepresentation of a complex target distribution, such as those observed in real cells, one needs to have multiplecisternae and precise enzyme partitioning. Low complexity target distributions can be achieved with fewer cisternae.(ii) This deﬁnition of ﬁdelity of the glycan code, allows us to provide a quantitative argument for the evolutionaryrequirement of multiple-compartments. (iii) For ﬁxed number of enzymes and cisternae, there is an optimal levelof speciﬁcity of enzymes that achieves the complex target distribution with high ﬁdelity. (iv) Keeping the numberof enzymes ﬁxed, having low speciﬁcity or sloppy enzymes and larger cisternal number could give rise to a diverserepertoire of functional glycans, a strategy used in organisms such as plants and algae.Stated another way, our results imply that the pressure to achieve the target glycan code for a given cell type, placesstrong constraints on the cisternal number and enzyme speciﬁcity [18]. This would suggest that a description of thenonequilibrium assembly of a ﬁxed number of Golgi cisternae must combine the dynamics of chemical processing withmembrane dynamics involving ﬁssion, fusion and transport [19, 20], opening up a new direction for future research.

II. COMPLEXITY OF GLYCAN CODE IN REAL CELLS

Since each cell type (in a niche) is identiﬁed with a distinct glycan proﬁle [4, 7, 11], and this glycan proﬁle is noisybecause of the stochastic noise associated with the synthesis and transport [11–13], a large number of diﬀerent celltypes can be diﬀerentiated only if the cells are able to produce a large set of glycan proﬁles that are distinguishable (a) (b)

FIG. 1. Real cells display a complex glycan distribution. (a) Here we take the MSMS data from mouse neutrophils, human neutrophils and human

T-cells and approximate these using Gaussian Mixture Models (GMM) of less complexity 3-GMM (left)and more complexity 20-GMM (right). (b) The change in log likelihood with increase in the number of GMM components for mouse and human neutrophils and human

T-cells, shows a saturation at large enough values of k , indicating that these glycandistribution are complex. Details appear in Appendix G. in the presence of this noise. A more complex or richer class of glycan proﬁles is able to support a larger number ofwell separated proﬁles, and therefore, a larger number cell types, or equivalently, a more complex organism In order to implement a quantitative measure of complexity, we ﬁrst need a consistent way of smoothening or coarse-graining the discrete glycan distribution to remove measurement and synthesis noise. In this paper, we approximatethe glycan proﬁle as mixture of Gaussian densities with speciﬁed number of components that are supported on a ﬁniteset of indices [21]. Since the complexity of k -component Gaussian is an increasing function of k , we use the numberof component k and complexity interchangeably.Using this deﬁnition we demonstrate that the glycan proﬁles of typical mammalian cells are very complex. We obtaintarget proﬁles for a given cell type from Mass Spectrometry coupled with determination of molecular structure (MSMS)measurements [22]. Fig. 1 shows the the MSMS data from human T-cells and human and mouse neutrophils [22], andtheir coarse-grained representations using Gaussian mixture models (GMM) of diﬀering complexity - a low complexity k = 3 GMM and high complexity k = 20 GMM. It is clear from Fig. 1, that the more complex k = 20 GMM is a betterrepresentation of the MSMS data as compared to the less complex k = 3 GMM. Indeed the k = 20 Gaussian mixturemodel is the best compromise between faithfulness of the representation and cost of an additional component, as seenfrom the saturation of the likelihood function [17]. Details of this systematic coarse-graining procedure appear inSect. VI B and Appendix G.Having demonstrated the complexity of the typical glycan distributions associated with a given cell type, we will nowdescribe a general model of the cellular machinery that is capable of synthesizing glycans of the desired complexity.We expect that cells need a more elaborate mechanism to produce proﬁles from a more complex set. III. SYNTHESIS OF GLYCANS IN THE GOLGI CISTERNAE

The glycan display at the cell surface is a result of proteins that ﬂux through and undergo sequential chemicalmodiﬁcation in the secretory pathway, comprising an array of Golgi cisternae situated between the ER and the PM, A rigorous deﬁnition of complexity can be given in terms of the Kullback-Leibler metric [16, 17] between two glycan proﬁles. We declarethat two proﬁles are distinguishable only if the Kullback-Leibler distance between the proﬁles is more than a given tolerance. Thistolerance is an increasing function of the noise. We deﬁne the complexity of a set of possible glycan proﬁles as the size of the largestsubset such that the Kullback-Leibler distance of any pair of proﬁles is larger than the tolerance.

FIG. 2. Enzymatic reaction and transport network in the secretory pathway. Represented here is the array of Golgi cisternae(blue) indexed by j = 1 , . . . , N C situated between the ER and PM. Glycan-binding proteins P c (1)1 are injected from the ERto cisterna-1 at rate q . Superimposed is transition network of chemical reactions (column) - intercisternal transfer (rows), thelatter with rates µ ( j ) . P c ( j ) k denotes the acceptor substrate in compartment j and the glycosyl donor c is chemostated in eachcisterna. This results in a frequency distribution of glycans displayed at the PM (red curve), that is representative of the celltype. as depicted in Fig. 2. Glycan-binding proteins (GBPs) are delivered from the ER to the ﬁrst cisterna, whereupon theyare processed by the resident enzymes in a sequence of steps that constitute the N-glycosylation process [2]. A genericenzymatic reaction in the cisterna involves the catalysis of a group transfer reaction in which the monosaccharidemoiety of a simple sugar donor substrate, e.g. UDP-Gal, is transferred to the acceptor substrate, by a Michaelis-Menten (MM) type reaction [2] Acceptor + glycosyl donor + Enzyme ω f −− (cid:42)(cid:41) −− ω b [ Acceptor · glycosyl donor · Enzyme ] ω c −−→ glycosylated acceptor + nucleotide + Enzyme (1)From the ﬁrst cisterna, the proteins with attached sugars are delivered to the second cisterna at a given inter-cisternaltransfer rate, where further chemical processing catalysed by the enzymes resident in the second cisterna occurs.This chemical processing and inter-cisternal transfer continues until the last cisterna, thereupon the fully processedglycans are displayed at the PM [2]. The network of chemical processing and inter-cisternal transfer forms the basisthe physical model that we will describe next.Any physical model of such a network of enzymatic reactions and cisternal transfer needs to be augmented by reactionand transfer rates and chemical abundances. To obtain the range of allowable values for the reaction rates and chemicalabundances, we use the elaborate enzymatic reaction models, such as the KB2005 model [23–25] (with a network of22 ,

871 chemical reactions and 7565 oligosaccharide structures) that predict the N-glycan distribution based on theactivities and levels of processing enzymes distributed in the Golgi-cisternae of mammalian cells, and compare thesepredictions with N-glycan mass spectrum data. For the allowable rates of cisternal transfer, we rely on the recentstudy by Ungar and coworkers [26, 27], whose study shows how the overall Golgi transit time and cisternal number,can be tuned to engineer a homogeneous glycan distribution.

IV. MODEL DEFINITIONA. Chemical reaction and transport network in cisternae

With this background, we now deﬁne our quantitative model for chemical processing and transport in the secretorypathway. We consider an array of N C Golgi cisternae, labelled by j = 1 , . . . , N C , between the ER and the PM (Fig. 2).Glycan-binding proteins (GBPs), denoted as P c (1)1 , are delivered from the ER to cisterna-1 at an injection rate q . Itis well established that the concentration of the glycosyl donor in the j -th cisterna is chemostated [2, 28–30], thusin our model we hold its concentration c ( j )0 constant in time for each j . The acceptor P c (1)1 reacts with c (1)0 to formthe glycosylated acceptor P c (1)2 , following an MM-reaction (1) catalysed by the appropriate enzyme. The acceptor P c (1)2 has the potential of being transformed into P c (1)3 , and so on, provided the requisite enzymes are present in thatcisterna. This leads to the sequence of enzymatic reactions P c (1)1 → P c (1)2 → . . . P c (1) k → . . . , where k enumerates thesequence of glycosylated acceptors, using a consistent scheme (such as in [23]). The glycosylated GBPs are transportedfrom cisterna-1 to cisterna-2 at an inter-cisternal transfer rate µ (1) , whereupon similar enzymatic reactions proceed.The processes of intra-cisternal chemical reactions and inter-cisternal transfer continue to the other cisternae and forma network as depicted in Fig. 2. Although, in this paper, we focus on a sequence of reactions that form a line-graph,the methodology we propose extends to tree-like reaction sequences, and more generally to reaction sequences thatform a directed acyclic graphs [31].Let N s denote the maximum number of possible glycosylation reactions in each cisterna j , catalysed by enzymeslabelled as E ( j ) α , with α = 1 , . . . , N E , where N E is the total number of enzyme species in each cisterna. Since manysubstrates can compete for the substrate binding site on each enzyme, one expects in general that N s (cid:29) N E . Theconﬁguration space of the network Fig. 2 is N s × N C . For the N-glycosylation pathway in a typical mammalian cell, N s = 2 × , N E = 10 −

20 and N C = 4 − E ( j ) α on the substrate P c ( j ) k in cisterna j is given by P c ( j ) k + E ( j ) α ω f ( j,k,α ) c ( j )0 −−−−−−−− (cid:42)(cid:41) −−−−−−−− ω b ( j,k,α ) (cid:104) E ( j ) α − P c ( j ) k − c ( j )0 (cid:105) ω c ( j,k,α ) −−−−−−→ P c ( j ) k +1 + E ( j ) α (2)In general, the forward, backward and catalytic rates ω f , ω b and ω c , respectively, depend on the cisternal label j , thereaction label k , and the enzyme label α , that parametrise the MM-reactions [32]. For instance, structural studieson glycosyltransferase-mediated synthesis of glycans [33], would suggest that the forward rate ω f to depend on thebinding energy of the enzyme E ( j ) α to acceptor substrate P c ( j ) k and a physical variable that characterises cisterna - j .A potential candidate for such a cisternal variable is pH [34], whose value is maintained homeostatically in eachcisterna [35]; changes in pH can aﬀect the shape of an enzyme (substrate) or their charge properties, and in generalthe reaction eﬃciency of an enzyme has a pH optimum [32]. Another possible candidate for a cisternal variable ismembrane bilayer thickness [36] - indeed both pH [37] and membrane thickness are known to have a gradient acrossthe Golgi cisternae. We take ω f ( j, k, α ) ∝ P ( j ) ( k, α ), where P ( j ) ( k, α ) ∈ (0 , E ( j ) α with substrate P c ( j ) k , and deﬁne the binding probability P ( j ) ( k, α ) using a biophysical model, similar in spirit tothe Monod-Wyman-Changeux model of enzyme kinetics [38, 39], of enzyme-substrate induced ﬁt.Let l ( j ) α and l k denote, respectively, the optimal “shape” for enzyme E ( j ) α and the substrate P c ( j ) k . We assume thatthe mismatch (or distortion) energy between the substrate k and enzyme E ( j ) α is (cid:107) l k − l ( j ) α (cid:107) , with a binding probabilitygiven by, P ( j ) ( k, α ) = exp (cid:16) − σ ( j ) α (cid:107) l k − l ( j ) α (cid:107) (cid:17) (3)where (cid:107) . (cid:107) is a distance metric deﬁned on the space of l ( j ) α (e.g., the square of the (cid:96) -norm would be related to anelastic distortion model [40]) and the vector σ ≡ [ σ ( j ) α ] parametrises enzyme speciﬁcity . A large value of σ ( j ) α indicatesa highly speciﬁc enzyme, a small value of σ ( j ) α indicates a promiscuous or sloppy enzyme. It is recognised that thedegree of enzyme speciﬁcity or sloppiness is an important determinant of glycan distribution [2, 41–43].As in [23–25], our synthesis model is mean-ﬁeld, in that we ignore stochasticity in glycan synthesis that may arisefrom low copy numbers of substrates and enzymes, multiple substrates competing for the same enzymes, and kineticsof inter-cisternal transfer. Then the usual MM-steady state condition on (2), which assumes that the concentrationof the intermediate enzyme-substrate complex does not change with time, implies (cid:104) E ( j ) α − P c ( j ) k − c ( j )0 (cid:105) = ω f ( j, k, α ) c ( j )0 ω b ( j, k, α ) + ω c ( j, k, α ) E ( j ) α c ( j ) k . where c ( j ) k is the concentration of the acceptor substrate P c ( j ) k in compartment j .Together with the constancy of the total enzyme concentration, (cid:104) E ( j ) α (cid:105) tot = E ( j ) α + (cid:80) N s k =1 (cid:104) E ( j ) α − P c ( j ) k − c ( j )0 (cid:105) , thisimmediately ﬁxes the kinetics of product formation (not including inter-cisternal transport), dc ( j ) k +1 dt = N E (cid:88) α =1 V ( j, k, α ) P ( j ) ( k, α ) c ( j ) k M ( j, k, α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) (4)where M ( j, k, α ) = ω b ( j, k, α ) + ω c ( j, k, α ) ω f ( j, k, α ) c ( j )0 P ( j ) ( k, α )and V ( j, k, α ) = ω c ( j, k, α ) (cid:104) E ( j ) α (cid:105) tot . From the above, the experimentally measurable parameters V max and MM-constant K M , for each ( j, k, α ) can beeasily read out. As is the usual case, the maximum velocity V max is not an intrinsic property of the enzyme,because it is dependent on the enzyme concentration (cid:104) E ( j ) α (cid:105) tot ; while K M ( j, k, α ) = M ( j, k, α ) c ( j )0 /P ( j ) ( k, α ) is anintrinsic parameter of the enzyme and the enzyme-substrate interaction. The enzyme catalytic eﬃciency, the so-called“ k cat /K M ” ∝ P ( j ) ( k, α ) and is high for perfect enzymes [44] with minimum mismatch.We now add to this chemical reaction kinetics, the rates of injection ( q ) and inter-cisternal transport µ ( j ) from thecisterna j to j + 1; in Appendix A we display the complete set of equations that describe the changes in the substrateconcentrations c ( j ) k with time. These kinetic equations automatically obey the conservation law for the proteinconcentration ( p ). Rescaling the kinetic parameters in terms of the injection rate q , i.e. V ( j, k, α ) = V ( j, k, α ) /q and µ ( j ) = µ ( j ) /q , we see that the steady state concentrations of the glycans in each cisterna satisfy the following recursionrelations (see, Appendix A). In the ﬁrst cisterna, c (1)1 = 1 µ (1) + (cid:80) N E α =1 V (1 , ,α ) P (1) (1 ,α ) c (1)1 M (1 , ,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P (1)( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:33) c (1) k = (cid:80) N E α =1 V (1 ,k − ,α ) P (1) ( k − ,α ) c (1) k − M (1 ,k − ,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P (1)( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:33) µ (1) + (cid:80) N E α =1 V (1 ,k,α ) P (1) ( k,α ) c (1) k M (1 ,k,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P (1)( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:33) (5) c (1) N s = (cid:80) N E α =1 V (1 ,N s − ,α ) P (1) ( N s − ,α ) c (1) Ns − M (1 ,N s − ,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P (1)( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:33) µ (1) and in cisternae j ≥ c ( j )1 = µ ( j − c ( j − µ ( j ) + (cid:80) N E α =1 V ( j, ,α ) P ( j ) (1 ,α ) c ( j )1 M ( j, ,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P ( j )( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:33) (6) c ( j ) k = µ ( j − c ( j − k + (cid:80) N E α =1 V ( j,k − ,α ) P ( j ) ( k − ,α ) c ( j ) k − M ( j,k − ,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P ( j )( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:33) µ ( j ) + (cid:80) N E α =1 V ( j,k,α ) P ( j ) ( k,α ) c ( j ) k M ( j,k,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P ( j )( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:33) c ( j ) N s = µ ( j − c ( j − N s + (cid:80) N E α =1 V ( j,N s − ,α ) P ( j ) ( N s − ,α ) c ( j ) Ns − M ( j,N s − ,α ) (cid:32) (cid:80) Nsk (cid:48) =1 P ( j )( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:33) µ ( j ) Equations (5)-(6) automatically imply that the total steady state glycan concentration in each cisterna j = 1 , . . . , N c is given by N s (cid:88) k =1 c ( j ) k = 1 µ ( j ) . These nonlinear recursion equations (5)-(6) have to be solved numerically to obtain the steady state glycan concen-trations, c ≡ c ( j ) k , as a function of the independent vectors M ≡ [ M ( j, k, α )], V ≡ [ V ( j, k, α )], and L ≡ [ P ( j ) ( k, α )],the transport rates µ ≡ [ µ ( j ) ] and speciﬁcity, σ ≡ [ σ ( j ) α ]. V. OPTIMIZATION PROBLEM

Now, with both the protocol for determining the target glycan distribution and the sequential chemical processingmodel in hand, we can precisely deﬁne the optimization problem referred to in the Introduction. Let c ∗ denotethe “target” concentration distribution for a particular cell type, i.e. the goal of the sequential synthesis mechanismdescribed in Sect. IV A is to approximate c ∗ . Let ¯ c denote the steady state glycan concentration distribution displayedon the PM - (6) implies that ¯ c k = µ ( N C ) c ( N C ) k , k = 1 , . . . , N s . We measure the ﬁdelity between the c ∗ and ¯ c by theKullback-Leibler metric [16, 17], D KL ( c ∗ (cid:107) ¯ c ) = N s (cid:88) k =1 c ∗ k ln (cid:16) c ∗ k ¯ c k (cid:17) = N s (cid:88) k =1 c ∗ k ln (cid:16) c ∗ k c ( N C ) k µ ( N C ) (cid:17) (7)Thus, the problem of designing a sequential synthesis mechanism that approximates c ∗ for a given enzyme speciﬁcity σ , transport rate µ , number of enzymes N E , and number of cisternae N C is given by Optimization A : min M , V , L ≥ D KL ( c ∗ (cid:107) ¯ c ) (8)There is separation of time scales implicit in Optimization A – the chemical kinetics of the production of glycans andtheir display on the PM happens over cellular time scales, while the issues of tradeoﬀs and changes of parameters aredriven over evolutionary timescales. We normalize the distribution so that (cid:80) N s k =1 c ∗ k = 1. Optimization A, though well-deﬁned, is a hard problem, since the steady state concentrations (6) are not explicitly known in terms of the parameters ( M , V , L ). In Appendix B, we formulate an alternative problem Optimization B inwhich the steady state concentrations are deﬁned explicitly in terms of a new parameters R and L , and in Appendix Cwe prove that Optimization A and Optimization B are exactly equivalent. This is a crucial insight that allows us toobtain all the results that follow.In Appendix H, we describe the variant of the Sequential Quadratic Programming (SQP) [45], that we use to numer-ically solve the optimization problem. VI. RESULTS OF OPTIMIZATION

To start with, the dimension of the optimization search space is extremely large ≈ O ( N s × N E × N C ). To make theoptimization search more manageable, we ignore the k -dependence of the vectors ( M , V ), (or, alternatively of R ,see Appendix B for details). The dependence on the reaction rates on the glycosyl substrate is still present in theforward reactions via the enzyme-substrate binding probability P ( j ) ( k, α ). We further assume that shape function isa number, l ( j ) α = l ( j ) α and that l k = k . Finally we will drop the dependence of the speciﬁcity on α and j , and take it tobe a scalar σ . To ﬁx our model, we will take the distortion energy that appears in (3) to be the linear form | l k − l ( j ) α | .Other metrics, such as | l k − l ( j ) α | , corresponding to the elastic distortion model [40], do not pose any computationaldiﬃculties, and we see that the results of our optimization remain qualitatively unchanged.These restrictions signiﬁcantly reduce the dimension of the optimization search, so much so that in certain limits,we can solve the problem analytically . This helps us obtain some useful heuristics (Appendix E) on how to tunethe parameters, e.g. N E , N C , σ , and others, in order to generate glycan distributions c of a given complexity. Theseheuristics inform our more detailed optimization using “realistic” target distributions.The calculations in Appendix E imply, as one might expect, that the synthesis model needs to be more elaborate, i.e.,needs a larger number of cisternae N C or a larger number of enzymes N E , in order to produce a more complex glycandistribution. For a real cell type in a niche, the speciﬁc elaboration of the synthesis machinery, would depend on avariety of control costs associated with increasing N E and N C . While an increase in the number of enzymes wouldinvolve genetic and transcriptional costs, the costs involved in increasing the number of cisternae could be rathersubtle.Notwithstanding the relative control costs of increasing N E and N C , it is clear from the special case, that increasingthe number of cisternae achieves the goal of obtaining an accurate representation of the target distribution. Let usassume that the target distribution is c ∗ k = δ ( k − M ) for a ﬁxed M (cid:29)

1, i.e. c ∗ k = 1 when k = M , and 0 otherwise,and that the N E enzymes that catalyse the reactions are highly speciﬁc. In this limit, Optimization A reduces to asimple enumeration exercise [46]: clearly one needs N E = M , with one enzyme species for each of the k = 1 , . . . , M reactions, in order to generate P c M . For a single Golgi cisterna with a ﬁnite cisternal residence time (ﬁnite µ ), thechemical synthesis network will generate a signiﬁcant steady state concentration of lower index glycans P c k with k < M , contributing to a low ﬁdelity. To obtain high ﬁdelity, one needs multiple Golgi cisternae with a speciﬁcenzyme partitioning ( E , E , . . . , E M ) with E j enzymes in cisterna j = 1 , . . . , N c . This argument can be generalisedto the case where the target distribution is a ﬁnite sum of delta-functions. The more general case, where the enzymesare allowed to have variable speciﬁcity, needs a more detailed study, to which we turn to below. A. Target distribution from coarse-grained MSMS

As discussed in Sect. II, we obtain the target glycan distribution from glycan proﬁles for real cells obtained using MassSpectrometry coupled with determination of molecular structure (MSMS) measurements [22]. The raw MSMS data, In Appendix D we show that (B4) can be solved analytically in the limit N s (cid:29)

1, since the glycan index k can be approximated by acontinuous variable, and the recursion relations for the steady state glycan concentrations (5)-(6) can be cast as a matrix diﬀerentialequation. This allows us to obtain an explicit expression for the steady state concentration in terms of the parameters ( R , L ). however, is not suitable as a target distribution. This is because it is very noisy, with chemical noise in the sampleand Poisson noise associated with detecting discrete events being the most relevant [47]. This means that many ofthe small peaks in the raw data are not part of the signal, and one has to “smoothen” the distribution to remove theimpact of noise.We use MSMS data from human T-cells [22] for our analysis. As discussed in Sect. II, the Gaussian mixture models(GMM) are often used to approximate distributions with a mixed number of modes or peaks [17], or in our setting,a given ﬁxed complexity. Here, we use a variation of the Gaussian mixture models (see Appendix G for details) tocreate a hierarchy of increasingly complex distributions to approximate the MSMS raw data. Thus, the 3-GMM and20-GMM approximations represent the low and high complexity benchmarks, respectively. In Appendix G, we showthat the likelihood for the glycan distribution of the human

T-cell saturates at 20 peaks. Thus, statistically speaking,the human

T-cell glycan distribution is accurately approximated by 20 peaks.This hierarchy allows us to study the trade-oﬀ between the complexity of the target distribution and the complexityof the synthesis model needed to generate the distribution as follows. Let T ( i ) denote the i th -GMM approximationfor the human T-cell MSMS data. We sample this target distribution at indices k = 1 , . . . , N s , that representthe glycan indices, and then renormalize to obtain the discrete distribution { T ( i ) k , k = 1 , . . . , N s } . Let H ( T ( i ) ) := − (cid:80) N s k =1 T ( i ) k log T ( i ) k denote the entropy [16] of the i th -GMM approximation. H ( T ( i ) ) quantiﬁes statistical informationin the target distribution T ( i ) . We evaluate the ﬁdelity of the distribution generated by the synthesis model to thistarget distribution by the ratio of the Kullback-Leibler distance to the entropy of the target distribution:¯ D ( σ, N E , N C , T ( i ) ) := D ( σ, N E , N C , T ( i ) ) H ( T ( i ) ) (9)This normalization allows us evaluate the ﬁdelity of the synthesis model to the target distribution T ( i ) as a fractionof the total statistical information in the target distribution T ( i ) . B. Tradeoﬀs between number of enzymes, number of cisternae and enzyme speciﬁcity to achieve givencomplexity

We are now in a position to catalogue the main results that follow from an optimization of the parameters of theglycan synthesis machinery to a given target distribution, Figs. 3-41. The normalized KL-distance ¯ D ( σ, N E , N C , c ∗ ) is a convex function of σ for ﬁxed values for other parameters(Fig. 3), i.e. it ﬁrst decreases with σ and then increases beyond a critical value of σ min . ¯ D ( σ, N E , N C , c ∗ )is decreasing in N C and N E for ﬁxed values of the other parameters, and increasing in the complexity of c ∗ for ﬁxed ( σ, N C ). The marginal contribution of N C and N E in reducing the normalised distance ¯ D isapproximately equal (Figs. 4a, 4b). The lower complexity distributions can be synthesized with high ﬁdelity withsmall ( N E , N C ), whereas higher complexity distributions require signiﬁcantly larger ( N E , N C ), Figs. 4a, 4b. For atypical mammalian cell, the number of enzymes in the N-glycosylation pathway are in the range N E = 10 −

20 [23–25, 27], Fig. 4b would then suggest that the optimal cisternal number would range from N C = 3 − σ min ( c ∗ , N C ) = argmin σ { ¯ D ( σ, ¯ N E , N C , c ∗ ) } , that minimises the error as functionof ( N C , c ∗ ) with N E ﬁxed at ¯ N E , is an increasing function of N C and the complexity of the target distribution c ∗ (Figs. 3a, 3b, 4c, 4d). This is consistent with the results in Appendix E where we established that the widthof the synthesized distribution is inversely dependent on the speciﬁcity σ : since a GMM approximation withfewer peaks has wider peaks, σ min is low, and vice versa. Similar results hold when N C is ﬁxed at ¯ N C , and N E is varied (Figs. 3c, 3d, 4c, 4d).3. Let σ min ( N C , N E , c ∗ ) denote the value of σ that minimizes ¯ D ( σ, N C , N E , c ∗ ). Then the second-derivative ∇ σ min ¯ D ( N C , N E , c ∗ ) = d dσ ¯ D ( σ, N C , N E , c ∗ ) | σ = σ min denotes the curvature at σ min , and is measure of the sen-sitivity of ¯ D ( σ, N E , N C , c ∗ ) to σ for values close to σ min ( N E , N C , c ∗ ). ∇ σ min ¯ D ( N C , N E , c ∗ ) is a decreasingfunction of N C (resp. N E ) for ﬁxed values of ( N E , c ∗ ) (resp. ( N C , c ∗ )), see Figs. 3, 4e, 4f. Thus, for any targetdistribution c ∗ there is a minimal value of ( N E , N C ) such that the target can be synthesized with high ﬁdelity0 (a) Less complex target, 3-GMMapproximation (b) More complex target, 20-GMMapproximation(c) Less complex target, 3-GMMapproximation (d) More complex target, 20-GMMapproximation FIG. 3. Tradeoﬀs amongst the glycan synthesis parameters, enzyme speciﬁcity σ , cisternal number N C and enzyme number N E , to achieve a complex target distribution c ∗ ). (a)-(b) Normalised Kullback-Leibler distance ¯ D ( σ, N E , N C , c ∗ ) as functionof σ and N C (for ﬁxed N E = 3), (c)-(d) ¯ D ( σ, N E , N C , c ∗ ) as function of σ and N E (for ﬁxed N C = 3), with the targetdistribution c ∗ set to the 3-GMM (less complex) and 20-GMM (more complex) approximations for the human T-cell MSMSdata. ¯ D ( σ, N E , N C , c ∗ ) is a convex function of σ for each ( N E , N C , c ∗ ), decreasing in N C , N E for each ( σ, c ∗ ), increasing inthe complexity of c ∗ for ﬁxed ( σ, N E , N C ). The speciﬁcity σ min ( c ∗ , N E , N C ) = argmin σ { ¯ D ( σ, N E , N C , c ∗ ) } that minimises theerror for given ( N E , N C , c ∗ ) is an increasing function of N C , N E and the complexity of the target distribution c ∗ . Furthermore,the curvature of ¯ D ( σ, N E , N C , c ∗ ) at σ min ( N E , N C , c ∗ ), related to sensitivity , is a decreasing function of N C , N E . provided the sensitivity σ is tightly controlled at σ min ( N C , N E , c ∗ ), and there is larger value of ( N E , N C ) suchthat the target can be synthesized even if the control on σ is less tight.Ungar et al. [26] optimize incoming glycan ratio, transport rate and eﬀective reaction rates in order to synthesize anarrow target distribution centred around a desired glycan. The ability to produce speciﬁc glycans without muchheterogeneity is an important goal in pharma industry. They deﬁne heterogeneity as the total number of glycanssynthesized, and show that increasing the number of compartments N C decreases heterogeneity, and increases theconcentration of the speciﬁc glycan. They also show that changing transport rate does not aﬀect the heterogeneity.Our results are entirely consistent with theirs - we have shown that ¯ D decreases as we increase N C . Thus, if thetarget distribution has a single sharp peak, increasing N C will reduce the heterogeneity in the distribution. C. Optimal partitioning of enzymes in cisternae

Having studied the optimum N E , N C , σ to attain a given target distribution with high ﬁdelity, we ask what is theoptimal partitioning of the N E enzymes in these N C cisternae? Answering this within our chemical reaction model(Sect. IV A) requires some care, since it incorporates the following enzymatic features: (a) enzymes with a ﬁnitespeciﬁcity σ can catalyse several reactions, although with an eﬃciency that varies with both the substrate index k and cisternal index j , and (b) every enzyme appears in each cisternae; however their reaction eﬃciencies depend on1 (a) Fidelity for less complex target, c ∗ = 3-GMMapproximation (b) Fidelity for more complex target c ∗ = 20-GMMapproximation(c) Optimal enzyme speciﬁcity for less complex target, c ∗ = 3-GMM approximation (d) Optimal enzyme speciﬁcity for more complextarget c ∗ = 20-GMM approximation(e) Sensitivity for less complex target, c ∗ = 3-GMMapproximation (f) Sensitivity for more complex target c ∗ = 20-GMMapproximation FIG. 4. Fidelity of glycan distribution and optimal enzyme properties to achieve a complex target distribution. The target c ∗ is taken from 3-GMM (less complex) and 20-GMM (more complex) approximations of the human T-cell MSMS data. (a)-(b)Minimum normalised KL divergence min σ { ¯ D ( σ, N C , N E , c ∗ ) } as a function of ( N E , N C ). More complex distributions requireeither a larger value N E or N C . The marginal impact of increasing N E and N C on ¯ D is approximately equal. (c)-(d) Optimumenzyme speciﬁcity σ min obtained from min σ { ¯ D ( σ, N C , N E , c ∗ ) } as a function of ( N E , N C ). σ min increases with increasing N E or N C . To synthesize the more complex 20 GMM approximation with high ﬁdelity requires enzymes with higher speciﬁcity σ min compared to those needed to synthesize the broader, less complex 3-GMM approximation. (e) -(f) Sensitivity ln d dσ ¯ D (cid:12)(cid:12)(cid:12) σ min of the normalised Kullback-Leibler distance ¯ D ( σ, N C , N E , c ∗ ) as a function of ( N E , N C ). Increasing N E or N C decreases thissensitivity implying the speciﬁcity σ does not need to be tuned very carefully if N E , N C are high. (a)(b) FIG. 5. Optimal enzyme partitioning in cisternae. (a) Heat map of the eﬀective reaction rates in each cisterna (representingthe optimal enzyme partitioning) and the steady state concentration in the last compartment ( c ( N C ) ) for the 20-GMM targetdistribution. Here N E = 5, N C = 7, normalised D KL ( T (20) (cid:107) c ( N C ) ) /H ( T (20) ) = 0 .

11. (b) Eﬀective Reaction rates afterswapping the optimal enzymes of the fourth and second cisternae. The displayed glycan proﬁle is considerably altered fromthe original proﬁle. the enzyme levels, the enzymatic reaction rates and the enzyme matching function L , all of which depend on thecisternal index j .Thus, rather than determining the cisternal partitioning of enzymes, we instead identify chemical reactions that occurwith high propensity in each cisternae. For this we deﬁne an eﬀective reaction rate ¯ R ( j, k ) for P c k → P c k +1 in the3 (a) (b) FIG. 6. Strategies for achieving high glycan diversity. Diversity versus N C and transport rate µ at various values of speciﬁcity σ for ﬁxed N E = 3. (a) Diversity vs. N C at optimal transport rate µ . Diversity initially increases with N C , but eventuallylevels oﬀ. The levelling oﬀ starts at a higher N C when σ is increased. These curves are bounded by the σ = 0 curve. (b)Diversity vs. cisternal residence time ( µ − ) in units of the reaction time ( R − ) at various value of σ , for ﬁxed N C = 4 and N E = 10. This has implications for glycoengineering (see text) where the task is to produce a particular glycan proﬁle withlow heterogeneity [26, 46]. j -th cisterna as ¯ R ( j, k ) = N E (cid:88) α =1 R ( j ) α P ( j ) ( k, α ) . (10)According to our model presented in Sect. IV A, the list of reactions with high eﬀective reaction rates in each cisterna,corresponds to a cisternal partitioning of the perfect enzymes. In a future study, we will consider a Boolean versionof a more complex chemical model, to address more clearly, the optimal enzyme partitioning amongst cisternae.Figure 5 (a) (i) shows the heat map of the eﬀective reaction rates in each cisterna for the optimal N E , N C , σ thatminimises the normalised KL-distance to the 20 GMM target distribution T (20) (see Fig. 5 (a) (ii)). The optimizedglycan proﬁle displayed in Fig. 5 (a) (iii) is very close to the target. An interesting observation from Fig. 5a(i) is thatthe same reaction can occur in multiple cisternae.Keeping everything else ﬁxed at the optimal value, we ask whether simply repartitioning the optimal enzymes amongstthe cisternae, alters the displayed glycan distribution. In Fig. 5 (b) (i), we have exchanged the enzymes of the fourthand second cisterna. The glycan proﬁle after enzyme partitioning (see Fig. 5 (b) ((iii)) is now completely altered(compare Fig. 5 (b) (ii) with Fig. 5 (b) (iii)). Thus one may achieve a diﬀerent glycan distribution by repartitioningenzymes amongst the same number of cisternae [46]. VII. STRATEGIES TO ACHIEVE HIGH GLYCAN DIVERSITY

So far we have studied how the complexity of the target glycan distribution places constraints on the evolution ofGolgi cisternal number and enzyme speciﬁcity. We now take up another issue, namely, how the physical properties ofthe Golgi cisternae, namely cisternal number and inter-cisternal transport rate, may drive diversiﬁcation of glycans[48, 49]. There is substantial correlative evidence to support the idea that cell types that carry out extensive glycanprocessing employ larger numbers of Golgi cisternae. For example, the salivary Brunners gland cells secrete mucousthat contains heavily O-glycosylated mucin as its major component [50]. The Golgi complex in these specialized cellscontain 9 −

11 cisternae per stack. Additionally, several organisms such as plants and algae secrete a rather diverserepertoire of large, complex glycosylated proteins, for a variety of functions [51–60]. These organisms possess enlargedGolgi complexes with multiple cisternae per stack [61–65].In this section, we study how changing the physical parameters in our chemical synthesis model can lead to changesin the diversity of glycan distributions.4We deﬁne diversity as the total number of glycan species produced above a speciﬁed threshold abundance c th . Thislast condition is necessary because very small peaks will not be distinguishable in the presence of noise. In computingthe diversity from our chemical synthesis model, we have chosen the threshold to be c th = 1 /N s , where N s is the totalnumber of glycan species. We have checked that the qualitative results do not depend on this choice, Fig. A6.Using the sigmoid function (1 + e − x/τ ) − as a continuous approximation to the Heaviside function Θ( x ), we deﬁnethe following optimization to achieve the maximal diversity for a given set of parameter values, N E , N C , σ ,Diversity( σ, N C , N E ) := max µ , R , L (cid:80) N s i =1 (cid:0) e − N s ( c i − c th ) (cid:1) − s.t. R min ≤ R ( j ) α ≤ R max ,µ min ≤ µ ( j ) ≤ µ max , where, as before, ( µ max , µ min ) = (1 , . R max , R min ) = (20 , . c th = 1 /N s is the threshold.See Appendix F for details on the parameter estimation.The results displayed in Fig. 6(a), show that for a ﬁxed speciﬁcity σ , the diversity at ﬁrst increases with the numberof cisternae N C , and then saturates at a value that depends on σ . For very high speciﬁcity enzymes, one can achievevery high diversity by appropriately increasing N C . This establishes the link between glycan diversity and cisternalnumber. However, this link is correlative at best, since there are many ways to achieve high glycan diversity - notablyby increasing the number of enzymes.On the other hand, one of the goals of glycoengineering is to produce a particular glycan proﬁle with low heterogene-ity [26, 46]. For low speciﬁcity enzymes, the diversity remains unchanged upon increasing the cisternal residence time.For enzymes with high speciﬁcity, the diversity typically shows a non-monotonic variation with the cisternal residencetime. At small cisternal residence time, the diversity decreases from the peak because of early exit of incompleteoligomers. At large cisternal residence time the diversity again decreases as more reactions are taken to completion.Note that the peak is generally very ﬂat, this is consistent with the results of [26]. To get a sharper peak, as advocatedfor instance by [46], one might need to increase the number of high speciﬁcity enzymes N E further. VIII. DISCUSSION

The precision of the stereochemistry and enzymatic kinetics of these N-glycosylation reactions [2], has inspired anumber of mathematical models [23–25] that predict the N-glycan distribution based on the activities and levels ofprocessing enzymes distributed in the Golgi-cisternae of mammalian cells, and compare these predictions with N-glycanmass spectrum data. Models such as the KB2005 model [23–25] are extremely elaborate (with a network of 22 , chemical parameters on the glycan distribution, and to evaluate appropriate metabolic strategies to recover the original glycoproﬁle.Additionally, a recent study by Ungar and coworkers [26, 27] shows how physical parameters , such as overall Golgitransit time and cisternal number, can be tuned to engineer a homogeneous glycan distribution. Overall, such modelsmay help predict glycosylation patterns and direct glycoengineering projects to optimize glycoform distributions.In this paper, we have been interested in the role of glycans as a marker or molecular code of cell identity [4, 7, 11].In particular, we have studied one aspect of molecular coding, namely the ﬁdelity of this glycan code generated byenzymatic and transport processes located in the secretory apparatus of the cell. This involves a method of analysisthat draws on many diﬀerent ﬁelds, and so it might be useful to provide a short summary of the assumptions, methodsand results of the paper:1. The distribution of glycans at the cell surface is a marker of cell-type identity [2, 4, 7, 11]. This glycan distributioncan be very complex; it is believed that there is an evolutionary drive for having glycan distributions of high complexity arising from the following considerations,5(a) Reliable cell type identiﬁcation amongst a large set of diﬀerent cell types in a complex organism, thepreservation and diversiﬁcation of “self-recognition” [5].(b) Consequence of pathogen-mediated selection pressures [2, 4, 6].(c) Consequence of herd immunity within a heterogenous population of cells of a community [15] or within asingle organism [5].2. The glycans at the cell surface are the end product of a sequential chemical processing via a set of enzymesresident in the Golgi cisternae, and transport across cisternae [4, 10, 11]. Using a fairly general and tractablemodel for chemical synthesis and transport, we compute the synthesized glycan distribution at the cell surface.Parameters of our synthesis model include the number of enzymes N E , speciﬁcity of enzymes σ , number ofcisternae N C and transport rates µ .3. We measure the reliability or ﬁdelity of cell identity [10, 11, 66] in terms of the error between synthesized glycandistribution on the cell surface from the its internal “target” distribution using the Kullback- Leibler distance D KL [16, 17]. In our numerical study, we obtain the target distribution for the given cell type by suitablecoarse-graining of the MSMS data for the human T-cells [22]. We solve a constrained optimization problem forminimising D KL , and study the tradeoﬀs between N E , N C and σ .4. The results of the optimization to achieve a given target complexity are summarised in Figs. 3-4. Here, wehighlight some its direct consequences:(a) Keeping the number of enzymes ﬁxed, a more elaborate transport mechanism (via control of N C and µ )is essential for synthesising high complexity target distributions (Figs. 4a, 4b). Fewer cisternae cannot becompensated for by optimising the enzymatic synthesis (via control of parameters R , L and σ ).(b) Thus, our study suggests that ﬁdelity of the glycan code generation provides a functional control of Golgicisternal number. It also provides a quantitative argument for the evolutionary requirement of multiple-compartments, by demonstrating that the ﬁdelity and sensitivity of the glycan code arising from a chemicalsynthesis that involves multiple cisternae is higher than one that involves a single cisterna (keeping ev-erything else ﬁxed) (Figs. 4a, 4b, 4e, 4f). This feature that with multiple cisternae and precise enzymepartitioning, one may generically achieve a highly accurate representation of the target distribution, hasbeen highlighted in an algorithmic model of glycan synthesis [46].(c) Our study shows that for a ﬁxed N C and N E , there is an optimal enzyme speciﬁcity that achieves thelowest distance from a given target distribution. As we see in Fig. 4d, this optimal enzyme speciﬁcity canbe very high for highly complex target distributions.(d) Organisms such as plants and algae, have a diverse repertoire of glycans that are utilised in a variety offunctions [51–60]. Our study shows that it is optimal to use low speciﬁcity enzymes to synthesize targetdistributions with high diversity (Fig. 6). However, this compromises on the complexity of the glycandistribution, revealing a tension between complexity and diversity. One way of relieving this tension is tohave larger N E and N C .(e) Consider a situation where the environment, and hence the target glycan distribution, ﬂuctuates rapidly.When synthesis parameters cannot change rapidly enough to track the environment, high speciﬁcity en-zymes can lead to a lowering of the cell’s ﬁtness [67, 68]. Having slightly sloppy enzymes may give the bestselective advantage in a time varying environment. This compromise, between robustness in a changingenvironment and the demand for complexity, is achieved by having sloppy enzymes, that allows the systemto be more evolvable [67, 68]. However, sloppy enzymes are subject to errors from synthesising the wrongreaction products. In this case, error correcting mechanisms must be in place to ensure ﬁdelity of theglycan code. We leave the role of intra-cellular transport in providing non-equilibrium proof-reading mech-anisms to reduce such coding errors, and its optimal adaptive strategies and plasticity in a time varyingenvironment, as a task for the future.Admittedly the chemical network that we have considered here is much simpler than the chemical network associatedwith all possible protein modiﬁcations in the secretory pathway. For instance, typical N-glycosylation pathways wouldinvolve the glycosylation of a variety of GBPs. Further, apart from N-glycosylation, there are other glycoprotein,proteoglycan and glycolipid synthesis pathways [1, 2, 11]. We believe our analysis is generalisable and that thequalitative results we have arrived at would still hold.To conclude, our work establishes the link between the cisternal machinery (chemical and transport) and optimalcoding. We ﬁnd that the pressure to achieve the target glycan code for a given cell type, places strong constraints on6the cisternal number and enzyme speciﬁcity [18]. An important implication is that a description of the nonequilibriumself-assembly of a ﬁxed number of Golgi cisternae must combine the dynamics of chemical processing and membranedynamics involving ﬁssion, fusion and transport [18–20]. We believe this is a promising direction for future research. IX. ACKNOWLEDGMENTS

We thank M. Thattai, A. Jaiman, S. Ramaswamy, A. Varki for discussions, and S. Krishna and R. Bhat for very usefulsuggestions on the manuscript. We thank our group members at the Simons Centre for many incisive inputs. Weare very grateful to P. Babu for consultations on the MSMS data and literature. We acknowledge the computationalfacilities at NCBS. MR acknowledges a JC Bose Fellowship from DST (Government of India), and thanks InstitutCurie for hosting a visit under the Labex program. This work has received support under the program InvestissementsdAvenir launched by the French Government and implemented by ANR with the references ANR-10-LABX-0038 andANR-10-IDEX-0001-02 PSL. QV thanks the Simons Centre (NCBS) for hosting his visit.7

Appendix

Appendix A: Kinetics of sequential chemical reactions and transport

On including the rates of injection ( q ) and inter-cisternal transport µ ( j ) from the cisterna j to j + 1, into the chemicalreaction kinetics, the substrate concentrations c ( j ) k change with time as, dc (1)1 dt = q − N E (cid:88) α =1 V (1 , , α ) P (1) (1 , α ) c (1)1 M (1 , , α ) (cid:18) (cid:80) N s k (cid:48) =1 P (1) ( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:19) − µ (1) c (1)1 dc (1) k dt = N E (cid:88) α =1 V (1 , k − , α ) P (1) ( k − , α ) c (1) k − M (1 , k − , α ) (cid:18) (cid:80) N s k (cid:48) =1 P (1) ( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:19) − N E (cid:88) α =1 V (1 , k, α ) P (1) ( k, α ) c (1) k M (1 , k, α ) (cid:18) (cid:80) N s k (cid:48) =1 P (1) ( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:19) − µ (1) c (1) k dc (1) N s dt = N E (cid:88) α =1 V (1 , N s − , α ) P (1) ( N s − , α ) c (1) N s − M (1 , N s − , α ) (cid:18) (cid:80) N s k (cid:48) =1 P (1) ( k (cid:48) ,α ) c (1) k (cid:48) M (1 ,k (cid:48) ,α ) (cid:19) − µ (1) c (1) N s (A1)for cisterna-1, and dc ( j )1 dt = µ ( j − c ( j − − N E (cid:88) α =1 V ( j, , α ) P ( j ) (1 , α ) c ( j )1 M ( j, , α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) − µ ( j ) c ( j )1 dc ( j ) k dt = µ ( j − c ( j − k + N E (cid:88) α =1 V ( j, k − , α ) P ( j ) ( k − , α ) c ( j ) k − M ( j, k − , α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) − N E (cid:88) α =1 V ( j, k, α ) P ( j ) ( k, α ) c ( j ) k M ( j, k, α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) − µ ( j ) c ( j ) k dc ( j ) N s dt = µ ( j − c ( j − N s + N E (cid:88) α =1 V ( j, N s − , α ) P ( j ) ( N s − , α ) c ( j ) N s − M ( j, N s − , α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) − µ ( j ) c ( j ) N s (A2)for cisternae j = 2 , , . . . , N C . These set of dynamical equations (A1)-(A2), with initial conditions, can be solved toobtain the concentration c ( j ) k ( t ) for t ≥ p ), i.e., denoting theprotein concentration in the j -th cisterna as p ( j ) = (cid:80) N s k (cid:48) =1 c ( j ) k (cid:48) , we automatically obtain, dp (1) dt = q − µ (1) p (1) dp ( j ) dt = µ ( j − p ( j − − µ ( j ) p ( j ) for j = 2 , , . . . N C .At steady state, the left hand side of the above equations is set to zero, which after rescaling, gives the nonlinearrecursion relations displayed in (5) and (6) of the main text.8 Appendix B: A computationally simpler optimization equivalent to Optimization A

Deﬁne a new set of parameters, R ( j, k, α ) = N E (cid:88) α =1 V ( j, k, α ) M ( j, k, α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) (B1)where c denotes the steady state glycan concentration, corresponding to a speciﬁc ( M , V , L ). Deﬁne v by the followingset of linear equations: v (1)1 = 1 µ (1) + (cid:80) N E α =1 R (1 , , α ) P (1) (1 , α ) v (1) k = v (1) k − (cid:80) N E α =1 R (1 , k − , α ) P (1) ( k − , α ) µ (1) + (cid:80) N E α =1 R (1 , k, α ) P (1) ( k, α ) v (1) N s = v (1) N s − (cid:80) N E α =1 R (1 , N s − , α ) P (1) ( N s − , α ) µ (1) (B2)for j = 1, and v ( j )1 = v ( j − µ ( j − µ ( j ) + (cid:80) N E α =1 R ( j, , α ) P ( j ) (1 , α ) v ( j ) k = v ( j − k µ ( j − µ ( j ) + (cid:80) N E α =1 R ( j, k, α ) P ( j ) ( k, α )+ v ( j ) k − (cid:80) N E α =1 R ( j, k − , α ) P ( j ) ( k − , α ) µ ( j ) + (cid:80) N E α =1 R ( j, k, α ) P ( j ) ( k, α ) v ( j ) N s = v ( j − N s (cid:80) N E α =1 R ( j, N s − , α ) P ( j ) ( N s − , α ) µ ( j ) + v ( j − N s µ ( j − µ ( j ) (B3)for j = 2 , . . . , N C . Then, by the deﬁnition of R in (B1), it trivially follows that the steady state concentration c corresponding to ( M , V , L ) is a solution for (B2)-(B3).In Appendix C we show that for v obtained from (B2)-(B3) for any parameter ( R , L ), there exists parameter ( M , V , L )such that (5)-(6) are automatically satisﬁed when we set c = v , i.e. v is the steady state concentration for ( M , V , L ).Thus, the set of all concentration proﬁles deﬁned by (B2)-(B3) as a function of all possible values of the parameters( R , L ) is identical to the set deﬁned by (5)-(6) as function of ( M , V , L ). This is a crucial insight, since it allows usto search the entire parameter space using (B2)-(B3), where the concentration is known explicitly in terms of ( R , L ).See Figure A1 for a ﬂow chart of the two optimization schemes.To pose this new optimization problem, it is convenient to deﬁne ¯ v i = µ ( N c ) v ( N c ) i . Then, it follows that Optimization B : D ( σ, N E , N C , c ∗ ) := min R ≥ , L D KL ( c ∗ (cid:107) ¯ v ) (B4)9 FIG. A1. Flow chart showing the optimization schemes for Optimization A and B. We prove that D A min = D B min by showingthe set of all c is equal to the set of all v . We additionally establish that the optimum v min = c min . is equivalent to (8). Since v is explicitly known as a function of ( R , L ), optimization B (B4) is a more tractableoptimization problem than (8). Note that in this setting, the function D ( σ, N E , N C , c ∗ ) (B4) is independent of therates µ .While this optimization is easy to implement, we note that the parameters (e.g., reaction rates, speciﬁcity) are notconstrained to take only physically relevant values; a legitimate concern is that the absence of such physico-chemicalconstraints might drive this optimization to physically unrealistic solutions.There are two possible ways to impose these parameter constraints. One is to impose constraints on the “micro-scopic” chemical parameters, such as the rate of individual reactions R ( j, k, α ) and the inter-cisternal transport rate µ ( j ) . These take into consideration constraints arising from molecular enzymatic processes. The other is to imposeconstraints on “global” physical parameters, such as the total transport time across the Golgi cisternae and theaverage enzymatic reaction time. Here, we impose constraints on the microscopic reaction and transport parameters. Optimization C : D ( σ, N C , N E , c ∗ ) := min µ , R , L D KL ( c ∗ (cid:107) ¯ v )s.t. R min ≤ R ( j, k, α ) ≤ R max ,µ min ≤ µ ( j ) ≤ µ max . The upper and lower bounds on the rates R and µ are estimated in Appendix F : µ max = 1/min (resp. µ min = . R max = 20/min (resp. R min = . Appendix C: Equivalence of Optimizations A and B Let A = (cid:40) [ c ( j ) k ] j,k : M ( j, k, α ) ≥ , V ( j, k, α ) ≥ , l ( j ) α ≥ , [ c ( j ) k ] jk given by (5) and (6) (cid:41) denote the set of concentrations achievable in Optimization A . Similarly, let B = (cid:40) [ v ( j ) k ] j,k : R ( j, k, α ) ≥ , l ( j ) α ≥ v ( j ) k ] j,k given by (B2) and (B3) (cid:41) denote the set of concentrations achievable in the Optimization B .Our task is to show that A = B . Suppose [ c ( j ) k ] j,k ∈ A . Let [ M ( j, k, α )], [ V ( j, k, α )] and [ l ( j ) α ] be the correspondingparameters. Deﬁne R ( j, k, α ) = N E (cid:88) α =1 V ( j, k, α ) M ( j, k, α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) c ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) ≥ c ( j ) k ] j,k ∈ B .Suppose [ v ( j ) k ] j,k ∈ B . Let [ R ( j, k, α )], [ l ( j ) α ] denote the corresponding parameters. Since (cid:80) N s k =1 v ( j ) k = 1 /µ ( j ) < ∞ , itfollows that (cid:80) N s k =1 P ( j ) ( k, α ) v ( j ) k < /µ ( j ) < ∞ . Thus, there exists parameters [ M ( j, k, α )], [ V ( j, k, α )] and [ l ( j ) α ] suchthat R ( j, k, α ) = N E (cid:88) α =1 V ( j, k, α ) M ( j, k, α ) (cid:18) (cid:80) N s k (cid:48) =1 P ( j ) ( k (cid:48) ,α ) v ( j ) k (cid:48) M ( j,k (cid:48) ,α ) (cid:19) (C1)Therefore, [ v ( j ) k ] j,k satisfy (5) and (6), i.e. [ v ( j ) k ] j,k ∈ A .Moreover, suppose v satisﬁes (B2)-(B3) for a given set of parameters ( R , L ). Then there exist ( M , V , L ) such that v satisﬁes (5)-(6), i.e. v is the steady state concentration for ( M , V , L ). Appendix D: Analytical solution when N s (cid:29) It is possible to obtain analytical expressions for the steady state glycan distribution, in the limit N s (cid:29)

1, whenthe glycan index k can be approximated by a continuous variable. In this case, (5)-(6) can be cast as diﬀerentialequations, dc (1) k dk ≈ c (1) k − c (1) k − = (cid:32) (cid:80) N E α =1 R (1 , k − , α ) exp( − σ | k − − l (1) α | ) µ (1) + (cid:80) N E α =1 R (1 , k, α ) exp( − σ | k − l (1) α | ) − (cid:33) c (1) k − ≈ − (cid:32) µ (1) + ddk (cid:80) N E α =1 R (1 , k, α ) exp( − σ | k − l (1) α | ) µ (1) + (cid:80) N E α =1 R (1 , k, α ) exp( − σ | k − l (1) α | ) (cid:33) c (1) k , (D1)1and dc ( j ) k dk ≈ c ( j ) k − c ( j ) k − = µ ( j − µ ( j ) + (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) c ( j − k − (cid:32) µ ( j ) + ddk (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) µ ( j ) + (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) (cid:33) c ( j ) k (D2)for j = 2 , . . . , N C . In (D1) and (D2), ddk N E (cid:88) α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | )= N E (cid:88) α =1 R ( j, k, α ) σ exp( − σ | k − l ( j ) α | )(1 − I ( k ≥ l α )) + R (cid:48) ( j, k, α ) exp( − σ | k − l ( j ) α | ) (D3)where the indicator function I ( · ) is equal to 1 if the argument is true, and zero otherwise and R (cid:48) ( j, k, α ) is the derivativeof R ( j, k, α ) with respect to k .Deﬁne a vector function C ( k ) ∈ R Nc of the continuous variable k by C ( k ) = [ c (1) k , c (2) k , . . . c ( N C ) k ]. Then (D1) and (D2)can be written as: dC ( k ) dk = M ( k ) C ( k ) (D4)where the matrix M ( k ) is given by M ( k ) =  A (1) ( k ) 0 0 0 . . . B (2) ( k ) A (2) ( k ) 0 0 . . . B (3) ( k ) A (3) ( k ) 0 . . . . . . B ( N C ) ( k ) A ( N C ) ( k )  (D5)with A ( j ) ( k ) = − µ ( j ) + ddk (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) µ ( j ) + (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) B ( j ) ( k ) = µ ( j − µ ( j ) + (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | )The functions A ( j ) ( k ) and B ( j ) ( k ) involve absolute value and indicator functions; therefore, the diﬀerential equationhas to be solved in a piecewise manner assuming continuity of solution C ( k ).The general solution of (D4) C ( k ) = C exp (Ω( k )) (D6)2is written in terms of the Magnus Function Ω( k ) = (cid:80) ∞ n =1 Ω( n, k ), obtained from the Baker-Campbell-Hausdorﬀformula [69], Ω(1 , k ) = (cid:90) k M ( k ) dk Ω(2 , k ) = 12 (cid:90) k dk (cid:90) k dk [ M ( k ) , M ( k )]Ω(3 , k ) = 16 (cid:90) k dk (cid:90) k dk (cid:90) k dk [ M ( k ) , [ M ( k ) , M ( k )]] + [ M ( k ) , [ M ( k ) , M ( k )]] . . . . . . where [ M ( k ) , M ( k )] := M ( k ) M ( k ) − M ( k ) M ( k ) is the commutator, and the higher order terms in . . . containhigher order nested commutators.Here, we establish conditions under which the series (cid:80) ∞ n =1 Ω( n, k ) that deﬁnes solution C ( k ) to the diﬀerentialequation (D4) converges. We also solve (D4) for some special cases.The commutator [ M ( k ) , M ( k )] =  . . . a . . . a a . . . a a . . . . . . a n,n − a n,n −  where a i,i − = A ( i − ( k ) B ( i ) ( k ) + A ( i ) ( k ) B ( i ) ( k ) − A ( i − ( k ) B ( i ) ( k ) + A ( i ) ( k ) B ( i ) ( k ) a i,i − = B ( i − ( k ) B ( i ) ( k ) − B ( i − ( k ) B ( i ) ( k )The general form of Ω( n, k ) is given by [69]Ω( n, k ) = z n n ! (cid:90) k dk (cid:90) k dk . . . (cid:90) k n − dk n − (cid:90) k n − dk n (cid:88) l W l M ( k p l ) M ( k p l ) . . . M ( k p ln ) (D7)where ( p ( l )1 , p ( l )2 . . . p ( l ) n ) is a permutation of (1 , , , . . . n ), W l ∈ {− , } , and z n ∈ , . . .n .Let ¯ A = max k,l,m | M l,m ( k ) | . Deﬁne ¯ M =  ¯ A . . . A ¯ A . . .

00 ¯ A ¯ A . . . . . . A ¯ A  We can bound all the matrix elements of Ω( n, k ) in the following wayΩ lm ( n, k ) ≤ z n ¯ M nl,m (cid:90) k dk (cid:90) k dk . . . (cid:90) k n − dk n = z n ¯ M n (cid:12)(cid:12)(cid:12) lm k n n ! (D8)3The matrix ¯ M n =  a . . . a a . . . a a a . . . a a a a . . . a n . . . a n,n − a n,n − a nn  where a lm = S lm ( n ) ¯ A n for appropriately deﬁned polynomials S l,m ( n ). Thus, it follows that Ω lm ≤ z n S lm ( n )( A ∗ ) n k n n ! and Ω l,m ( k ) ≤ (cid:80) ∞ n =1 z n S l,m ( n )( A ∗ ) n k n n ! . Consequently, the series will converge if ¯ Ak <

1, i.e. k ≤ A . Assuming µ ( j ) = µ ∀ j , we can bound ¯ A as¯ A ≤ max j,k (cid:32) µ + σ (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) µ + (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) + (cid:80) N E α =1 R (cid:48) ( j, k, α ) exp( − σ | k − l ( j ) α | ) µ + (cid:80) N E α =1 R ( j, k, α ) exp( − σ | k − l ( j ) α | ) (cid:33) (D9)Since the parameters µ, σ, R ( j, k, α ) , l ( j ) α and N E are ﬁnite and positive, and R (cid:48) ( j, k, α ) is ﬁnite, ¯ A has a ﬁnite upperbound, implying k is always greater than zero, and the series has ﬁnite radius of convergence.While in principle we can obtain the glycan proﬁle for any N E and N C with arbitrary accuracy, assuming R ( j, k, α ) = R ( j ) α , we provide explicit formulae for a few representative cases : (i) ( N E = 1 , N C = 1) and (ii) ( N E = 1 , N C = 2).(i) N E = 1 , N C = 1: The solution of the diﬀerential equation is given by c ( k ) =  c e − k (cid:16) µ + R exp( − σ ( l − k )) µ + R exp( − σl ) (cid:17) (1 /σ ) − k ≤ lc ( l ) e − ( k − l ) (cid:16) µ + Rµ + R exp( − σ ( k − l )) (cid:17) (1 /σ )+1 k > l (D10)A representative concentration proﬁle is plotted in Fig. A2(a). The concentration proﬁle consists of two distinctcomponents: an initial exponential decay, and then an exponential rise and fall concentrated around l . The relativeweight of these two components is controlled by the sensitivity σ and the rate R . Such explicit formulae can beobtained for any N E >

1, as long as N C = 1.(ii) N E = 1 , N C = 2: The concentration proﬁle c (2) in cisterna 2 can be obtained from the following calculation. Let l ( j ) denote the “length” of the enzyme in cisterna j = 1 ,

2. For k ≤ min { l (1) , l (2) } c (2) ( k ) = c µ (1) e − k (cid:16) µ (2) + R (2) exp( − σ ( l (2) − k )) µ (1) + R (1) e − σl (1) (cid:17) (1 /σ ) − (cid:90) k ( µ (1) + R (1) exp( − σ ( l (1) − k ))) (1 /σ ) − ( µ (2) + R (2) exp( − σ ( l (2) − k ))) /σ dk + c (2) (0) e − k (cid:32) µ (2) + R (2) e − σ ( l (2) − k ) µ (2) + R (2) e − σl (2) (cid:33) (1 /σ ) − (D11)Next, consider the case where l (1) ≤ l (2) . Then, for l (1) < k ≤ l (2) c (2) ( k ) = c (1) ( l (1) ) µ (1) e − ( k − l (1) ) ( µ (1) + R (1) ) (1 /σ )+1 ( µ (2) + R (2) exp( − σ ( l (2) − k ))) (1 /σ ) − (cid:90) kl (1) ( µ (2) + R (2) exp( − σ ( l (2) − k ))) − /σ ( µ (1) + R (1) exp( − σ ( k − l (1) ))) (1 /σ )+1 dk + c (2) ( l (1) ) e − ( k − l (1) ) (cid:32) µ (2) + R (2) e − σ ( l (2) − k ) µ (2) + R (2) e − σ ( l (2) − l (1) ) (cid:33) (1 /σ ) − (D12)4 (a) (b) FIG. A2. Glycan concentration proﬁle calculated from the model using (a) formula (D10) for N E = N C = 1 and (b) formulae(D11)-(D15) for N E = 1 , N C = 2. and for l (1) ≤ l (2) < k , c (2) ( k ) = c (1) ( l (1) ) µ (1) e − ( k − l (1) ) (cid:18) µ (1) + R (1) µ (2) + R (2) exp( − σ ( k − l (2) )) (cid:19) (1 /σ )+1 (cid:90) kl (2) ( µ (2) + R (2) exp( − σ ( k − l (2) ))) /σ ( µ (1) + R (1) exp( − σ ( k − l (1) ))) (1 /σ )+1 dk + c (2) ( l (2) ) e − ( k − l (2) ) (cid:18) µ (2) + R (2) µ (2) + R (2) e − σ ( k − l (2) ) (cid:19) (1 /σ )+1 (D13)Next, the case where l (1) ≥ l (2) . For l (2) < k ≤ l (1) , c (2) ( k ) = c µ (1) e − k ( µ (1) + R (1) e − σl (1) ) − (1 /σ ) ( µ (2) + R (2) exp( − σ ( k − l (2) ))) (1 /σ )+1 (cid:90) kl (2) ( µ (1) + R (1) exp( − σ ( l (1) − k ))) (1 /σ ) − ( µ (2) + R (2) exp( − σ ( k − l (2) ))) − /σ dk + c (2) ( l (2) ) e l (2) − k (cid:18) µ (2) + R (2) µ (2) + R (2) e − σ ( k − l (2) ) (cid:19) (1 /σ )+1 (D14)For (2) ≤ l (1) < k , c (2) ( k ) = c (1) ( l (1) ) µ (1) e − ( k − l (1) ) (cid:18) µ (1) + R (1) µ (2) + R (2) exp( − σ ( k − l (2) )) (cid:19) (1 /σ )+1 (cid:90) kl (2) ( µ (2) + R (2) exp( − σ ( k − l (2) ))) /σ ( µ (1) + R (1) exp( − σ ( k − l (1) ))) (1 /σ )+1 dk + c (2) ( l (1) ) e − ( k − l (1) ) (cid:32) µ (2) + R (2) e − σ ( l (1) − l (2) ) µ (2) + R (2) e − σ ( k − l (2) ) (cid:33) (1 /σ )+1 (D15)The integrals in (D11) to (D15) can evaluated numerically. The result of the numerical computation is shown inFig. A2. Appendix E: Capability of the chemical network model to generate complex distributions

Is our glycan synthesis model capable of generating concentration distributions of arbitrary complexity? In what waydo we need to change the parameters N E , N C , σ, . . . , in order to generate glycan distributions c of a given complexity?5The purpose of this section, is to obtain some heuristics for this task.We show in Appendix D that (B4) can be solved analytically in the limit N s (cid:29)

1, because in this limit the glycan index k can be approximated by a continuous variable, and the recursion relations for the steady state glycan concentrations(5)-(6) can be cast as a matrix diﬀerential equation. This allows us to obtain an explicit expression for the steadystate concentration in terms of the parameters ( R , L ).We derive our heuristics from a semi-analytical treatment in the limit N s (cid:29) c k for the case N E = N C = 1 (D10). Figures A3(a)-(d)show the glycan proﬁle c k vs. k as one varies the enzyme speciﬁcity σ , the reaction rates R and transport rates µ , fortwo diﬀerent values of N E and N C . The results in the plots lead us to the following general observations: • Very low speciﬁcity enzymes cannot generate complex glycan distributions. Keeping everything else ﬁxed,intermediate or high speciﬁcity enzymes can generate glycan distributions of higher complexity by increasing N E or N C (Figs. A3(a),(c)). • Decreasing the speciﬁcity σ or increasing the rates R increases the proportion of higher index glycans. Keepingeverything else ﬁxed, changes in the rate R have a stronger impact on the relative weights of the higher indexglycans to lower index glycans. The relative weight of the higher index glycans increases with increasing N E and N C (Figs. A3(b)-(d)). • Keeping everything else ﬁxed, decreasing enzyme speciﬁcity increases the spread of the distribution around thepeaks (Figs. A3(a),(c)).

Appendix F: Parameter estimation

The typical transport time of glycoproteins across the Golgi complex is estimated to be in the range 15-20 mins. [23],which corresponds to the transport rate, µ = . q is in the range 100 − cell 24 h [23, 24]. For our calculation we set q = 387 .

30 pmol/10 cells 24 hr = 0 .

27 pmol/10 cells min,the geometric mean of 100 and 1500. We set the range for the enzymatic rate R to be R min = min α (cid:110) V ( α )max /νK ( α ) M + ν qµ (cid:111) ≤ R ≤ R max = max α (cid:110) V ( α )max /νK ( α ) M (cid:111) . where K ( α ) M and V ( α )max denote the Michaelis constants and V max of the α -th enzyme. The conversion from 1 pmoles/10 cells to concentration can be obtained by taking cisternal volume ( ν ) to be 2.5 µm [23, 24]. This gives1 pmoles/10 cells = 10 − moles10 × . × − × litre = 400 µM. (F1)In Table I we report the parameters for the 8 enzymes taken from Table 3 in [23]. From these parameters it followsthat6 (a) (b)(c) (d) FIG. A3. Glycan proﬁle { c k : k = 1 , . . . , N s } as a function of speciﬁcity σ (Fig. (a), (c)), and reaction rates R (Fig. (b), (d)).Fig. (a): N E = N C = 1 , ( R = 50 , µ = 1 , l = 10). c k decreases exponentially with k for very low and very high σ ; however, thedecay rate is lower at low σ . For intermediate values of σ , the distribution has exactly two peaks, one of which is at k = 0, andeventually decays exponentially. The width of the distribution is a decreasing function of σ .Fig. (b): N E = N C = 1 , ( σ = 0 . , µ = 1 , l = 10). At low R , c k is concentrated at low k . The proportion of higher index glycansin an increasing function of R .Fig. (c): N E = N C = 2 , ( R = 40 , µ = 1 , [ l (1)1 , l (1)2 , l (2)1 , l (2)2 ] = [10 , , , σ increases, the distribution becomes morecomplex – from a single peaked distribution at low σ to a maximum of four-peaked distribution at high σ . The peaks getssharper, and more well deﬁned as σ increases.Fig. (d): N E = N C = 2 , ( R = 40 , µ = 1 , [ l (1)1 , l (1)2 , l (2)1 , l (2)2 ] = [10 , , , R shifts thepeaks towards higher index glycans and the proportion of higher index glycan increases. R min = min α (cid:110) V ( α )max /νK ( α ) M + ν qµ (cid:111) = V (7)max /νK (7) M + ν qµ = . × µM/ min3400 µM + 149 . µM = 0 . − R max = max α (cid:110) V ( α )max /νK ( α ) M (cid:111) = V (1)max /νK (1) M = 5 × µM/ min100 µM = 20min − α K ( α ) M V ( α )max ( µ mol) (pmol/10 cell-min)1 100 52 260 7.53 200 54 100 55 190 2.336 130 .167 3400 .168 4000 9.66TABLE I. Enzyme parameters taken from Table 3 in [23] that we use to calculate the bounds on the reaction rate R . Here K ( α ) M and V ( α )max denote the Michaelis constant and V max of the α -th enzyme. Appendix G: Constructing target distributions for glycans of a given cell type

The target distribution of the glycans on the cell surface is obtained via mass spectrometry. The x-axis of mass spec-troscopy (MS) graphs is mass/charge of the ionised sample molecules and the y-axis is relative intensity correspondingto each mass/charge value, taking the highest intensity as 100%.This relative intensity roughly correlates with the relative abundances of the molecules in the sample.This raw MS data is noisy and cannot be directly used as the target distribution in our optimization problem. Thereare three major sources of noise in the MS data [47]: the chemical noise in the sample, the Poisson noise associatedwith detecting discrete events, and the Nyquist-Johnson noise associated with any charge system. We propose asimple model that accounts for the chemical noise and the Poisson sampling noise. Using this noise model and theavailable MS data, we generate parametric bootstrap samples of glycan measurements, and ﬁt a Gaussian MixtureModel (GMM) on this sample to approximate the glycan distribution. This GMM probability distribution is used asthe target distribution in our numerical experiments.The MS data obtained from [22] had mass ranging between 500 to 5000 Daltons with intensity reported at every0.0153 Daltons. We ﬁrst bin this MS data into 180 bins and take the maximum value within each bin as the value ofintensity for that bin. Fig. A4 shows the raw MS data and the binned distribution.Let ¯ I k represents the relative intensity of the k -th bin in the binned MS graph. We generate a sample population ofglycans using the MS data in the following way:1. Poisson sampling noise: The MS data does not have absolute count information. We assume an arbitrarymaximum count I max , and deﬁne the intensity I k = I max ¯ I k . The plots in Fig. A5(a) show that the results arenot sensitive to the speciﬁc value of I max .2. Chemical noise: The sample used for MS analysis also contains small amounts of molecules that are not glycans.These appear as the very small peaks in the MS data. We assume that the probability p k that the peak at index k corresponds to a glycan is given by p k = 1 − e − IkImax = 1 − e − ¯ I k which adequately suppresses this chemical noise.3. Bootstrapped glycan data: The count n k at the glycan index k is distributed according to the following distri-bution: n k = (cid:26) − p k ) n = 0 n p k e − I k ( I k ) n n ! n ≥ . We assume that the MS data was generated from N diﬀerent cells. Thus, the total count at glycan index k isgiven by the sum of N i i.i.d. samples distributed according to the distribution above. We in Fig. A5 (b) showthat results are insensitive to N i .8 FIG. A4. The binned MS data (blue) approximates the raw MS data (red) very well. We use this binned data for GMMapproximation of the MS data. (a) (b)

FIG. A5. Log likelihood vs. number of components ( N ) in the GMM. We see that the log likelihood saturates at around N = 20,thus 20-GMM is a very good representation of the MS-data from human T-cells. The diﬀerent symbols are for (a) diﬀerentvalues of the maximum intensity I max = 50 , ,

200 and (b) diﬀerent values of the number of i.i.d samples N i = 500 , , I max and N i . Next, we interpret the counts as samples from a “spatial” distribution f . We approximate this distribution as aGaussian mixture, i.e. f ( x ) = (cid:80) N(cid:96) =1 γ (cid:96) η ( x | µ (cid:96) , σ (cid:96) ), where η ( x | µ, σ ) denotes the density of a normally distributedrandom variable with mean µ and standard deviation σ at the location x . In this setting, we assume that each countis a sample from the distribution η ( x | µ (cid:96) , σ (cid:96) ) with probability γ (cid:96) . Thus, each count is classiﬁed as coming from oneof the Gaussian components. Appendix H: Numerical scheme for performing the non-convex optimization

We solve Optimization C using the numerical scheme detailed below. The optimization problem consists of minimisinga non-convex objective with linear box constraints. We use the MATLAB FMINCON function to solve this optimiza-tion. We use Sequential Quadratic Programming (SQP), a gradient based iterative optimization scheme for solvingoptimizations with non-linear diﬀerentiable objective and constraints. Since our problem is non-convex and SQP onlygives local minima, we initialise the algorithm with many random initial points. We use SOBOLSET function ofMATLAB to generate space ﬁlling pseudo random numbers. We have taken 1000 initialisations for each N E , N C and σ value. We have taken 50 equally spaced points between 0 and 1 to explore the σ -space for Fig. 3. Some minorﬂuctuations in D due to non-convexity of the objective function in the ﬁnal results were smoothed out by taking theconvex hull of the D vs. σ graph. The results for σ min ( N E , N C ) and D ( σ min , N E , N C ) (Fig. 4) were obtained by9 (a) c th = N s (b) c th = N s (c) c th = N s FIG. A6. Diversity vs. N C for diﬀerent values of σ keeping N E = 1 ﬁxed, for three diﬀerent values of the threshold, c th = N s , N s , N s . Changing the value of the threshold c th , only changes the saturation value of the diversity curve. adding σ to the optimization vector and then performing the optimization. The sensitivity results (Figs. 4e and 4e)were obtained by approximating the D vs σ graph around σ min with a parabola, the coeﬃcient of the quadratic termbeing the curvature of the graph at σ min .A similar numerical scheme was used to optimize diversity. [1] B. Alberts et al. , Molecular Biology of the Cell (Garland Science, 2002).[2] A. Varki et al. , Essentials of Glycobiology (Cold Spring Harbor Laboratory Press, 2009).[3] R. D. Cummings and J. M. Pierce, Chemistry & biology , 1 (2014).[4] A. Varki, Glycobiology , 3 (2017).[5] K. Drickamer and M. E. Taylor, Trends in biochemical sciences , 321 (1998).[6] P. Gagneux and A. Varki, Glycobiology , 747 (1999).[7] H.-J. Gabius, BioSystems , 102 (2018).[8] R. A. Dwek, Chemical reviews , 683 (1996).[9] P. Winterburn and C. Phelps, Nature , 147 (1972).[10] A. Varki, Trends in cell biology , 34 (1998).[11] P. Pothukuchi, I. Agliarulo, D. Russo, R. Rizzo, F. Russo, and S. Parashuraman, FEBS letters , 2390 (2019).[12] F. Bard and J. Chia, Trends in cell biology , 379 (2016).[13] G. D’Angelo, S. Capasso, L. Sticco, and D. Russo, The FEBS journal , 6338 (2013).[14] M. Demetriou, M. Granovsky, S. Quaggin, and J. W. Dennis, Nature , 733 (2001).[15] C. WILLS and D. R. GREEN, Immunological reviews , 263 (1995).[16] T. M. Cover and J. A. Thomas, Elements of information theory (John Wiley & Sons, 2012).[17] D. J. MacKay,

Information theory, inference and learning algorithms (Cambridge university press, 2003).[18] D. Sengupta and A. D. Linstedt, Annual review of cell and developmental biology , 57 (2011).[19] H. Sachdeva, M. Barma, and M. Rao, Scientiﬁc reports , 1 (2016).[20] P. Sens and M. Rao, in Methods in cell biology , Vol. 118 (Elsevier, 2013) pp. 299–310.[21] A. Bacharoglou, Proceedings of the American Mathematical Society , 2619 (2010).[22] R. D. Cummings and P. Crocker,

Functional Glycomics Database, Consortium for Functional Glycomics, (2020).[23] P. Uma˜na and J. E. Bailey, Biotechnology and bioengineering , 890 (1997).[24] F. J. Krambeck, S. V. Bennun, S. Narang, S. Choi, K. J. Yarema, and M. J. Betenbaugh, Glycobiology , 1163 (2009).[25] F. J. Krambeck and M. J. Betenbaugh, Biotechnology and Bioengineering , 711 (2005).[26] P. Fisher, H. Spencer, J. Thomas-Oates, A. J. Wood, and D. Ungar, Cell reports , 1231 (2019).[27] P. Fisher and D. Ungar, Frontiers in cell and developmental biology , 15 (2016).[28] C. B. Hirschberg, P. W. Robbins, and C. Abeijon, Transporters of nucleotide sugars, ATP, and nucleotide sulfate in theendoplasmic reticulum and Golgi apparatus, (1998).[29] C. E. Caﬀaro and C. B. Hirschberg, Accounts of chemical research , 805 (2006).[30] P. M. Berninsone and C. B. Hirschberg, Current opinion in structural biology , 542 (2000).[31] N. Trinajstic, Chemical graph theory (Routledge, 2018).[32] N. Price and L. Stevens,

Fundamentals of Enzymology: The cell and molecular biology of catalytic proteins (OxfordUniversity Press, 1999).[33] K. W. Moremen and R. S. Haltiwanger, Nature chemical biology , 853 (2019). [34] S. Kellokumpu, Frontiers in cell and developmental biology , 93 (2019).[35] J. R. Casey, S. Grinstein, and J. Orlowski, Nature reviews Molecular cell biology , 50 (2010).[36] S. Dmitrieﬀ, M. Rao, and P. Sens, Proceedings of the National Academy of Sciences , 15692 (2013).[37] J. Llopis, J. M. McCaﬀery, A. Miyawaki, M. G. Farquhar, and R. Y. Tsien, Proceedings of the National Academy ofSciences , 6803 (1998).[38] J. Monod, J. Wyman, and J.-P. Changeux, J Mol Biol , 88 (1965).[39] J.-P. Changeux and S. J. Edelstein, Science , 1424 (2005).[40] Y. Savir and T. Tlusty, PloS one , e468 (2007).[41] S. Roseman, Journal of Biological Chemistry , 41527 (2001).[42] P. Hossler, B. C. Mulukutla, and W.-S. Hu, PloS one (2007).[43] M. Yang, C. Fehl, K. V. Lees, E.-K. Lim, W. A. Oﬀen, G. J. Davies, D. J. Bowles, M. G. Davidson, S. J. Roberts, andB. G. Davis, Nature chemical biology , 1109 (2018).[44] A. Bar-Even, R. Milo, E. Noor, and D. S. Tawﬁk, Biochemistry , 4969 (2015).[45] S. Boyd and L. Vandenberghe, Convex optimization (Cambridge university press, 2004).[46] A. Jaiman and M. Thattai, BioRxiv , 440792 (2018).[47] P. Du, G. Stolovitzky, P. Horvatovich, R. Bischoﬀ, J. Lim, and F. Suits, Bioinformatics , 1070 (2008).[48] A. Varki, Cold Spring Harbor perspectives in biology , a005462 (2011).[49] J. W. Dennis, I. R. Nabi, and M. Demetriou, Cell , 1229 (2009).[50] H. van Halbeek, G. J. Gerwig, J. F. Vliegenthart, H. L. Smits, P. J. Van Kerkhof, and M. F. Kramer, Biochimica etBiophysica Acta (BBA)-Protein Structure and Molecular Enzymology , 107 (1983).[51] H. E. McFarlane, A. D¨oring, and S. Persson, Annual review of plant biology , 69 (2014).[52] B. E. Koch, J. Stougaard, and H. P. Spaink, Glycobiology , 469 (2015).[53] M. A. O’Neill, T. Ishii, P. Albersheim, and A. G. Darvill, Annu. Rev. Plant Biol. , 109 (2004).[54] T. Hayashi and R. Kaida, Molecular Plant , 17 (2011).[55] P. Kumar, M. Yang, B. C. Haynes, M. L. Skowyra, and T. L. Doering, Current opinion in structural biology , 597(2011).[56] N. A. Gow and B. Hube, Current opinion in microbiology , 406 (2012).[57] M. A. Atmodjo, Z. Hao, and D. Mohnen, Annual review of plant biology (2013).[58] S. J. Free, in Advances in genetics , Vol. 81 (Elsevier, 2013) pp. 33–82.[59] M. Pauly, S. Gille, L. Liu, N. Mansoori, A. de Souza, A. Schultink, and G. Xiong, Planta , 627 (2013).[60] R. A. Burton and G. B. Fincher, Frontiers in plant science , 456 (2014).[61] B. Becker and M. Melkonian, Microbiol. Mol. Biol. Rev. , 697 (1996).[62] A. A. Mironov, I. S. Sesorova, E. V. Seliverstova, and G. V. Beznoussenko, Tissue and Cell , 186 (2017).[63] B. S. Donohoe, B.-H. Kang, and L. A. Staehelin, Proceedings of the National Academy of Sciences , 163 (2007).[64] S. Mogelsvang, N. Gomez-Ospina, J. Soderholm, B. S. Glick, and L. A. Staehelin, Molecular biology of the cell , 2277(2003).[65] M. S. Ladinsky, C. C. Wu, S. McIntosh, J. R. McIntosh, and K. E. Howell, Molecular biology of the cell , 2810 (2002).[66] P. Stanley, Cold Spring Harbor perspectives in biology , a005199 (2011).[67] H. Nam, N. E. Lewis, J. A. Lerman, D.-H. Lee, R. L. Chang, D. Kim, and B. O. Palsson, Science , 1101 (2012).[68] A. Peracchi, Trends in biochemical sciences (2018).[69] S. Blanes, F. Casas, J. Oteo, and J. Ros, Physics Reports470