Signal metrics analysis of oscillatory patterns in bacterial multi-omic networks
aa r X i v : . [ q - b i o . M N ] A ug Signal metrics analysis of oscillatory patterns inbacterial multi-omic networks
Francesco Bardozzo ∗ , Pietro Li´o ∗ , Roberto Tagliaferri ∗ ∗ - DISA-MIS - University of Salerno (IT), ∗ - Computer Laboratory, University of Cambridge (UK)August 2020 AbstractMotivation:
One of the branches of Systems Biology is focused on adeep understanding of underlying regulatory networks through the analy-sis of the biomolecules oscillations and their interplay. Synthetic Biologyexploits gene or/and protein regulatory networks towards the design of os-cillatory networks for producing useful compounds. Therefore, at differentlevels of application and for different purposes, the study of biomolecularoscillations can lead to different clues about the mechanisms underly-ing living cells. It is known that network-level interactions involve morethan one type of biomolecule as well as biological processes operating atmultiple omic levels. Combining network/pathway-level information withgenetic information it is possible to describe well-understood or unknownbacterial mechanisms and organism-specific dynamics.
Results:
Network multi-omic integration has led to the discovery of in-teresting oscillatory signals. Following the methodologies used in signalprocessing and communication engineering, a new methodology is intro-duced to identify and quantify the extent of the multi-omic oscillations ofthe signal. New signal metrics are designed to allow further biotechnologi-cal explanations and provide important clues about the oscillatory natureof the pathways and their regulatory circuits. Our algorithms designed forthe analysis of multi-omic signals are tested and validated on 11 differentbacteria for thousands of multi-omic signals perturbed at the network levelby different experimental conditions. Information on the order of genes,codon usage, gene expression, and protein molecular weight is integratedat three different functional levels. Oscillations show interesting evidencethat network-level multi-omic signals present a synchronized response toperturbations and evolutionary relations along with taxa.
Availability:
The algorithms, the code (written in R), the tool, thepipeline and the whole dataset of multi-omic signal metrics are available ata GitHub repository: https://github.com/lodeguns/Multi-omicSignals
Contact: [email protected] Introduction
An oscillating multi-omic network is a complex interlacing of interact-ing elements, which could be defined gene/protein oscillators. Theseelements are able to produce oscillations of a certain frequency poten-tially involving several and different cell processes [13]. It is proved thatbiomolecules, such as genes or proteins, could exhibit oscillatory behav-iors. Moreover, if combined, they could generate oscillatory circuits ob-servable on the temporal axis. The gene/protein oscillations are shown tobe controlled by underlying regulatory networks and impact to differentscales. [23, 33]. In [19] a comprehensive review is reported, showing that,in general, living cells have a pervasive dynamic behaviour where the keytranscription and regulatory factors oscillate on and off repeatedly evenwhen cells are in steady states. These oscillations can be detected incircuits of genes/proteins involved in stress responses, signaling, and celldevelopment [18]. Many biomolecules exhibit network-level metabolic in-teractions coordinated with cell growth, chromosome replication and celldivision [39]. For example, the Min oscillation is discovered to be fun-damental in the E. coli cell division [20]. In addition, biomolecule oscilla-tions are found in very complex and multi-periodic signals. [1] generatea de-noised waveform from multiple significant frequencies, to provideoscillation statistics including signal metrics and multi-periodicity quan-tification. Furthemore, genetic circuits present some interesting intrinsicdynamics [14]; for example, in signalling pathways, they are responsiveto feedback loops and show functional plasticity [18]. Recently, pseudo-temporal estimations based on the level of mRNA and/or proteins havebeen introduced to detect oscillatory gene networks from single-snapshotexperiments [7]. Unfortunately, even if the complex dynamics of bacte-rial processes could be predicted by keeping track of bacterial functionaladaptation to single-snapshot perturbations (control vs treatment), theoscillatory dynamics have not yet been explored enough. Thus, a com-plete mapping of regulatory and control mechanisms is not yet known.Furthermore, the lack of experiments along temporal axis make difficultto recognise oscillating biomolecules and their circuital interactions. [26].However, network-level synchronisation could be outlined by the hypothe-ses that every biomolecule in a network could interact with any other; thiscauses that the shared biomolecule oscillators synchronise the signals oncommon fluctuations. There are few examples in nature for which theseassumptions are fully verified, such as circadian oscillations [12]. Onthe other hand, in synthetic biology, artificial oscillators , although oftenshowing poor accuracy [25, 15], are one of the most promising researchfields. In particular, artificial oscillators allow the creation of genetic cir-cuits focused on the execution of logical programming in living cells. TheE. coli repressillator experiment represents a clear example of how geneticregulatory networks can be designed and implemented to perform newfunctions [3, 9]. In our previous works [4, 5], we investigated the E. coli response to ≈
70 perturbations by monitoring network-level oscillationchanges from controls to treatments and discovering that at the networklevel, there is another type of inter molecular multi-omic oscillation as-sociated to each single pathway and experiment. This type of oscillation, s far as we know, has not yet been sufficiently investigated on fixed-timeanalysis. In particular, our multi-omic oscillations could be described asa multi-periodic signal given by the variation of interacting biomolecu-lar multi-omics. This variation should be intended in terms of sequencelow-hight alternations of multi-omic values [5]. To clarify these points, inSection 2.1 and Figure 1, the identification procedure of these multi-omicsignal is described. With respect to our previous works, the number of or-ganisms is extended to 11. Also, the experiment cardinality has increasedto the order of thousands. Previously, we demonstrated how to measurethe structural relations between the genomic and proteomic layers andhow these led to oscillatory variations in response to perturbations. Onthe contrary, in this paper, taking advantage of signal theory and com-munication engineering, ad hoc metrics to better quantify network-leveloscillatory features are designed and the algorithms for their computationare provided on an online repository. In detail, two new change pointdetection algorithms ( CP D ) [34, 37] are introduced. These algorithmsare capable of managing both the complexity of the variable amplitude ofthe multi-omic signal and the multiple periodicity. Further analyses onnetwork-level synchronisations based on our novel signal metrics are pro-vided. In particular, the analyses are focused on the interactions throughdifferent pathways of the same organism and modulated by different condi-tion contrasts ( CC ) (single-snapshot mRNA experiments) [22]. Throughour approach, it is possible to recognise the oscillating networks, eventu-ally evaluating if they are synchronised (periodically and simultaneouslyactivated) or not synchronised, and how this feature changes across taxa.Integrated multi-omics are created from the following single omics: thecodon usage [32], mRNA amount contrasts [22] and the protein molecu-lar weight. Our final dataset is composed of 2 . .
722 multi-omic signalsfrom 11 different bacteria on thousands of environmental experiments.Network-level oscillatory variations are analysed with three functionallevels of granularity from KEGG orthology [16]. The results confirmedand extended our previous findings, by showing that network-level multi-omic oscillations exist in bacteria. Moreover, we found additional cluesto support that the oscillatory networks are synchronised showing a com-bined dynamic response to perturbations. Furthermore, the comparisonsbetween the various bacteria succeed in highlighting, in a completely in-novative way, how network-level oscillations could reflect the effects ofevolutionary pressure. Moreover, even if the maintenance of the geneorder is not well understood [29], this research could give new clues toits meaning, underlining its dynamical relations with the proteomic layerunder the evolutionary pressure [36].
In subsection 2.1 the procedure for multi-omic signal identification is de-scribed. In Figure 1 an overview of this task is provided. In detail,the mRNA condition contrasts and protein weight information are ex-tracted, normalised and aligned with the codon usage information. InFigure 1 - Box (c) , the multi-omic signal identification in condition con- j -th pathway and considering the k -th perturbation (indicated as the k -th experiment, or better condition contrast CC k ). As it is described in Section 2.1.1 and shown in Box (a) and Box (b) , themulti-omic information is combined and grouped by molecular networks (KEGGpathways) and ordered with respect to the gene order information. The dashedlines between pathway nodes and genes indicate a correspondence betweenthe n selected pathway proteins and their related genes ( ∀ i ∈ [1 , n ] , p j,i ⇔ g j,i ).As shown in Box (c) , the multi-omic values ( ~mv ij ) are represented as outputsof gene/protein oscillators at time t ≤ k ≤ t f , which is the time when a singlecondition contrast is taken. Then, the ( ~mv ij ) are discretized into N quantisationlevels (see also Section 2.1.2), and rotated of 90 ◦ with respect to the multi-omicspace. In this way, as it is shown in Box (d) , a discrete multi-omic signal s ij isobtained, which could be eventually operon compressed (see Box (e) and alsoSection 2.1.3). The ∗ and ∗ asterisks represent the proximal and the distalpositions of possible promoters or repressors neighbouring the operon g j , , g j , . trast CC at a fixed time k is indicated. From this single experiment, thecollected multi-omic values ~mv are selected with respect to the KEGGpathway composition (Box (a) ) and ordered considering the related DNAspatial positions (Box (b) ). Then, they are figuratively rotated of 90 ◦ and normalised generating a network-level multi-omic signal (Box (d-e) ).As shown in Box (d) , the multi-omics are discretised by considering N common levels of discretisation obtained through a between-organismsanalysis ( IOAC procedure) as defined in subsection 2.1.1.The identifiedsignals have been found to be almost periodic, then in subsection 2.2,two novel algorithms are introduced for their analyses. Furthermore, byapplying Algorithm 1 across the organisms and for all the multi-omic com- inations, the periodicity in the multi-omic signals is identified throughan estimate ˆ θ of the search window where it can be found (Figure 2 -Box (a) and Section2.2.1). Next, another CPD algorithm (Algorithm 2)is applied to obtain two signal indices: osc s and osc k (Figure 2 - Box (b) ). In particular, they are used to identify network-level oscillations foreach pathway, for each experiment and for each multi-omic combination(Figure 4- Box (a) ). Furthermore, the oscillations between networks andamong the experiments, as described in the pipeline of Figure 4- Box (a-b-c) are computed within and between organisms. Finally, this step iswell described in Section 2.3. Given ~P j,n as the j -th bacterial pathway of n proteins, the multi-omicsignal s j is composed by a finite vector of n multi-omic values ~mv j as-sociated to a subset of genes ~G j,n = { g j, , g j, , . . . g j,n } . The g j,i arecollected in ~G j,n considering the exact correspondence with respectto the proteins p j,i that compose the ~P j,n = { p j, , p j, , . . . p j,n } . Eachmulti-omic value ~mv j,i is arranged on the multi-omic signal s j,i consid-ering the relative position of its associated gene g j,i with respect to theorigin of replication. In order to describe the signal we adopt the form s j [ k ] with the index k ∈ Z (such for example, the i -th multi-omic valueof a signal is equal to s j [ i ] = mv j,i ). The multi-omic values ~mv j,i arecombined averaging their associated single-omics ~sv j,i , ∀ i ∈ Z . Since the ~sv j,i are not defined in the same range, a normalisation is applied to maketheir values comparable. In particular the single-omics are three: (I) themRNA condition contrasts ( CC ) and (II) the molecular weights (MW)and (III) the codon adaptation index. MW and CC are normalised intothe interval [0,1], that is the same range in which the codon adaptationindex (CAI) is already defined [32]. In section 2.4 there is an accuratedescription of these sources. In order to compare the signals between different organisms, an amplitudediscretisation process was applied. The whole procedure is called: Inter-Organisms Amplitude Consensus (IOAC) and discretisation. The signalamplitude is divided into N bins ( C = { , , ..b − , b, b + 1 ...N } , ∀ b ∈ Z )and the original ~mv j,i was replaced by the bin label it belongs to througha function map: f m ( s ) := s [ i ] → c ( s [ i ]) , ∀ i ∈ h , n i . Thus, if the totalnumber of classes ( N ) is equal to the cardinality of C , then each class c ( s [ i ]) represents the c -th interval in which the ~mv j,i falls. However, thecorrect estimation of N for the discretisation follows a study on the singleomic distributions trough the IOAC. In particular, the distribution of the ~sv j,i on the whole genome and for each organism was investigated bymeans of the Anderson-Darling test (A-D Test, [28]). In the case of CAI,for the 91 . H with a significance of 0 .
05 . The W and CC have the same significance with a percentage near to 100%.We conclude that the single omics do not follow a normal distribution.As a consequence, the optimal number of N bins is computed for non-normal distributions applying the Doane’s formula [38]. The N values areestimated between-organisms. In particular, N is fixed equal to 9 troughsa frequency based consensus (bold column in Supplementary MaterialSection 1). Consequently, in our set-up, the multi-omic signals generatedfor each experiment are discrete non-deterministic signals that representgene-ordered multi-omic values that fall into 9 possible class intervals(from 0 to 8). The extent of the multi-omic signal dataset is increased by their operoncompressed versions. In this case, if a set of multi-omic values in a signal s j of length n are part of an operon in position r of length m , this setis defined as s j [ r : r + m ] = { mv j,r , mv j,r +1 , mv j,r + m } , with m < n . Inthis case, each mv j,i follows its natural adjacent disposition on the DNAsequence. For this reason, we can represent the signal as a concatenation(indicated as ⊕ ) of the original signal with respect the operon: s j [1 : n ] = s j [1 : r − ⊕ s j [ r : r + m ] ⊕ s j [ n − m + r : n ]. According to theirnatural regulatory functions, in order to apply the compression, the mv j,i that composes an operon could be seen as a single averaged value : s j [ r ] = f m ( | mv i + mv i +1 + . . . + mv m /m | ). In this way we obtain additional signalswith this shape: s j [1 : n ] = s j [1 : r − ⊕ s j [ r ] ⊕ s j [ n − r : n ]. Obviously,the compression is applied more times if occurs, thus shortening the signallength. In this section, we deal with multiple-periodicity signals characterisedfrom different mRNA condition contrasts (see Section 2.4). These couldrepresent more replications of the same experiment. Thus, due to theexperimental intrinsic and extrinsic noise [35], it is very rare that thesesignals follow an ideal shape with a fixed periodicity; on the contrary, theperiodicities are more variable making difficult the oscillation detection.After all, if there are oscillations, then their half-periods (from peak tolows or vice versa) are localised in windows of variable length. In orderto obtain an estimation of half-periods, we introduce a novel localisedand non-parametric change-point detection algorithm (CPD). The ideabehind Algorithm 1 is based on the analysis of the median multi-omicvariations estimating the window lengths θ s in which the half-periodsoccur. A change-point is detected if and only if the median value ofthe previous variations ( ~m k ) is less or equal to the new coming multi-omic variation ( d k ) respecting the genes order. In general, the changepoint detectors look for changes in the statistical characteristics of thesignal (i.e. the median values ~m k ), therefore considering the signal as a ollection of different distributions arranged in adjacent windows [8]. InFigure 2 - Box (a) an atomic example of a half-period length estimation( θ ) is shown. Algorithm 1 collects in a vector ~θ all the θ s computed alongthe signal. Then, for each signal, the median values of ~θ are collected.Next, a common ˆ θ is defined as the maximum between the median valuesof ~θ between all the organisms and for each multi-omic combination. Inparticular, the ˆ θ s have a double functionality; they are indices of thedifferent multi-omic interplay within and between organisms and they arehalt condition parameters of Algorithm 2. In our case, we are focusing onthe variable half-period lengths for all the pathways. It is discovered thatall the organisms, on overall experiments and for each pathway, show amedian θ of about 3.0 and an average comprised between 3.0 and 3.7 witha low standard deviation. Also, the max and min θ s values are very similarbetween organisms. These statistics depend on the different multi-omiccombinations ( MOC ) and on the presence of operon compression. Thetable of the ˆ θ s is shown in the Supplementary Material Section 8, whilein Section 2 the source code of Algorithm 1 is provided. Algorithm 1
Multi-omic median window periodicity with CPD
Require:
A multi-omic signal: s [ n ] of length n and N the between organismsbins estimation function change-point-median-window ( s [ n ] , N ) ~θ ← N U LL ⊲
Array of enstimated half-period lenghts θ s for k ← n do θ ← ⊲ Index of the current window. d k ← | s [ k ] − s [ k + 1] | ~m k ← [ d k ] ⊲ Trace the multi-omic variation if k + 1 < n thenwhile d k ≤ median ( ~m k ) do ⊲ Change-point detection if k+2 ¡ n-1 then d k ← | d k − s [ k + 2] | k ← k + 1 ~m k ← [ ~m k ⊕ d k ] end if θ ← θ + 1 end whileend if ~m k ← N U LL~θ ← [ ~θ ⊕ θ ] end forreturn ~θ end function .2.2 Multi-omic signal indices: osc s and osc k In this section, we introduce algorithm 2, which is a multi-level changepoint detector, capable of detecting multi-omic variations in relation tothe half-periods to which they belong. The algorithm returns in output osc s and osc k . In particular, osc s is an oscillation index which relates thelength of the half-periods (conditioned by the multi-omic variations) withtheir signal amplitude. Instead, osc k is an index describing the relativelength of the half-periods with respect to the signal length. The algorithmis divided into five steps, as they are summarised in Figure 2 - Box (b) and detailed as follows: I) Collect multi-omic adjacent change-points:
In this step al-gorithm 2 collects the adjacent multi-omics change-points on the signal s tracking the adjacent multi-omic class variations ~am as described inEquation 1: ~am [ i ] ← +1 if s[i] ¿ s[i+1]0 if s[i] = s[i+1]-1 otherwise ∀ i ∈ | s | . (1) II) Collect adjacent multi-omic variations:
For each signal s , algorithm 2 collects the adjacent multi-omic variations ~amv as the absolute difference between two adjacent multi-omic values,as described in Equation 2: ~amv [ i ] ← | s [ i ] − s [ i + 1] | ∀ i ∈ | s | . (2) III) Quantification of multi-omic variations per half-periods:
The multi-omic variations of ~amv [ i ] are summed from a starting point ( p es )to a halt point ( p eh ) at time step e . The procedure is repeated iterativelyfrom a p eh to another halt point until the length of the signal is reached.The results are given in output in the vector ~mvq , as shown in Equation3: ~mvq = (cid:20) p eh X i = p es amv [ i ] , p e +1 h X i = p eh amv [ i ] , . . . , p e + nh X i = p e + n − h amv [ i ] (cid:21) . (3)Therefore, the last p e − h become the new p es . In particular, the halt pointis computed dynamically with two stop conditions: the former is givenwhen a change point ~am [ i ] = 0, the latter is given when p eh − p es ≥ ˆ θ . Thenumber of the summed multi-omic variations for each step e are collectedin the vector ~mvl as shown in Equation 4. ~mvl = (cid:20) | p e + th − p e + ts | (cid:21) , ∀ t ∈ | s | . (4)The calculation of ~mvq and ~mvl is the central point for the developmentof our change point detector, therefore Step III has been described indetail in the pseudocode of Algorithm 2 and is illustrated in the exampleof Figure 2 for an halt condition.
IV) Computation of osc s : The two vectors ~mvq and the ~mvl have the same length d , where at most ≤ | s | −
1. The first one traces the quantified multi-omic variation foreach half-period. The second one traces the window lengths in which thevariations are computed. Thus, the oscillation index is defined in Equation5 as the product of the two vectors with respect to the sum of ~mvl for the N bins: osc s = P di =1 ~mvq [ i ] · ~mvl [ i ]( | N | − P d ~mvl . (5)The osc s is defined in the interval [0 , osc s areindices of the signal oscillations. In order to give a proof of the algorithmcorrectness, we prove the following theorem and some associated corol-laries, as deepened in the Supplementary Materials Section 3. TheoremI : Algorithm 2 gives in output an oscillation index osc s equal to 1 if andonly if ( ⇔ ) the observed signal presents a perfect oscillation. V) Computation of osc k : Algorithm 2 computes, also, the oscillation index osc k in Equation 6: osc k = | ~mvl || s | . (6)For each signal, this index represents a relation between the length of s and the number of periods along the signal, described as the cardinality of ~mvl . As we will see, osc k ∈ [0 ,
1] remains defined in a certain interval anddescribes some interesting relations in the analysis of the pathways phasesynchronisations (Section 3). In the Section 5 of Supplementary Materialwe provide the source code related to Algorithm 2 in order to compute osc s and osc k . The robustness of Algorithm 2 was tested by defining two types of pertur-bations. Without loss of generality, we assumed that the perturbations aredefined by a random distribution with zero mean and unit variance. As aconsequence, the first type of perturbation applied to the discrete signal s is a stochastic additive noise. We decided to add the 5% of the generatednoise, in the following way: s [ i ] + ( N (0 , ∗ . ∀ i ∈ [1 , n ]. The secondtype of perturbation consists of random shuffling the elements of the orig-inal signal, thus testing the importance of the information deriving fromthe gene order. The t-test p-value of the obtained osc s on random shuffleddistributions is equal to 0 . . e −
16. We selected a subset of original signalswith at least 70% of significant oscillating multi-omics. This means thatwe selected only the signals with oscillation index osc s ≥ φ , with φ = 0 . osc s . In Figure 3, the PDF of the original signal oscillation index osc s (solid line) against the perturbed ones (dashed lines) are shown. In par-ticular, we can observe that the osc s of the original signals remains intothe interval from 0.7 to 1, while the noisy and shuffled signals interceptan interval from 0.4 to 1, thickening the area of interest (Figure 3 - ∗ , ∗ ) lgorithm 2 : Oscillation indices: osc s and osc k . The procedure is dividedinto 5 steps as described in Section 2.2.2. Here, in pseudocode, Step III inrelation with the other steps is shown. Require:
A multi-omic signal: s [ n ] of length n and estimated ˆ θ function compute- ~mvq -and- ~mvl ( s [ n ] , ˆ θ ) ~am ← Step I ⊲ Collect multi-omic adjacent change-points. ~amv ← Step II ⊲ Collect adjacent multi-omic variations. p s ← p h ← j ← for e ← n do ⊲ Change-point detection clauses: c , c c ← am [ e ] < ∧ am [ e − > c ← am [ e ] > ∧ am [ e − < if c ∨ c then p eh ← e~mvq [ j ] ← P p eh i = p es amv [ i ] ~mvl [ j ] ← | p eh − p es | j ← j + 1 p es ← e else ⊲ No-change-point detection clauses: c , c c ← am [ e ] ≤ ∧ am [ e − ≤ c ← am [ e ] ≥ ∧ am [ e − ≥ if c ∨ c thenif ( e − p es ) < ˆ θ then e ← e + 1 else p eh ← e~mvq [ j ] ← P p eh i = p es amv [ i ] ~mvl [ j ] ← | p e + th − p e + ts | j ← j + 1 p es ← e end ifend ifend ifend forreturn ( ~mvq, ~mvl ) end function osc s ← Step IV ⊲ Compute the oscillation index osc s . osc k ← Step V ⊲ Compute the oscillation index osc k .10 o lower values than those defining the original area (Figure 3 - ∗ ). Notethat Algorithm 1 and Algorithm 2 have linear complexity O(n) over thesignal length n . As expected, small variations in multi-omic values or arandom arrangement clearly lower the oscillation index osc s . The multi-omic oscillation indices: osc s and osc k , for all the pathways ∀ P j ∈ O on the whole collection of COLOMBOS v3.0 condition contrasts( CC s) are computed. We summarised the pipeline of this section in Fig-ure 4. Having set φ = 0 .
8, the pathways are separated from the othersby splitting those with an oscillation index osc s < φ from those with osc s ≥ φ . A binary function a on P j is designed defining the pathwayswith oscillatory behaviours greater than φ as active pathways ( a ( P x ) = 1),and those less than the threshold as inactive pathways a ( P y ) = 0, with x = y . The active ones are sets of pathways with very relevant oscillatorybehaviours. Our hypothesis is that there is a network-level synchroni-sation only if the r pathways are all active in the same i-th experiment a exp i := { P ,i , P ,i , . . . , P r,i } . On the other hand, the asynchronised path-ways are those that, in the same experiment, are inactive ( P j,i a exp i ).For each experiment, the a exp i are grouped with 3 levels of functionalgranularities, following their KEGG orthology ( KO ) classifications, byKEGG pathway names ( KO Level 1 ), KEGG molecular network func-tionalities (
KO Level 2 ) and KEGG maps (
KO Level 3 ). Without lossof generalisation, the rows of CC s that represent the same within-studiesmicroarray replications are merged. In this way it is possible to quantifythe presence of oscillatory networks on the whole microarray experimentand not only on one of its replications. The next step in the pipelineconsists of an analysis within and between-organisms of the co-occurrencematrices, through the 3 KO levels, in order to understand if the syn-chronised pathways appear as common scheme overall the experiments( a exp i ∩ a exp j , ∀ i = j ) and to what extent (see also Figure 4). Note thatthe between-organisms cardinality of the experiments is not homogeneousand it depends on the collection of CC s provided by COLOMBOS v3.0.Thus, in Figure 5, box b , in order to carry on the information about thesynchronisation as much as possible, under the heatmap, the cardinalityof the experiments ( EC ) and their relative representation percentage onthe between-organisms intersections scheme ( EE ) are underlined. Thecomplete co-occurrence matrices with a fixed threshold to φ = 0 .
8, theircircuital intersections, for each organism and for every multi-omic com-binations are provided in the Supplementary Materials - Section 7. Atool capable to visualise these scheme varying the threshold and the otherparameters is provided as Supplementary Material Section 7.
The dataset analysed in this work is composed of 2.830.722 multi-omicsignals of 11 different bacteria on thousands of environmental experi-ments The bacteria included in the study are:
Bacillus cereus (ATCC bce ], Bacillus subtilis (168) [KEGG ID: bsu ], Bac-teroides thetaiotaomicron (VPI-5482) [KEGG ID: bth ], Clostridium ace-tobutylicum (ATCC 824) [KEGG ID: cac ], Campylobacter jejuni (NCTC11168) [KEGG ID: cje ], Escherichia coli (K-12 MG1655) [KEGG ID: eco ], Helicobacter pylori (26695) [KEGG ID: hpy ], Mycobacterium tuberculosis (H37Rv) [KEGG ID: mtu ], Pseudomonas aeruginosa (PAO1) [KEGG ID: pae ], Sinorhizobium meliloti sme ] and
Salmonella en-terica (serovar Typhimurium LT2) [KEGG ID: stm ]. We define the set ofthese organisms as O := { bce, bsu, bth, cac, cje, eco, hpy, mtu, pae, sme, stm } . The multi-omicsignals were integrated for every organism ( ∀ O ) and for every P j follow-ing the combinations of single omic values ~sv j,i with and without operoncompression. The interplay between these omic layers is described by[2]. In particular, these multi-omic values ( ~mv j,i ) are combinations ofdynamic ~sv j,i and static ~sv j,i or only a combination of static ~sv j,i . Thestatic ~sv j,i are the CAI (genomic layer) and the MW (proteomic layer).The ~sv j,i of the CAI was computed as described by [32]. The MW wascomputed considering the molecular weight of the amino-acidic compo-sition of each protein in P j . These values are called static because theydo not change when perturbations occur. The dynamic ~sv j,i were rep-resented as ‘condition contrast’ ( CC ) and represented the mRNA ex-pression changes between microarray experiments. From a certain pointof view, the CC is a glue between the two static layers. The datafor the condition contrast were downloaded from the COLOMBOS v3.0dataset and they were already normalised within and between exper-iments by the creators of the dataset [22]. For this reason, throughthe CC was possible to compare the between organisms signals s ∀ O based on multi-platform experimental setups without losing generalisa-tion. The four multi-omic combinations considered in this work were: MOC := { ( CC, CAI ) , ( CC, MW ) , ( CAI, MW ) , ( CAI, CC, MW ) } .The molecular networks pathway information and their KEGG Orthol-ogy are extracted from KEGG through a REST service [16, 21, 17]. Theinformation about the operons localisation was determined through theOperonDB dataset [24]. The order of the genes was obtained through theNCBI dataset [6] and aligned to the KEGG microbial genome informa-tion. The whole dataset, with their respective labels, is provided in an Rdata format in the Supplementary Materials Section 6. The phylogenetictree is reconstructed considering the NCBI taxonomy dataset [10]. I) Multi-omic signal oscillations : The mean absolute error (MAE)is computed in order to quantify the distance of osc s obtained from theoriginal signals and the noised/perturbed ones (see also SupplementaryMaterial - Section 4) [31] . The two types of perturbations adopted hereare the same described in paragraph 2.2.3. MAE is computed for all the s ∈ O ( see Section 2.4) with osc s ≥ φ with φ = 0 .
7. Algorithm 2 com-putes the osc s of the original signal and of the noisy one in the sameway. A cutoff is applied to the wavelength, considering that the biological eaning of sequences makes sense for signals with n ≥
6. Nevertheless,very long signals present low oscillation indices with respect to medianwavelengths, thus, respecting the short memory property of [30]. Withthese constraints, we are still considering the 95% of the pathways and63% of pathways with operon compression. The oscillation index osc s iscomputed on signals with and without operon compression consideringall the possible MOC separately. The MAE between the original signalsand the perturbed ones is shown in Table 1. As it is shown, the distancebetween the noisy signals and the original ones is more pronounced consid-ering the combination of CAI, molecular weight and condition contrasts(bold cells). Instead, the distance between the random shuffled signalsand the original ones is more pronounced in the combination of CAI andMW (bold cells). In this case, the involved omics are both static anddeeply related to the gene order. In these analyses, it is highlighted thatthe multi-omic signals preserve the oscillation behaviours, described byour oscillation indices, representing effectively recurrent patterns presentin nature. As it is shown also in Figure 3), slight or massive multi-omicvariations lower dramatically the oscillation index osc s . II) Network-level synchronisations : The synchronised signals, foreach experiment, could give a meaningful picture of the network-level in-terplay of multi-omics. With the methodology described in section 2.3,important clues to support the hypothesis that there are groups of path-ways/oscillatory networks in synchronisation are provided. For example,in Figure 4, the most frequent network-level interplay based on multi-omicsignals
CAI-MW-CC for the functional feature of
KO Level 2 is shown.The intersections reported in Figure 4 involve the pathway signals of all 11bacteria, shaping the general behavior of the dynamics of oscillations atthe network level and outlining a similar response to several experimentalperturbations. According to [11], if we look at, for example, Figure 5 box (a-b) we can see the central role of
Carbohydrate Metabolism (CMT) . Inparticular, the oscillatory networks belonging to the CMT class are insynchronisation in practically all the experiments and all the organisms(Figure 4 - see CMT row). Under the dictates of evolutionary pressure,in Figure 5 box (a-b) , it is possible to see the percentages of reciprocalinfluence that the signals have in synchronisation also for other importantfunctions:
Drug Resistance (DRA),
Cellular Motility (CMY), etc. More-over, we investigated the behaviour of the oscillation index osc k computedfor each organism for each pathway with KO level 1 (Figure 5, box c ).The distributions of the osc k were studied by separating the synchronisedsignals (gray boxplots) to those with a lower oscillation behaviour (whiteboxplots). In this case, the oscillation index osc k of the synchronised sig-nals is lower than in those not synchronised even if they show a higheroscillation index osc s . It is evident that the two distributions are sepa-rable and preserved along with the organisms, except in rare cases wheresome few oscillatory networks seem to overlap the not synchronised ones.In these particular cases, we observed a higher osc k for these functionalclasses: Drug Resistance, Cell Motility, Xenobiotic Metabolism, DNA Re-pair, Amino Acid Metabolism . Instead, although the high values of osc s in Table 1, the distances between the osc k are considerably reduced onnetwork-level oscillations of signals with operon compressions. This evi- MOC ): osc s osc s vs noise osc s vs shuffle Signals set-size S-s % CAI-MW-CC 0.81
CAI-MW
204 on 779 26%MAE with all the possible
MOC with operon compression: osc s osc s vs noise osc s vs shuffle Signals set-size S-s % CAI-MW-CC 0.93
CAI-MW
73 on 328 22%Table 1: : Table of distances between the original osc s and the perturbedones. In Table 1 the mean absolute error (MAE) between the original signaloscillation indices ( osc s ≥ φ , with φ = 0 .
7) and the perturbed ones are shown.The additive noise and the random shuffle are the two perturbations listed as noise and shuffle . The distances are computed for each
MOC . In
Signals set-size , the size of the signals with the in-text described constraints is reported.Despite the set-size that remains equal for each organism, every pathway andevery MOC, the number of multi-omic operon compressed signals is differentfrom the original ones because some pathways during the operon compressionbecame too short to be considered as biological sequences. dence could allow us to assume that network level synchronisations have asignificant oscillation in amplitude with longer periods due to the under-lying synchronisation mechanism that can be derived from the regulatoryand control circuits. Conversely, a high osc k could indicate that networksshow a rapid biological response to external stress. However, these hy-potheses will have to be explored in future research work. In this article, two new signal metrics have been introduced to studymulti-omic oscillations at the network level and on single-snapshot exper-iments, defined in particular as condition contrasts. From the analysis ofthese metrics, it is possible to provide interesting clues about the char-acteristics of the signal proving that multi-omic network-level oscillationsexist in nature. Furthermore, clear evidence has been provided that theseoscillations could show a synchronised interaction in response to perturba-tions. Multi-omic signal analyses have been extended to multiple organ-isms and related to their phylogenetic tree to provide better comparisons.Algorithmic methodologies are provided and accompanied by a tool onsupplementary material and on the online repository. This work could be seful in the fields of synthetic biology and systems biology with the goalof mapping the organism regulation and control circuits, for example, incase of lack of time series experiments. Furthermore, there is a growingamount of whole-genome and longitudinal data and these metrics answerto the need to detect complex patterns of changes. References [1] Amariei, C. et al. (2014). Quantifying periodicity in omics data.
Fron-tiers in cell and developmental biology , , 40.[2] Angione, C. et al. (2016). Multiplex methods provide effective integra-tion of multi-omic data in genome-scale models. BMC bioinformatics , (4), 83.[3] Arenas, A. et al. (2008). Synchronization in complex networks. Physicsreports , (3), 93–153.[4] Bardozzo, F. et al. (2015). Multi omic oscillations in bacterial path-ways. In , pages 1–8. IEEE.[5] Bardozzo, F. et al. (2018). A study on multi-omic oscillations in es-cherichia coli metabolic networks. BMC bioinformatics , (7), 194.[6] Barrett, T. et al. (2008). Ncbi geo: archive for high-throughput func-tional genomic data. Nucleic acids research , (suppl 1), D885–D890.[7] Boukouvalas, A. et al. (2019). Osconet: Inferring oscillatory genenetworks. bioRxiv , page 600049.[8] Darkhovski, B. S. (1994). Nonparametric methods in change-pointproblems: A general approach and some concrete algorithms. LectureNotes-Monograph Series , pages 99–107.[9] Elowitz, M. B. and Leibler, S. (2000). A synthetic oscillatory networkof transcriptional regulators.
Nature , (6767), 335–338.[10] Federhen, S. (2011). The ncbi taxonomy database. Nucleic acidsresearch , (D1), D136–D143.[11] Fuhrer, T. et al. (2005). Experimental identification and quantifi-cation of glucose metabolism in seven bacterial species. Journal ofbacteriology , (5), 1581–1590.[12] Golden, S. S. (2003). Timekeeping in bacteria: the cyanobacterialcircadian clock. Current opinion in microbiology , (6), 535–540.[13] Govindarajan, S. et al. (2012). Compartmentalization and spatiotem-poral organization of macromolecules in bacteria. FEMS microbiologyreviews , (5), 1005–1022.
14] Guantes, R. et al. (2010). Trade-offs and noise tolerance in signaldetection by genetic circuits.
PLoS One , (8), e12314.[15] Hawe, J. S. et al. (2019). Inferring interaction networks from multi-omics data. Frontiers in genetics , , 535.[16] Kanehisa, M. and Goto, S. (2000). Kegg: kyoto encyclopedia of genesand genomes. Nucleic acids research , (1), 27–30.[17] Kanehisa, M. et al. (2013). Data, information, knowledge and prin-ciple: back to metabolism in kegg. Nucleic acids research , (D1),D199–D205.[18] Lenz, P. and Søgaard-Andersen, L. (2011). Temporal and spatialoscillations in bacteria. Nature Reviews Microbiology , (8), 565.[19] Levine, J. H. et al. (2013). Functional roles of pulsing in geneticcircuits. Science , (6163), 1193–1200.[20] Lutkenhaus, J. (2008). Min oscillation in bacteria. In Cellular Oscil-latory Mechanisms , pages 49–61. Springer.[21] Mao, X. et al. (2005). Automated genome annotation and pathwayidentification using the kegg orthology (ko) as a controlled vocabulary.
Bioinformatics , (19), 3787–3793.[22] Meysman, P. et al. (2013). Colombos v2. 0: an ever expanding collec-tion of bacterial expression compendia. Nucleic acids research , (D1),D649–D653.[23] Michalodimitrakis, K. and Isalan, M. (2008). Engineering prokaryoticgene circuits. FEMS microbiology reviews , (1), 27–37.[24] Pertea, M. et al. (2008). Operondb: a comprehensive databaseof predicted operons in microbial genomes. Nucleic acids research , (suppl 1), D479–D482.[25] Potvin-Trottier, L. et al. (2016). Synchronous long-term oscillationsin a synthetic gene circuit. Nature , (7626), 514–517.[26] Prokop, A. and Csuk´as, B. (2013). Systems biology: integrative biol-ogy and simulation tools , volume 1. Springer Science & Business Media.[27] R (2018).
R: A Language and Environment for Statistical Computing .R Foundation for Statistical Computing, Vienna, Austria.[28] Razali, N. M. et al. (2011). Power comparisons of shapiro-wilk,kolmogorov-smirnov, lilliefors and anderson-darling tests.
Journal ofstatistical modeling and analytics , (1), 21–33.[29] Rocha, E. P. (2003). Dna repeats lead to the accelerated loss of geneorder in bacteria. TRENDS in Genetics , (11), 600–603.
30] Ron, D. et al. (1996). The power of amnesia: Learning probabilisticautomata with variable memory length.
Machine learning , (2-3),117–149.[31] Ruckdeschel, P. and Kohl, M. (2018). distrmod—an s4-class basedpackage for statistical models. Robust Inference in Generalized LinearModels , , 159.[32] Sharp, P. M. and Li, W.-H. (1987). The codon adaptation index-ameasure of directional synonymous codon usage bias, and its potentialapplications. Nucleic acids research , (3), 1281–1295.[33] Shis, D. L. et al. (2018). Dynamics of bacterial gene regulatory net-works. Annual review of biophysics , , 447–467.[34] Siegmund, D. (2013). Change-points: from sequential detection tobiology and back. Sequential analysis , (1), 2–14.[35] Singh, A. and Soltani, M. (2013). Quantifying intrinsic and extrin-sic variability in stochastic gene expression models. Plos one , (12),e84301.[36] Tamames, J. (2001). Evolution of gene order conservation in prokary-otes. Genome biology , (6), research0020–1.[37] Unakafov, A. and Keller, K. (2018). Change-point detection usingthe conditional entropy of ordinal patterns. Entropy , (9), 709.[38] Venables, W. N. and Ripley, B. D. (2013). Modern applied statisticswith S-PLUS . Springer Science & Business Media.[39] Wang, J. D. and Levin, P. A. (2009). Metabolism, cell growth andthe bacterial cell cycle.
Nature Reviews Microbiology , (11), 822–827. Box(a) . In this case, a half-period is recognised with a length of θ = 3. In general,Algorithm 1 collects in a vector ~θ all the θ s computed along the signal. The ˆ θ are computed between organisms as the max of the half-period median lengthsof ~θ . The ˆ θ s have the same value in all the organisms but vary depending onthe multi-omic combination considered with or without operon compression (seeSection 2.2.1. Next, Algorithm 2 is divided into a 5-step pipeline as describedin Section 2.2.2. In Box (b) , these steps are summarised. Algorithm 2 takes ininput a signal and its associated ˆ θ . Then, in Step I-II both adjacent change-points and multi-omic variations are computed along the signals. In
Step III the adjacent multi-omic variations ( amv [ p e ] , amv [ p e +1 ] , . . . ]) are added togetherfor each window until a halt condition occurs. In Box (b)-Step III , the haltcondition is represented by p e + i − p e = 3 greater than ˆ θ . Finally, in Step IV-V ,Algorithm 2 gives in output the two metrics osc s and osc k . All the variablesshown in the Figure are defined and described in the respective sections.. 18igure 3: In this Figure three probability density plots (PDF) are shown. Thearea under the three PDFs are indicated with the asterisks ∗ , ∗ , ∗ . The valuesof the density functions are on the y-axes. The oscillation index osc s are on thex-axes. It is plotted the PDF of the original multi-omic signals with osc s ≥ φ with φ =0.7 (solid line). As it is proved, the underlying area ∗ is comprised intothe interval of osc s from 0.7 to the upper bound of 1.0. Then, this subset oforiginal signals ( osc s ≥ .
7) is perturbed in two ways. The PDF of the originalsignals perturbed with noise are shown as dashed lines and the underlying areais indicated with ∗ . Those random shuffled are shown in 3 dots dashed linesand the underlying area is indicated with ∗ . It is possible to observe that, whenthe original signals are perturbed, the PDFs area ∗ and ∗ move mostly on osc s values comprised between 0.4 and 0.7. Thus, the perturbed signals lower the osc s proving the Algorithm 2 correctness, in terms of robustness and sensitivityanalysis. The osc s is computed for all the signals with a length of at least 6.19igure 4: In Figure 4 - Box (a-b) the pipeline of the oscillation extractionis shown. In particular,in Box (a) multi-omic networks are extracted, whilein Box (b) the oscillations with a high score ( osc s ≥ φ, φ = 0 .
8) are selectedas active pathways and considered as synchronised. In Box (c) , for each i -th experiment the synchronised active pathways are superimposed on those ofthe j -th experiment and their intersections are computed (common activationscheme) with respect to KO Level 1 (pathways with the same functions). Inparticular, for the entire collection of experiments, intersections are calculatedwithin organisms and between organisms. In Box (d) an example of between or-ganisms pathway-level synchronisation scheme. Their intersection schemes arereorganised by their
KO Level 2 functionalities as follows: DRA:
Drug resis-tance: Antimicrobial , CMY:
Cell motility ,TR :
Translation , MAA :
Metabolismof other amino acids , MTR:
Membrane transport , FSD:
Folding, sorting anddegradation , RR:
Replication and repair , EM:
Energy metabolism , LM:
Lipidmetabolism , AAM:
Amino acid metabolism , MCV:
Metabolism of cofactors andvitamins , CMT :
Carbohydrate metabolism . The black dots represent the activa-tion scheme that is shared across the between-organism experiments. For exam-ple, in the first column we can find 487 between-organism experiments whose theturned-on-simultaneous functionalities are FSD,RR,EM,LM,AAM,MCV,CMT.This scheme is the most frequent in the between-organism experiments as sug-gested by the horizontal bars. The coverage percentage of the first-columnscheme is very significant and overlays with at least the 72% of the detectedschemes. 20igure 5: In this Figure, in box a , a phylogenetic tree is projected onto theheatmap between-organims and KO Level 2 functionalities. In box b the rowlabels are specified as in Figure 4, while the column labels are the organismslisted in Section 2.4. Under the heatmap the cardinality of the experiments( EC ) and their relative effort ( EE ) are shown in order to represent the relativeinfluence of the oscillatory networks in phase synchronisations of Figure 4. Forexample, E. coli (eco) influences the 46% of the intersections of Figure 4. InFigure 5, in box c , between-organisms boxplot comparisons with respect to the osc k distributions are shown. The phase synchronised oscillatory networks arerepresented by the gray boxplots and the inactive ones by the white boxplots.The plotted values represent their average osc k value.21value.21