[PDF] Structural representations of DNA regulatory substrates can enhance sequence-based algorithms by associating functional sequence variants

Abstract

The nucleotide sequence representation of DNA can be inadequate for resolving protein-DNA binding sites and regulatory substrates, such as those involved in gene expression and horizontal gene transfer. Considering that sequence-like representations are algorithmically very useful, here we fused over 60 currently available DNA physicochemical and conformational variables into compact structural representations that can encode single DNA binding sites to whole regulatory regions. We find that the main structural components reflect key properties of protein-DNA interactions and can be condensed to the amount of information found in a single nucleotide position. The most accurate structural representations compress functional DNA sequence variants by 30% to 50%, as each instance encodes from tens to thousands of sequences. We show that a structural distance function discriminates among groups of DNA substrates more accurately than nucleotide sequence-based metrics. As this opens up a variety of implementation possibilities, we develop and test a distance-based alignment algorithm, demonstrating the potential of using the structural representations to enhance sequence-based algorithms. Due to the bias of most current bioinformatic methods to nucleotide sequence representations, it is possible that considerable performance increases might still be achievable with such solutions.

Full PDF

SStructural representations of DNA regulatorysubstrates can enhance sequence-based algorithmsby associating functional sequence variants

Jan ZrimecChalmers University of TechnologyGothenburg, Sweden [email protected]

13 June 2020

Abstract

The nucleotide sequence representation of DNA can be inadequate forresolving protein-DNA binding sites and regulatory substrates, such asthose involved in gene expression and horizontal gene transfer. Consider-ing that sequence-like representations are algorithmically very useful, herewe fused over 60 currently available DNA physicochemical and conforma-tional variables into compact structural representations that can encodesingle DNA binding sites to whole regulatory regions. We ﬁnd that themain structural components reﬂect key properties of protein-DNA interac-tions and can be condensed to the amount of information found in a singlenucleotide position. The most accurate structural representations com-press functional DNA sequence variants by 30% to 50%, as each instanceencodes from tens to thousands of sequences. We show that a structuraldistance function discriminates among groups of DNA substrates moreaccurately than nucleotide sequence-based metrics. As this opens up a va-riety of implementation possibilities, we develop and test a distance-basedalignment algorithm, demonstrating the potential of using the structuralrepresentations to enhance sequence-based algorithms. Due to the biasof most current bioinformatic methods to nucleotide sequence representa-tions, it is possible that considerable performance increases might still beachievable with such solutions.

Besides the popular yet simplistic representation of DNA as a polymer chain of4 diﬀerent nucleotide bases A, C, G and T, the molecule in its double strandeddsDNA form possesses certain conformational and physicochemical properties.These are especially important in relation to interactions of the DNA withproteins, which drive many essential cellular processes [1, 2, 3]. These include1 a r X i v : . [ q - b i o . GN ] J u l ut are not limited to transcription [1], replication [4] as well as horizontal genetransfer (HGT) [3, 5]. The processes are commonly initiated and regulated atspeciﬁc DNA regulatory regions, which are the substrates for protein bindingand enzymatic activity, and include promoters [6], origins of replication [4] andorigins of transfer (transfer regions, plasmid conjugation in HGT) [7]. TheDNA substrates contain either one or multiple protein binding sites, such asfor instance transcription factor binding sites (TFBS) in promoters, as wellas additional sequence and structure context that is related to protein-DNArecognition and binding preceding the main enzymatic processing [8, 9, 3].As such, the nucleotide sequence representation is often not suﬃciently de-tailed for discriminating DNA regulatory substrates, as it encodes speciﬁc con-served structural properties that are not directly obvious from the mere sequencecontext (Table 1) [3, 10, 11]. To uncover the encoded structural properties, manyDNA structure models and prediction tools have been developed, including (i)DNA thermodynamic stability and its potential for destabilization and meltingbubble formation, modelled with the nearest neighbor (NN) framework [12] andthermally induced duplex destabilization [2], (ii) DNA major and minor grooveproperties that describe their size and thus accessibility by proteins, which arethe focus of the DNAshape and ORChID2 models [13, 14] among others, (iii)DNA intrinsic curvature and ﬂexibility, modeled by measuring DNAzeI digestionproﬁles [15] and DNA persistence lengths [16], (vi) DNA twisting and supercoil-ing, captured by multiple DNA conformational models and variables [17, 18, 19],(v) diﬀerences in DNA spacing and orientation between binding and enzymaticsites, described for instance by DNA helical repeats [16], and (vi) the propen-sity for transitions between DNA forms, such as from B-DNA to A-DNA or toZ-DNA given with respective variables [20, 21, 22, 23]. However, due to thelarge variety and diﬀerences among the DNA structural models and variables,it is not simple to choose among them or integrate them within existing DNAbioinformatic frameworks. Current studies thus focus on speciﬁc groups of DNAstructural properties and do not span the whole possible structural repertoire[14, 3, 24, 25, 26, 4]. A common DNA structural representation could thus helpcircumvent these problems and facilitate the development of novel, improvedDNA algorithms.Hence, the aim of the present study was to analyse and engineer novel DNAstructural representations for use with bioinformatic frameworks, such as se-quence alignment and motif ﬁnding algorithms, and to compare them with thestandard nucleotide sequence-based methods. First, based on DNA physico-chemical and conformational properties that are involved in DNA-protein in-teractions and using dimensionality reduction techniques we constructed diﬀer-ent structural DNA representations. To explore the smallest amount of struc-tural information that could suﬃciently describe functional DNA regions wecompressed the DNA representations using clustering algorithms to encompassfrom an estimated 2 to 8 bits of information. Next, we explored the capabilityof each representation to encode multiple DNA sequence variants using TFBSmotif datasets. To facilitate the usefulness of the representations with existingDNA algorithms via comparison of the (dis)similarity of encoded sequences, we2able 1: Overview of DNA structural properties and representative variables inprotein-DNA interaction. DNA structural prop-erties Facilitated protein-DNA interactions Representative struc-tural variables ReferencesDNA stability, desta-bilizations and meltingbubble formation Enzymatic processingof substrates, e.g. nick-ing of transfer regions,leads to secondarystructure formation Duplex stability, Ther-mally induced duplexdestabilization (TIDD) [12, 27,2, 28]Major and minorgroove properties Readout of chemicalinformation, e.g. TFsin promoters DNAShape, ORChID2 [1, 14,13, 6]Intrinsically curved orﬂexible regions Binding and topologi-cal changes, e.g. IHFbinding in promoters DNAzeI cleavage fre-quency, Persistencelength [15, 16,29]DNA twist and super-coiling Topological changesrecognized by proteins,e.g. histones, andaﬀect other properties Twist and other con-formational variables [17, 18,19, 6]Diﬀerences in DNAspacing and orienta-tion in binding andenzymatic sites Aﬀect binding withmultiple contact pointsand protein complexformation Helical repeats [30, 16,6]Propensity for tran-sitions between DNAforms B-DNA, A-DNA, Z-DNA Aﬀect overall featuresrecognized by proteinsand their accessibility B-A and B-Z transitionpropensities [20, 23,21, 22] developed a structural distance function and tested it on datasets of transferand promoter regions. Finally, we explored the applicability of the structuralrepresentations within a sequence alignment framework and discussed ideas forhybrid sequence and structure-based approaches speciﬁcally for analysing regu-latory DNA substrates.

We obtained 64 published DNA structure models that were shown to be in-formative for analysis of DNA-protein interactions [5] (Table 1). These modelswere based on nearest neighbor dinucleotide (56), trinucleotide (4) and pentanu-cleotide (4) models and included physicochemical and conformational proper-ties and properties attributed to DNA-protein interactions. Of these, 44 modelswere derived experimentally and the rest computationally based on experimen-tal data. The set included the widely used models DNA shape [1, 14], Orchid[13], DNA stability and thermally induced duplex destabilization [12, 2].A dataset of transcription factor binding site (TFBS) motifs was obtainedfrom the Jaspar database [31] (sites ﬁle) and ﬁltered to contain only sequenceswith { A,C,G,T } characters, of equal length as the median length in each motif3roup but at least 9 bp long and containing at least 10 motif sequence variants.A published dataset of transfer regions from 4 mobility (Mob) groups [3] wasused as positive examples and expanded with 64 negative examples. Negativeexample sequences were selected randomly from a region 200 to 800 bp aroundthe enzymatic nicking sites [3], thus containing diﬀerent non-regulatory codingand non-coding regions, and low sequence similarity was veriﬁed among thesequences (p-distance > To develop DNA structural representations we computed the structural proper-ties of all permutations of k-mers 3, 5, 7 and 9 bp in length (1 to 4 neighboringregions around a speciﬁc nucleotide) and performed dimensionality reductionfollowed by clustering (Fig. 1A). The structurally deﬁned groups of k-mers weretermed s-mers. Structural properties were calculated in windows of 5 bp or atdefault values as described [2, 3]. Dimensionality reduction and analysis of themain components of variance was performed using Principal Component Analy-sis (PCA). The k-means clustering algorithm was used (Matlab), where clusterswith a lowest total sum of distances were chosen from 10 runs of up to 1000iterations at default settings. The optimal amount of clusters was analysed withthe Elbow, average Silhouette [34] and GAP [35] methods with Matlab func-tion evalclusters at default settings with triplicate runs. The tested numbers ofclusters included 4, 8, 16, 32, 64, 128 and 256 clusters, chosen considering that(i) positions with 2 x possible states (clusters) can carry a maximum of x bitsof information [36], (ii) 1 bp of DNA carries up to 2 bits of information (iii) upto 4 bp neighboring regions deﬁned the structural eﬀect in the s-mers, and (iv)s-mers are overlapping, meaning information is distributed among all of them.Thus, up to 8 bits of information (256 clusters) was expected per s-mer (andless with decreasing s-mer size).For a DNA substrate, the length of the structural representation was equalto the length of the nucleotide sequence minus the leftover nucleotides at theborders equal to ( s − /

2, due to the neighboring nucleotides in s-mers (Fig. 1B).The s-distance between two DNA substrates was the sum of squared Euclideandistances between the cluster centroids of all equally positioned s-mers in their4igure 1: Schematic depiction of the (A) construction and (B) usage of struc-tural representations. In a structural representation of a given DNA sequence,each central nucleotide position and its neighboring regions deﬁne a k-mer from3 to 9 bp in length, and are encoded as an s-mer with n structural dimensions(S. dim.) that can be deﬁned as a sequence of s-mer cluster centroids.structural representations of length n , s − distance = n (cid:88) i =1 d ( C i , C i ) , (1)where C i = ( c , c , ..., c k ) is the cluster centroid of the s-mer at position i ofthe ﬁrst sequence and C i = ( c , c , ..., c k ) is the cluster centroid of s-mer atposition i of the second sequence. For statistical hypothesis testing, the Python package Scipy v1.1.0 was usedwith default settings. To evaluate the explained variation, the coeﬃcient ofdetermination was deﬁned as R = 1 − SS Residual SS T otal , (2)where SS Residual is the within group residual sum of squares and SS T otal is thetotal sum of squares. The compression ratio was calculated as

Compression ratio = Num. unique k-mersNum. unique s-mers . (3)Precision and recall were deﬁned as P recision = T PT P + F P , (4)5 ecall = T PT P + F N , (5)where TP , FP and FN denote the number of true positive, false positive andfalse negative elements, respectively. The F -score was deﬁned as F − score = 2 · precision · recallprecision + recall . (6)The F -test was performed using permutational multivariate analysis of variancewith sequence bootstraps [37, 3]. The distribution of the F function under thenull hypothesis of no diﬀerences among group means was evaluated by perform-ing 1e4 bootstrap repetitions, with p -values calculated as p = Num. F Bootstrap ≥ F Total num. F Bootstrap . (7) We developed and tested a simple ungapped DNA sequence alignment frame-work (Algorithm 1) that ﬁnds the most similar segments to query sequencesin target sequences using a given distance function. The assessment of algo-rithm performance included (i) locating the transfer regions to within +/- 1bp of their known locations in the target plasmids and (ii) correct typing ofMob groups in the target plasmids. For this, true and false positive and neg-ative counts were obtained from the alignment tests by considering only thelowest scoring hit per alignment. A true or false positive value was assigned ifthe result was below a speciﬁed signiﬁcance cutoﬀ and corresponded or did notcorrespond, respectively, to the known value (region location or Mob group),and alternatively, a false or true negative value was assigned to results abovethe signiﬁcance cutoﬀ that corresponded or did not correspond, respectively,to the known value. The statistical signiﬁcance of distance scores (at p -valuecutoﬀs from 1e-6 to 1e-1) was evaluated using bootstrap permutations ( n = 1e6per sequence) of 10 randomly selected query sequences (Eq. 7) and a mappingfunction between the distance scores and p -values was then obtained by leastsquares curve ﬁtting (Matlab) to a second order polynomial function (distancescore of 0 corresponded to the theoretical limit of 1e-132) [5]. Algorithm 1:

Sequence alignment algorithm. input query set, target set; for i = 1 : size(query set) dofor j = 1 : size(target set) dofor k = 1 : length(target set(j)) do dist(i,j,k) = distance(query set(i),target set(j)(k)); endendendreturn min(dist(:,:)). .5 Software . Considering that the ﬁrst 4 neighboring nucleotides have the largest eﬀect on thestructural state of a given nucleotide base pair [2, 38], we designed nucleotide-position speciﬁc DNA structural representations that included the eﬀects of 1to 4 neighboring nucleotides, termed s-mers (Table 2: sizes 3, 5, 7 and 9 bp,Fig. 1). The s-mers were based on calculating up to 64 of the most widely usedDNA structural properties for all permutations of the corresponding equallysized k-mers (Table 2, Methods M1), followed by dimensionality reduction andclustering (Fig. 1A, Methods M2). The calculated DNA structural propertiesincluded physicochemical, conformational and protein-DNA binding variablesof experimental origin [5] (Table 1).Dimensionality reduction with Principal Component Analysis (PCA) yieldeda speciﬁc number of principal components (PC) with each s-mer size s that ex-plained over 99% of the data variance (Fig. 2). The amount of PCs increasedwith the s-mer size and with the number of initial structural variables from 14 to18 PCs, which was an over 3.5-fold decrease in the amount of variables requiredto describe almost all of the original information (Table 2). On average, 3, 6,9 and 17 components were required to explain over 60, 80, 90, and 99% of thevariance, respectively. The coeﬃcient of variation ( σ/µ ) of the ﬁrst 6 most infor-mative PCs across the s-mer sizes was below 0.637 and lowest with the ﬁrst andsixth components with 0.160 and 0.466, respectively, showing that these PCscarried similar structural information with each s . Analysis of PCA loadingsshowed that each PC mainly comprised a number of distinct structural variables,which enabled us to determine the key protein-DNA binding features deﬁnedby the 6 most informative components (in the order of decreasing importance):thermodynamic stability [12, 39], horizontal ﬂexibility [18, 15, 40, 41, 17, 14],torsional ﬂexibility [19, 42, 18, 17], conformational stability [12, 20, 19, 43, 16],major and minor groove accessibility [39, 42, 41, 18] and A-DNA to B-DNAtransition potential [20, 19, 41, 17]. Moreover, the top 20 sorted structural vari-ables contained 3 of the well known DNAshape functions [14, 1] and ORChID2[13], with nucleosome positioning (phase) [44] being the most important.Although the dimensionality reduced structural data was not expected toform strong clusters, we considered that a limit must exist to the resolution ofthe structural representations, above which there is no measurable inﬂuence on7able 2: Summary of s-mer construction. The groups of s-mers comprised allpermutations of nucleotide k-mers of 3 to 9 bps in length. Due to sequencelength constraints, not all DNA structural models could be computed with alls-mers. The amount of structural models used is given under ’Structures’. Thegiven amount of principal components (PC) describing over 99% of the varianceof the data was chosen to deﬁne each structural representation. S-mer size s Neighbors Permut. Structures PC > k Figure 2: The ﬁrst two dimensions (principal components, PC) of the structuralrepresentation with s-mer size s = 5 and number of clusters k = 16, where eachpoint represents a diﬀerent 5-mer nucleotide permutation { AAAAA, AAAAC,,TTTTT } , colors depict the clusters and black points denote the cluster cen-troids.the achievable computational accuracy. Additionally, we wanted to achieve anadditional level of compactness of the DNA codes as well as test the clusteringof the structural data points. Clustering was performed with the number ofclusters k varied between 4 and 256 clusters (2 to 8 bits), with the exception ofusing up to 32 clusters with 3-mers (Table 2, Methods M2). Standard clusterevaluation methods, including Elbow and Silhouette [34], showed that with adecreasing k , the overall accuracy of the clustered data representations decreasedcompared to using a higher number of clusters. At the highest k (256), theexplained variance ( R ) was over 80% with both s-mer sizes 5 and 7 ( s = 9not fully tested due to memory restrictions). With a decreasing number ofclusters, progressively larger clusters were obtained (Fig. 3) with a decreasein the percentage of explained variance down to 40% with k = 4 clusters,and similarly a decrease in the average Silhouette ratio. The cluster sizes were8pproximately normally distributed (Fig. 3), with the variation of cluster sizesincreasing with an increasing k , although the coeﬃcient of variation did notsurpass 0.52.Figure 3: The distributions of cluster sizes across the structural representationswith diﬀerent s-mer sizes s and number of clusters k . We next explored whether the s-mers could encode groups of conserved func-tional DNA motifs [8], and if our structural DNA representations could morecompactly but also more accurately encapsulate DNA motifs than bare k-mers.We used a dataset of 595 Jaspar [31] transcription factor binding site (TFBS)motifs comprising 1,296,654 unique DNA sequences from multiple model organ-isms (Methods M1), which contained at least 10 motifs in each group and wereat least 9 bp long. To test the capacity of the structural representations to en-code TFBS motifs we ﬁrst measured the compression ratio (Eq. 3), which wasthe ratio of the amount of unique s-mers versus the amount of unique k-mersobserved with a given TFBS motif. As a note, due to computational complexityand memory limitations, only certain parameter combinations could be tested.We observed an increase in the average compression ratio across motifs withan increasing s-mer size (Fig. 4), as it increased from 1.132 with s = 5 ( k =256) to 1.296 with s = 9 ( k = 256). Similarly and as expected, the averagecompression ratio increased with decreasing amount of clusters (Fig. 4), from1.132 with k = 256 ( s = 5) to 3.152 with k = 4 ( s = 3). The variation of thecompression ratio across the diﬀerent motifs was approximately constant (SDbetween 0.056 with s = 3, k = 4 and 0.092 with s = 7, k = 128). Moreover,we measured a signiﬁcant negative correlation (Pearson’s r , p -value < compression ratio and the TFBS sequence length increasing from-0.166 to -0.667 with an increasing s-mer size s = 3 ( k = 32) to s = 9 ( k =128), respectively, and up to -0.724 with a decreasing cluster size k (s = 3, k =9). Weak correlation (Pearson’s r , p -value < compression ratio and the number of unique sequences in a TFBS motif,similarly as above increasing from -0.091 to -0.311 with an increasing s = 3 ( k = 32) to s = 9 ( k = 128), respectively, and up to -0.382 with a decreasing k ( s = 3, k = 4). This suggested that the capacity for compression decreases withmore abundant DNA sequence space, such as with longer motifs or ones with amore diverse set of sequence variants.Figure 4: Compression ratios of Jaspar TFBS motifs obtained with diﬀerentstructural representations. Due to computation and memory limitations onlycertain parameter combinations of s-mer size s and number of clusters k wereanalysed. The black line denotes no compression.Since the structural representations indeed compressed the TFBS motifs upto 3-fold, we next explored how accurate the encodings were at describing diﬀer-ent functional motif variants. We selected the most sequence-abundant motif,the 18 bp Human MAFF motif (Jaspar: MA0495.1, class of Basic leucine zip-per factors) that contained 49,462 unique sequence variants. Using a randomlyselected 1% ( n = 495) subset of these sequence variants to deﬁne their s-mers,we measured how many of the remaining 99% of the motifs were described by(or rather, could be predicted from) these structural encodings. This meantreconstructing all the possible motif sequence variants from each structural rep-resentation instance, and gave an estimate of the encoding accuracy. The initial precision and recall (Methods M3) obtained without any encoding were 1 and0.01, respectively. Unsurprisingly, an inverse relation was observed between pre-cision and recall (Fig. 5). Precision was highest (0.838) with a low s-mer size( s = 3, k = 32) and decreased 6.5-fold with an increasing s (0.129, s = 9, k =256), whereas recall increased by 9% from 0.0102 to 0.0111 at the equal param-eter values, respectively. A reason for the decrease in precision was likely thatthe average amount of distinct sequence variants encapsulated by the diﬀerentstructural representations increased with an increasing s-mer size as well as witha decreasing k (Fig. 6). This related to an increasing coverage of the correctTFBS motifs (true positives) as well as an increasing number of variants not10n the given TFBS sequence set (false positives). However, it is also possiblethat the TFBS sequence set is incomplete and does not contain all the possiblefunctional sequence variants of that motif, meaning that the true positive andnegative space is unknown. With an 18 bp DNA sequence, such as is the lengthof the MAFF motif, we estimated that up to 6.9e10 sequence variants could ex-ist, whereas the set of sequence variants in the TFBS represented merely 7.2e-7of this diversity and was likely undersampled. This suggested that for furtheranalysis, datasets with a more complete coverage of the functional sequencespace should be used.Figure 5: Precision versus recall when comparing an encoding of 1% of sequencevariants of the Human MAFF motif (Jaspar: MA0495.1) with the remaining99% of sequence variants, at diﬀerent structural representation parameters s-mer size s and number of clusters k . The precision and recall obtained withnucleotide k-mers were 1 and 0.01, respectively.Figure 6: Amount of unique DNA sequence variants encoded by structuralrepresentations of the 18 bp long Human MAFF motif (Jaspar: MA0495.1),results with 1% ( n = 100) of motif variants shown.11 .3 A structure-based distance function resolves regula-tory DNA more accurately than sequence-based ones To facilitate the comparison of two diﬀerent structural representations (Fig. 1:s-mer vectors), such as is done with nucleotide sequences using for instancethe p-distance (equal to the Hamming distance corrected for sequence length),we deﬁned a structural distance, termed s-distance, as the sum of squared Eu-clidean distances between s-mer cluster centroids of two representations (Eq.1). To test the s-distance and compare it to the p-distance, we used a datasetof whole DNA regulatory regions that control the initiation of DNA transfer inplasmid conjugation (origin-of-transfer regions) [7] (Methods 1). These transferregions are 220 bp long and comprise binding sites for the relaxase as well asthose for accessory proteins that regulate transfer. The dataset contained 64unique transfer sequences [3] from 4 distinct mobility groups (Mob, deﬁned byamino acid homology of the main transfer enzyme relaxase) [45], with approx-imately equal group sizes of 16 elements. We previously determined that theMob groups of these regions could be clearly discriminated based on 6 DNAphysicochemical and conformational properties [3]. In order to additionally testthe possibility of discriminating between functional and non-functional transfersequences, here we expanded the dataset to include, besides the positive se-quences (Pos) also an equal amount negative counterparts (Neg, Methods M1).Using the speciﬁc distance functions as the measure of variation across the datawithin a permutational MANOVA framework (bootstrap n = 1e4, Methods M3)[46], we observed signiﬁcant ( p -value ≤ p -value < We tested whether the structural representations and s-distance could be ap-plied to existing algorithm frameworks, such as DNA sequence alignment. The12able 3: Comparison of distance functions based on nucleotide sequence (p-distance) and structural representations (s-distance). F -scores were obtainedwith the alignment algorithm (Algorithm 1). Func. Params. R ANOVA p -value F -scorePos/Neg Mob Pos/Neg Mob Pos/Neg Mobp-distance / 0.017 0.113 0.472 0.473 0.892 0.834s-distance s = 3, k = 32 0.046 0.237 < < s = 5, k = 128 0.049 0.248 < < s = 7, k = 128 0.054 0.253 1e-4 < s = 9, k = 128 0.057 0.264 1e-4 < Figure 7: Amount of unique DNA sequence variants encoded by structuralrepresentations of 220 bp long plasmid transfer regions, results with 10 regionvariants shown.DNA alignment framework that we developed (Algorithm 1) enabled the use ofdiﬀerent metrics such as the s-distance and could align a query and a target se-quence based on ﬁnding the minimal distance between them. Since equal sizedgroups were no longer required, we used an expanded dataset of 112 distincttransfer regions as the query dataset and a target dataset of 52 plasmids withknown locations of the transfer regions as well as their Mob groups [5] (Meth-ods M1). By counting the amount of lowest-distance alignments in the targetdataset that were below a speciﬁed distance threshold, we could deﬁne similarmetrics as for a binary classiﬁcation problem, namely amounts of true and falsepositive and negative results (Methods M4) as well as the harmonic mean ofprecision and recall, the F -score (Eq. 6). Although both distance functionsproved accurate, we observed, on average, an over 7% improvement of the F -score with the s-distance compared to the p-distance, both when discriminatingfunctional regions (Pos/Neg) as well as the Mob groups ( p -value < s = 7, k = 128) thus correctly uncovered132 (62%) transfer regions in the target set with 30 of them (58%) correctly Mobtyped, compared to 29 (56%) and 26 (50%) with the sequence based p-distance,respectively. This suggested that existing DNA sequence-based algorithms couldindeed be enhanced with structural representations [9, 8, 11, 1] (Fig. 8).Figure 8: Possibilities for combined sequence structure algorithms based on howproteins recognize and bind their active sites in the regulatory DNA. Interactionsof lower speciﬁcity with the surrounding DNA (corresponding to DNA withless conserved nucleotide sequence but deﬁned structural properties) guide theproteins towards their speciﬁc binding sites (highly conserved or exact sequence). Here we used a number of functionally-relevant DNA physicochemical and con-formational properties (Table 2: 57 to 64 structural variables) and fused theminto compact DNA structural representations containing the most importantstructural information (e.g. 6 components carried over 80% of the initial datavariance). Recovery of the key distinguishing properties of the ﬁrst 6 structuralcomponents showed that they indeed reﬂected the main properties involvedin protein-DNA interactions (Table 1). The amount of structural informationcould be further reﬁned with clustering down to 2 bits (Fig. 3), where the com-pression ratio of functional DNA sequence motifs could be as high as 3:1 (Fig.4). Nevertheless, the most promising results were obtained with a number ofclusters corresponding to 6 to 8 bits of information (64 to 256 clusters, Fig. 5).These structural representations could compress functional sequence variantsby around 30% to 50%, where each instance of the structural representationencapsulated up to 20 sequence variants of a TFBS motif (Fig. 6) and upto 2000 variants of a whole DNA regulatory region (Fig. 7). Further test-ing is required, however, using datasets with a more complete coverage of thefunctional sequence space or experimentally, to properly investigate the capac-ity to encode groups of functional DNA sequence variants and their conservedfunctional properties. 14evising the s-distance function opens up a plethora of possibilities for test-ing the structural codes and implementing them into DNA algorithms, as it wasfound to resolve regulatory DNA more accurately than sequence-based met-rics [5, 3] (Table 3). One can also think of additional metrics that can proveuseful, such as for instance a Jaccard distance using s-mers instead of k-mers[5], that can distinguish coding and non-coding sequences across genomes orhelp bin species in metagenomic data. Furthermore, by developing and test-ing a distance-based alignment algorithm (Algorithm 1), we demonstrated thepotential of the structural representations to enhance existing sequence-basedalgorithms. Here, the usefulness of sequence-like codes (similar to the nucleotidecode ACGT) stood out, as the structural representations could be realized asmere sequences of cluster indices with precomputed pairwise distances, abstract-ing from, and altogether disposing of, the structural component space. This cansimplify their implementation in existing sequence-based algorithms and alsoimproves the accuracy of pinpointing enzymatic sites, such as nicking sites intransfer regions, down to a resolution of 1 bp [5]. Accordingly, the solutioncan uncover many new variants of transfer regions in natural plasmids, helpingresearchers investigate the potential for plasmid mobility and its global eﬀects[5]. The use of whole sets of structural variables is more frequently found in thedomain of machine learning (ML), where ML models trained on speciﬁc sets oftarget variables then learn to use only speciﬁc subsets of the structural features[2, 3, 24]. For instance, when training ML models for discriminating Pos/Negexamples or MOB groups (Table 3), diﬀerent variables at diﬀerent locationswere found to be most informative [3]. Although this leads to eﬃcient discrimi-native or predictive models for speciﬁc tasks, DNA sequence analysis frequentlyrequires generalizing across multiple tasks and using all of the features (e.g.alignment). Indeed, we have found that sequence based models of structuralproperties outperform ML models [5, 3]. Deep learning algorithms, however,might prove to be much more capable as they can interpret new data repre-sentations themselves [47] and, in our experience, outperform any structuralfeature-based model using mere nucleotide sequence [48].Besides sequence alignment [5], the potential uses of the structural repre-sentations include: (i) phylogenetic analysis of regulatory DNA regions [45],(ii) analysis of single nucleotide variations [6], as the structural representationscontain variants with position-speciﬁc nucleotide substitutions, (iii) motif iden-tiﬁcation [25], where, for instance, the initialization stage of graph-based algo-rithms could be performed using structural representations [49], and (iv) designof DNA substrates with a modiﬁed DNA sequence but conserved functionality.Other, combined approaches could potentially mimic the protein-DNA searchand binding dynamics in DNA regulatory regions [9, 8, 11, 1] and adopt a com-bination of both structural and sequence features (Fig. 8). Another possibilityis to use diﬀerent representations for the non-coding (structural) and coding(sequence) regions, e.g. for whole-gene identiﬁcation. Indeed, due to the bias ofthe current bioinformatic methods to nucleotide sequence based approaches, itis possible that considerable performance increases might still be achievable in15ertain ﬁelds with such novel solutions [5, 25].

Acknowledgements

This work was supported by the Slovenian Research Agency under grant agree-ment no. [Z2-7257]. I kindly thank Filip Buric (Chalmers Univ. of Tech., Swe-den), Tomaz Pisanski and Nino Basic (UP-FAMNIT, Slovenia), Chrats Melko-nian (VU, Netherlands), Maria Pilar Garcillan-Barcia and Fernando de la Cruz(UNICAN, Spain), Joshua Ramsey (Curtin Univ., Australia) and Ziva Stepan-cic and Ales Lapanje (IJS, Slovenia) for their discussions and support of thisresearch.

References [1] Remo Rohs, Sean M West, Alona Sosinsky, Peng Liu, Richard S Mann, andBarry Honig. The role of DNA shape in protein–DNA recognition.

Nature ,461(7268):1248–1253, October 2009.[2] Jan Zrimec and Ales Lapanje. Fast prediction of DNA melting bubblesusing DNA thermodynamic stability.

IEEE/ACM Trans. Comput. Biol.Bioinform. , 12(5):1137–1145, September 2015.[3] Jan Zrimec and Aleˇs Lapanje. DNA structure at the plasmid origin-of-transfer indicates its potential transfer range.

Sci. Rep. , 8(1):1820, January2018.[4] Wei Chen, Pengmian Feng, and Hao Lin. Prediction of replication originsby calculating DNA structural properties.

FEBS Lett. , 586(6):934–938,March 2012.[5] Jan Zrimec. Multiple plasmid origin-of-transfer substrates enable thespread of natural antimicrobial resistance to human pathogens. bioRxiv ,page 2020.04.20.050401, April 2020.[6] James D Watson, Tania A Baker, Stephen P Bell, Alexander Gann,Michael K Levine, and Richard Losick.

Molecular Biology of the Gene.6th. ed . Pearson/Benjamin Cummings, 2008.[7] Fernando De La Cruz, Laura S Frost, Richard J Meyer, and Ellen L Zech-ner. Conjugative DNA metabolism in gram-negative bacteria.

FEMS Mi-crobiol. Rev. , 34(1):18–40, January 2010.[8] Michal Levo, Einat Zalckvar, Eilon Sharon, Ana Carolina Dantas Machado,Yael Kalma, Maya Lotam-Pompan, Adina Weinberger, Zohar Yakhini,Remo Rohs, and Eran Segal. Unraveling determinants of transcription fac-tor binding outside the core binding site.

Genome Res. , 25(7):1018–1029,July 2015. 169] Amir Marcovitz and Yaakov Levy. Weak frustration regulates sliding andbinding kinetics on rugged Protein–DNA landscapes.

The Journal of Phys-ical Chemistry B , 117(42):13005–13014, 2013.[10] Valentina Tosato, Nicole West, Jan Zrimec, Dmitri V Nikitin, GianninoDel Sal, Roberto Marano, Michael Breitenbach, and Carlo V Bruschi.Bridge-induced translocation between NUP145 and TOP2 yeast genes mod-els the genetic fusion between the human orthologs associated with acutemyeloid leukemia.

Front. Oncol. , 7:231, 2017.[11] Matthew Slattery, Tianyin Zhou, Lin Yang, Ana Carolina Dantas Machado,Raluca Gordˆan, and Remo Rohs. Absence of a simple code: how tran-scription factors read the genome.

Trends Biochem. Sci. , 39(9):381–399,September 2014.[12] J SantaLucia, Jr. A uniﬁed view of polymer, dumbbell, and oligonucleotideDNA nearest-neighbor thermodynamics.

Proc. Natl. Acad. Sci. U. S. A. ,95(4):1460–1465, February 1998.[13] Eric P Bishop, Remo Rohs, Stephen C J Parker, Sean M West, Peng Liu,Richard S Mann, Barry Honig, and Thomas D Tullius. A map of minorgroove shape and electrostatic potential from hydroxyl radical cleavagepatterns of DNA.

ACS Chem. Biol. , 6(12):1314–1320, December 2011.[14] Tsu-Pei Chiu, Federico Comoglio, Tianyin Zhou, Lin Yang, Renato Paro,and Remo Rohs. DNAshapeR: an R/Bioconductor package for DNA shapeprediction and feature encoding.

Bioinformatics , 32(8):1211–1213, April2016.[15] Ivan Brukner, Roberto Sanchez, Dietrich Suck, and Sandor Pongor.Sequence-dependent bending propensity of DNA as revealed by DNase i:parameters for trinucleotides.

EMBO J. , 14(8):1812–1818, 1995.[16] Stephanie Geggier and Alexander Vologodskii. Sequence dependence ofDNA bending rigidity.

Proc. Natl. Acad. Sci. U. S. A. , 107(35):15421–15426, August 2010.[17] H Karas, R Kn¨uppel, W Schulz, H Sklenar, and E Wingender. Combin-ing structural analysis of DNA with search routines for the detection oftranscription regulatory elements.

Comput. Appl. Biosci. , 12(5):441–446,October 1996.[18] Wilma K Olson, Andrey A Gorin, Xiang-Jun Lu, Lynette M Hock,and Victor B Zhurkin. DNA sequence-dependent deformability deducedfrom protein–DNA crystal complexes.

Proc. Natl. Acad. Sci. U. S. A. ,95(19):11163–11168, September 1998.[19] Alberto Perez, Agnes Noy, Filip Lankas, F Javier Luque, and ModestoOrozco. The relative ﬂexibility of B-DNA and A-RNA duplexes: databaseanalysis.

Nucleic Acids Res. , 32(20):6144–6151, 2004.1720] Misako Aida. An ab initio molecular orbital study on the sequence-dependency of DNA conformation: An evaluation of intra- and inter-strandstacking interaction energy.

Journal of Theoretical Biology , 130(3):327–335,1988.[21] B Hartmann, B Malfoy, and R Lavery. Theoretical prediction of base se-quence eﬀects in DNA. experimental reactivity of Z-DNA and B-Z transi-tion enthalpies.

J. Mol. Biol. , 207(2):433–444, May 1989.[22] Mandar Kulkarni and Arnab Mukherjee. Sequence dependent free energyproﬁles of localized B- to a-form transition of DNA in water.

J. Chem.Phys. , 139(15):155102, October 2013.[23] Pui S Ho, Michael J Ellison, Gary J Quigley, and Alexander Rich. Acomputer aided thermodynamic approach for predicting the formation ofZ-DNA in naturally occurring sequences.

EMBO J. , 5(10):2737–2744, 1986.[24] Jan Zrimec, Rok Kopinˇc, Tomaˇz Rijavec, Tatjana Zrimec, and Aleˇs La-panje. Band smearing of PCR ampliﬁed bacterial 16S rRNA genes: depen-dence on initial PCR target diversity.

J. Microbiol. Methods , 95(2):186–194,November 2013.[25] Md Abul Hassan Samee, Benoit G Bruneau, and Katherine S Pollard. A denovo shape motif discovery algorithm reveals preferences of transcriptionfactors for DNA shape beyond sequence motifs.

Cell Systems , 8(1):27–42.e6, 2019.[26] Manju Bansal, Aditya Kumar, and Venkata Rajesh Yella. Role of DNAsequence based structural features of promoters in transcription initiationand gene expression.

Curr. Opin. Struct. Biol. , 25:77–85, April 2014.[27] Mar´ıa Lucas, Blanca Gonz´alez-P´erez, Matilde Cabezas, Gabriel Mon-calian, Germ´an Rivas, and Fernando de la Cruz. Relaxase DNA bindingand cleavage are two distinguishable steps in conjugative DNA processingthat involve diﬀerent sequence elements of the nic site.

J. Biol. Chem. ,285(12):8918–8926, March 2010.[28] Marta V Sut, Sanja Mihajlovic, Silvia Lang, Christian J Gruber, andEllen L Zechner. Protein and DNA eﬀectors control the TraI conjuga-tive helicase of plasmid R1.

J. Bacteriol. , 191(22):6888–6899, November2009.[29] Gabriel Moncalian, Mikel Valle, Jose Maria Valpuesta, and Fernando de laCruz. IHF protein inhibits cleavage but not assembly of plasmid R388relaxosomes.

Molecular Microbiology , 31(6):1643–1652, 1999.[30] Sarah L Williams and Joel F Schildbach. TraY and integration host factororit binding sites and F conjugal transfer: sequence variations, but notaltered spacing, are tolerated.

J. Bacteriol. , 189(10):3813–3823, May 2007.1831] Aziz Khan, Oriol Fornes, Arnaud Stigliani, Marius Gheorghe, Jaime ACastro-Mondragon, Robin van der Lee, Adrien Bessy, Jeanne Ch`eneby,Shubhada R Kulkarni, Ge Tan, Damir Baranasic, David J Arenillas, AlbinSandelin, Klaas Vandepoele, Boris Lenhard, Benoˆıt Ballester, Wyeth WWasserman, Fran¸cois Parcy, and Anthony Mathelier. JASPAR 2018: up-date of the open-access database of transcription factor binding proﬁles andits web framework.

Nucleic Acids Res. , 46(D1):D1284, January 2018.[32] E G Gusm¨ao and M C P de Souto. Issues on sampling negative examples forpredicting prokaryotic promoters. In , pages 494–501. IEEE, July 2014.[33] Socorro Gama-Castro, Heladia Salgado, Alberto Santos-Zavaleta, DanielaLedezma-Tejeida, Luis Mu˜niz-Rascado, Jair Santiago Garc´ıa-Sotelo, KevinAlquicira-Hern´andez, Irma Mart´ınez-Flores, Lucia Pannier, Jaime Abra-ham Castro-Mondrag´on, Alejandra Medina-Rivera, Hilda Solano-Lira,C´esar Bonavides-Mart´ınez, Ernesto P´erez-Rueda, Shirley Alquicira-Hern´andez, Liliana Porr´on-Sotelo, Alejandra L´opez-Fuentes, AnastasiaHern´andez-Koutoucheva, V´ıctor Del Moral-Ch´avez, Fabio Rinaldi, andJulio Collado-Vides. RegulonDB version 9.0: high-level integration of generegulation, coexpression, motif clustering and beyond.

Nucleic Acids Res. ,44(D1):D133–43, January 2016.[34] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis.

J. Comput. Appl. Math. , 20:53–65, 1987.[35] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating thenumber of clusters in a data set via the gap statistic.

J. R. Stat. Soc. SeriesB Stat. Methodol. , 63(2):411–423, May 2001.[36] T D Schneider, G D Stormo, L Gold, and A Ehrenfeucht. Informationcontent of binding sites on nucleotide sequences.

J. Mol. Biol. , 188(3):415–431, April 1986.[37] Marti J Anderson. A new method for non-parametric multivariate analysisof variance.

Austral Ecol. , 26(1):32–46, 2001.[38] M Peyrard, S Cuesta-L´opez, and D Angelov. Experimental and theoreticalstudies of sequence eﬀects on the ﬂuctuation and melting of short DNAmolecules.

J. Phys. Condens. Matter , 21(3):034103, January 2009.[39] Ekaterina Protozanova, Peter Yakovchuk, and Maxim D Frank-Kamenetskii. Stacked–Unstacked equilibrium at the nick site of DNA.

Journal of Molecular Biology , 342(3):775–785, 2004.[40] A Bolshoy, P McNamara, and others. Curved DNA without AA: ex-perimental estimation of all 16 DNA wedge angles.

Proceedings of the ,6(88):2312–6, 1991. 1941] A A Gorin, V B Zhurkin, and W K Olson. B-DNA twisting correlates withbase-pair morphology.

J. Mol. Biol. , 247(1):34–48, March 1995.[42] W Kabsch, C Sander, and E N Trifonov. The ten helical twist angles ofB-DNA.

Nucleic Acids Research , 10(3):1097–1104, 1982.[43] M J Packer, M P Dauncey, and C A Hunter. Sequence-dependent DNAstructure: dinucleotide conformational maps.

J. Mol. Biol. , 295(1):71–83,January 2000.[44] S C Satchwell, H R Drew, and A A Travers. Sequence periodicities inchicken nucleosome core DNA.

J. Mol. Biol. , 191(4):659–675, October1986.[45] Mar´ıa Pilar Garcill´an-Barcia, Mar´ıa Victoria Francia, and Fernando de laCruz. The diversity of conjugative relaxases and its application in plasmidclassiﬁcation.

FEMS Microbiol. Rev. , 33(3):657–687, May 2009.[46] Marti J Anderson. Permutational multivariate analysis of variance.

De-partment of Statistics, University of Auckland, Auckland , 26:32–46, 2005.[47] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learn-ing: a review and new perspectives.

IEEE Trans. Pattern Anal. Mach.Intell. , 35(8):1798–1828, August 2013.[48] J Zrimec, F Buric, A S Muhammad, R Chen, V Verendel, and others.Gene expression is encoded in all parts of a co-evolving interacting generegulatory structure. bioRxiv , page 10.1101/792531, 2019.[49] ˇZiva Stepanˇciˇc. Enhancing gibbs sampling method for motif ﬁnding in DNAwith initial graph representation of sequences.