[PDF] A word recurrence based algorithm to extract genomic dictionaries

Abstract

Genomes may be analyzed from an information viewpoint as very long strings, containing functional elements of variable length, which have been assembled by evolution. In this work an innovative information theory based algorithm is proposed, to extract significant (relatively small) dictionaries of genomic words. Namely, conceptual analyses are here combined with empirical studies, to open up a methodology for the extraction of variable length dictionaries from genomic sequences, based on the information content of some factors. Its application to human chromosomes highlights an original inter-chromosomal similarity in terms of factor distributions.

Full PDF

AA word recurrence based algorithm to extract genomicdictionaries

Vincenzo Bonnici, Giuditta Franco, Vincenzo Manca

Abstract

Genomes may be analyzed from an information viewpoint as very long strings, containing functional elementsof variable length, which have been assembled by evolution. In this work an innovative information theory basedalgorithm is proposed, to extract signiﬁcant (relatively small) dictionaries of genomic words. Namely, conceptualanalyses are here combined with empirical studies, to open up a methodology for the extraction of variable lengthdictionaries from genomic sequences, based on the information content of some factors. Its application to humanchromosomes highlights an original inter-chromosomal similarity in terms of factor distributions.

Human genome computational analysis is one of the most important and intriguing research challenges we are currentlyfacing. Genomes carry the main information underlying life of organisms and their evolution, including a systemof molecular (reading, writing, and signal transmission) rules which orchestrate all cell functions. Most of theserules and especially the way they cooperate are unknown, while this being a problem of great scientiﬁc and medicalinterest [1], due mainly to currently incurable diffused genetic diseases. Our work here follows and outlines sometrends of research which analyze and interpret genomic information, by assuming the genome to be a book encryptedin a language to decipher (see for example [2, 3, 4, 5, 6, 7]). Namely, this analysis may be developed by sequencealignment-free methods based on information theoretical concepts, in order to convert the genomic information into acomprehensible mathematical form, such as a dictionary of variable-length factors that collects words of the unknowngenomic language.According to a common approach in computational genomics (e.g., [8, 9, 10, 11, 12]), a genome is represented by astring over the alphabet Γ = { A , C , T , G } , when secondary and ternary structure of DNA double helix are neglected, thatis, a genome G is an element of Γ (cid:63) . This representation easily leads to afﬁnities with a text, written in a natural language,which is comprehensible by means of its vocabulary, giving both syntax and semantic of words . Chomsky taught usthat the elaboration of words (as sequences of symbols generated or recognized by a computational model) is crucial forformal languages, which paved the base of computer science, while Shannon gave birth to information theory workingon codes (systems of words) equipped with a probability distribution. The concept (deﬁnition, characterization) of word is indeed central to understand the language in which information is organized within a text or a (genomic) string.An example of genomic dictionary may be found within the structure surrounding eukaryotic genes. The codingregion of a gene is composed by a starting untranslated region (5 (cid:48) − UT R ), a speciﬁc starting codon (usually

AT G ),an interleaving of exons and introns followed by a termination codon (

TAG , TAA or T GA ) and an untranslated tailingregion (3 (cid:48) − UT R ). However, in

Homo sapiens genes cover a relatively small fraction of the entire genome, whilethe rest of it, ﬁrst considered junk DNA , is either transcribed, into regulatory elements, or associated with some otherbiochemical activity: hence, it is covered by (generally long) functional elements. According to recent advancements,the concept of functional element is central, deﬁned as a genomic segment that codes for a deﬁned biochemical productor displays a reproducible biochemical signature [13, 6]. Furthermore, the distinct distribution of transcribed RNAspecies across segments suggests that underlying biological activities are captured in some genome segmentation. Insuch a context, catching most of the ‘signiﬁcant’ words (from these segments) which were naturally selected duringevolution would be a ﬁrst step to understand at least the syntax of an hypothetical genomic language, whose seman-tics may be possibly studied by the support of epigenomics. An information theory based analysis clearly plays animportant role in deciphering such a language, and in the literature there are several examples regarding informationtheory applications to biological sequence analysis, for example reviewed in [14], that conﬁrm the linkage betweenDNA fragments and their information content [8, 4, 13, 15, 16, 17].Also ﬁxed length dictionaries show some interesting properties. Namely, in [18, 19] the authors applied a method-ology developed for literary text to extract ﬁxed length genomic dictionaries. An analysis regarding the intersection1 a r X i v : . [ q - b i o . GN ] S e p f ﬁxed length dictionaries coming from human chromosomes is reported in this paper. Examples of ﬁxed lengthdictionary extraction procedures could be provided by applying notions such as word multiplicity or word lengthdistributions. On the other hand, graphical investigative analyses, based on expected frequency gaps, show the unpre-dictable behaviour of genomic sequences and help to detect peculiar words [20]. Following the terminology from ourprevious work [12], given a genome G we call D k ( G ) ⊆ Γ k the k-dictionary of all k -mers occurring in the genome G .If we think of a book, semantically signiﬁcant words have a fairly medium number of occurrences (they are not over-represented, as conjunctions and prepositions, and only some of them are underrepresented, as signatures, neologisms,specialist words) and they are clustered according to the topic described in that part of the book. Analogously, it isclear why several works are focused on ﬁnding genomic words exhibiting some special kind of (somehow clustered)repetitiveness, with a global frequency quite different than the expected frequency in purely random sequences havingthe same length of an investigated genome [21, 22, 8, 23, 16, 15]. A very relevant and peculiar word periodicity isrevealed by the Recurrence Distance Distribution (RDD), which measures the frequency at which a given word occursat given distances [24] (for an application to real genomes, see [25]).In this paper, we start from a modiﬁed version of the algorithm introduced in [18], in order to apply it to realgenomes (e.g., human chromosomes). We call it V-algorithm, from the ﬁrst name of the authors who designed it. Boththese original and modiﬁed algorithms are aimed at ﬁnding words forming local clusters (the approach is explainedin section 2.1). Then we propose a new RDD-based algorithm, we call it W-algorithm, which extracts variable lengthdictionaries of interests from several real genomic sequences and collects words having a recurrence distribution max-imally different than their random distribution. Such a selection is developed by computing the (locally) maximumdivergence, from random sequences, of the RDD of each string obtained by elongating an initial seed word over thegenome. The divergence from random sequences is a crucial issue in information analysis of strings [26, 27] andin analyzing mathematical properties of dictionaries. The methodology in [18] to ﬁnd dictionaries is therefore hereimproved by the V-algorithm, and a more general approach is proposed (in section 2.3) by means of the RDD basedW-algorithm, that works with the global word recurrence distance distribution rather than with only a ﬁrst slice of it.Several studies from the state of the art deﬁne properties for words which result to be salient features in analysinggenomic sequences [28]. Minimal absent words, maximal or palindromic repeated words are some examples [29,30, 31]. Comparison of sequences for ﬁnding common substrings has been used for detecting protein domains viaMarkov’s chains [32]. Other analyses recognize words that are statistically signiﬁcant to compare two sequences [33],or to discriminate sequence motifs [34]. These approaches are focused on ﬁnding speciﬁc words to be used as keyfeatures of a string for analysing its property or for comparing it to another sequence [35]. The extracted words areoften sparsely located in the analysed sequence [36], thus they do not constitute a real linguistic analysis of genomicstrings. In [24], on the other hand, an alignment-free distance measure (based on the return time distribution of k-mers)is employed for sequence clustering and phylogeny purposes. The approach presented in our study aims at extractinga set of words that represent, in a statistical way (that is, having a recurrence distribution maximally different than ina random sequence), the factors of the hidden language of a given genome. By analogy to linguistics, the extractedset of words constitute the dictionary of the unknown language. In fact, such dictionaries are shown to cover an highpercentage of the sequence by (forcing) a minimal overlap of their occurrences.In brief, we focus on two speciﬁc algorithms to extract genomic dictionaries with genomic words owning desiredrecurrence properties. Such automatically generated dictionaries were further i ) selected according to their genomecoverage properties (that is, analysed in terms of contained words, their lengths, and their sequence and positionalcoverage over the source genome), ii ) biologically validated, iii ) ﬁltered by elimination of inﬁx and sufﬁx words,and iv ) employed to cluster human chromosomes. Our methodology to extract and evaluate genomic dictionaries isexplained in next three sections (where, respectively, the algorithmic approaches together with a dictionary validationcriterion, the results obtained, and a discussion, are presented) and illustrated in Figure 4.Speciﬁc software IGtools was developed in [37] for extracting k -dictionaries, computing distributions and set-theoretic operations, evaluating empirical entropies and informational indexes. Both the V-algorithm and the W-algorithm were implemented within the Infogenomics tools framework, which is based on an engineered sufﬁx ar-ray suitable for analyzing genomic sequences. The software is available also at https://bitbucket.org/infogenomics/igtools/wiki/Home . However, our work presented here is mainly a proof of concept, fo-cused on the idea underlying the algorithms design, also supported by empirical results (namely on clustering humanchromosomes), whereas algorithmic efﬁciency and implementation technology were not investigated. In this respect,advanced programming paradigm as the MapReduce could support our study with more computational analyses on2enomic dictionaries [38], from both informational and linguistic viewpoints [7].

RDD plays an important role in computational analysis of genomic sequences. Inspired by the fact the keywords areclustered in literary text, in [18] RDD was used as basis in deﬁning a clustering coefﬁcient of words, while in [25] itsapplication to coding regions shows the informational evidence of the codon language, and in [39, 40, 41] some char-acterizations of recurrence behaviours were pointed out for very short k -mers. However, only ﬁxed length dictionarieswere extracted from real genomes by means of such a distribution [19].In this section, we ﬁrst summarize the genomic word extraction methodology reported in [18], which was ourstarting point to develop a variant of it, the V-algorithm, and then introduce a novel RDD-based extraction algorithm,called W-algorithm, by giving a special emphasis to issues regarding their applicability in extracting variable lengthdictionaries from human chromosomes. Here we propose also some criteria (based on sufﬁx presence, biologicalrelevance, and covering properties) to evaluate genomic dictionaries extracted by the W-algorithm, in order to optimizethe whole methodology. Authors of [18] have used RDD to identify keywords by applying a methodology that associates a clustering coefﬁcient C to k -mers. The main idea is based on the fact that keywords are not uniformly distributed among a literary text, insteadthey are clustered. Their approach combines the information provided by the spatial distribution of a word along thetext (via the clustering coefﬁcient) and its frequency, since the statistical ﬂuctuation depends on the frequency. Thisbasic approach has been used in [19] to assign a relevance to 6-mers and 8-mers in Homo sapiens and

Mus musculus .The 8-mers were sorted by their normalized clustering coefﬁcient (called σ nor ), and it has been shown that part of thetop-200 clustered words (about 70%) appears in known functional biological elements, like coding regions and TFBSs.The whole recurrence distribution is synthesised with a single parameter σ , to quantify the clustering level, previ-ously presented in [9] for studying the energy levels of quantum disorder systems [42], and a clustering degree σ nor assigned to words, for the identiﬁcation of keywords in literary texts, obtained by means of the relation between the σ of a real word and the theoretical expected one (coming from a theoretical hypothesized distribution), as in thefollowing.For a given word, the parameter σ is the standard deviation of its normalized set of recurrence distances, σ = s / ¯ d ,where s is the standard deviation of the recurrence distance distribution, and ¯ d is the average recurrence distance. Whenthe RDD is a geometric distribution, the parameter is denoted by σ geo and it is equal to √ − p , since s = √ − p / p and ¯ d = / p , where p is the word frequency. Thus, the resultant normalized clustering measuring σ nor of the givenword is given by σσ geo = s / ¯ d √ − p . For values of σ nor near to 1, the recurrence distribution of the word is closed to thegeometric one, thus it indicates a randomness of the word. In fact, a random sequence is generated by a Bernoullianprocess, then different occurrences of a given word are independent events, and the event of having k occurrences of aword (in a segmentation unit) follows a Poisson distribution. Therefore, according to probability theory [43] its waitingtime, that is the distance at which a word recurs, is an exponential distribution (having a geometric distribution as adiscrete counterpart).For words with low multiplicity, the statistical ﬂuctuation are much larger, and its is possible to obtain an higher σ nor for rare words placed at random, and they would be misidentiﬁed as keywords. Thus, the authors applied a correction bya Z-score measure that combines the clustering of a word and its multiplicity n . The resultant clustering measure C , isgiven by the following equation: C ( σ nor , n ) = σ nor −(cid:104) σ nor (cid:105) ( n ) sd ( σ nor )( n ) , where (cid:104) σ nor (cid:105) ( n ) = n − n + and sd ( σ nor )( n ) = √ n ( + . n − . ) .Parameter values were obtained via extensive simulations, by taking into account the distribution of σ nor in randomtexts. They represent the mean value and the standard deviation of such empirical distribution. The C coefﬁcientmeasures the deviation of σ nor with respect to the expected value in a random text, in units of the expected standarddeviation. In this case, C = C > C < k , some of the3ords in D k ( G ) are selected, according to their C measure, that must be greater then a C measure corresponding to aﬁxed percentile (usually 0 . elongations having a C measure greater than C , and up to a ﬁxed maximal word length : these are properly thetwo points we changed in the V-algorithm presented in next section. In [18], the longest visited lineage is selected as aword with semantic meaning, and the process is repeated for different values of k (ranging in 2-35), until a dictionaryis obtained by discarding repeating words. The application of the above algorithm to wide genomic sequences comes with some issues. Indeed, in literary textthe maximal length of words is known a priori , and parameters, such as the percentile threshold, can be calculated em-pirically. Working on genomes with the aim to discover an unknown genomic language requires a different approach.Although some words with biological meaning are already known, it would be a limitation to assign a value for themaximal word length in the dictionary we are extracting. Therefore, the V-algorithm does not use any maximal wordlength threshold . Moreover, word elongations are selected by comparing their C measure with the one of their longestproper preﬁx, rather than with a preﬁxed C , and the elongated word (from length k to k +

1) is selected to be part ofthe output dictionary only if having a local maximum of the C -measure (see Figure 1).Figure 1: Expansion procedure.

Word elongation is realized until the measure r ( α ) of the current word α does notdecrease. In (b) the algorithm produces as an output the word ATGCGCGTATG, by starting from each seed, of length1 and of length 4, while in (a) seeds of different lengths (1 and 14) allow to produce two different words (correspondingto the two peaks of r ( α ) ). In (c) a sufﬁx of ATGCGCGTATG is generated, due to a different position of the seed long1 with respect to (b).In the above procedure we cannot exclude the fact that both a word αβ and its proper preﬁx α may own thedesired properties. Indeed, there may exist signiﬁcant words which are part of signiﬁcant longer words, as illustratedin Figure 1 (a), and to catch them it is fundamental to evaluate dictionaries achieved by starting from different seedlengths. By deﬁnition, if we run the elongation procedure by starting from the ﬁrst symbol of αβ , then α is discoveredas genomic word, and the elongation procedure stops without discovering the further elongation up to αβ . Hence, werun the elongation procedure over several sets of seed words, including also α x (to be evaluated as seed words, for x ∈ Γ ) where α corresponds to the ﬁrst peak of the parameter C, then to a genomic word selected to be an element ofthe dictionary we are extracting. Therefore, by our algorithms we allow roots (or preﬁxes) to be part of the genomiclanguage, as it holds for natural languages.As seed words we consider sets D k ( G ) , with values of k ranging from 1 to the minimal forbidden length , aninformational index widely used in sequence analysis [44, 45, 46, 47]. Our choice for seed lengths, in order toelongate 1-mers as well as longer seeds chosen according to a speciﬁc property of the analysed genome, is based on ourprevious works [12, 20, 6], where we have analyzed cardinalities of genomic dictionaries of k -mers, and informationalindexes such as k -lessicality and k -dictionary selectivity, which take into account also the number of occurrences (i.e., multiplicity , of repeats) by varying the value of k . Namely, in [48, 49] the value k = lg ( n ) , where n is the genomelength, allowed us to deﬁne some indexes based on information entropies, helpful to ﬁnd a new genome complexitymeasure. 4 .3 The RDD based W-algorithm Here we use RDD to calculate the divergence of the real distribution of a word within the genome from its frequencyover a random string with the same genome length [50, 23]. Such a divergence is used as a measure of the informationcontent of a word. Low expressive words are elongated by an expansion procedure, until they reach a reasonable levelof “signiﬁcance” according to which they are classiﬁed as genomic words of the extracted dictionary.We assume that the higher is the entropic divergence from the above exponential distribution, the more specializedand evolutionary selected is the genomic element. In this sense, low multiplicity words already represent elementsowning high level of signiﬁcance. As for words with multiple occurrences (i.e., repeats), we associate their “meaning”with their repetitiveness-proﬁle, as it is revealed by a “good” RDD. A good RDD means that a great number ofrecurrence distances have to occur, and the number of times such distances occur has to fall in a wide numeric interval.Roughly speaking, a repeat has to widely occur along the genomic sequence but it has also to show a reasonable levelof speciﬁcity. In other terms, a word has to occur along the sequence several times and at different distances. See anexample in Figure 2, where the exponential distribution represents the random recurrence behaviour of the word. RDDof words along a real genomes is often sparse, meaning that several distances (of recurrence) actually do not appear inthe genome. This is why we evaluate the sound (i.e., more ﬁtting) exponential distribution after removing peaks, thatare absent in exponential functions, and by imposing a normalization ensuring the overall unitary probability.Figure 2: RDD of word CGC (the jagged curve) in human chromosome 22, up to distance 300, with the “best ﬁtting”exponential curve (the regular curve), which represents the waiting time of a Poisson distribution [43] ruling a wordrandom occurrence.The degree of signiﬁcance of a word to be selected for our dictionary is its random deviation , measured by thefunction in equation (1), based on the the entropic divergence (such as the Kullback-Leibler divergence [21]), betweenthe real RDD of a word (over the analysed genome) and its expected exponential distribution.More technically, given a word α , which occurs in a genome G , we calculate its random deviation as the entropicdivergence between its RDD and a suitable exponential distribution. To this aim, we ﬁrst extract the real RDD of α over G , which we refer as R α . Then we estimate a two parameters exponential distribution E α , by making use of theNelder and Mead Simplex algorithm [51], a commonly used nonlinear optimization technique for problems for whichderivatives may be not known. A denoised distribution is used as input for the estimation procedure: it is obtained byapplying a low-pass ﬁlter (over R α ) in order to attenuate peaks. Afterwards we remove from E α the domain valueswhich are not present in R α , namely the gaps of R α . Successively, both R α and E α are normalized in order to becomeprobability distributions. Finally, the random deviation of α is chosen as: r ( α ) = max ( KL ( R α , E α ) , KL ( E α , R α )) , (1)where KL is the asymmetric Kullback-Leibler entropic divergence.In our algorithm (reported in Listing 1) estimation of the information content of a word α is computed (at everystep) by function r ( α ) . Word elongation is realized until the (current) random deviation does not start to decrease. As5isting 1: Extraction AlgorithmW: = /0 ;F o r E a c h α ∈ D :E l o n g a t e ( α , W ) W : = W \ D ;R e t u r n W Listing 2: Elongation procedure: Elongate ( α , W )i f r ( α x ) ≤ r ( α ) , ∀ x ∈ Γ t h e n W : = W ∪ { α } e l s e F o r E a c h x ∈ Γ i f r ( α x ) > r ( α ) t h e n E l o n g a t e ( α x , W )it may be seen in Figure 1, smaller seeds allow the algorithm to generate words α corresponding to the ﬁrst peak (localmaximum) of r ( α ) . To produce a longer signiﬁcant word α , corresponding to the second peak of r ( α ) , a longer seedhas to be taken as a starting string. In all our computational experiments, r ( α ) showed to have only two peaks, whoselocalization depends on the genome length.We would like to extract all the words α such that both α [ , | α | − ] and α x (where α x is any elongation of α occurring in G at least once) own a lower level of signiﬁcance, namely a lower random deviation, with respect to α . The goal can be reached by examining all the words within G from monomers up to a word length equal to themaximum repeat length of G , and by discarding hapaxes. However, such an approach turns out highly expensive, and itcannot be applied efﬁciently for long genomes. Thus, we developed an expansion procedure with the aim of elongatingseed words, let say monomers, up to more meaningful words. The (variable length dictionary) extraction algorithm,combining word elongation and random deviance test (in the expansion procedure) is given by two recursive functionsin Listings 1 and 2, where D denotes the set of seeds D k ( G ) .The main idea of the algorithm shown above is to compare the random deviation of a word with those of itselongations. If an elongation results in a word more signiﬁcant than its root (i.e., its longest proper preﬁxes), then theroot word is discarded and the elongated word is selected. The process is applied recursively over the word branchingof the selected elements (see Listing 2). Seeds are discarded from the output dictionary.Three steps are implemented to compute random deviations. For all factors α of the genome i ) RDD of the currentword α is computed, by also removing distribution noise (peaks) and transforming R α into a probability distribution; ii ) an exponential distribution E α is computed from R α and normalized to be a probability distribution; iii ) randomdeviation r α is computed by means of the Kullback-Leibler (entropic) divergence.For the applications, we employ two elongating functions (along both different directions of the genome doublestring) and the resulting dictionary is the union of the dictionaries obtained with the two elongations. We refer with W L R and W R L as the dictionaries extracted by following the 5 (cid:48) − (cid:48) and 3 (cid:48) − (cid:48) verses, respectively, and with W = W L R ∪ W R L as the resulting dictionary. Indeed, the concept of root expressed above is strictly related to the verse inwhich a text is written and read. However, DNA is a double helix where information resides on both strands, each onehaving an own reading verse. Therefore, the increasing random deviation must be investigated on either verses.An application of the algorithm to real genomes shows interesting results, which are evaluated via informationmeasurements on the extracted dictionaries, as described in next subsection. Extracted dictionaries are evaluated by means of information measurements such as the word length distribution oftheir elements (see Tables 2, 3), and its deviation from the minimal forbidden length of the genome. Two parameterswe used to evaluate a dictionary D are the sequence coverage , which is the percentage of positions i in the genomesuch that G [ j , k ] is a word of the dictionary D for j < i < k , and the average positional coverage , which is the averageover positions i of number of words G [ j , k ] for j < i < k of the dictionary D. Intuitively, the ﬁrst measures the portionof genome occupied by at least one word of the dictionary, and the second the average number of words from thedictionary that occupy each single position of the genome. They are denoted by cov ( G , D ) and avg ( covp ( G , D )) ,6espectively.Dictionaries are tested in terms of both sequence coverage and positional coverage (see Tables 4, 5). A “good”dictionary must have a high sequence coverage, but a low overlapping degree among its elements, too. In fact, if weconsider D k ( G ) as a language, for a certain value of k , then it has the maximum sequence coverage (all positions of thegenome would be involved by at least one k -mer) but also the maximum positional coverage, since each position of thesequence is involved by up to k different words of the dictionary.On an ideally good dictionary, both parameters are close to one, meaning that its words cover almost the entiregenome and tend to not overlap. These parameters were computed both on single and union dictionaries, and on sub-dictionaries W k of ﬁxed length words, in order to focus our analysis on groups of words having avg ( cov p ( G , W k )) asclose as possible to one.We also checked that our dictionaries are informative enough to contain most of the biologically annotated se-quences, and have ﬁltered them by elimination of sufﬁxes and inﬁxes in order to provide more signiﬁcant languages.Moreover, by intersection of suitable extracted dictionaries from all human chromosomes we have developed a clus-tering which conﬁrmed known similarities among them. Both the algorithms described in previous section were run over all human chromosomes, by taking them into accountindividually, and then merging the 24 sets into a single one. Analyses were run over the human reference assemblyhg19 ( ). Table 1 shows the number of extracted words (that is, dictionary sizes), for each single human chromosome, and theirunion at the bottom, for both the algorithm in [18] and the V-algorithm, by starting from different seed lengths, and byimplementing two ﬁlters as redundancy strategies: one discarding duplicates (same words coming from different seedlengths) and the other discarding preﬁxes (in order to estimate the relative amount of preﬁxes).

Chr

Original Original ratio V-algorithm V-algorithm rationo duplicates no preﬁxes no duplicates no preﬁxes1 276,178 210,728 0.763 57,064 57,055 1.0002 281,698 227,544 0.808 119,582 118,368 0.9903 259,805 203,888 0.785 102,640 101,142 0.9854 251,067 201,760 0.804 108,229 106,879 0.9885 259,167 207,300 0.800 112,846 111,581 0.9896 255,025 198,487 0.778 106,193 104,510 0.9847 269,392 208,465 0.774 113,139 111,840 0.9898 259,586 206,241 0.794 118,551 117,295 0.9899 212,362 152,523 0.718 33,886 33,878 1.00010 234,663 186,844 0.796 100,616 99,595 0.99011 249,374 188,012 0.754 94,484 93,417 0.98912 247,842 187,931 0.758 99,147 97,579 0.98413 176,546 149,563 0.847 81,634 78,868 0.96614 209,881 162,515 0.774 94,312 90,313 0.95815 207,173 177,125 0.855 107,114 103,917 0.97016 229,208 166,653 0.727 62,732 62,673 0.99917 204,905 160,475 0.783 85,091 84,303 0.99118 161,710 131,900 0.816 65,985 65,558 0.99419 258,781 197,822 0.764 123,913 122,541 0.98920 171,474 131,434 0.766 66,320 65,597 0.98921 130,763 100,427 0.768 50,698 50,233 0.99122 147,002 120,259 0.818 77,797 74,511 0.958X 279,938 213,093 0.761 124,793 123,006 0.986Y 194,014 137,284 0.708 66,088 65,986 0.998union 4,281,701 3,737,766 0.873 1,813,776 1,798,241 0.991

Table 1: Number of extracted words for each human chromosome and the entire set union, both for original andmodiﬁed clustering based algorithms. Ratios indicate the percentage of single sets, obtained by removing preﬁxes andduplicates, coming from seed length values in the range 5 − k the lower are the C measures of k -mers. Therefore, comparing the C measure of a word, relatively longer than k , with the measure of its proper preﬁxis more restrictive than a comparison with the measure of the initial word of length k . From this behaviour, we canspeculate that the V-algorithm selects words with an higher semantic meaning.In Table 1 it is evident that the V-algorithm extracts a smaller amount of duplicates and preﬁxes than the algorithmin [18] (even when starting from seeds with different length). Indeed, smaller variable length dictionaries were ex-tracted by the V-algorithm, with fewer duplicate discarding steps, and a smaller amount of preﬁxes (which needed tobe discarded in the original algorithm). The RDD-based W-algorithm was applied (with values for seed length from the range 1 −

12) to extract genomic dic-tionaries from each human chromosome, and some analysis were performed also on the union of such 24 dictionaries.However, here we show data only for some (more explicable) chromosomes, for (more signiﬁcant) seed lengths upto 8. k k

63 349 517 995 1,261

201 326 794 1,391 9,126

12 64 91 198 323 973 24,275 97,64613 21 30 51 81 225 4,592 20,67014 2 3 10 18 40 875 3,52515 2 2 5 6 11 190 72416 4 5 5 5 9 54 16517 1 1 2 2 3 17 5418 5 1919 520 621 322 623 1

Table 2:

Word Length Distribution of human chromosome 1.

RDD-algorithm run on chromosome 1 producesdictionaries here partitioned according to the word length. Seed length is given as k value, extracted word length as k . Cardinality of ﬁxed length extracted words shows a bimodal trend for k -mers shorter than 7 (bold numbers are localmaximum values). The longest word extracted by the algorithm is 24.The word length distribution (WLD) related to human chromosomes 1 and 22 are shown respectively in Table 2and Table 3, by reporting the cardinality of words both having a given length and being generated by the W-algorithmstarting from a given seed length. A common feature (also for the other chromosomes) is to have two modes in the k -dictionary sizes, that is, two local maximum values (indicated in bold) for some lengths k . In Table 2 such valuesare 6 (for seeds long from 1 to 5) and 10-11 (for seeds long from 2 to 8) , while in Table 3 these values are 7 (forseed lengths from 1 to 6) and 10-11 (for seed lengths from 2 to 7). Although they have not ﬁxed values, they arenot very variable, if we consider that chromosome 22 and 1 are respectively the shortest and the longest ones (seeTable 1, where chromosomes are ordered by length). Let us here remark that 9 is the minimal forbidden length for allchromosomes, and 6 is a word length k such that human chromosomes own all the possible k -mers (i.e., examers), withhigh multiplicity. In E. coli , as an example, the two modes have values 5 and 9.Another empirical result, which was conﬁrmed on all the other chromosomes, is that the dictionary generated bystarting from seeds long k − is a proper subset of that generated by starting from seeds long k, apart of the wordslong k . In fact, words with the same length of the seed are eliminated by the algorithm and do not appear in the WLDtables. In other terms, numbers in one column of Table 2 count some of the words counted on the same row in the nextcolumn (only along columns index greater than the seed length, due to the seed discarding policy of the algorithm).As described in section of Material and Methods, the extracted dictionaries are evaluated according to both their8 k

92 337 616 1,478 2,310 2,925

221 542 1,325 2,140 9,106

197 479 1,284 2,115 6,986

13 2 53 119 327 579 774 6,403 136,13514 2 19 36 80 145 194 1,094 20,80515 2 7 9 21 33 50 291 4,19316 5 7 12 17 24 99 1,19617 2 3 5 6 9 27 32718 1 1 1 1 4 12 12819 2 4320 2 1521 622 12324 1

Table 3:

Word Length Distribution of human chromosome 22.

RDD-algorithm run on chromosome 22 producesdictionaries here partitioned according to the word length. Seed length is given as k , extracted word length as k . Abimodal trend for k -mers shorter than 7 may be visualized, with local maximum values in bold. The longest wordextracted by the algorithm is 23.sequence and their (average) positional coverage: these data related to chromosome 1 are reported in Table 4 andTable 5 respectively, where it is clear that parameter goodness does not increase with the word or seed length k . k k

10 0.0008 0.0054 0.0071 0.0128 0.0206 0.0329 0.0974 0.538811

Table 4:

Human chromosome 1: sequence coverage values.

RDD-algorithm run on human chromosome 1 producesdictionaries, here partitioned according to the word length k and the seed length k . Values in the table describe theportion of the genomic sequence covered by the dictionary, while bold numbers being the local maxima in the column(showing a bimodal trend).By observing data in Table 4, best coverage of the chromosome (corresponding value 0.84) is obtained by theexamers obtained starting from 5-mers as seeds, while the average positional coverage of such a dictionary is 2.7715(see Table 5), which is far from one. However, this dictionary was our choice for the chromosome clustering analysisdescribed below, because we gave a priority of importance to sequence coverage. Relatively to only positional coveragevalues, in Table 5 we may notice that words of length 10 (or longer, for instance 15) exhibit good (i.e., less than 2)values for any seed length up to 7, while examers have good positional coverage with shorter seeds (long up to 3).By observing data in Table 4, best coverage of the chromosome (corresponding value 0.84) is obtained by theexamers obtained starting from 5-mers as seeds, while the average positional coverage of such a dictionary is 2.7715(see Table 5), which is far from one. However, this dictionary was our choice for the chromosome clustering analysis9escribed below, because we gave a priority of importance to sequence coverage. Relatively to only positional coveragevalues, in Table 5 we may notice that words of length 10 (or longer, for instance 15) exhibit good (i.e., less than 2)values for any seed length up to 7, while examers have good positional coverage with shorter seeds (long up to 3). k k

18 1.0000 1.0000 1.0000 1.0000 1.0000 1.0015 1.018719 1.0000 1.000020 1.0000 1.000021 1.000022 1.00002324 1.0000

Table 5:

Human chromosome 1: average positional coverage.

RDD-algorithm run on human chromosome 1produces dictionaries, here partitioned according to the word length k and the seed length k . Values in the tabledescribe the average number of words (counted in the dictionary with their multiplicity on the chromosome) coveringsingle positions, while bold numbers being the local maxima in the column.We extracted dictionaries of examers on each single human chromosome, and from their pairwise intersections,rsin absolute and relative terms, we found interesting results, reported in Figure 3, where four groups of chromosomesmay be identiﬁed at the second level of the dendrogram, having cardinalities of dictionary intersection of the sameorder of that of the extracted dictionary from each single chromosomes (see leaves of the dendogram). Our dictionarybased method was then capable to discriminate by structure similarity the following clusters of human chromosomes: { , , , , , } , { , , , , , } , { , , , , , , } , { , , } .Figure 3: Human chromosome clusters.

Starting from the extracted seed5 non-reduced dictionary, the number (andpercentage) of conserved examers in the pairwise intersection of human chromosomes is reported as dendrogram intop ﬁgure. The known similarity among chromosomes was found by our clustering, reported more in detail as heatmapin bottom ﬁgure. Four groups were identiﬁed, where dictionary intersection has a size comparable with that of singlechromosomes. These (absolute and percentage) numbers are visible at the second level of the binary tree.If we look back to Figure 1, we understand that preﬁxes of other words have to be kept in our dictionary, as rootsand morphemes of the genomic language . As discussed in previous section, indeed, when starting by seeds longer thanthe ﬁrst local maximum (of the word random deviation), we miss all the corresponding (shorter) signiﬁcant words,10btained by starting with shorter seeds. On the other hand, here we point out that the word extraction to form thelanguage must not depend on the single point in the genome where the seed is located, then we need to discard allsufﬁxes from a dictionary (even in natural languages, sufﬁxes are not proper words), as comprehensible by looking atFigure 1 (b) and (c). Same argumentation holds for proper substrings, which may be generated by advancing the initialposition of a seed, thus sufﬁxes and substrings have been considered a computation bias and were both discarded. Thisis the motivation for a ﬁnal ﬁltering work on our dictionaries, to obtain the data reported in Table 6, where sufﬁxeshave been isolated in order to form sufﬁx-free dictionaries. sufﬁxes non-sufﬁxes k | W k | cov ( G , W k ) avg ( covp ( G , W k )) | W k | cov ( G , W k ) avg ( covp ( G , W k ))

12 1,363 0.009 1.18113 579 0.007

14 145 0.006 1.22415 33 0.005 1.17416 17 0.002 1.25417 6 0.001

18 1 0.001 1.000All 1,683 0.867 2.344 10,096 0.923 2.928

Table 6: Dictionaries extracted by RDD-algorithm, from human chromosome 1, by starting from seeds 5 long, thus allseeds from D ( G ) . Dictionary sizes, sequence and (average) positional coverage values are reported, grouped for wordlengths starting from 6. First three columns count the sufﬁx words, while last three columns count non-sufﬁx words(bold numbers denote the local maxima in the column).A reﬁned result from the application of RDD-based extraction algorithm on human chromosomes is a reduceddictionary generated by the sufﬁx- and inﬁx-free examers starting from seed length 5. This ﬁnal selection of wordsturned out to include several well known biological sequences. We developed an empirical validation protocol fordictionaries in order to evidence its inclusion of biologically annotated regions, such as transcripts, lincRNA (longintergenic non coding RNA), CpG islands (often occurring close to the TSS, then overlapping some trascripts), snomiRNA (small nucleolar microRNA), TFBS (Transcription Factor Binding Sites), enhancers (of lengths from 200 to2000) and other regulatory elements. For space limitation, we omit here to report these results.The non-reduced dictionary instead is the dictionary extracted by the RDD algorithm without any sufﬁx or inﬁxﬁltering. In particular, the non-reduced dictionary of examers obtained by the algorithm from seeds long ﬁve was hereemployed to cluster all human chromosomes, as in Figure 3, where known similarities between human chromosomeswere conﬁrmed by reverse engineering of our method. However, surprisingly, all chromosomes share very few examers(159 are common to all, over the 1,666 extracted words) which we exhibit as informative conserved sequences, a sortof product by evolution selection, to be further analyzed for their biological characterization. Given a genome we extract a speciﬁc set of its factors which represent the building blocks, or semantic units, of adictionary signiﬁcant for the genome language. In this work we have described an information theoretical methodologyto extract relatively small genomic dictionaries, which have good properties in terms of genome coverage, illustratedin Figure 4.Namely, three methods were presented. One from the literature, introduced in [18], which was our starting pointin terms of basic ideas, the second method is a variant of this, called V-algorithm, more efﬁcient and appropriateto extract genomic dictionaries, and ﬁnally, our RDD based W-algorithm, which originally combines a criterion ofanti-randomness with a criterion of elongation of seeds to select variable length factors.More precisely, both the V- and W-algorithms are based on concepts of word elongation and anti-randomness,though in a complementary way. Both consider word elongations signiﬁcant if the distribution is far from randomaccording to a given parameter, which is C for the V-algorithm and the value of r in equation (1) for the W-algorithm.Parameter C measures the clustering of word occurrences, then its difference from random increases for words denselydistributed (in average), while r measures the difference between the average distance of word recurrence (i.e., the11igure 4: RDD based algorithm extracts a certain number of dictionaries, by starting from a given genomic sequenceand by applying a word two-direction elongation to all seeds of initial length k (that is, to all k -mers). The union ofsuch dictionaries may be analyzed, ﬁltered, partitioned, according to different strategies (Word Length Distributions,coverage properties, etc). Reduced dictionaries are sufﬁx-free and inﬁx-free, and provide possible genomic languages.Non-reduced dictionaries of examers have been employed to identify chromosomes similarity and clustering.distance between two occurrences) and that expected in a Poisson process. This is the suitable distribution of rareevents, such as word occurrences in (long) genomes [12]. We may roughly say that the V-algorithm is based on thehigh word density and the W-algorithm is based on the non-random word recurrence.In section 3.1 performance results are shown in a comparison between the V-algorithm and the original one intro-duced in [18]. Here we may add that both applied methodologies show a weakness in elongating words: often they arenot able to extend seed words. On the other hand, when they succeed in word elongation, they extract very long words(up to 1 ,

941 and 1 , k , the same semantic preﬁxes are kept.The point of our approach is to ﬁnd best seed lengths to optimize the performance of the extraction algorithm,and eventually appropriate word lengths as good dictionaries to investigate. Best performance is obtained by produc-ing relatively small dictionaries with both sequence and average positional coverage as close as possible to one. Insection 3.2 we have exhibited dictionaries extracted by the W-algorithm, that own an internal bimodal trend and theproperty such that dictionaries starting from shorter seeds are contained (relatively to words of the same length) bythose obtained starting by longer ones (see Tables 2, 3). Preferred seed lengths emerge, from an observation of se-quence and positional genome coverage (see Tables 4, 5), that provide a better coverage. Dictionaries of examers wereidentiﬁed to reveal a clear similarity pattern for human chromosomes. Then genomic dictionaries were extracted andﬁltered (by elimination of sufﬁxes and substrings), which resulted in languages composed by roots and morphemes,including known biological sequences.Other numerical analysis were performed on single chromosomes. As an example, we considered comparisonswith examers extracted by other seeds length, or with examers composing extracted longer words. Namely, for humanchromosome 22 we analyzed all the examers contained in the 40 words of length greater than 17, extracted by thealgorithm starting from seed 8, to check if they have intersection with the 1,261 examers we see in Figure 3. Thesurprising result is that 139 over the 256 (i.e., a majority of) examers extracted from the 40 words do not belong tothe dictionary of 1,261 examers (extracted by seed 5). Longer words obtained with longer seeds are composed bygenerally different examers than those in the dictionary with shorter seeds, and this is promising for the validity of ourmethod, producing almost inﬁx-free dictionaries. 12nnotated words cov ( A , D ) trascripts 0.258lincRNA 0.249CpG islands 0.491sno miRNA 0.248TFBS 0.287enhancers 0.309regulatory elements 0.264Table 7: Annotated words coverage by the extracted dictionary from chromosome 22, starting by seed1.For future developments of our methodology, we will consider further concepts useful to evaluate dictionaries, andextend our extraction methodology to patterns of words. For example, it is reasonable to assume that selected wordshave to present a low percentage of possible anagrams that effectively appear within the genomic sequence. As a further validation of our method, it turned out that extracted informational dictionaries include several well knownbiological sequences. We developed an empirical validation protocol for dictionaries that helped us to ﬁnd evidence ofseveral notices, which we report here for human chromosome 22, especially for dictionaries D of examers obtained byseeds with a length in the range 1-5.To collect some of the main biologically annotated regions, into a dictionary we called A, we used UCSC GenomeBrowser [[52]]. Our dictionary A includes: transcripts (from RefSeq [[53]]), lincRNA (long intergenic non codingRNA), CpG islands (often occurring close to the TSS, then overlapping some trascripts), sno miRNA (small nucleolarmicroRNA), TFBS (Transcription Factor Binding Sites), enhancers (of lengths from 200 to 2000) and regulatory el-ements (from ORegAnno database: Open Regulatory Annotation). Class of lincRNA is included in that of lncRNAs(long non-coding RNA), having several biological functions, together with ncRNA (non coding RNAs shorter than200bp), such as transcription, splicing, traslation, cellular cycle, apoptosis. Elements lincRNA show a tissue speciﬁcexpression [[54]] and have an average length of 1000 bases, then shorter than that of protein coding transcripts (inaverage 2900 long).In chromosome 22 (130,481,394 bp long) the above annotated regions count 23,209,047 bases. However, sincethey share overlapping words of the chromosome (speciﬁcally, 7.1% bases of the annotated regions), the dictionaryA involves 21,567,860 of the whole chomosome22. Therefore, A has chromosome coverage equal to 61.8% while38.2% remains the non-annotated region. By extraction of dictionary D, there are 14 words which occur only in thenon-annotated region, and more in general, there are 114 words having more than 80% of their occurrences in such anon-annotated region. This is notable, if we consider that this threshold is double percentage than expected, and wepresume an informative value for these speciﬁc words.In Table 7 some annotated words of A are reported (in the ﬁrst column) with their related coverage of the dictionaryof examers extracted from seed1. We may notice that the words of the dictionary cover 25.80% of the transcripts, 24.9%of lincRNAs, 25.77% of the whole chromosome, and 28.09% of the exons. Indeed, if we separate data on introns andexons (as in Table 8), protein coding (25.8%) and non-coding transcripts (25.7%) ratios do not change. From datain Table 7 we conclude that the dictionary extracted by seed1 is present as expected in all the annotated regions weconsidered, with the exeption of CpG islands, where it is 77% more present than expected, however this is due to thehigh content of GC in short words (which indeed exhibit higher coverage parameters).The CpG islands are the most covered region by the dictionary (0.491), however, this does not mean they constitutethe most informative region. Indeed, words from the extracted dictionary, of length from 4 to 6, have a high numberof couples G+C (between 0.64 and 0.81), even if they were mostly extracted outside of the CpG islands (which coveronly the 2% of the chromosome). Words of length 5 and 6, in particular, exhibit a large chromosome coverage, andmainly contribute to compute the corresponding value in Table 7.Chromosome 22 reaches a maximum positional coverage (equal to 0.52) with examers obtained by elongation of13eed (amount of examers) cov ( exons , D ) cov ( introns , D ) cov ( exons ; D ) cov ( introns ; D ) Seed1 (63) 0.1503 0.1370 1.10Seed2 (349-63=286) 0.5246 0.4930 1.06Seed3 (517-349=168) 0.3444 0.3205 1.07Seed4 (995-517=478) 0.6434 0.6275 1.03Seed5 (1,261-995=266) 0.4169 0.4010 1.04Table 8: New examers are added with the row index (and with the seed length) and their positional coverage ismeasured over exons and introns. The values represent the percentage of exons and introns (in the transcripts) whichis covered by the dictionary of examers extracted by seeds with length in the range 1-5.seeds long from 1 to 5 (of course, no other length of seed would produce examers, then we may say they are examersproduced by RDD-algorithm). Coverage of such a dictionary over exons and introns of chromosome 22 may be seenin Table 8.Finally, we veriﬁed that words (of the examers obtained by seed1) which occur (about 200 times) across transcriptsand exons coincide with known splicing recognition sites : they are examers (G

CAG | G C, CAG | G GA e

CAG | G GC,consensus sequence is denoted in bold). Other examers of the dictionary occur at most 121 times across exons. Tran-scripts coverage and chromosome coverage turn out to coincide for ﬁxed length extracted dictionaries (subdictionariespartitioned by word length, as in previous WLD tables), the same holds for lincRNA. If we focus on words long 4, wefound a couple of them (CG | GT, G | GAG) occurring 3 times the others (600 versus a maximum of 200) which coincidewith known sequences across exons and introns.

More in general, one of the future possible developments of the present approach is the deﬁnition of genomic entropicdivergences, as a way for comparing genomes and for discovering genomic differences and similarities at a globalstructural level. Indeed, when two dictionary D and D are extracted from two genomes G and G , it is interestingto associate two probability distributions related to these dictionaries. An idea (see [49]) is that of deﬁning a commondictionary D formed by the longest common preﬁxes of D and D . Their intersection Pre f ( D ) ∩ Pre f ( D ) , after theelimination of words which are preﬁxes of others, is a good candidate for such a dictionary D , because two distributionsmay be deﬁned for the two genomes, by p ( α ) = ∑ αβ ∈ D p ( αβ ) and p ( α ) = ∑ αβ ∈ D p ( αβ ) , respectively, where inboth cases, p ( αβ ) is the frequency of the string αβ in G or in G , respectively. The entropic divergence between p and p is a measure of their whole difference, and the more the two dictionaries are expressive for the two genomes,the more this measure is accurate. References [1] G. S. Ginsburg and H. F. Willard, editors.

Genomic and Precision Medicine – Foundations, Translation, andImplementation . Elsevier, (Third Edition), 2017.[2] R.N. Mantegna and et al. ”linguistic features of noncoding dna sequences”.

Physical Review Letters ,73(23):3169–72, 1994.[3] D. B. Searls. The language of genes.

Nature , 420:211–217, 2002.[4] M. Sadovsky, J.A. Putintseva, and A. S. Shchepanovsky. Genes, information and sense: Complexity and knowl-edge retrieval.

Theory in Biosciences , 127(2):69–78, 2008.[5] S. Neph, J. Vierstra, A. Stergachis, and et al. An expansive human regulatory lexicon encoded in transcriptionfactor footprints.

Nature , 489:83–90, 2012. 146] G. Franco and V. Manca. Decoding genomic information. In S. Stepney, S. Rasmussen, and M. Amos, editors,

Computational Matter , chapter 9, pages 129–149. Springer, Cham, 2018.[7] U. Ferraro Petrillo, G. Roscigno, G. Cattaneo, and R. Giancarlo. Informational and linguistic analysis of largegenomic sequence collections via efﬁcient hadoop cluster algorithms.

Bioinformatics , 34(11):1826–1833, 2018.[8] Z.D. Zhang, A. Paccanaro, Y. Fu, and et al. Statistical analysis of the genomic distribution and correlation ofregulatory elements in the encode regions.

Genome Res. , 17(6):787–97, 2007.[9] M. Ortuno, P. Carpena, P. Bernaola-Galv´an, E. Munoz, and AM Somoza. Keyword detection in natural languagesand DNA.

EPL (Europhysics Letters) , 57(5):759, 2007.[10] The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome.

Nature ,489(7414):57–72, 2012.[11] F. Zambelli, G. Pesole, and G. Pavesi. Motif discovery and transcription factor binding sites before and after thenext-generation sequencing era.

Brieﬁngs in bioinformatics , page bbs016, 2012.[12] A. Castellini, G. Franco, and V. Manca. A dictionary based informational genome analysis.

BMC Genomics ,13(1):485, 2012.[13] F. Zhou, V. Olman, and Y. Xu. Barcodes for genomes and applications.

BMC Bioinformatics , 9:546, 2008.[14] S. Vinga. Information theory applications for biological sequence analysis.

Brieﬁngs in bioinformatics ,15(3):376–389, 2013.[15] G. E. Sims, S.R. Jun, G. A. Wu, and S.H. Kim. Alignment-free genome comparison with feature frequencyproﬁles (ffp) andoptimal resolutions.

PNAS , 106(8):2677–82, 2009.[16] B. Chor, D. Horn, N. Goldman, and et al. Genomic dna k -mer spectra: models and modalities. Genome Biology ,10:R108, 2009.[17] Y. Zheng, H. Li, Y. Wang, and et al. Evolutionary mechanism and biological functions of 8-mers containing cgdinucleotide in yeast.

Chromosome Research , E-pub ahead of print]:1–17, 2017.[18] P. Carpena, P. Bernaola-Galv´an, M. Hackenberg, A.V. Coronado, and J.L. Oliver. Level statistics of words:Finding keywords in literary texts and symbolic sequences.

Physical Review E , 79(3):035102, 2009.[19] M. Hackenberg, A. Rueda, P. Carpena, P. Bernaola-Galv´an, G. Barturen, and J. L. Oliver. Clustering of DNAwords and biological function: A proof of principle.

Journal of theoretical biology , 297:127–136, 2012.[20] G. Franco and A. Milanese. An investigation on genomic repeats. In

Conference on Computability in Europe –CiE , volume 7921 of

Lecture Notes in Computer Science , pages 149–160. Springer, 2013.[21] A. Thomas and T. M. Cover.

Elements of Information Theory . John Wiley, 1991.[22] J. H. Holland.

Emergence: from chaos to order . Perseus books: Cambridge, Massachusetts, 1998.[23] S. G. Kong, H.-D. Chen W.-L. Fan, and et al. Quantitative measure of randomness and order for completegenomes.

Phys Rev E , 79(6):061911, 2009.[24] P. Kolekar, M. Kale, and U. Kulkarni-Kale. Alignment-free distance measure based on return time distributionfor sequence analysis: Applications to clustering, molecular phylogeny and subtyping.

Molecular phylogeneticsand evolution , 65:510–22, 2012.[25] V. Bonnici and V. Manca. Recurrence distance distributions in computational genomics.

Am. J. Bioinformaticsand Computational Biology , 3(1):5–23, 2015.[26] L. Gatlin et al.

Information theory and the living system . Columbia University Press, 1972.1527] S. P. Harter.

A probabilistic approach to automatic keyword indexing . PhD thesis, University of Chicago, 1974.[28] V. M¨akinen, D. Belazzougui, F. Cunial, and A. Tomescu.

Genome-Scale Algorithm Design: Biological SequenceAnalysis in the Era of High-Throughput Sequencing . Cambridge: Cambridge University Press, 2015.[29] S. P. Garcia, A. J. Pinho, J. MOS Rodrigues, C. AC Bastos, and P. JSG Ferreira. Minimal absent words inprokaryotic and eukaryotic genomes.

PLoS One , 6(1), 2011.[30] A. L Price, N. C Jones, and P. A Pevzner. De novo identiﬁcation of repeat families in large genomes.

Bioinfor-matics , 21(suppl 1):i351–i358, 2005.[31] I. Grissa, G. Vergnaud, and C. Pourcel. Crisprﬁnder: a web tool to identify clustered regularly interspaced shortpalindromic repeats.

Nucleic acids research , 35(suppl 2):W52–W57, 2007.[32] G. Bejerano, Y. Seldin, H. Margalit, and N. Tishby. Markovian domain ﬁngerprinting: statistical segmentation ofprotein sequences.

Bioinformatics , 17(10):927–934, 2001.[33] A. Apostolico. Maximal words in sequence comparisons based on subword composition. In

Algorithms andApplications , pages 34–44. Springer, 2010.[34] L. Parida, C. Pizzi, and S.E. Rombo. Irredundant tandem motifs.

Theoretical Computer Science , 525:89–102,2014.[35] J. Qian and M. Comin. Metacon: unsupervised clustering of metagenomic contigs with probabilistic k-mersstatistics and coverage.

BMC Bioinformatics , 20(367), 2019.[36] M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparserepresentation.

Signal Processing, IEEE Transactions , 54:4311–4322, 2006.[37] V. Bonnici and V. Manca. Infogenomics tools: a computational suite for informational analysis of genomes.

Bioinform. Proteomics Rev. , 1(1):7–14, 2015.[38] U. Ferraro Petrillo, M. Sorella, G. Cattaneo, R. Giancarlo, and S.E. Rombo. Analyzing big datasets of genomicsequences: Fast and scalable collection of k-mer statistics.

BMC Bioinformatics , 20(138):138–151, 2019.[39] A. S. Nair and T. Mahalakshmi. Visualization of genomic data using inter-nucleotide distance signals.

Proceed-ings of IEEE Genomic Signal Processing , 408, 2005.[40] V. Afreixo, C. AC Bastos, A. J. Pinho, S. P. Garcia, and P. JSG Ferreira. Genome analysis with inter-nucleotidedistances.

Bioinformatics , 25(23):3064–3070, 2009.[41] C. AC Bastos, V. Afreixo, A. J Pinho, S. P Garcia, JMOS Rodrigues, and P. JSG Ferreira. Inter-dinucleotidedistances in the human genome: an analysis of the whole-genome and protein-coding distributions.

Journal ofIntegrative Bioinformatics , 8(3):172, 2011.[42] P. Carpena, P. Bernaola-Galv´an, and P. Ch. Ivanov. New class of level statistics in correlated disordered chains.

Physical review letters , 93(17):176804, 2004.[43] W. Feller.

An introduction to probability theory and its applications , volume 1. John Wiley & Sons, 1968.[44] G. Hampikian and T. Andersen. Absent sequences: nullomers and primes.

Paciﬁc Symposium on Biocomputing ,12:355–366, 2007.[45] J. Herold, S. Kurtz, and R. Giegerich. Efﬁcient computation of absent words in genomic sequences.

BMCBioinformatics , 9(5987):167, 2008.[46] F. Mignosi, A. Restivo, and M. Sciortino. Words and forbidden factors.

Theoretical Computer Science , 273(1):99–117, 2002. 1647] G. Fici, F. Mignosi, A. Restivo, and M. Sciortino. Word assembly through minimal forbidden words.

Theoreticalcomputer science , 359(1):214–230, 2006.[48] V. Bonnici and V. Manca. Informational laws of genome structures.

Nature Scientiﬁc Reports , 6(28840), 2016.[49] V. Manca. The principles of informational genomics.

Theoretical Computer Science , 2017.[50] A. Kolmogorov. On tables of random numbers.

Theoretical Computer Science , 207(2):387–395, 1998.[51] J. A Nelder and R. Mead. A simplex method for function minimization.

The computer journal , 7(4):308–313,1965.[52] J. Kent and et al. The human genome browser at ucsc.

Genome research , 12(6):996–1006, 2002.[53] K. Pruitt, T. Tatusova, and D. Maglott. Ncbi reference sequence (refseq): a curated non-redundant sequencedatabase of genomes, transcripts and proteins.

Nucleic acids research , 33:D501–D504, 2005.[54] M. Cabili and et al. Integrative annotation of human large intergenic noncoding rnas reveals global properties andspeciﬁc subclasses.