[PDF] SPRISS: Approximating Frequent k-mers by Sampling Reads, and Applications

Abstract

The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. In this work we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets and the identification of discriminative k-mers, to extract insights in a fraction of the time required by the analysis of the whole dataset.

Full PDF

SSPRISS : Approximating Frequent k -mers by Sampling Reads,and Applications *Diego Santoro †‡ [email protected] Leonardo Pellegrina †‡ [email protected] Fabio Vandin †§ [email protected] Abstract

The extraction of k -mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classiﬁcation in genomics and the characterization ofRNA-seq datasets. The extraction of all k -mers and their frequencies is extremely demanding in termsof running time and memory, owing to the size of the data and to the exponential number of k -mers tobe considered. However, in several applications, only frequent k -mers, which are k -mers appearing ina relatively high proportion of the data, are required by the analysis. In this work we present SPRISS ,a new efﬁcient algorithm to approximate frequent k -mers and their frequencies in next-generation se-quencing data. SPRISS employs a simple yet powerful reads sampling scheme, which allows to extract arepresentative subset of the dataset that can be used, in combination with any k -mer counting algorithm,to perform downstream analyses in a fraction of the time required by the analysis of the whole data,while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efﬁciencyand accuracy of SPRISS in approximating frequent k -mers, and shows that it can be used in various sce-narios, such as the comparison of metagenomic datasets and the identiﬁcation of discriminative k -mers,to extract insights in a fraction of the time required by the analysis of the whole dataset. keywords : k -mer analysis; frequent k -mers; read sampling; pseudodimension; * Part of this work was supported by the MIUR, the Italian Ministry of Education, University and Research, under PRIN Projectn. 20174LF3T8 AHeAD (Efﬁcient Algorithms for HArnessing Networked Data) and the initiative “Departments of Excellence”(Law 232/2016), and by the Univ. of Padova under project SEED 2020 RATED-X. † Department of Information Engineering, University of Padova, Padova (Italy). ‡ These authors contributed equally to this work. § Corresponding author. a r X i v : . [ q - b i o . Q M ] J a n Introduction

The study of substrings of length k , or k -mers, is a fundamental task in the analysis of large next-generationsequencing datasets. The extraction of k -mers, and of the frequencies with which they appear in a dataset ofreads, is a crucial step in several applications, including the comparison of datasets and reads classiﬁcation inmetagenomics [59], the characterization of variation in RNA-seq data [3], the analysis of structural changesin genomes [22, 23], RNA-seq quantiﬁcation [40, 62], fast search-by-sequence over large high-throughputsequencing repositories [53], genome comparison [51], and error correction for genome assembly [19, 50]. k -mers and their frequencies can be obtained with a linear scan of a dataset. However, due to themassive size of the modern datasets and the exponential growth of the k -mers number (with respect to k ), the extraction of k -mers is an extremely computationally intensive task, both in terms of running timeand memory [13], and several algorithms have been proposed to reduce the running time and memoryrequirements (see Section 1.2). Nonetheless, the extraction of all k -mers and their frequencies from a readsdataset is still highly demanding in terms of time and memory (e.g., KMC 3 [20], one of the currently bestperforming tools for k -mer counting, requires more than . hours, GB of memory, and

GB of spaceon disk on a sequence of

Gbases [20], and from our experiments more than minutes, GB ofmemory, and GB of disk space for counting k -mers from Mo17 dataset ).While some applications, such as error correction [19, 50] or reads classiﬁcation [59], require to iden-tify all k -mers, even the ones that appear only once or few times in a dataset, other analyses, such as thecomparison of abundances in metagenomic datasets [4, 11, 12, 41] or the discovery of k -mers discriminatingbetween two datasets [37, 23], hinge on the identiﬁcation of frequent k -mers, which are k -mers appearingwith a (relatively) high frequency in a dataset. For the latter analyses, tools capable of efﬁciently extract-ing frequent k -mers only would be extremely beneﬁcial and much more efﬁcient than tools reporting all k -mers (given that a large fraction of k -mers appear with extremely low frequency). However, the efﬁcientidentiﬁcation of frequent k -mers and their frequencies is still relatively unexplored (see Section 1.2).A natural approach to speed-up the identiﬁcation of frequent k -mers is to analyze only a sample of thedata, since frequent k -mers appear with high probability in a sample, while unfrequent k -mers appear withlower probability. A major challenge in sampling approaches is how to rigorously relate the results obtainedanalyzing the sample and the results that would be obtained analyzing the whole dataset. Tackling suchchallenge requires to identify a minimum sample size which guarantees that the results on the sample wellrepresent the results to be obtained on the whole dataset. An additional challenge in the use of sampling forthe identiﬁcation of frequent k -mers is due to the fact that, for values of k of interest in modern applications(e.g., k ∈ [20 , ), even the most frequent k -mers appear in a relatively low portion of the data (e.g., − - − ). The net effect is that the application of standard sampling techniques to rigorously approximatefrequent k -mers results in sample sizes larger than the initial dataset. In this work we study the problem of approximating frequent k -mers in a dataset of reads. In this regard,our contributions are:• We propose SPRISS , SamPling Reads algorIthm to eStimate frequent k -merS . SPRISS is based ona simple yet powerful read sampling approach, which renders

SPRISS very ﬂexible and suitable to beused in combination with any k -mer counter. In fact, the read sampling scheme of SPRISS returnsa representative subset of a dataset of reads, which can be used to obtain representative results fordown-stream analyses based on frequent k -mers.• We prove that SPRISS provides rigorous guarantees on the quality of the approximation of the fre-quent k -mers. In this regard, our main technical contribution is the derivation of the sample size Using k = 31 , workers, and maximum RAM of GB. See Supplemental Table 3 for the size of

Mo17 . https://vec.wikipedia.org/wiki/Spriss SPRISS , obtained through the study of the pseudodimension [42], a key concept fromstatistical learning theory, of k -mers in reads.• We show on several real datasets that SPRISS approximates frequent k -mers with high accuracy, whilerequiring a fraction of the time needed by approaches that analyze all k -mers in a dataset.• We show the beneﬁts of using the approximation of frequent k -mers obtained by SPRISS in twoapplications: the comparison of metagenomic datasets, and the extraction of discriminative k -mers.In both applications SPRISS signiﬁcantly speeds up the analysis, while providing the same insightsobtained by the analysis of the whole data.

The problem of exactly counting k -mers in datasets has been extensively studied, with several methodsproposed for its solution [21, 26, 32, 46, 2, 47, 20, 39]. Such methods are typically highly demandingin terms of time and memory when analyzing large high-throughput sequencing datasets [13]. For thisreason, many methods have been recently developed to compute approximations of the k -mers abundancesto reduce the computational cost of the task (e.g, [31, 52, 34, 8, 61, 39]). However, such methods do notprovide guarantees on the accuracy of their approximations that are simultaneously valid for all (or themost frequent) k -mers. In recent years other problems closely related to the task of counting k -mers havebeen studied, including how to efﬁciently index [38, 15, 30, 28], represent [7, 10, 1, 14, 14, 29, 17, 44],query [53, 54, 60, 55, 5, 27], and store [18, 35, 16, 43] the massive collections of sequences or of k -mersthat are extracted from the data.A natural approach to reduce computational demands is to analyze a small sample instead of the en-tire dataset. To this end, methods that perform a downsampling of massive datasets have been recentlyproposed [6, 58, 9]. These methods focus on discarding reads of the datasets that are very similar to thereads already included in the sample, computing approximate similarity measures as each read is consid-ered. Such measures (i.e., the Jaccard similarity) are designed to maximise the diversity of the content of thereads in the sample. This approach is well suited for applications where rare k -mers are important, but theyare less relevant for analyses, of interest to this work, where the most frequent k -mers carry the major partof the information. Furthermore, these methods have a heuristic nature, and do not provide guarantees onthe relation between the accuracy of the analysis performed on the sample w.r.t. the analysis performed onthe entire dataset. SAKEIMA [41] is the ﬁrst sampling method that provides an approximation of the setof frequent k -mers (together with their estimated frequencies) with rigorous guarantees, based on countingonly a subset of all occurrences of k -mers, chosen at random. SAKEIMA performs a full scan of the entiredataset, in a streaming fashion, and processes each k -mer occurence according to the outcome of its randomchoices. SPRISS , the algorithm we present in this work, is instead the ﬁrst sampling algorithm to approxi-mate frequent k -mers (and their frequencies), with rigorous guarantees, by sampling reads from the dataset.In fact, SPRISS does not require to receive in input and to scan the entire dataset, but, instead, it needs ininput only a small sample of reads drawn from the dataset, sample that may be obtained, for example, atthe time of the physical creation of the whole dataset. While the sampling strategy of SAKEIMA could beanalyzed using the concept of

VC dimension [57], the reads-sampling strategy of

SPRISS requires the moresophisticated concept of pseudodimension [42], for its analysis.In this work we consider the use of

SPRISS to speed up the computation of the Bray-Curtis distance be-tween metagenomic datatasets and the identiﬁcation of discriminative k -mers. Computational tools for theseproblems have been recently proposed [4, 49]. These tools are based on exact k -mer counting strategies,and the approach we propose with SPRISS could be applied to such strategies as well.

Let Σ be an alphabet of σ symbols. A dataset D = { r , . . . , r n } is a bag of |D| = n reads , where, for i ∈ { , . . . , n } , a read r i is a string of length n i built from Σ . For a given integer k , a k -mer K is a string2f length k on Σ , that is K ∈ Σ k . Given a k -mer K , a read r i of D , and a position j ∈ { , . . . , n i − k } ,we deﬁne the indicator function φ r i ,K ( j ) to be if K appears in r i at position j , that is K [ h ] = r i [ j + h ] ∀ h ∈ { , . . . , k − } , while φ r i ,K ( j ) is otherwise. The size t D ,k of the multiset of k -mers that appear in D is t D ,k = (cid:80) r i ∈D ( n i − k + 1) . The average size of the multiset of k -mers that appear in a read of D is (cid:96) D ,k = t D ,k /n , while the maximum value of such quantity is (cid:96) max , D ,k = max r i ∈D ( n i − k + 1) . The support o D ( K ) of k -mer K in dataset D is the number of distinct positions of D where k -mer K appears, that is o D ( K ) = (cid:80) r i ∈D (cid:80) n i − kj =0 φ r i ,K ( j ) . The frequency f D ( K ) of a k -mer K in D is the fraction of all positionsin D where K appears, that is f D ( K ) = o D ( K ) /t D ,k .The task of ﬁnding frequent k -mers (FKs) is deﬁned as follows: given a dataset D , a positive integer k ,and a minimum frequency threshold θ ∈ (0 , , ﬁnd the set F K ( D , k, θ ) of all the k -mers whose frequencyin D is at least θ , and their frequencies, that is F K ( D , k, θ ) = { ( K, f D ( K )) : K ∈ Σ k , f D ( K ) ≥ θ } .The set of frequent k -mers can be computed by scanning the dataset and counting the number of oc-currences for each k -mers. However, when dealing with a massive dataset D , the exact computation of theset F K ( D , k, θ ) requires large amount of time and memory. For this reason, one could instead focus onﬁnding an approximation of F K ( D , k, θ ) with rigorous guarantees on its quality. In this work we considerthe following approximation, introduced in [41]. Deﬁnition 1.

Given a dataset D , a positive integer k , a frequency threshold θ ∈ (0 , , and an accuracyparameter ε ∈ (0 , θ ) , an ε -approximation C = { ( K, f K ) : K ∈ Σ k , f K ∈ [0 , } of F K ( D , k, θ ) is a set ofpairs ( K, f K ) with the following properties: • C contains a pair ( K, f K ) for every ( K, f D ( K )) ∈ F K ( D , k, θ ) ; • C contains no pair ( K, f K ) such that f D ( K ) < θ − ε ; • for every ( K, f K ) ∈ C , it holds | f D ( K ) − f K | ≤ ε/ . Intuitively, the approximation C contains no false negatives (i.e. all the frequent k -mers in F K ( D , k, θ ) are in C ) and no k -mer whose frequency in D is much smaller than θ . In addition, the frequencies in C aregood approximations of the actual frequencies in D , i.e. within a small error ε/ . Deﬁnition 2.

Given a dataset D of n reads, we deﬁne a reads sample S of D as a bag of m reads, sampledindependently and uniformly at random, with replacement, from the bag of reads in D . A natural way to compute an approximation of the set of frequent k -mers is by processing a sample , i.e.a small portion of the dataset D , instead of the whole dataset. While previous work [41] considered samplesobtained by drawing k -mers independently from D , we consider samples obtained by drawing entire reads .As explained in Section 1.1, our approach has several advantages, including the fact that it can be combinedwith any efﬁcient k -mer counting procedure, and that it can be used to extract a representative subset ofthe data on which to conduct down-stream analyses obtaining, in a fraction of the time required to processthe whole dataset, the same insights. Such representative subsets could be stored and used for exploratoryanalyses, with a gain in terms of space and time requirements compared to using the whole dataset.However, the development of an efﬁcient scheme to effectively approximate the frequency of all frequent k -mers by sampling reads is highly nontrivial, due to dependencies among k -mers appearing in the sameread. In the next sections, we develop and analyze algorithms to approximate F K ( D , k, θ ) by read sampling,starting from a straightforward, but inefﬁcient, approach (Section 3), then showing how pseudodimensioncan be used to improve the sample size required by such approach (Section 4), and culminating in ouralgorithm SPRISS , the ﬁrst efﬁcient algorithm to approximate frequent k -mers by read sampling (Section 5). k -mers by Sampling Reads A ﬁrst, simple approach to approximate the set

F K ( D , k, θ ) of frequent k -mers consists in taking a sample S of m reads, with m large enough, and report in output the set F K ( S, k, θ − ε/ of k -mers that appear3ith frequency at least θ − ε/ in the sample S . The following result, obtained by combining Hoeffding’sinequality [33] and a union bound, provides an upper bound to the number m of reads required to haveguarantees on the quality of the approximation (see Supplement Material Section A for the full analysis). Proposition 1.

Consider a sample S of m reads from D . For ﬁxed frequency threshold θ ∈ (0 , , errorparameter ε ∈ (0 , θ ) , and conﬁdence parameter δ ∈ (0 , , if m ≥ ε (cid:16) (cid:96) max , D ,k (cid:96) D ,k (cid:17) (cid:0) ln (cid:0) σ k (cid:1) + ln (cid:0) δ (cid:1)(cid:1) then, with probability ≥ − δ , F K ( S, k, θ − ε/ is an ε -approximation of F K ( D , k, θ ) . While the result above provides a ﬁrst bound to the number m of reads required to obtain a rigorousapproximation of the frequent k -mers, it usually results in a sample size m larger than |D| (this is due to theneed for ε to be small in order to obtain meaningful approximations, see Section 6.2), making the samplingapproach useless. Thus, in the next sections we propose advanced methods to reduce the sample size m . k -mers Approximationby Sampling Reads In this section we introduce the notion of pseudodimension and we use it to improve the bound on the samplesize m of Proposition 1.Let F be a class of real-valued functions from a domain X to [ a, b ] ⊂ R . Consider, for each f ∈ F , thesubset of X (cid:48) = X × [ a, b ] deﬁned as R f = { ( x, t ) : t ≤ f ( x ) } , and call it range . Let F + = { R f , f ∈ F } bea range set on X (cid:48) , and its corresponding range space Q (cid:48) be Q (cid:48) = ( X (cid:48) , F + ) . We say that a subset D ⊂ X (cid:48) is shattered by F + if the size of the projection set proj F + ( D ) = { r ∩ D : r ∈ F + } is equal to | D | . The VC dimension

V C ( Q (cid:48) ) of Q (cid:48) is the maximum size of a subset of X (cid:48) shattered by F + . The pseudodimension P D ( X, F ) is then deﬁned as the VC dimension of Q (cid:48) : P D ( X, F ) = V C ( Q (cid:48) ) .Let π be the uniform distribution on X , and let S be a sample of X of size | S | = m , with every elementof S sampled independently and uniformly at random from X . We deﬁne, ∀ f ∈ F , f S = m (cid:80) x ∈ S f ( x ) and f X = E x ∼ π [ f ( x )] . Note that E [ f S ] = f X . The following result relates the accuracy and conﬁdenceparameters ε , δ and the pseudodimension with the probability that the expected values of the functions in F are well approximated by their averages computed from a ﬁnite random sample. Proposition 2 ([56, 25]) . Let X be a domain and F be a class of real-valued functions from X to [ a, b ] . Let P D ( X, F ) = V C ( Q (cid:48) ) ≤ v . There exist an absolute positive constant c such that, for ﬁxed ε, δ ∈ (0 , ,if S is a random sample of m samples drawn independently and uniformly at random from X with m ≥ c ( b − a ) ε (cid:0) v + ln (cid:0) δ (cid:1)(cid:1) then, with probability ≥ − δ , it holds simultaneously ∀ f ∈ F that | f S − f X | ≤ ε . The universal constant c has been experimentally estimated to be at most . [24].We now deﬁne the range space associated to k -mers, derive an upper bound to its pseudodimension, anduse the result above to derive an improved bound on the number m of reads to be sampled in order to obtaina rigorous approximation of the frequent k -mers. Let k be a positive integer and D be a bag of n reads.Deﬁne the domain X as the set of integers { , . . . , n } , where every i ∈ X corresponds to the i -th read of D . Then deﬁne the family of real-valued functions F = { f K , ∀ K ∈ Σ k } where, for every i ∈ X and forevery f K ∈ F , the function f K ( i ) is the number of distinct positions in read r i where k -mer K appearsdivided by the average size of the multiset of k -mers that appear in a read of D : f K ( i ) = (cid:80) n i − kj =0 φ ri,K ( j ) (cid:96) D ,k .Therefore f K ( i ) ∈ [0 , (cid:96) max , D ,k (cid:96) D ,k ] . For each f K ∈ F , the subset of X (cid:48) = X × [0 , (cid:96) max , D ,k (cid:96) D ,k ] deﬁned as R f K = { ( i, t ) : t ≤ f K ( i ) } is the associated range. Let F + = { R f K , f K ∈ F } be the range set on X (cid:48) , andits corresponding range space Q (cid:48) be Q (cid:48) = ( X (cid:48) , F + ) .A trivial upper bound to P D ( X, F ) is given by P D ( X, F ) ≤ (cid:98) log |F |(cid:99) = (cid:98) log σ k (cid:99) . The followingresult provides an improved upper bound to P D ( X, F ) (the proof is in Supplemental Material Section B -see Proposition 12). 4 roposition 3. Let D be a bag of n reads, k a positive integer, X = { , . . . , n } be the domain, and let thefamily F of real-valued functions be F = { f K , ∀ K ∈ Σ k } . Then the pseudodimension P D ( X, F ) satisﬁes P D ( X, F ) ≤ (cid:98) log ( (cid:96) max, D ,k ) (cid:99) + 1 . Combining Proposition 2 and Proposition 3, we derive the following (see Supplemental Material Sec-tion B for the full analysis).

Proposition 4.

Let S be a sample of m reads from D . For ﬁxed threshold θ ∈ (0 , , error parameter ε ∈ (0 , θ ) , and conﬁdence parameter δ ∈ (0 , , if m ≥ ε (cid:16) (cid:96) max , D ,k (cid:96) D ,k (cid:17) (cid:0) (cid:98) log min(2 (cid:96) max , D ,k , σ k ) (cid:99) + ln (cid:0) δ (cid:1)(cid:1) then, with probability ≥ − δ , F K ( S, k, θ − ε/ is an ε -approximation of F K ( D , k, θ ) . This bound signiﬁcantly improves on the one in Proposition 1, since the factor ln(2 σ k ) is reduced to (cid:98) log min(2 (cid:96) max , D ,k , σ k ) (cid:99) . However, even the bound from Proposition 4 results in a sample size m largerthan |D| . In the following section we proposes a method to further reduce the sample size m , which resultsin a practical sampling approach. SPRISS : Sampling Reads Algorithm to Estimate Frequent k -mers We now introduce

SPRISS , which approximates the frequent k -mers by sampling bags of reads . We deﬁne I (cid:96) = { i , i , . . . , i (cid:96) } as a bag of (cid:96) indexes of reads of D chosen uniformly at random, with replacement,from the set { , . . . , n } . Then we deﬁne an (cid:96) - reads sample S (cid:96) as a collection of m bags of (cid:96) reads S (cid:96) = { I (cid:96), , . . . , I (cid:96),m } .Let k be a positive integer and D be a bag of n reads. Deﬁne the domain X as the set of bags of (cid:96) indexes of reads of D . Then deﬁne the family of real-valued functions F = { f K,(cid:96) , ∀ K ∈ Σ k } where, forevery I (cid:96) ∈ X and for every f K,(cid:96) ∈ F , we have f K,(cid:96) ( I (cid:96) ) = min(1 , o I (cid:96) ( K )) / ( (cid:96)(cid:96) D ,k ) , where o I (cid:96) ( K ) = (cid:80) i ∈ I (cid:96) (cid:80) n i − kj =0 φ r i ,K ( j ) counts the number of occurrences of K in all the (cid:96) reads of I (cid:96) . Therefore f K,(cid:96) ( I (cid:96) ) ∈{ , (cid:96)(cid:96) D ,k } ∀ f K,(cid:96) and ∀ I (cid:96) . For each f K,(cid:96) ∈ F , the subset of X (cid:48) = X × { , (cid:96)(cid:96) D ,k } deﬁned as R f K,(cid:96) = { ( I (cid:96) , t ) : t ≤ f K,(cid:96) ( I (cid:96) ) } is the associated range. Let F + = { R f K,(cid:96) , f

K,(cid:96) ∈ F } be the range set on X (cid:48) , andits corresponding range space Q (cid:48) be Q (cid:48) = ( X (cid:48) , F + ) .Note that, for a given bag I (cid:96) , the functions f K,(cid:96) are then biased if K appears more than times in all the (cid:96) reads of I (cid:96) . We prove the following upper bound to the pseudodimension P D ( X, F ) (see Proposition 14of Supplemental Material Section C). Proposition 5.

The pseudodimension

P D ( X, F ) satisﬁes P D ( X, F ) ≤ (cid:98) log ( (cid:96)(cid:96) max, D ,k ) (cid:99) + 1 . We deﬁne the frequency ˆ f S (cid:96) ( K ) of a k -mer K obtained from the sample S (cid:96) of bags of reads as ˆ f S (cid:96) ( K ) = m (cid:80) I (cid:96),i ∈ S (cid:96) f K,(cid:96) ( I (cid:96),i ) . Note that ˆ f S (cid:96) ( K ) is a “biased” version of f S (cid:96) ( K ) = m (cid:80) I (cid:96),i ∈ S (cid:96) o I (cid:96) ( K ) / ( (cid:96)(cid:96) D ,k ) , which is an unbiased estimator of f D ( K ) (i.e., E [ f S (cid:96) ( K )] = f D ( K ) ).The following is our main technical results, and establishes a rigorous relation between the number m of bags of (cid:96) reads and the guarantees obtained by approximating the frequency f D ( K ) of a k -mer K withits (biased) estimate ˆ f S (cid:96) ( K ) . (The full analysis is in Supplemental Material Section C - see Proposition 17.) Proposition 6.

Algorithm 1:

SPRISS ( D , k, θ, δ, ε, (cid:96) ) Data: D , k , θ ∈ (0 , , δ ∈ (0 , , ε ∈ (0 , θ ) , integer (cid:96) ≥ Result:

Approximation A of F K ( D , k, θ ) with probability atleast − δ m ← (cid:100) ε (cid:16) (cid:96)(cid:96) D ,k (cid:17) (cid:0) (cid:98) log min(2 (cid:96)(cid:96) max , D ,k , σ k ) (cid:99) + ln (cid:0) δ (cid:1)(cid:1) (cid:101) ; S ← sample of exactly m(cid:96) reads drawn from D ; T ← exact counting ( S, k ) ; S (cid:96) ← random partition of S into m bags of (cid:96) reads each; A ← ∅ ; forall ( K, o S ( K )) ∈ T do S K ← number of bags of S (cid:96) where K appears; ˆ f S (cid:96) ( K ) ← S K / ( m(cid:96)(cid:96) D ,k ) ; f S (cid:96) ( K ) ← o S ( K ) / ( m(cid:96)(cid:96) D ,k ) ; if ˆ f S (cid:96) ( K ) ≥ θ − ε/ then A ← A ∪ ( K, f S (cid:96) ( K )) ; return A ; Our algorithm SPRISS (Alg. 1)builds on Proposition 6, and returnsthe approximation of

F K ( D , k, θ ) deﬁned by the set A = { ( K, f S (cid:96) ( K )) :ˆ f S (cid:96) ( K ) ≥ θ − ε/ } . There-fore, with probability at least − δ the output of SPRISS provides theguarantees stated in Proposition 6.

SPRISS starts by computingthe number m of bags of (cid:96) reads asin Eq. 1, based on the input param-eters k, θ, δ, ε, (cid:96) and on the charac-teristics ( (cid:96) D ,k , (cid:96) max , D ,k , σ ) of data-set D . It then draws a sample S of exactly m(cid:96) reads, uniformly andindependently at random, from D (with replacement). Next, it com-putes for each k -mer K the number of occurrences o S ( K ) of K in sample S , using any exact k -merscounting algorithm. We denote the call of this method by exact counting ( S, k ) (line 3), which returnsa collection T of pairs ( K, o S ( K )) . The sample S is then partitioned into m bags, where each bag containsexactly (cid:96) reads (line 4). For each k -mer K , SPRISS computes the biased frequency ˆ f S (cid:96) ( K ) (line 8) and theunbiased frequency f S (cid:96) ( K ) (line 9), reporting in output only k -mers with biased frequency at least θ − ε/ (line 9). Note that the estimated frequency of a k -mer K reported in output is always given by the unbiasedfrequency f S (cid:96) ( K ) . In practice the partition of S into m bags (line 4) and the computation of S K (line 7)could be high demanding in terms of running time and space, since one has to compute and store, for each k -mer K , the exact number S K of bags where K appears at least once among all reads of the bag.We now describe an alternative, and much more efﬁcient, approach to approximate the values S K ,without the need to explicitly compute the bags (line 4). The number of reads in a given bag where K appears is well approximated by a Poisson distribution P oisson ( R [ K ] /m ) , where R [ K ] is the number ofreads of S where k -mer K appears at least once. Therefore, the number S K of bags where K appears atleast once is approximated by a binomial distribution Binomial ( m, − e − R [ K ] /m ) . Thus, one can avoidto explicitely create the bags and to exactly count S K by removing line 4, and replacing lines 7 and 8 with “ ˆ f S (cid:96) ( K ) ← Binomial ( m, − e − R [ K ] /m ) / ( m(cid:96)(cid:96) D ,k )” . Corollary 5.11 of [33] guarantees that, by using thisPoisson distribution to approximate S K , the output of SPRISS satisﬁes the properties of Proposition 6 withprobability at least − δ . This leads to the replacement of “ ln(1 /δ )” with “ ln(2 /δ )” in line 1. However,this approach requires to compute, for each k -mer K , the number of reads R [ K ] of S where k -appears at6east once. We believe such computation can be obtained with minimal effort within the implementation ofmost k -mer counters, but we now describe a simple idea to approximate R [ K ] . Since most k -mers appearat most once in a read, the number of reads R [ K ] where a k -mer K appears is well approximated by thenumber of occurrences T [ K ] of K in the sample S . Thus, we can replace lines 7 and 8 with “ ˆ f S (cid:96) ( K ) ← Binomial ( m, − e − T [ K ] /m ) / ( m(cid:96)(cid:96) D ,k )” , which only requires the counts T [ K ] obtained from the exactcounting procedure exact counting ( S, k ) of line 3 (see Algorithm 2 in Supplement Material). Notethat approximating R [ K ] with T [ K ] leads to overestimate frequencies of few k -mers who reside in veryrepetitive sequences, e.g. k -mers composed by the same k consecutive nucleotides, for which T [ K ] (cid:29) R [ K ] . However, since the majority of k -mers reside in non-repetitive sequences, we can assume R [ K ] ≈ T [ K ] . In this section we present results of our experimental evaluation. In particular:• We assess the performance of

SPRISS in approximating the set of frequent k -mers from a datasetof reads. In particular, we evaluate the accuracy of estimated frequencies and false negatives in theapproximation, and compare SPRISS with the state-of-the-art sampling algorithm SAKEIMA [41]in terms of sample size and running time.• We evaluate

SPRISS ’s performance for the comparison of metagenomic datasets. We use

SPRISS ’sapproximations to estimate abundance based distances (e.g., the Bray-Curtis distance) between metage-nomic datasets, and show that the estimated distances can be used to obtain informative clusterings ofmetagenomic datasets (from the Sorcerer II Global Ocean Sampling Expedition [48] ) in a fraction ofthe time required by the exact distances computation (i.e., based on exact k -mers frequencies).• We test SPRISS to discover discriminative k -mers between pairs of datasets. We show that SPRISS identiﬁes almost all discriminative k -mers from pairs of metagenomic datasets from [23] and theHuman Microbiome Project (HMP) , with a signiﬁcant speed-up compared to standard approaches. We implemented

SPRISS as a combination of a Python script, which performs the reads sampling and savesthe sample on a ﬁle, and C++, as a modiﬁcation of KMC 3 [20] , a fast and efﬁcient counting k -mersalgorithm. Note that our ﬂexible sampling technique can be combined with any k -mer counting algorithm.(See Supplemental Material for results, e.g. Figure S1, obtained using J ELLYFISH v. as k -mer counterin SPRISS ). We use the variant of

SPRISS that employs the Poisson approximation for computing S K (seeend of Section 5). SPRISS implementation and scripts for reproducing all results are publicity available .We compared SPRISS with the exact k -mer counter KMC and with SAKEIMA [41] , the state-of-the-art sampling-based algorithm for approximating frequent k -mers. In all experiments we ﬁx δ = 0 . and ε = θ − /t D ,k . If not stated otherwise, we considered k = 31 and (cid:96) = (cid:98) . / ( θ(cid:96) D ,k ) (cid:99) in our experiments.When comparing running times, we did not consider the time required by SPRISS to materialize the samplein a ﬁle, since this step is not explicitly performed in SAKEIMA and could be easily done at the timeof creation of the reads dataset. For SAKEIMA, as suggested in [41] we set the number (cid:96) SK of k -mersin a bag to be (cid:96) SK = (cid:98) . /θ (cid:99) . We remark that a bag of reads of SPRISS contains the same (expected)number of k -mers positions of a bag of SAKEIMA; this guarantees that both algorithms provide outputswith the same guarantees, thus making the comparison between the two methods fair. To assess SPRISS in https://hmpdacc.org/HMASM/ Available at https://github.com/refresh-bio/KMC Available at https://github.com/gmarcais/Jellyfish Available at https://github.com/VandinLab/SPRISS Available at https://github.com/VandinLab/SAKEIMA k -mers, we considered large metagenomic datasets from HMP, each with ≈ reads and average read length ≈ (see Supplemental Table 1). For the evaluation of SPRISS in comparingmetagenomic datasets, we also used small metagenomic datasets from the Sorcerer II Global OceanSampling Expedition [48], each with ≈ - reads and average read length ≈ (see SupplementTable 2). For the assessment of SPRISS in the discovery of discriminative k -mers we used two large datasetsfrom [23], B73 and

Mo17 , each with ≈ · reads and average read length = 250 (see SupplementalTable 3), and we also experimented with the HMP datasets. All experiments have been performed on amachine with 512 GB of RAM and 2 Intel(R) Xeon(R) CPU E5-2698 v3 @2.3GHz, with one worker, if notstated otherwise. All reported results are averages over runs. k -mers In this section we ﬁrst assess the quality of the approximation of

F K ( D , k, θ ) provided by SPRISS , andthen compare

SPRISS with SAKEIMA.We use

SPRISS to extract approximations of frequent k -mers on 6 datasets from HMP for values of theminimum frequency threshold θ ∈ { . · − , · − , . · − , − } . The output of SPRISS satisﬁed theguarantees from Proposition 6 for all 5 runs of every combination of dataset and θ . In all cases the estimatedfrequencies provided by SPRISS are close to the exact ones (see Figure 1a for an example). In fact, theaverage (across all reported k -mers) absolute deviation of the estimated frequency w.r.t. the true frequencyis always small, i.e. one order of magnitude smaller than θ (Figure 1b), and the maximum deviations isvery small as well (Figure S2b). In addition, SPRISS results in a very low false negative rate (i.e., fractionof k -mers of F K ( D , k, θ ) not reported by SPRISS ), which is always been below . in our experiments.In terms of running time, SPRISS required at most of the time required by the exact approachKMC (Figure 1c). This is due to

SPRISS requiring to analyze at most of the entire dataset (Figure 1d).Note that the use of collections of bags of reads is crucial to achieve useful sample size, i.e. lower than thewhole dataset: the sample size from Hoeffding’s inequality and union bound (Proposition 1), and the onefrom pseudodimension without collections of bags (Proposition 4) are ≈ and ≈ , respectively,which are useless for datasets of ≈ reads. These results show that SPRISS obtains very accurateapproximations of frequent k -mers in a fraction of the time required by exact counting approaches.We then compared SPRISS with SAKEIMA. In terms of quality of approximation,

SPRISS reportsapproximations with an average deviation lower than SAKEIMA’s approximations, while SAKEIMA’sapproximations have a lower maximum deviation. However, the ratio between the maximum deviation of

SPRISS and the one of SAKEIMA are always below 2. Overall, the quality of the approximation providedby

SPRISS and SAKEIMA are, thus, comparable. In terms of running time,

SPRISS signiﬁcantly improvesover SAKEIMA (Figure 1c), and processes slightly smaller portions of the dataset compared to SAKEIMA(Figure 1d). Summarizing,

SPRISS is able to report most of the frequent k -mers and estimate their frequen-cies with small errors, by analyzing small samples of the datasets and with signiﬁcant improvements onrunning times compared to exact approaches and to state-of-the-art sampling algorithms. We evaluated

SPRISS to compare metagenomic datasets by computing an approximation to the Bray-Curtis(BC) distance between pairs of datasets of reads, and using such approximations to cluster datasets.Let D and D be two datasets of reads. Let F = F K ( D , k, θ ) and F = F K ( D , k, θ ) be the set offrequent k -mers respectively of D and D , where θ is a minimum frequency threshold. The BC distance between D and D considering only frequent k -mers is deﬁned as BC ( D , D , F , F ) = 1 − I/U , where I = (cid:80) K ∈F ∩F min { o D ( K ) , o D ( K ) } and U = (cid:80) K ∈F o D ( K ) + (cid:80) K ∈F o D ( K ) . Conversely, the

BCsimilarity is deﬁned as − BC ( D , D , F , F ) .We considered 6 datasets from HMP, and estimated the BC distances among them by using SPRISS to approximate the sets of frequent k -mers F = F K ( D , k, θ ) and F = F K ( D , k, θ ) for the values8 a) A v e r a g e d e v i a t i o n

1e 8 Average deviation

HMP1 - (SP)HMP1 - (SK)HMP2 - (SP)HMP2 - (SK)HMP3 - (SP)HMP3 - (SK) HMP4 - (SP)HMP4 - (SK)HMP5 - (SP)HMP5 - (SK)HMP6 - (SP)HMP6 - (SK) (b) R unn i n g t i m e ( s e c ) Running time

HMP1 (E)HMP2 (E)HMP3 (E) HMP4 (E)HMP5 (E)HMP6 (E) (c) F r a c t i o n s Fraction reads (SP) vs k-mers (SK)

HMP1 - (SP)HMP1 - (SK)HMP2 - (SP)HMP2 - (SK)HMP3 - (SP)HMP3 - (SK) HMP4 - (SP)HMP4 - (SK)HMP5 - (SP)HMP5 - (SK)HMP6 - (SP)HMP6 - (SK) (d)

Figure 1: (a) k -mers exact frequency and frequency estimated by SPRISS for dataset

SRS024075 and θ = 2 . · − .(b) Average deviations between exact frequencies and frequencies estimated by SPRISS ( SP ) and SAKEIMA ( SK ),for various datasets and values of θ . (c) Running time of SPRISS ( SP ), SAKEIMA ( SK ), and the exact computation( E ) - see also legend of panel 1b). (d) Fraction of the dataset analyzed by SPRISS ( SP ) and by SAKEIMA ( SK ). of θ as in Section 6.2. We compared such estimated distances with the exact BC distances and with theestimates obtained using SAKEIMA. Both SPRISS and SAKEIMA provide accurate estimates of the BCdistances (Figure 2a and Figure S3), which can be used to assess the relative similarity of pairs of datasets.However, to obtain such approximations

SPRISS requires at most of the time required by SAKEIMAand usually of the time required by the exact computation with KMC(Figure 2b). Therefore

SPRISS provides accurate estimates of metagenomic distances in a fraction of time required by other approaches.As an example of the impact in accurately estimating distances among metagenomic datasets, we usedthe sampling approach of

SPRISS to approximate all pairwise BC distances among small datasets fromthe Sorcerer II Global Ocean Sampling Expedition (GOS) [48], and used such distances to cluster thedatasets using average linkage hierarchical clustering. The k -mer based clustering of metagenomic datasetsis often performed by using presence-based distances, such as the Jaccard distance [36], which estimatessimilarities between two datasets by computing the fraction of k -mers in common between the two datasets.Abundance-based distances, such as the BC distance [4, 11, 12], provide more detailed measures based alsoon the k -mers abundance, but are often not used due to the heavy computational requirements to extract all9 .4 0.6 0.8 1.0 BC distance (exact) B C d i s t a n c e ( s a m p li n g ) Exact vs estimated BC dist. ( = 2.5e-08)

HMP1-HMP2HMP1-HMP3HMP1-HMP4HMP1-HMP5HMP1-HMP6HMP2-HMP3HMP2-HMP4HMP2-HMP5 HMP2-HMP6HMP3-HMP4HMP3-HMP5HMP3-HMP6HMP4-HMP5HMP4-HMP6HMP5-HMP6 (a) R unn i n g t i m e ( s e c ) Running time

ExactSPRISSSAKEIMA (b) N C T G N C T O T O T O T O T O T O T O T G T G T G T G T G T G T G T G T G T G T G T G T G T N T N T N T N T N T N E E T S T S T S T S N C N C NC3TG15NC2TO2TO1TO5TO7TO6TO4TO3TG14TG13TG1TG3TG16TG2TG8TG12TG11TG5TG4TG7TG6TN4TN6TN3TN1TN5TN2E2E1TS1TS2TS4TS3NC4NC10.00.3

GOS clustering with exact Jaccard similarity (c) T O N C T O T O T O T O T O T O N C T G T G T G T G T G T G T G T G T G T G T G T G T G T G T N T N T N T N T N T N E E T S T S T S T S N C N C TO2NC2TO1TO5TO7TO6TO4TO3NC3TG8TG7TG12TG11TG6TG5TG4TG13TG3TG1TG16TG2TG15TG14TN1TN3TN5TN4TN6TN2E2E1TS1TS2TS4TS3NC4NC10.00.3

GOS clustering with BC similarity from SPRISS (d)

Figure 2: (a) Comparison of the approximations of the Bray-Curtis (BC) distances using approximations of frequent k -mers provided by SPRISS ( × ) and by SAKEIMA ( • ), and the exact distances, for θ = 2 . · − . (b) Runningtime to approximate BC distances for all pairs of datasets with SPRISS , with SAKEIMA, and the exact approach.(c) Average linkage hierarchical clustering of GOS datasets using Jaccard similarity. (d) Same as (c), using estimatedBC similarity from

SPRISS with of the data. (See also larger Figures S4-S6 in Supplemental Material for betterreadability of datasets’ labels and computed clusters.) k -mers counts. However, the sampling approach of SPRISS can signiﬁcantly speed-up the computation ofall BC distances, and, thus, the entire clustering analysis. In fact, for this experiment, the use of

SPRISS reduces the time required to analyze the datasets (i.e., obtain k -mers frequencies, compute all pairwisedistances, and obtain the clustering) by .We then compared the clustering obtained using the Jaccard distance (Figure 2c) and the clustering ob-tained using the estimates of the BC distances (Figure 2d) obtained using only of reads in the GOSdatasets, which are assigned to groups and macro-groups according to the origin of the sample [48]. Even10f the BC distance is computed using only a sample of the datasets, while the Jaccard distance is computedusing the entirety of all datasets, the use of approximate BC distances leads to a better clustering in termsof correspondence of clusters to groups, and to the correct cluster separation for macro-groups. In addition,the similarities among datasets in the same group and the dissimilarities among datasets in different groupsare more accentuated using the approximated BC distance. In fact, the ratio between the average BC simi-larity among datasets in the same group and the analogous average Jaccard is in the interval [1 . , . forall groups. In addition, the ratio between i) the difference of the average BC similarity within the tropicalmacro-group and the average BC similarity between the tropical and temperate groups, and ii) the analo-gous difference using the Jaccard similarity is ≈ . . These results tell us the approximate BC-distances,computed using only half of the reads in each dataset, increase by ≈ the similarity signal inside allgroups deﬁned by the original study [48], and the dissimilarities between the two macro-groups (tropicaland temperate).To conclude, the estimates of the BC similarities obtained using the sampling scheme of SPRISS allowsto better cluster metagenomic datasets than using the Jaccard similarity, while requiring less than ofthe time needed by the exact computation of BC similarities, even for fairly small metagenomic datasets. k -mers In this section we assess

SPRISS for approximating discriminative k -mers in metagenomic datasets. Inparticular, we consider the following deﬁnition of discriminative k -mers [23]. Given two datasets D , D ,and a minimum frequency threshold θ , we deﬁne the set DK ( D , D , k, θ, ρ ) of D -discriminative k -mersas the collection of k -mers K for which the following conditions both hold: 1. K ∈ F K ( D , k, θ ) ; 2. f D ( K ) ≥ ρf D ( K ) , with ρ = 2 . Note that the computation of DK ( D , D , k, θ, ρ ) requires to extract F K ( D , k, θ ) and F K ( D , k, θ/ρ ) . SPRISS can be used to approximate the set DK ( D , D , k, θ, ρ ) ,by computing approximations F K ( D i , k, θ ) of the sets F K ( D i , k, θ ) , i = 1 , , of frequent k -mers in D , D , and then reporting a k -mer K as D -discriminative if the following conditions both hold: 1. K ∈ F K ( D , k, θ ) ; 2. K / ∈ F K ( D , k, θ ) , or f S (cid:96) ( K ) ≥ ρf S (cid:96) ( K ) when K ∈ F K ( D , k, θ ) .To evaluate such approach, we considered two datasets from [23], and θ = 2 · − and ρ = 2 , whichare the parameters used in [23]. We used the sampling approach of SPRISS with (cid:96) = (cid:98) . / ( θ(cid:96) D ,k ) (cid:99) and (cid:96) = (cid:98) . / ( θ(cid:96) D ,k ) (cid:99) , resulting in analyzing of and of all reads, to approximate the sets ofdiscriminative D -discriminative and of D -discriminative k -mers. When of the reads are used, thefalse negative rate is < . , while when of the reads are used, the false negative rate is < . .The running times are ≈ sec. and ≈ sec., respectively, while the exact computation of thediscriminative k -mers with KMC requires ≈ sec. (we used 32 workers for both SPRISS and KMC).Similar results are obtained when analyzing pairs of HMP datasets, for various values of θ (Figure S7).These results show that SPRISS can identify discriminative k -mers with small false negative rates whileproviding a remarkable improvement in running time compared to the exact approach. We presented

SPRISS , an efﬁcient algorithm to compute rigorous approximations of frequent k -mers andtheir frequencies by sampling reads. SPRISS builds on pseudodimension, an advanced concept from sta-tistical learning theory. Our extensive experimental evaluation shows that

SPRISS provides high-qualityapproximations and can be employed to speed-up exploratory analyses in various applications, such as theanalysis of metagenomic datasets and the identiﬁcation of discriminative k -mers. Overall, the samplingapproach used by SPRISS provides an efﬁcient way to obtain a representative subset of the data that can beused to perform complex analyses more efﬁciently than examining the whole data, while obtaining repre-sentative results. 11 eferences [1] Fatemeh Almodaresi, Hirak Sarkar, Avi Srivastava, and Rob Patro. A space and time-efﬁcient index for thecompacted colored de bruijn graph.

Bioinformatics , 34(13):i169–i177, 2018.[2] Peter Audano and Fredrik Vannberg. Kanalyze: a fast versatile pipelined k-mer toolkit.

Bioinformatics ,30(14):2070–2072, 2014.[3] J´erˆome Audoux, Nicolas Philippe, Rayan Chikhi, Mika¨el Salson, M´elina Gallopin, Marc Gabriel, J´er´emyLe Coz, Emilie Drouineau, Th´er`ese Commes, and Daniel Gautheret. De-kupl: exhaustive capture of biolog-ical variation in rna-seq data through k-mer decomposition.

Genome biology , 18(1):243, 2017.[4] Ga¨etan Benoit, Pierre Peterlongo, Mahendra Mariadassou, Erwan Drezen, Sophie Schbath, Dominique Lavenier,and Claire Lemaitre. Multiple comparative metagenomics using multiset k-mer counting.

PeerJ ComputerScience , 2:e94, 2016.[5] Phelim Bradley, Henk C Den Bakker, Eduardo PC Rocha, Gil McVean, and Zamin Iqbal. Ultrafast search of alldeposited bacterial and viral genomic data.

Nature biotechnology , 37(2):152–159, 2019.[6] C Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B Pyrkosz, and Timothy H Brom. A reference-freealgorithm for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802 , 2012.[7] Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T Simpson, and Paul Medvedev. On the representationof de bruijn graphs. In

International conference on Research in computational molecular biology , pages 35–55.Springer, 2014.[8] Rayan Chikhi and Paul Medvedev. Informed and automated k-mer size selection for genome assembly.

Bioin-formatics , 30(1):31–37, 2013.[9] Benjamin Coleman, Benito Geordie, Li Chou, RA Leo Elworth, Todd J Treangen, and Anshumali Shrivastava.Diversiﬁed race sampling on data streams applied to metagenomic sequence analysis. bioRxiv , page 852889,2019.[10] Temesgen Hailemariam Dadi, Enrico Siragusa, Vitor C Piro, Andreas Andrusch, Enrico Seiler, Bernhard YRenard, and Knut Reinert. Dream-yara: An exact read mapper for very large databases with short update time.

Bioinformatics , 34(17):i766–i772, 2018.[11] Roberto Danovaro, Miquel Canals, Michael Tangherlini, Antonio Dell’Anno, Cristina Gambi, Galderic Lastras,David Amblas, Anna Sanchez-Vidal, Jaime Frigola, Antoni M Calafat, et al. A submarine volcanic eruptionleads to a novel microbial habitat.

Nature ecology & evolution , 1(6):0144, 2017.[12] Laura B Dickson, Davy Jiolle, Guillaume Minard, Isabelle Moltini-Conclois, Stevenn Volant, Amine Ghozlane,Christiane Bouchier, Diego Ayala, Christophe Paupy, Claire Valiente Moro, et al. Carryover effects of larvalexposure to different environmental bacteria drive adult trait variation in a mosquito vector.

Science advances ,3(8):e1700585, 2017.[13] RA Leo Elworth, Qi Wang, Pavan K Kota, CJ Barberan, Benjamin Coleman, Advait Balaji, Gaurav Gupta,Richard G Baraniuk, Anshumali Shrivastava, and Todd J Treangen. To petabytes and beyond: recent advancesin probabilistic and signal processing algorithms and their application to metagenomics.

Nucleic Acids Research ,48(10):5217–5234, 2020.[14] Hongzhe Guo, Yilei Fu, Yan Gao, Junyi Li, Yadong Wang, and Bo Liu. degsm: memory scalable constructionof large scale de bruijn graph.

IEEE/ACM transactions on computational biology and bioinformatics , 2019.[15] Robert S Harris and Paul Medvedev. Improved representation of sequence bloom trees.

Bioinformatics ,36(3):721–727, 2020.[16] Mikel Hernaez, Dmitri Pavlichin, Tsachy Weissman, and Idoia Ochoa. Genomic data compression.

AnnualReview of Biomedical Data Science , 2:19–37, 2019.[17] Guillaume Holley and P´all Melsted. Bifrost: highly parallel construction and indexing of colored and compactedde bruijn graphs.

Genome biology , 21(1):1–20, 2020.[18] Morteza Hosseini, Diogo Pratas, and Armando J Pinho. A survey on data compression methods for biologicalsequences.

Information , 7(4):56, 2016.[19] David R Kelley, Michael C Schatz, and Steven L Salzberg. Quake: quality-aware detection and correction ofsequencing errors.

Genome biology , 11(11):R116, 2010.[20] Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. Kmc 3: counting and manipulating k-mer statistics.

Bioinformatics , 33(17):2759–2761, 2017.[21] Stefan Kurtz, Apurva Narechania, Joshua C Stein, and Doreen Ware. A new method to compute k-mer frequen-cies and its application to annotate large repetitive plant genomes.

BMC genomics , 9(1):517, 2008.

22] Xiaoman Li and Michael S Waterman. Estimating the repeat structure and length of dna sequences using (cid:96) -tuples.

Genome research , 13(8):1916–1922, 2003.[23] Sanzhen Liu, Jun Zheng, Pierre Migeon, Jie Ren, Ying Hu, Cheng He, Hongjun Liu, Junjie Fu, Frank F White,Christopher Toomajian, et al. Unbiased k-mer analysis reveals changes in copy number of highly repetitivesequences during maize domestication and improvement.

Scientiﬁc reports , 7:42444, 2017.[24] Maarten L¨ofﬂer and Jeff M Phillips. Shape ﬁtting on point sets with probability distributions. In

EuropeanSymposium on Algorithms , pages 313–324. Springer, 2009.[25] Philip M Long. The complexity of learning according to two models of a drifting environment.

MachineLearning , 37(3):337–354, 1999.[26] Guillaume Marc¸ais and Carl Kingsford. A fast, lock-free approach for efﬁcient parallel counting of occurrencesof k-mers.

Bioinformatics , 27(6):764–770, 2011.[27] Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mika¨el Salson, and Rayan Chikhi. Datastructures based on k-mers for querying large collections of sequencing datasets. bioRxiv , page 866756, 2019.[28] Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mika¨el Salson, and Rayan Chikhi. Reindeer: efﬁcient indexingof k-mer presence and abundance in sequencing datasets. bioRxiv , 2020.[29] Camille Marchet, Ma¨el Kerbiriou, and Antoine Limasset. Indexing de bruijn graphs with minimizers.

BioRxiv ,page 546309, 2019.[30] Camille Marchet, Lolita Lecompte, Antoine Limasset, Lucie Bittner, and Pierre Peterlongo. A resource-frugalprobabilistic dictionary and applications in bioinformatics.

Discrete Applied Mathematics , 274:92–102, 2020.[31] P´all Melsted and Bjarni V Halld´orsson. Kmerstream: streaming algorithms for k-mer abundance estimation.

Bioinformatics , 30(24):3541–3547, 2014.[32] Pall Melsted and Jonathan K Pritchard. Efﬁcient counting of k-mers in dna sequences using a bloom ﬁlter.

BMCbioinformatics , 12(1):333, 2011.[33] Michael Mitzenmacher and Eli Upfal.

Probability and computing: Randomization and probabilistic techniquesin algorithms and data analysis . Cambridge university press, 2017.[34] Hamid Mohamadi, Hamza Khan, and Inanc Birol. ntcard: a streaming algorithm for cardinality estimation ingenomics data.

Bioinformatics , 33(9):1324–1330, 2017.[35] Ibrahim Numanagi´c, James K Bonﬁeld, Faraz Hach, Jan Voges, J¨orn Ostermann, Claudio Alberti, Marco Mat-tavelli, and S Cenk Sahinalp. Comparison of high-throughput sequencing data compression tools. nature meth-ods , 13(12):1005–1008, 2016.[36] Brian D Ondov, Todd J Treangen, P´all Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, andAdam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash.

Genome biology ,17(1):132, 2016.[37] Rachid Ounit, Steve Wanamaker, Timothy J Close, and Stefano Lonardi. Clark: fast and accurate classiﬁcationof metagenomic and genomic sequences using discriminative k-mers.

BMC genomics , 16(1):236, 2015.[38] Prashant Pandey, Fatemeh Almodaresi, Michael A Bender, Michael Ferdman, Rob Johnson, and Rob Patro.Mantis: A fast, small, and exact large-scale sequence-search index.

Cell systems , 7(2):201–207, 2018.[39] Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. Squeakr: an exact and approximate k-mercounting system.

Bioinformatics , 2017.[40] Rob Patro, Stephen M Mount, and Carl Kingsford. Sailﬁsh enables alignment-free isoform quantiﬁcation fromrna-seq reads using lightweight algorithms.

Nature biotechnology , 32(5):462, 2014.[41] Leonardo Pellegrina, Cinzia Pizzi, and Fabio Vandin. Fast approximation of frequent k-mers and applications tometagenomics.

Journal of Computational Biology , 27(4):534–549, 2020.[42] David Pollard.

Convergence of Stochastic Processes . Springer-Verlag, 1984.[43] Amatur Rahman, Rayan Chikhi, and Paul Medvedev. Disk compression of k-mer sets. In . Schloss Dagstuhl-Leibniz-Zentrum f¨ur Informatik,2020.[44] Amatur Rahman and Paul Medvedev. Representation of k -mer sets using spectrum-preserving string sets. In International Conference on Research in Computational Molecular Biology , pages 152–168. Springer, 2020.[45] Matteo Riondato and Eli Upfal. Abra: Approximating betweenness centrality in static and dynamic graphs withrademacher averages.

ACM Transactions on Knowledge Discovery from Data (TKDD) , 12(5):61, 2018.[46] Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. Dsk: k-mer counting with very low memory usage.

Bioinformatics , 29(5):652–653, 2013.

47] Rajat Shuvro Roy, Debashish Bhattacharya, and Alexander Schliep. Turtle: Identifying frequent k-mers withcache-efﬁcient algorithms.

Bioinformatics , 30(14):1950–1957, 2014.[48] Douglas B Rusch, Aaron L Halpern, Granger Sutton, Karla B Heidelberg, Shannon Williamson, Shibu Yooseph,Dongying Wu, Jonathan A Eisen, Jeff M Hoffman, Karin Remington, Karen Beeson, Bao Tran, Hamilton Smith,Holly Baden-Tillson, Clare Stewart, Joyce Thorpe, Jason Freeman, Cynthia Andrews-Pfannkoch, Joseph EVenter, Kelvin Li, Saul Kravitz, John F Heidelberg, Terry Utterback, Yu-Hui Rogers, Luisa I Falc´on, ValeriaSouza, Germ´an Bonilla-Rosso, Luis E Eguiarte, David M Karl, Shubha Sathyendranath, Trevor Platt, EldredgeBermingham, Victor Gallardo, Giselle Tamayo-Castillo, Michael R Ferrari, Robert L Strausberg, Kenneth Neal-son, Robert Friedman, Marvin Frazier, and J. Craig Venter. The sorcerer ii global ocean sampling expedition:Northwest atlantic through eastern tropical paciﬁc.

PLOS Biology , 5(3):1–34, 03 2007.[49] Antonio Saavedra, Hans Lehnert, Cecilia Hern´andez, Gonzalo Carvajal, and Miguel Figueroa. Mining discrim-inative k-mers in dna sequences using sketches and hardware acceleration.

IEEE Access , 8:114715–114732,2020.[50] Leena Salmela, Riku Walve, Eric Rivals, and Esko Ukkonen. Accurate self-correction of errors in long readsusing de bruijn graphs.

Bioinformatics , 33(6):799–806, 2016.[51] Gregory E Sims, Se-Ran Jun, Guohong A Wu, and Sung-Hou Kim. Alignment-free genome comparison withfeature frequency proﬁles (ffp) and optimal resolutions.

Proceedings of the National Academy of Sciences ,106(8):2677–2682, 2009.[52] Naveen Sivadasan, Rajgopal Srinivasan, and Kshama Goyal. Kmerlight: fast and accurate k-mer abundanceestimation. arXiv preprint arXiv:1609.05626 , 2016.[53] Brad Solomon and Carl Kingsford. Fast search of thousands of short-read sequencing experiments.

Naturebiotechnology , 34(3):300, 2016.[54] Brad Solomon and Carl Kingsford. Improved search of large transcriptomic sequencing databases using splitsequence bloom trees.

Journal of Computational Biology , 25(7):755–765, 2018.[55] Chen Sun, Robert S Harris, Rayan Chikhi, and Paul Medvedev. Allsome sequence bloom trees.

Journal ofComputational Biology , 25(5):467–479, 2018.[56] Michel Talagrand. Sharper bounds for gaussian and empirical processes.

The Annals of Probability , pages28–76, 1994.[57] Vladimir Vapnik.

Statistical learning theory.

Wiley, New York, 1998.[58] Axel Wedemeyer, Lasse Kliemann, Anand Srivastav, Christian Schielke, Thorsten B Reusch, and Philip Rosen-stiel. An improved ﬁltering algorithm for big read datasets and its application to single-cell assembly.

BMCbioinformatics , 18(1):324, 2017.[59] Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classiﬁcation using exactalignments.

Genome biology , 15(3):R46, 2014.[60] Ye Yu, Jinpeng Liu, Xinan Liu, Yi Zhang, Eamonn Magner, Erik Lehnert, Chen Qian, and Jinze Liu. Seqothello:querying rna-seq experiments at scale.

Genome biology , 19(1):167, 2018.[61] Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, and C Titus Brown. These arenot the k-mers you are looking for: efﬁcient online k-mer counting using a probabilistic data structure.

PloSone , 9(7):e101271, 2014.[62] Zhaojun Zhang and Wei Wang. Rna-skim: a rapid method for rna-seq quantiﬁcation at transcript level.

Bioin-formatics , 30(12):i283–i292, 2014. upplemental Material A Analysis of Simple Reads Sampling Algorithm

In this section we prove Proposition 1, which here corresponds to Proposition 10. To this aim, we need tointroduce and prove some preliminary results.

Proposition 7.

The expectation E [ t S,k ] of the size of the multiset of k -mers that appear in S is m(cid:96) D ,k .Proof. Let X ( r i ) = n i − k + 1 be the number of starting positions for k -mers in read r i sampled uniformlyat random form D , i ∈ { , . . . , n } . E [ X ( r i )] = (cid:80) r i ∈D n ( n i − k + 1) = (cid:96) D ,k . Combining this with thelinearity of the expectation, we have: E [ t S,k ] = E  (cid:88) r i ∈ S ( n i − k + 1)  = (cid:88) r i ∈ S E [ n i − k + 1] = m E [ X ( r i )] = m(cid:96) D ,k . Given a k -mer K , its support o S ( K ) in S is deﬁned as o S ( K ) = (cid:80) r i ∈ S (cid:80) n i − kj =0 φ r i ,K ( j ) . We deﬁnethe frequency of K in S as f S ( K ) = o S ( K ) / ( m(cid:96) D ,k ) , that is the ratio between the support of K and theexpectation E [ t S,k ] = m(cid:96) D ,k of the size of the multiset of k -mers that appear in S . This deﬁnition of f S ( K ) gives us an unbiased estimator for f D ( K ) . Proposition 8.

The frequency f S ( K ) = o S ( K ) / ( m(cid:96) D ,k ) is an unbiased estimator for f D ( K ) = o D ( K ) /t D ,k .Proof. Let X r i ( K ) = (cid:80) n i − kj =0 φ r i ,K ( j ) be the number of distinct positions where k -mer K appears in read r i sampled uniformly at random form D , i ∈ { , . . . , n } . E [ X r i ( K )] = (cid:80) r i ∈D (cid:16) n (cid:80) n i − kj =0 φ r i ,K ( j ) (cid:17) = o D ( K ) /n. Combining this with the linearity of the expectation, we have: E [ f S ( K )] = E [ o S ( K )] m(cid:96) D ,k = E [ (cid:80) r i ∈ S (cid:80) n i − kj =0 φ r i ,K ( j )] m(cid:96) D ,k = E [ X r i ( K )] (cid:96) D ,k = o D ( K ) n(cid:96) D ,k = o D ( K ) t D ,k = f D ( K ) . By using the sampling framework based on reads and the Hoeffding inequality [33], we prove the fol-lowing bound on the probability that f S ( K ) is not within ε/ from f D ( K ) , for an arbitrary k -mer K . Proposition 9.

Consider a sample S of m reads from D . Let (cid:96) max , D ,k = max r i ∈D ( n i − k + 1) . Let K ∈ Σ k be an arbitrary k -mer. For a ﬁxed accuracy parameter ε ∈ (0 , we have: Pr (cid:16) | f S ( K ) − f D ( K ) | ≥ ε (cid:17) ≤ (cid:32) − mε (cid:18) (cid:96) D ,k (cid:96) max , D ,k (cid:19) (cid:33) . (2) Proof.

The frequency f S ( K ) = o S ( K ) / ( m(cid:96) D ,k ) of K in S can be rewritten as: f S ( K ) = (cid:80) r i ∈ S (cid:80) n i − kj =0 φ r i ,K ( j ) m(cid:96) D ,k = (cid:88) r i ∈ S n i − k (cid:88) j =0 φ r i ,K ( j ) m(cid:96) D ,k = (cid:88) r i ∈ S ˆ φ K ( r i ) , (3)where the random variable (r.v.) ˆ φ K ( r i ) = (cid:80) n i − kj =0 φ ri,K ( j ) m(cid:96) D ,k is the number of times K appears in read r i divided by m(cid:96) D ,k . Thus, f S ( K ) can be rewritten as a sum of m independent r.v. that take values in15 , (cid:96) max , D ,k m(cid:96) D ,k ] . Combining this fact with Proposition 8, and by applying the Hoeffding inequality [33] wehave: Pr( | f S ( K ) − f D ( K ) | ≥ ε ≤  − ε/ m (cid:16) (cid:96) max , D ,k m(cid:96) D ,k (cid:17)  = 2 exp (cid:32) − mε (cid:18) (cid:96) D ,k (cid:96) max , D ,k (cid:19) (cid:33) . Since the maximum number of k -mers is σ k , by combining the result above with the union bound wehave the following result. Proposition 10.

Consider a sample S of m reads from D . For ﬁxed frequency threshold θ ∈ (0 , , errorparameter ε ∈ (0 , θ ) , and conﬁdence parameter δ ∈ (0 , , if m ≥ ε (cid:18) (cid:96) max , D ,k (cid:96) D ,k (cid:19) (cid:18) ln (cid:16) σ k (cid:17) + ln (cid:18) δ (cid:19)(cid:19) (4) then, with probability ≥ − δ , F K ( S, k, θ − ε/ is an ε -approximation of F K ( D , k, θ ) .Proof. Let E K be the event “ | f S ( K ) − f D ( K ) | ≤ ε ” for a k -mer K . By the choice of m and Proposition 9we have that the probability of the complementary event E K of E K is Pr( E K ) = Pr (cid:16) | f S ( K ) − f D ( K ) | ≥ ε (cid:17) = 2 exp (cid:32) − mε (cid:18) (cid:96) D ,k (cid:96) max , D ,k (cid:19) (cid:33) ≤ δσ k . Now, by applying the union bound, the probability that for at least one k -mer K of Σ k the event E K holds isbounded by (cid:80) K ∈ Σ k Pr( E K ) ≤ δ . Thus, the probability that events E K simultaneously hold for all k -mers K in Σ k is at least − δ .Now we prove that, with probability at least − δ , F K ( S, k, θ − ε/ is an ε -approximation of F K ( D , k, θ ) , when, with probability at least − δ , “ | f S ( K ) − f D ( K ) | ≤ ε ” for all k -mers K . Note thatthe third property of Deﬁnition 1 is already satisﬁed. Let K be a k -mer of F K ( D , k, θ ) , that is f D ( K ) ≥ θ .Given that f S ( K ) ≥ f D ( K ) − ε/ , we have f S ( K ) ≥ θ − ε/ and the ﬁrst property of Deﬁnition 1 holds.Combining f D ( K ) ≥ f S ( K ) − ε/ and f S ( K ) ≥ θ − ε/ , we have f D ( K ) ≥ θ − ε and the second propertyof Deﬁnition 1 holds.The previous theorem gives us the following simple procedure for approximating the set offrequent k -mers with guarantees on the quality of the solution: build a sample S of m ≥ ε (cid:16) (cid:96) max , D ,k (cid:96) D ,k (cid:17) (cid:0) ln (cid:0) σ k (cid:1) + ln (cid:0) δ (cid:1)(cid:1) reads from D , and output the set F K ( S, k, θ − ε/ which is an ε -approximation of F K ( D , k, θ ) with probability at least − δ . Since the frequencies of k -mers we areestimating are small, then (cid:15) must be set to a small value. This typically results in a sample size m largerthan |D| , making useless the sampling approach. B Analysis of the First Improvement: A Pseudodimension-based Algorithm for k -mers Ap-proximation by Sampling Reads In this section we prove Proposition 3 and Proposition 4, which here corresponds to Proposition 12 andProposition 13, respectively. In order to help the reader to avoid too many jumps to the main text, wereintroduce some important deﬁnitions and results.Let F be a class of real-valued functions from a domain X to [ a, b ] ⊂ R . Consider, for each f ∈ F , thesubset of X (cid:48) = X × [ a, b ] deﬁned as R f = { ( x, t ) : t ≤ f ( x ) } , and call it range . Let F + = { R f , f ∈ F } be16 range set on X (cid:48) , and its corresponding range space Q (cid:48) be Q (cid:48) = ( X (cid:48) , F + ) . We say that a subset D ⊂ X (cid:48) is shattered by F + if the size of the projection set proj F + ( D ) = { r ∩ D : r ∈ F + } is equal to | D | . The VC dimension

Let k be a positive integer and D be a bag of n reads. Deﬁne the domain X as the set ofintegers { , . . . , n } , where every i ∈ X corresponds to the i -th read of D . Then deﬁne the family of real-valued functions F = { f K , ∀ K ∈ Σ k } where, for every i ∈ X and for every f K ∈ F , the function f K ( i ) is the number of distinct positions in read r i where k -mer K appears divided by the average size of themultiset of k -mers that appear in a read of D : f K ( i ) = (cid:80) n i − kj =0 φ ri,K ( j ) (cid:96) D ,k . Therefore f K ( i ) ∈ [0 , (cid:96) max , D ,k (cid:96) D ,k ] .For each f K ∈ F , the subset of X (cid:48) = X × [0 , (cid:96) max , D ,k (cid:96) D ,k ] deﬁned as R f K = { ( i, t ) : t ≤ f K ( i ) } is theassociated range. Let F + = { R f K , f K ∈ F } be the range set on X (cid:48) , and its corresponding range space Q (cid:48) be Q (cid:48) = ( X (cid:48) , F + ) . A trivial upper bound to

P D ( X, F ) is given by P D ( X, F ) ≤ (cid:98) log |F |(cid:99) = (cid:98) log σ k (cid:99) . Before provinga tighter bound to P D ( X, F ) , we ﬁrst state a technical Lemma (Lemma 3.8 from [45]. Lemma 1.

Let B ⊆ X (cid:48) be a set that is shattered by F + . Then B does not contain any element in the form ( i, , for any i ∈ X . Proposition 12.

Let D be a bag of n reads, k a positive integer, X be the domain and F be the family ofreal-valued functions deﬁned in Deﬁnition 3. Then the pseudodimension P D ( X, F ) satisﬁes P D ( X, F ) ≤ (cid:98) log ( (cid:96) max, D ,k ) (cid:99) + 1 . (6) Proof.

From the deﬁnition of pseudodimension we have

P D ( X, F ) = V C ( Q (cid:48) ) , therefore showing V C ( Q (cid:48) ) = v ≤ (cid:98) log ( (cid:96) max, D ,k ) (cid:99) + 1 is sufﬁcient for the proof. An immediate consequence of Lemma 1is that for all elements ( i, t ) of any set B that is shattered by F + it holds t ≥ /(cid:96) D ,k . Now we denote aninteger v and suppose that V C ( Q (cid:48) ) = v . Thus, there must exist a set B ⊆ X (cid:48) with | B | = v which needsto be shattered by F + . This means that v subsets of B must be in projection of F + on B . If this is true,17hen every element of B needs to belong to exactly v − such sets. This means that for a given ( i, t ) of B ,all the projections of v − elements of F + contain ( i, t ) . Since t ≥ /(cid:96) D ,k , there need to exist v − distinct k -mers appearing at least once in the read r i . More formally, it needs to hold n i − k + 1 ≥ v − , thatimplies v ≤ (cid:98) log ( n i − k + 1) (cid:99) + 1 , ∀ ( i, t ) ∈ B . Since n i − k + 1 ≤ (cid:96) max, D ,k for each ( i, t ) ∈ B , then v ≤ (cid:98) log ( (cid:96) max, D ,k ) (cid:99) + 1 , and the thesis holds.Based on the previous result, we obtain the following. Proposition 13.

P D ( X, F ) ≤ (cid:98) log σ k (cid:99) with Propositions 2 and 3 we have that, with proba-bility at least − δ , | f S ( K ) − f D ( K ) | ≤ ε/ simultaneously holds for every k -mer K .Now, as for Proposition 10, we prove that, with probability at least − δ , F K ( S, k, θ − ε/ is an ε -approximation of F K ( D , k, θ ) , when, with probability at least − δ , “ | f S ( K ) − f D ( K ) | ≤ ε ” for all k -mers K . Note that the third property of Deﬁnition 1 is already satisﬁed. Let K be a k -mer of F K ( D , k, θ ) , thatis f D ( K ) ≥ θ . Given that f S ( K ) ≥ f D ( K ) − ε/ , we have f S ( K ) ≥ θ − ε/ and the ﬁrst property ofDeﬁnition 1 holds. Combining f D ( K ) ≥ f S ( K ) − ε/ and f S ( K ) ≥ θ − ε/ , we have f D ( K ) ≥ θ − ε and the second property of Deﬁnition 1 holds.This bound signiﬁcantly improves on the result of Proposition 10, since the factor ln(2 σ k ) has beenreduced to (cid:98) log min(2 (cid:96) max , D ,k , σ k ) (cid:99) . Finally, by taking a sample S of size m according to Proposition 13and by extracting the set F K ( S, k, θ − ε/ we get an ε -approximation of F K ( D , k, θ ) with probability atleast − δ . However, also this approach typically results in a sample size m larger than |D| . C Analysis of the Main Technical Result (Proposition 6)

This section is dedicated to prove our main technical result on which

SPRISS is built, i.e. Proposition 6(here it corresponds to Proposition 17). We also prove Proposition 5 of the main text (here Proposition14) and some additional but necessary results. As for the previous section, we reintroduce some importantdeﬁnitions and results.We deﬁne I (cid:96) = { i , i , . . . , i (cid:96) } as a bag of (cid:96) indexes of reads of D chosen uniformly at random, withreplacement, from the set { , . . . , n } . Then we deﬁne an (cid:96) - reads sample S (cid:96) as a bag of m bags of (cid:96) reads S (cid:96) = { I (cid:96), , . . . , I (cid:96),m } . The deﬁnition of a new range space Q (cid:48) = ( X (cid:48) , F + ) associated to k -mers requires todeﬁne a new domain X and a new class of real-valued functions F .18 eﬁnition 4. Let k be a positive integer and D be a bag of n reads. Deﬁne the domain X as the set ofbags of (cid:96) indexes of reads of D . Then deﬁne the family of real-valued functions F = { f K,(cid:96) , ∀ K ∈ Σ k } where, for every I (cid:96) ∈ X and for every f K,(cid:96) ∈ F , we have f K,(cid:96) ( I (cid:96) ) = min(1 , o I (cid:96) ( K )) / ( (cid:96)(cid:96) D ,k ) , where o I (cid:96) ( K ) = (cid:80) i ∈ I (cid:96) (cid:80) n i − kj =0 φ r i ,K ( j ) counts the number of occurrences of K in all the (cid:96) reads of I (cid:96) . Therefore f K,(cid:96) ( I (cid:96) ) ∈ { , (cid:96)(cid:96) D ,k } ∀ f K,(cid:96) and ∀ I (cid:96) . For each f K,(cid:96) ∈ F , the subset of X (cid:48) = X × { , (cid:96)(cid:96) D ,k } deﬁned as R f K,(cid:96) = { ( I (cid:96) , t ) : t ≤ f K,(cid:96) ( I (cid:96) ) } is the associated range. Let F + = { R f K,(cid:96) , f

K,(cid:96) ∈ F } be the range set on X (cid:48) , and its corresponding range space Q (cid:48) be Q (cid:48) = ( X (cid:48) , F + ) . Note that, for a given bag I (cid:96) , the functions f K,(cid:96) are then biased if K appears more than times in all the (cid:96) reads of I (cid:96) . We now prove an upper bound to the pseudodimension P D ( X, F ) . Proposition 14.

Let D be a bag of n reads, k a positive integer, X be the domain and F be the family ofreal-valued functions deﬁned in Deﬁnition 4. Then the pseudodimension P D ( X, F ) satisﬁes P D ( X, F ) ≤ (cid:98) log ( (cid:96)(cid:96) max, D ,k ) (cid:99) + 1 . (8) Proof.

From the deﬁnition of pseudodimension we have

P D ( X, F ) = V C ( Q (cid:48) ) , therefore showing V C ( Q (cid:48) ) = v ≤ (cid:98) log ( (cid:96)(cid:96) max, D ,k ) (cid:99) + 1 is sufﬁcient for the proof. Since Lemma 1 is also valid for thenew deﬁnition of the range space Q (cid:48) = ( X (cid:48) , F + ) , an immediate consequence is that for all elements ( i, t ) of any set B that is shattered by F + it holds t ≥ / ( (cid:96)(cid:96) D ,k ) . Now we denote an integer v and suppose that V C ( Q (cid:48) ) = v . Thus, there must exist a set B ⊆ X (cid:48) with | B | = v which needs to be shattered by F + . Thismeans that v subsets of B must be in projection of F + on B . If this is true, then every element of B needsto belong to exactly v − such sets. This means that for a given ( I (cid:96) , t ) of B , all the projections of v − ele-ments of F + contain ( I (cid:96) , t ) . Since t ≥ / ( (cid:96)(cid:96) D ,k ) , there need to exist v − distinct k -mers appearing at leastonce in the bag of (cid:96) reads associated with I (cid:96) . More formally, it needs to hold (cid:80) i ∈ I (cid:96) ( n i − k + 1) ≥ v − , thatimplies v ≤ (cid:98) log (cid:80) i ∈ I (cid:96) ( n i − k + 1) (cid:99) + 1 , ∀ ( I (cid:96) , t ) ∈ B . Since n i − k + 1 ≤ (cid:96) max, D ,k for each ( I (cid:96) , t ) ∈ B and i ∈ I (cid:96) , then v ≤ (cid:98) log ( (cid:96)(cid:96) max, D ,k ) (cid:99) + 1 , and the thesis holds.Before showing an improved bound on the sample size, we need to deﬁne the frequency ˆ f S (cid:96) ( K ) of a k -mer K computed from the sample S (cid:96) : ˆ f S (cid:96) ( K ) = 1 m (cid:88) I (cid:96),i ∈ S (cid:96) f K,(cid:96) ( I (cid:96),i ) , (9)which is the bias version of f S (cid:96) ( K ) = 1 m (cid:88) I (cid:96),i ∈ S (cid:96) o I (cid:96) ( K ) / ( (cid:96)(cid:96) D ,k ) . (10)Note that E [ f S (cid:96) ( K )] = f D ( K ) . In order to ﬁnd a relation between E [ ˆ f S (cid:96) ( K )] and f D ( K ) , we need thefollowing proposition. Proposition 15.

Let ˜ f D ( K ) = (cid:80) r i ∈D ( K ∈ r i ) /n and f D ( K ) = o D ( K ) /t D ,k . It holds that: (cid:96) D ,k (cid:96) max , D ,k f D ( K ) ≤ ˜ f D ( K ) ≤ (cid:96) D ,k f D ( K ) . (11) Proof.

Let us rewrite ˜ f D ( K ) : ˜ f D ( K ) = (cid:88) r i ∈D ( K ∈ r i ) /n = ( K ∈ r ) (cid:96) D ,k (cid:96) D ,k n + · · · + ( K ∈ r n ) (cid:96) D ,k (cid:96) D ,k n . (12)19ince ( K ∈ r i ) ≤ o r i ( K ) for every i ∈ { , . . . , n } , we have ˜ f D ( K ) ≤ o r ( K ) (cid:96) D ,k (cid:96) D ,k n + · · · + o r n ( K ) (cid:96) D ,k (cid:96) D ,k n = (cid:96) D ,k f D ( K ) . (13)Since ( K ∈ r i ) ≥ o r i ( K ) /(cid:96) max , D ,k for every i ∈ { , . . . , n } , we have ˜ f D ( K ) ≥ o r ( K ) n(cid:96) D ,k (cid:96) D ,k (cid:96) max , D ,k + · · · + o r n ( K ) n(cid:96) D ,k (cid:96) D ,k (cid:96) max , D ,k = (cid:96) D ,k (cid:96) max , D ,k f D ( K ) . (14)Now we show a relation between E [ ˆ f S (cid:96) ( K )] and f D ( K ) . Proposition 16.

Let ˜ f D ( K ) = (cid:80) r i ∈D ( K ∈ r i ) /n and f D ( K ) = o D ( K ) /t D ,k . Let S (cid:96) be a bag of m bags of (cid:96) reads drawn from D . Then: E [ ˆ f S (cid:96) ( K )] ≥ (cid:96)(cid:96) D ,k (1 − (1 − (cid:96) D ,k (cid:96) max , D ,k f D ( K )) (cid:96) ) . (15) Proof.

Let us rewrite E [ ˆ f S (cid:96) ( K )] : E [ ˆ f S (cid:96) ( K )] = 1 (cid:96)(cid:96) D ,k E [min(1 , o I (cid:96) ( K ))] = 1 (cid:96)(cid:96) D ,k Pr( o I (cid:96) ( K ) > . (16)Then, we have E [ ˆ f S (cid:96) ( K )] = 1 (cid:96)(cid:96) D ,k P r ( o I (cid:96) ( K ) >

0) = 1 (cid:96)(cid:96) D ,k (1 − P r ( o I (cid:96) ( K ) = 0)) = (17) = 1 (cid:96)(cid:96) D ,k (1 − (cid:89) i ∈ I (cid:96) Pr( o r i ( K ) = 0)) = 1 (cid:96)(cid:96) D ,k (1 − (1 − ˜ f D ( K )) (cid:96) ) , (18)and since ˜ f D ( K ) ≥ (cid:96) D ,k (cid:96) max , D ,k f D ( K ) by Proposition 15, the thesis holds.Let θ be a minimum frequency threshold. Using the previous proposition, if f D ( K ) ≥ (cid:96) max , D ,k (cid:96) D ,k (1 − (1 − (cid:96)(cid:96) D ,k θ ) /(cid:96) ) (19)with (cid:96) ≤ / ( (cid:96) D ,k θ ) , then E [ ˆ f S (cid:96) ( K )] ≥ θ . Proposition 17.

Let k and (cid:96) be two positive integers. Consider a sample S (cid:96) of m bags of (cid:96) reads from D .For ﬁxed frequency threshold θ ∈ (0 , , error parameter ε ∈ (0 , θ ) , and conﬁdence parameter δ ∈ (0 , , if m ≥ ε (cid:18) (cid:96)(cid:96) D ,k (cid:19) (cid:18) (cid:98) log min(2 (cid:96)(cid:96) max , D ,k , σ k ) (cid:99) + ln (cid:18) δ (cid:19)(cid:19) (20) then, with probability at least − δ : • for any k -mer K ∈ F K ( D , k, θ ) such that f D ( A ) ≥ (cid:96) max , D ,k (cid:96) D ,k (1 − (1 − (cid:96)(cid:96) D ,k θ ) /(cid:96) ) it holds ˆ f S (cid:96) ( K ) ≥ θ − ε/ ; • for any k -mer K with ˆ f S (cid:96) ( K ) ≥ θ − ε/ it holds f D ( K ) ≥ θ − ε ; for any k -mer K ∈ F K ( D , k, θ ) it holds f D ( K ) ≥ ˆ f S (cid:96) ( K ) − ε/ ; • for any k -mer K with (cid:96)(cid:96) D ,k ( ˆ f S (cid:96) ( K ) + ε/ ≤ it holds f D ( K ) ≤ (cid:96) max , D ,k (cid:96) D ,k (1 − (1 − (cid:96)(cid:96) D ,k ( ˆ f S (cid:96) ( K ) + ε/ (1 /(cid:96) ) ) .Proof. Let us consider ˆ f S (cid:96) ( K ) = m (cid:80) I (cid:96),i ∈ S (cid:96) f K,(cid:96) ( I (cid:96),i ) and its expectation E [ ˆ f S (cid:96) ( K )] = E [ f K,(cid:96) ( I (cid:96),i )] ,which is taken with respect to the uniform distribution over bags of (cid:96) reads. By using Proposition 2, Propo-sition 14, and by the choice of m , we have that with probability at least − δ it holds | E [ ˆ f S (cid:96) ( K )] − ˆ f S (cid:96) ( K ) | ≤ ε/ for every k -mer K , which implies ˆ f S (cid:96) ( K ) ≥ E [ ˆ f S (cid:96) ( K )] − ε/ . Using Proposition 16, when f D ( K ) ≥ (cid:96) max , D ,k (cid:96) D ,k (1 − (1 − (cid:96)(cid:96) D ,k θ ) /(cid:96) ) , then E [ ˆ f S (cid:96) ( K )] ≥ θ and the ﬁrst part holds.By the deﬁnitions of ˆ f S (cid:96) ( K ) and f S (cid:96) ( K ) of Equation 9 and Equation 10 we have E [ ˆ f S (cid:96) ( K )] ≤ E [ f S (cid:96) ( K )] = f D ( K ) . From the proof of the ﬁrst part we have | E [ ˆ f S (cid:96) ( K )] − ˆ f S (cid:96) ( K ) | ≤ ε/ for ev-ery k -mer K . If we consider a k -mer K with f D ( K ) < θ − ε we have ˆ f S (cid:96) ( K ) ≤ E [ ˆ f S (cid:96) ( K )] + ε/ ≤ f D ( K ) + ε/ < θ − ε/ and the second part holds.Since f D ( K ) ≥ E [ ˆ f S (cid:96) ( K )] and | E [ ˆ f S (cid:96) ( K )] − ˆ f S (cid:96) ( K ) | ≤ ε/ for every k -mer K , we have E [ ˆ f S (cid:96) ( K )] ≥ ˆ f S (cid:96) ( K ) − ε/ and the third part holds.By Proposition 16 we have f D ( K ) ≤ (cid:96) max , D ,k (cid:96) D ,k (1 − (1 − (cid:96)(cid:96) D ,k E [ ˆ f S (cid:96) ( K )]) (1 /(cid:96) ) ) . Using the fact that E [ ˆ f S (cid:96) ( K )] ≤ ˆ f S (cid:96) ( K ) + ε/ for every k -mer K , the last part holds. Algorithm 2:

SPRISS ( D , k, θ, δ, ε, (cid:96) ) Data: D , k , θ ∈ (0 , , δ ∈ (0 , , ε ∈ (0 , θ ) , integer (cid:96) ≥ Result:

Approximation A of F K ( D , k, θ ) with probability at least − δ m ← (cid:100) ε (cid:16) (cid:96)(cid:96) D ,k (cid:17) (cid:0) (cid:98) log min(2 (cid:96)(cid:96) max , D ,k , σ k ) (cid:99) + ln (cid:0) δ (cid:1)(cid:1) (cid:101) ; S ← sample of exactly m(cid:96) reads drawn from D ; T ← exact counting ( S, k ) ; A ← ∅ ; forall k -mers K ∈ T do ˆ f S (cid:96) ( K ) ← Binomial ( m, − e − T [ K ] /m ) / ( m(cid:96)(cid:96) D ,k ) ; // biased frequency 9 f S (cid:96) ( K ) ← T [ K ] / ( m(cid:96)(cid:96) D ,k ) ; // unbiased frequency 10 if ˆ f S (cid:96) ( K ) ≥ θ − ε/ then A ← A ∪ ( K, f S (cid:96) ( K )) return A ; 21 Additional ﬁgures R unn i n g t i m e ( s e c ) Running time

SRS024075 (E-Jellyfish) SRS024075 (SK) SRS024075 (SP-Jellyfish) SRS024388 (E-Jellyfish) SRS024388 (SK) SRS024388 (SP-Jellyfish) SRS011239 (E-Jellyfish) SRS011239 (SK) SRS011239 (SP-Jellyfish) SRS075404 (E-Jellyfish) SRS075404 (SK) SRS075404 (SP-Jellyfish) SRS043663 (E-Jellyfish) SRS043663 (SK) SRS043663 (SP-Jellyfish) SRS062761 (E-Jellyfish) SRS062761 (SK) SRS062761 (SP-Jellyfish)

Figure S1:

As function of θ and for each dataset D , running times to approximate F K ( D , k, θ ) with SPRISS usingJ

ELLYFISH ( SP -J ELLYFISH ), with the state-of-the-art sampling algorithm SAKEIMA ( SK ), and for exactly comput-ing F K ( D , k, θ ) with J ELLYFISH ( E -J ELLYFISH ). F a l s e n e g a t i v e s r a t e False negatives rate

HMP1 - (SP)HMP1 - (SK)HMP2 - (SP)HMP2 - (SK)HMP3 - (SP)HMP3 - (SK) HMP4 - (SP)HMP4 - (SK)HMP5 - (SP)HMP5 - (SK)HMP6 - (SP)HMP6 - (SK) (a) M a x i m u m d e v i a t i o n

1e 6 Maximum deviation (b)

Figure S2:

As function of θ and for every dataset: (a) False negatives rates, i.e. the fraction of k -mers of F K ( D , k, θ ) not reported by the approximation sets, obtained using SPRISS ( SP ) and SAKEIMA ( SK ); (b) Maximum devia-tions between exact and unbiased observed frequencies provided by the approximations sets of SPRISS ( SP ) andSAKEIMA ( SK ). .4 0.5 0.6 0.7 0.8 0.9 1.0 Bray-Curtis distance (exact) B r a y - C u r t i s d i s t a n c e ( s a m p li n g ) Exact vs estimated BC dist. ( = 5e-08)

HMP1-HMP2HMP1-HMP3HMP1-HMP4HMP1-HMP5HMP1-HMP6HMP2-HMP3HMP2-HMP4HMP2-HMP5 HMP2-HMP6HMP3-HMP4HMP3-HMP5HMP3-HMP6HMP4-HMP5HMP4-HMP6HMP5-HMP6 (a)

Bray-Curtis distance (exact) B r a y - C u r t i s d i s t a n c e ( s a m p li n g ) Exact vs estimated BC dist. ( = 7.5e-08) (b)

Bray-Curtis distance (exact) B r a y - C u r t i s d i s t a n c e ( s a m p li n g ) Exact vs estimated BC dist. ( = 1e-07) (c)

Figure S3:

Comparison of the approximations of the Bray-Curtis distances using approximations of frequent k -merssets provided by SPRISS ( × ) and by SAKEIMA ( • ) with the exact distances, for: (a) θ = 5 · − ; (b) θ = 7 . · − ;(c) θ = 1 · − . C T G N C T O T O T O T O T O T O T O T G T G T G T G T G T G T G T G T G T G T G T G T G T N T N T N T N T N T N E E T S T S T S T S N C N C NC3TG15NC2TO2TO1TO5TO7TO6TO4TO3TG14TG13TG1TG3TG16TG2TG8TG12TG11TG5TG4TG7TG6TN4TN6TN3TN1TN5TN2E2E1TS1TS2TS4TS3NC4NC10.00.3

GOS clustering with exact Jaccard similarity

Figure S4:

Average linkage hierarchical clustering of GOS datasets using Jaccard similarity. Preﬁx ID of the GOSdatasets: TO = Tropical Open ocean, TG = Tropical Galapagos, TN = Temperate North, TS = Temperate South, E =Estuary, NC = Non-Classiﬁed as datasets from marine environments. N C T O T O T O T O T O T O T O T G T G T G T G T G T G T G T G T G T G T G T G T G T G N C T N T N T N T N T N T N E E T S T S T S T S N C N C NC2TO2TO1TO5TO7TO6TO4TO3TG8TG7TG12TG11TG6TG5TG4TG13TG3TG1

TG16TG2TG15TG14NC3TN4TN6TN2TN1TN5TN3E2E1TS1TS2TS4TS3NC4NC1

GOS clustering with exact BC similarity

Figure S5:

Average linkage hierarchical clustering of GOS datasets using Bray-Curtis similarity. Preﬁx ID of theGOS datasets: TO = Tropical Open ocean, TG = Tropical Galapagos, TN = Temperate North, TS = Temperate South,E = Estuary, NC = Non-Classiﬁed as datasets from marine environments. O N C T O T O T O T O T O T O N C T G T G T G T G T G T G T G T G T G T G T G T G T G T G T N T N T N T N T N T N E E T S T S T S T S N C N C TO2NC2TO1TO5TO7TO6TO4TO3NC3TG8TG7TG12TG11TG6TG5TG4TG13TG3TG1TG16TG2TG15TG14TN1TN3TN5TN4TN6TN2E2E1TS1TS2TS4TS3NC4NC10.00.3

GOS clustering with BC similarity from SPRISS

Figure S6:

Average linkage hierarchical clustering of GOS datasets using estimated Bray-Curtis similarity from

SPRISS with of the data. Preﬁx ID of the GOS datasets: TO = Tropical Open ocean, TG = Tropical Gala-pagos, TN = Temperate North, TS = Temperate South, E = Estuary, NC = Non-Classiﬁed as datasets from marineenvironments.

E Datasets

Table 1:

HMP datasets for our experimental evaluation. For each dataset D the table shows: the dataset name and site( (s) for stool , (t) for tongue dorsum ); its corresponding label on ﬁgures; the total number t D ,k of k -mers( k = 31 ) in D ; the number |D| of reads it contains; the maximum read length max n i = max i { n i | r i ∈ D} ; the averageread length avg n i = (cid:80) ni =1 n i /n . dataset label t D ,k |D| max n i avg n i SRS024075(s)

HMP1 . · . ·

95 93.88

SRS024388(s)

HMP2 . · . ·

101 96.21

SRS011239(s)

HMP3 . · . ·

101 95.69

SRS075404(t)

HMP4 . · . ·

101 93.51

SRS043663(t)

HMP5 . · . ·

100 100.00

SRS062761(t)

HMP6 . · . ·

100 100.00 able 2: GOS datasets for our experimental evaluation. For each dataset D the table shows: the dataset name; itscorresponding label for clustering results in ﬁgures 2c, 2d, and S5; the total number t D ,k of k -mers ( k = 21 ) in D ;the number |D| of reads it contains; the maximum read length max n i = max i { n i | r i ∈ D} ; the average read lengthavg n i = (cid:80) ni =1 n i /n . Preﬁx IDs of the GOS datasets: TO = Tropical Open ocean, TG = Tropical Galapagos, TN =Temperate North, TS = Temperate South, E = Estuary, NC = Non-Classiﬁed. dataset label t D ,k |D| max n i avg n i GS02

TN1 . · . · GS03

TN2 . · . · GS04

TN3 . · . · GS05

TN4 . · . · GS06

TN5 . · . · GS07

TN6 . · . · GS08

TS1 . · . · GS09

TS2 . · . · GS10

TS3 . · . · GS11 E1 . · . · GS12 E2 . · . · GS13

TS4 . · . · GS14

TG1 . · . · GS15

TO1 . · . · GS16

TO2 . · . · GS17

TO3 . · . · GS18

TO4 . · . · GS19

TO5 . · . · GS20

NC1 . · . · GS21

TG2 . · . · GS22

TG3 . · . · GS23

TO6 . · . · GS25

NC2 . · . · GS26

TO7 . · . · GS27

TG4 . · . · GS28

TG5 . · . · GS29

TG6 . · . · GS30

TG7 . · . · GS31

TG8 . · . · GS32

NC3 . · . · GS33

NC4 . · . · GS34

TG11 . · . · GS35

TG12 . · . · GS36

TG13 . · . · GS37

TG14 . · . · GS47

TG15 . · . · GS51

TG16 . · . · Table 3:

B73 and Mo17 datasets for our experimental evaluation. For each dataset D the table shows: the datasetname; the total number t D ,k of k -mers ( k = 31 ) in D ; the number |D| of reads it contains; the maximum read length max n i = max i { n i | r i ∈ D} ; the average read length avg n i = (cid:80) ni =1 n i /n . dataset t D ,k |D| max n i avg n i B73 . · . ·

250 250

Mo17 . · . ·

250 250 .5e-8 5e-8 7.5e-8 1e-70.0050.0100.015 F a l s e n e g a t i v e s r a t e False negatives rate (HMP1,·)

HMP1-HMP2HMP1-HMP3HMP1-HMP4HMP1-HMP5HMP1-HMP6 (a) F a l s e n e g a t i v e s r a t e False negatives rate (HMP2,·)

HMP2-HMP1HMP2-HMP3HMP2-HMP4HMP2-HMP5HMP2-HMP6 (b) F a l s e n e g a t i v e s r a t e False negatives rate (HMP3,·)

HMP3-HMP1HMP3-HMP2HMP3-HMP4HMP3-HMP5HMP3-HMP6 (c) F a l s e n e g a t i v e s r a t e False negatives rate (HMP4,·)

HMP4-HMP1HMP4-HMP2HMP4-HMP3HMP4-HMP5HMP4-HMP6 (d) F a l s e n e g a t i v e s r a t e False negatives rate (HMP5,·)

HMP5-HMP1HMP5-HMP2HMP5-HMP3HMP5-HMP4HMP5-HMP6 (e) F a l s e n e g a t i v e s r a t e False negatives rate (HMP6,·)

HMP6-HMP1HMP6-HMP2HMP6-HMP3HMP6-HMP4HMP6-HMP5 (f) R unn i n g t i m e ( s e c ) Running time

ExactSPRISS (g)

Figure S7:

As function of θ , false negatives rate, i.e. the fraction of k -mers of DK ( D , D , k, θ, ρ ) not included inits approximation DK ( D , D , k, θ, ρ ) , which is obtained using SPRISS , for all pairs of datasets D and D . FigureS7g shows the running times to compute DK ( D , D , k, θ, ρ ) using SPRISS against the one required to compute theexact set DK ( D , D , k, θ, ρ ) , cumulative for all pairs of datasets D and D ..