Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors and Jumbled-Index Queries in String Reconstruction
Ramtin Afshar, Amihood Amir, Michael T. Goodrich, Pedro Matias
aa r X i v : . [ c s . D S ] A ug Adaptive Exact Learning in a Mixed-Up World:Dealing with Periodicity, Errors andJumbled-Index Queries in String Reconstruction
Ramtin Afshar , Amihood Amir , Michael T. Goodrich − − − X ] ,and Pedro Matias − − − Dept. of Computer Science, Univ. of California Irvine, USA { afsharr,goodrich,pmatias } @uci.edu Dept. of Computer Science, Bar Ilan Univ., Israel [email protected]
Abstract.
We study the query complexity of exactly reconstructinga string from adaptive queries, such as substring, subsequence, andjumbled-index queries. Such problems have applications, e.g., in com-putational biology. We provide a number of new and improved boundsfor exact string reconstruction for settings where either the string or thequeries are “mixed-up”.For example, we show that a periodic (i.e., “mixed-up”) string, S = p k p ′ ,of smallest period p , where | p ′ | < | p | , can be reconstructed using O ( σ | p | + lg n ) substring queries, where σ is the alphabet size, if n = | S | is unknown. We also show that we can reconstruct S after having beencorrupted by a small number of errors d , measured by Hamming distance.In this case, we give an algorithm that uses O ( dσ | p | + d | p | lg nd +1 ) queries.In addition, we show that a periodic string can be reconstructed using2 σ ⌈ lg n ⌉ + 2 | p |⌈ lg σ ⌉ subsequence queries, and that general strings canbe reconstructed using 2 σ ⌈ lg n ⌉ + n ⌈ lg σ ⌉ subsequence queries, withoutknowledge of n in advance. This latter result improves the previousbest, decades-old result, by Skiena and Sundaram. Finally, we believewe are the first to study the exact-learning query complexity for stringreconstruction using jumbled-index queries, which are a “mixed-up” typeof query that have received much attention of late. Keywords:
Exact Learning · String Reconstruction · Jumbled-IndexQueries · Periodicity · DNA Sequencing · Stringology · Substrings · Hybridization · Information Security
Exact learning involves asking a series of queries so as to learn a configurationor concept uniquely and without errors, e.g., see [13]. For example, imagine agame where a player, Alice, is trying to exactly learn a secret string, S , such as S = "rumpelstiltskin" , which is known only to a magic fairy. Alice mayask the fairy questions about S , but only if they are in a form allowed bythe fairy, such as “Is X a substring of S ?”. Any allowable question that Alice R. Afshar, A. Amir, M. T. Goodrich, and P. Matias asks must be answered truthfully by the fairy. Alice’s goal is to learn S byasking the fewest number of allowable questions. Her strategy is adaptive if herquestions can depend on the answers to previous queries. This exact-learningstring-reconstruction problem might at first seem like a contrived game, but itactually has a number of applications.For example, the magic fairy could represent a corporation with a documentdatabase, S , that supports an API allowing users to perform certain onlinequery operations on S , such as keyword searches. Further, this corporationmay receive financial compensation for each of its database responses (eitherdirectly or through advertisements); hence, the corporation might not want thedatabase’s entire contents leaking out. In this case, Alice could represent a rivalcorporation that is interested in learning the contents of the database, by askinglegal queries from its API, so that Alice can setup a competing online queryservice. An optimal solution to the fairy-querying game would allow Alice tosteal the database by asking the fewest number of questions necessary.As another example, in interactive DNA sequencing, the fairy’s string is anunknown DNA sequence, S , and allowable queries are “Is X a substring of S ?”Each such question can be answered by a hybridization experiment that exposescopies of S to a mixture containing specific primers to see which ones bind to S , e.g., see [87]. An efficient scheme for Alice to play this fairy-querying gameresults in an efficient method for sequencing the unknown DNA sequence.Yet another application comes from computer security and cryptography,dealing with searchable encryption (e.g., [36, 89]), where a database returnsencrypted answers in response to queries. In this case, so long as Alice can,for instance, tell encryptions of “yes” apart from encryptions of “no,” thenthe fairy-querying game corresponds to a type of side-channel attack, e.g.,see [27, 57, 65, 67, 78, 101].Thus, we are interested in the exact-learning complexity of adaptivelylearning an unknown string via queries of various given types, that is, for exactlyreconstructing a string from queries. Formally, we are interested in minimizinga query-complexity measure, Q ( n ), which, in our case, is the number ofqueries of certain types needed in order to exactly learn a string, S . This query-complexity concept comes from machine-learning and complexity theory, e.g.,see [3, 13, 22, 31, 40, 91, 99]. Motivated by DNA sequencing, Skiena and Sundaram [87] were the first tostudy exact string reconstruction from adaptive queries. For substring queries ,of the form “Is X a substring of S ?”, they give a bound for Q ( n ) of ( σ − n + 2 log n + O ( σ ), where σ is the alphabet size. For subsequence queries ,of the form “Is X a subsequence of S ?”, they prove a bound for Q ( n ) of Θ ( n log σ + σ log n ). Recently, Iwama et al. [52] study the problem for binaryalphabets, which removes the additive logarithmic term in this case. Thesepapers do not consider “mixed-up” strings, however, such as strings that areperiodic or periodic with errors. The abundance of repetitions and periodic runs daptive Exact Learning in a Mixed-Up World 3 in genomic sequences is well known and has been exploited in the last decadesfor biologic and medical information (see e.g. [19, 20, 38, 41, 43, 64, 79, 80, 88, 97]).It is somewhat surprising that this phenomenon has not been used to achievemore efficient algorithms. Margaritis and Skiena [73] study a parallel version ofexact string reconstruction from queries, which are hybrids of adaptive and non-adaptive strategies, showing, e.g., that a length- n string can be reconstructedin O (log n ) rounds using n substring queries per round. Tsur [92] gives apolynomial approximation algorithm for the 1-round case. As in [87], thesepapers do not consider bounds for Q ( n ) based on properties of the string suchas its periodicity. Cleve et al. [34] study string reconstruction in a quantum-computing model, showing, for example, that a sublinear number of queries aresufficient for a binary alphabet. This result does not seem to carry over to aclassical computing model, however, which is the subject of our paper.Another type of query we consider is the jumbled (or histogram)-index query, first considered in [24, 25, 32, 45] and studied more recently in, e.g. [4, 8,10, 11, 63, 75]. Jumbled indexing has many applications. It can be used as a toolfor de novo peptide identification (as in e.g. [53, 60, 61]), and has been used as afilter for searching an image database [33, 39, 90, 96, 102]. In this query, whichhas received much study of late, but has not been studied before for adaptivestring reconstruction, one is given a Parikh vector, i.e., a vector of frequencycounts for each character in an alphabet, and asked if there is a substring ofthe reference string, S , having these frequency counts and, if so, where it occursin S . Such reconstruction may aid in narrowing down peptide identification, orfocusing on image retrieval.Another model for string reconstruction, tangential to ours and studiedextensively, is the one defined by a non-adaptive oracle, where: we are givena set of answers to queries in advance, and we aim to understand sufficient andnecessary conditions on the answers that enable the exact reconstruction of thestring. This model differs from the adaptive one considered in this paper in thatit focuses on the study of combinatorial properties of strings, rather than onminimizing the number of queries. Below, we give a detailed review of existingliterature on this model, for each type of query considered in this paper. Non-adaptive Substring Queries
There is an extensive line of work focusingon the ability to reconstruct a string given the multiset of all its length- L substrings. For L ≥ a lg n ( a > n approachesinfinity, almost every length- n string can be recovered. The following variantshave also been studied: (i) only a subset of the length- L substrings is given, oreach substring is subject to substitution errors of fixed Hamming distance [58,72];(ii) the hidden string is an i.i.d. DNA string [15], combined with a random subsetof the length- L substrings [76], subject to probabilistic substitution errors [77]or edit errors (of fixed maximum amount) [50]; (iii) the hidden string satisfiesseveral constraints based on its repeat statistics [23, 93] and input substringsare subject to erasure errors [83]; and (iv) when partial reconstruction of the A letter in the substring is replaced by an ε . R. Afshar, A. Amir, M. T. Goodrich, and P. Matias hidden string is sufficient [84]. On a different note, the authors of [26,46] considerinstead the case where the input is a special set of substrings which is derivedfrom the set of maximal substrings. Non-adaptive Subsequence Queries
Perhaps the most studied problemin this category is the k -deck problem: given the multiset of all length- k subsequences of a length- n string S , what is the smallest value of k that enablesthe unique reconstruction of S ? This problem was introduced in [55], who showedan upper bound of ⌊ n/ ⌋ . This bound was improved to (1 + o (1)) p ( n ln n )in [82] and, in the same year, to ⌊ / √ n ⌋ + 5 in [66]. The first non-triviallower bound, of lg / lg lg n , was given in [100] and later on, was improved tolg n in [71] and to e Ω ( √ lg n ) in [42]. Recently, Gabrys et al. [48] considered anextension of the k -deck problem, where one is also given a number of specialsubsequences of length n − t , t >
0; they provide lower and upper bounds thathave a dependence on t . Also related to the k -deck problem is the work ofSimon [85], on which subsequences are considered to be of length at most k .Another relevant problem is trace reconstruction. The input to this problemis a set of traces, distorted versions of the hidden string obtained by deletion(i.e. subsequences) or other types of errors, when sending it through a noisychannel. Similarly, the goal is to recover the hidden string S , either exactly orwith some accuracy or probability, using the least amount of traces. To thebest of our knowledge, this problem was first studied in [69], who providedbounds for the number of input traces, when subject to a worse case fixednumber of substitutions, transpositions, deletions or insertion errors. In the caseof exclusively dealing with deletions, where each letter is deleted with somefixed probability q , Batu et al. [18] showed that reconstruction is possible w.h.p.for q = O (1 / lg n ) and O (lg n ) traces, when S is chosen uniformly at random.Moreover, they show that, for arbitrary S and for q = O (1 /n / ǫ ), O (1 /ǫ )traces are sufficient to reconstruct a close approximation of S and O ( n lg n )traces are sufficient to recover S exactly. Later Kannan et al. [56] extendedthese results to the case where insertion errors are also allowed, showing that fordeletion/insertion error probabilities of q = O (1 / lg n ) and O (lg n ) traces, S canbe recovered w.h.p. assuming it is chosen uniformly at random. Similarly, theyshow that an arbitrary S can be recovered w.h.p., for q = O (1 /n / ǫ ) and O (1)traces of length at most n ǫ . Later, Viswanathan et al. [94] improved on this, byshowing that deletion/insert error probabilities of q = O (1 / lg n ) are sufficientto reconstruct S , chosen uniformly at random. They also show that Ω (lg n )traces are necessary to reconstruct 1 − o (1) length- n strings w.h.p. In [51], theauthors showed that, for the case of deletion errors only, of probability q = O (1),reconstruction is possible w.h.p. using poly ( n ) traces, when S is chosen uniformlyat random. Finally, Sala et al. [81] studied lower bounds on the number of inputtraces formed from a worst-case number of insertion errors, where S is a memberof specific error-correcting codes, i.e. sets of strings constructed strategically toallow recovering them from a noisy channel is modified. daptive Exact Learning in a Mixed-Up World 5 Non-adaptive Jumbled-Index Queries
In [1, 2], Acharya et al. study anon-adaptive version of the problem of enumerating candidate strings fromthe composition multiset of the underlying string. The composition multisetcorresponds to the set of answers to all possible queries of the following type:given a Parikh vector, how many times does a matching substring occur in thehidden string? Under this model, they extend polynomial techniques used for theturnpike problem (see [37,86]) to give: (i) sufficient (but not necessary) conditionsfor the ability to uniquely reconstruct a string, (ii) a sufficient characterizationof unreconstructable strings and (iii) a backtracking algorithm that enumeratesthe set of all candidate strings, whose cardinality they lower and upper bound.
We provide new and improved results for exactly reconstructing strings fromadaptive substring, subsequence, and jumbled-index queries. For example, webelieve we are the first to characterize query complexities for exactly recon-structing periodic strings from adaptive queries, including the following resultsfor reconstructing a length- n periodic (i.e., “mixed-up”) string, S = p k p ′ , ofsmallest period p , where p ′ is a prefix of p and the alphabet has size σ : – It requires at least | p | lg σ substring or subsequence queries. – It can be done with σ | p | + ⌈ lg | p |⌉ substring queries, if n is known. – It can be done with O ( σ | p | + lg n ) substring queries, if n is unknown. – It can be done with σ ⌈ lg n ⌉ + 2 | p |⌈ lg σ ⌉ subsequence queries, for known n . – It can be done with 2 σ ⌈ lg n ⌉ +2 | p |⌈ lg σ ⌉ subsequence queries, if n is unknown.Perhaps our most technical result is that we show that we can reconstructa length- n string, S , within Hamming distance d of a periodic string S ′ = p k p ′ ,of smallest period p , using O (min( σn, dσ | p | + d | p | lg nd +1 )) substring queries, if n is unknown. We also show that we can exactly reconstruct a general length- n string, S , using 2 σ ⌈ lg n ⌉ + n ⌈ lg σ ⌉ subsequence queries, if n is unknown. Suchqueries are another “mixed-up” setting, since there can be multiple subsequencematches for a given string. Our bound improves the previous best, decades-oldresult, by Skiena and Sundaram [87], who prove a query complexity of 2 σ lg n +1 . n lg σ + 5 σ for this case. If n is known, then σ ⌈ lg n ⌉ + n ⌈ lg σ ⌉ subsequencequeries suffice. We believe we are the first to study string reconstruction usingjumbled-index queries, which are yet another “mixed-up” setting, since theysimply count the frequency of each character occurring in a substring. We provethe following results: – We can reconstruct a length- n string with O ( σn ) yes/no extended jumbled-index queries, which include a count for an end-of-string character, $. – For jumbled-index queries that return an index of a matching substring,string reconstruction is not possible if this index is chosen adversarially, butis possible using O ( σ + n lg n ) queries if it is chosen uniformly at random. R. Afshar, A. Amir, M. T. Goodrich, and P. Matias
We consider strings over the alphabet Σ = { a , a , . . . , a σ } of σ letters. The sizeof a string X is denoted by | X | . We use X [ i ] to denote the i th letter of X and X [ i..j ] to refer to the substring of X starting at its i th and ending at its j th letter (e.g., X = X [1 .. | X | ]). We may ignore i when expressing a prefix X [ ..j ] of X . Similarly, X [ i.. ] is a suffix of X . Occasionally, we will express concatenationof strings X and Y by X · Y (instead of XY ) to emphasize some property of thestring. A string X concatenated with itself k (resp. infinitely many) times canbe expressed as X k (resp. X ∞ ). The reversal of a string X is denoted by X R .A string, S , has period p if S = p k p ′ , such that k > p ′ isa (possibly empty) prefix of p . Further, a string S is periodic if it has a periodthat repeats at least twice, i.e. S = p k p ′ and k > . The following is a wellknown result concerning the periodicity of a string, due to Fine and Wilf [47],which we will need later on. Lemma 1 (Periodicity Lemma [47]). If p, q are periods of a string X oflength | X | ≥ | p | + | q | − gcd( | p | , | q | ) , then X also has a period of size gcd( | p | , | q | ) . A doubling search is the operation used to determine a number n from a(typically unbounded) range of possibilities. It involves doubling a query value, m , until it is greater than n , followed by a binary search to determine n itself.Its time complexity is 2 ⌊ lg n ⌋ + 1. A more sophisticated version of this procedureexists (see [21]) that actually improves the time complexity into ⌊ lg L (0) ⌋ + ⌊ lg L (1) n ⌋ + · · · + ⌊ lg L ( t − n ⌋ + 2 ⌊ lg L ( t ) n ⌋ + 1 , where L ( j ) ( n ) = ⌊ lg L ( j − ( n ) ⌋ + 1 and L (0) ( n ) = n , for which there exists anoptimized value of t . For simplicity, we use the traditional algorithm, which isasymptotically equivalent. In this section, we study query complexities for a string, S , subject to yes/no substring queries, IsSubstr , i.e. queries of “Is X a substring of S ?”. We focuson the cases where S corresponds to an originally periodic string, that may havelost its periodicity property due to error corruption. The nature of the errorsis context-dependent. For example, corruption may be caused by transmissionerrors, measurement errors, malicious tampering, or even by the aging processof a natural phenomenon. There are multiple ways to model errors in strings.Examples include: – Hamming distance model: corruption is caused by allowing the substitutionof a letter in the string by a different letter of the alphabet Our algorithms assume that S is periodic ( k > k > – Edit distance model: a generalization of the Hamming distance that alsoallows the insertion or deletion of a letter (see [68]) – Swap distance model: the operation allowed consists of swapping two adja-cent letters in the string (see [70, 95]) – Interchange (or
Caley ) distance model: it generalizes swap distance, byallowing the swap of any two letters, not necessarily adjacent (see [9, 12,28, 54])In this paper, we consider Hamming distance. We say that S is a d - corruptedperiodic string if there exists a periodic string S ′ of period p , such that | S | = | S ′ | and δ ( S ′ , S ) ≤ d , where δ is the Hamming distance. We refer to p asan approximate period of S . Notice that, depending on d , there might existmultiple possible strings S ′ that originate S . We are interested in reconstructing S , as opposed to S ′ , since we can use one of the existing algorithms to enumerateall possible strings S ′ (see [5, 7, 62]), without incurring additional queries.Our main result in this section is the following. Theorem 1.
We can reconstruct a length- n d -corrupted periodic string S using O (cid:18) min (cid:18) σn, dσ | p | + d | p | lg nd + 1 (cid:19)(cid:19) queries,for known d , unknown | p | , regardless of whether we know n , where p is a smallestapproximate period of S . The algorithm of Theorem 1 is a more elaborate version of a reconstructionalgorithm for the special case of d = 0, i.e. when no errors occurred and S = S ′ ,and when n is not known in advance. Theorem 2.
We can reconstruct a length- n periodic string, S = p k p ′ , ofsmallest period p , using O ( σ | p | + lg n ) substring queries, assuming both n and | p | are unknown in advance. The algorithm of Theorem 2, in turn, builds from a simple reconstructionalgorithm that handles the case where n is known in advance and d = 0.For clarity, we will present our results in increasing order of complexity, fromthe least general result of d = 0 and known n , to the most general result ofarbitrary d and unknown n . We first give a simple algorithm to reconstruct a periodic string S = p k p ′ ofsmallest period p and known size with query complexity O ( σ | p | ), and then showhow to improve this algorithm to have query complexity σ | p | plus lower-orderterms. Our algorithms use a primitive developed by Skiena and Sundaram [87],which we call “ append (resp., prepend ) a letter.” In the append (resp.,prepend) primitive, we start with a known substring q of S , and we ask queries IsSubstr ( qa i ) (resp., IsSubstr ( a i q )), for each a i ∈ Σ . Note that if we know that R. Afshar, A. Amir, M. T. Goodrich, and P. Matias
Algorithm 1:
Reconstructing a periodic string S = p k p ′ of known size n and smallest period p , for k > Let q = ε repeat Append a letter to q ( σ − until IsSubstr ( q g ( q ) − ) (1 per iteration; | p | iterations) Let T = q g ( q ) − While T is a substring of S , append a letter to T While | T | < n and T is a substring of S , prepend a letter to T ( σ (2 | p | − Output T one of the qa i (resp., a i q ) strings must be a substring, we can save one query, sothat appending or prepending a letter uses at most σ − q , usingthe append primitive until q g ( q ) − is a substring, where g ( x ) = ⌊ n/ | x |⌋ . Noticethat q may be an “unlucky” cyclic rotation of p , which only repeats g ( p ) − q g ( q ) − , we then append/prepend letters until we recover allof S . For reference, see Algorithm 1, where the number of queries is shown inparentheses for steps involving queries. Theorem 3.
We can reconstruct a length- n periodic string S = p k p ′ , of smallestperiod p , using O ( σ | p | ) substring queries, assuming n is known in advance and | p | is unknown.Proof. The main loop in Algorithm 1 will always terminate, because S is periodicand any cyclic permutation of p is a substring, when concatenated at least g ( p ) − q must result in a cyclic permutation of p , unless the main loop stops earlier. Afterthe main loop, there are at most 2 | p |− σ | p | + σ (2 | p | − O ( σ | p | ). ⊓⊔ With a little more effort, we can improve the constant factor in the querycomplexity. The main challenge to achieving this improvement is that, after themain loop in Algorithm 1, q may not correspond to a cyclic rotation of p . Forexample, in S = abababaab · abababaab · abababaab , we may get q = abababa , whilethe actual period is p = abababaab . However, we show that, when k = n/ | p | > q g ( q ) − is a substring, then q must bea cyclic rotation of p .We begin by giving the details for our improved algorithm for reconstructinga periodic length- n string S , when n is known, which is shown in Algorithm 2. Remark 1.
A string, p , is a period of a string X of length | X | ≥ i | p | if and onlyif p j is a period of X , for all j ∈ { , , . . . , i } . daptive Exact Learning in a Mixed-Up World 9 Algorithm 2:
Reconstructing a periodic string S = p k p ′ of known size n and smallest period p , for k > start Let q = ε repeat Append a letter to q ( σ − until IsSubstr ( q g ( q ) − ) (1 per iteration; | p | iterations) Let p = TrueRotation ( q ) ( ⌈ lg | q |⌉ ) Determine p ′ and output p k p ′ function TrueRotation( q ) Find, using binary search, the largest suffix q [ j.. ], such that IsSubstr ( q [ j.. ] · q g ( q ) − ) ( ⌈ lg | q |⌉ ) Return q [ j.. ] · q [ ..j − Theorem 4.
We can reconstruct a length- n periodic string S = p k p ′ , of smallestperiod p , using at most σ | p | + ⌈ lg | p |⌉ substring queries, assuming that: n is knownin advance, k > and | p | is unknown.Proof. Consider Algorithm 2. We claim that, immediately after the main loop,the candidate period q is indeed a cyclic rotation of the true period p . Theremainder of the proof then follows from this.So let us prove our claim. Let q be the string immediately after the mainloop and let T = q ⌊ n/ | q |⌋− . If | q | = | p | , then q is clearly a cyclic rotation of p .Besides, | q | cannot be greater than | p | , because the letter-by-letter constructionof q would have implied a halt of the main loop when q had size | p | : any cyclicrotation of p must repeat at least ⌊ n/ | p |⌋ − | q | < | p | . Since k >
3, we have that n ≥ | p | . Moreover, since T = q ⌊ n/ | q |⌋− , weknow that | T | ≥ n − (2 | q | −
1) and, thus, | T | ≥ | p | . Since T is a substring of S , T must have a second period of size | p | . Moreover, | T | ≥ | p |≥ | p | + | q |≥ | p | + | q | − gcd( | p | , | q | )Thus, by the Periodicity Lemma (1), T has a period p T of size gcd( | p | , | q | ).Therefore, S must have a period of size | p T | , and thus, S must have a period ofsize | q | (by Remark 1), which contradicts the fact that p is the smallest periodof S . ⊓⊔ Our analysis above is tight in the sense that, for k = 3, it no longer holds:recall the example given above, where S = abababaab · abababaab · abababaab and q = abababa .Notice that any reconstruction algorithm requires at least | p | lg σ queries; thisfollows from an information-theoretic argument. Theorem 5.
Reconstructing a length- n string, S = p k p ′ , of smallest period p ,requires at least | p | lg σ IsSubstr queries, even if n and | p | are known.Proof. There are σ | p | possible periods for S . Since each period corresponds to adifferent output of a reconstruction algorithm, A , and each query is binary, wecan model any such algorithm, A , as a binary decision tree, where each internalnode corresponds to an IsSubstr query. Each of the σ | p | possible periods mustcorrespond to at least one leaf of A ; hence, the minimum height of A is lg( σ | p | ). ⊓⊔ In the next section we consider the case where the underlying string is ofunknown size.
As in Section 2.1, we iteratively grow a candidate period q and attempt to recover S by concatenating q with itself in the appropriate way. The difficulty when n is unknown is that we can no longer confidently predict g ( q ). Thus, we can nolonger issue a single query to test if q is the right period. An immediate solution isto use a doubling search. Unfortunately, this introduces a multiplicative O (lg n )term into the query complexity. To avoid it, we show how we can take advantageof the Periodicity Lemma (1) to amortize the extra work needed to recover S .Let us describe the algorithm (see Algorithm 3 for reference). We start withan empty candidate period q . At each iteration, we add a letter to q , using theappend primitive and, using a doubling search, determine the run-length t of q ,i.e. the maximum integer t such that q t is a substring of S . If t = 1, we advanceto the next iteration and repeat this process. If, on the other hand, t >
1, we use q to determine the largest substring T that has a period of size | q | . This can bedone efficiently, using doubling searches, by determining the largest suffix l of q and the largest prefix r of q , such that IsSubstr ( l · q t · r ). Once T is determined,we check whether it corresponds to S by checking if there is any letter precedingand succeeding T (see IsValid subroutine). If T corresponds to S , we output it.Otherwise, we update q to be any largest substring of T whose size is assuredlyless than | p | : using Periodicity Lemma (1), we argue in Lemma 2 below that, if q is not a cyclic rotation of p , then p must be as large as almost the entire substring T ; more specifically, it must be the case that | p | > | T | − | q | + 1. Thus, we update q to be a length-( | T | − | q | + 1) prefix of T (any other substring of T would alsowork). We use this fact to get a faster convergence to a cyclic rotation of p , whilemaking sure that we do not overshoot | p | . Indeed, this observation will enable usto incur a O (lg n ) additive factor, instead of a multiplicative one. After updating q , we advance to the next iteration, where a new letter is appended to q , andrepeat this process until T = S . Lemma 2.
Let T be the largest proper substring of S = p k p ′ , of smallest period p , such that: | q | is the length of the smallest period of T . Then, | p | > | T |− | q | + 1 . daptive Exact Learning in a Mixed-Up World 11 Algorithm 3:
Reconstructing a periodic string S = p k p ′ , of smallest period p and unknown size n , for k > start Let q = ε repeat Append or prepend a letter to q ( σ −
1; potentially, 2 σ − k ≤ Determine the run-length t of q (2 ⌊ lg t ⌋ + 1) if t = 1 then Let T = q else Let l be the largest suffix of q such that IsSubstr ( l · q t ) (2 ⌊ lg | l |⌋ + 1) Let r be the largest prefix of q such that IsSubstr ( l · q t · r )(2 ⌊ lg | r |⌋ + 1) Let T = l · q t · r Let q = T [ .. | T | − | q | + 1] until IsValid ( T ) (2 σ ) Output T function IsValid( T ) (2 σ )Let x be the letter to the left of T or ε if there is none ( σ )Let y be the letter to the right of T or ε if there is none ( σ ) Return x == ε and y == ε Proof.
Let us assume, by contradiction, that | p | ≤ | T | − | q | + 1. Then, | T | ≥| q | + | p | − | T | ≥ | q | + | p | − gcd( | q | , | p | ). In addition, if p is a period of S , then T must have a period of size | p | . So, by the Periodicity Lemma (1), T alsohas a period of size gcd( | q | , | p | ). Moreover, since T is the largest proper substringof S , | p | is not a multiple of | q | . Therefore, T must have a period shorter than | q | , a contradiction. ⊓⊔ When k ≤
2, our algorithm behaves similarly to the letter-by-letter algorithmof Skiena and Sundaram [87] – after finding a cyclic rotation q of p , our algorithmwill continue adding letters to q until q = S , this time using both the appendand prepend primitives.Next, we give the details of the correctness and query complexity of Algo-rithm 3. Let q , q , . . . , q m be the sequence of m candidate periods of increasinglength, each of which is the result of the append/prepend primitive at thebeginning of every iteration (line 3 of Algorithm 3), e.g. | q | = 1. Notice that each q i may be expanded (in line 10), so the difference | q i |− | q i − | may not necessarilybe 1. In addition, let us use t i to denote the run-length of q i computed in line 4. Lemma 3.
Algorithm 3 successfully returns S = p k p ′ , of smallest period p , ifthere exists an iteration i ∈ { , , . . . , m } , such that q i is a cyclic rotation of p .Proof. If t i >
1, then it is easy to see that the string T , computed in line 9in iteration i , must correspond to S . If f i = 1, then the algorithm essentially switches to the letter-by-letter algorithm, appending or prepending letters untilthe end, when q m = S . Correctness of the stopping condition follows from thecorrectness of IsValid . ⊓⊔ We now show that, indeed, at some iteration i , the candidate period q i is acyclic rotation of p . Lemma 4.
There exists an iteration i ∈ { , , . . . , m } , such that q i is a cyclicrotation of p .Proof. Let us assume that there is no such iteration i . Then, since all the q i ’sare increasing in length, it must be the case that there exists an iteration j ∈{ , , . . . , m − } , such that: | q j | < | p | , but | q j +1 | > | p | . However, it follows fromLemma 2 (when f t >
1) and the fact that we add a single letter to q j (when f t = 1) that p must be at least as large as q j +1 , a contradiction. ⊓⊔ Let us now argue about query complexity. The following lemma shows that wecan charge the logarithmic factors, incurred in each iteration j , to the work thatwould have been required to find the letters introduced in q j +1 . This establishesthe amortization in query complexity. We denote the number of queries initeration j of Algorithm 3 by Q ( j ). Lemma 5.
The number of queries Q ( j ) performed in iteration j of Algorithm 3is at most σ ( | q j +1 | − | q j | ) + O ( σ ) , for j < m , or O ( σ + lg n ) , for j = m .Proof. Let l j and r j denote, respectively, the lengths of the prefix l and suffix r computed in lines 7 and 8 of Algorithm 3 in iteration j . The query complexityin any iteration j is Q ( j ) ≤ ⌊ lg t j ⌋ + 1 + 2 ⌊ lg l j ⌋ + 1 + 2 ⌊ lg r j ⌋ + 1 + 4 σ Let us assume that f t >
1, since otherwise the query complexity is O ( σ ) and,therefore, agrees with the query complexity that is stated in the lemma.When j = m , it must be the case that q m is a cyclic rotation of p , andtherefore has size | p | . Thus, we spend at most: (i) σ queries when appending the p th letter, (ii) 2 ⌊ lg n/ | p |⌋ + 1 queries to determine the run-length t m , and (iii)2(2 ⌊ lg | p |⌋ + 1) queries to determine the suffix and prefix of lengths l m and r m ,respectively. Notice that the combined log factors result in no less than Θ (lg n ).Thus, when j = m , the overall query complexity is O ( σ + lg n ). daptive Exact Learning in a Mixed-Up World 13 When j < m , we have the following:lg f j ≤ t j − t j > ⇒ lg t j ≤ ( t j − q j + 1 ( q j ≥ ⇒ lg t j + lg l j + lg r j ≤ ( t j − q j + l + r + 1 (lg x < x )= ⇒ t j + lg l j + lg r j ) ≤ t j − q j + l + r + 1)= ⇒ Q ( j ) ≤ t j − q j + l + r ) + 2 + 3 + O ( σ ) (def. of Q ( j ))= ⇒ Q ( j ) ≤ t j − q j + l + r + 2) + O ( σ )= ⇒ Q ( j ) ≤ | q j +1 | − | q j | ) + O ( σ ) ( ⋆ )= ⇒ Q ( j ) ≤ σ ( | q j +1 | − | q j | ) + O ( σ ) ( σ ≥ , where ( ⋆ ) follows from the fact that, when t j > | q j +1 | = ( t j − | q j | + l + r + 2. ⊓⊔ Finally, we are in conditions of proving Theorem 2, recalled below forconvenience:
Theorem 2.
We can reconstruct a length- n periodic string, S = p k p ′ , ofsmallest period p , using O ( σ | p | + lg n ) substring queries, assuming both n and | p | are unknown in advance.Proof. Correctness follows from Lemmas 3 and 4. As for the query complexity,it follows from Lemma 5, that the overall query complexity of Algorithm 3 is m X j =1 Q ( j )Let i be the iteration in which | q i | = | p | (see Lemma 4) and let us consider thequeries done up to and after iteration i −
1. Thus, by Lemma 5: m X j =1 Q ( j ) = i − X j =1 (cid:16) σ ( | q j +1 | − | q j | ) + O ( σ ) (cid:17) + m X j = i Q ( j )= O ( σ | p | ) + m X j = i Q ( j ) , where the last equality follows from the telescoping nature of the first summation.As for the second summation, regarding the queries done after iteration i − i = m , then we spend either O ( σ ) queries if t i = 1, or O ( σ + lg n ) queries if t i >
1, by Lemma 5. If, on the other hand, i < m , thennotice that it must have been the case that t j = 1 for all j ∈ { i, i + 1 , . . . , m } .Thus, the total number of letters in S left to recover at the end of iteration i − | q i | − | p | −
1, each of which is added during each iteration j ∈ { i, i + 1 , . . . , m } using O ( σ | p | ) queries in total. Thus, whether or not i = m ,the overall query complexity is m X j =1 Q ( j ) = O ( σ | p | + lg n ) ⊓⊔ Let us assume throughout the remainder of this section that S is a d -corruptedperiodic string of approximate period p . Recall that S is a d -corrupted periodicstring if there exists a periodic string S ′ of period p , such that | S | = | S ′ | and δ ( S ′ , S ) ≤ d , where δ is the Hamming distance. Again, the main idea of thealgorithm described in this section consists of: (1) determining a cyclic rotationof a true period (in this case, there might be multiple true periods), by iterativelygrowing a candidate period q , and (2) using q to recover S accordingly. However,in the presence of errors, each of these steps becomes more difficult to realizeefficiently. For example, in the first step, we might be growing a candidate period q that includes an error. So, in order to rightfully reject the hypothesis that q isat most as large as some approximate period p , our algorithm should be able totell the difference between (i) | p | = | q | and q includes an error and (ii) | p | > | q | .Otherwise, the algorithm will keep on growing q until it is equal to S , possiblyincurring σn queries. In addition, the second step of using q to determine S requires more work, since the presence of errors discards the possibility of simplyconcatenating q with itself the required number of times. Because of these issues,it is crucial that our algorithm understands when a candidate period is or notfree of errors. Thus, the algorithm relies on the following. Lemma 6.
Let A be any length- (2 d + 1) | p | substring of a d -corrupted periodicstring S of approximate period p , corresponding to the concatenation of length- | p | substrings q , q , . . . , q d +1 . Then, a cyclic rotation of p must be the onlysubstring q j appearing at least d + 1 times in q , q , . . . , q d +1 .Proof. Clearly, there is some q i that is a cyclic rotation of p . Moreover, there issome q j that appears at least d + 1 times in q , q , . . . , q d +1 , or the number oferrors would exceed d , by the pigeonhole principle. If i = j , then each occurrenceof q j , contributes at least 1 error, resulting in at least d +1 errors, a contradiction.Finally, q j must be the only string with d + 1 appearances in q , q , . . . , q d +1 ,by the pigeonhole principle. ⊓⊔ Let us give the details for our algorithm, which is able to recover S , evenwhen its size n is unknown (see Algorithm 4 for reference). We maintain aninitially empty substring, A , of S , by extending it with 2 d + 1 letters in eachiteration, using the append and prepend primitives (as described in Section 2.1),potentially incurring an extra σ queries for detecting a left or right endpointof S . In the case that n = | S | < | p | (2 d + 1), the last iteration requires only daptive Exact Learning in a Mixed-Up World 15 Algorithm 4:
Reconstructing a d -corrupted periodic string S . Let A = ε repeat Append/prepend min(2 d + 1 , | S | − | A | ) letters to A ( σ (2 d + 2)) Let q be the candidate period that is a substring of A ,as determined by Lemma 6 ( success , T ) = Expand ( q ) ( O ( dσ + d lg nd +1 )) until success Output T Function
Expand( q ) ( O ( dσ + d lg nd +1 )) Let T = q , done = False while δ ( T, q ∞ [ .. | T | ]) ≤ d and not done do Find the largest substring R , such that IsSubstr ( T · R ) (2 ⌊ lg | R |⌋ + 1) Find the largest substring L , such that IsSubstr ( L · T · R ) (2 ⌊ lg | L |⌋ + 1) Let r ( l ) be the letter to the right (left) of L · T · R or ε if there is none (2 σ ) Let done = ( r == ε and l == ε ) Let T = l · L · T · R · r if δ ( T, q ∞ [ .. | T | ]) > d then return ( False , ) return ( True , T ) min(2 d + 1 , | S | − | A | ) new letters. Thus, after adding letters to A in the i th iteration, A is a substring of S of size at most i (2 d + 1). Before advancing tothe next iteration, we determine the only possible length- i candidate period q that could have originated A with at most d errors (by Lemma 6). At this pointwe do not know if some approximate period p has size | p | = i , so we try to use q to recover the rest of the string, halting whenever the total number of errorsexceeds d , in which case we advance to the next iteration and repeat this processfor a new candidate period of size i + 1. This logic is in the subroutine Expand ( q ),described next (see the pseudo-code for reference). It initializes a string T to q and expands it by doing the following at each iteration:1. Appending to T the largest periodic substring of period −→ q , where −→ q is theappropriate cyclic rotation of q that aligns with the right-endpoint of T .This can be done efficiently by determining the maximum value of x , usinga doubling search, for which IsSubstr ( T · ( −→ q ∞ [ .. x ])) , incurring 2 ⌊ lg x ⌋ + 1 queries. The cyclic rotation −→ q can be determined withno additional queries, by maintaining the value x ′ , which is the value of x inthe previous iteration, i.e. −→ q is the cyclic rotation of q starting at the index( x ′ mod | q | + 2) of q .2. Prepending to T the largest periodic substring of period ←− q , where ←− q isthe appropriate cyclic rotation of q that aligns with the left-endpoint of T . This can be done efficiently by determining the maximum value of y , usinga doubling search, for which IsSubstr ((( ←− q R ) ∞ [ .. y ]) R · T ) , incurring 2 ⌊ lg y ⌋ + 1 queries. The cyclic rotation ←− q can be determined withno additional queries in a similar fashion to −→ q .3. Determining, if they exist, the letters immediately to the left and to the rightof T , using 2 σ queries, and adding them to T .The expansion process in Expand ( q ) halts when either the total number oferrors with respect to q , δ ( T, q ∞ [ .. | T | ]), exceeds d (in which case we advance tothe next iteration), or when T = S (in which case we return T ). Remark 2.
Expand ( q ) successfully returns S if and only if q is a cyclic rotationof some approximate period. Lemma 7.
The number of queries performed during any call to
Expand is O ( dσ + d lg nd +1 ) .Proof. Each call to
Expand uses at most 2( d + 1) σ queries to determine thecorrupted letters, as well as the left/right endpoints of S – the total number ofiterations of the while loop in Expand is d + 1, since every iteration except thelast introduces at least 2 errors in T , and each iteration incurs 2 σ queries.In addition, the number of queries used by Expand ( q ) during the doublingsearches is | q | X j =1 (2 ⌊ lg R j ⌋ + 2 ⌊ lg L j ⌋ + 2) , where R j and L j denote, respectively, the lengths of the substrings determinedvia doubling searches in lines 3 and 4, during the j th call to Expand . Since thetotal number of iterations is d + 1, there is at most d + 2 such R j ’s and L j ’s.Moreover, the above summation is maximized when all the R j ’s and L j ’s havethe same average value of at most ( n − d ) / ( d + 1). This follows from Jensen’sinequality and concavity of log. Thus, the overall time complexity is O (cid:18) dσ + d lg nd + 1 (cid:19) . ⊓⊔ Correctness and query complexity of our algorithm follows from Remark 2and Lemmas 6 and 7, giving us the following result:
Theorem 6.
We can reconstruct a length- n d -corrupted periodic string S using O ( dσ | p | + d | p | lg nd +1 ) queries, for known d , unknown | p | , regardless of whetherwe know n , where p is a smallest approximate period of S . daptive Exact Learning in a Mixed-Up World 17 Proof.
At the | p | th iteration of the main loop, A has size (2 d + 1) | p | and, byLemma 6, q must correspond to a cyclic rotation of some approximate period p .Correctness of reconstruction then follows from Remark 2.The overall query complexity consists of the queries used to expand A ineach iteration and the queries used in the calls to the subroutine Expand . Theformer requires at most (2 d + 2) σ | p | queries overall, and the latter requires atmost O ( dσ | p | + d | p | lg nd +1 ), by Lemma 7. Thus, the overall query complexity is O (cid:18) dσ | p | + d | p | lg nd + 1 (cid:19) . ⊓⊔ If n is known, we could save the queries used to check the left and rightendpoints of S in line 5 of Expand , but this does not alter the query complexityasymptotically.We assume a small enough number of errors, following [7]. Algorithm 4 isan improvement to the O ( σn ) letter-by-letter algorithm of Skiena and Sun-daram [87] for general strings of known and unknown size, when d = O ( σn/ ( σ | p | + | p | lg n )). In particular, if d = O ( k/ (1 + lg n )), then Algorithm 4 is an improve-ment, where k = ⌊ n/ | p |⌋ . Thus, our algorithm performs better if there is, onaverage, at most 1 error in every other O (1 + lg n ) th non-overlapping occurrenceof p . If the number of errors is not small enough, then one should run the letter-by-letter algorithm intercalated with ours, to get an upper bound of O ( σn )queries, giving us Theorem 1, which we referred to at the beginning of Section 2,recalled here for convenience. Theorem 1.
We can reconstruct a length- n d -corrupted periodic string S using O (cid:18) min (cid:18) σn, dσ | p | + d | p | lg nd + 1 (cid:19)(cid:19) queries,for known d , unknown | p | , regardless of whether we know n , where p is a smallestapproximate period of S . We study the query complexity for a length- n string, S , subject to yes/no subsequence queries, IsSubseq , i.e., queries of the form “Is X a subsequenceof S ?”We begin with a simple lower bound. Theorem 7.
Reconstructing a length- n periodic string, S = p k p ′ , of smallestperiod p , requires at least | p | lg σ IsSubseq queries, even if n and | p | are known.Proof. The proof follows that of Theorem 5 for substring queries, which can befound in Section 2. ⊓⊔ Let us next describe an algorithm for reconstructing a periodic length- n periodic string, S = p k p ′ , of smallest period p . We begin by performing eitherbinary searches (if n is known) or doubling search (if n is unknown), using queriesof the form IsSubseq ( a i ) to determine the number of a ’s in S , for each a ∈ Σ .From all of these queries, we can determine the value of n if it was previouslyunknown. This part of our algorithm requires either σ ⌈ lg n ⌉ or 2 σ ⌈ lg n ⌉ queriesin total, depending on whether we knew n at the outset.If the number of a ’s in S is n , for any a ∈ Σ , then we are done, so let usassume the number of a ’s in S is less than n , for each a ∈ Σ . Thus, when wecomplete all our doubling/binary searches, for each letter, a ∈ Σ that occursa nonzero number of times in S , we have a maximal subsequence, S a , of S ,consisting of a ’s. Moreover, since S is periodic with a period that repeats k times, each S a is periodic with a period that repeats k times. Unfortunately,at this point in the algorithm, we may not be able to determine k . So next wecreate a binary merge tree, T , with each of its leaves associated with a nonemptysubsequence, S a , much in the style of the well-known merge-sort algorithm, sothat T has height ⌈ lg σ ⌉ . We then perform a bottom-up merge-like procedure in T using IsSubseq queries, as follows.Let v be an internal node in T , with children x and y for which we haveinductively determined periodic subsequences, S x and S y , respectively, of S . Let n x = | S x | and n y = | S y | . To create the subsequence, S v , for v , we need toperform a merge procedure to interleave S x and S y . To do this, we maintainindices i and j in S x and S y , respectively, such that we have already determinedan interleaving, S v [ ..i + j ], of S x [ ..i ] and S y [ ..j ]. Initially, i = j = 0. We thenperform the query IsSubseq ( S v [ ..i + j ] · S x [ i +1] · S y [ j +1 ..n y ]). Suppose the answerto this query is “yes”. In this case, we set S v [ ..i + j + 1] = S v [ ..i + j ] · S x [ i + 1] andwe increment i . If, on the other hand, the answer to the above query is “no”,then we set S v [ ..i + j + 1] = S v [ ..i + j ] · S y [ j + 1], because in this case we knowthat IsSubseq ( S v [ ..i + j ] · S y [ j + 1] · S x [ i + 1 ..n x ]) would return “yes”. If this lattercondition occurs, then we increment j .Let q v denote this new interleaving prefix, S v [ ..i + j ], and let ˆ k = ⌊ n/ | q v |⌋ .If q v ˆ k q v ′ is a plausible interleaving of S x and S y , where q v ′ is a prefix of q v ,then we next ask the query IsSubseq ( q v ˆ k q v ′ ). If the answer is “yes”, then we set S v = q v ˆ k q v ′ and this completes the merge. Otherwise, we continue incrementallyinterleaving S x and S y , using the current values of i and j , by iterating theprocedure described above. Clearly, this merge procedure asks at most 2 | q v | queries in total. Lemma 8.
Let p v be the subsequence of p consisting of the letters from S v . Then | q v | ≤ | p v | .Proof. The letter-by-letter construction of q v ensures that q v is the smallestperiod of S v . Since p v is itself a period of S v no smaller than the smallest, then | p v | ≥ | q v | . ⊓⊔ daptive Exact Learning in a Mixed-Up World 19 Theorem 8.
We can determine a length- n periodic string, S = p k p ′ , of smallestperiod p of unknown size, using σ ⌈ lg n ⌉ + 2 | p |⌈ lg σ ⌉ IsSubseq queries, if n isunknown. If n is known, then σ ⌈ lg n ⌉ + 2 | p |⌈ lg σ ⌉ IsSubseq queries suffice.Proof.
The total query complexity includes: (i) the letter decomposition S a forall a ∈ Σ , during the the first stage and (ii) the merge-like composition of allsubsequences S a , during the second stage. If n is known, the first stage requires | Σ | binary searches, incurring σ ⌈ lg n ⌉ queries. Otherwise, it requires | Σ | doublingsearches, amounting to 2 σ ⌈ lg n ⌉ queries. Regarding the second stage, we claimthat any level l of the binary merge tree, T , incurs a total of at most 2 | p | queries,which amounts to a total of at most 2 | p |⌈ lg σ ⌉ queries, when taking into accountall the ⌈ lg σ ⌉ levels of T . Let T ( l ) be the set of all nodes in T at level l . Then, X v ∈ T ( l ) | q v | ≤ X v ∈ T ( l ) | p v | = | p | . This follows from Lemma 8 and the fact that all { S v | v ∈ T ( l ) } are pairwiseletter-disjoint. Since the merge of an internal node v requires a cost of 2 | q v | , thetotal cost incurred in any level l of T is at most 2 | p | . ⊓⊔ A simple modification of our algorithm also implies the following.
Theorem 9.
We can determine a length- n string, S , using σ ⌈ lg n ⌉ + n ⌈ lg σ ⌉ IsSubseq queries, without knowing the value of n in advance. If n is known, then σ ⌈ lg n ⌉ + n ⌈ lg σ ⌉ IsSubseq queries suffice.Proof.
Modify our subsequence-querying algorithm given in Section 3 to removethe queries for strings of the form q v ˆ k q v ′ . The proof follows by an analysis similarto that for Theorem 8. ⊓⊔ This latter theorem improves a result of Skiena and Sundaram [87], whoprove a query bound of 2 σ lg n + 1 . n lg σ + 5 σ when n is unknown. Jumbled-indexing involves preprocessing a given string, S , so as to determinewhether there exists a substring of S whose letter frequencies match the given Parikh vector , i.e., a vector ψ = ( f , . . . , f σ ) such that f i is the number ofoccurrences in S of a i ∈ Σ , e.g., see [4, 8, 10, 11, 63, 75]. In this section, we studythe query complexity for reconstructing an unknown length- n string, S , usingjumbled-index queries. As observed by Acharya et al. [1, 2], strings and theirreversals have the same “composition multiset”. This immediately implies thefollowing negative result, which we prove regardless for completeness. Lemma 9. If S is not a palindrome, then S cannot be reconstructed by yes/nojumbled-index queries, which return whether there is a substring in S with agiven Parikh vector. Proof.
Suppose S = S R , where S R denotes the reversal of S . For any substring, T , of S , there is, of course, a corresponding substring, T R , of S R . Moreover, T and T R have the same Parikh vector. Thus, S and S R have the same set ofresponses to yes/no jumbled-index queries; hence, any set of yes/no jumbled-index queries cannot distinguish S from S R . ⊓⊔ Given that simple yes/no jumbled-index queries are not sufficient for stringreconstruction, let us consider an extended type of yes/no jumbled-index query. – Jumbled-Indexing with End-of-string symbol “ $ ” (JIE): given an extended Parikh vector, ψ = ( f , . . . , f σ , f $ ), for the letters in Σ and an end-of-string symbol, $, which is not in Σ , this query returns a yes/no responseas to whether there is a substring of S $ with extended Parikh vector ψ .Unlike the yes/no jumbled-index queries, this variant enables full reconstruction. Theorem 10.
We can reconstruct a length- n string, S , using ( σ − n JIEqueries, if n is known, or σ ( n + 1) JIE queries, if n is unknown.Proof. Our method is to use a letter-by-letter reconstruction algorithm via anadaption of the prepend-a-letter primitive for substring queries. Suppose n isunknown. Let ψ be an extended Parikh vector for a known suffix, s , of S $;initially, ψ = (0 , , . . . , ,
1) and s = $. Then we perform a jumbled-index queryfor ψ i , for each a i ∈ Σ , where ψ i = ψ except that ψ i adds 1 to the f i valuein ψ . If one of these, say, ψ i , returns “yes”, then we prepend a i to our knownsuffix and we repeat this procedure using ψ i for ψ . If all of these queries return“no”, then we are done. If n is known, on the other hand, then we can skip thislast test of all-no responses and we can also save at least one query with eachiteration, with the algorithm otherwise being the same. ⊓⊔ We can also consider jumbled-index queries that return an index of amatching substring for a given Parikh vector, if such a substring exists. Thoughrelated, notice that this type of query is not subsumed by the query studied inAcharya et al. [1,2], which returns the number of occurrences (instead of position)of matching substrings in S . There is some ambiguity, however, if there is morethan one matching substring; hence, we should consider how to handle suchmultiple matches. For example, if a jumbled-index query returns the indices ofall matching substrings, then σ queries are clearly sufficient to reconstruct anylength- n string, for any n , without knowing the value of n in advance. Thus, letus consider two more-interesting types of jumbled-index queries. – Adversarial Jumbled-Indexing (AJI): given a Parikh vector, ψ =( f , . . . , f σ ), this query returns, in an adversarial manner, one of the startingindices of a matching substring, if such a string exists. If there is no matchingsubstring, this query returns False . – Random Jumbled-Indexing (RJI): given a Parikh vector, ψ =( f , . . . , f σ ), this query returns, uniformly at random, one of the indicesof a substring with Parikh vector ψ if such a substring exists in S . If thereis no such substring, this query returns False . daptive Exact Learning in a Mixed-Up World 21 Unfortunately, for the AJI variant, there are some strings that cannot be fullyreconstructed, but this is admittedly not obvious. In fact, the unreconstructabil-ity characterization of [1,2] fails for AJI queries, because the symmetry propertyused in their construction of pairwise “equicomposable” strings inherently yieldsmatching substrings with symmetric (e.g. different) positions in S .Nevertheless, we give a construction of an infinite family of pairwise undis-tinguishable strings, i.e. two strings such that, for every possible query, thereexists an answer (positive or negative) that is common to both strings. Clearly,the adversarial strategy is to output these common answers when given eitherof these strings. In particular, for all b ≥
1, consider the two binary strings oflength 4 b + 14 given below, which differ only in the middle section, consisting of in the first string and in the second: S = b b S = b b Theorem 11.
The strings S and S cannot be distinguished using AJI queries,for b ≥ .Proof. Let n = 4 b + 14 be the size of the strings. We refer to responses thatwould be common to both S and S as helpless answers. Let us think of apositive answer i to a query ( k, l ) in terms of the space occupied by its matchingsubstring, denoted h i, i + k + l − i . We note that an answer that does not spanthe middle section or that spans it in its entirety must be helpless.Notice that the first half of either string is the symmetric complement ofthe second half. This implies the following: (i) an answer h i, j i to a query ( k, l )exists if and only if an answer h n − j + 1 , n − i + 1 i exists for the query ( l, k )and (ii) an answer is negative to ( k, l ) if and only if an answer is negative to( l, k ). Therefore, we can restrict ourselves to queries of the form ( k, k + c ), where c ≥
0. We break this down to the following cases:1.
Queries of type ( k, k ).We say that an answer is k -centered if it is of the type h n/ − ( k − , n/ k − i . Since any k -centered answer contains the middle section, itmust be helpless. Thus, it is enough to show, by induction, that all queries( k, k ) have k -centered answers. Clearly, this holds for the base case (1 , k − a to the query( k − , k − a mustbe the complement of each other. Thus, the k -centered answer must be validfor the query ( k, k ).2. Queries of type ( k, k + 1).Take the k -centered answer and either extend it with one letter to the left,or one letter to the right. Exactly one of these options is a valid answer to( k, k + 1) (by the symmetric-complement property of the strings) and eitherare helpless, since they span the middle section. Queries of type ( k, k + 2).Consider, as a base case, the answer h , i to the query (0 , j and 5 + 2 j are complements of each other, h , j i is a valid answer to the query( j, j + 2), for all 0 ≤ j ≤ b + 2. For greater values of j , the answer is helplessregardless, since it corresponds to a substring of length greater than 2 b + 7,half of the string length and, therefore, it spans the middle section.4. Queries of type ( k, k + c ) , for c ≥ ∆ i denote the number of ’s minus the number of ’s for the answer h i, n/ i ,with respect to S , for all 0 ≤ i ≤ n/ ∆ n/ = 1, corresponds to thefirst letter in the middle). A simple passage from right to left, for increasingvalues of i , reveals that there exist no value of i for which ∆ i = 4, so wedo not need to handle the case c ≥
4. Moreover, the only values of i forwhich ∆ i = 3 are i = 2 and i = 0, which correspond to answers for thequeries ( b + 1 , b + 4) and ( b + 2 , b + 5), respectively. However, these querieshave helpless answers: h , b + 4 i in the former and h , b + 8 i in the latter.For the first string, a similar exercise reveals that there exist no answersthat partially overlap the middle section and whose difference between thenumber of ’s and the number of ’s is at least 3. ⊓⊔ In contrast, the query variant RJI can be used to reconstruct any length- n string, S , without knowing the value of n in advance. In particular, it is possibleto reconstruct any length- n string, S , using O ( σ + n log n ) RJI queries with highprobability. Our algorithm for doing this involves a reduction to a multi-windowcoupon-collector problem.Let ψ i be a Parikh vector that is all 0’s except for a count of 1 for the letter a i ∈ Σ . Note that an RJI query using ψ i will return one of the n i locations in S with an a i uniformly at random (if n i > n i = 0, for any i = 1 , , . . . , σ , welearn this fact immediately after one RJI query for ψ i , so let us assume, w.l.o.g.,that n i >
0, for all i = 1 , , . . . , σ , after performing an initial σ number of RJIqueries.Recall that in the coupon-collector problem, a collector visits a couponwindow each day and requests a coupon from an agent, who chooses one of n coupons uniformly at random and gives it to the collector, e.g., see [74]. Theexpected number of days required for the collector to get all n coupons is nH n ,where H n is the n th Harmonic number. But this assumes the collector knowswhen they have received all n coupons (i.e., the collector knows the value of n ).In a coupon-collector formulation of our reconstruction problem, we insteadhave σ coupon windows, one for each letter a i ∈ Σ , where each window i has n i coupons that differ from the coupons for the other windows, and we do notknow the value of any n i . Each day the collector must choose one of the couponwindows, i , and request one of its coupons (corresponding to an RJI queryfor ψ i ), which is chosen uniformly at random from the n i coupons for window daptive Exact Learning in a Mixed-Up World 23 i . We are interested in a strategy and analysis for the collector to collect all n = n + n + · · · + n σ coupons, with high probability (i.e., with probability atleast 1 − /n ).Note that although we do not know the value of any n i , we can nonethelesstest whether the collector has collected all n coupons. In particular, suppose wehave received RJI responses for all indices, 1 , , . . . , n , for letters in S , and let n i be the number of a i ’s we have found so far. Let ψ ′ = ( n , n , . . . , n σ ), and let ψ ′ i be equal to ψ ′ except that we increment n i by 1. If an RJI query for each ψ ′ i returns False , then we know we have fully reconstructed S . Thus, if n = 1, thenwe can determine this and S after 2 σ RJI queries, so let us assume that n ≥ N ≥
2, which is at least n and atmost twice n , by a simple doubling strategy, where we double N any time a testfor n fails and we set N equal to any RJI query response that is larger than N .Therefore, the remaining problem is to solve the multi-window coupon-collectorproblem.Our strategy for the multi-window coupon-collector problem is simply to visitthe coupon windows in phases, so that in phase i we repeatedly visit window i until we are confident we have all of its n i coupons, for which the followinglemma will prove useful. Lemma 10.
Let T i be the number of trips to window i needed to collect all its n i ≥ coupons. Then, for any real number β : Pr ( T i > βn i ln N ) ≤ n i N β . Proof.
Adapting a proof from [98], let Z j,r denote the event that the j -th couponwas not picked in the first r trips to window i . ThenPr ( Z j,r ) = (cid:18) − n i (cid:19) r ≤ e − r/n i . Thus, for r = βn i ln N , we have Pr( Z j,r ) ≤ e − ( βn i ln N ) /n i = N − β . Therefore, bya union bound,Pr ( T > βn i ln N ) = Pr [ j Z j,βn i ln N ≤ n i · Pr ( Z ,βn i ln N ) ≤ n i N β . ⊓⊔ Our strategy, then, is to let β ≥ i , implement adoubling strategy where we perform βN i log N RJI queries for ψ i , such that N i is an upper bound estimate for n i , which we double each time we get more than N i distinct responses to our queries in this phase. So by the end of the phase i , n i ≤ N i ≤ n i . This gives us: Theorem 12.
A string, S , of unknown size, n , can be reconstructed using O ( σ + n log n ) RJI queries, with high probability.
Proof.
After an initial O ( σ ) queries to determine which letters from Σ appearin S , the total number of remaining queries performed by our method is at most2 σ X i =1 βN i ln N = 4 σ X i =1 βN i ln N, by the doubling strategy applied to each letter, a i ∈ Σ , and then globally for N .Further, σ X i =1 βN i ln N ≤ σ X i =1 βn i ln N ≤ σ X i =1 βn i ln n = 3 βn ln n, with probability at least1 − σ X i =1 n i N β = 1 − P σi =1 n i N β = 1 − nN β ≥ − n , by Lemma 10, since β ≥ ⊓⊔ We have studied the reconstruction of strings under the following settings,by giving efficient reconstruction algorithms and proving lower bounds: (i)periodic strings of known and unknown sizes, with and without mismatcherrors, using substring queries; (ii) periodic strings of known and unknown sizes,using subsequence queries and (iii) general strings, using variations of jumbled-indexing queries. For the non-optimal algorithms given here, it would be nice toknow whether there exist matching lower bounds, or whether there exist fasteralgorithms.Regarding corrupted periodic strings, different applications suffer from dif-ferent types of corruption. In particular, the following error metrics have beenconsidered in the literature: Pseudo-local metrics such as swap distance [6] orInterchange (Cayley) distance [7]; and the Levenshtein edit distance [68]. Itwould be interesting to see whether our reconstruction algorithms can be adaptedto these more general error distances.The next step is to reconstruct strings that have more complex syntacticregularities than periods, such as covers [14]. A length m substring C of a string T of length n , is said to be a cover of T , if n > m and every letter of T lieswithin some occurrence of C . We would like to efficiently reconstruct a coverablestring, without knowing its cover a-priori.Data compression schemes such as, Lempel-Ziv [103, 104] are known tocompress any stationary and ergodic source down to the entropy rate of thesource per source symbol, provided the input source sequence is sufficiently long.These schemes rely heavily on encoding repeated substrings by their startingindex and length. In this sense, a periodic string is highly compressible. We would daptive Exact Learning in a Mixed-Up World 25 like to extend our ideas to reconstruct a general string in time proportional toits LZ compression.The type of query used for reconstruction is a key factor in the reconstructioncomplexity. Much as the error distance, the query type is also application-dependent. A reasonable query type is the less than matching . Let S and S be strings of length n over an ordered alphabet. We say that S is less than S if S [ i ] < S , ∀ i = 1 , . . . , n . Other matchings that have been researched in theliterature, are the order preserving matching [30, 35, 59], and the parameterizedmatching [16, 17]. In the order preserving matching, we say that two stringsmatch if the relative order of their elements is the same, for example 1 , , , , , , , ,
1, or 56 , , , ,
56, i.e., any stringwhere the fist element is smaller than the second, which is smaller than thethird, where the fourth is equal to the second, and the fifth equals the first. Twoequal-length strings S , S over alphabet Σ are said to parameterize match , ifthere is a bijection f : Σ → Σ such that S = f ( S ). Using these more powerfulqueries, can we reconstruct a string more efficiently?Finally, given the impossibility result on reconstructing strings using Adver-sarial Jumbled-Indexing queries, it would be interesting to know whether thereexists an efficient algorithm that enumerates all of the undistinguishable strings. Acknowledgments
This research was funded in part by the U.S. National Science Foundation undergrant 1815073. Amihood Amir was partly supported by BSF grant 2018141 andISF grant 1475-18.
References
1. Acharya, J., Das, H., Milenkovic, O., Orlitsky, A., Pan, S.: Quadratic-backtrackingalgorithm for string reconstruction from substring compositions. In: 2014 IEEE In-ternational Symposium on Information Theory, Honolulu, HI, USA, June 29 - July4, 2014. pp. 1296–1300. IEEE (2014). https://doi.org/10.1109/ISIT.2014.6875042, https://doi.org/10.1109/ISIT.2014.6875042
2. Acharya, J., Das, H., Milenkovic, O., Orlitsky, A., Pan, S.: String reconstructionfrom substring compositions. SIAM J. Discret. Math. (3), 1340–1371 (2015).https://doi.org/10.1137/140962486, https://doi.org/10.1137/140962486
3. Afshani, P., Agrawal, M., Doerr, B., Doerr, C., Larsen, K.G., Mehlhorn,K.: The query complexity of finding a hidden permutation. In: Brodnik, A.,L´opez-Ortiz, A., Raman, V., Viola, A. (eds.) Space-Efficient Data Structures,Streams, and Algorithms - Papers in Honor of J. Ian Munro on the Occasionof His 66th Birthday. Lecture Notes in Computer Science, vol. 8066,pp. 1–11. Springer (2013). https://doi.org/10.1007/978-3-642-40273-9 1, https://doi.org/10.1007/978-3-642-40273-9_1
4. Afshani, P., van Duijn, I., Killmann, R., Nielsen, J.S.: A lower bound forjumbled indexing. In: 2020 ACM-SIAM Symposium on Discrete Algorithms(SODA). pp. 592–606 (2020). https://doi.org/10.1137/1.9781611975994.36, https://epubs.siam.org/doi/abs/10.1137/1.9781611975994.36
5. Amir, A., Amit, M., G.M.Landau, Sokol, D.: Period recovery of strings over thehamming and edit distances. Theortetical Computer Science , 2–18 (2018)6. Amir, A., Aumann, Y., Landau, G., Lewenstein, M., Lewenstein, N.: Patternmatching with swaps. Journal of Algorithms , 247–266 (2000), (Preliminaryversion appeared at FOCS 97.)7. Amir, A., Eisenberg, E., Levy, A., Porat, E., Shapira, N.: Cycle detection andcorrection. ACM Trans. Alg. (1), 13 (2012)8. Amir, A., Apostolico, A., Hirst, T., Landau, G.M., Lewenstein, N., Rozen-berg, L.: Algorithms for jumbled indexing, jumbled border and jum-bled square on run-length encoded strings. Theoretical Computer Science , 146–159 (2016). https://doi.org/https://doi.org/10.1016/j.tcs.2016.04.030,
9. Amir, A., Aumann, Y., Benson, G., Levy, A., Lipsky, O.,Porat, E., Skiena, S., Vishne, U.: Pattern matching withaddress errors: Rearrangement distances. J. Comput. Syst. Sci. (6), 359–370 (2009). https://doi.org/10.1016/j.jcss.2009.03.001, https://doi.org/10.1016/j.jcss.2009.03.001
10. Amir, A., Butman, A., Porat, E.: On the relationship between histogramindexing and block-mass indexing. Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineering Sciences (2016), 20130132 (2014). https://doi.org/10.1098/rsta.2013.0132, https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2013.0132
11. Amir, A., Chan, T.M., Lewenstein, M., Lewenstein, N.: On hardness of jumbledindexing. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.)Int. Colloq. on Automata, Languages, and Programming (ICALP). pp. 114–125.Springer, Berlin, Heidelberg (2014)12. Amir, A., Hartman, T., Kapah, O., Levy, A., Porat, E.: On the cost ofinterchange rearrangement in strings. In: Arge, L., Hoffmann, M., Welzl, E.(eds.) Algorithms - ESA 2007, 15th Annual European Symposium, Eilat, Israel,daptive Exact Learning in a Mixed-Up World 27October 8-10, 2007, Proceedings. Lecture Notes in Computer Science, vol. 4698,pp. 99–110. Springer (2007). https://doi.org/10.1007/978-3-540-75520-3 11, https://doi.org/10.1007/978-3-540-75520-3_11
13. Angluin, D.: Queries and concept learning. Machine learning (4), 319–342 (1988)14. Apostolico, A., Iliopoulos, C., Farach, M.: Optimal superprimativity testing forstrings. Information Processing Letters , 17–20 (1991)15. Arratia, R., Martin, D., Reinert, G., Waterman, M.S.: Poisson processapproximation for sequence repeats and sequencing by hybridization. J.Comput. Biol. (3), 425–463 (1996). https://doi.org/10.1089/cmb.1996.3.425, https://doi.org/10.1089/cmb.1996.3.425
16. Baker, B.S.: Parameterized pattern matching: Algorithms and applications.Journal of Computer and System Sciences (1), 28–42 (1996)17. Baker, B.S.: Parameterized duplication in strings: Algorithms and an applicationto software maintenance. SIAM Journal on Computing (5), 1343–1362 (1997)18. Batu, T., Kannan, S., Khanna, S., McGregor, A.: Reconstructing stringsfrom random traces. In: Munro, J.I. (ed.) Proceedings of the FifteenthAnnual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, NewOrleans, Louisiana, USA, January 11-14, 2004. pp. 910–918. SIAM (2004), http://dl.acm.org/citation.cfm?id=982792.982929
19. Benson, G.: Tandem repeats finder: a program to analyze dna sequence. NucleicAcids Research (2), 573–580 (1999)20. Benson, G., Waterman, M.: A method for fast database search for all k-nucleotiderepeats. Nucleic Acids Research , 4828–4836 (1994)21. Bentley, J.L., Yao, A.C.: An almost optimal algorithmfor unbounded searching. Inf. Process. Lett. (3), 82–87 (1976). https://doi.org/10.1016/0020-0190(76)90071-5, https://doi.org/10.1016/0020-0190(76)90071-5
22. Bernasconi, A., Damm, C., Shparlinski, I.: Circuit and decision tree complexityof some number theoretic problems. Information and Computation (2),113 – 124 (2001). https://doi.org/https://doi.org/10.1006/inco.2000.3017,
23. Bresler, G., Bresler, M., Tse, D.: Optimal assembly for high throughput shotgunsequencing. In: BMC Bioinformatics. vol. 14, p. S18. Springer (2013)24. Burcsi, P., Cicalese, F., Fici, G., Lipt´ak, Z.: Algorithms forjumbled pattern matching in strings. Int. J. Found. Comput. Sci. (2), 357–374 (2012). https://doi.org/10.1142/S0129054112400175, https://doi.org/10.1142/S0129054112400175
25. Butman, A., Eres, R., Landau, G.M.: Scaled and permuted string matching. Inf.Process. Lett. (6), 293–297 (2004). https://doi.org/10.1016/j.ipl.2004.09.002, https://doi.org/10.1016/j.ipl.2004.09.002
26. Carpi, A., de Luca, A.: Words and special factors. Theor. Comput.Sci. (1-2), 145–182 (2001). https://doi.org/10.1016/S0304-3975(99)00334-5, https://doi.org/10.1016/S0304-3975(99)00334-5
27. Cash, D., Grubbs, P., Perry, J., Ristenpart, T.: Leakage-abuse attacks againstsearchable encryption. In: In Proc. of the 22nd ACM SIGSAC Conference onComputer and Communications Security, CCS. pp. 668–679 (2015)28. Cayley, A.: Lxxvii. note on the theory of permutations. The London, Edinburgh,and Dublin Philosophical Magazine and Journal of Science (232), 527–529(1849)8 R. Afshar, A. Amir, M. T. Goodrich, and P. Matias29. Chang, Z., Chrisnata, J., Ezerman, M.F., Kiah, H.M.: Rates of DNAsequence profiles for practical values of read lengths. IEEE Trans. Inf.Theory (11), 7166–7177 (2017). https://doi.org/10.1109/TIT.2017.2747557, https://doi.org/10.1109/TIT.2017.2747557
30. Cho, S., Na, J.C., Park, K., Sim, J.S.: Fast order-preserving pattern matching. In:Proc. 7th conf. Combinatorial Optimization and Applications COCOA. LectureNotes in Computer Science, vol. 8287, pp. 295–305. Springer (2013)31. Choi, S.S., Kim, J.H.: Optimal query complexity boundsfor finding graphs. Artificial Intelligence (9), 551 – 569(2010). https://doi.org/https://doi.org/10.1016/j.artint.2010.02.003,
32. Cicalese, F., Fici, G., Lipt´ak, Z.: Searching for jumbled patterns in strings. In:Holub, J., Zd´arek, J. (eds.) Proceedings of the Prague Stringology Conference2009, Prague, Czech Republic, August 31 - September 2, 2009. pp. 105–117.Prague Stringology Club, Department of Computer Science and Engineering,Faculty of Electrical Engineering, Czech Technical University in Prague (2009),
33. Cieplinski, L.: MPEG-7 color descriptors and their applications. In: Skarbek, W.(ed.) Proc. 9th Intl. Conf. on Computer Analysis of Images and Patterns CAIP.LNCS, vol. 2124, pp. 11–20. Springer (2001)34. Cleve, R., Iwama, K., Le Gall, F., Nishimura, H., Tani, S., Teruyama, J.,Yamashita, S.: Reconstructing strings from substrings with quantum queries.In: Fomin, F.V., Kaski, P. (eds.) Scandinavian Workshop on Algorithm Theory(SWAT). pp. 388–397. Springer, Berlin, Heidelberg (2012)35. Crochemore, M., Iliopoulos, C.S., Kociumaka, T., Kubica, M., Langiu, A.,Pissis, S.P., Radoszewski, J., Rytter, W., Walen, T.: Order-preserving indexing.Theoretcial Computer Science , 122–135 (2016)36. Curtmola, R., Garay, J.A., Kamara, S., Ostrovsky, R.: Searchable symmetricencryption: Improved definitions and efficient constructions. Journal of ComputerSecurity (5), 895–934 (2011)37. Dakic, T.: On the turnpike problem. Simon Fraser University BC, Canada (2000)38. Deininger, P.: SINEs: short interspersed repeated DNA elements in highereukaryotes. In: Berg, D., Howe, M. (eds.) Mobile DNA, chap. 27, pp. 619–636.American Society for Microbiology (1989)39. Deselaers, T., Keysers, D., Ney, H.: Features for image retrieval: an experimentalcomparison. Inf. Retr. (2), 77–107 (2008)40. Dobzinski, S., Vondrak, J.: From query complexity to computationalcomplexity. In: Proceedings of the Forty-fourth Annual ACM Symposiumon Theory of Computing. pp. 1107–1116. STOC ’12, ACM, NewYork, NY, USA (2012). https://doi.org/10.1145/2213977.2214076, http://doi.acm.org/10.1145/2213977.2214076
41. Domani¸c, N.O., Preparata, F.P.: A novel approach to the detection of genomicapproximate tandem repeats in the levenshtein metric. Journal of ComputationalBiology (7), 873–891 (2007)42. Dud´ık, M., Schulman, L.J.: Reconstruction from subsequences. J. Comb. Theory,Ser. A (2), 337–348 (2003). https://doi.org/10.1016/S0097-3165(03)00103-1, https://doi.org/10.1016/S0097-3165(03)00103-1
43. Dudley, J., Lin, M.T., Le, D., J.R.Eshleman: Microsatellite instability as abiomarker for pd-1 blockade. Clinical Cancer Research (4), 813–820 (2016)daptive Exact Learning in a Mixed-Up World 2944. Elishco, O., Gabrys, R., M´edard, M., Yaakobi, E.: Repeat-free codes. In: IEEEInternational Symposium on Information Theory, ISIT 2019, Paris, France, July 7-12, 2019. pp. 932–936. IEEE (2019). https://doi.org/10.1109/ISIT.2019.8849483, https://doi.org/10.1109/ISIT.2019.8849483
45. Eres, R., Landau, G.M., Parida, L.: Permutation pat-tern discovery in biosequences. J. Comput. Biol. (6),1050–1060 (2004). https://doi.org/10.1089/cmb.2004.11.1050, https://doi.org/10.1089/cmb.2004.11.1050
46. Fici, G., Mignosi, F., Restivo, A., Sciortino, M.: Word assemblythrough minimal forbidden words. Theor. Comput. Sci. (1-3), 214–230 (2006). https://doi.org/10.1016/j.tcs.2006.03.006, https://doi.org/10.1016/j.tcs.2006.03.006
47. Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proceedingsof the American Mathematical Society (1), 109–114 (1965)48. Gabrys, R., Milenkovic, O.: The hybrid k-deck problem: Reconstructingsequences from short and long traces. In: 2017 IEEE International Sym-posium on Information Theory, ISIT 2017, Aachen, Germany, June 25-30,2017. pp. 1306–1310. IEEE (2017). https://doi.org/10.1109/ISIT.2017.8006740, https://doi.org/10.1109/ISIT.2017.8006740
49. Gabrys, R., Milenkovic, O.: Unique reconstruction of coded sequencesfrom multiset substring spectra. In: 2018 IEEE International Sympo-sium on Information Theory, ISIT 2018, Vail, CO, USA, June 17-22,2018. pp. 2540–2544. IEEE (2018). https://doi.org/10.1109/ISIT.2018.8437909, https://doi.org/10.1109/ISIT.2018.8437909
50. Ganguly, S., Mossel, E., R´acz, M.Z.: Sequence assembly fromcorrupted shotgun reads. In: IEEE International Symposium onInformation Theory, ISIT 2016, Barcelona, Spain, July 10-15, 2016.pp. 265–269. IEEE (2016). https://doi.org/10.1109/ISIT.2016.7541302, https://doi.org/10.1109/ISIT.2016.7541302
51. Holenstein, T., Mitzenmacher, M., Panigrahy, R., Wieder, U.: Trace recon-struction with constant deletion probability and related results. In: Teng, S.(ed.) Proceedings of the Nineteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008. pp.389–398. SIAM (2008), http://dl.acm.org/citation.cfm?id=1347082.1347125
52. Iwama, K., Teruyama, J., Tsuyama, S.: Reconstructing strings from substrings:Optimal randomized and average-case algorithms (2018)53. Jeong, K., Bandeira, N., Kim, S., Pevzner, P.A.: Gapped spectral dictionariesand their applications for database searches of tandem mass spectra. Mol CellProteomics (2011), m110.00222054. Jerrum, M.: The complexity of finding minimum-lengthgenerator sequences. Theor. Comput. Sci. , 265–289 (1985). https://doi.org/10.1016/0304-3975(85)90047-7, https://doi.org/10.1016/0304-3975(85)90047-7
55. Kalashnik, L.: The reconstruction of a word from fragments. Numericalmathematics and computer technology pp. 56–57 (1973)56. Kannan, S., McGregor, A.: More on reconstructing strings fromrandom traces: insertions and deletions. In: Proceedings of the2005 IEEE International Symposium on Information Theory, ISIT2005, Adelaide, South Australia, Australia, 4-9 September 2005. pp.297–301. IEEE (2005). https://doi.org/10.1109/ISIT.2005.1523342, https://doi.org/10.1109/ISIT.2005.1523342 (6),3125–3146 (2016). https://doi.org/10.1109/TIT.2016.2555321, https://doi.org/10.1109/TIT.2016.2555321
59. Kim, J., Amir, A., Na, J.C., Park, K., Sim, J.S.: On representations of ternaryorder relations in numeric strings. In: Proc. 2nd International Conference onAlgorithms for Big Data (ICABD). CEUR Workshop Proceedings, vol. 1146, pp.46–52 (2014)60. Kim, S., Bandeira, N., Pevzner, P.A.: Spectral profiles: A novel representationof tandem mass spectra and its applications for de novo peptide sequencing andidentification. Mol Cell Proteomics , 1391–1400 (2009)61. Kim, S., Gupta, N., Bandeira, N., Pevzner, P.A.: Spectral dictionaries: Integratingde novo peptide sequencing with database search of tandem mass spectra. MolCell Proteomics (1), 53–69 (2009)62. Kociumaka, T., Radoszewski, J., Rytter, W., Straszy´nski, J., Wale´n, T., Zuba,W.: Faster recovery of approximate periods over edit distance. In: Proc.25th International Symposium on String Processing and Information Retrieval(SPIRE). pp. 233–240. LNCS, Springer (2018)63. Kociumaka, T., Radoszewski, J., Rytter, W.: Efficient indexes for jumbled patternmatching with constant-sized alphabet. In: Bodlaender, H.L., Italiano, G.F. (eds.)European Symp. on Algorithms (ESA). pp. 625–636. Springer, Berlin, Heidelberg(2013)64. Kolpakov, R., Kucherov, G.: mreps: efficient and flexible detection oftandem repeats in DNA. Nucleic Acids Res. k -nearest neighbor query leakage. In: Proc. IEEESymposium on Security and Privacy, SP. pp. 245–262 (2019)66. Krasikov, I., Roditty, Y.: On a reconstruction problem for sequences,. J. Comb.Theory, Ser. A (2), 344–348 (1997). https://doi.org/10.1006/jcta.1997.2732, https://doi.org/10.1006/jcta.1997.2732
67. Lacharit´e, M., Minaud, B., Paterson, K.G.: Improved reconstruction attacks onencrypted data using range query leakage. In: Proc. IEEE Symposium on Securityand Privacy, SP. pp. 297–314 (2018)68. Levenshtein, V.I.: Binary codes capable of correcting, deletions, insertions andreversals. Soviet Phys. Dokl. , 707–710 (1966)69. Levenshtein, V.I.: Efficient reconstruction of sequences. IEEE Trans.Inf. Theory (1), 2–22 (2001). https://doi.org/10.1109/18.904499, https://doi.org/10.1109/18.904499
70. Lowrance, R., Wagner, R.A.: An extension of the string-to-string correctionproblem. J. ACM (2), 177–183 (1975). https://doi.org/10.1145/321879.321880, https://doi.org/10.1145/321879.321880
71. Manvel, B., Meyerowitz, A., Schwenk, A.J., Smith, K., Stockmeyer,P.K.: Reconstruction of sequences. Discret. Math. (3),209–219 (1991). https://doi.org/10.1016/0012-365X(91)90026-X, https://doi.org/10.1016/0012-365X(91)90026-X daptive Exact Learning in a Mixed-Up World 3172. Marcovich, S., Yaakobi, E.: Reconstruction of strings from their substringsspectrum. CoRR abs/1912.11108 (2019), http://arxiv.org/abs/1912.11108
73. Margaritis, D., Skiena, S.S.: Reconstructing strings from substrings in rounds.In: IEEE 36th Symp. on Foundations of Computer Science (FOCS). pp. 613–620(Oct 1995). https://doi.org/10.1109/SFCS.1995.49259174. Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomized Algo-rithms and Probabilistic Analysis. Cambridge University Press, 2 edn. (2017)75. Moosa, T.M., Rahman, M.S.: Indexing permutations forbinary strings. Information Processing Letters (18), 795–798 (2010). https://doi.org/https://doi.org/10.1016/j.ipl.2010.06.012,
76. Motahari, A.S., Bresler, G., Tse, D.N.C.: Information theoryof DNA shotgun sequencing. IEEE Trans. Inf. Theory (10),6273–6289 (2013). https://doi.org/10.1109/TIT.2013.2270273, https://doi.org/10.1109/TIT.2013.2270273
77. Motahari, A.S., Ramchandran, K., Tse, D., Ma, N.: OptimalDNA shotgun sequencing: Noisy reads are as good as noiselessreads. In: Proceedings of the 2013 IEEE International Symposiumon Information Theory, Istanbul, Turkey, July 7-12, 2013. pp.1640–1644. IEEE (2013). https://doi.org/10.1109/ISIT.2013.6620505, https://doi.org/10.1109/ISIT.2013.6620505
78. Naveed, M., Kamara, S., Wright, C.V.: Inference attacks on property-preservingencrypted databases. In: In Proc. of the 22nd ACM SIGSAC Conference onComputer and Communications Security,CCS. pp. 644–655 (2015)79. Parisi, V., Fonzo, V.D., Aluffi-Pentini, F.: STRING: finding tandem repeats inDNA sequences. Bioinformatics (14), 1733–1738 (2003)80. Pellegrini, M., Renda, M.E., Vecchio, A.: Trstalker: an efficient heuristic forfinding fuzzy tandem repeats. Bioinformatics [ISMB] (12), 358–366 (2010)81. Sala, F., Gabrys, R., Schoeny, C., Mazooji, K., Dolecek, L.: Exact se-quence reconstruction for insertion-correcting codes. In: IEEE InternationalSymposium on Information Theory, ISIT 2016, Barcelona, Spain, July 10-15,2016. pp. 615–619. IEEE (2016). https://doi.org/10.1109/ISIT.2016.7541372, https://doi.org/10.1109/ISIT.2016.7541372
82. Scott, A.D.: Reconstructing sequences. Discret. Math. (1-3), 231–238 (1997). https://doi.org/10.1016/S0012-365X(96)00153-7, https://doi.org/10.1016/S0012-365X(96)00153-7
83. Shomorony, I., Courtade, T.A., Tse, D.N.C.: Do read errors matter forgenome assembly? In: IEEE International Symposium on InformationTheory, ISIT 2015, Hong Kong, China, June 14-19, 2015. pp.919–923. IEEE (2015). https://doi.org/10.1109/ISIT.2015.7282589, https://doi.org/10.1109/ISIT.2015.7282589
84. Shomorony, I., Kamath, G.M., Xia, F., Courtade, T.A., Tse, D.N.C.: PartialDNA assembly: A rate-distortion perspective. In: IEEE International Sym-posium on Information Theory, ISIT 2016, Barcelona, Spain, July 10-15,2016. pp. 1799–1803. IEEE (2016). https://doi.org/10.1109/ISIT.2016.7541609, https://doi.org/10.1109/ISIT.2016.7541609
85. Simon, I.: Piecewise testable events. In: Barkhage, H. (ed.) AutomataTheory and Formal Languages, 2nd GI Conference, Kaiserslautern,May 20-23, 1975. Lecture Notes in Computer Science, vol. 33, pp.214–222. Springer (1975). https://doi.org/10.1007/3-540-07407-4 23, https://doi.org/10.1007/3-540-07407-4_23 https://doi.org/10.1145/98524.98598
87. Skiena, S., Sundaram, G.: Reconstructing strings fromsubstrings. Journal of Computational Biology (2),333–353 (1995). https://doi.org/10.1089/cmb.1995.2.333, https://doi.org/10.1089/cmb.1995.2.333
88. Sokol, D.: Tredd - a database for tandem repeats over the edit distance.Database: The Journal of Biological Databases and Curation (baq003)(2010). https://doi.org/10.1093/database/baq00389. Stefanov, E., Papamanthou, C., Shi, E.: Practical dynamic searchable encryptionwith small leakage. In: 21st Annual Network and Distributed System SecuritySymposium, NDSS 2014, San Diego, California, USA, February 23-26, 2014 (2014)90. Tan, K., Ooi, B.C., Yee, C.Y.: An evaluation of color-spatial retrieval techniquesfor large image databases. Multim. Tools Appl. (1), 55–78 (2001)91. Tardos, G.: Query complexity, or why is it difficult to separate NP A ∩ coNP A from P A by random oracles A ? Combinatorica (4), 385–392 (Dec 1989).https://doi.org/10.1007/BF02125350, https://doi.org/10.1007/BF02125350
92. Tsur, D.: Tight bounds for string reconstruction using substring queries. In:Chekuri, C., Jansen, K., Rolim, J.D.P., Trevisan, L. (eds.) Algorithms andTechniques for Approximation, Randomization and Combinatorial Optimization.pp. 448–459. Springer (2005)93. Ukkonen, E.: Approximate string matching with q-gramsand maximal matches. Theor. Comput. Sci. (1), 191–211 (1992). https://doi.org/10.1016/0304-3975(92)90143-4, https://doi.org/10.1016/0304-3975(92)90143-4
94. Viswanathan, K., Swaminathan, R.: Improved string reconstruction overinsertion-deletion channels. In: Teng, S. (ed.) Proceedings of the NineteenthAnnual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, SanFrancisco, California, USA, January 20-22, 2008. pp. 399–408. SIAM (2008), http://dl.acm.org/citation.cfm?id=1347082.1347126
95. Wagner, R.A.: On the complexity of the extended string-to-string cor-rection problem. In: Rounds, W.C., Martin, N., Carlyle, J.W., Har-rison, M.A. (eds.) Proceedings of the 7th Annual ACM Symposiumon Theory of Computing, May 5-7, 1975, Albuquerque, New Mexico,USA. pp. 218–223. ACM (1975). https://doi.org/10.1145/800116.803771, https://doi.org/10.1145/800116.803771
96. Wang, J., Hua, X.: Interactive image search by color map. ACM Trans. Intell.Syst. Technol. (1), 12:1–12:23 (2011)97. Wexler, Y., Yakhini, Z., Kashi, Y., Geiger, D.: Finding approximate tandemrepeats in genomic sequences. In: RECOMB. pp. 223–232 (2004)98. Wikipedia contributors: Coupon collector’s problem. https://en.wikipedia.org/wiki/Coupon_collector%27s_problem (2019),accessed 30-Jan-202099. Yao, A.C.C.: Decision tree complexity and Betti numbers. In:Proceedings of the Twenty-sixth Annual ACM Symposium onTheory of Computing. pp. 615–624. STOC ’94, ACM, NewYork, NY, USA (1994). https://doi.org/10.1145/195058.195414, http://doi.acm.org/10.1145/195058.195414 daptive Exact Learning in a Mixed-Up World 33100. Zenkin, A., Leont’ev, V.K.: On a non-classical recognition problem. USSRComputational Mathematics and Mathematical Physics (3), 189–193 (1984)101. Zhang, Y., Katz, J., Papamanthou, C.: All your queries are belong to us: Thepower of file-injection attacks on searchable encryption. In: 25th USENIX SecuritySymposium, USENIX Security 16. pp. 707–720 (2016)102. Zhou, W., Li, H., Tian, Q.: Recent advance in content-basedimage retrieval: A literature survey. CoRR abs/1706.06064 (2017), http://arxiv.org/abs/1706.06064 IT-24 , 530–536 (1978)104. Ziv, J., Lempel, A.: A universal algorithm for sequentialdata compression. IEEE Trans. Information Theory (3),337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714,(3),337–343 (1977). https://doi.org/10.1109/TIT.1977.1055714,