[PDF] The Minimal Compression Rate for Similarity Identification

Abstract

Traditionally, data compression deals with the problem of concisely representing a data source, e.g. a sequence of letters, for the purpose of eventual reproduction (either exact or approximate). In this work we are interested in the case where the goal is to answer similarity queries about the compressed sequence, i.e. to identify whether or not the original sequence is similar to a given query sequence. We study the fundamental tradeoff between the compression rate and the reliability of the queries performed on compressed data. For i.i.d. sequences, we characterize the minimal compression rate that allows query answers, that are reliable in the sense of having a vanishing false-positive probability, when false negatives are not allowed. The result is partially based on a previous work by Ahlswede et al., and the inherently typical subset lemma plays a key role in the converse proof. We then characterize the compression rate achievable by schemes that use lossy source codes as a building block, and show that such schemes are, in general, suboptimal. Finally, we tackle the problem of evaluating the minimal compression rate, by converting the problem to a sequence of convex programs that can be solved efficiently.

Full PDF

aa r X i v : . [ c s . I T ] D ec SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1

The Minimal Compression Ratefor Similarity Identiﬁcation

Amir Ingber and Tsachy Weissman

Abstract

Traditionally, data compression deals with the problem of concisely representing a data source,e.g. a sequence of letters, for the purpose of eventual reproduction (either exact or approximate).In this work we are interested in the case where the goal is to answer similarity queries about thecompressed sequence, i.e. to identify whether or not the original sequence is similar to a givenquery sequence.We study the fundamental tradeoff between the compression rate and the reliability of thequeries performed on compressed data. For i.i.d. sequences, we characterize the minimal compres-sion rate that allows query answers, that are reliable in the sense of having a vanishing false-positiveprobability, when false negatives are not allowed. The result is partially based on a previous workby Ahlswede et al. [1], and the inherently typical subset lemma plays a key role in the converseproof.We then characterize the compression rate achievable by schemes that use lossy source codesas a building block, and show that such schemes are, in general, suboptimal. Finally, we tacklethe problem of evaluating the minimal compression rate, by converting the problem to a sequenceof convex programs that can be solved efﬁciently.

I. I

NTRODUCTION

Traditionally, data compression deals with concisely representing data, e.g., a sequenceof letters, for the purpose of eventual reproduction (either exact or approximate). More gen-erally, one wishes to know something about the source from its compressed representation.In this work, we are interested in compression when the goal is to identify whether theoriginal source sequence is similar to a given query sequence.A typical scenario where this problem arises is in database queries. Here, a large databasecontaining many sequences { x , ..., x M } is required to answer queries of the sort “what arethe sequences in the database that are close to the sequence y ?”. Such a scenario (seeFig. 1) appears, for example, in computation biology (where the sequences can be, e.g.,DNA sequences), forensics (where the sequences represent ﬁngerprints) and internet search.Speciﬁcally, our interest is in the case where for each sequence x in the database we keepa short signature T ( x ) , and the similarity queries are performed based on T ( x ) and y . Oursetting differs from classical compression in that we do not require that the original databe reproducible from the signatures, so the signatures are not meant to replace the originaldatabase. There are many instances where such compression is desirable. For example, theset of signatures can be thought of as a cached version of the original database which,due to its smaller size, can be stored on a faster media (e.g. RAM), or even hosted on The authors are with the Dept. of Electrical Engineering, Stanford University, Stanford, CA 94305. Email: { ingber,tsachy } @stanford.edu.This work is supported in part by the NSF Center for Science of Information under grant agreement CCF-0939370,and by a Google research award SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

Fig. 1. Similarity queries in a database.Fig. 2. Answering similarity queries using a compressed database: ﬁrst the user receives a set of potential matches, andthen asks the original database for the actual sequences. In the example, x is a false positive (FP). many location in order to reduce the burden on the main database. Typically, the user willeventually request the relevant sequences (and only them) from the original database – seeFig. 2.Naturally, when the queries are answered from the compressed data, one cannot expectto get accurate answers all the time. There are two error events that may occur: the ﬁrst is a false positive (FP), where the query returns a positive answer (“the sequences are similar”)but the answer is wrong (the sequences are in fact dissimilar). The second is a false negative (FN), when the query returns a negative answer (“the sequences are not similar”) but thesequences are actually similar . Therefore the interesting tradeoff to consider is between thecompression rate (the amount of space required to represent T ( x ) ) and the reliability of the In the statistics literature, the FP and FN events are also known as type 1 and type 2 errors, respectively, while in theengineering literature they are also known as false alarm and misdetection events, respectively.

NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 3 query answers (measured by the FP and FN probabilities, under an appropriately deﬁnedprobabilistic model).The problem was ﬁrst studied from an information-theoretic viewpoint in the seminalwork by Ahlswede et al. [1]. In this work, the source and query sequences are assumed tobe drawn i.i.d. from a known distribution, and both false negatives and false positives areallowed. In [1], the authors ﬁrst considered the case where the probability of false positivesand the probability of false negatives are only required to vanish with the dimension n (in the same spirit of the deﬁnition of an achievable rate for communication as a rate forwhich the error probability can be made to vanish). However, it was shown in [1] thatthis deﬁnition leads to degeneracy: there exist schemes for compression at rates that arearbitrarily close to zero while the two error probabilities vanish. Then, the authors in [1]moved on to consider the case where the FP and FN probabilities are required to vanish exponentially , with prescribed exponents α and β , respectively, and were able to ﬁnd theoptimal compression rate in that setting . We note that this case is atypical in informationtheory: in the channel coding setting, the highest achievable rate (the channel capacity) is thesame, regardless of whether an achievable rate is deﬁned by an error probability vanishing exponentially or just vanishing. The same holds for the lowest compression rate for lossyreproduction (the rate-distortion function).In this paper, we consider the case where no false negatives are allowed. The mainmotivation is that false negatives cause an undetected error in the system where, in contrast,false positives can be easily detected (after retrieving the sequences ‘ﬂagged’ as potentialmatches, it is easy to ﬁlter out the false positives). This is important in several applications,where one cannot compromise on the accuracy of the results (e.g. in a forensic database),but still would like to enjoy the beneﬁts of compression. While it is natural to ask whatcan be gained when the FN probability is nonzero but ‘very very small’ – it is importantto recall that this probability is based on a probabilistic model of the data, which may notbe accurate (in fact, it is rarely the case, especially in source coding settings, where theprobabilistic model matches the actual data very closely).The contributions of the current paper are as follows.1) We ﬁnd the minimal compression rate for reliable similarity identiﬁcation with nofalse negatives. This rate, called the identiﬁcation rate and denoted R ID ( D ) , turns outto be the inﬁmal rate at which the “false-positive” exponent of [1] is positive. Ourresult holds for both ﬁxed and variable length compression schemes.2) In the case where x and y have the same alphabet, and the similarity measure satisﬁesthe triangle inequality , we analyze two schemes for compression that are based onthe notion of lossy compression. In those schemes the signature is used for producinga reconstruction ˆ x of the source, and the decision whether x and y is done accordingto the distance between ˆ x and y . We show that those schemes, although simpler foranalysis and implementation, attain rates that are generally suboptimal, i.e. strictlygreater than R ID ( D ) .3) The identiﬁcation rate R ID ( D ) is stated as a non-convex optimization program withan auxiliary random variable. We provide two results that facilitate the computation of R ID ( D ) . First, we improve a bound on the cardinality of the auxiliary RV. Then, wepropose a method of transforming the said non-convex optimization program into a This optimal rate, however, is uncomputable, since the expression depends on an auxiliary random variable withunbounded cardinality.

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY sequence of convex optimization programs, and by that allowing efﬁcient computationof R ID ( D ) for small alphabet sizes. We demonstrate the effectiveness of this approachby calculating R ID ( D ) for several sources.The paper is organized as follows. In the next section we provide an extended literaturesurvey, which compares the setting discussed in the paper with other ideas, includingdifferent hashing schemes. In Sec. III we formulate the problem, and in Sec. IV we stateand discuss our main results. In Sec. V we prove the result for the identiﬁcation rate,Sec. VI contains the analysis of the schemes based on lossy compression, and in Sec. VIIwe describe the results enabling the computation of the identiﬁcation rate. Sec. VIII deliversconcluding remarks. II. R ELATED L ITERATURE

A. Directly related work

In the current paper we focus on discrete alphabets only, following [1]. A parallel result,with complete characterization of the identiﬁcation rate (and exponent) for the Gaussiancase and quadratic distortion, appears in [2],[3]. The identiﬁcation exponent problem wasoriginally studied in [1] for the variable length case, where the resulting exponent dependson an auxiliary random variable with unbounded cardinality. A bound on the cardinality hasbeen obtained recently in [4], where the exponent for ﬁxed-length schemes is also found(and is different than that of the variable length schemes, unlike the identiﬁcation rate – seeProp. 1 below). In the special case of exact match queries (i.e. identiﬁcation with D = 0 for Hamming distance), the exponent was studied in [5]. B. Other work in information theory

Another closely related work is the one by Tuncel et al. [6], where a similar setting ofsearching in a database was considered. In that work the search accuracy was addressed bya reconstruction requirement with a single-letter distortion measure that is side-informationdependent (and the tradeoff between compression and accuracy is that of a Wyner-Ziv [7]type). In contrast, in the current paper the search accuracy is measured directly by theprobability of false positives.A different line of work geared at identifying the fundamental performance limits ofdatabase retrieval includes [8], [9], which characterize the maximum rate of entries thatcan be reliably identiﬁed in a database. These papers were extended in [10], [11] allowingcompression of the database, and in [12] to the case where sequence reconstruction is alsorequired. In all of these papers, the underlying assumption is that the original sequencesare corrupted by noise before enrolled in the database, the query sequence is one of thoseoriginal sequences , and the objective is to identify which one. There are two fundamentaldifferences between this line of work and the one in the current paper. First, in our case thequery sequence is random (i.e. generated by nature) and does not need to be a sequencethat has already been enrolled in the database. Second, in our problem we are searching forsequences that are similar to the query sequence (rather than an exact match).

NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 5

C. Hashing and related concepts

The term hashing (see, e.g. [13]) generally refers to the process of representing a complexdata entity with a short signature, or hash. Classically, hashing is used for quickly identifyingexact matches, by simply comparing the hash of the source sequence and that of the querysequence. Hashing has been extended to detect membership in sets, a method known as theBloom Filter [14] (with many subsequent improvements, e.g. [15]). Here, however, we areinterested in similarities, or “approximate” matches.The extension of the hashing concept to similarity search is called Locality SensitiveHashing (LSH), which is a framework for data structures and algorithms for ﬁnding similaritems in a given set (see [16] for a survey). LSH trades off accuracy with computationalcomplexity and space, and false negatives are allowed. Several fundamental points aredifferent in our setting. First, we study the information-theoretic aspect of the problem,i.e. concentrate on space only (compression rate) and ignore computational complexity inan attempt to understand the amount of information relevant to querying that can be stored inthe short signatures. Second, we do not allow false negatives, which, as discussed above, areinherent for LSH. Third, in the general framework of LSH (and also in hashing), the generalassumption is that the data is ﬁxed and that the hashing functions are random. This meansthat the performance guarantees are given as low failure probabilities, where the probabilityspace is that of the random functions. However, for a given database, the hashing functionis eventually ﬁxed, which means that there always exist source and/or query sequences forwhich the scheme will always fail. In our case, the scheme is deterministic, false negativesnever occur (by design), and the probability of false positive depends on the probabilisticassumptions on the data.Another related idea is that of dimensionality reduction techniques that preserve distances,namely based on Johnson-Lindenstrauss type embeddings [17]. Such embeddings take a setof points in space, and transform each point to a point in a lower-dimensional space, witha guarantee that the distance between these points is approximately preserved. However,note that such mappings generally depend on the elements in the database – so that thedistance preservation property cannot apply to any query element outside the database,making the guarantee for zero false negative impossible without further assumptions. Infact, the original proof of the lemma in [17] results in a guarantee for any two points inspace, but this guarantee is probabilistic, and therefore cannot match our setting (similarlyto LSH).The process of compressing a sequence to produce a short signature can be also thoughtof as a type of sketching (see, e.g. [18]), which is a computational framework for succinctdata representation that still allows performing different operations with the data.

D. Practical examples of compression for similarity identiﬁcation

The idea of using compression for accelerating similarity search in databases is not new.Earlier practical examples include the

VA-ﬁle scheme [19], which uses scalar quantization ofeach coordinate of the source sequence in order to form the signature. The VA-ﬁle approachdemonstrates that compression-based similarity search systems can outperform tree-basedsystems for similarity search, providing further motivation to study the fundamental tradeoffbetween compression and search accuracy. The VA-ﬁle scheme has been generalized to vector quantization in [20], showing further improvements in both computational time andnumber of disk I/O operations.

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY

In the machine learning literature, the term ‘semantic hashing’ [21] refers to a transforma-tion that maps similar elements to bit strings that have low Hamming distance. Extensionsof this concept include [22], [23]. We comment that in neither of these papers there is aguarantee for zero false negatives, as in the setting considered in the current paper.We emphasize again that the results in the current paper are concerned with the amountof compression only, and ignore the computational complexity (as is typical for informationtheoretical results). Nevertheless, the fundamental limits, such as R ID ( D ) , characterize theplaying ﬁeld at which practical schemes should be evaluated.III. P ROBLEM F ORMULATION

A. Notation

Throughout this paper, boldface notation x denotes a column vector of elements [ x , ...x n ] T .Capital letters denote random variables (e.g. X, Y ), and X , Y denote random vectors. Weuse calligraphic fonts (e.g. X , Y ) to represent the ﬁnite alphabets. log( · ) denotes the base- logarithm, while ln( · ) is used for the usual natural logarithm.We measure the similarity between symbols with an arbitrary per-letter distortion measure ρ : X × Y → R + . For length n vectors, the distortion is given by d ( x , y ) , n n X i =1 ρ ( x i , y i ) . (1)We say that x and y are D - similar when d ( x , y ) ≤ D , or simply similar when D is clearfrom the context. B. Identiﬁcation Systems

A rate- R identiﬁcation system ( T, g ) consists of a signature assignment T : X n → { , , . . . , nR } (2)and a decision function g : { , , . . . , nR } × Y n → { no , maybe } . (3)A system ( T, g ) is said to be D - admissible , if for any x , y satisfying d ( x , y ) ≤ D , wehave g ( T ( x ) , y ) = maybe . (4)This notion of D -admissibility motivates the use of “ no ” and “ maybe ” in describing theoutput of g : • If g ( T ( x ) , y ) = no , then x and y can not be D -similar. • If g ( T ( x ) , y ) = maybe , then x and y are possibly D -similar.Stated another way, a D -admissible system ( T, g ) does not produce false negatives. Thus,a natural ﬁgure of merit for a D -admissible system ( T, g ) is the frequency at which falsepositives occur (i.e., where g ( T ( x ) , y ) = maybe and d ( x , y ) > D ). To this end, let P X and P Y be probability distributions on X , Y respectively, and assume that the vectors X and Y are independent of each other and drawn i.i.d. according to P X and P Y respectively. Deﬁnethe false positive event E = { g ( T ( X ) , Y ) = maybe , d ( X , Y ) > D } , (5) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 7 and note that, for any D -admissible system ( T, g ) , we have Pr { g ( T ( X ) , Y ) = maybe } = Pr { g ( T ( X ) , Y ) = maybe | d ( X , Y ) ≤ D } Pr { d ( X , Y ) ≤ D } + Pr { g ( T ( X ) , Y ) = maybe , d ( X , Y ) > D } (6) = Pr { d ( X , Y ) ≤ D } + Pr {E } , (7)where (7) follows since Pr { g ( T ( X ) , Y ) = maybe | d ( X , Y ) ≤ D } = 1 by D -admissibilityof ( T, g ) . Since Pr { d ( X , Y ) ≤ D } does not depend on what scheme is employed, minimiz-ing the false positive probability Pr {E } over all D -admissible schemes ( T, g ) is equivalentto minimizing Pr { g ( T ( X ) , Y ) = maybe } . Also note, that the only interesting case is when Pr { d ( X , Y ) ≤ D } → as n grows, since otherwise almost all the sequences in the databasewill be similar to the query sequence, making the problem degenerate (since almost all thedatabase needs to be retrieved, regardless of the compression). In this case, it is easy to seefrom (6) that Pr {E } vanishes if and only if the conditional probability Pr { g ( T ( X ) , Y ) = maybe | d ( X , Y ) > D } (8)vanishes as well. In view of the above, we henceforth restrict our attention to the behaviorof Pr { g ( T ( X ) , Y ) = maybe } . In particular, we study the tradeoff between the rate R and Pr { g ( T ( X ) , Y ) = maybe } . This motivates the following deﬁnitions: Deﬁnition 1:

For given distributions P X , P Y and a similarity threshold D , a rate R issaid to be D - achievable if there exists a sequence of admissible schemes ( T ( n ) , g ( n ) ) withrates at most R , satisfying lim n →∞ Pr (cid:8) g ( n ) (cid:0) T ( n ) ( X ) , Y (cid:1) = maybe (cid:9) = 0 . (9) Deﬁnition 2:

For given distributions P X , P Y and a similarity threshold D , the identiﬁca-tion rate R ID ( D, P X , P Y ) is the inﬁmum of D -achievable rates. That is, R ID ( D ) , inf { R : R is D -achievable } , (10)where an inﬁmum over the empty set is equal to ∞ .It is not hard to see that R ID ( D ) must be nondecreasing. To see this, note that anysequence of schemes at rate R that achieve vanishing probability of maybe for similaritythreshold D , is also admissible for any threshold D ′ ≤ D , so if R is D -achievable, then itis also D ′ -achievable. In other words, a higher similarity threshold is a more difﬁcult task(i.e. requires higher compression rate). Therefore, analogously to the deﬁnition of R ID ( D ) ,we deﬁne D ID ( R ) as the maximal achievable similarity threshold for ﬁxed rate schemes.The deﬁnitions of an achievable rate and the identiﬁcation rate are in the same spirit of therate distortion function (the rate above which a vanishing probability for excess distortion isachievable), and also in the spirit of the channel capacity (the rate below which a vanishingprobability of error can be obtained). See, for example, Gallager [24]. C. Variable Length Identiﬁcation Systems

In [1], the authors study a similar setting, where the compression is of variable length .In that spirit, we deﬁne the corresponding variable-length quantities:A variable length identiﬁcation system ( T vl , g vl ) consists of a signature assignment T vl : X n → B, (11) SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY where B ⊆ { , } ∗ is a preﬁx-free set, and a decision function g vl : B × Y n → { no , maybe } . (12)The rate of the system is given by R = 1 n E [ length ( T vl ( X ))] . (13)As before, a scheme is said to be admissible, if for any x , y satisfying d ( x , y ) ≤ D , wehave g vl ( T vl ( x ) , y ) = maybe . (14)Analogously to Deﬁnitions 1 and 2, we deﬁne the variable-length identiﬁcation rate,denoted R vl ID as the inﬁmum of achievable rates for variable length identiﬁcation systems.Clearly, any rate R that is achievable with ﬁxed-length schemes is also achievable withvariable length schemes, and therefore R ID ≥ R vl ID . It turns out that both quantities areactually equal: Proposition 1:

The identiﬁcation rate for variable rate is the same as that for ﬁxed rate,i.e. R ID ( D ) = R vl ID ( D ) (15)The proof of the proposition, given in detail in Appendix A, is based on a simple meta-argument, that essentially says that any variable length scheme can be used as a buildingblock to construct a ﬁxed-length scheme. The argument is based on the concatenation ofseveral input sequences into a larger one, and then applying a variable length scheme toeach of the sequences. This will result in a variable length scheme, but with high probability,most of the signatures will have overall length bounded by some ﬁxed length, enabling theconversion to a ﬁxed-length scheme.There are two direct consequences of Prop. 1 that will enable the evaluation of R ID ( D ) :In order to show that a given rate is achievable with ﬁxed-rate schemes, it is possible toconsider variable length schemes, as in [1]. On the other hand, in order to prove a converse,it sufﬁces to consider ﬁxed-length schemes, slightly simplifying the proof. This is the pathwe take in the paper. IV. M AIN R ESULTS

A. The Identiﬁcation Rate

Deﬁne the following distance between distributions P X , P Y : ¯ ρ ( P X , P Y ) , min E [ ρ ( X, Y )] , (16)where the minimization is w.r.t. all random variables X, Y with marginal distributions P X and P Y , respectively. This distance goes by many names, such as the Wasserstein (orVasershtein) distance, the Kantorovich distance and also the Transport distance (see [25],and also [26] for a survey).Deﬁne the (informational) identiﬁcation rate as ¯ R ID ( D ) = min P U | X : P u ∈U P U ( u )¯ ρ ( P X | U ( ·| u ) ,P Y ) ≥ D I ( X ; U ) , (17) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 9 where U is any random variable with ﬁnite alphabet U , that is independent of Y .It follows from [1, Theorem 2] that when R > ¯ R ID ( D ) , there exist (variable-length)identiﬁcation schemes with FP probability that vanishes exponentially with n (the explicitconnection to the limiting rate ¯ R ID ( D ) is made in [1, Eq. (2.21)]). This fact, combined withProp. 1 above implies that R ID ( D ) ≤ ¯ R ID ( D ) . However, it remains open whether R ID ( D ) is even strictly positive. In the related case studied in [1], where the probability of bothFN and FP events are required to vanish, it was shown [1, Thm. 1] that the achievablerate in this sense is equal to zero, so according to [1], “the only problem left to investigateis the case tradeoff between the rate R and the two error exponents [of the FP and FNevents]”. Our ﬁrst result below shows that the restriction to the case of no FN (also calleda ‘one-sided error’) is, in fact, very interesting. Theorem 2 (The Identiﬁcation Rate Theorem): R ID ( D ) = ¯ R ID ( D ) , (18)i.e. the identiﬁcation rate is given in (17). Moreover, if R < R ID ( D ) , then the probabilityof maybe converges to exponentially fast.A few comments are in order regarding the theorem, whose proof is given in detail inSec. V: • Theorem 2 states that the case where the probability of FN events is equal to zerois inherently different from the case ﬁrst discussed in [1, Thm. 1], where the FNprobability is only required to vanish with n : here the rate problem is not degenerate(since the minimal achievable rate question does not always give zero), and is in fact,more along the lines of the classical results in information theory, such as channelcapacity and classical rate distortion compression. • As discussed above, the direct part of the theorem follows from [1] and Prop. 1.Nevertheless, in Sec. V we shall also outline how to prove the achievability part directly,based on a version of the type covering lemma. • The techniques used in [1] for proving a converse result on the error exponents are notstrong enough for proving the converse for Thm. 2. In the proof here, we utilize someof the tools developed in [1], namely the inherently typical subset lemma , and augmentthem with the blowing-up lemma ([27], see also [28, Lemma 12]). The purpose of theblowing up lemma in this context is to take an event whose probability is exponentiallysmall, but with a very small exponent, and transform it to a related event, whoseprobability is very close to 1. See Sec. V for details.

B. Special Case: Distortion Measures Satisfying the Triangle Inequality

Consider the special case where X = Y , and for all x, y, z ∈ X , we have ρ ( x, y ) + ρ ( y, z ) ≥ ρ ( x, y ) . (19)In other words, the distortion measure satisﬁes the triangle inequality , which is a commonproperty of distance / similarity measures. For simplicity, also assume that the measure issymmetric, i.e. that for all x, y ∈ X , ρ ( x, y ) = ρ ( y, x ) (later on we discuss the generalization The cardinality of U can be taken as |X | + 2 , according to [1, Lemma 3]. However, in the sequel we improve thecardinality bound, see Subsection IV-C. Fig. 3. Illustration of the triangle-inequality decision rule. If d (ˆ x , y ) > D + d ( x , ˆ x ) , then it is certain that d ( x , y ) > D ,and we can therefore safely declare that x and y are not similar. to the non-symmetric case). In this case there are intuitive compression schemes that natu-rally come to mind. The main idea is to use the compressed representation for reconstructingan approximation ˆ x of the source, and then to use this reconstruction to decide whether x and y are similar or not. For example, suppose that the source was indeed reconstructed as ˆ x , and also assume that we know the value of d ( x , ˆ x ) (adding this value to the compressedrepresentation of ˆ x is a negligible addition to the rate). Consider the following decisionrule, based on the pair (ˆ x , d ( x , ˆ x )) : g ((ˆ x , d ( x , ˆ x )) , y ) = (cid:26) no , d (ˆ x , y ) > d ( x , ˆ x ) + D ; maybe , otherwise. (20)This rule is admissible because in cases where x and y are D -similar, it follows by thetriangle inequality that d (ˆ x , y ) ≤ d (ˆ x , x ) + d ( x , y ) (21) ≤ d (ˆ x , x ) + D, (22)resulting in the decision function in (20) returning a maybe . The process is illustrated inFig. 3.The remaining question is, then, how to choose the lossy representation ˆ x , and whetherthis results in an optimal scheme (i.e. whether the optimal R ID ( D ) is achieved). We surveytwo schemes based on the triangle inequality principle, and discuss the compression rateachievable with each scheme. In general, neither can be shown to achieve R ID ( D ) .The naive scheme: LC- △ (Lossy Compression signatures and triangle ineq. decision rule)In this scheme, we use standard lossy compression in order to represent ˆ x with ﬁxed rate R , i.e. we optimize for a reconstruction that minimizes d ( x , ˆ x ) . When the compression rateis R , it is known that the attained distortion d ( x , ˆ x ) is close, with high probability and forlong enough sequences, to the distortion-rate function D ( R ) .The full details of the LC- △ scheme are given in Section VI, along with the proof ofthe following theorem, which characterizes the similarity threshold supported by any givencompression rate: Theorem 3:

Any similarity threshold below D LC- △ ID ( R ) can be attained with a LC- △ compression scheme of rate R , where D LC- △ ID ( R ) , E [ ρ ( ˆ X , Y )] − D ( R ) , (23) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 11 where D ( R ) is the classical distortion-rate function of the source X , and ˆ X , which isindependent of Y , is distributed according to the marginal distribution of the D ( R ) achievingdistribution (if there are several D ( R ) -achieving distributions, take one that maximizes E [ ρ ( ˆ X , Y )] ).We denote by R LC- △ ID ( D ) the inverse function of D LC- △ ID ( R ) , i.e. the compression rate thatis achieved for a similarity threshold D . This scheme is suboptimal in general, i.e. there arecases for which R LC- △ I D ( D ) > R ID ( D ) , but in some cases they are equal. For example, thesymmetric binary source with Hamming distortion is one case in which this naive schemeis optimal. This case is discussed in detail in [29], where an actual scheme is implementedbased on lossy compression algorithms.An improved scheme: TC- △ (“Type Covering” signatures and triangle ineq. decision rule)The expression in (23) gives rise to the following intuitive idea: in the distortion rate case,we wish to minimize the distortion, with a constraint on the mutual information (that controlsthe compression rate). The free variable in the optimization is the transition probability P ˆ X | X (ˆ x | x ) . So far this is in agreement with (23), as we wish that the similarity thresholdwill be maximized. However, the expectation term in (23) also depends on the transitionprobability P ˆ X | X (ˆ x | x ) , raising the question whether both terms should be optimized together.The answer to this question is positive. The key step is to use a general type covering lemmafor generating ˆ x (using a distribution that does not necessarily minimize the distortionbetween X and ˆ X ). This idea is made concrete in the following theorem (whose proof isgiven in section VI): Theorem 4:

Any similarity threshold below D TC- △ ID ( R ) can be attained by a TC- △ com-pression scheme of rate R , where D TC- △ ID ( R ) , max P ˆ X | X : I ( X ; ˆ X ) ≤ R E [ ρ ( ˆ X , Y )] − E [ ρ ( X, ˆ X )] , (24)where on the RHS, ˆ X and Y are independent.We denote by R TC- △ ID ( D ) the inverse function of D TC- △ ID ( R ) , i.e. the compression rate thatis achieved by a TC- △ scheme for a similarity threshold D . It can be also written as R TC- △ ID ( D ) = min P ˆ X | X : E [ ρ ( ˆ X,Y )] − E [ ρ ( X, ˆ X )] ≥ D I ( X ; ˆ X ) . (25)We also note that the TC- △ scheme is a natural extension of the scheme given in [3,Theorem 3], which applies for continuous sources and quadratic distortion.It is not hard to see that R TC- △ ID ( D ) ≤ R LC- △ ID ( D ) , since the distortion-rate achievingdistribution in (23) is a feasible transition probability for the expression in (24). However,the TC- △ scheme is not optimal in general, as we shall see later on.So far we have, in general, that R ID ( D ) ≤ R TC- △ ID ( D ) ≤ R LC- △ ID ( D ) . (26)There are special cases, however, where some of the above inequalities are actually equalities(proved in Sec. VI): • For the binary-Hamming case, we have R ID ( D ) = R TC- △ ID ( D ) . (27) • If P y ∈Y P Y ( y ) ρ (ˆ x, y ) is constant for all ˆ x ∈ X , then R TC- △ ID ( D ) = R LC- △ ID ( D ) = R ( D − D ) , (28)where R ( · ) is the standard rate-distortion function and D , P y ∈Y P Y ( y ) ρ (ˆ x, y ) . • As a consequence, in the binary-Hamming case where Y is symmetric, both (27) and(28) hold, and we have R ID ( D ) = R TC- △ ID ( D ) = R LC- △ ID ( D ) = R ( − D ) , (29)where R ( · ) is the rate distortion function for the source X and Hamming distortion.Next, we provide an easily computable lower bound on the identiﬁcation rate, that holdsfor the case of Hamming distortion: Theorem 5:

For Hamming distortion, the identiﬁcation rate is always lower bounded by R ID ( D ) ≥ (cid:2) D log e − D ( P X || P Y ) (cid:3) + . (30)In particular, when P X = P Y , we have R ID ( D ) ≥ D log e. (31) Proof:

Appendix BIn Fig. 4 we plot the three rates: R LC- △ ID ( D ) , R TC- △ ID ( D ) and R ID ( D ) for the case of X = Y = { , , } , P X = P Y = [ . . . and Hamming distortion. As seen in the ﬁgure, thethree rates are different, indicating that neither of the LC- △ and TC- △ schemes is optimal.We also plot the lower bound from Theorem 5. C. Computing the Identiﬁcation Rate

The identiﬁcation rate R ID ( D ) is given in (17) as an optimization program of single-letterinformation-theoretic quantities, with an auxiliary random variable with alphabet U . In orderto actually compute the value of R ID ( D ) , one has to (a) obtain a bound on the cardinalityof U , and (b) be able to efﬁciently solve the optimization program, which is non-convex.In order to tackle the cardinality issue, let us deﬁne R k ID ( D ) , min P U | X : P u ∈U P U ( u )¯ ρ ( P X | U ( ·| u ) ,P Y ) ≥ D I ( X ; U ) , (32)where the alphabet U is given by k . Clearly, R k ID ( D ) is a decreasing function of k (sincea lower value of k can be considered a special case of the higher k ). In [1, Lemma 3],the standard support lemma (see, e.g. [30, Lemma 15.4]) is used to show that R ID ( D ) = R |X | +2ID ( D ) . In other words, taking |U | = |X | + 2 sufﬁces in order to obtain the true valueof R ID ( D ) . We improve this result as follows: Theorem 6:

For any D , we have R ID ( D ) = R |X | +1ID ( D ) . (33)Furthermore, the entire curve R ID ( D ) can be obtained by calculating the curve R |X | ID ( D ) forall D , and then taking the lower convex envelope.Remarks: NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 13 H ( X ) R LC −△ ID ( D ) R TC −△ ID ( D ) R ID ( D )2 D log e E [ ρ ( X, Y )] C o m p r e ss i on r a t e ( b i t s ) Similarity threshold D (Hamming)

Fig. 4. Three compression rates for the ternary source P X = [ . , . . . The distribution P Y of the query sequence is thesame as P X . The distortion measure is Hamming. • The proof, given in detail in Sec. VII, is based on the support lemma, but uses it in amore reﬁned way than in [1, Lemma 3], in a manner similar in spirit to [31]. • The second part of the result follows from the fact that whenever ( R ID ( D ) , D ) is an exposed point of the convex region of achievable pairs, R ID ( D ) = R |X | ID ( D ) . (34)See Sec. VII for details. • Taking the lower convex envelope is also necessary. In other words, we sometimeshave a strict inequality of the form R |X | +1ID ( D ) < R |X | ID ( D ) . For example, in the caseof ternary alphabet and Hamming distortion as in Fig. 4 above, we have such a strictinequality for D = 0 . .The harder problem is the non-convexity of the optimization in (17) (not to be confusedwith the fact that the function R ID ( D ) itself is a convex function of D , see [1, Lemma 3]).While the target function (the mutual information) is convex, the feasibility region is not,which makes the optimization hard. In order to tackle this issue, we show that this regionis the complementary of a convex polytope. Then, we propose a method for reducing theoptimization program (17) to a sequence of convex optimization programs that can be solvedeasily (e.g. via cvx [32]), and the minimum among the solutions of those programs is equalto R ID ( D ) .To illustrate this idea, consider the optimization of a convex function f : R n → R over the set R n \ Ξ , where Ξ is a convex polytope: inf z ∈ R n \ Ξ f ( z ) . (35)Any polytope can be written as Ξ = { z ∈ R n : a Ti z ≤ b i for all ≤ i ≤ m } , (36)where { a i } are m length- n vectors and { b i } are m scalars, where m corresponds to thenumber of facets of the polytope Ξ . Rewriting the optimization program gives inf z ∈ R n \ Ξ f ( z ) = min z ∈ R n : ∃ i : a Ti z ≥ b i f ( z ) (37) = min ≤ i ≤ m min z ∈ R n : a Ti z ≥ b i f ( z ) . (38)Each of the new optimization programs is a minimization of a the original convex function,but now with linear constraints, and therefore can be calculated easily.While this method does not scale well with the alphabet size (since the number of facetsof the polytope grows very quickly with |X | ), it still provides a method for calculating R ID ( D ) for small values of |X | . For example, the R ID ( D ) curve in Fig. 4 was obtainedwith this method. The full details of the reduction method, along with simpliﬁcations forthe Hamming distortion case, are given in Sec. VII.V. P ROOF OF THE I DENTIFICATION R ATE T HEOREM

In this section we prove Theorem 2. After giving a high-level overview of the proof,we introduce additional notation and review the basic tools that are used in the proof. Theproof itself is given in Subsection V-F.

A. Proof roadmap

We start with an informal overview of the proof. First, note that it sufﬁces to consideronly typical sequences x , i.e. sequences whose (ﬁrst-order) empirical distribution is closeto the true one P X (denoted T P X , see a formal deﬁnition below). The same holds for thequery sequences y .The achievability scheme is based on constructing a code, which is a set of sequencesfrom another alphabet U n , that “covers” the typical set T P X . The covering is in the sensethat for each x ∈ T P X , there exists a word u in the code, s.t. x , u will have a ﬁrst-orderempirical distribution that is close to some given joint distribution (which is given as a designparameter P U | X , along with the size of the alphabet U ). Such a covering is guaranteed toexist by a version of the type covering lemma (stated below), and the code rate is given bythe mutual information between X and U , which deﬁne the joint distribution. The signatureof a sequence x , T ( x ) , is deﬁned as the index to the sequence u in the code that covers x .The decision process g ( · , · ) simply declares maybe (given u and y ), if there exists a typicalsequence x that is mapped to u , that is also similar to y . The scheme is admissible, and allthat remains is to evaluate the probability of maybe , i.e. the probability that Y falls in the D -‘expansion’ of the set of sequences x that are mapped to u . See Fig. 5 for an illustration.It can be shown that if the similarity threshold D satisﬁes D < X u P U ( x ) ¯ ρ ( P X | U ( ·| u ) , P Y ) , (39) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 15

Fig. 5. Illustration of the achievability scheme. Each sequence in the typical set T P X is mapped to (the index of) asingle code point u . A maybe is declared if y falls in the D -expansion of the set of x -sequences that are mapped to u . the fraction of sequences for which a maybe is declared vanishes with n . This leads to theachievablity of the rate ¯ R ID ( D ) .For the converse part, we follow the steps of [1] and essentially show that any compressionscheme performs as well as a scheme that was considered in the achievability part. Inparticular, we show that there exists a distribution P U | X for which each of the sets of x -sequences mapped to the same i , contains a set which is called ‘inherently typical’ w.r.t. P U X (see below). While in the achievability part we claimed that if (39) holds then theprobability of maybe vanishes, here we need to claim the opposite, i.e. that if (39) does nothold, then the probability of maybe cannot go to zero. We note that in [1] the argumentcan only lead to a claim of the sort “if (39) does not hold, then the probability of maybe cannot vanish exponentially with n ”. Here, however, we require a stronger result. In orderto proceed, we use the blowing-up lemma (see below), and show that if (39) does not hold,then the probability of maybe converges to 1 , and this convergence is exponential in n .This can be regarded as a ‘exponentially strong converse’ (see, e.g. [33] for an overviewof converse types). B. Additional Notation

We shall use the method of types [30]. Let P ( X ) denote the set of all probabilitydistributions on the alphabet X . We denote by P ( X → Y ) the set of all stochastic ma-trices (or, equivalently, conditional distributions, or channels) from alphabet X to Y . Onoccasions, we deal with vectors of different lengths. For this purpose, we use the notation x k as short hand for the vector [ x , ..., x k ] , so, for example, x n and x shall denote thesame thing. For a sequence x ∈ X n and a ∈ X , let N ( a | x ) denote the number ofoccurrences of a in x . The type of the sequence x , denoted P x , is deﬁned as the vector n [ N (1 | x ) , N (2 | x ) , ..., N ( |X | | x )] . For any sequence length n , let P n ( X ) denote the set ofall possible types of sequences of length n , (also called n -types): P n ( X ) , { P ∈ P ( X ) |∀ x ∈ X , nP ( x ) ∈ Z + } . (40)For a type P ∈ P n ( X ) , the type class T P is deﬁned as the set of all sequences x ∈ X n withtype P (or, equivalently, that P x = P ). More generally, for a distribution P ∈ P ( X ) (notnecessarily an n -type), and a constant γ > , the set of typical sequences T P,γ is deﬁnedas the set of all sequences x ∈ X n for which:1) | P ( x ) − P x ( x ) | ≤ γ ∀ x ∈ X , P x ( x ) = 0 whenever P ( x ) = 0 .If X is a random variable distributed according to P , we shall sometimes write T X,γ for T P,γ .Similarly, for V ∈ P ( X → Y ) and x ∈ X n , we denote by T V,γ ( x ) the set of conditionallytypical sequences, i.e. that1) | N ( x ′ , y ′ | x , y ) − N ( x ′ | x ) V ( y ′ | x ′ ) | ≤ n · γ ∀ x ′ ∈ X , y ′ ∈ Y , N ( x ′ , y ′ | x , y ) = 0 whenever V ( y | x ) = 0 .For random variables X, Y where Y is the output of the channel V with input X , we’llsometimes use the notation T Y | X,γ ( x ) for T V,γ ( x ) .Let δ n be deﬁned according to the delta convention [30]; i.e. that δ n → , but also that nδ n → ∞ (e.g. δ n , n − α , with some α ∈ (0 , . ). With this convention, we have (see [30,Lemma 2.12]): • If X is distributed i.i.d. according to a distribution P , then Pr { X ∈ T P,δ n } ≥ − |X | nδ n . (41) • If Y is the output of the DMC V with input x ∈ X n , then Pr { Y ∈ T V,δ n ( x ) } ≥ − |X ||Y | nδ n . (42)Recall that we have deﬁned an arbitrary distortion measure ρ : X × Y → R + . Themaximal possible distortion is denoted by ρ max : ρ max , max x ∈X ,y ∈Y ρ ( x, y ) . (43)For a set A ⊆ X n , we denote its D -expansion by Γ D ( A ) , { y ∈ Y n : ∃ x ∈ A s.t. d ( x , y ) ≤ D } . (44)When we use the Hamming distance as the distortion measure, we denote the expansionby Γ DH ( A ) . NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 17

C. A covering lemma

The main building block for the achievability proofs in the paper is the following versionof the covering lemma (see e.g. [34, Prop. 1]):

Lemma 1 (A covering lemma):

For any distribution P X ∈ P ( X ) and any channel V ∈P ( X → U ) , there exists a mapping ω : T P X ,δ n → U n s.t. ω ( x ) ∈ T V,δ n ( x ) for all x ∈ T P X ,δ n , (45)and |{ ω ( x ) : x ∈ T P X ,δ n }| ≤ n ( I ( P X ,V )+ ε n ) , (46)where ε n = O ( δ n log(1 /δ n )) , with constants that depend only on |X | and |U | .Essentially, this lemma says that there exists a code C ⊆ U n of rate I ( P X , W ) that coversthe typical set T P X ,δ n , in the sense that for every x ∈ T P X ,δ n , there exists a ‘codeword’ u ∈ C such that the sequences x , u have a joint type that is very close to ( P X , W ) . D. The inherently typical subset lemma

The proof of the converse of Theorem 2 relies heavily on the inherently typical subsetlemma, due to Ahlswede et al.[1]. An inherently typical set is a generalization of theconditional type class concept, as detailed below. Loosely speaking, the lemma says thatevery set contains an inherently typical subset of essentially the same size.Before stating the lemma we require several deﬁnitions.

Deﬁnition 3 (Preﬁx set):

For A ⊆ X n and ≤ k ≤ n , let A k denote the set of preﬁxesof sequences of A , i.e. A k , { x k ∈ X k : ∃ ˜ x n ∈ A s.t. x k = ˜ x k } . (47)We denote A = { Λ } , where Λ denotes the empty string. Note that, by our convention, forany x ∈ A , x = Λ . Deﬁnition 4 (Causal mapping):

For A ⊆ X n and an arbitrary alphabet U , a mapping φ : A → U n is said to be causal , if there exist mappings { φ i : A i → U } n − i =0 , s.t. for all x ∈ A we have φ ( x ) = [ φ ( x ) , φ ( x ) , ..., φ n ( x n − )] . (48)Let U m denote a discrete alphabet of size |U m | = |P m ( X ) | = (cid:18) |X | + m − m (cid:19) . (49)The alphabet U m is exactly the set of m -types over the alphabet X . For example, if X = { , } , we have |U m | = m + 1 . Deﬁnition 5 (Inherently typical set [1]):

A set A ⊆ X n is said to be m - inherently typical ,if there exist a causal mapping φ : A → U nm and an n -type Q ∈ P n ( X × U m ) , s.t.1) For every x ∈ A , the sequences x , φ ( x ) have the joint type Q , i.e. P x ,φ ( x ) = Q. (50)2) If X , U are jointly distributed according to Q , then n log | A | ≤ H ( X | U ) ≤ n log | A | + log mm . (51) Lemma 2 (The inherently typical subset lemma [1]):

Let m > |X | , and let n be largeenough s.t. ( m + 1) |X | +4 ln( n + 1) /n ≤ . Then for any set A ⊆ X n there exists a subset ˜ A ⊆ A that is m -inherently typical, whose size satisﬁes n log | A || ˜ A | ≤ |X | ( m + 1) |X | log( n + 1) n . (52)Note that the lemma holds starting at m = 2 |X | , which is extremely large, even for X being binary. It is therefore expected that ﬁnite blocklength versions of results based on thelemma will be very loose. Nevertheless, it is sufﬁcient for proving a converse for the errorexponent (as in [1]), and also for proving the converse for the rate (as we show next). E. The Blowing Up Lemma

The blowing up lemma (see, e.g., [28, Lemma 12]) will play a key role in proving theconverse part of Theorem 2.

Lemma 3 (Blowing-up Lemma):

Let Z be distributed i.i.d. according to P Z and let B ⊆Z n be a given set. Then: Pr { Z / ∈ Γ δH ( B ) } ≤ exp  − n δ − s n ln (cid:18) { Z ∈ B } (cid:19)!  , (53) ∀ δ > s n ln (cid:18) { Z ∈ B } (cid:19) . F. Proof of Theorem 2

As discussed above, the direct part of the theorem follows from the combination of [1,Thm. 2] and Prop. 1 above. Nevertheless, for completeness, we give a direct proof here.It will be simpler than the proof of [1, Thm. 2], since it only involves the achievable rate(and not error exponents). After reading through the direct part, it should be easier to thereader to follow the converse proof.

Proof of Theorem 2, direct part:

Let W : P ( X → U ) be an arbitrary channel into an arbitrary (discrete) alphabet U .Following Lemma 1, let ω : T P X ,δ n → U n be the mapping that covers the typical set T P X ,δ n . Let C = { ω ( x ) : x ∈ T P X ,δ n } denote the set of codewords. Let u = ω ( x ) . Themapping T ( · ) is deﬁned as follows: T ( x ) = (cid:26) [ u , P xu ] , x ∈ T P X ,δ n ; e , otherwise. (54)In other words, the signature T ( x ) consists of the (index to the) codeword u , and also thejoint type of x and u . The special symbol e denotes ‘erasure’.The decision function g ( T ( x ) , y ) shall return maybe only if one of the following occurs: • The erasure symbol was received, i.e. T ( x ) = e , or • y was not typical, i.e. y / ∈ T P Y ,δ n , or • y is typical, T ( x ) = e , and there exists a joint type Q ∈ P ( X × U × Y ) s.t.:1) The marginal of Q w.r.t. U, Y is P uy (i.e. the empirical joint distribution of thesequences u , y ), NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 19

2) the marginal of Q w.r.t. X, U is P xu and3) the marginal of Q w.r.t. X, Y satisﬁes E [ ρ ( X, Y )] ≤ D .First, let us see why this scheme is admissible, i.e. there are no false negatives. Note thatthe only case where the scheme may return no is when the signature consists of [ u , P xu ] .If d ( x , y ) ≤ D , then the joint type Q = P xuy satisﬁes the conditions 1) - 3) above, andtherefore the scheme would return maybe , as required.Next, note that the rate of the scheme is arbitrarily close to I ( P X , W ) , since the joint n -type P x , u adds O (log( n ) /n ) to the rate, and the signature e adds a negligible amount tothe rate.The last point that needs to be proved is that the probability of maybe vanishes. Considerthe three error events: • T ( x ) = e . This happens if and only of x / ∈ T P X ,δ n , which, following (41) and thedeﬁnition of δ n , vanishes with n . • For a similar reason, the probability that y / ∈ T P Y ,δ n vanishes with n . • The remaining event is when x and y are both typical (i.e. have empirical distributionsnear P X and P Y , respectively), and there exists a joint type Q that satisﬁes 1)-3) above.Fix a sequence x ∈ T P X ,δ n and let P ′ Y ∈ P n ( Y ) be a given n -type. Conditioned on Y ∈ T P ′ Y , we know that Y is distributed uniformly on T P ′ Y . Among those sequences, thefraction of these sequences that trigger a maybe are the ones residing in the Q Y | U ( u ) -shellof some distribution Q that satisﬁes 1)-3). Note that the requirement 3) is on the distribution Q , and not on the sequence y . Therefore the probability of maybe , conditioned on X = x and y ∈ T P ′ Y , is upper bounded, for any ε > and large enough n , by X Q : Q XU = P xu , E Q [ ρ ( X ′ ,Y ′ )] ≤ D,Q Y = P ′ Y − n ( I ( Y ′ ; U ′ ) − ε ) , (55)where X ′ , U ′ , Y ′ are jointly distributed according to Q . Since P ′ Y is arbitrarily close to P Y and P xu is arbitrarily close to ( P X , W ) , we can sum over the (polynomial number of) types P ′ Y , and bound the probability for maybe , now conditioned on x ∈ T P X ,δ n and y ∈ T P Y ,δ n ,is upper bounded, for any ε > and large enough n , by max Q : Q XU =( P X ,W ) , E Q [ ρ ( X ′ ,Y ′ )] ≤ D,Q Y = P Y − n ( I ( Y ′ ; U ′ ) − ε ) . (56)It therefore remains to show that if W is chosen according to (17), then there are notypes Q that satisfy 1)-3), for which U ′ and Y ′ are independent. This fact will make theexponent in (56) to be strictly positive, completing the proof.In order to show this, let ∆ R , ∆ D > be arbitrarily small constants, s.t. R = ∆ R + ¯ R ID ( D − ∆ D ) . (57)In other words, we know that ( P X , W ) satisﬁes X u ∈U P U ( u ) ¯ ρ ( P X | U ( ·| u ) , P Y ) ≥ D + ∆ D . (58)and we need to prove that for all joint distributions Q that satisfy Q XU = ( P X , W ) , Q Y = P Y , and E Q [ ρ ( X, Y )] ≤ D , we must have I ( Y ; U ) > . To prove this, assume, for contradiction, that Q is such a distribution, where Y and U are independent. Then wecan write the following: D ≥ E Q [ ρ ( X, Y )] (59) D ≥ X u ∈U P U ( u ) X x,y Q XY | U ( x, y | u ) ρ ( x, y ) (60) D ≥ X u ∈U P U ( u ) X x,y Q Y | U ( y | u ) Q X | Y U ( x, y | u ) ρ ( x, y ) (61) D ≥ X u ∈U P U ( u ) X x,y P Y ( y ) Q X | Y U ( x, y | u ) ρ ( x, y ) (62)Note that for all u , the term P Y ( y ) Q X | Y U ( x, y | u ) deﬁnes a joint distribution on X, Y , withmarginals P Y and Q X | U ( ·| u ) = P X | U ( ·| u ) . Therefore the inner summation is an upper bound,by deﬁnition, for the term ¯ ρ ( P X | U ( ·| u ) , P Y ) , and we get that D ≥ X u ∈U P U ( u ) ¯ ρ ( P X | U ( ·| u ) , P Y ) , (63)which contradicts (58). Therefore U and Y can never be independent, and the exponentin (56) is strictly positive. This makes sure that the third error event also has a vanishingprobability, and the proof is concluded by using the union bound on the error events.To prove the converse part, we start by following the steps similar to that of the converseof [1, Thm. 2]. In our case these steps are actually slightly simpler than those of [1, Thm.2] because of the restriction to ﬁxed-length compression schemes. As mentioned before, thekey step that is missing from [1, Thm. 2] is the ability to transform an event with probabilitythat vanishes with an exponent that is arbitrarily small, to an event whose probability goesto . The key to achieving this is the blowing-up lemma, as detailed in what follows. Proof of Theorem 2, converse part:

Let ∆ R , ∆ D > , and let T, g be a sequence of schemes with rate

R < ¯ R ID ( D +∆ D ) − ∆ R .Our goal is to show that the probability for maybe cannot vanish with n , for arbitrarily small ∆ R and ∆ D .Let i ∈ [1 : 2 nR ] , and let T − ( i ) , { x ∈ X n : T ( x ) = i } . (64)Since the given scheme is admissible, we must have Pr { maybe } = nR X i =1 Pr { T ( X ) = i } Pr { maybe | T ( X ) = i } (65) ≥ nR X i =1 Pr { T ( X ) = i } Pr (cid:8) Y ∈ Γ D (cid:0) T − ( i ) (cid:1)(cid:9) . (66)For some γ > , deﬁne the set A i , T − ( i ) ∩ T P X ,γ , (67) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 21

Since by deﬁnition A i ⊆ T − ( i ) , we also have Pr { maybe } ≥ nR X i =1 Pr { T ( X ) = i } Pr (cid:8) Y ∈ Γ D ( A i ) (cid:9) . (68)It appears that it is enough to consider only A i ’s that are ‘large’, by the following lemma: Lemma 4 (Most A i ’s are large): There exists n = n ( γ ) > , s.t. for all n > n , X i : | A i |≤ n [ H ( PX ) − R − γ ′ ] Pr { X ∈ A i } ≤ − γ ′ n , (69)where γ ′ , γ |X | log(1 /γ ) . Proof:

Appendix C.Next, consider a speciﬁc A = A i , and suppose that | A | ≥ n ( H ( X ) − R − γ ′ ) (by the previouslemma we know that this occurs with high probability). We invoke the inherently typicalsubset lemma and conclude that for any m > |X | and large enough n , there exists asubset ˜ A ⊆ A , for which:1) The size of the set ˜ A is essentially the same as A : n log | A || ˜ A | ≤ |X | ( m + 1) |X | log( n + 1) n . (70)2) There exists an n -type Q ∈ P n ( X × U m ) , and a causal mapping φ : X n → U nm , s.t.for every x ∈ ˜ A , P x ,φ ( x ) = Q. (71)3) If X , U are jointly distributed according to Q , then n log | ˜ A | ≤ H ( X | U ) ≤ n log | ˜ A | + log mm . (72)Since A ⊆ T P X ,γ , we must have that the marginal of Q w.r.t. X has a type P that satisﬁes k P − P X k ∞ ≤ γ .Since ˜ A ⊆ A , we have Pr { Y ∈ Γ D A } ≥ Pr { Y ∈ Γ D ˜ A } . (73)Let ε > and let V : X × U m → Y be a stochastic matrix, s.t. X x,y,u Q ( x, u ) V ( y | x, u ) ρ ( x, y ) ≤ D − ε. (74)In other words, if X , U , Y are distributed according to ( Q, V ) , then E [ ρ ( X , Y )] ≤ D − ε .Following (74), for all x ∈ ˜ A and large enough n , y ∈ T Y | X U ,δ n ( x , φ ( x )) implies d ( x , y ) ≤ D − ε . Next, deﬁne the set F to be the union of all such conditional type classesas follows: F , [ x ∈ ˜ A T Y | X U ,δ n ( x , φ ( x )) . (75)Clearly, from the above it follows that F ⊆ Γ D − ε (cid:16) ˜ A (cid:17) . Let ε ′ , ε/ρ max . Note the following fact: Γ ε ′ H (cid:16) Γ D − ε (cid:16) ˜ A (cid:17)(cid:17) ⊆ Γ D (cid:16) ˜ A (cid:17) . (76)To see this, suppose y ∈ Γ ε ′ H (cid:16) Γ D − ε (cid:16) ˜ A (cid:17)(cid:17) . Then there exists y ′ ∈ Γ D − ε ( ˜ A ) s.t. d H ( y , y ′ ) ≤ ε ′ . Also, there exists x ∈ ˜ A s.t. d ( x , y ′ ) ≤ D − ε . Now write n · d ( x , y ) = n X i =1 d ( x i , y i ) (77) = X i : y i = y ′ i d ( x i , y ′ i ) + X i : y i = y ′ i d ( x i , y ′ i ) (78) ≤ n X | i =1 d ( x i , y ′ i ) + X i : y i = y ′ i ρ max (79) ≤ n ( D − ε ) + nε ′ ρ max (80) = nD. (81)With (76) above we have Pr (cid:8) y ∈ Γ D ( A ) (cid:9) ≥ Pr n y ∈ Γ ε ′ H ( F ) o . (82)Next, we apply the blowing up lemma to the set F , and have: Pr n y ∈ Γ ε ′ H ( F ) o ≥ − exp  − n ε ′ − s n ln (cid:18) { Y ∈ F } (cid:19)!  , (83)whenever ε ′ > s n ln (cid:18) { Y ∈ F } (cid:19) . (84)Since ε ′ > is a constant, all that is left is to prove that Pr { Y ∈ F } does not vanishexponentially. Note that we only need to show the existence of a single channel V forwhich Pr { Y ∈ F } does not vanish exponentially. To show this, we follow the steps of [1]:First, since F is a union of V -shells, we deduce that all the sequences in it are typical: F ⊆ T Y ,δ n ·|X ||U m | , (85)where Y is an RV with the marginal distribution of ( Q, V ) . It follows that Pr { Y ∈ F } ≥ | F | nH ( Y ) − n ( D ( Y || Y )+ ξ n ) , (86)where ξ n → with n . We see that the key to evaluating Pr { Y ∈ F } is evaluating | F | .Let ˜ X be uniformly distributed on ˜ A , and let ˜ U , φ ( ˜ X ) . Note that ˜ X , ˜ U are randomvectors which are not i.i.d., but have the joint type Q with probability . Next, deﬁne ˜ Y to bethe output of the memoryless channel V with inputs ˜ X , ˜ U . Here are several facts regardingthe random vector ˜ Y , expressed by the random variables X , U , Y that are distributedaccording to ( Q, V ) . NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 23 • Since ˜ Y is conditionally typical w.h.p. (given ˜ X = x ∈ ˜ A ), Pr { ˜ Y ∈ F | ˜ X = x } ≥ Pr { ˜ Y ∈ T Y | X U ,δ n ( x , φ ( x )) | ˜ X = x } (87) ≥ − |X ||Y||U m | nδ n . (88) • Since the above holds for any x ∈ ˜ A , we also have Pr { ˜ Y ∈ F } = X x ∈ ˜ A Pr { ˜ X = x } Pr { ˜ Y ∈ F | ˜ X = x }≥ − |X ||Y||U m | nδ n . (89)For convenience, let γ n , |X ||Y| nδ n . Note that by the delta convention, we have γ n → .Deﬁne the indicator RV χ F as χ F ( ˜ Y ) , { ˜ Y ∈ F } (90)It follows that H ( ˜ Y ) = H (cid:16) ˜ Y , χ F ( ˜ Y ) (cid:17) (91) = H (cid:16) ˜ Y | χ F ( ˜ Y ) (cid:17) + H ( χ F ( ˜ Y )) (92) = H (cid:16) ˜ Y | χ F ( ˜ Y ) (cid:17) + h (Pr { ˜ Y ∈ F } ) (93) ( a ) ≤ H (cid:16) ˜ Y | χ F ( ˜ Y ) (cid:17) + h ( γ n ) (94) = H (cid:16) ˜ Y | ˜ Y ∈ F (cid:17) Pr { ˜ Y ∈ F } + H (cid:16) ˜ Y | ˜ Y / ∈ F (cid:17) Pr { ˜ Y / ∈ F } + h ( γ n ) (95) ≤ H (cid:16) ˜ Y | ˜ Y ∈ F (cid:17) + n log |Y | Pr { ˜ Y / ∈ F } + h ( γ n ) (96) ≤ log | F | + n log |Y | Pr { ˜ Y / ∈ F } + h ( γ n ) (97) ≤ log | F | + γ n n log |Y | + h ( γ n ) . (98)where ( a ) follows from (89) for n large enough s.t. γ n < / .The last derivation reveals that we can bound H ( ˜ Y ) to get a lower bound on | F | (andby that a lower bound on Pr { Y ∈ Γ D ( A ) } ).The next step follows [1, (4.26)-(4.30)] almost verbatim. For completeness, we pack theargument into a lemma, and prove it in the appendix (with simpler notation than in [1]). Lemma 5:

With ˜ Y deﬁned above, we have n H ( ˜ Y ) ≥ H ( Y | U ) − log mm . (99) Proof:

Appendix D.We conclude that n log | F | ≥ H ( Y | U ) − log mm − n h ( γ n ) − γ n log |Y | . (100) Substituting back into (86) gives Pr { Y ∈ F } ≥ | F | nH ( Y ) − n ( D ( Y || Y )+ ξ n ) (101) ≥ − n (cid:20) I ( Y ; U )+ D ( Y || Y )+ ξ n + log mm + 1 n h ( γ n )+ γ n log |Y| (cid:21) . (102)Our next step is to show that the exponent in (102) can be made arbitrarily small, by aproper selection of the channel V . For that purpose, write: I ( X ; U ) ≤ H ( X ) − H ( X | U ) (103) ( a ) ≤ H ( X ) − H ( X | U ) + γ ′ (104) ( b ) ≤ H ( X ) − n log | ˜ A | + γ ′ (105) ( c ) ≤ H ( X ) − n log | A | + γ ′ + ζ n (106) ( d ) ≤ R + 3 γ ′ + ζ n (107) ( e ) ≤ ¯ R ID ( D − ∆ D ) − ∆ R + 3 γ ′ + ζ n . (108)In the above, ( a ) follows by [30, Lemma 2.7] (cf. also the proof of Lemma 4). ( b ) and ( c ) follow from (72) and (70) respectively, where ζ n , |X | ( m + 1) |X | log( n +1) n . ( d ) followsfrom the assumption that | A | ≥ n ( H ( X ) − R − γ ′ ) , and ( e ) follows from the assumption at thebeginning of the proof.Next, we use the fact that X and X have distributions that are very close (with closenessquantiﬁed by γ ), and write: ¯ R ID ( D − ∆ D ) = min P U | X : P u ∈U P U ( u )¯ ρ ( P X | U ( ·| u ) ,P Y ) ≥ D − ∆ D I ( X ; U ) , (109) ≤ min P U | X : P u ∈U P U ( u )¯ ρ ( P X | U ( ·| u ) ,P Y ) ≥ D − ∆ D + γ ′′ I ( X ; U ) + γ ′′ , (110)for some γ ′′ > that vanishes with γ .Next, let ε , deﬁned above to be arbitrarily small but positive, take the value of ∆ D .Also let γ be small enough s.t. γ ′′ < ε , and that γ ′ + γ ′′ ≤ ∆ R . This way, whenever n islarge enough so that ζ n ≤ ∆ R , we have I ( X ; U ) ≤ min P U | X : P u ∈U P U ( u )¯ ρ ( P X | U ( ·| u ) ,P Y ) ≥ D − ε I ( X ; U ) . (111)From the equation above, we deduce that X u ∈U m P U ( u ) ¯ ρ ( P X | U ( ·| u ) , P Y ) < D − ε. (112)By the deﬁnition of ¯ ρ ( · , · ) , there exist distributions Ψ u ( x, y ) , for each u ∈ U m , s.t. X u ∈U m P U ( u ) E Ψ u [ ρ ( X, Y )] ≤ D − ε. (113)Furthermore, the marginals of Ψ u are P X | U ( ·| u ) and P Y . With P U , Ψ u deﬁnes a jointdistribution with the following properties: NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 25 • The marginal of the distribution w.r.t. X , U is exactly Q . Therefore the conditionaldistribution w.r.t. Y is a feasible choice for V according to (74). Denote the RV at theoutput of this channel by Y . • The marginal distribution w.r.t. U , Y is given by P U × P Y , i.e. U and Y are independent , and we also have that P Y = P Y .Indeed, we choose V to be deﬁned according to P U , Ψ u , and conclude that I ( Y ; U ) + D ( Y || Y ) = 0 . (114)Substituting back into (102), we see that the desired exponent is arbitrarily close to log mm .We then set m to be large enough s.t. the condition (84) holds.To summarize, so far we have shown that if | A | ≥ n ( H ( X ) − R − γ ′ ) , then Pr { Y ∈ Γ D ( A ) } ≥ η n , (115)where η n approaches (exponentially fast), as a result of the blowing up lemma.Finally, we repeat this for each of the sets A i , and write: Pr { maybe } ≥ nR X i =1 Pr { T ( X ) = i } Pr (cid:8) Y ∈ Γ D ( A i ) (cid:9) (116) ≥ X i : | A i |≥ n ( H ( X ) − R − γ ′ ) Pr { T ( X ) = i } Pr (cid:8) Y ∈ Γ D ( A i ) (cid:9) (117) ( a ) ≥ X i : | A i |≥ n ( H ( X ) − R − γ ′ ) Pr { T ( X ) = i } η n (118) = η n X i : | A i |≥ n ( H ( X ) − R − γ ′ ) Pr { T ( X ) = i } (119) ( b ) ≥ η n  − X i : | A i |≤ n ( H ( X ) − R − γ ′ ) Pr { T ( X ) = i }  (120) ≥ η n h − − nγ ′ i . (121)In the above, ( a ) follows from (115) and ( b ) follows from Lemma 4. Since η n approaches exponentially fast, we conclude that the probability for maybe approaches , also expo-nentially fast. This concludes the proof of the converse.VI. S CHEMES BASED ON THE TRIANGLE INEQUALITY

In this section we discuss the triangle-inequality based schemes: the lossy compression- triangle inequality (LC- △ ) and the type covering - triangle inequality (TC- △ ). A. Lossy compression with triangle inequality (LC- △ ) Here we prove that any compression rate above R LC- △ ID ( D ) [deﬁned as the inverse func-tion of D LC- △ ID ( R ) , see (23)] can be attained via a scheme that employs standard lossycompression for the signature assignment and the triangle inequality for the decision rule. Proof of Theorem 3:

We will show that any pair ( R, D ) s.t. D > D

LC- △ ID ( R ) isachievable, where D LC- △ ID ( R ) = E [ ρ ( ˆ X , Y )] − D ( R ) , (122)where D ( R ) is the classical distortion-rate function of the source X , and ˆ X , which is inde-pendent of Y , is distributed according to any marginal distribution of the D ( R ) achievingdistribution.Let P X | ˆ X be an achieving distribution for the standard distortion rate function at rate R , and let ˆ X be the corresponding marginal distribution. Next, use the covering lemma(Lemma 1) to show the existence of a code C that covers the typical set T P X ,δ n with thedistribution P X ˆ X . In other words, for each sequence in the typical set x ∈ T P X ,δ n , thereexists a sequence ω ( x ) = ˆ x ∈ C s.t. x , ˆ x are strongly jointly typical according to thedistribution P X ˆ X (formally, ˆ x ∈ T V,δ n ( x ) ). We also know by the covering lemma that thecode rate is upper bounded by I ( X ; ˆ X ) + ε n where ε n vanishes as δ n → . Since P X, ˆ X isthe distortion-rate achieving distribution, we know that d ( x , ˆ x ) ≤ D ( R ) + ε ′ n , with some ε ′ n that vanishes as δ n → .So far, we have constructed a standard code for lossy compression: for an input x , ifit is typical, then its compressed representation is the index to ˆ x . If x is not typical, thendeclare an ‘erasure’ e . Therefore with probability approaching one, we have a guaranteethat the distortion between the source and the reconstruction is at most D ( R ) .Next, we use this code in order to construct a compression scheme for identiﬁcation. Weonly need to specify the decision process g ( · , · ) , which proceeds as follows. If T ( x ) = e ,we set g ( T ( x ) , y ) = maybe . Otherwise, we reconstruct ˆ x , and have g ( T ( x ) , y ) = (cid:26) maybe , if d (ˆ x , y ) ≤ D + D ( R ) + ε ′ n ; no , otherwise. (123)It follows from the triangle inequality that whenever d ( x , y ) ≤ D and when d ( x , ˆ x ) ≤ D ( R ) + ε ′ n , then d (ˆ x , y ) ≤ D + D ( R ) + ε ′ n , triggering a maybe . Therefore the scheme isadmissible.Since, by construction, the rate of the scheme is arbitrarily close to R , we only need toverify that the probability of maybe vanishes.Next, assume that the similarity threshold D satisﬁes D = D LC- △ ID ( R ) = ∆ D for some ∆ D > . We analyze the probability of maybe as follows: Pr { maybe } ≤ Pr { X / ∈ T P X ,δ n } + Pr { Y / ∈ T P Y ,δ n } + Pr { maybe | X ∈ T P X ,δ n , Y ∈ T P Y ,δ n } . (124)The ﬁrst two terms in the summation vanish with n . To bound the third term, we need toevaluate the probability that a sequence Y will be in a ‘ball’ of radius D + D ( R ) + ε ′ n centered at ˆ x . Let ˆ x be the reconstruction sequence with type P ′ ˆ X . We know that it has atype close to P ˆ X , the marginal of the D ( R ) -achieving distribution. Also, let P ′ Y ∈ P n ( Y ) be a given n -type. Conditioned on Y ∈ T P ′ Y , Y is now distributed uniformly on T P ′ Y .Among those sequences, the fraction of sequences that trigger a maybe is given (up tosup-exponential factors) by X nH ( Y | ˆ X ) nH ( Y ) , (125) Recall that δ n is deﬁned according to the delta-convention [30]. See also Sec. V. NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 27 where the summation is over all joint n -types for Y, ˆ X with marginals P ′ Y , P ′ ˆ X s.t. E [ ρ ( ˆ X, Y )] ≤ D + D ( R ) + ε n . The exponent of this expression is given by min I ( Y ; ˆ X ) , (126)where the minimization is the same as in (125). We can see that this exponent can be madestrictly positive: if it was zero, this would imply that there exist independent ˆ X, Y s.t. E [ ρ ( ˆ X , Y )] > D + D ( R ) + ε n , which contradicts the assumption that D > D

LC- △ ID ( R ) . The next step follows by standardtype arguments showing that P ′ Y and P ′ ˆ X are arbitrarily close to the real P Y , P ˆ X (see also thedirect part of the proof of Theorem 2). Finally, we see that all three terms in the expressionfor the probability of maybe in (124) vanish with n , as required. B. Type covering with triangle inequality (TC- △ ) Here we prove that for any similarity threshold

D > D

TC- △ ID ( R ) , there exists a sequenceof rate- R schemes that are D -admissible. Proof of Theorem 4:

The proof follows the steps of the proof of Theorem 3 above,with the following key exception. The conditional distribution determining the code thatdescribes X remains a design parameter and is optimized at the end (rather than at thebeginning of the proof as in Theorem 3, where it is set to be the one that minimizes theexpected distortion between X and ˆ X ).More formally, let P ˆ X | X be a conditional distribution s.t. I ( X ; ˆ X ) > R . Again, we usethe covering lemma (Lemma 1) to show the existence of a code C that covers the typicalset T P X ,δ n with the distribution P X ˆ X . In other words, for each x ∈ T P X ,δ n , there exists asequence ω ( x ) = ˆ x ∈ C s.t. x , ˆ x are strongly jointly typical according to the distribution P X ˆ X (formally, ˆ x ∈ T V,δ n ( x ) ). Again, we know by the covering lemma that the code rateis upper bounded by I ( X ; ˆ X ) + ε n where ε n vanishes as δ n → . Since x , ˆ x are stronglyjointly typical, we know that d ( x , ˆ x ) ≤ E [ ρ ( X, ˆ X )] + ε ′ n , with some ε ′ n that vanishes as δ n → .The rest of the proof is identical to that of Theorem 3, where D ( R ) is replaced by E [ ρ ( X, ˆ X )] . If we choose P ˆ X | X s.t. E [ ρ ( ˆ X , Y )] − E [ ρ ( X, ˆ X )] ≥ D , we can verify that theexponent of the third term in the equivalent of (124) is positive, proving that the overallprobability of maybe vanishes, as required.Remarks: • It is obvious that R TC- △ ID ( D ) ≤ R LC- △ ID ( D ) , since the distortion-rate achieving distributionin (23) is a feasible transition probability for the expression in (24). Therefore the TC- △ scheme can be regarded as a generalization of LC- △ . • In order to simplify the discussion, we have assumed that the distortion measure ρ ( · , · ) is symmetric . Similar results can be obtained for the non-symmetric case, where theonly difference is that the triangle inequality is applied in the following form: d (ˆ x , y ) ≤ d (ˆ x , x ) + d ( x , y ) . (127) Therefore the compression of x needs to be done (in the LC- △ scheme) for thedistortion measure ρ ′ ( x, ˆ x ) , ρ (ˆ x, x ) . The decision rule (in the typical case) is givenby g ( T ( x ) , y ) = (cid:26) maybe , if d (ˆ x , y ) ≤ D + E [ ρ ( ˆ X , X )] + ε ′ n ; no , otherwise. (128) • If, in addition to the symmetry condition, we would require that ρ ( x, y ) = 0 if andonly if x = y , this would make the measure a metric . However, there is no need forthis third condition in order for the results to hold. • In principle, one could add another condition that rules out sequences that are notsimilar, using the modiﬁed decision rule: g ( T ( x ) , y ) = (cid:26) maybe , if D ( R ) − D ≤ d (ˆ x , y ) ≤ D + D ( R ) ; no , otherwise. (129)This condition retains the admissibility of the scheme by another usage of the triangleinequality. In essence, it allows us to rule out sequences y that are too close to ˆ x , sincewe know that x is at distance approximately D ( R ) from it. A similar argument holdsfor the LC- △ scheme as well. However, this condition does not improve the achievablerate, and even in practice, the performance gain is generally negligible (see [29]). C. Special cases

In general, R ID ( D ) ≤ R TC- △ ID ( D ) ≤ R LC- △ ID ( D ) , (130)where the inequalities may be strict (see Fig. 4). There are cases, however, where some ofthe inequalities in (130) are equalities . We review some of those cases here. Theorem 7:

If there exists a constant D (that may depend on P Y ), s.t. for all ˆ x ∈ X X y ∈X P Y ( y ) ρ (ˆ x, y ) = D , (131)then R TC- △ ID ( D ) = R LC- △ ID ( D ) = R ( D − D ) , (132)where R ( · ) is the rate-distortion function under distortion measure ρ ( · , · ) . Proof:

Under the stipulation in the theorem, we have that for any RV ˆ X that isindependent of Y , E [ ρ ( ˆ X , Y )] = X ˆ x,y P Y ( y ) P ˆ X (ˆ x ) ρ (ˆ x, y ) (133) = X ˆ x P ˆ X (ˆ x ) X y P Y ( y ) ρ (ˆ x, y ) (134) = D . (135)It follows that D TC- △ ID ( R ) = max P ˆ X | X : I ( X ; ˆ X ) ≤ R D − E [ ρ ( X, ˆ X )] (136) = D − min P ˆ X | X : I ( X ; ˆ X ) ≤ R E [ ρ ( X, ˆ X )] (137) = D − D ( R ) . (138) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 29

The proof follows by noting that (137) is equal to D LC- △ ID ( R ) .The conditions for the above theorem holds, for example, in the following cases: • If Y is equiprobable on Y = X , and the columns of ρ ( · , · ) are permutations of eachother (e.g. if ρ ( · , · ) is a ‘difference’ distortion measure), then D is given by D = X y |X | ρ ( y, . (139) • A special case of the above is the Hamming distortion. In this case, (where P Y is stillequiprobable), D = |X | − |X | . (140)Theorem 7 implies that in simple cases the LC- △ scheme is equivalent (in the rate sense)to the TC- △ scheme. If X and Y are binary and equiprobable, and the distortion measureis Hamming, it follows from [29, Theorem 1] that R TC- △ ID ( D ) = R ID ( D ) , i.e. the TC- △ and the LC- △ schemes are optimal. However, if X and Y are not equiprobable (and thedistortion measure is still Hamming), the LC- △ scheme differs from the TC- △ scheme (see[35, Fig. 2]). Is this TC- △ scheme optimal for the binary-Hamming case? The followingtheorem answers this question in the afﬁrmative. Theorem 8:

For the binary-Hamming case, i.e. X ∼ Ber( p ) and Y ∼ Ber( q ) , R ID ( D ) = R TC- △ ID ( D ) . (141) Proof:

We ﬁrst show that it is sufﬁcient to take the cardinality of U in (17) to be equalto . To that end, note that for the binary-Hamming case we have ¯ ρ ( p, q ) = | p − q | .Let U be an arbitrary (but ﬁnite) alphabet, and let P U | X be a given channel from X to U , that attains the minimum in R ID ( D ) . It has to satisfy X u ∈U P U ( u ) ¯ ρ ( P X | U ( ·| u ) , P Y ) ≥ D, (142)i.e. it is feasible for the optimization in R ID ( D ) . Suppose there exist u and u for whichboth P X | U (1 | u ) ≥ q and P X | U (1 | u ) ≥ q hold. Next, deﬁne a new channel P ′ U | X that is theresult of the channel P U | X , followed by a merge of u and u into a new symbol u ∗ . By thedata processing inequality, the mutual information does not increase (in fact, it decreasesunless P X | U (1 | u ) = P X | U (1 | u ) ). The new reverse channel given u ∗ is easily calculated as P ′ X | U (1 | u ∗ ) = P U ( u ) P X | U (1 | u ) + P U ( u ) P X | U (1 | u ) P U ( u ) + P U ( u ) , (143)and the new prior is P ′ U ( u ∗ ) = P U ( u ) + P U ( u ) . (144)Next, observe that P U ( u ) ¯ ρ ( P X | U ( ·| u ) , P Y ) + P U ( u ) ¯ ρ ( P X | U ( ·| u ) , P Y )= P U ( u ) (cid:12)(cid:12) P X | U ( ·| u ) − q (cid:12)(cid:12) + P U ( u ) (cid:12)(cid:12) P X | U ( ·| u ) − q (cid:12)(cid:12) = P U ( u ) (cid:0) P X | U ( ·| u ) − q (cid:1) + P U ( u ) (cid:0) P X | U ( ·| u ) − q (cid:1) = P ′ U ( u ∗ ) (cid:0) P ′ X | U ( ·| u ∗ ) − q (cid:1) . Since for all the other values of u ∈ U , P X | U ( x | u ) = P ′ X | U ( x | u ) , we conclude that the newchannel is also feasible, and attains lower mutual information. Therefore the cardinality of U can be safely reduced by , assuring that the value of the optimization of R ID ( D ) will notbe higher because of this reduction. The same holds for the case where both P X | U (1 | u ) ≤ q and P X | U (1 | u ) ≤ q hold. This process can be applied for any channel W with more thanone value of u for which P X | U (1 | u ) at the same side of q . Therefore for the optimal channel,it sufﬁces to check only channels with |U | = 2 .Next, we aim to show that min P U | X : P u ∈U P U ( u )¯ ρ ( P X | U ( ·| u ) ,P Y ) ≥ D I ( X ; U ) = min P U | X : E [ ρ ( U,Y )] − E [ ρ ( X,U )] ≥ D I ( X ; U ) . (145)Since it sufﬁces to look at binary U , the feasibility condition on the LHS of (145) can berewritten as P U (0) (cid:12)(cid:12) P X | U (1 | − q (cid:12)(cid:12) + P U (1) (cid:12)(cid:12) P X | U (1 | − q (cid:12)(cid:12) ≥ D. (146)Let P ∗ X | U be a minimizing channel for the LSH of (145). Denote the reverse channel by P ∗ U | X . Next, assume that P ∗ X | U (1 | , P ∗ X | U (1 , are on different sides of q . Then, there is noloss of generality when assuming that P ∗ X | U (1 | ≤ q and that P ∗ X | U (1 | ≥ q (if this is notthe case, then it can be achieved by reversing the roles of u = 1 and u = 0 ). In this case,the feasibility condition can be rewritten as P U (0) (cid:0) q − P ∗ X | U (1 | (cid:1) + P U (1) (cid:0) P ∗ X | U (1 | − q (cid:1) ≥ D. (147)On the other hand, the feasibility condition on W for R △ ( D ) is E [ ρ ( U, Y )] − E [ ρ ( U, X )] ≥ D, (148)which is the same as P U (0) (cid:0) P Y (1) − P X | U (1 | (cid:1) + P U (1) (cid:0) P Y (0) − P X | U (0 | (cid:1) ≥ D, (149)or, equivalently, P U (0) (cid:0) q − P X | U (1 | (cid:1) + P U (1) (cid:0) P X | U (1 | − q (cid:1) ≥ D. (150)We conclude that the channel P ∗ U | X is also feasible for the RHS of (145), thereby proving(145), i.e. that R ID ( D ) = R TC- △ ID ( D ) .If P ∗ X | U (1 | , P ∗ X | U (1 | are on the same side of q , then u = 0 and u = 1 can be merged,following the steps of the merging process at the beginning of the proof. If they are merged,this means that U is no longer a random variable, but a constant, and that X and U aretherefore independent. This implies that R ID ( D ) = 0 , and also that | p − q | ≥ D . In thisspecial case, we show that R TC- △ ID ( D ) = 0 directly.Our goal is to ﬁnd a channel W that will make X, U independent, and at the same timebe feasible for the minimization of R TC- △ ID ( D ) according to (147). For this purpose, choosethe channel P U | X to be P U | X (0 | x ) = α ; P U | X (1 | x ) = 1 − α, for any x ∈ { , } . (151)It is easy to see that U and X are independent, i.e. I ( X ; U ) = 0 , and that P U (0) = α .Next, in order to satisfy (147), choose either α = 0 or α = 1 , according to whether p < q or q < p (if p = q , this implies that D = 0 and then any choice of α will work). NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 31

Theorems 7 and 8, when combined, result in the following corollary.

Corollary 1: If X ∼ Ber( p ) , Y ∼ Ber( ) , and the distortion measure is Hamming, then R ID ( D ) = R TC- △ ID ( D ) = R LC- △ ID ( D ) = R ( − D ) , (152)where R ( · ) is the rate distortion function of the source X and Hamming distortion.Note that this result is slightly more general than [29, Theorem 1], since here X is notrestricted to be symmetric.VII. C OMPUTING THE I DENTIFICATION R ATE

Calculating the value of the achievable rates R LC- △ ID ( D ) and R TC- △ ID ( D ) is relatively easy.The term D LC- △ ID ( R ) , shown in (23), can be calculated from the distortion rate function (andthe achieving reconstruction distribution), which can be calculated with, e.g. the well knownBlahut-Arimoto algorithm, or simply as a minimization problem of a linear function over aconvex set. The term D TC- △ ID ( R ) , shown in (24), is also given as a minimization problem ofa linear function with convex constraints, and therefore can be solved easily. The generalterm R ID ( D ) , however, is posed as a minimization problem with nonconvex constraints,making its computation a challenge. In this section we give two results that facilitate thecomputation of this quantity. In Subsection VII-A we improve the bound on the cardinalityof U , which reduces the dimensions of the optimization problem. In Subsection VII-B wedescribe the process of transforming the (non-convex) problem into a sequence of convexproblems that can be solved efﬁciently. A. Cardinality of the auxiliary RV U For the evaluation of R ID ( D ) , it was already shown in [1, Lemma 3] that it sufﬁces toconsider only |U | = |C| + 2 . Here we prove an improvement of the cardinality bound, statedin Theorem 6. Proof of Theorem 6:

We start by proving that taking |U | = |X | + 1 sufﬁces tocalculate R ID ( D ) . The proof follows the idea from [36], i.e. using the strengthened versionof Carath´eodory’s theorem due to Fenchel and Eggleston.Deﬁne |X | + 1 functions Ψ i : P ( U → X ) → R . In other words, the functions Ψ i take aconditional distribution from U to X , and return a real number. The functions are given by: Ψ x ( Q ) = Q ( x ) , , for x = 1 , ..., |X | − (153) Ψ |X | ( Q ) = ¯ ρ ( Q, P Y ) ; (154) Ψ |X | +1 ( Q ) = H ( P X ) − H ( Q ) . (155)Note that in the optimization function R ID ( D ) , the objective function can be written as I ( X ; U ) = X u ∈U P U ( u )Ψ |X | ( P X | U ( ·| u )) , (156)and the constraint can be written as X u ∈U P U ( u )Ψ |X | +1 ( P X | U ( ·| u )) ≥ D. (157)Deﬁne the set A to be the set of tuples (Ψ ( Q ) , ..., Ψ |X +1 | ( Q )) for all Q ∈ P ( X ) . Notethat A is a closed and connected set, and therefore any point in the convex hull of A can be represented as a convex combination of at most |X | + 1 elements of A (this is due tothe Fenchel-Eggleston-Carath´eodory theorem, see, e.g. [37, Theorem 18]). Deﬁne B to theconvex hull of A . Further, deﬁne B P X as B P X , (cid:8) ( ψ , ..., ψ |X | +1 : ψ i = P X ( i ) , ≤ i ≤ |X | − (cid:9) . (158)In other words, the set B P X contains only convex combinations of A that correspond tocombinations of distributions on X that, when averaged with the convex combination (whichrepresents the distribution on U ), result in the distribution P X .Finally, let P ∗ U | X be an achieving distribution for R ID ( D ) and let P ∗ U and P ∗ X | U be theinduced marginal on U and the reverse conditional distribution, respectively. P ∗ U and P ∗ X | U can be associated with a point in the set B P X , where the attained ( R, D ) pair is given inthe last two coordinates of the vector in B P X . As claimed before, any point in B can berepresented as a convex combination of at most |X | + 1 points of A , and in other words, thesame pair ( R, D ) can be achieved with a distribution that averages only |X | + 1 distributionsof X , i.e. the cardinality of U can be limited to |X | + 1 .For the second part, recall the following facts: • R ID ( D ) is a convex function of D [1, Lemma 3]. Denote the region of achievable pairs ( R, D ) as R , { ( R, D ) : R ≥ R ID ( D ) } , (159)where D ∈ [0 , ρ max ] and R ≤ log |X | . Therefore the set is closed, bounded and convex. • For a convex set, an extreme point is a point in the set that cannot be represented as anontrivial convex combination of other points in the set. It is a well-known theoremthat any convex set is equal to the convex hull of its extreme points (e.g. [38, Corr.18.5.1]). For our proof we require the more delicate notion of exposed points. • An exposed point p of a convex set is a point on the boundary of the set, s.t. there existsa supporting hyperplane of the set at p (a hyperplane that touches the set at p , but theset is at one side of the hyperplane), whose intersection with the set itself is equal to { p } . Any exposed point is also an extreme point. The most useful fact about exposedpoints is the fact that any closed and bounded convex set is equal to the closure of aconvex hull of its exposed points (a special case of [38, Theorem 18.7]). We shall usethis fact directly.To begin the proof, let ( R , D ) (where R = R ID ( D ) ) be an exposed point of theachievable region. Our goal is to show that this point D , R is achievable with a conditionaldistribution P U | X s.t. the distribution of U is supported on at most |X | elements. Next, let ( c, λ ) be constants s.t. R = c + λD is a supporting line (hyperplane in 2D) of the achievableregion at ( D , R ) , s.t. the intersection of R and the line contains this point only. Such a lineis guaranteed to exist by the assumption that ( R , D ) is an exposed point. A typical image isshown in Fig. 6. Recall the R ID ( D ) is given by the minimization expression (17). Let P ∗ U | X be an achieving conditional distribution at D , i.e. that minimizes (17). If R ID ( D ) = 0 ,this implies that U and X are independent, and therefore ¯ ρ ( P X , P Y ) ≥ D . This means that The restriction on the values of D and R is due to the fact that any rate above log |X | is trivially achievable by usingthe sequence x itself as the signature, and for a distortion threshold above ρ max renders all sequences similar to eachother, making the problem degenerate. A convex combination is considered trivial if all the coefﬁcients are zero except for one of them (which is equal toone).

NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 33

Fig. 6. The achievable region R , an exposed point ( D, R ) and a supporting line. R ID ( D ) can be attained by a trivial distribution of U (where U is a constant). In the generalcase where R ID ( D ) > , we conclude that the constraint in (17) is active, and therefore R = I ( X ; U ); and D = E [ ¯ ρ ( P ∗ X | U ( · , U ) , P Y )] , (160)where X, U are distributed according to P X , P ∗ U | X .Next, note that P ∗ U | X also minimizes the expression I ( X ; U ) − λ E [ ¯ ρ ( P X | U ( · , U ) , P Y )] , (161)where the minimization is without constraints (other than the fact that P U | X is a conditionaldistribution). This fact follows since an existence of a better minimum would imply adistribution P ′ U | X for which ( I ( X ; U ) , E [ ¯ ρ ( P ′ X | U ( · , U ) , P Y )]) falls outside the achievableregion (due to the supporting hyperplane property), leading to a contradiction.Next, claim that from P ∗ U | X , we can construct a distribution P ∗∗ U | X that attains the sameminimum in (161), for which the distribution of P U has at most |X | elements. This can beshown by following the same steps as in the ﬁrst part of the proof (where the cardinalitywas shown to be bounded by |X | + 1 ), but now we replace the two functions Ψ |X | and Ψ |X | +1 by a single function Ψ |X | that is equal to (161).Since the new distribution P ∗∗ U | X attains the same minimum in (161) as P ∗ U | X does, weconclude that the point ( D , R ) , given by R = I ( X ; U ); and D = E [ ¯ ρ ( P ∗ X | U ( · , U ) , P Y )] , (162) where X, U are distributed according to P X , P ∗∗ U | X , satisﬁes the same line equation R = c + λD . However, since we assumed that ( D , R ) is an exposed point of the achievableregion, then by deﬁnition ( D , R ) = ( D , R ) . (163)In other words, the distribution P ∗∗ U | X attains the minimum of the original minimizationproblem (17).The proof is concluded since, as noted before, a bounded, closed and convex set is equalto the closure of the convex hull of its exposed points. As a result, the achievable region R can be calculated by calculating R k ID ( D ) , and then taking the lower convex envelope (theclosure operation has no practical effect). B. Conversion to a set of convex functions

Consider ﬁrst the case where the distortion measure is Hamming, and P X , P Y are arbitrarydistributions on X . In this case, it is not hard to verify that ¯ ρ ( P X , P Y ) = 12 k P X − P Y k (164) = 12 X x ∈X | P X ( x ) − P Y ( x ) | . (165)With this fact, we can rewrite the identiﬁcation rate as min P U | X I ( X ; U ) , (166)where the minimization is w.r.t. all conditional distributions P U | X s.t. X u ∈U P U ( u ) X x ∈X (cid:12)(cid:12) P X | U ( x | u ) − P Y ( x ) (cid:12)(cid:12) ≥ D. (167)Deﬁne F to be the set of all functions that take a pair in X × U and return a binaryvalue. With this, the condition (167) is equivalent to max f ∈F "X u ∈U P U ( u ) X x ∈X ( − f ( x,u ) (cid:0) P X | U ( x | u ) − P Y ( x ) (cid:1) ≥ D. (168)Alternatively, we may require P U | X to satisfy X u ∈U P U ( u ) X x ∈X ( − f ( x,u ) (cid:0) P X | U ( x | u ) − P Y ( x ) (cid:1) ≥ D, (169)for some function f ∈ F . Deﬁne the LHS of (169) as L f ( P U | X ) , and rewrite it as L f ( P U | X ) , X u ∈U P U ( u ) X x ∈X ( − f ( x,u ) (cid:0) P X | U ( x | u ) − P Y ( x ) (cid:1) (170) = X u ∈U X x ∈X ( − f ( x,u ) (cid:0) P U | X ( u | x ) P X ( x ) − P U ( u ) P Y ( x ) (cid:1) (171) = X u ∈U X x ∈X ( − f ( x,u ) (cid:0) P U | X ( u | x ) P X ( x ) − P U ( u ) P Y ( x ) (cid:1) , (172) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 35 |X | |F| |F ′ | · ·

10 1 . · . · TABLE IN

UMBER OF CONVEX OPTIMIZATION PROBLEMS TO BE SOLVED IN ORDER TO CALCULATE R ID ( D ) . which shows that L f ( P U | X ) is a linear function of the optimization variable P U | X . Finally,we can rewrite the optimization problem as R ID ( D ) = min f ∈F min L f ( P U | X ) ≥ D I ( X ; U ) . (173)The expression (173) gives rise to the following scheme for computing R ID ( D ) : since L f ( · ) is a linear function, the inner optimization in (173) is that of a convex target function withlinear constraints, and can be solved efﬁciently (e.g. via cvx [32]). To get the value of R ID ( D ) , simply repeat the inner optimization for all f ∈ F , and take the overall minimalvalue.The main problem with this approach is that the number of optimization problems canbe very high. Assume that U = X , following the previous subsection. The size of F is |X | . For |X | = 5 , for example, one would need to solve ∼ = 33 . × optimizationproblems. We can slightly improve the situation by utilizing symmetries in the expression(169). Theorem 9:

Deﬁne the set F ′ ⊆ F as follows. The set F ′ shall contain only functions f ( · , · ) where: • For all u , the function f ( · , u ) cannot be constant (in x ). In other words, for the innersummation in (169), some of the summands must be ﬂipped and some not. • There are no u = u ∈ U s.t. ∀ x ∈X f ( x, u ) = f ( x, u ) .Then in the double optimization of the form (173), it sufﬁces to consider functions f ∈ F ′ as deﬁned above. In addition, the number of such functions is given by |F ′ | = (cid:18) |X | − |U | (cid:19) . (174) Proof:

Appendix E.For quick reference, we show in Table I the improvement in the number of optimizationproblems that is sufﬁcient to solve as a result of Theorem 9. For example, the identiﬁcationrate in Fig. 4 above, for ternary alphabet, was calculated using the method above. At eachpoint, we have solved convex optimization programs and took the minimum value.As seen in Table I, the proposed method for the computation of R ID ( D ) , although itimproves on the naive (173), is only effective for small values of |X | . It is therefore anopen problem how to calculate R ID ( D ) effectively for larger alphabets.Finally, we brieﬂy note how to extend the process described here for arbitrary distortionmeasures. The key fact in the computation of R ID ( D ) in the Hamming case is the fact thatthe function f ( P ) = ¯ ρ ( P, P Y ) can be represented as a maximum of linear functions of P . This fact holds in the general case as well, following the fact that the epigraph of thefunction f , deﬁned as epi f , { ( P, D ) ∈ P ( X ) × R : f ( P ) ≤ D } , (175)is always a polyhedron. For the proof of this fact, see [4]. Once the ¯ ρ -distance has beenrepresented as a maximization of linear functions, the process described in Equations (164)-(173) can be followed. Note that as in the Hamming case, it is expected that this approachwill only allow easy computation of R ID ( D ) in cases where the alphabet X is small.VIII. C ONCLUSION AND F UTURE W ORK

In this paper we have established the fundamental limit of compression for similarityidentiﬁcation: the minimal compression rate that allows reliable answers to the query “isthe compressed sequence similar to the query sequence”. While the achievability partwas mostly derived in previous work (namely Ahlswede et al. [1]), for the converse partwe combined the approach of [1] with the blowing-up lemma. We then investigated theachievable performance when using lossy compression as a building block, and provided amethod for efﬁciently computing R ID ( D ) for small alphabets.There are several directions for future research that are natural given the result in thepaper, some are theoretical, and some are more on the practical side. Future work thatrelates to theory includes the following: • Symmetric compression schemes: how does R ID ( D ) change when the query sequenceis also compressed, and possibly at a different rate than the source sequence? Whilethe achievability part of this question is rather similar in spirit to the one presentedhere, the converse seems to be more complicated. • Characterization of the “identiﬁcation exponent” – how fast does the probability of maybe (or, similarly, the false-positive probability) go to zero when the sequence lengthgrows? Results for variable length compression has been presented in [1]. However,they depend on an auxiliary random variable with unbounded cardinality, making theresult uncomputable. Recently [4], the cardinality issue has been resolved, along withthe exponent for the ﬁxed-length compression case (which is different than that of thevariable length). • In addition to the error exponent, in lossy source coding (and also in channel coding)there exist additional results that characterize the tradeoff between rate, reliabilityand sequence length. Such results include different asymptotics (i.e. “dispersion”-typeresults [39]) and also explicit bounds for ﬁnite sequence length (e.g. [40]). It wouldbe interesting to discover similar results for the setting of the current paper. • More complicated source and query models: how do the results change when the sourceand/or query sequence are no longer i.i.d.? For ﬁnite-order Markov-type sources, it isexpected that an approach based on the method of types (namely its extension forsources with memory, e.g. [41, Sec. VII]) will lead to the right direction. For the casewhere the source and query sequences are statistically dependent, Ahlswede et al. [1]provide partial results, as the dependent case seems to be more difﬁcult.On the practical side, these are possible directions for future research, some of which arealready being pursued: • Practical schemes for compression for similarity queries: Shannon’s classical rate dis-tortion theorem [42] is now over 50 years old, but practical schemes for approaching

NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 37 the rate-distortion limit has only appeared roughly in the last two decades. It wouldbe interesting to study how to harness the vast amount of work that has been done onpractical source coding systems for the related (but different) task of compression foridentiﬁcation. This direction is already being pursued, with preliminary results reportedin [29] and [35]. • Computation of R ID ( D ) : As discussed in Section VII, the computation of R ID ( D ) is a challenge, mainly due to the fact that it is given as a non-convex optimizationproblem. While for the Hamming case and an alphabet of small size we have presentedan efﬁcient way to calculate R ID ( D ) , the general problem remains open. It would beinteresting to study other approaches, perhaps in the spirit of the well known Blahut-Arimoto iterative algorithm, for efﬁciently computing R ID ( D ) .A CKNOWLEDGEMENT

The authors would like to thank Thomas Courtade for fruitful discussions and for provingan earlier version of Theorem 5, and to Golan Yona for introducing us to the world ofbiological databases that provided the initial motivation for this work.A

PPENDIX AE QUIVALENCE OF F IXED AND V ARIABLE L ENGTH I DENTIFICATION R ATE

Proof of Prop. 1:

Our goal is to show that R ID ( D ) ≤ R vl ID ( D ) . Let T ( i ) vl , g ( i ) vl be asequence of variable length schemes of rate R , that achieve a vanishing probability for maybe . We will construct a sequence of ﬁxed-length schemes with rate arbitrarily close to R , that also attain a vanishing probability for maybe .The ﬁxed-length scheme shall be constructed as a concatenation of M variable-lengthschemes, operating on a single sequence x of length nM (each instance of the variablelength scheme operates on a separate block in the input sequence).Deﬁne L m to be normalized length of the binary codeword of the m -th block of X , i.e. theoutput of the m -th instance of the variable length scheme. Note that since the compressedsequence X is i.i.d., and the random variables L m are i.i.d. as well.Let ε > , ∆ R > to be arbitrarily small constants. By a standard Chebyshev-typeargument, we have that Pr ( M M X m =1 L m > R + ∆ R ) ≤ R M · ∆ R . (176)Choose M s.t. the RHS in the above inequality is equal to ε/ .The new ﬁxed-length scheme shall work as follows: Encoding : • Encode each sub-block (of length n ) with the underlying variable length scheme. • Calculate M P m L m . If larger than R + ∆ R , set the signature of the entire sequenceto e , denoting “erasure”. • Otherwise, the signature is the concatenation of the variable-length codewords thatcorrespond to each of the sub-block. Note that is guaranteed that the number of differentsignatures is at most nM ( R +∆ R ) + 1 , i.e. the rate is arbitrarily close to R (the addedone is due to the erasure symbol). Decision function :Given a signature and a query sequence y , the decision function g ( · , · ) is deﬁned as follows. • If the signature equal to e , answer maybe . • Otherwise, compute the answers of the M sub-schemes where the input for each ofthem is the m -th signature (corresponding to the m -th block of x ), and the m -th blockof y . • Finally, answer no if all the sub-schemes returned a no . Otherwise, return a maybe . Analysis:

The probability of maybe in the overall scheme can be bounded by, by the union bound, as Pr { maybe } ≤ Pr { The signature is e } + M × Pr { A sub-scheme has returned a maybe } . (177)Recall that M was chosen to be equal to c/ε , where c is a constant (independent of n ), andthat the probability of erasure is bounded by ε/ . Overall, we have Pr { maybe } ≤ ε/ cε × Pr { A sub-scheme has returned a maybe } . (178)Finally, note that the probability of maybe in the underlying scheme can be made to bearbitrarily small (while letting n grow), and speciﬁcally, it can be made smaller than ε / (2 c ) .With this choice, the overall probability of maybe is upper bounded by ε , which was chosento be arbitrarily small. Since the rate of the ﬁxed-length scheme is arbitrarily close to R ,this completes the proof of the proposition.A PPENDIX B Proof of Theorem 5:

Let ρ ( · , · ) denote the Hamming distortion. The proof relies on abound on the ¯ ρ distance due to Marton [27], where it is called a ¯ d -distance in the contextof Hamming distance only. The result in [27, Prop. 1] says that for any two distribution ( P A , P B ) , ¯ ρ ( P A , P B ) ≤ (cid:20) D e ( P A || P B ) (cid:21) / , (179)where D e ( ·||· ) is the KL divergence, given in nats . With this result, consider the constrainton P XU in the expression for R ID ( D ) : X u P U ( u ) ¯ ρ ( P X | U ( ·| u ) , P Y ) ≤ X u P U ( u ) (cid:20) D e ( P X | U ( ·| u ) || P Y ) (cid:21) / (180) ≤ " X u P U ( u ) D e ( P X | U ( ·| u ) || P Y ) / , (181) Note that since the ¯ ρ distance for the Hamming distance is equal to the ℓ distance between the distributions, (179)is nothing but Pinsker’s inequality. NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 39 where the second inequality follows from Jensen’s inequality and the concavity of √· . Bywriting the explicit expression for the divergence, we obtain: X u P U ( u ) D e ( P X | U ( ·| u ) || P Y ) = 1log e X u P U ( u ) D ( P X | U ( ·| u ) || P Y ) (182) = 1log e X u P U ( u ) X x P X | U ( x | u ) log P X | U ( x | u ) P Y ( x ) (183) = 1log e X u P U ( u ) X x P X | U ( x | u ) log P X | U ( x | u ) P X ( x ) P X ( x ) P Y ( x ) (184) = I ( X ; U ) + D ( P X || P Y )log e . (185)Therefore the constraint s I ( X ; U ) + D ( P X || P Y )2 log e ≥ D (186)is more loose than that of the identiﬁcation rate, and therefore we obtain R ID ( D ) ≥ min I ( X ; U )+ D ( P X || P Y ) ≥ D log e I ( X ; U ) (187) ≥ D log e − D ( P X || P Y ) . (188)since R ID ( D ) is nonnegative, the proof is concluded. A PPENDIX C Proof of Lemma 4:

We ﬁrst give an upper bound on Pr { X ∈ A i } in terms of | A i | . Pr { X ∈ A i } = X P ∈P n ( X ): k P − P X k ∞ ≤ γ Pr { X ∈ A i ∩ T P } (189) = X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P || T P | (190) ≤ X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P | n +1) |X| nH ( P ) (191) = X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P | nH ( P ) −|X | log( n +1) (192) = X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P | nH ( P ) −|X | log( n +1) (193) = X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P | − nH ( P )+ |X | log( n +1) (194) ≤ X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P | − n [ H ( P X ) − γ |X | log(1 /γ )]+ |X | log( n +1) (195) ≤ X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P | − n [ H ( P X ) − γ ′ ] (196) = 2 − n [ H ( P X ) − γ ′ ] X P ∈P n ( X ): k P − P X k ∞ ≤ γ | A i ∩ T P | (197) = | A i | · − n ( H ( P X ) − γ ′ ) . (198)The ﬁrst two inequalities in the above derivation follow from [30, Lemma 2.3] and [30,Lemma 2.7] respectively. The last inequality follows from the deﬁnition of γ ′ and by setting n to be the smallest n s.t. n log( n + 1) ≤ γ log(1 /γ ) .Next, we have (for any R ′ ): X i : | A i |≤ nR ′ Pr { X ∈ A i } ( a ) ≤ X i : | A i |≤ nR ′ | A i | · n ( H ( P X ) − γ ′ ) (199) ≤ X i : | A i |≤ nR ′ nR ′ · n ( H ( P X ) − γ ′ ) (200) ( b ) ≤ n ( R ′ + R − H ( P X )+ γ ′ ) , (201)where ( a ) follows from (198) and ( b ) follows since the sum contains nR elements. Theproof of the lemma is concluded by choosing R ′ = H ( P X ) − R − γ ′ . NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 41 A PPENDIX D Proof of Lemma 5:

We start with H ( ˜ Y ) = n X i =1 H ( ˜ Y i | ˜ Y i − ) (202) ≥ n X i =1 H ( ˜ Y i | ˜ Y i − ˜ X i − ˜ U i − ) (203) ( a ) = n X i =1 H ( ˜ Y i | ˜ X i − ˜ U i − ) (204) ( b ) = n X i =1 H ( ˜ Y i | ˜ X i − ) (205) = n X i =1 X x i − ∈ A i − Pr n ˜ X i − = x i − o H ( ˜ Y i | ˜ X i − = x i − ) (206) = n X i =1 X x i − ∈ A i − Pr n ˜ X i − = x i − o × X y ∈Y Pr { ˜ Y i = y | ˜ X i − = x i − } log 1Pr { ˜ Y i = y | ˜ X i − = x i − } , (207)where ( a ) follows since ˜ Y i − ( ˜ X i − , ˜ U i − ) − ˜ Y i − form a Markov chain, and ( b ) followssince ˜ U i − is a function of ˜ X i − .Next, it is not hard to verify that n n X i =1 Pr { ˜ X i = x } = P X ( x ) , (208) n n X i =1 Pr { ˜ Y i = y, ˜ U i = u } = Pr { Y = y, U = u } (209) = X x ∈X Q ( x, u ) V ( y | x, u ) . (210)We remind that X , U , Y are random variables that are distributed according to ( Q, V ) .i.e. that Pr { X = x, U = u } = P X U ( x, u ) = Q ( x, u ) , (211) Pr { Y = y | X = x, U = u } = P Y | X U ( y | x, u ) = V ( y | x, u ) . (212) Next, write: nH ( Y | U ) = n X u ∈U m ,y ∈Y Pr { Y = y, U = u } log 1 P Y | U ( y | u ) (213) = n X i =1 X u ∈U m ,y ∈Y Pr { ˜ Y i = y, ˜ U i = u } log 1 P Y | U ( y | u ) (214) = n X i =1 E " log 1 P Y | U ( ˜ Y i | ˜ U i ) (215) = n X i =1 E  log 1 P Y | U (cid:16) ˜ Y i | φ (cid:16) ˜ X i − (cid:17)(cid:17)  (216) = n X i =1 X x i − ∈ A i − Pr { ˜ X i − = x i − }× X y ∈Y Pr { ˜ Y i = y | ˜ X i − = x i − } log 1 P Y | U ( y | φ ( x i − )) . (217)Combined with (207) we can write nH ( Y | U ) − H ( ˜ Y ) ≤ n X i =1 X x i − ∈ A i − Pr { ˜ X i − = x i − }× X y ∈Y Pr { ˜ Y i = y | ˜ X i − = x i − } log Pr { ˜ Y i = y | ˜ X i − = x i − } P Y | U ( y | φ ( x i − )) . (218)Next, note that P Y | U (cid:0) y | φ (cid:0) x i − (cid:1)(cid:1) = X x ∈X P X | U ( x | φ (cid:0) x i − (cid:1) ) V ( y | x, φ ( x i − )) , (219) Pr n ˜ Y i = y | ˜ X i − = x i − o = X x ∈X Pr n ˜ X i = x | ˜ X i − = x i − o V ( y | x, φ ( x i − )) . (220)With (219) and (220), we can use the log-sum inequality and write: X y ∈Y Pr { ˜ Y i = y | ˜ X i − = x i − } log Pr { ˜ Y i = y | ˜ X i − = x i − } P Y | U ( y | φ ( x i − )) ≤ X y ∈Y X x ∈X Pr n ˜ X i = x | ˜ X i − = x i − o V ( y | x, φ ( x i − )) log Pr n ˜ X i = x | ˜ X i − = x i − o P X | U ( x | φ ( x i − )) (221) = X x ∈X Pr n ˜ X i = x | ˜ X i − = x i − o log Pr n ˜ X i = x | ˜ X i − = x i − o P X | U ( x | φ ( x i − )) . (222) NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 43

Combined with (218) we have nH ( Y | U ) − H ( ˜ Y ) ≤ n X i =1 X x i − ∈ A i − Pr { ˜ X i − = x i − }× X x ∈X Pr n ˜ X i = x | ˜ X i − = x i − o log Pr n ˜ X i = x | ˜ X i − = x i − o P X | U ( x | φ ( x i − )) (223) = nH ( X | U ) − H ( ˜ X ) (224) = nH ( X | U ) − log | ˜ A | (225) ≤ n log mm , (226)where (223) follows from derivations similar to (213)-(217), and the last inequality followsfrom (72). This concludes the proof of the lemma.A PPENDIX E Proof of Theorem 9:

Our goal is to show that min f ∈F min L f ( P U | X ) ≥ D I ( X ; U ) = min f ∈F ′ min L f ( P U | X ) ≥ D I ( X ; U ) . (227)Let P ∗ U | X be a minimizer of the LHS of (227). Our goal is to construct some f ∗ ∈ F ′ s.t. P ∗ U | X will be a minimizer of min L f ∗ ( P U | X ) ≥ D I ( X ; U ) . To this end, deﬁne f ∗ ( x, u ) asfollows: f ∗ ( x, u ) = (cid:26) , if P ∗ U | X ( u | x ) P X ( x ) > P ∗ U ( u ) P Y ( x ) ; , if P ∗ U | X ( u | x ) P X ( x ) < P ∗ U ( u ) P Y ( x ) . (228)where P ∗ U is the marginal of U that results from P X , P ∗ U | X . Whenever P ∗ U | X ( u | x ) P X ( x ) = P ∗ U ( u ) P Y ( x ) , break ties arbitrarily so that f ∗ ∈ F ′ , e.g. by setting f ∗ ( u, x ) = 0 if x = 0 ,and otherwise. With this deﬁnition, it is obvious that L f ∗ ( P ∗ U | X ) = X u ∈U X x ∈X (cid:12)(cid:12) P U | X ( u | x ) P X ( x ) − P U ( u ) P Y ( x ) (cid:12)(cid:12) , (229)and also that for all f ∈ F , L f ∗ ( P ∗ U | X ) ≥ L f ( P ∗ U | X ) . The conclusion is that if P ∗ U | X isfeasible for some f ∈ F , (i.e. satisﬁes L f ( P ∗ U | X ) ≥ D , then for sure it will also befeasible for f ∗ , and hence we have equality in (227).For the second part, it is convenient to consider function f ∈ F as a matrix, with f ( u, x ) , f ( u, x ) , ... in the u ’th row. The ﬁrst claim of the theorem says that there cannotbe any rows in the matrix with ﬁxed values. In other words, there are |X | − possiblevalues for every row in the matrix. The second claim is that there are no two rows thatare equal to each other. This immediately shows that the number of such matrices, whichis equal to |F ′ | , is simply the number of combinations of |U | different rows from |X | − possible values. The order of the rows does not matter, since it is equivalent to relabelingthe values u ∈ U , and hence we arrive at (174). To show that indeed there is no need tohave repeated rows in the matrix, note the following. Suppose that P ∗ U | X is a minimizer for the LHS of (227), at some f with two equal rows inthe matrix corresponding to f . Denote by P ∗ X | U the reverse channel. Let u , u correspond tothe two identical rows in the matrix. Then, construct the following conditional distribution P ∗∗ U | X , by merging the outputs u and u into a new symbol u ′ (a process similar to thatof Theorem 8). The new distribution results in mutual information I ( P X , P ∗∗ U | X ) that is notsmaller than the one obtained by P ∗ U | X , because of the data processing inequality. Next,rename u ′ to be u , and add a ﬁctitious new symbol u with probability zero. Then, deﬁnea new function f ′ ( · , · ) to be equal to f ( · , · ) for all u = u , and for u , choose f ′ ( u , · ) to bea new line in the matrix that hasn’t occurred there. This is guaranteed to exist, since thereare |X | − > |U | − such possible values. Finally, note that by construction, L f ( P ∗ U | X ) = L f ′ ( P ∗∗ U | X ) . (230)In other words, if P ∗ U | X is feasible for some f ∈ F , then there exists a distribution P ∗∗ U | X ,with better (lower) mutual information, that is feasible for another f ′ ∈ F ′ .The conclusion is, then, that it sufﬁces to consider only functions f ∈ F ′ .R EFERENCES [1] R. Ahlswede, E.-h. Yang, and Z. Zhang, “Identiﬁcation via compressed data,”

IEEE Trans. on Information Theory ,vol. 43, no. 1, pp. 48 –70, Jan 1997.[2] A. Ingber, T. Courtade, and T. Weissman, “Quadratic similarity queries on compressed data,” in

Data CompressionConference (DCC) , 2013, pp. 441–450.[3] A. Ingber, T. A. Courtade, and T. Weissman, “Compression for quadratic similarity queries,” Submitted to IEEETrans. on Information Theory, 2013. [Online]. Available: http://arxiv.org/abs/1307.6609[4] A. Ingber and T. Weissman, “The error exponent in compression for similarity identiﬁcation,” in prep.[5] A. Ingber, T. Courtade, and T. Weissman, “Compression for exact match identiﬁcation,” in

IEEE InternationalSymposium on Information Theory Proceedings (ISIT) , 2013, pp. 654–658.[6] E. Tuncel, P. Koulgi, and K. Rose, “Rate-distortion approach to databases: Storage and content-based retrieval,”

IEEE Trans. on Information Theory , vol. 50, no. 6, pp. 953–967, 2004.[7] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,”

IEEETrans. on Information Theory , vol. 22, no. 1, pp. 1 – 10, jan 1976.[8] J. O’Sullivan and N.A.Schmid, “Large deviations performance analysis for biometrics recognition,” in

AllertonConference on Communication, Control, and Computing , October 2002.[9] F. Willems, T. Kalker, S. Baggen, and J. paul Linnartz, “On the capacity of a biometrical identiﬁcation system,” in

In: Proc. of the 2003 IEEE Int. Symp. on Inf. Theory , 2003, pp. 8–2.[10] M. Westover and J. O’Sullivan, “Achievable rates for pattern recognition,”

Information Theory, IEEE Transactionson , vol. 54, no. 1, pp. 299 –320, jan. 2008.[11] E. Tuncel, “Capacity/storage tradeoff in high-dimensional identiﬁcation systems,”

Information Theory, IEEETransactions on , vol. 55, no. 5, pp. 2097 –2106, may 2009.[12] E. Tuncel and D. G¨und¨uz, “Identiﬁcation and lossy reconstruction in noisy databases,”

IEEE Transactions onInformation Theory, to appear , 2013.[13] A. G. Konheim,

Hashing in Computer Science: Fifty Years of Slicing and Dicing . John Wiley & Sons, 2010.[14] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,”

Commun. ACM , vol. 13, no. 7, pp.422–426, Jul. 1970.[15] E. Porat, “An optimal bloom ﬁlter replacement based on matrix solving,” in

CSR , ser. Lecture Notes in ComputerScience, A. E. Frid, A. Morozov, A. Rybalchenko, and K. W. Wagner, Eds., vol. 5675. Springer, 2009, pp. 263–273.[16] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,”

Commun. ACM , vol. 51, no. 1, pp. 117–122, 2008.[17] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,”

Conf. in ModernAnalysis and Probability , vol. 26, pp. 189–206, 1984, conf. was held in 1982, book publ. 1984.[18] P. Indyk,

Sketching, streaming and sublinear-space algorithms . Lecture Notes, 2007, Mass. Inst. of Tech., availableat http://stellar.mit.edu/S/course/6/fa07/6.895/.[19] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methodsin high-dimensional spaces,” in

VLDB , vol. 98, 1998, pp. 194–205.[20] S. Ramaswamy and K. Rose, “Adaptive cluster-distance bounding for nearest neighbor search in image databases,”in

IEEE International Conference on Image Processing (ICIP) , vol. 6. IEEE, 2007, pp. VI–381.

NGBER AND WEISSMAN: THE MINIMAL COMPRESSION RATE FOR SIMILARITY IDENTIFICATION 45 [21] R. Salakhutdinov and G. Hinton, “Semantic hashing,”

RBM , vol. 500, no. 3, p. 500, 2007.[22] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in

Advances in neural information processing systems ,2008, pp. 1753–1760.[23] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in

Advances in neuralinformation processing systems , 2009, pp. 1509–1517.[24] R. G. Gallager,

Information Theory and Reliable Communication . New York, NY, USA: John Wiley & Sons, Inc.,1968.[25] R. M. Gray, D. L. Neuhoff, and P. C. Shields, “A generalization of ornstein’s d distance with applications toinformation theory,”

The Annals of Probability , pp. 315–328, 1975.[26] C. Villani,

Optimal transport: old and new . Springer, 2009, vol. 338.[27] K. Marton, “Bounding ¯ d -distance by informational divergence: a method to prove measure concentration,” TheAnnals of Probability , vol. 24, no. 2, pp. 857–866, 1996.[28] M. Raginsky and I. Sason, “Concentration of measure inequalities in information theory, communications andcoding,”

CoRR , vol. abs/1212.4663, 2012.[29] I. Ochoa, A. Ingber, and T. Weissman, “Efﬁcient similarity queries via lossy compression,” in , Monticelo, IL, Sep. 2013.[30] I. Csisz´ar and J. Korner,

Information Theory - Coding Theorems for Discrete Memoryless Systems . Cambridge,2011.[31] S. Jana, “Alphabet sizes of auxiliary random variables in canonical inner bounds,” in , 2009, pp. 67–71.[32] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming, version 2.0 beta,”http://cvxr.com/cvx, Sep. 2013.[33] D. Wang, A. Ingber, and Y. Kochman, “A strong converse for joint source-channel coding,” in

Proc. IEEEInternational Symposium on Information Theory , 2012, pp. 2117–2121.[34] A. Dembo and T. Weissman, “The minimax distortion redundancy in noisy source coding,”

IEEE Trans. onInformation Theory , vol. 49, no. 11, pp. 3020–3030, 2003.[35] I. Ochoa, A. Ingber, and T. Weissman, “Compression schemes for similarity queries,” submitted to the 2014 DataCompression Conference.[36] M. Salehi, “Cardinality bounds on auxiliary variables in multiple-user theory via the method of ahlswede and k¨orner,”

Dept. Statistics, Stanford Univ., Stanford, CA, Tech. Rep , vol. 33, 1978.[37] H. G. Eggleston, J. A. Todd, and F. Smithies,

Convexity . Cambridge University Press, 1966.[38] R. T. Rockafellar,

Convex Analysis . Princeton University Press, 1970.[39] A. Ingber and Y. Kochman, “The dispersion of lossy source coding,” in

Proc. of the Data Compression Conference ,Snowbird, Utah, March 2011.[40] V. Kostina and S. Verd´u, “Fixed-length lossy compression in the ﬁnite blocklength regime,”

IEEE Trans. onInformation Theory , vol. 58, no. 6, pp. 3309–3338, 2012.[41] I. Csisz´ar, “The method of types [information theory],”

IEEE Trans. on Information Theory , vol. 44, no. 6, pp.2505–2523, 1998.[42] C. E. Shannon, “Coding theorems for a discrete source with a ﬁdelity criterion,”