The Capacity of Private Information Retrieval with Partially Known Private Side Information
aa r X i v : . [ c s . I T ] N ov The Capacity of Private Information Retrieval withPartially Known Private Side Information ∗ Yi-Peng Wei Karim Banawan Sennur Ulukus
Department of Electrical and Computer EngineeringUniversity of Maryland, College Park, MD 20742 [email protected] [email protected] [email protected]
November 28, 2017
Abstract
We consider the problem of private information retrieval (PIR) of a single messageout of K messages from N replicated and non-colluding databases where a cache-enabled user (retriever) of cache-size M possesses side information in the form of fullmessages that are partially known to the databases. In this model, the user and thedatabases engage in a two-phase scheme, namely, the prefetching phase where the useracquires side information and the retrieval phase where the user downloads desiredinformation. In the prefetching phase, the user receives m n full messages from the n thdatabase, under the cache memory size constraint P Nn =1 m n ≤ M . In the retrievalphase, the user wishes to retrieve a message such that no individual database learnsanything about the identity of the desired message. In addition, the identities of theside information messages that the user did not prefetch from a database must remainprivate against that database. Since the side information provided by each databasein the prefetching phase is known by the providing database and the side informationmust be kept private against the remaining databases, we coin this model as partiallyknown private side information . We characterize the capacity of the PIR with partiallyknown private side information to be C = (cid:0) N + · · · + N K − M − (cid:1) − = − N − ( N ) K − M .Interestingly, this result is the same if none of the databases knows any of the prefetchedside information, i.e., when the side information is obtained externally, a problem posedby Kadhe et al. and settled by Chen-Wang-Jafar recently. Thus, our result implies thatthere is no loss in using the same databases for both prefetching and retrieval phases. ∗ This work was supported by NSF Grants CNS 13-14733, CCF 14-22111, CNS 15-26608 and CCF 17-13977. Introduction
The private information retrieval (PIR) problem is a canonical problem to study privacyissues that arise when information is downloaded (retrieved) from public databases. Sinceits first formulation by Chor et al. in [1], the PIR problem has become a central researchtopic in the computer science literature, see e.g., [2–5]. In the classical setting of PIR in [1],a user wishes to retrieve a single message (or a file) out of K messages replicated across N non-communicating databases without leaking any information about the identity of theretrieved message. To that end, the user submits a query to each database. Each databaseresponds truthfully with an answer string. The user reconstructs the desired message fromthe collected answer strings. Trivially, the user can download the entire database and incur alinear (in number of messages) download cost, but this retrieval strategy is highly inefficient.The efficiency of a PIR scheme is measured by the normalized download cost, which is thecost of privately downloading one bit of the desired message. The goal of the PIR problem isto devise the most efficient retrieval strategy under the privacy and decodability constraints.The PIR problem has received attention in recent years in the information and codingtheory literatures, see e.g., [6–11]. In the leading work of Sun-Jafar [12], the classical PIRproblem is re-formulated to conform with the conventional information-theoretic arguments,and the notion of PIR capacity is introduced, which is defined as the supremum of retrievalrates over all achievable retrieval schemes. Reference [12] characterizes the capacity of theclassical PIR model to be C = (cid:0) N + · · · + N K − (cid:1) − using a greedy achievable schemethat is closely related to blind interference alignment [13] and an induction-based converseargument. Following the work of Sun-Jafar [12], the capacity of many interesting variants ofthe classical PIR model have been investigated, such as, PIR from colluding databases, robustPIR, symmetric PIR, PIR from MDS-coded databases, PIR for arbitrary message lengths,multi-round PIR, multi-message PIR, PIR from Byzantine databases, secure symmetric PIRwith adversaries, and their several combinations [14–28].In this paper, we consider the problem of PIR with partially known private side informa-tion. Our work is most closely related to [29–32] . These works investigate the PIR problemwhen the user (retriever) possesses some form of side information about the contents of thedatabases. However, the models of [29–32] differ in three important aspects, namely, 1) thestructure of the side information, 2) the presence or absence of privacy constraints on theside information, and 3) the databases’ awareness of the side information at its initial acqui-sition. Here, structure of the side information refers to whether the side information is in theform of full messages or parts of messages or whether messages are mixed through functions(coded/uncoded); privacy of the side information refers to whether the user further aims tokeep the side information private from the databases; and databases’ awareness of the sideinformation refers to whether the databases knew the initially prefetched side information. A parallel line of work that studies privacy issues of requests and side information in index coding basedbroadcast systems can be found in [33, 34]. rLK bits in the form of any arbitrary function of the K messages, where L is themessage size, and 0 ≤ r ≤ D ∗ ( r ) = C ( r ) = (1 − r ) (cid:0) N + · · · + N K − (cid:1) using a memory-sharing achievablescheme and a converse that utilizes Han’s inequality. This conclusion is somewhat pessimisticin that the user cannot exploit the cached content as useful side information during PIR toreduce the download cost, since the databases are fully aware of it; the optimum D ∗ ( r )formula indicates that the user should download the uncached part of the content, i.e.,(1 − r ), via the optimum PIR scheme in [12]. This result motivates [30, 31] to study theother extreme when the databases are completely unaware of the side information at itsinitial acquisition. References [30] and [31] differ in terms of the structure of the cachedcontent: [30] considers the case where rK full messages are cached, and [31] considers thecase where a random r fraction of the symbols of each of K messages is cached. In thiscase, [31] shows a significant reduction in the download cost over [29], as the user can nowleverage the cached bits as side information, since they are unknown to the databases. In [31],there is no privacy constraint on the cached content.Reference [30] further introduces another model where the cached content (in the formof full messages) which is unknown to the databases at the time of initial prefetching, mustremain unknown throughout the PIR, i.e., the queries of the user should not leak any in-formation about the cached content to the databases. The exact capacity for this problemis settled in [32] to be C = (cid:0) N + · · · + N K − M − (cid:1) − . The optimal achievable scheme inthis case starts from the traditional achievable scheme without side information in [12] andreduces the download cost by utilizing the reconstruction property of MDS codes.In this paper, we take a deeper look at the issue of awareness or otherwise unawareness of the databases about the cached content at its initial acquisition . We first note that it ispractically challenging to make the side information completely unknown to the databasesat its initial acquisition as assumed in [30–32]. One way to do this could be to employone of the databases for prefetching the side information and exclude it from the retrievalprocess. Therefore, for the remaining N − C in the previous paragraph) is monotonically decreasing in N . An alternative solutioncould be to devise a refreshing mechanism that ensures that the cached content is essentiallyrandom from the perspective of each database [29], which may be challenging to implement.We also note that the other extreme of the problem, where the databases are fully awareof the cached content [29], is discouraging as the user cannot benefit from the cached sideinformation. Therefore, a natural model is to use the databases for both prefetching andretrieval phases, such that the databases gain partial knowledge about the side information3vailable to the user, which makes it possible for the user to exploit the remaining sideinformation that is unknown to each individual database to reduce the download cost duringthe retrieval process. This poses the following questions: Can we propose efficient jointprefetching-retrieval strategies that exploit the limited knowledge of each database to drivedown the download cost? How much is the loss from the fully unknown case in [30, 32]?In this paper, we investigate the PIR problem when the user and the databases engage ina two-phase scheme, namely, prefetching phase and retrieval phase. In the prefetching phase,the user caches m n full messages out of the K messages from the n th database under a totalcache memory size constraint P Nn =1 m n ≤ M . Hence, each database has a partial knowledge about the side information possessed by the user, namely, the part of the side informationthat this database has provided during the prefetching phase. In the retrieval phase, theuser wants to retrieve a message (which is not present in its memory) without leaking anyinformation to any individual database about the desired message or the remaining sideinformation messages that are unknown to each database. The goal of this work is to designa joint prefetching-retrieval scheme that minimizes the download cost in the retrieval phase.To that end, we first derive a general lower bound for the normalized download cost thatis independent of the prefetching strategy. Then, we prove that this bound is attainable usingtwo achievable schemes. The first achievable scheme, which is proposed in [32] for completelyunknown side information, is a valid achievable scheme for our problem with partially knownside information for any prefetching strategy. We provide a second achievable scheme forthe case of uniform prefetching, i.e., m n = MN ∈ N , which requires smaller sub-packetizationand smaller field size for realizing MDS codes. While the first achievable scheme [32] requiresa message size of L = N K , the second achievable scheme proposed here requires a messagesize of L = N K − MN , which scales down the message size by an exponential factor N MN , whichin turn simplifies the achievable scheme and minimizes the total number of downloaded bitswithout sacrificing from the capacity. We prove that the exact capacity of this problem is C = (cid:0) N + · · · + N K − M − (cid:1) − . Surprisingly, this is the same capacity expression for thePIR problem when the databases are completely unaware of the side information possessedby the user as found in [32] recently. Therefore, our result implies that there is no loss inthe capacity if the same databases are employed in both prefetching and retrieval phases. We consider a classic PIR problem with K independent messages W , . . . , W K , where eachmessage consists of L symbols, H ( W ) = · · · = H ( W K ) = L, H ( W , . . . , W K ) = H ( W ) + · · · + H ( W K ) . (1) We thank Dr. Hua Sun for pointing this out. N non-communicating databases, and each database stores all the K messages.The user (retriever) has a local cache memory which can store up to M messages.There are two phases: a prefetching phase and a retrieval phase . In the prefetching phase, ∀ n ∈ [ N ], where [ N ] = { , , . . . , N } , the user caches m n out of total K messages from the n th database. We denote the indices of the cached messages from the n th database as H n .Therefore, | H n | = m n . We denote the indices of all cached messages as H , H = N [ n =1 H n , (2)where H n ∩ H n = ∅ , if n = n . Due to the cache memory size constraint, we require | H | = N X n =1 m n ≤ M. (3)Since the user caches m n messages from the n th database, H n is known to the n th database.Since the databases do not communicate with each other, H n is unknown to the otherdatabases. We use m = ( m , . . . , m N ) to represent the prefetching phase. After the prefetch-ing phase, the user learns | H | messages, denoted as W H = { W i , . . . , W i | H | } . We refer to W H as partially known private side information .In the retrieval phase, the user privately generates a desired message index θ ∈ [ K ] \ H ,and wishes to retrieve message W θ such that no database knows which message is retrieved.Since the desired message index θ and cached message indices H are independent of themessage contents, for random variables θ , H , and W , . . . , W K , we have H ( θ, H , W , . . . , W K ) = H ( θ, H ) + H ( W ) + · · · + H ( W K ) . (4)In order to retrieve W θ , the user sends N queries Q [ θ, H ]1 , . . . , Q [ θ, H ] N to the N databases,where Q [ θ, H ] n is the query sent to the n th database for message W θ given the user has partiallyknown private side information W H . The queries are generated according to H , which isindependent of the realizations of the K messages. Therefore, we have I ( W , . . . , W K ; Q [ θ, H ]1 , . . . , Q [ θ, H ] N ) = 0 . (5)To ensure that individual databases do not know which message is retrieved and alsodo not know the cached messages from other databases, i.e., to guarantee the privacy of( θ, H \ H n ), we need to satisfy the following privacy constraint, ∀ n ∈ [ N ], ∀ H , H ′ such that | H | = | H ′ | ≤ M , H n ⊂ H , H n ⊂ H ′ , and ∀ θ ∈ [ K ] \ H , ∀ θ ′ ∈ [ K ] \ H ′ ,( Q [ θ, H ] n , A [ θ, H ] n , W , . . . , W K , H n ) ∼ ( Q [ θ ′ , H ′ ] n , A [ θ ′ , H ′ ] n , W , . . . , W K , H n ) , (6)5here A ∼ B means that A and B are identically distributed.Upon receiving the query Q [ θ, H ] n , the n th database replies with an answering string A [ θ, H ] n ,which is a function of Q [ θ, H ] n and all the K messages. Therefore, ∀ θ ∈ [ K ] \ H , ∀ n ∈ [ N ], H ( A [ θ, H ] n | Q [ θ, H ] n , W , . . . , W K ) = 0 . (7)After receiving the answering strings A [ θ, H ]1 , . . . , A [ θ, H ] N from all the N databases, the userneeds to decode the desired message W θ reliably. By using Fano’s inequality, we have thefollowing reliability constraint H (cid:16) W θ |W H , H , Q [ θ, H ]1 , . . . , Q [ θ, H ] N , A [ θ, H ]1 , . . . , A [ θ, H ] N (cid:17) = o ( L ) , (8)where o ( L ) denotes a function such that o ( L ) L → L → ∞ .For fixed N , K , and pretching scheme m = ( m , . . . , m N ), a pair ( D ( m ) , L ( m )) is achiev-able if there exists a PIR scheme for messages of size L ( m ) symbols long with partially knownprivate side information satisfying the privacy constraint (6) and the reliability constraint (8),where D ( m ) represents the expected number of downloaded symbols (over all the queries)from the N databases via the answering strings A [ θ, H ]1: N , where A [ θ, H ]1: N = ( A [ θ, H ]1 , . . . , A [ θ, H ] N ), i.e., D ( m ) = N X n =1 H (cid:0) A [ θ, H ] n (cid:1) . (9)In this work, for fixed N , K , and M , we aim to characterize the optimal normalized downloadcost D ∗ , where D ∗ = inf m :(3) (cid:26) D ( m ) L ( m ) : ( D ( m ) , L ( m )) is achievable (cid:27) . (10) We characterize the exact normalized download cost for the PIR problem with partiallyknown private side information as shown in the following theorem.
Theorem 1
In the PIR problem with partially known private side information under thecache memory size constraint | H | ≤ M , the optimal normalized download cost is D ∗ = 1 + 1 N + · · · + 1 N K − M − (11)= 1 − ( N ) K − M − N . (12)The converse proof for Theorem 1 is given in Section 4, and the achievability proof forTheorem 1 is given in Section 5. Theorem 1 does not assume any particular property for6he prefetching strategy, i.e., m is arbitrary except for satisfying the memory size constraint.We have a few remarks. Remark 1
Theorem 1 implies that C = D ∗ = − N − ( N ) K − M . Surprisingly, this capacity expres-sion is exactly the same as the capacity for the PIR problem with completely unknown privateside information in [32]. This implies that there is no loss in capacity due to employing thesame databases for both prefetching and retrieval phases. The reason for this phenomenonis that although each database has a partial knowledge about some of the cached messages atthe user, the privacy constraint on this known side information is relaxed. Remark 2
The normalized download cost in Theorem 1 is the same as the normalized down-load cost for the classical PIR problem [12] if the number of messages is K − M . That is, acache of size M messages effectively reduces the total number of messages by M . Noting thatthe download cost in [12] monotonically increases in the number of messages, the effectivereduction in the number of messages by the cache size results in a significant reduction in thedownload cost due to the presence of side information at the user even though it is partiallyknown by the databases and it needs to be kept private against other databases. Remark 3
The optimal prefetching strategy exploits the entire cache memory of the user asthe capacity expression is monotonically increasing in M . Remark 4
In Section 5, we present the capacity achieving schemes for the partially knownprivate side information. We note that, in general the PIR scheme in [32] is a valid achiev-able scheme for our problem as well. Nevertheless, in the special case of uniform prefetching ,i.e., m n = MN = m ∈ N , we provide a different achievable scheme that exploits the prefetch-ing uniformity to work with message size L = N K − m = N K − MN in contrast to L = N K needed for the scheme in [32], i.e., the message size is decreased by an exponential factor N MN . Furthermore, we note that although both schemes need an MDS code to reduce thenumber of downloaded equations, we note that the field size needed to realize this MDS codeis significantly smaller with our scheme (if MN ∈ N ) compared with the field size needed inthe scheme in [32]. This implies that although uniform prefetching does not affect the PIRcapacity, it significantly simplifies the achievable scheme. In this section, we derive a general lower bound for the normalized download cost D ∗ givenin (10). We extend the techniques presented in [12, 32] to the PIR problem with partiallyknown private side information.For the prefetching vector m = ( m , . . . , m N ) satisfying (3), we note that satisfying thememory size constraint with equality leads to a valid lower bound on (10). Consequently, wefirst consider the case P Nn =1 m n = ˜ M ≤ M , i.e., we study the case when the user learns ˜ M m inadvance, the following lower bound is valid for all m such that P Nn =1 m n = ˜ M . Without lossof generality, we relabel the ˜ M cached messages as W , W , . . . , W ˜ M , i.e., H = { , , . . . , ˜ M } and W H = W
1: ˜ M . We first need the following lemma, which characterizes a lower bound onthe length of the undesired portion of the answering strings as a consequence of the privacyconstraint on the retrieved message. Lemma 1 (Interference lower bound)
For the PIR with partially known private sideinformation, the interference from undesired messages within the answering strings, D − L ,is lower bounded by, D − L + o ( L ) ≥ I (cid:16) W ˜ M +2: K ; H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N |W H , W ˜ M +1 (cid:17) . (13)If the privacy constraint is absent, the user downloads only L symbols for the desiredmessage, however, when the privacy constraint is present, it should download D symbols.The difference between D and L , i.e., D − L , corresponds to the undesired portion of theanswering strings. Note that Lemma 1 is an extension of [12, Lemma 5] if ˜ M = 0, i.e., theuser has no partially known private side information. Lemma 1 differs from its counterpartin [31, Lemma 1] in two aspects, namely, the left hand side is D ( r ) − L (1 − r ) in [31] as theuser requests to download the uncached bits only, and the bound in [31, Lemma 1] constructs K − k in contrast to one bound here as it always startsfrom W ˜ M +2 . Finally, we note that a similar argument to Lemma 1 can be implied from [32]. Proof:
We start with the right hand side of (13), I (cid:16) W ˜ M +2: K ; H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N |W H , W ˜ M +1 (cid:17) = I (cid:16) W ˜ M +2: K ; H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N , W ˜ M +1 |W H (cid:17) − I (cid:0) W ˜ M +2: K ; W ˜ M +1 |W H (cid:1) . (14)For the first term on the right hand side of (14), we have I (cid:16) W ˜ M +2: K ; H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N , W ˜ M +1 |W H (cid:17) = I (cid:16) W ˜ M +2: K ; H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N |W H (cid:17) + I (cid:16) W ˜ M +2: K ; W ˜ M +1 | H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N , W H (cid:17) (15) (8) = I (cid:16) W ˜ M +2: K ; H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N |W H (cid:17) + o ( L ) (16) (4) , (5) = I (cid:16) W ˜ M +2: K ; A [ ˜ M +1 , H ]1: N |W H , H , Q [ ˜ M +1 , H ]1: N (cid:17) + o ( L ) (17)= H (cid:16) A [ ˜ M +1 , H ]1: N |W H , H , Q [ ˜ M +1 , H ]1: N (cid:17) − H (cid:16) A [ ˜ M +1 , H ]1: N |W H , H , Q [ ˜ M +1 , H ]1: N , W ˜ M +2: K (cid:17) + o ( L ) (18) (8) = H (cid:16) A [ ˜ M +1 , H ]1: N |W H , H , Q [ ˜ M +1 , H ]1: N (cid:17) − H (cid:16) A [ ˜ M +1 , H ]1: N , W ˜ M +1 |W H , H , Q [ ˜ M +1 , H ]1: N , W ˜ M +2: K (cid:17) + o ( L )(19)8 H (cid:16) A [ ˜ M +1 , H ]1: N |W H , H , Q [ ˜ M +1 , H ]1: N (cid:17) − H (cid:16) W ˜ M +1 |W H , H , Q [ ˜ M +1 , H ]1: N , W ˜ M +2: K (cid:17) + o ( L ) (20) (4) , (5) = H (cid:16) A [ ˜ M +1 , H ]1: N |W H , H , Q [ ˜ M +1 , H ]1: N (cid:17) − H (cid:0) W ˜ M +1 |W H , W ˜ M +2: K (cid:1) + o ( L ) (21)= H (cid:16) A [ ˜ M +1 , H ]1: N |W H , H , Q [ ˜ M +1 , H ]1: N (cid:17) − L + o ( L ) (22) ≤ H (cid:16) A [ ˜ M +1 , H ]1: N (cid:17) − L + o ( L ) (23) ≤ D − L + o ( L ) , (24)where (16), (19) follow from the decodability of W ˜ M +1 given (cid:16) H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N , W H (cid:17) ,(17) follows from the independence of W ˜ M +2: K and (cid:16) H , Q [ ˜ M +1 , H ]1: N (cid:17) , (21) follows from theindependence of W ˜ M +1 and (cid:16) H , Q [ ˜ M +1 , H ]1: N (cid:17) , and (24) follows from the independence bound.For the second term on the right hand side of (14), we have I (cid:0) W ˜ M +2: K ; W ˜ M +1 |W H (cid:1) = H (cid:0) W ˜ M +1 |W H (cid:1) − H (cid:0) W ˜ M +1 |W H , W ˜ M +2: K (cid:1) (25)= L − L = 0 . (26)Combining (14), (24), and (26) yields (13). (cid:4) In the following lemma, we prove an inductive relation for the mutual information termon the right hand side of (13).
Lemma 2 (Induction lemma)
For all k ∈ { ˜ M + 2 , . . . , K } , the mutual information termin Lemma 1 can be inductively lower bounded as, I (cid:16) W k : K ; H , Q [ k − , H ]1: N , A [ k − , H ]1: N |W H , W ˜ M +1: k − (cid:17) ≥ N I (cid:16) W k +1: K ; H , Q [ k, H ]1: N , A [ k, H ]1: N |W H , W ˜ M +1: k (cid:17) + L − o ( L ) N . (27)Lemma 2 is a generalization of [12, Lemma 6] to our setting. The main difference betweenLemma 2 and [32] is that in order to apply the partial privacy constraint, the random variable H should be used in its local form H n as it corresponds to the partial knowledge of the n thdatabase. Proof:
We start with the left hand side of (27), I (cid:16) W k : K ; H , Q [ k − , H ]1: N , A [ k − , H ]1: N |W H , W ˜ M +1: k − (cid:17) = 1 N × N × I (cid:16) W k : K ; H , Q [ k − , H ]1: N , A [ k − , H ]1: N |W H , W ˜ M +1: k − (cid:17) (28) ≥ N N X n =1 I (cid:0) W k : K ; H n , Q [ k − , H ] n , A [ k − , H ] n |W H , W ˜ M +1: k − (cid:1) (29) ≥ N N X n =1 I (cid:0) W k : K ; Q [ k − , H ] n , A [ k − , H ] n |W H , W ˜ M +1: k − , H n (cid:1) (30)9 = 1 N N X n =1 I (cid:0) W k : K ; Q [ k, H ] n , A [ k, H ] n |W H , W ˜ M +1: k − , H n (cid:1) (31) (4) , (5) = 1 N N X n =1 I (cid:0) W k : K ; A [ k, H ] n |W H , W ˜ M +1: k − , H n , Q [ k, H ] n (cid:1) (32) (7) = 1 N N X n =1 H (cid:0) A [ k, H ] n |W H , W ˜ M +1: k − , H n , Q [ k, H ] n (cid:1) (33) ≥ N N X n =1 H (cid:16) A [ k, H ] n |W H , W ˜ M +1: k − , H , Q [ k, H ]1: N , A [ k, H ]1: n − (cid:17) (34) (7) = 1 N N X n =1 I (cid:16) W k : K ; A [ k, H ] n |W H , W ˜ M +1: k − , H , Q [ k, H ]1: N , A [ k, H ]1: n − (cid:17) (35)= 1 N I (cid:16) W k : K ; A [ k, H ]1: N |W H , W ˜ M +1: k − , H , Q [ k, H ]1: N (cid:17) (36) (4) , (5) = 1 N I (cid:16) W k : K ; H , Q [ k, H ]1: N , A [ k, H ]1: N |W H , W ˜ M +1: k − (cid:17) (37) (8) = 1 N I (cid:16) W k : K ; W k , H , Q [ k, H ]1: N , A [ k, H ]1: N |W H , W ˜ M +1: k − (cid:17) − o ( L ) N (38)= 1 N I (cid:0) W k : K ; W k |W H , W ˜ M +1: k − (cid:1) + 1 N I (cid:16) W k : K ; H , Q [ k, H ]1: N , A [ k, H ]1: N |W H , W ˜ M +1: k (cid:17) − o ( L ) N (39)= 1 N I (cid:16) W k +1: K ; H , Q [ k, H ]1: N , A [ k, H ]1: N |W H , W ˜ M +1: k (cid:17) + L − o ( L ) N , (40)where (29) follows from the non-negativity of mutual information, (31) follows from theprivacy constraint, (32) follows from the independence of the messages and the queries, (33),(35) follow from the fact that answer strings are deterministic functions of the messages andthe queries, (34) follows from the fact that conditioning reduces entropy, (37) follows fromthe independence of W k : K and (cid:16) H , Q [ k, H ]1: N (cid:17) , (38) follows from the reliability constraint on W k ,and (40) follows from the independence of W k and ( W H , W ˜ M +1: k − ). (cid:4) Now, we are ready to derive the lower bound for arbitrary K , N , and ˜ M . This can beobtained by applying Lemma 1 and Lemma 2 successively. Lemma 3
For fixed N , K , and ˜ M ≤ M , we have D ≥ L (cid:18) N + · · · + 1 N K − ˜ M − (cid:19) − o ( L ) . (41) Proof:
We have D (13) ≥ L + I (cid:16) W ˜ M +2: K ; H , Q [ ˜ M +1 , H ]1: N , A [ ˜ M +1 , H ]1: N |W H , W ˜ M +1 (cid:17) − o ( L ) (42) (27) ≥ L + LN + 1 N I (cid:16) W ˜ M +3: K ; H , Q [ ˜ M +2 , H ]1: N , A [ ˜ M +2 , H ]1: N |W H , W ˜ M +1: ˜ M +2 (cid:17) − o ( L ) (43)10 ≥ L + LN + LN + 1 N I (cid:16) W ˜ M +4: K ; H , Q [ ˜ M +3 , H ]1: N , A [ ˜ M +3 , H ]1: N |W H , W ˜ M +1: ˜ M +3 (cid:17) − o ( L ) (44) (27) ≥ . . . (45) (27) ≥ L (cid:18) N + · · · + 1 N K − ˜ M − (cid:19) − o ( L ) , (46)where (42) follows from Lemma 1, (43)-(46) follow from applying Lemma 2 starting from k = ˜ M + 2 to k = K , which differs from [12] in terms of the starting point of the induction. (cid:4) We conclude the converse proof by dividing by L and taking L → ∞ in (41), to have D ∗ ≥ N + · · · + 1 N K − ˜ M − . (47)Finally, we note that the right hand side of (47) is monotonically decreasing in ˜ M . Since˜ M ≤ M , the lowest lower bound is obtained by taking ˜ M = M , which yields the finalconverse bound, D ∗ ≥ N + · · · + 1 N K − M − . (48) Remark 5
We note that if (48) is tight, any prefetching strategy m such that P Nn =1 m n < M is strictly suboptimal. Furthermore, the lower bound in (48) is the same for all prefetchingstrategies m satisfying P Nn =1 m n = M . In Section 5, we show that this lower bound is tight. We first note that the achievability scheme proposed in [32] for the PIR problem with com-pletely unknown private side information also works for the PIR problem with partiallyknown private side information here. The PIR scheme in [32] is based on MDS codes andconsists of two stages. The first stage determines the systematic part of the MDS code ac-cording to the queries generated in [12], which protects the privacy of the desired message,i.e., in the first stage, the user designs the queries such that no information is leaked aboutwhich message out of the K messages is the desired one. In the second stage, the user reducesthe number of the downloaded equations by downloading the parity part of the MDS codeonly. For the case of partially known private side information here, two privacy constraintsshould be satisfied: the desired message privacy constraint and the side information privacyconstraint. For the desired message, we note that the user should guarantee that the queriesdesigned to retrieve any of the K − m n messages should be indistinguishable at the n thdatabase (i.e., with the exception of the m n messages that the n th database has provided).Due to the first stage, the privacy of the desired message holds as it was designed to protect11he privacy of all K messages, which is more restricted. Furthermore, the PIR scheme in [32]also protects the privacy of the side information. The scheme in [32] ensures that the queriesdo not reveal the identity of the M messages that are possessed by the user as side informa-tion. In our model, we note that we need to protect the privacy of M − m n messages fromthe n th database, as the remaining m n messages are known to the n th database. Since theprivacy constraint imposed on the side information in our model is less restricted than [32],using the scheme in [32] satisfies the privacy constraint of the side information in our case aswell. That is, the n th database cannot infer which other M − m n messages the user holds.The PIR scheme in [32] achieves the normalized download cost in Theorem 1. The PIRscheme in [32] requires a message size of N K symbols. In the following, we propose anotherachievability scheme which requires a message size of N K − MN , if m n = MN ∈ N . Thus, thisscheme requires smaller sub-packetization and smaller field size for the MDS code.Our PIR scheme for partially known private side information is based on the PIR schemesin [12, 32]. To protect the privacy of the partially known private side information and theprivacy of the desired message, similar to [12], we apply the following three principles recur-sively: 1) database symmetry, 2) message symmetry within each database, and 3) exploitingundesired messages as side information. We reduce the download cost by utilizing the re-construction property of MDS codes by exploiting partially known private side informationas in [32]. The side information enables the user to request reduced number of equations asa consequence of the user’s knowledge of M messages from the prefetching phase. Never-theless, to protect the privacy of the side information, the user actually queries MDS codedsymbols which is mixture of K − m n messages. The main difference between our achievabilityscheme and that in [12, 32] is that since the n th database knows that the user has prefetched m n messages, the user does not need to protect the privacy for these m n messages from the n th database. This effectively reduces the number of messages that the scheme in [32] needsto operate on to K − m n messages in contrast to K in [32]. When MN ∈ N , we show that if theuser caches the same number of messages from each database, i.e., m n = MN , for all n , thenthe lower bound in (11) is achievable by this scheme. This scheme reduces the message sizerequirement from L = N K in [32] to L = N K − MN here, simplifying the achievable scheme. N = 2 Databases, K = 4 Messages, and M = 2 Cached Messages
Assume that each message is of size 8 symbols. We use a i , b i , c i and d i , for i = 1 , . . . , W , W , W and W , respectively. In this example, inthe prefetching phase, the user caches message W from database 1, and message W fromdatabase 2; and in the retrieval phase, the user wishes to retrieve message W privately.The user first generates the query table in Table 1. In Table 1, the user queries 7 symbols.Since the user knows d from the cached message W , in order to use the partially known12rivate side information, the user can in fact reduce the number of queries to 6 equations perdatabase by ignoring d . However, if the user simply does not download d , it compromisesthe privacy of W at database 1. Alternatively, the user queries the MDS coded version ofthe 7 symbols. By using these 7 symbols as the systematic part, we can use a (13 ,
7) MDScode. By downloading the 6 parity symbols, the user can reconstruct the whole 7 symbolsutilizing the knowledge of d . Therefore, the normalized download cost for our achievabilityscheme is = , which matches the lower bound in (11) for this case.For database 1, the query table in Table 1 induces the same distribution on the messages W , W and W . Therefore, we guarantee the privacy of the desired message. The reliabilityconstraint can also be verified. Note that b is downloaded from database 2, and d isdownloaded in the prefetching phase. Therefore, a and a are decodable. By getting b + c from database 2, the user can get b due to the private side information W . Therefore, theuser can decode a from a + b + d . Similar arguments follow for database 2.Table 1: Query table for K = 4, N = 2, M = 2.DB1 DB2 a a b b d c a + b a + b a + d a + c b + d b + c a + b + d a + b + c W H = { W } W H = { W } N = 2 Databases, K = 5 Messages, and M = 2 Cached Messages
Assume that each message is of size 16 symbols. We use a i , b i , c i , d i and e i , for i = 1 , . . . , W , W , W , W , and W , respectively. In this example,in the prefetching phase, the user caches message W from database 1, and message W fromdatabase 2; and in the retrieval phase, the user wishes to retrieve message W privately. Theuser first generates the query table in Table 2. In Table 2, the user queries 15 symbols. Sincethe user knows e from the cached message W , in order to use the partially known privateside information, the user in fact queries the MDS coded version of the 15 symbols. By usingthese 15 symbols as the systematic part, we can use a (29 ,
15) MDS code. By downloading the14 parity symbols, the user can reconstruct the whole 15 symbols. Therefore, the normalizeddownload cost for our achievability scheme is = , which matches the lower bound in(11) for this case.For database 1, the query table in Table 2 induces the same distribution on the messages W , W , W and W . Therefore, we guarantee the privacy of the desired message. The13eliability constraint can also be verified. Note that b , c are downloaded from database 2,and e is downloaded in the prefetching phase. Therefore, a , a and a are decodable. Bygetting b + d from database 2, the user can get b due to the private side information W .Similarly, c is also decodable. Therefore, the user can decode a from a + b + e and a from a + c + e . By getting b + c + d from database 2, the user can get b + c due tothe private side information W . Therefore, the user can decode a from a + b + c + e .Similar arguments follow for database 2.Table 2: Query table for K = 5, N = 2, M = 2.DB1 DB2 a a b b c c e d a + b a + b a + c a + c a + e a + d b + c b + c b + e b + d c + e c + d a + b + c a + b + c a + b + e a + b + d a + c + e a + c + d b + c + e b + c + d a + b + c + e a + b + c + d W H = { W } W H = { W } MN ∈ N Let MN = m . In the prefetching phase, the user caches m messages from each database. Toachieve the lower bound shown in (11), in the retrieval phase, we choose the message size as L = N K − m symbols. The details of the achievable scheme are as follows:1. Initialization:
The user permutes each message randomly and independently. After therandom permutation, we use U i ( j ) to denote the j th symbol of the permuted message W i . Suppose the user wishes to retrieve W θ privately. We then prepare the query tableby first querying U θ (1) from database 1. Set the round index to r = 1.2. Symmetry across databases:
The user queries the same number of equations with thesame structure as database 1 from the remaining databases.3.
Message symmetry:
For each database, to satisfy the privacy constraint, the usershould query equal amount of symbols from all other K − m messages. Since the user14as cached m messages from each database in the prefetching phase, the user does notneed to protect the privacy for these m messages. For the r th round, the user queriessums of every r combinations of the K − m messages.4. Exploiting side information:
For database 1, the user exploits the side informationequations obtained from the other ( N −
1) databases to query sum of r +1 combinationsof the K − m messages, where sum of r combinations is the side information. If the r combinations contain the cached message from database 1, we replace the overlappingsymbols through the symbols cached from other databases.5. Repeat steps 2, 3, 4 after setting r = r + 1 until r = K − m + 1.6. Shuffling the order of queries:
By shuffling the order of queries uniformly, all possiblequeries can be made equally likely regardless of the message index. This guaranteesthe privacy of the desired message.7.
Downloading MDS parity parts:
Now, the query table is finished. For each database,let p be the number of queried symbols in the query table, and let q be the number ofqueried symbols which are determined by the side information the user cached in theprefetching phase. Apply a (2 p − q, p ) MDS code to the queried symbols by letting the p symbols to be the systematic part. Finally, the user downloads the parity parts ofthe MDS-coded answering strings which are p − q symbols for each database. We now calculate the total number of downloaded symbols. We first calculate p , which isthe number of queried symbols in the query table for each database, p = (cid:18) K − m (cid:19) + (cid:18) K − m (cid:19) ( N −
1) + · · · + (cid:18) K − mK − m (cid:19) ( N − K − m − (49)= 1 N − (cid:20)(cid:18) K − m (cid:19) ( N −
1) + (cid:18) K − m (cid:19) ( N − + · · · + (cid:18) K − mK − m (cid:19) ( N − K − m (cid:21) (50)= 1 N − (cid:0) N K − m − (cid:1) , (51)where (cid:0) K − mr (cid:1) in (49) corresponds to the queries of sums of every r combinations of the K − m messages, and ( N − r − corresponds to the number of sets of the available side informationfrom other ( N −
1) databases.We then calculate q , which is the number of queried symbols which are determined bythe side information the user cached in the prefetching phase, q = (cid:18) ( N − m (cid:19) + (cid:18) ( N − m (cid:19) ( N −
1) + · · · + (cid:18) ( N − m ( N − m (cid:19) ( N − ( N − m − (52)15 1 N − (cid:20)(cid:18) ( N − m (cid:19) ( N −
1) + · · · + (cid:18) ( N − m ( N − m (cid:19) ( N − ( N − m (cid:21) (53)= 1 N − (cid:0) N ( N − m − (cid:1) , (54)where (cid:0) ( N − mr (cid:1) in (52) corresponds to the queries which can be determined by the partiallyknown private side information, and ( N − r − corresponds to the number of sets of queriesconsisting of r combinations.Next, we calculate the number of symbols for the desired message, L = N (cid:20)(cid:18) K − m − (cid:19) + (cid:18) K − m − (cid:19) ( N −
1) + · · · + (cid:18) K − m − K − m − (cid:19) ( N − K − m − (cid:21) (55)= N × N K − m − = N K − m , (56)where (cid:0) K − m − r − (cid:1) in (55) corresponds to the queries containing the desired message and ( N − r − corresponds to the number of sets of queries consisting of r combinations.Therefore, the normalized download cost becomes, DL = N ( p − q ) L (57)= NN − (cid:0) N K − m − (cid:1) − NN − (cid:0) N ( N − m − (cid:1) N K − m (58)= NN − × N K − m − N ( N − m N K − m (59)= 11 − N × " − (cid:18) N (cid:19) K − M , (60)which matches the lower bound in (11). Remark 6
Note that although our achievable scheme and the scheme in [32] are both usingMDS coding to exploit the available side information, the field size requirements for realizingthe MDS codes are different. For the scheme of [32], a (2˜ p − ˜ q, ˜ p ) MDS code is used, where ˜ p = N − ( N K − and ˜ q = N − ( N M − . This requires larger field size than the (2 p − q, p ) MDS code used in our scheme (if MN ∈ N ), since p − ˜ q > (2 p − q ) . In this paper, we have introduced a new PIR model, namely, PIR with partially knownprivate side information as a natural model for studying practical PIR problems with cachedside information. In this model, the user and the databases engage in a caching/PIR sce-nario which consists of two phases, namely, prefetching phase and retrieval phase. The n th16atabase provides the user with m n side information messages in the prefetching phase suchthat P Nn =1 m n ≤ M , hence, each database has partial knowledge about the side informationin contrast to full knowledge in [29] and no knowledge in [30–32]. Based on this side infor-mation, the user designs a retrieval scheme that does not reveal the identity of the desiredmessage or the identities of the remaining M − m n messages to the n th database. For thismodel, we determined the exact capacity to be C = − N − ( N ) K − M . The capacity is attainedfor any prefetching strategy that satisfies the cache memory size constraint with equality.The achievable scheme in [32] can also be used for this model. We further proposed anotherPIR scheme which requires smaller sub-packetization and field size for the case of uniformprefetching. Uniform prefetching, when feasible, is optimal. Interestingly, the capacity ex-pression we derive for this problem is exactly the same as the capacity expression for thePIR problem with completely unknown side information [32]. Therefore, our result impliesthat there is no loss in employing the same databases for prefetching and retrieval purposes. References [1] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan. Private information retrieval.
Journal of the ACM , 45(6):965–981, 1998.[2] W. Gasarch. A survey on private information retrieval. In
Bulletin of the EATCS , 2004.[3] C. Cachin, S. Micali, and M. Stadler. Computationally private information retrievalwith polylogarithmic communication. In
International Conference on the Theory andApplications of Cryptographic Techniques . Springer, 1999.[4] R. Ostrovsky and W. Skeith III. A survey of single-database private information re-trieval: Techniques and applications. In
International Workshop on Public Key Cryp-tography , pages 393–411. Springer, 2007.[5] S. Yekhanin. Private information retrieval.
Communications of the ACM , 53(4):68–73,2010.[6] N. B. Shah, K. V. Rashmi, and K. Ramchandran. One extra bit of download ensuresperfectly private information retrieval. In
IEEE ISIT , June 2014.[7] G. Fanti and K. Ramchandran. Efficient private information retrieval over unsynchro-nized databases.
IEEE Journal of Selected Topics in Signal Processing , 9(7):1229–1239,October 2015.[8] T. Chan, S. Ho, and H. Yamamoto. Private information retrieval for coded storage. In
IEEE ISIT , June 2015. 179] A. Fazeli, A. Vardy, and E. Yaakobi. Codes for distributed pir with low storage overhead.In
IEEE ISIT , June 2015.[10] R. Tajeddine and S. El Rouayheb. Private information retrieval from MDS coded datain distributed storage systems. In
IEEE ISIT , July 2016.[11] H. Sun and S. A. Jafar. The capacity of private information retrieval. In
IEEE Globecom ,December 2016.[12] H. Sun and S. A. Jafar. The capacity of private information retrieval.
IEEE Trans. onInfo. Theory , 63(7):4075–4088, July 2017.[13] H. Sun and S. Jafar. Blind interference alignment for private information retrieval. 2016.Available at arXiv:1601.07885.[14] H. Sun and S. Jafar. The capacity of robust private information retrieval with colludingdatabases. 2016. Available at arXiv:1605.00635.[15] R. Tajeddine, O. W. Gnilke, D. Karpuk, R. Freij-Hollanti, C. Hollanti, and S. El Rouay-heb. Private information retrieval schemes for coded data with arbitrary collusion pat-terns. 2017. Available at arXiv:1701.07636.[16] H. Sun and S. Jafar. The capacity of symmetric private information retrieval. 2016.Available at arXiv:1606.08828.[17] K. Banawan and S. Ulukus. The capacity of private information retrieval from codeddatabases.
IEEE Trans. on Info. Theory . Submitted September 2016. Also available atarXiv:1609.08138.[18] H. Sun and S. Jafar. Optimal download cost of private information retrieval for arbitrarymessage length. 2016. Available at arXiv:1610.03048.[19] H. Sun and S. Jafar. Multiround private information retrieval: Capacity and storageoverhead. 2016. Available at arXiv:1611.02257.[20] K. Banawan and S. Ulukus. Multi-message private information retrieval: Capacityresults and near-optimal schemes.
IEEE Trans. on Info. Theory . Submitted February2017. Also available at arXiv:1702.01739.[21] K. Banawan and S. Ulukus. The capacity of private information retrieval from Byzantineand colluding databases.
IEEE Trans. on Info. Theory . Submitted June 2017. Alsoavailable at arXiv:1706.01442.[22] Q. Wang and M. Skoglund. Symmetric private information retrieval for MDS codeddistributed storage. 2016. Available at arXiv:1610.04530.1823] R. Freij-Hollanti, O. Gnilke, C. Hollanti, and D. Karpuk. Private information retrievalfrom coded databases with colluding servers. 2016. Available at arXiv:1611.02062.[24] H. Sun and S. Jafar. Private information retrieval from MDS coded data with collud-ing servers: Settling a conjecture by Freij-Hollanti et al. 2017. Available at arXiv:1701.07807.[25] Y. Zhang and G. Ge. A general private information retrieval scheme for MDS codeddatabases with colluding servers. 2017. Available at arXiv: 1704.06785.[26] Y. Zhang and G. Ge. Multi-file private information retrieval from MDS coded databaseswith colluding servers. 2017. Available at arXiv: 1705.03186.[27] Q. Wang and M. Skoglund. Linear symmetric private information retrieval for MDScoded distributed storage with colluding servers. 2017. Available at arXiv:1708.05673.[28] Q. Wang and M. Skoglund. Secure symmetric private information retrieval from col-luding databases with adversaries. 2017. Available at arXiv:1707.02152.[29] R. Tandon. The capacity of cache aided private information retrieval. 2017. Availableat arXiv: 1706.07035.[30] S. Kadhe, B. Garcia, A. Heidarzadeh, S. El Rouayheb, and A. Sprintson. Privateinformation retrieval with side information. 2017. Available at arXiv:1709.00112.[31] Y.-P. Wei, K. Banawan, and S. Ulukus. Fundamental limits of cache-aided privateinformation retrieval with unknown and uncoded prefetching. 2017. Available atarXiv:1709.01056.[32] Z. Chen, Z. Wang, and S. Jafar. The capacity of private information retrieval withprivate side information. 2017. Available at arXiv:1709.03022.[33] M. Karmoose, L. Song, M. Cardone, and C. Fragouli. Private broadcasting: an indexcoding approach. 2017. Available at arXiv: 1701.04958.[34] M. Karmoose, L. Song, M. Cardone, and C. Fragouli. Preserving privacy while broad-casting: kk