[PDF] Multi-Message Private Information Retrieval: Capacity Results and Near-Optimal Schemes

Abstract

Full PDF

aa r X i v : . [ c s . I T ] F e b Multi-Message Private Information Retrieval: CapacityResults and Near-Optimal Schemes ∗ Karim Banawan Sennur Ulukus

Department of Electrical and Computer EngineeringUniversity of Maryland, College Park, MD 20742 [email protected] [email protected]

February 7, 2017

Abstract

We consider the problem of multi-message private information retrieval (MPIR)from N non-communicating replicated databases. In MPIR, the user is interestedin retrieving P messages out of M stored messages without leaking the identity ofthe retrieved messages. The information-theoretic sum capacity of MPIR C Ps is themaximum number of desired message symbols that can be retrieved privately per down-loaded symbol. For the case P ≥ M , we determine the exact sum capacity of MPIRas C Ps = M − PPN . The achievable scheme in this case is based on downloading MDS-coded mixtures of all messages. For P ≤ M , we develop lower and upper bounds for all M, P, N . These bounds match if the total number of messages M is an integer multipleof the number of desired messages P , i.e., MP ∈ N . In this case, C Ps = − N − ( N ) M/P . Theachievable scheme in this case generalizes the single-message capacity achieving schemeto have unbalanced number of stages per round of download. For all the remainingcases, the diﬀerence between the lower and upper bound is at most 0 . M = 5, P = 2, N = 2. Our results indicate that joint retrieval of desiredmessages is more eﬃcient than successive use of single-message retrieval schemes. The privacy of the contents of the downloaded information from curious public databases hasattracted considerable research within the computer science community [1–4]. The problemis motivated by practical examples such as: ensuring privacy of investors as they download ∗ This work was supported by NSF Grants CNS 13-14733, CCF 14-22111, CCF 14-22129, and CNS 15-26608. A shorter version is submitted to IEEE ISIT 2017. N non-communicating databaseswithout leaking any information about the identity of the downloaded message. The contentsof the databases are identical. The user performs this operation by preparing and submittingqueries to all databases. The databases respond truthfully with answer strings which arefunctions of the queries and the messages. The user needs to reconstruct the desired messagefrom these answer strings. A trivial solution for this seemingly diﬃcult problem is for theuser to download the contents of all databases. This solution however is extremely ineﬃcient.The eﬃciency is measured by the retrieval rate which is the ratio of the number of retrieveddesired message symbols to the number of total downloaded symbols. The capacity of PIRis the maximum retrieval rate over all possible PIR schemes.The computer science formulation of this problem assumes that the messages are oflength one. The metrics in this case are the download cost, i.e., the sum of lengths ofthe answer strings, and the upload cost, i.e., the size of the queries. Most of this workis computational PIR as it ensures only that a server cannot get any information aboutuser intent unless it solves a certain computationally hard problem [2, 5]. The information-theoretic re-formulation of the problem considers arbitrarily large message sizes, and ignoresthe upload cost. This formulation provides an absolute, i.e., information-theoretic, guaranteethat no server participating in the protocol gets any information about the user intent.Towards that end, recently, [6] has drawn a connection between the PIR problem and theblind interference alignment scheme proposed in [7]. Then, [8] has determined the exactcapacity of the classical PIR problem. The retrieval scheme in [8] is based on three principles:message symmetry, symmetry across databases, and exploiting side information from theundesired messages through alignment.The basic PIR setting has been extended in several interesting directions. The ﬁrstextension is the coded PIR (CPIR) problem [9–11]. The contents of the databases in thisproblem are coded by an ( N, K ) storage code instead of being replicated. This is a naturalextension since most storage systems nowadays are in fact coded to achieve reliability againstnode failures and erasures with manageable storage cost. In [12], the exact capacity ofthe MDS-coded PIR is determined. Another interesting extension is PIR with colludingdatabases (TPIR). In this setting, T databases can communicate and exchange the queriesto identify the desired message. The exact capacity of colluded PIR is determined in [13].The case of coded colluded PIR is investigated in [14]. The robust PIR problem (RPIR)extension considers the case when some databases are not responsive [13]. Lastly, in thesymmetric PIR problem (SPIR) the privacy of the remaining records should be maintained2gainst the user in addition to the usual privacy constraint on the databases, i.e., the usershould not learn any other messages other than the one it wished to retrieve. The exactcapacity of symmetric PIR is determined in [15]; and the exact capacity of symmetric PIRfrom coded databases is determined in [16].In some applications, the user may be interested in retrieving multiple messages fromthe databases without revealing the identities of these messages. Returning to the examplespresented earlier: the investor may be interested in comparing the values of multiple recordsat the same time, and the inventor may be looking up several patents that are closely relatedto their work. One possible solution to this problem is to use single-message retrieval schemein [8] successively. We show in this work that multiple messages can be retrieved moreeﬃciently than retrieving them one-by-one in a sequence. This resembles superiority of jointdecoding in multiple access channels over multiple simultaneous single-user transmissions[17]. To motivate the multi-message private information retrieval problem (MPIR), considerthe example in [8, Section 4.3] where the number of messages is M = 3, number of databasesis N = 2, and the user is interested in retrieving only P = 1 message. Here the optimalscheme retrieves 8 desired bits in 14 downloads, hence with a rate 4 /

7. When the user wishesto retrieve P = 2 messages, if we use the scheme in [8] twice in a row, we retrieve 16 bitsin 28 downloads, hence again a sum rate of 4 /

7. Even considering the fact that the schemein [8] retrieves 2 bits of the second message for free in downloading the ﬁrst message, i.e., itactually retrieves 10 bits in 14 downloads, hence a sum rate of 5 /

7, we show in this paperthat a better sum rate of 4 / k -safe binary matricesthat uses XOR operations. [20] extends the scheme in [1] to multiple blocks. [21] designsan eﬃcient non-trivial multi-query computational PIR protocol and gives a lower bound onthe communication of any multi-query information retrieval protocol. These works do notconsider determining the information-theoretic capacity.In this paper, we formulate the MPIR problem with non-colluding repeated databasesfrom an information-theoretic perspective. Our goal is to characterize the sum capacity ofthe MPIR problem C Ps , which is deﬁned as the maximum ratio of the number of retrievedsymbols from the P desired messages to the number of total downloaded symbols. Whenthe number of desired messages P is at least half of the total number of messages M , i.e., P ≥ M , we determine the exact sum capacity of MPIR as C Ps = M − PPN . We use a novelachievable scheme which downloads MDS-coded mixtures of all messages. We show thatjoint retrieving of the desired messages strictly outperforms successive use of single-message3etrieval for P times. Additionally, we present an achievable rate region to characterize thetrade-oﬀ between the retrieval rates of the desired P messages.For the case of P ≤ M , we derive lower and upper bounds that match if the total numberof messages M is an integer multiple of the number of desired messages P , i.e., MP ∈ N . In thiscase, the sum capacity is C Ps = − N − ( N ) M/P . The result resembles the single-message capacitywith the number of messages equal to MP . In other cases, although the exact capacity is stillan open problem, we show numerically that the gap between the lower and upper bounds ismonotonically decreasing in N and is upper bounded by 0 . P ≤ M is inspired by the greedy algorithm in [8], which retrieves all possible combinations ofmessages. The main diﬀerence of our scheme from the scheme in [8] is the number of stagesrequired in each download round. For example, round M − P + 1 to round M −

1, whichcorrespond to retrieving the sum of M − P + 1 to sum of M − P -order IIR ﬁlter [22]. Our converse proof generalizes the proofin [8] for P ≥

1. The essence of the proof is captured in two lemmas: the ﬁrst lemma lowerbounds the uncertainty of the interference for the case P ≥ M , and the second lemma upperbounds the remaining uncertainty after conditioning on P interfering messages. Consider a classical PIR setting storing M messages (or ﬁles). Each message is a vector W i ∈ F Lq , i ∈ { , · · · , M } , whose elements are picked uniformly and independently from suﬃcientlylarge ﬁeld F q . Denote the contents of message W m by the vector [ w m (1) , w m (2) , · · · , w m ( L )] T .The messages are independent and identically distributed, and thus, H ( W i ) = L, i ∈ { , · · · , M } (1) H ( W M ) = M L (2)where W M = ( W , W , · · · , W M ). The messages are stored in N non-colluding (non-communicating) databases. Each database stores an identical copy of all M messages, i.e.,the databases encode the messages via ( N,

1) repetition storage code [12].In the MPIR problem, the user aims to retrieve a subset of messages indexed by the indexset P = { i , · · · , i P } ⊆ { , · · · , M } out of the available messages, where |P| = P , withoutleaking the identity of the subset P . We assume that the cardinality of the potential messageset, P , is known to all databases. To retrieve W P = ( W i , W i , · · · , W i P ), the user generatesa query Q [ P ] n and sends it to the n th database. The user does not have any knowledge about4he messages in advance, hence the messages and the queries are statistically independent, I (cid:16) W , · · · , W M ; Q [ P ]1 , · · · , Q [ P ] N (cid:17) = I (cid:16) W M ; Q [ P ]1: N (cid:17) = 0 (3)The privacy is satisﬁed by ensuring statistical independence between the queries and themessage index set P = { i , · · · , i P } , i.e., the privacy constraint is given by, I (cid:0) Q [ i , ··· ,i P ] n ; i , · · · , i P (cid:1) = I (cid:0) Q [ P ] n ; P (cid:1) = 0 , n ∈ { , · · · , N } (4)The n th database responds with an answer string A [ P ] n , which is a deterministic function ofthe queries and the messages, hence H ( A [ P ] n | Q [ P ] n , W M ) = 0 (5)We further note that by the data processing inequality and (4), I (cid:0) A [ P ] n ; P (cid:1) = 0 , n ∈ { , · · · , N } (6)In addition, the user should be able to reconstruct the messages W P reliably from the col-lected answers from all databases given the knowledge of the queries. Thus, we write thereliability constraint as, H (cid:16) W i , · · · , W i P | A [ P ]1 , · · · , A [ P ] N , Q [ P ]1 , · · · , Q [ P ] N (cid:17) = H (cid:16) W P | A [ P ]1: N , Q [ P ]1: N (cid:17) = 0 (7)We denote the retrieval rate of the i th message by R i , where i ∈ P . The retrieval rate ofthe i th message is the ratio between the length of message i and the total download cost ofthe message set P that includes W i . Hence, R i = H ( W i ) P Nn =1 H (cid:16) A [ P ] n (cid:17) (8)The sum retrieval rate of W P is given by, P X i =1 R i = H ( W P ) P Nn =1 H (cid:16) A [ P ] n (cid:17) = P L P Nn =1 H (cid:16) A [ P ] n (cid:17) (9)The sum capacity of the MPIR problem is given by C Ps = sup P X i =1 R i (10)where the sup is over all private retrieval schemes.5n this paper, we follow the information-theoretic assumptions of large enough messagesize, large enough ﬁeld size, and ignore the upload cost as in [8, 11–13]. A formal treatmentof the capacity under message and ﬁeld size constraints for P = 1 can be found in [23].We note that the MPIR problem described here reduces to the classical PIR problem when P = 1, whose capacity is characterized in [8]. Our ﬁrst result is the exact characterization of the sum capacity for the case P ≥ M , i.e.,when the user wishes to privately retrieve at least half of the messages stored in the databases. Theorem 1

For the MPIR problem with non-colluding and replicated databases, if the num-ber of desired messages P is at least half of the number of overall stored messages M , i.e.,if P ≥ M , then the sum capacity is given by, C Ps = 11 + M − PP N (11)The achievability proof for Theorem 1 is given in Section 4, and the converse proof isgiven in Section 6.1. We note that when P = 1, the constraint of Theorem 1 is equivalentto M = 2, and the result in (11) reduces to the known result of [8] for P = 1, M = 2, whichis N . We observe that the sum capacity in (11) is a strictly increasing function of N ,and C Ps → N → ∞ . We also observe that the sum capacity in this regime is a strictlyincreasing function of P , and approaches 1 as P → M .The following corollary compares our result and the rate corresponding to the repeateduse of single-message retrieval scheme [8]. Corollary 1

For the MPIR problem with P ≥ M , the repetition of the single-message re-trieval scheme of [8] P times in a row, which achieves a sum rate of, R reps = ( N − N M − + P − N M − is strictly sub-optimal with respect to the exact capacity in (11). Proof:

In order to use the single-message capacity achieving PIR scheme as an MPIRscheme, the user repeats the single-message achievable scheme for each individual messagethat belongs to P . We note that at each repetition, the scheme downloads extra decodablesymbols from other messages. By this argument, the following rate R reps is achievable using6 repetition of the single-message scheme, R reps = C + ∆( M, P, N ) (13)where C is the single-message capacity which is given by C = − N − ( N ) M [8], and ∆( M, P, N ) isthe rate of the extra decodable symbols that belong to P . To calculate ∆( M, P, N ), we notethat the total download cost D is given by D = LC by deﬁnition. Since L = N M in the single-message scheme, D = N M (1 − ( N ) M )1 − N = N M +1 − NN − . The single-message scheme downloads onesymbol from every message from every database, i.e., the scheme downloads extra ( P − N symbols from the remaining desired messages that belong to P , thus,∆( M, P, N ) = ( P − N ( N − N M +1 − N = ( P − N − N M − R reps expression in (12).Now, the diﬀerence between the capacity in (11) and achievable rate in (12) is, C Ps − R reps = P NP ( N −

1) + M − ( N − N M − + P − N M − η ( P, M, N )( N M − P ( N −

1) + M ) (16)It suﬃces to prove that η ( P, M, N ) ≥ P , M , N when P ≥ M and N ≥

2. Note, η ( P, M, N ) =(2 P − M ) N M + ( M − P ) N M − − P ( P − N + (( P − P − M ) − P ) N + ( M − P )( P −

1) (17)In the regime P ≥ M , coeﬃcients of N M , N M − , N are non-negative. Denote the negativeterms in η ( · ) by ν ( P, N ) which is ν ( P, N ) = P ( P − N + P N . We note ν ( P, N ) < P N when N >

1, which is the case here. Thus, η ( P, M, N ) ≥ (2 P − M ) N M + ( M − P ) N M − + ( P − P − M ) N + ( M − P )( P − − P N (18) > (2 P − M ) N M + ( M − P ) N M − − P N (19)= N (cid:0) (2 P − M ) N M − + ( M − P ) N M − − P (cid:1) (20) ≥ N (cid:0) (2 P − M )2 M − + ( M − P )2 M − − P (cid:1) (21)= N (cid:0) M − (3 P − M ) − P (cid:1) (22) ≥ N (cid:18) M − · M − M (cid:19) (23)= M N (cid:0) M − − M (cid:1) (24)7here (21) follows from the fact that (2 P − M ) N M − + ( M − P ) N M − − P is monotoneincreasing in N ≥ M ≥

3, and (23) follows from M ≤ P ≤ M . From (24), we concludethat η ( M, P, N ) > M ≥ P ≥ M and N ≥

2. Examining the expression in (17) forthe remaining cases manually, i.e., when M ≤

6, we note that η ( M, P, N ) > η ( M, P, N ) > (cid:4) For the example in the introduction, where M = 3, P = 2, N = 2, our MPIR schemeachieves a sum capacity of in (11), which is strictly larger than the repeating-based achiev-able sum rate of in (12).The following corollary gives an achievable rate region for the MPIR problem. Corollary 2

For the MPIR problem, for the case P ≥ M , the following rate region isachievable, C = conv { ( C, δ, · · · , δ ) , ( δ, C, · · · , δ ) , · · · , ( δ, · · · , δ, C ) , ( C, , , · · · , , (0 , C, , · · · , , · · · , (0 , , · · · , C ) , (0 , , · · · , , (cid:0) C P , C P , · · · , C P (cid:1)(cid:9) (25) where C = 1 − N − ( N ) M , C P = C Ps P = NP N + ( M − P ) , δ = ∆( M, P, N ) P − N − N M − and where conv denotes the convex hull, and all corner points lie in the P -dimensional space. Proof:

This is a direct consequence of Theorem 1 and Corollary 1. The corner point (cid:16) C, ∆( M,P,N ) P − , ∆( M,P,N ) P − , · · · , ∆( M,P,N ) P − (cid:17) = (cid:16) − N − ( N ) M , N − N M − , N − N M − , · · · , N − N M − (cid:17) is achievable fromthe single-message achievable scheme. Due to the symmetry of the problem any other per-mutation for the coordinates of this corner point is also achievable by changing the roles ofthe desired messages. Theorem 1 gives the symmetric sum capacity corner point for the caseof P ≥ M , namely (cid:16) C Ps P , C Ps P , · · · , C Ps P (cid:17) = (cid:16) NP N +( M − P ) , NP N +( M − P ) , · · · , NP N +( M − P ) (cid:17) . By timesharing of these corner points along with the origin, the region in (25) is achievable. (cid:4) As an example for this achievable region, consider again the example in the introduction,where M = 3, P = 2, N = 2. In this case, we have a two-dimensional rate region with threecorner points: ( , ), which corresponds to the single-message capacity achieving point thataims at retrieving W ; ( , ), which corresponds to single-message capacity achieving pointthat aims at retrieving W ; and ( , ), which corresponds to the symmetric sum capacitypoint. The convex hull of these corner points together with the points on the axes gives theachievable region in Fig. 1.For the case P ≤ M , we have the following result, where the lower and upper boundmatch if MP ∈ N . 8 , ) ( , )( , ) R R Figure 1: The achievable rate region of M = 3, P = 2, N = 2. Theorem 2

For the MPIR problem with non-colluding and replicated databases, when P ≤ M , the sum capacity is lower and upper bounded as, ¯ R s ≤ C Ps ≤ ¯ R s (27) where the upper bound ¯ R s is given by, ¯ R s = 11 + N + · · · + N ⌊ MP ⌋− + (cid:0) MP − ⌊ MP ⌋ (cid:1) N ⌊ MP ⌋ (28)= 1 − ( N ) ⌊ MP ⌋ − N + (cid:0) MP − (cid:4) MP (cid:5)(cid:1) N ⌊ MP ⌋ (29) For the lower bound, deﬁne r i as, r i = e j π ( i − /P N /P − e j π ( i − /P , i = 1 , · · · , P (30) where j = √− , and denote γ i , i = 1 , · · · , P , to be the solutions of the linear equations P Pi =1 γ i r − Pi = ( N − M − P , and P Pi =1 γ i r − ki = 0 , k = 1 , · · · , P − , then ¯ R s is given by, ¯ R s = P Pi =1 γ i r M − Pi (cid:20)(cid:16) r i (cid:17) M − (cid:16) r i (cid:17) M − P (cid:21)P Pi =1 γ i r M − Pi (cid:20)(cid:16) r i (cid:17) M − (cid:21) (31)The achievability lower bound in Theorem 2 is shown in Section 5 and the upper boundis derived in Section 6.2. The following corollary states that the bounds in Theorem 2 matchif the total number of messages is an integer multiple of the number of desired messages.9 orollary 3 For the MPIR problem with non-colluding and replicated databases, if MP is aninteger, then the bounds in (27) match, and hence, C Ps = 1 − N − ( N ) MP , MP ∈ N (32) Proof:

For the upper bound, observe that if MP ∈ N , then MP = (cid:4) MP (cid:5) . Hence, (28) becomes¯ R s = 1 − N − ( N ) MP (33)For the lower bound, consider the case MP ∈ N . From (30), (cid:18) r i (cid:19) M = (cid:18) N /P e j π ( i − /P (cid:19) M = N MP (34)since e j π ( i − M/P = 1 for MP ∈ N . Similarly, (cid:16) r i (cid:17) M − P = N MP − . Hence, if MP ∈ N ,¯ R s = P Pi =1 γ i r M − Pi h N MP − N MP − iP Pi =1 γ i r M − Pi h N MP − i (35)= N MP − N MP − N MP − − N − ( N ) MP (37)Thus, ¯ R s = C Ps = ¯ R s if MP ∈ N , and we have an exact capacity result in this case. (cid:4) Examining the result, we observe that when the total number of messages is an integermultiple of the number of desired messages, the sum capacity of the MPIR is the same asthe capacity of the single-message PIR with the number of messages equal to MP . Note that,although at ﬁrst the result may seem as if every P messages can be lumped together as asingle message, and the achievable scheme in [8] can be used, this is not the case. The reasonfor this is that, we need to ensure the privacy constraint for every subset of messages of size P . That is why, in this paper, we develop a new achievable scheme.The state of the results is summarized in Fig. 2: Consider the ( M, P ) plane, wherenaturally M ≥ P . The valid part of the plane is divided into two regions. The ﬁrst regionis conﬁned between the lines P = M and P = M ; the sum capacity in this region is exactlycharacterized (Theorem 1). The second region is conﬁned between the lines P = 1 and P = M ; the sum capacity in this region is characterized only for the cases when MP ∈ N (Corollary 3). The line P = 1 corresponds to the previously known result for the single-message PIR [8]. The exact capacity for the rest of the cases is still an open problem; however,10 un-Jafar [8] M C Ps = M − PPN P P = M P = M C Ps = − N − ( N ) M/P , if MP ∈ N P = 1 Figure 2: Summary of the state of the results.the achievable scheme in Theorem 2 yields near-optimal sum rates for all the remaining caseswith the largest diﬀerence of 0 . R s and the upper bound ¯ R s in Theorem 2.The ﬁgure shows that the diﬀerence decreases as N increases. This diﬀerence in all casesis small and is upper bounded by 0 . M = 5, P = 2, N = 2. Inaddition, the diﬀerence is zero for the cases P ≥ M (Theorem 1) or MP ∈ N (Corollary 3).Fig. 4 shows the eﬀect of changing M for ﬁxed ( P, N ). We observe that as M increases,the sum rate monotonically decreases and has a limit of 1 − N . In addition, Fig. 5 showsthe eﬀect of changing N for ﬁxed ( P, M ). We observe that as N increases, the sum rateincreases and approaches 1, as expected. P ≥ M P ≥ M . The scheme applies the concepts of message symmetry, database symmetry,and exploiting side information as in [8]. However, our scheme requires the extra ingredientof MDS coding of the desired symbols and the side information in its second stage. M = 3 , P = 2 Messages, N = 2 Databases

We start with a simple motivating example in this sub-section. The scheme operates overmessage size N = 4. For sake of clarity, we assume that the three messages after interleavingtheir indices are W = ( a , · · · , a ) T , W = ( b , · · · , b ) T , and W = ( c , · · · , c ) T . We use11 u m b e r o f d e s i r e d m e s s a g e s P t o t a l nu m b e r o f m e ss ag e s M d e v i a t i o n f r o m t h e upp e r b o und f o r N = (a) N = 2 n u m b e r o f d e s i r e d m e s s a g e s P t o t a l nu m b e r o f m e ss ag e s M d e v i a t i o n f r o m t h e upp e r b o und f o r N = (b) N = 5 t o t a l nu m b e r o f m e ss ag e s M d e v i a t i o n f r o m t h e upp e r b o und f o r N = n u m b e r o f d e s i r e d m e s s a g e s P (c) N = 10 t o t a l nu m b e r o f m e ss ag e s M d e v i a t i o n f r o m t h e upp e r b o und f o r N = n u m b e r o f d e s i r e d m e s s a g e s P (d) N = 20 Figure 3: Deviation of the achievable sum rate from the upper bound. G × Reed-Solomon generator matrix over F as G × = " (38)The user picks a random permutation for the columns of G × from the 6 possible permuta-tions, e.g., in this example we use the permutation 2 , ,

3. In the ﬁrst round, the user startsby downloading one symbol from each database and each message, i.e., the user downloads( a , b , c ) from the ﬁrst database, and ( a , b , c ) from the second database. In the secondround, the user encodes the side information from database 2 which is c with two newsymbols from W , W which are ( a , b ) using the permuted generator matrix, i.e., the user12 total number of messages ( M ) s u m p r i v a t e r e t r i e v a l r a t e ( P P i = R i ) achievable rate P = 5 , N = 2upper bound P = 5 , N = 2achievable rate P = 6 , N = 2upper bound P = 6 , N = 2achievable rate P = 10 , N = 2upper bound P = 10 , N = 2 Figure 4: Eﬀect of changing M for ﬁxed P = 5 , ,

10 and ﬁxed N = 2.downloads two equations from database 1 in the second round, GS  a b c  = "   a b c  = " a + b + c a + b + 3 c (39)The user repeats this operation for the second database with ( a , b ) as desired symbols and c as the side information from the ﬁrst database.For the decodability: The user subtracts out c from round two in the ﬁrst database,then the user can decode ( a , b ) from a + b and 2 a + b . Similarly, by subtracting out c from round two in the second database, the user can decode ( a , b ) from a + b and 2 a + b .For the privacy: Single bit retrievals of ( a , b , c ) and ( a , b , c ) from the two databasesin the ﬁrst round satisfy message symmetry and database symmetry, and do not leak anyinformation. In addition, due to the private shuﬄing of bit indices, the diﬀerent coeﬃcientsof 1, 2 and 3 in front of the bits in the MDS-coded summations in the second round do notleak any information either; see a formal proof in Section 4.3. To see the privacy constraintintuitively from another angle, we note that the user can alter the queries for the seconddatabase when the queries for the ﬁrst database are ﬁxed, when the user wishes to retrieveanother set of two messages. For instance, if the user wishes to retrieve ( W , W ) instead of( W , W ), it can alter the queries for the second database by changing every c in the queriesof the second database with c , c with c , b with b , and b with b .The query table for this case is shown in Table 1 below. The scheme retrieves a , · · · , a and b , · · · , b , i.e., 8 bits in 10 downloads (5 from each database). Thus, the achievablesum rate for this scheme is = = M − PPN . If we use the single-message optimal scheme13 number of databases ( N ) s u m p r i v a t e r e t r i e v a l r a t e ( P P i = R i ) achievable rate M = 5 , P = 2upper bound M = 5 , P = 2achievable rate M = 10 , P = 5upper bound M = 10 , P = 5achievable rate M = 20 , P = 3upper bound M = 20 , P = 3 Figure 5: Eﬀect of changing N for ﬁxed ( M, P ) = (5 , , (10 , , (20 , = < as discussed in the introduction.Table 1: The query table for the case M = 3 , P = 2 , N = 2.Database 1 Database 2 a , b , c a , b , c a + b + c a + b + c a + b + 3 c a + b + 3 c The scheme requires L = N , and is completed in two rounds. The main ingredient of thescheme is MDS coding of the desired symbols and side information in the second round. Thedetails of the scheme are as follows.1. Index preparation:

The user interleaves the contents of each message randomly andindependently from the remaining messages using a random interleaver π m ( . ) which isknown privately to the user only, i.e., x m ( i ) = w m ( π m ( i )) , i ∈ { , · · · , L } (40)where X m = [ x m (1) , · · · , x m ( L )] T is the interleaved message. Thus, the downloadedsymbol x m ( i ) at any database appears to be chosen at random and independent fromthe desired message subset P . 14. Round one:

As in [8], the user downloads one symbol from every message from everydatabase, i.e., the user downloads ( x ( n ) , x ( n ) , · · · , x M ( n )) from the n th database.This implements message symmetry , symmetry across databases and satisﬁes the pri-vacy constraint.3. Round two:

The user downloads a coded mixture of new symbols from the desiredmessages and the undesired symbols downloaded from the other databases. Speciﬁcally,(a) The user picks an MDS generator matrix G ∈ F P × Mq , which has the property thatevery P × P submatrix is full-rank. This implies that if the user can cancel outany M − P symbols from the mixture, the remaining symbols can be decoded.One explicit MDS generator matrix is the Reed-Solomon generator matrix over F q , where q > M , [24, 25] G =  · · ·

11 2 3 · · · M · · · M ... ... ... ... ...1 P − P − P − · · · M P −  P × M (41)(b) The user picks uniformly and independently at random the permutation matrices S , S , · · · , S N − of size M × M . These matrices shuﬄe the order of columns of G to be independent of P .(c) At the ﬁrst database, the user downloads an MDS-coded version of P new symbolsfrom the desired set P and M − P undesired symbols that are already decodedfrom the second database in the ﬁrst round, i.e., the user downloads P equationsof the form GS h x i ( n + 1) x i ( n + 1) · · · x i P ( n + 1) x j (2) x j (2) · · · x j M − P (2) i T (42)where P = { i , i , · · · , i P } are the indices of the desired messages and ¯ P = { j , j , · · · , j M − P } are the indices of the undesired messages. In this case, theuser can cancel out the undesired messages and be left with a P × P invertiblesystem of equations that it can solve to get [ x i ( n + 1) , x i ( n + 1) , · · · , x i P ( n + 1)].This implements exploiting side information as in [8].(d) The user repeats the last step for each set of side information from database 3 todatabase N , each with diﬀerent permutation matrix.(e) By database symmetry , the user repeats all steps of round two at all other databases.15 .3 Decodability, Privacy, and Calculation of the Achievable Rate Now, we verify that this achievable scheme satisﬁes the reliability and privacy constraints.For the reliability: The user gets individual symbols from all databases in the ﬁrst round,and hence they are all decodable by deﬁnition. In the second round, the user can subtractout all the undesired message symbols using the undesired symbols downloaded from allother databases during the ﬁrst round. Consequently, the user is left with a P × P systemof equations which is guaranteed to be invertible by the MDS property, hence all symbolsthat belong to W P are decodable.For the privacy: At each database, for every message subset P of size P , the achievablescheme retrieves randomly interleaved symbols which are encoded by the following matrix: H P =  I P P P · · · P P G P P · · · P P P G P · · · P ... ... ... ... ... P P P · · · G N − P  (43)where G n P = GS n (: , P ) are the columns of the encoding matrix that correspond to the mes-sage subset P after applying the random permutation S n . Since the permutation matricesare chosen uniformly and independently from each other, the probability distribution of H P is uniform irrespective to P (the probability of realizing such a matrix is (cid:16) ( M − P )! M ! (cid:17) N − ).Furthermore, the symbols are chosen randomly and uniformly by applying the random in-terleaver. Hence, the retrieval scheme is private.To calculate the achievable rate: We note that at each database, the user downloads M individual symbols in the ﬁrst round that includes P desired symbols. The user exploits theside information from the remaining ( N −

1) databases to generate P equations for each sideinformation set. Each set of P equations in turn generates P desired symbols. Hence, theachievable rate is calculated as, P X i =1 R i = total number of desired symbolstotal downloaded equations (44)= N ( P + P ( N − N ( M + P ( N − P N ( M − P ) + P N (46)= 11 + M − PP N (47)16 .4 Further Examples for the Case P ≥ M In this section, we illustrate our achievable scheme with two more basic examples. In Sec-tion 4.1, we considered the case M = 3, P = 2, N = 2. In the next two sub-sections, we willconsider examples with larger M , P (Section 4.4.1), and larger N (Section 4.4.2). M = 5 Messages, P = 3 Messages, N = 2 Databases

Let P = { , , } , and a to e denote the contents of W to W , respectively. The achievablescheme is similar to the example in Section 4.1. The diﬀerence is the use 5 × S and G × Reed-Solomon generator matrix over F as: G × =   (48)The query table is shown in Table 2 below with the following random permutation forthe columns: 2 , , , ,

4. The reliability and privacy constraints are satisﬁed due to theMDS property that implies that any subset of 3 messages corresponds to a 3 × a , · · · , a , b , · · · , b and c , · · · , c , hence 12 bits in 16 downloads (8 from eachdatabase). Thus, the achievable sum rate is = which equals the sum capacity M − PPN in (11). This strictly outperforms the repetition-based achievable sum rate in (12).Table 2: The query table for the case M = 5 , P = 3 , N = 2.Database 1 Database 2 a , b , c , d , e a , b , c , d , e a + b + c + d + e a + b + c + d + e a + 5 b + c + 3 d + 4 e a + 5 b + c + 3 d + 4 e a + c + 4 d + e a + c + 4 d + e M = 4 Messages, P = 2 Messages, N = 3 Databases

Next, we give an example with a larger N . Here, the message size is N = 9. With agenerator matrix G × = G × ([1 : 2] , [1 : 4]) to be the upper left submatrix of the previousexample and two set of random permutations (corresponding to S , S ) as 1 , , ,

4, and4 , , ,

2. The query table is shown in Table 3 below. This scheme retrieves a , · · · , a and b , · · · , b , hence 18 bits in 24 downloads (8 from each database). Thus, the achievable rateis = = M − PPN . This strictly outperforms the repetition-based achievable scheme sumrate in (12). 17able 3: The query table for the case M = 4 , P = 2 , N = 3.Database 1 Database 2 Database 3 a , b , c , d a , b , c , d a , b , c , d a + b + c + d a + b + c + d a + b + c + d a + 3 b + 2 c + 4 d a + 3 b + 2 c + 4 d a + 3 b + 2 c + 4 d a + b + c + d a + b + c + d a + b + c + d a + b + 3 c + 2 d a + b + 3 c + 2 d a + b + 3 c + 2 d P ≤ M P ≤ M . We show that thisscheme is optimal when the total number of messages M is an integer multiple of the numberof desired messages P . The scheme incurs a small loss from the upper bound for all othercases. The scheme generalizes the ideas in [8]. Diﬀerent than [8], our scheme uses unequalnumber of stages for each round of download. Interestingly, the number of stages at eachround can be thought of as the output of an all-poles IIR ﬁlter. Our scheme reduces to [8]if we let P = 1. In the sequel, we deﬁne the i th round as the download queries that retrievesum of i diﬀerent symbols. We deﬁne the stage as a block of queries that exhausts all (cid:0) Mi (cid:1) combinations of the sum of i symbols in the i th round. M = 5 , P = 2 Messages, N = 2 Databases

To motivate our achievable scheme, consider the case of retrieving two messages denotedby letters ( a, b ) from ﬁve stored messages denoted by letters ( a, b, c, d, e ). Instead of design-ing the queries beginning from the top as usual, i.e., beginning by downloading individualsymbols, we design the scheme backwards starting from the last round that corresponds todownloading sums of all ﬁve messages and trace back to identify the side information neededat each round from the other database. Our steps described below can be followed throughin the query table in Table 4.Now, let us ﬁx the number of stages in the 5th round to be 1 as in [8] since N = 2. Round5 corresponds to downloading the sum of all ﬁve messages and contains one combination ofsymbols a + b + c + d + e ; please see the last line in Table 4. Since we wish to retrieve ( a, b ),we need one side information equation in the form of c + d + e from earlier rounds. Thecombination c + d + e can be created directly from round 3 without using round 4. Hence, wesuppress round 4, as it does not create any useful side information in our case, and downloadone stage from round 3 to generate one side information equation c + d + e .In round 3, we download sums of 3 messages. Each stage of round 3 consists of (cid:0) (cid:1) = 10equations. One of those 10 equations is in the desired c + d + e form, and the remaining 9of them have either a or b or both a, b in them. In tabulating all these 9 combinations, werecognize two categories of side information equations needed from earlier rounds. The ﬁrst18ategory corresponds to equations of the form a + b + ( c, d, e ), where ( c, d, e ) means possiblechoices for the rest of the equation, i.e., these equations have both a and b in them and plusone more symbol in the form of c or d or e . This category requires downloading one stageof individual symbols (i.e., an individual c or d or e ), that is, one stage of round 1. We notealso that one of the symbols ( a, b ) should be known as a side information from the seconddatabase in order to solve for the remaining new symbol. The second category correspondsto equations of the form a + ( c + d, c + e, d + e ) and b + ( c + d, c + e, d + e ), i.e., these equationshave only one of a or b but not both. This category requires two stages of round 2, as weneed diﬀerent side information equations that contain sum of twos, e.g., c + d , c + e , d + e .In round 2, we download sums of 2 messages. Each stage of the second round contains (cid:0) (cid:1) = 10 equations. In each stage, we need one category of side information equations, whichis a + ( c, d, e ) and b + ( c, d, e ). This necessitates two diﬀerent stages of individual symbols,i.e., two stages of round 1 for each stage of round 2.Denoting α i to be the number of stages needed for the i th round, we sum all the requiredstages for round 1 to be α = 2 · α = 5 , α = 2 , α = 1 , α = 0 , α = 1. These can be observed in the query tablein Table 4. Note that, we have α = 5 stages in round 1 where we download individual bits;then we have α = 2 stages in round 2 where we download sums of two symbols; then wehave α = 1 stage in round 3 where we download sums of three symbols; we skip round 4 as α = 0; and we have α = 1 stage of round 5 where we download sum of all ﬁve symbols.Now, after designing the structure of the queries and the number of stages needed for eachround, we apply the rest of the scheme described in [8]. The user randomly interleaves themessages as usual. In the ﬁrst round, the user downloads one symbol from each message ateach database. This is repeated α = 5 times for each database. Hence, the user downloads a , b , c , d , e from the two databases. In the second round, the user downloadssums of two messages. Each stage contains (cid:0) (cid:1) = 10 equations. This is repeated α = 2times. For example, in database 1, user exploits c , d , e to get a , a , a and c , d , e toobtain b , b , b . These are from round 1. Round 2 generates c + d , c + e , d + e from stage 1, and c + d , c + e , d + e from stage 2 as side information for round 3.In round 3, the user downloads sum of three symbols. There are (cid:0) (cid:1) = 10 of them. Symbols c , d , e downloaded from round 1 in database 2 are used to be summed with mixturesof a + b . The two sets of side information generated in the second round are exploited inthe equations that have one a or b . Note that for each such equation, one of a or b is newand the other one is decoded from database 2. Round 3 generates one side information as c + d + e that is used in round 5. This last round includes the sum of all ﬁve messages.Therefore, as seen in Table 4, we have retrieved a , · · · , a and b · · · , b , i.e., 68 bits ina total of 112 downloads (56 from each database). Thus, the achievable sum rate is = .This is ¯ R s in Theorem 2, whereas the upper bound ¯ R s in Theorem 2 is N + N = .The gap between ¯ R s and ¯ R s is equal to ≃ . R s and ¯ R s over all possible values of M , P and N . The main new ingredient of our scheme in comparison to the scheme in [8] is the unequalnumber of stages in each round. In [8], the scheme is completed in M rounds, and eachround contains only 1 stage only when N = 2. To generalize the ideas in Section 5.1 andcalculate the number of stages needed per round, we use Vandermonde’s identity (cid:18) Mi (cid:19) = P X k =0 (cid:18) Pk (cid:19)(cid:18) M − Pi − k (cid:19) (49)The relation in (49) states that any combination of i objects from a group of M objectsmust have k objects from a group of size P and i − k objects from a group of size M − P .In our context, the ﬁrst group is the subset of the desired messages and the second group isthe subset of the undesired messages. Then, the relation can be interpreted in our setting asfollows: In the i th round, the (cid:0) Mi (cid:1) combinations of all possible sums of i terms can be sortedinto P + 1 categories: The ﬁrst category (i.e., k = 0), contains no terms from the desiredmessages, the second category contains 1 term from the desired messages and i − (cid:0) Pk (cid:1) and the number of queries in each subgroup (cid:0) M − Pi − k (cid:1) . Let us consider the following concrete example for clariﬁcation:

Consider that we have 6messages denoted by ( a, b, c, d, e, f ), and the desired group to be retrieved is ( a, b ). Considerround 4 that consists of all combinations of sums of 4 symbols. From Vandermonde’s identity,we know that (cid:0) (cid:1) = (cid:0) (cid:1)(cid:0) (cid:1) + (cid:0) (cid:1)(cid:0) (cid:1) + (cid:0) (cid:1)(cid:0) (cid:1) . Which means that there are three categoriesof sums: First category is with only undesired messages; we have (cid:0) (cid:1) = 1 query subgroup ofthe form c + d + e + f . The second category is to have 1 term from the desired group andthe remaining are undesired; we have (cid:0) (cid:1) = 2 query subgroups, one corresponds to a withcombinations of 3 terms from c, d, e, f , and the other to b with combinations of 3 terms from c, d, e, f . Each query subgroup contains (cid:0) (cid:1) queries, i.e., the ﬁrst query subgroup is of theform a + ( c + d + e, c + d + f, c + e + f, d + e + f ) and the second query subgroup is of theform b + ( c + d + e, c + d + f, c + e + f, d + e + f ). Third category is to have 2 terms fromthe desired group and 2 terms from the undesired group; we have (cid:0) (cid:1) = 1 query subgroupof this category that takes the form a + b + ( c + d, c + e, · · · ). The number of queries of thisgroup is (cid:0) (cid:1) corresponding to all combinations of 2 undesired symbols. Back to the calculation of the number of stages:

To be able to cancel the undesiredsymbols from an i -term sum, the user needs to download these undesired symbols as sideinformation in the previous rounds. Hence, round i requires downloading (cid:0) P (cid:1) stages in round( i − (cid:0) P (cid:1) stages in round ( i − N −

1) databases. Then, each database needs to download N − (cid:0) P (cid:1) stages in20able 4: The query table for the case M = 5 , P = 2 , N = 2.Database 1 Database 2 r o und stg 1 a , b , c , d , e a , b , c , d , e stg 2 a , b , c , d , e a , b , c , d , e stg 3 a , b , c , d , e a , b , c , d , e stg 4 a , b , c , d , e a , b , c , d , e stg 5 a , b , c , d , e a , b , c , d , e r o und s t ag e a + b a + b a + c a + c a + d a + d a + e a + e b + c b + c b + d b + d b + e b + e c + d c + d c + e c + e d + e d + e s t ag e a + b a + b a + c a + c a + d a + d a + e a + e b + c b + c b + d b + d b + e b + e c + d c + d c + e c + e d + e d + e r o und s t ag e a + b + c a + b + c a + b + d a + b + d a + b + e a + b + e a + c + d a + c + d a + c + e a + c + e a + d + e a + d + e b + c + d b + c + d b + c + e b + c + e b + d + e b + d + e c + d + e c + d + e r d . stg 1 a + b + c + d + e a + b + c + d + e i − N − (cid:0) P (cid:1) stages in round ( i − α i to be the number of stages in round i . Fix the number of stages in the lastround (round M ) to be α M = ( N − M − P stages. This choice ensures that the number ofstages in any round is an integer. Note that in round M , the user downloads a sum of all M messages, this requires side information in the form of the sum of the undesired M − P messages. Hence, we suppress the rounds M − P + 1 through M − M at each database are collected from the remaining ( N −

1) databases. Then, the number ofstages in round ( M − P ) should be ( N − M − P − . Therefore, we write α M = ( N − M − P (50) α M − = · · · = α M − P +1 = 0 (51) α M − P = ( N − M − P − = 1 N − α M = 1 N − P X i =1 (cid:18) Pi (cid:19) α M − P + i (52)Now, in round ( M − P ), each stage requires (cid:0) P (cid:1) stages from round ( M − P − (cid:0) P (cid:1) stages from round ( M − P − N −

1) databases. Continuing with the same argument, for each round, we write α M − P − = 1 N − (cid:18) P (cid:19) α M − P = 1 N − P X i =1 (cid:18) Pi (cid:19) α M − P − i (53) α M − P − = 1 N − (cid:18) P (cid:19) α M − P − + 1 N − (cid:18) P (cid:19) α M − P = 1 N − P X i =1 (cid:18) Pi (cid:19) α M − P − i (54)... α k = 1 N − P X i =1 (cid:18) Pi (cid:19) α k + i (55)Interestingly, this pattern closely resembles the output of an IIR ﬁlter y [ n ] [22], with thediﬀerence equation, y [ n ] = 1 N − P X i =1 (cid:18) Pi (cid:19) y [ n − i ] (56)and with the initial conditions y [ − P ] = ( N − M − P , y [ − P + 1] = · · · = y [ −

1] = 0. Notethat the only diﬀerence between the two seemingly diﬀerent settings is the orientation of thetime axis. The calculation of the number of stages is obtained backwards in contrast to theoutput of this IIR ﬁlter. Hence, we can systematically obtain the number of stages at eachround by observing the output of the IIR ﬁlter characterized by (56), and mapping it to the22umber of stages via α k = y [( M − P ) − k ].We note that for the special case P = 1, the number of stages can be obtained from theﬁrst order ﬁlter y [ n ] = N − y [ n − y [ n ] = ( N − M − − n . Then,the number of stages in round k is α k = y [ M − − k ] = ( N − k − , which is exactly thenumber of stages used in [8]; in particular if N = 2, then α k = 1 for all k . Index preparation:

We calculate the number of stages needed in each round. This canbe done systematically by ﬁnding the output of the IIR ﬁlter characterized by, y [ n ] = 1 N − P X i =1 (cid:18) Pi (cid:19) y [ n − i ] (58)with the initial conditions y [ − P ] = ( N − M − P , y [ − P + 1] = · · · = y [ −

1] = 0. Thenumber of stages in round i is α i = y [( M − P ) − i ] as discussed in Section 5.2.3. Initialization:

From the ﬁrst database, the user downloads one symbol from eachmessage that belongs to the desired message set P . The user sets the round index to i = 1.4. Message symmetry:

In round i , the user downloads sums of i terms from diﬀerentsymbols from the ﬁrst database. To satisfy the privacy constraint, the user shoulddownload an equal amount of symbols from all messages. Therefore, the user downloadsthe remaining (cid:0) M − Pi (cid:1) combinations in round i from the undesired symbol set ¯ P . Forexample: In round 1, the user downloads one symbol from every undesired messagewith a total of (cid:0) M − P (cid:1) = M − P such symbols.5. Repetition of stages:

In the ﬁrst database, the user repeats the operation in round i according to the number of calculated stages α i . This in total results in downloading α i (cid:0) M − Pi (cid:1) undesired equations, and α i (cid:0)(cid:0) Mi (cid:1) − (cid:0) M − Pi (cid:1)(cid:1) desired equations.6. Symmetry across databases:

The user implements symmetry across databases by down-loading α i (cid:0) M − Pi (cid:1) new undesired equations, and α i (cid:0)(cid:0) Mi (cid:1) − (cid:0) M − Pi (cid:1)(cid:1) new desired equa-tions from each database. These undesired equations will be used as side information in23ubsequent rounds. For example: In round 1, each database generates α ( M − P ) un-desired equations in the form of individual symbols. Hence, each database can exploitup to α ( N − M − P ) side information equations from other ( N −

1) databases.7.

Exploiting side information:

Until now, we did not specify how the desired equationsare constructed. Since each stage in round i can be categorized using Vandermonde’sidentity as in the previous section, we form the desired equations as a sum of thedesired symbols and the undesired symbols that can be decoded from other databasesin the former ( i −

1) rounds. If the user sums two or more symbols from P , the userdownloads one new symbol from one message only and the remaining symbols from P should be derived from other databases. Thus, in round ( i + 1), the user mixes onesymbol of P with the sum of i undesired symbols from round i . This should be repeatedfor all (cid:0) P (cid:1) desired symbols. Then, the user mixes each sum of 2 desired symbols withthe sum of ( i −

1) undesired symbols generated in the ( i − (cid:0) P (cid:1) combinations of the desired symbols, and so on.8. Repeating steps:

Repeat steps 4, 5, 6, 7 by setting i = i + 1 until i = M − P − Last round:

We note that rounds M − P + 1 to M − α M − P +1 = · · · = α M − = 0. In round M , which correspondsto summing all M messages, the user mixes P symbols from P (only one of them isnew and the remaining are previously decoded from the other ( N −

1) databases) and M − P undesired symbol mixture that was generated in round ( M − P ).10. Shuﬄing the order of queries:

After preparing the query table, the order of the queriesare shuﬄed uniformly, so that all possible orders of queries are equally likely regardlessof P . Now, we verify that the proposed scheme satisﬁes the reliability and privacy constraints.For the reliability: The scheme is designed to download the exact number of undesiredequations that will be used as side information equation at subsequent rounds in otherdatabases. Hence, each desired symbol at any round is mixed with a known mixture ofsymbols that can be decoded from other databases. Note that if the scheme encounters thecase of having a mixture of desired symbols, one of them only is chosen to be new and theremaining symbols are downloaded previously from other databases. Thus, the reliabilityconstraint is satisﬁed by canceling out the side information. Check for instance in Table 4 that all of the downloads (equations) involving undesired symbols fromdatabase 2 are used in database 1: singles c , d , e , c , d , e , c , d , e , c , d , e , c , d , e ; sums of twos c + d , c + e , d + e , c + d , c + e , d + e ; sum of threes c + d + e , all downloadedfrom database 2 are all used as side information in database 1. P subset of messages. This is true since all combinations of messages aregenerated by our scheme.To calculate the achievable rate: From Vandermonde’s identity (cid:0) Mi (cid:1) = P Pp =0 (cid:0) Pp (cid:1)(cid:0) M − Pi − p (cid:1) ,round i requires downloading (cid:0) Pp (cid:1) stages in round ( i − p ). These stages should be downloadedfrom the remaining ( N −

1) databases. Hence, as shown in the previous section, the numberof stages at each round is calculated as the output of an IIR ﬁlter whose input-output relationis given in (56) with the initial conditions y [ − P ] = ( N − M − P , y [ − P +1] = · · · = y [ −

1] = 0,with the conversion of time index of the ﬁlter to the round index of the schemes as α i = y [( M − P ) − i ]. These initial conditions imply that the user downloads ( N − M − P stagesin the last round that corresponds to downloading the sum of all messages. The ( P − M − P )messages to be used in the last round.Now, to calculate the number of stages for round i , we ﬁrst solve for the roots of thecharacteristic equation of (56) [22], r P − N − P X i =1 (cid:18) Pi (cid:19) r P − i = 0 (59)which is equivalent to r P − r P N − P X i =1 (cid:18) Pi (cid:19) r − i = 0 (60)which further reduces to r P − r P N − "(cid:18) r (cid:19) P − = 0 (61)using the binomial theorem. Simplifying (61), we have N r P − ( r + 1) P = 0 (62)By applying the bijective mapping t = N /P · rr +1 , (62) is equivalent to t P = 1. The rootsfor this equation are the normal roots of unity, i.e., t k = e j π ( k − /P , k = 1 , · · · , P , where j = √−

1. Hence, the roots of the characteristic equation are given by, r k = t k N /P − t k = e j π ( k − /P N /P − e j π ( k − /P , k = 1 , · · · , P (63)25hus, the complete response of the IIR ﬁlter is given by y [ n ] = P Pi =1 γ i r ni , where γ i areconstants that result from solving the initial conditions, i.e., γ = ( γ , · · · , γ P ) T is the solutionof the system of equations,  r − P r − P · · · r − PP r − P +11 r − P +12 · · · r − P +1 P ... ... · · · ... r − r − · · · r − P   γ γ ... γ P  =  ( N − M − P  (64)Now, we are ready to calculate the number of stages α k in round k . Since α k = y [( M − P ) − k ] by construction, then α k = P X i =1 γ i r M − P − ki (65)In round k , the user downloads sums of k symbols. The user repeats this round for α k stages. Each stage contains all the combinations of any k symbols which there are (cid:0) Mk (cid:1) ofthem. Hence, the total download cost D is, D = M X k =1 (cid:18) Mk (cid:19) α k (66)= M X k =1 P X i =1 (cid:18) Mk (cid:19) γ i r M − P − ki (67)= P X i =1 γ i r M − Pi M X k =1 (cid:18) Mk (cid:19) r − ki (68)= P X i =1 γ i r M − Pi "(cid:18) r i (cid:19) M − (69)Considering the undesired equations: in round k , the user downloads all combinations of the( M − P ) undesired messages which there are (cid:0) M − Pk (cid:1) of them. Therefore, similar to the abovecalculation, the total number of undesired equations U is, U = P X i =1 γ i r M − Pi "(cid:18) r i (cid:19) M − P − (70)Hence, the achievable rate ¯ R s is¯ R s = D − UD (71)26 P Pi =1 γ i r M − Pi (cid:20)(cid:16) r i (cid:17) M − (cid:16) r i (cid:17) M − P (cid:21)P Pi =1 γ i r M − Pi (cid:20)(cid:16) r i (cid:17) M − (cid:21) (72)which is (31) in Theorem 2. P ≤ M In this section, we illustrate our proposed scheme with a few additional basic examples. InSection 5.1, we considered the case M = 5, P = 2, N = 2. In the next three sub-sections, weconsider three more examples. In the example in Section 5.5.1, the ratio MP is exactly equalto 2, thus, both the achievable scheme here and the achievable scheme in Section 4 can beused; we comment about the diﬀerences and advantages of both schemes. In the examplein Section 5.5.2, we present the case of a larger N for the example in Section 5.1. In theexample in Section 5.5.3, we present a case with larger M , P and N . M = 4 Messages, P = 2 Messages, N = 2 Databases

The ﬁrst step of the achievable scheme is to identify the number of stages needed for eachround of download. The IIR ﬁlter in (56) that determines the number of stages reduces inthis case to y [ n ] = 2 y [ n −

1] + y [ n −

2] (73)with the initial conditions y [ −

2] = 1 , y [ −

1] = 0. The number of stages in round k is α k = y [2 − k ]. Since M is small, we can calculate the output iteratively without using thecanonical ﬁlter output as, α = y [ −

2] = 1 (74) α = y [ −

1] = 0 (75) α = y [0] = 2 y [ −

1] + y [ −

2] = 1 (76) α = y [1] = 2 y [0] + y [ −

1] = 2 (77)Hence, we should download 2 stages of individual symbols (round 1), and 1 stage of sums oftwo symbols (round 2). We should suppress the round that retrieves sums of three symbols(round 3), and have 1 stage of sums of all four symbols (round 4).The user initializes the scheme by randomly and independently interleaving the symbolsof each message. The query table for this example is shown in Table 5. In round 1, theuser downloads individual symbols from all messages at each database. The user downloads a , b , c , d and a , b , c , d from database 1, as α = 2. This is repeated for database 2. In27able 5: The query table for the case M = 4 , P = 2 , N = 2.Database 1 Database 2 r d . stg 1 a , b , c , d a , b , c , d stg 2 a , b , c , d a , b , c , d r o und s t ag e a + b a + b a + c a + c a + d a + d b + c b + c b + d b + d c + d c + d r d . stg 1 a + b + c + d a + b + c + d round 2, the user downloads sums of two symbols. There are (cid:0) (cid:1) = 6 such equations. Atdatabase 1, the undesired symbols from database 2 in the ﬁrst round are exploited in someof these sums. These equations are either in the form a + ( c, d ) or in the form b + ( c, d ).This necessitates two sets of diﬀerent individual symbols to be downloaded from database2 in the ﬁrst round, or otherwise the symbols are repeated and privacy is compromised.Moreover, we note that the user downloads a + b which uses b as side information eventhough W is desired; this is reversed in database 2 to download a + b with a as a sideinformation to have a symmetric scheme. Round 2 concludes with downloading c + d and c + d at the two databases, which will be used as side information in the last round. Round3 is skipped and the user proceeds to round 4 (last round) directly. In round 4, the userdownloads sum of four symbols, and uses the side information downloaded in round 2 andany decoded symbols for the other desired message. For example, in database 1, the userdownloads a + b + c + d , hence, the side information c + d is exploited in this round aswell as a . The user ﬁnishes the scheme by shuﬄing the order of all queries randomly. Theuser retrieves a , · · · , a and b , · · · , b privately in 30 downloads (15 from each database)and achieves a sum rate of = = N , which matches the upper bound in Theorem 2.This sum rate outperforms the repetition-based achievable rate which is in (12).We note that this case can be solved using the achievable scheme presented in Section 4as well since MP = 2 in this case. In fact, this is equivalent to the case considered inSection 4.4.2, if the number of databases is reduced from N = 3 to N = 2. Starting fromTable 3 in Section 4.4.2 and removing the downloads from database 3, we obtain the querytable which uses MDS-coded queries shown in Table 6 below. Via the scheme in Table 6below, the user retrieves a , · · · , a and b , · · · , b privately in 12 downloads (6 from eachdatabase), therefore achieving the same optimal sum rate of = = N .We presented this case here even though it could be solved using the scheme in Section 4,in order to give an example where the second achievable scheme achieves the upper boundin Theorem 2 and yields a capacity result since MP is an integer. Interestingly, we observe28able 6: Alternative query table for the case M = 4 , P = 2 , N = 2.Database 1 Database 2 a , b , c , d a , b , c , d a + b + c + d a + b + c + d a + 3 b + 2 c + 4 d a + 3 b + 2 c + 4 d that for all cases where P = M , the two achievable schemes are both optimal. The twoschemes present an interesting trade-oﬀ between the ﬁeld size and the upload cost: The ﬁrstachievable scheme in Section 4 requires using an MDS code with ﬁeld size q ≥ M but thenumber of queries for each database is limited to M + P . On the other hand, the secondachievable scheme here in Section 5 does not use any coding and can work with the storageﬁeld size, however, the number of queries increase exponentially since the number of stagesfor each round is related to an unstable IIR ﬁlter. M = 5 Messages, P = 2 Messages, N = 3 Databases

In this example, we show an explicit query structure for

N >

2. In this case the correspondingdiﬀerence equation for the IIR ﬁlter is y [ n ] = y [ n −

1] + 12 y [ n −

2] (78)with the initial conditions y [ −

1] = 0 , y [ −

2] = ( N − M − P = 8. Thus, the number of stages ineach round are: α = 6, α = 4, α = 4, α = 0 , α = 8. The query table is shown in Tables 7,8 and 9. This scheme retrieves a , · · · , a and b , · · · , b privately in 354 downloads (177from each database), therefore, achieving a sum rate of = < N + N = . The gapis ≃ . M = 7 Messages, P = 3 Messages, N = 3 Databases

Finally, in this section, we consider an example with N = 3 databases and larger M and P than in previous examples, where we describe the structure and the calculation of thenumber of queries without specifying the explicit query table as it grows quite large. Weﬁrst calculate the number of stages at each round. The corresponding IIR ﬁlter is y [ n ] = 12 (3 y [ n −

1] + 3 y [ n −

2] + y [ n − y [ −

3] = ( N − M − P = 16, y [ −

2] = 0, y [ −

1] = 0. Hence, thenumber of stages for each round α k = y [4 − k ], k = 1 , · · · ,

7, are calculated iteratively as α = 67, α = 30, α = 12, α = 8, α = 0, α = 0, α = 16.In round 1, the user downloads 67 individual symbols from each message and from eachdatabase. Each database can use the side information generated by the other two databases.29able 7: The query table for the case M = 5 , P = 2 , N = 3.Database 1 Database 2 Database 3 r o und stg 1 a , b , c , d , e a , b , c , d , e a , b , c , d , e stg 2 a , b , c , d , e a , b , c , d , e a , b , c , d , e stg 3 a , b , c , d , e a , b , c , d , e a , b , c , d , e stg 4 a , b , c , d , e a , b , c , d , e a , b , c , d , e stg 5 a , b , c , d , e a , b , c , d , e a , b , c , d , e stg 6 a , b , c , d , e a , b , c , d , e a , b , c , d , e r o und s t ag e a + b a + b a + b a + c a + c a + c a + d a + d a + d a + e a + e a + e b + c b + c b + c b + d b + d b + d b + e b + e b + e c + d c + d c + d c + e c + e c + e d + e d + e d + e s t ag e a + b a + b a + b a + c a + c a + c a + d a + d a + d a + e a + e a + e b + c b + c b + c b + d b + d b + d b + e b + e b + e c + d c + d c + d c + e c + e c + e d + e d + e d + e s t ag e a + b a + b a + b a + c a + c a + c a + d a + d a + d a + e a + e a + e b + c b + c b + c b + d b + d b + d b + e b + e b + e c + d c + d c + d c + e c + e c + e d + e d + e d + e M = 5 , P = 2 , N = 3 (cont.).Database 1 Database 2 Database 3 r o und s t ag e a + b a + b a + b a + c a + c a + c a + d a + d a + d a + e a + e a + e b + c b + c b + c b + d b + d b + d b + e b + e b + e c + d c + d c + d c + e c + e c + e d + e d + e d + e r o und s t ag e a + b + c a + b + c a + b + c a + b + d a + b + d a + b + d a + b + e a + b + e a + b + e a + c + d a + c + d a + c + d a + c + e a + c + e a + c + e a + d + e a + d + e a + d + e b + c + d b + c + d b + c + d b + c + e b + c + e b + c + e b + d + e b + d + e b + d + e c + d + e c + d + e c + d + e s t ag e a + b + c a + b + c a + b + c a + b + d a + b + d a + b + d a + b + e a + b + e a + b + e a + c + d a + c + d a + c + d a + c + e a + c + e a + c + e a + d + e a + d + e a + d + e b + c + d b + c + d b + c + d b + c + e b + c + e b + c + e b + d + e b + d + e b + d + e c + d + e c + d + e c + d + e s t ag e a + b + c a + b + c a + b + c a + b + d a + b + d a + b + d a + b + e a + b + e a + b + e a + c + d a + c + d a + c + d a + c + e a + c + e a + c + e a + d + e a + d + e a + d + e b + c + d b + c + d b + c + d b + c + e b + c + e b + c + e b + d + e b + d + e b + d + e c + d + e c + d + e c + d + e M = 5 , P = 2 , N = 3 (cont.).Database 1 Database 2 Database 3 r o und s t ag e a + b + c a + b + c a + b + c a + b + d a + b + d a + b + d a + b + e a + b + e a + b + e a + c + d a + c + d a + c + d a + c + e a + c + e a + c + e a + d + e a + d + e a + d + e b + c + d b + c + d b + c + d b + c + e b + c + e b + c + e b + d + e b + d + e b + d + e c + d + e c + d + e c + d + e r o und stg 1 a + b + c + d + e a + b + c + d + e a + b + c + d + e stg 2 a + b + c + d + e a + b + c + d + e a + b + c + d + e stg 3 a + b + c + d + e a + b + c + d + e a + b + c + d + e stg 4 a + b + c + d + e a + b + c + d + e a + b + c + d + e stg 5 a + b + c + d + e a + b + c + d + e a + b + c + d + e stg 6 a + b + c + d + e a + b + c + d + e a + b + c + d + e stg 7 a + b + c + d + e a + b + c + d + e a + b + c + d + e stg 8 a + b + c + d + e a + b + c + d + e a + b + c + d + e Hence, each database has 67 · a + ( d, e, f, g ), b + ( d, e, f, g )or c + ( d, e, f, g ) cases. Then, round 2 requires 30 · −

90 = 44 more stages of round 1. Eachdatabase can use the side information stages from the other two databases, i.e., each can useup to 2 ·

30 = 60 stages of side information in the form of sums of two.In round 3, the user downloads sums of three symbols, which can be either of a + b +( d, e, f, g ), a + c + ( d, e, f, g ), b + c + ( d, e, f, g ), a + ( d + e, d + f, · · · ), and similarly for b, c .Therefore, each stage in round 3 requires 3 stages from round 2, and 3 stages from round1. This in total requires 12 · ·

12 = 24 stages of side information in the form of sums of threes. In round 4, the userdownloads sums of 4 symbols, which can be either a + b + ( d + e, d + f, · · · ), and similarly for b + c and a + c , a + ( d + e + f, d + e + g, · · · ) and similarly for b , c , or a + b + c + ( d, e, f, g ).This means that for each stage of round 3, the user needs 1 stage of round 1, 3 stages ofround 2, and 3 stages of round 3. This in total requires 8 · · · = < N + N = . The gap is ≃ . In this section, we derive an upper bound for the MPIR problem. The derived upper boundis tight when P ≥ M and when MP ∈ N . We follow the notations and simpliﬁcations in [8,12],and we deﬁne Q , (cid:8) Q [ P ] n : P ⊆ { , · · · , M } , |P| = P, n ∈ { , · · · , N } (cid:9) (80) A [ P ] n : n , n A [ P ] n , A [ P ] n +1 , · · · , A [ P ] n o , n ≤ n , n , n ∈ { , · · · , N } (81)Without loss of generality, the following simpliﬁcations hold for the MPIR problem:1. We can assume that the MPIR scheme is symmetric. Since for every asymmetricscheme, there exists an equal rate symmetric scheme that can be constructed by repli-cating all permutations of databases and messages.2. To invoke the privacy constraint, we ﬁx the response of one database to be the sameirrespective of the desired set of messages P , i.e., A [ P i ] n = A n , where |P i | = P for every i ∈ { , , · · · , β } for some n ∈ { , · · · , N } , and β = (cid:0) MP (cid:1) . No loss of generality isincurred due to the fact that the queries and the answers are statistically independentfrom P . In the sequel, we ﬁx the answer string of the ﬁrst database, i.e., A [ P ]1 = A , ∀P (82)The following lemma is a consequence of the symmetry assumption; its proof can befound in [8]. Lemma 1 (Symmetry [8])

For any W S = { W i : i ∈ S} H ( A [ P ] n | W S , Q ) = H ( A [ P ]1 | W S , Q ) , n ∈ { , · · · , N } (83) H ( A |Q ) = H ( A [ P ] n |Q ) , n ∈ { , · · · , N } , ∀P (84)We construct the converse proof by induction over ⌊ MP ⌋ in a similar way to [8, 12]. Thebase induction step is obtained for 1 ≤ MP ≤ P ≥ M as it was referred to sofar, where the user wants to retrieve at least half of the messages). We obtain an inductiverelation for the case MP >

2. The converse proof extends the proof in [8] for

P > .1 Converse Proof for the Case ≤ MP ≤ To prove the converse for the case 1 ≤ MP ≤

2, we need the following lemma which gives alower bound on the interference within an answer string.

Lemma 2 (Interference Lower Bound)

For the MPIR problem with P ≥ M , the uncer-tainty of the interfering messages W P +1: M within the answer string A [1: P ]1 is lower boundedas, H ( A [1: P ]1 | W P , Q ) ≥ ( M − P ) LN (85) Furthermore, (85) is true for any set of desired messages P with |P| = P , i.e., H ( A [ P ]1 | W P , Q ) ≥ ( M − P ) LN (86) Proof:

For clarity of presentation, we assume that P = { , · · · , P } without loss of generality.Hence, ( M − P ) L = H ( W P +1: M ) (87)= H ( W P +1: M | W P , Q ) (88)= H ( W P +1: M | W P , Q ) − H ( W P +1: M | A [ M − P +1: M ]1: N , W P , Q ) (89)= I ( W P +1: M ; A [ M − P +1: M ]1: N | W P , Q ) (90)= H ( A [ M − P +1: M ]1: N | W P , Q ) (91) ≤ N X n =1 H ( A [ M − P +1: M ] n | W P , Q ) (92)= N H ( A | W P , Q ) (93)where (88) follows from the independence of the messages W P +1: M from the messages W P and the queries as in (2) and (3); (89) follows from the reliability constraint (7), sincemessages W P +1: M can be decoded correctly from the answer strings A [ M − P +1: M ]1: N if P ≥ M as { P + 1 , · · · , M } ⊆ { M − P + 1 , · · · , M } in this regime; (91) follows from the fact that theanswer strings are deterministic functions of all messages and queries ( Q , W M ); and (93)follows from the independence bound and Lemma 1.Consequently, H ( A | W P , Q ) ≥ ( M − P ) LN . The proof of the general statement can be donereplacing W P by W P , W P +1: M by W ¯ P which corresponds to the complement set of messagesof W P , and the answer strings A [ M − P +1: M ]1: N by A [ P ∗ ]1: N , where ¯ P ⊆ P ∗ , |P ∗ | = P . (cid:4) Now, we are ready to prove the converse of the case P ≥ M . We use a similar conversetechnique to the case of M = 2 , P = 1 in [8], M L = H ( W M ) (94)34 H ( W M |Q ) (95)= H ( W M |Q ) − H ( W M | A [ P ]1: N , A [ P ]1: N , · · · , A [ P β ]1: N , Q ) (96)= I ( W M ; A [ P ]1: N , A [ P ]1: N , · · · , A [ P β ]1: N |Q ) (97)= H ( A [ P ]1: N , A [ P ]1: N , · · · , A [ P β ]1: N |Q ) (98)= H ( A , A [ P ]2: N , A [ P ]2: N , · · · , A [ P β ]2: N |Q ) (99)= H ( A , A [ P ]2: N |Q ) + H ( A [ P ]2: N , · · · , A [ P β ]2: N | A , A [ P ]2: N , Q ) (100)= H ( A , A [ P ]2: N |Q ) + H ( A [ P ]2: N , · · · , A [ P β ]2: N | A , A [ P ]2: N , W P , Q ) (101) ≤ N X n =1 H ( A [ P ] n |Q ) + H ( A [ P ]2: N , · · · , A [ P β ]2: N | A , W P , Q ) (102)= N X n =1 H ( A [ P ] n |Q ) + H ( A [ P ]1: N , · · · , A [ P β ]1: N | W P , Q ) − H ( A | W P , Q ) (103)where (95) follows from the independence between the messages and the queries; (96) followsfrom the reliability constraint in (7) with A [ P ]1: N , A [ P ]1: N , · · · , A [ P β ]1: N representing all answer stringsfrom all databases to any subset of messages P ⊆ { , · · · , M } ; (98) follows from the fact thatanswer strings are deterministic functions of the messages and the queries; (99) follows fromsimpliﬁcation (82) without loss of generality; (101) follows from the fact that the messages W P = ( W i , W i , · · · , W i P ) can be reconstructed from A [ P ]1: N ; and (102) is a consequence ofthe fact that conditioning does not increase entropy and Lemma 1.Now, every message appears in (cid:0) M − P − (cid:1) diﬀerent message subsets of size P , therefore theanswer strings ( A [ P ]1: N , · · · , A [ P β ]1: N ) are suﬃcient to construct all messages W M irrespective of P . Therefore, H ( A [ P ]1: N , · · · , A [ P β ]1: N | W P , Q ) = ( M − P ) L (104)Using this and Lemma 2 in (103) yields M L ≤ N X n =1 H ( A [ P ] n |Q ) + ( M − P ) L − ( M − P ) LN (105)which can be written as, P L + ( M − P ) LN ≤ N X n =1 H ( A [ P ] n |Q ) (106)which further can be written as, (cid:18) M − PP N (cid:19)

P L ≤ N X n =1 H ( A [ P ] n |Q ) (107)35hich leads to the desired converse result, P X i =1 R i = P L P Nn =1 H (cid:16) A [ P ] n (cid:17) ≤ P L P Nn =1 H (cid:16) A [ P ] n |Q (cid:17) ≤

11 + M − PP N (108) MP > In the sequel, we derive an inductive relation that can be used in addition to the baseinduction step of 1 ≤ MP ≤ M = 2 messages, and developed an inductionover the number of messages M for the case M >

2. Here, we have developed a base conversestep for 1 ≤ MP ≤

2, and now develop an induction over (cid:4) MP (cid:5) for the case MP > P of the interference messages. Lemma 3 (Interference Conditioning Lemma)

The remaining uncertainty in the an-swer strings A [ P ]2: N after conditioning on the messages indexed by P , such that P ∩ P = φ , |P | = |P | = P is upper bounded by, H ( A [ P ]2: N | W P , Q ) ≤ ( N − N H ( A |Q ) − P L ] (109)

Proof:

We begin with H ( A [ P ]2: N | W P , Q ) ≤ N X n =2 H ( A [ P ] n | W P , Q ) (110) ≤ N X n =2 H ( A [ P ]1: n − , A [ P ] n , A [ P ] n +1: N | W P , Q ) (111)= N X n =2 H ( A [ P ]1: n − , A [ P ] n , A [ P ] n +1: N , W P |Q ) − H ( W P |Q ) (112)= N X n =2 H ( A [ P ]1: n − , A [ P ] n , A [ P ] n +1: N |Q ) + H ( W P | A [ P ]1: n − , A [ P ] n , A [ P ] n +1: N ) − H ( W P ) (113) ≤ N X n =2 N H ( A |Q ) − H ( W P ) (114)= ( N − N H ( A |Q ) − P L ] (115)where (110) follows from the independence bound; (111) follows from the non-negativityof entropy; (113) follows from the statistical independence between the messages and the36ueries; and (114) follows from the decodability of W P given the answer strings ( A [ P ]1: n − , A [ P ] n ,A [ P ] n +1: N ), which is tantamount to the privacy constraint as in the second simpliﬁcation. (cid:4) Now, we derive the inductive relation for MP >

2. Without loss of generality, let P = { , · · · , P } and P = { P + 1 , · · · , P } . Then, starting from (99), we write M L = H ( A , A [ P ]2: N , A [ P ]2: N , · · · , A [ P β ]2: N |Q ) (116)= H ( A , A [ P ]2: N |Q ) + H ( A [ P ]2: N | A , A [ P ]2: N , Q ) + H ( A [ P ]2: N , · · · , A [ P β ]2: N | A , A [ P ]2: N , A [ P ]2: N , Q )(117) ≤ N H ( A |Q ) + H ( A [ P ]2: N | A , A [ P ]2: N , W P , Q )+ H ( A [ P ]2: N , · · · , A [ P β ]2: N | A , A [ P ]2: N , A [ P ]2: N , W P , Q ) (118) ≤ N H ( A |Q ) + H ( A [ P ]2: N | W P , Q ) + H ( A [ P ]2: N , · · · , A [ P β ]2: N | A , W P , Q ) (119)= N H ( A |Q ) + H ( A [ P ]2: N | W P , Q )+ H ( A [ P ]1: N , · · · , A [ P β ]1: N | W P , Q ) − H ( A | W P , Q )(120)= N H ( A |Q ) + H ( A [ P ]2: N | W P , Q ) + ( M − P ) L − H ( A | W P , Q ) (121) ≤ N H ( A |Q ) + ( N − N H ( A |Q ) − P L ] + ( M − P ) L − H ( A | W P , Q ) (122)where (118) follows from the decodability of W P given ( A , A [ P ]2: N , A [ P ]2: N ), the symmetrylemma and the independence bound; (119) follows from the fact that conditioning doesnot increase entropy. In (121), we note that subsets ( P , · · · , P β ) include all messages( W , · · · , W M ) because every message appears in (cid:0) M − P − (cid:1) subsets. Hence, H ( A [ P ]1: N , · · · , A [ P β ]1: N | W P , Q ) = ( M − P ) L since W P +1: M is decodable from ( A [ P ]1: N , · · · , A [ P β ]1: N ) after knowing W P . Finally, (122) follows from the interference conditioning lemma.Consequently, (122) can be written as N H ( A |Q ) ≥ ( N + 1) P L + H ( A | W P , Q ) (123)which is equivalent to N H ( A |Q ) ≥ (cid:18) N (cid:19) P L + 1

N H ( A | W P , Q ) (124)Now, (124) constructs an inductive relation, since evaluating N H ( A | W P , Q ) is the sameas N H ( A |Q ) with ( M − P ) messages, i.e., the problem of MPIR with M messages forﬁxed P is reduced to an MPIR problem with ( M − P ) messages for the same ﬁxed P . Wenote that (124) generalizes the inductive relation in [8] for P = 1.We can write the induction hypothesis for MPIR with M messages as N H ( A |Q ) ≥ P L  ⌊ MP ⌋− X i =0 N i + (cid:18) MP − (cid:22) MP (cid:23)(cid:19) N ⌊ MP ⌋  (125)37ext, we proceed with proving this relation for M + 1 messages. From the inductionhypothesis, we have N H ( A | W P , Q ) ≥ P L  ⌊ M − P +1 P ⌋− X i =0 N i + (cid:18) M − P + 1 P − (cid:22) M − P + 1 P (cid:23)(cid:19) N ⌊ M − P +1 P ⌋  (126)= P L  ⌊ M +1 P ⌋− X i =0 N i + (cid:18) M + 1 P − (cid:22) M + 1 P (cid:23)(cid:19) N ⌊ M +1 P ⌋ −  (127)substituting this in (124), N H ( A |Q ) ≥ (cid:18) N (cid:19) P L + P LN  ⌊ M +1 P ⌋− X i =0 N i + (cid:18) M + 1 P − (cid:22) M + 1 P (cid:23)(cid:19) N ⌊ M +1 P ⌋ −  (128)= P L  ⌊ M +1 P ⌋− X i =0 N i + (cid:18) M + 1 P − (cid:22) M + 1 P (cid:23)(cid:19) N ⌊ M +1 P ⌋  (129)which concludes the induction argument.Consequently, the upper bound for the MPIR problem can be obtained as, P X i =1 R i = P L P Nn =1 H (cid:16) A [ P ] n (cid:17) (130) ≤ P LN H ( A |Q ) (131)= 1 P ⌊ MP ⌋− i =0 1 N i + (cid:0) MP − (cid:4) MP (cid:5)(cid:1) N ⌊ MP ⌋ (132)= − ( N ) ⌊ MP ⌋ − N + (cid:18) MP − (cid:22) MP (cid:23)(cid:19) N ⌊ MP ⌋ ! − (133)where (132) follows from (129); and (133) follows from evaluating the sum in (132). In this paper, we introduced the multi-message private information retrieval (MPIR) prob-lem from an information-theoretic perspective. The problem generalizes the PIR problemin [8] which retrieves a single message privately. We determined the exact sum capacity forthis problem when the number of desired messages is at least half of the number of total38tored messages to be C Ps = M − PPN . We showed that joint retrieval of the desired messagesstrictly outperforms repeating the single-message capacity achieving scheme for each mes-sage. Furthermore, we showed that if the total number of messages is an integer multiple ofthe number of desired messages, then the sum capacity is C Ps = − N − ( N ) M/P , which resemblesthe single-message PIR capacity expression when the number of messages is MP . For theremaining cases, we derived lower and upper bounds. We observed numerically that thegap between the lower and bounds decreases monotonically in N , and the worst case gap is0 . N = 2 when M = 5, P = 2. References [1] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan. Private information retrieval.

Journal of the ACM , 45(6):965–981, 1998.[2] S. Yekhanin. Private information retrieval.

Communications of the ACM , 53(4):68–73,2010.[3] W. Gasarch. A survey on private information retrieval. In

Bulletin of the EATCS , 2004.[4] R. Ostrovsky and W. Skeith III. A survey of single-database private information re-trieval: Techniques and applications. In

International Workshop on Public Key Cryp-tography , pages 393–411. Springer, 2007.[5] C. Cachin, S. Micali, and M. Stadler. Computationally private information retrievalwith polylogarithmic communication. In

International Conference on the Theory andApplications of Cryptographic Techniques . Springer, 1999.[6] H. Sun and S. Jafar. Blind interference alignment for private information retrieval. 2016.Available at arXiv:1601.07885.[7] S. Jafar. Blind interference alignment.

IEEE Journal of Selected Topics in SignalProcessing , 6(3):216–227, June 2012.[8] H. Sun and S. Jafar. The capacity of private information retrieval. 2016. Available atarXiv:1602.09134.[9] N. B. Shah, K. V. Rashmi, and K. Ramchandran. One extra bit of download ensuresperfectly private information retrieval. In

IEEE ISIT , June 2014.[10] R. Tajeddine and S. El Rouayheb. Private information retrieval from MDS coded datain distributed storage systems. In

IEEE ISIT , July 2016.[11] T. Chan, S. Ho, and H. Yamamoto. Private information retrieval for coded storage. In

IEEE ISIT , June 2015. 3912] K. Banawan and S. Ulukus. The capacity of private information retrieval from codeddatabases.

IEEE Trans. on Info. Theory . Submitted September 2016. Also available atarXiv:1609.08138.[13] H. Sun and S. Jafar. The capacity of robust private information retrieval with colludingdatabases. 2016. Available at arXiv:1605.00635.[14] R. Freij-Hollanti, O. Gnilke, C. Hollanti, and D. Karpuk. Private information retrievalfrom coded databases with colluding servers. 2016. Available at arXiv:1611.02062.[15] H. Sun and S. Jafar. The capacity of symmetric private information retrieval. 2016.Available at arXiv:1606.08828.[16] Q. Wang and M. Skoglund. Symmetric private information retrieval for MDS codeddistributed storage. 2016. Available at arXiv:1610.04530.[17] T. M. Cover and J. A. Thomas.

Elements of Information Theory . John Wiley & Sons,2012.[18] R. Henry, Y. Huang, and I. Goldberg. One (block) size ﬁts all: PIR and SPIR withvariable-length records via multi-block queries. In

NDSS , 2013.[19] D. Demmler, A. Herzberg, and T. Schneider. RAID-PIR: Practical multi-server PIR.In

ACM Workshop on Cloud Computing Security , pages 45–56, 2014.[20] L. Wang, T. K. Kuppusamy, Y. Liu, and J. Cappos. A fast multi-server, multi-blockprivate information retrieval protocol. In

IEEE Globecom , Dec 2015.[21] J. Groth, A. Kiayias, and H. Lipmaa. Multi-query computationally-private informationretrieval with constant communication rate. In

International Workshop on Public KeyCryptography . Springer, 2010.[22] A. V. Oppenheim and R. W. Schafer.

Discrete-Time Signal Processing . Pearson HigherEducation, 2010.[23] H. Sun and S. Jafar. Optimal download cost of private information retrieval for arbitrarymessage length. 2016. Available at arXiv:1610.03048.[24] I. S. Reed and G. Solomon. Polynomial codes over certain ﬁnite ﬁelds.

SIAM , 8(2):300–304, 1960.[25] S. B. Wicker and V. K. Bhargava.