[PDF] Counting Short Vector Pairs by Inner Product and Relations to the Permanent

Abstract

Given as input two n -element sets A,B⊆{0,1 } d with d=clogn≤(logn ) 2 /(loglogn ) 4 and a target t∈{0,1,…,d} , we show how to count the number of pairs (x,y)∈A×B with integer inner product ⟨x,y⟩=t deterministically, in n 2 / 2 Ω( lognloglogn/(c log 2 c) √ ) time. This demonstrates that one can solve this problem in deterministic subquadratic time almost up to log 2 n dimensions, nearly matching the dimension bound of a subquadratic randomized detection algorithm of Alman and Williams [FOCS 2015]. We also show how to modify their randomized algorithm to count the pairs w.h.p., to obtain a fast randomized algorithm. Our deterministic algorithm builds on a novel technique of reconstructing a function from sum-aggregates by prime residues, which can be seen as an {\em additive} analog of the Chinese Remainder Theorem. As our second contribution, we relate the fine-grained complexity of the task of counting of vector pairs by inner product to the task of computing a zero-one matrix permanent over the integers.

Full PDF

CCOUNTING SHORT VECTOR PAIRS BY INNER PRODUCTAND RELATIONS TO THE PERMANENT

ANDREAS BJ ¨ORKLUND AND PETTERI KASKI

Abstract.

Given as input two n -element sets A , B ⊆ { , } d with d = c log n ≤ (log n ) / (log log n ) and a target t ∈ { , , . . . , d } , we show how to count the number of pairs ( x, y ) ∈ A × B with integerinner product (cid:104) x, y (cid:105) = t deterministically, in n / Ω (cid:0) √ log n log log n/ ( c log c ) (cid:1) time. This demonstratesthat one can solve this problem in deterministic subquadratic time almost up to log n dimensions,nearly matching the dimension bound of a subquadratic randomized detection algorithm of Almanand Williams [FOCS 2015]. We also show how to modify their randomized algorithm to count thepairs w.h.p., to obtain a fast randomized algorithm.Our deterministic algorithm builds on a novel technique of reconstructing a function from sum-aggregates by prime residues, which can be seen as an additive analog of the Chinese RemainderTheorem.As our second contribution, we relate the ﬁne-grained complexity of the task of counting of vectorpairs by inner product to the task of computing a zero-one matrix permanent over the integers. Introduction

The Inner Product and the Size of Preimages.

The inner product map (cid:104) x, y (cid:105) = (cid:80) di =1 x i y i of two d -dimensional vectors x = ( x , x , . . . , x d ) and y = ( y , y , . . . , y d ) is one of the cornerstonesof linear algebra and its applications. For example, when x and y are vectors of observations nor-malized to zero mean and unit standard deviation, then (cid:104) x, y (cid:105) is the Pearson correlation between x and y . As such, it is a fundamentally important computational and data-analytical task to eﬃ-ciently gain information about the preimages of the inner product map; for example, to highlightpairs of similar or dissimilar observables between two families of n observables.Accordingly, the protagonist of this paper is the following counting problem ( InnerProduct ):Given as input a target t ∈ { , , . . . , d } and two n -element sets A ⊆ { , } d and B ⊆ { , } d , count the number of vector pairs ( x, y ) ∈ A × B with integer innerproduct (cid:104) x, y (cid:105) = t .From a complexity-theoretic standpoint, this problem generalizes many conjectured-hard problemsin the study of ﬁne-grained complexity—such as the t = 0 special case, the orthogonal vectorcounting ( OV ) problem—as well as generalizing fundamental application settings, such as sim-ilarity search in Hamming spaces. While it is immediate that subquadratic scalability in n isobtainable when d = o (log n ), our interest in this paper is to obtain an improved understandingof the ﬁne-grained complexity landscape for moderately short vectors, speciﬁcally for d at mostpoly-logarithmic in n .1.2. Subquadratic Scaling for Moderately Short Vectors.

Our main positive result estab-lishes deterministic subquadratic scalability for

InnerProduct up to d growing essentially asthe square of the logarithm of n : This work was carried out while AB was employed as a researcher at Lund University, Departmentof Computer Science, and the major part of the writeup was carried out while AB was employed asa researcher at Ericsson Research.Aalto University, Department of Computer Science, Finland

E-mail addresses : [email protected], [email protected] . a r X i v : . [ c s . D S ] J u l Theorem 1 (Main; Subquadratic Scaling for

InnerProduct ) . There exists a deterministic al-gorithm that, given as input a target t ∈ { , , . . . , c log n } and two n -element sets A , B ⊆ { , } c log n with ≤ c ≤ log n (log log n ) , outputs the number of pairs ( x, y ) ∈ A × B with (cid:104) x, y (cid:105) = t in time (1) n / Ω (cid:0) (cid:113) log n log log nc log2 c (cid:1) . The algorithm in Theorem 1 is based on a novel technique of reconstructing a function from itssum-aggregates by prime residue, which can be seen as an additive analog of the Chinese RemainderTheorem and may be of independent interest (cf. Sect. 2).We also show how a randomized algorithm for the decision problem of checking for a pair ofvectors whose Hamming distance is less than a target by Alman and Williams [5], can with a smallmodiﬁcation be turned into an algorithm for

InnerProduct . Theorem 2 (Randomized Subquadratic Scaling for

InnerProduct ) . There exists a randomizedalgorithm that w.h.p., given as input a target t ∈ { , , . . . , c log n } and two n -element sets A , B ⊆{ , } c log n with ≤ c ≤ log n (log log n ) , outputs the number of pairs ( x, y ) ∈ A × B with (cid:104) x, y (cid:105) = t intime (2) n / Ω (cid:0) log nc log2 c (cid:1) . While the randomized algorithm in Theorem 2 is faster than the deterministic one in Theorem 1,we stress that as far as we know no deterministic algorithm in subquadratic time was previouslyknown for

InnerProduct , even for O (log n ) dimensions. In particular, derandomizing The-orem 2 while retaining subquadratic time seems challenging, even though some progress on theamount of randomness needed in the algorithm has been made, cf. Theorem 1.1 in [3].Our further objective is to better understand the ﬁne-grained complexity of InnerProduct in relation to that of d = O (log n ), it is known that theseproblems are truly-subquadratically related; indeed, Chen and Williams [14] give a parsimoniousreduction for the detection variants of these two problems. That is, if OV can be solved in n − ω (1) time, then so can InnerProduct . However, while there is a subquadratic time algo-rithm for OV whose running time scales as good as n − Ω(1 / log c ) [13], the reduction of Chen andWilliams [14] does not immediately give a non-trivial algorithm for InnerProduct . Indeed, thefastest known algorithm for the decision version

InnerProduct utilize probabilistic polynomialsfor symmetric Boolean functions with optimal dependence on the degree and error [5], and doesnot go via fast OV algorithms and the reduction above. In Theorem 2, we show how a simple mod-iﬁcation to the algorithm in Alman and Williams [5] can turn their algorithm into a counting one.We note that while Alman, Chan, and Williams [3] later presented a deterministic algorithm basedon Chebyshev polynomials over the reals for minimum/maximum Hamming weight pair, with thesame running time as the randomized one in [5], that deterministic algorithm, or the even fasterrandomized one they presented, can not be turned into one for InnerProduct by our suggestedmodiﬁcation alone.1.3.

Lower Bounds via the Permanent.

The running times (1) and (2) would, at least at ﬁrst,appear to leave room for improvement. Indeed, the running time (2) is considerably worse than therunning time n − Ω(1 / log c ) obtained by Chan and Williams [13] for OV . We proceed to show thatthis intuition might be misleading, since such scalability would imply the existence of considerablyfaster algorithms for a canonical hard problem in exponential-time complexity. Accordingly, togain insight into the complexity of InnerProduct and OV when d = ω (log n ), we introduceour second protagonist ( R - Permanent ): Given as input an n × n matrix M with entries m ij in a ring R for i, j ∈ [ n ], computethe permanent per M = (cid:88) σ ∈ S n (cid:89) i ∈ [ n ] m i,σ ( i ) , where S n is the group of all permutations of [ n ] = { , , . . . , n } .Ryser’s algorithm from 1963 computes the permanent with O ( n n ) arithmetic operations in R [21].It is a major open problem whether this can be improved to O ( c n ) for some constant c <

2. Evenimproving the running time to less than 2 n operations has been noted as a challenge by Knuth inthe Art of Computer Programming [18, Exercise 4.6.4.11]. Valiant in 1979 famously proved thatthe permanent is m ij ∈ { , } and evaluated over the ring ofintegers [23]; this version of the problem can be interpreted as counting the perfect matchings in abalanced bipartite graph having the matrix as its biadjacency matrix. For zero-one inputs over theintegers, somewhat faster algorithms are known (cf. Sect. 1.5); to the best of our knowledge, thecurrent champion for zero-one matrices computes the permanent in 2 n − Ω (cid:0) √ n/ log log n (cid:1) time [11].As our second contribution, we relate the ﬁne-grained scalability of solving InnerProduct and OV to the task of computing the permanent of a zero-one matrix over the integers. In particular,our ﬁrst result shows that if we could solve InnerProduct as fast as the fastest currentlyknown algorithms for OV [13], then we would immediately obtain a much faster algorithm forthe permanent: Theorem 3 (Lower Bound for

InnerProduct via Integer Permanent) . If there exists an al-gorithm for solving

InnerProduct for N vectors from { , } c log N in time N − Ω(1 / log c ) , thenthere exists an algorithm solving the permanent of an n × n zero-one matrix over the integers intime n − Ω( n/ log n ) . Thus, despite the true-subquadratic equivalence for d = O (log n ) [14], it would appear that InnerProduct and OV have diﬀerent complexity characteristics when d = ω (log n ).Our next result shows that a modest improvement in ﬁne-grained scalability of OV wouldlikewise imply much faster algorithms for the permanent. Theorem 4 (Lower Bound for OV via Integer Permanent) . If there exists an algorithm forsolving OV for N vectors from { , } c log N in time N − Ω(1 / log − (cid:15) c ) for some (cid:15) > , then thereexists an algorithm solving the permanent of an n × n zero-one matrix over the integers in time n − Ω( n/ log /(cid:15) − n ) . We note that such fast algorithms for OV would already disprove the so-called Super StrongETH, that k - CNFSAT on n variables has a 2 n − n/o ( k ) time algorithm, by the reduction to OV byWilliams [24] after sparsiﬁcation [16]. The present result merely adds to the list of consequences offaster algorithms for OV .1.4. Methodology and Organization of the Paper.

The key methodological contribution un-derlying our main algorithmic result (Theorem 1) is a novel additive analog of the Chinese Remain-der Theorem (Lemma 5 developed independently of the application in Sect. 2), which enables usto recover the number of pairs ( x, y ) ∈ A × B with (cid:104) x, y (cid:105) = t from counts of pairs ( x, y ) satisfying (cid:104) x, y (cid:105) ≡ r (mod p ) for multiple small primes p and residues r ∈ { , , . . . , p − } . In particular,the crux of the algorithmic speedup lies in the observation that to recover the count associatedwith a target 0 ≤ t ≤ d , primes up to roughly √ d suﬃce by Lemma 5. To obtain the counts ofpairs in each residue class r modulo p , we employ the polynomial method with modulus-amplifyingpolynomials of Beigel and Tarui [8] to accommodate the counts under a prime-power modulus, withfast rectangular matrix multiplication of Coppersmith [15] as the key subroutine implementing thecount; this latter part of the algorithm design developed in Sect. 3 follows well-known techniques in ﬁne-grained algorithm design (e.g. [3]). Similarly, the randomized algorithm design in Theorem 2follows by a minor adaptation of the probabilistic-polynomial techniques of Alman and Williams [5]to a counting context; a proof is relegated to Sect. 4.Our two lower-bound reductions, Theorem 3 and Theorem 4, rely on reducing an m × m integerpermanent ﬁrst via the Chinese Remainder Theorem into permanents modulo multiple primes p with p ≤ m ln m , and then using algebraic splitting via Ryser’s formula [21] to obtain short-vectorinstances of InnerProduct and

InnerProduct and Theorem 3, thesplit employs a novel discrete-logarithm version of Ryser’s formula modulo p to arrive at two collec-tions of vectors whose counts of pairs with speciﬁc inner products enable recovery of the permanentmodulo p ; the proof is presented in Sect. 5. For p but with a more intricate vector-coding of group residues modulo p toobtain the desired correspondence with counts of pairs of orthogonal vectors; we relegate the proofto Sect. 6.1.5. Related Work and Further Applications.

Exact and approximate inner products.

Ab-boud, Williams, and Yu [1] used the polynomial method to construct a randomized subquadratictime algorithm for OV . Chan and Williams [13] derandomized the algorithm and showed that itcould also solve the counting problem OV . The ﬁrst result that addressed an inner product dif-ferent from zero, was the randomized algorithm for minimum Hamming weight pair by Alman andWilliams [5]. Subsequently, Alman, Chan, and Williams [3] found an even faster randomized aswell as a deterministic subquadratic algorithm matching [5].A number of studies address approximate versions of inner-product counting in subquadratictime, such as the detection of outlier correlations and oﬄine computation of approximate nearestneighbors, including Valiant [22], Karppa, Kaski, and Kohonen [17], Alman [2], and Alman, Chan,and Williams [4]. All the algorithms above utilize fast rectangular matrix multiplication. Permanents.

Bax and Franklin presented a randomised 2 n − Ω( n / / log n ) expected time algorithmfor the 0 / n − Ω( √ n/ log n ) time algorithm. The algorithm was subsequently improved to a deterministic 2 n − Ω( √ n/ log log n ) time algorithm by Bj¨orklund, Kaski, and Williams [11].For the computation of an integer matrix permanent modulo a prime power p λn/p for any constant λ <

1, Bj¨orklund, Husfeldt, and Lyckberg [10] derived a 2 n − Ω( n/ ( p log p )) time algorithm. Forthe computation of a matrix permanent over an arbitrary ring R on r elements, Bj¨orklund andWilliams [12] gave a deterministic 2 n − Ω( nr ) time algorithm.The problem InnerProduct has various applications in combinatorial algorithms. To men-tion two in particular, it can be used to count the satisfying assignments to a

Sym ◦ And formula(cf. Sect. 7.1), or compute the weight enumerator polynomial of a linear code (cf. Sect. 7.2).2.

Reconstruction from Sum-Aggregates by Prime Residue

This section develops the main methodological contribution of this work. Namely, we show that acomplex-valued function f : D → C can be reconstructed from its sum-aggregates by prime residuewhen the domain D is a preﬁx of the set of nonnegative integers. In essence, reconstruction of afunction from its sum-aggregates can be viewed as an additive analog of the Chinese RemainderTheorem; that is, we obtain reconstruction up to the sum of the prime moduli—in the precise senseof (3) below—whereas the Chinese Remainder Theorem enables reconstruction up to the productof the moduli. Here it should be noted that the scope of the Chinese Remainder Theorem is also somewhat more restricted thanour present setting; indeed, in our setting the Chinese Remainder Theorem does not enable the reconstruction of anarbitrary function f but rather is restricted to reconstruction in the case when f is known to vanish in all but onepoint of D . In our application of counting pairs of vectors by inner product, we let f be a counting functionsuch that f ( (cid:96) ) counts the number of pairs ( x, y ) ∈ A × B with (cid:104) x, y (cid:105) = (cid:96) . Reconstruction fromsum-aggregates then enables us to recover f by counting the number of pairs ( x, y ) with (cid:104) x, y (cid:105) ≡ r (mod p ) for small primes p and residues r ∈ { , , . . . , p − } ; we postpone the details of thisapplication to Sect. 3 and ﬁrst proceed to study reconstructibility.2.1. Sum-Aggregation by Prime Residue.

Let p , p , . . . , p m be distinct prime numbers andlet us assume that D ⊆ (cid:8) , , . . . , s m − (cid:9) where(3) s m = 1 + m (cid:88) b =1 (cid:0) p b − (cid:1) . Letting f (cid:96) be shorthand for f ( (cid:96) ), we show that we can recover f from the sequence of its sum-aggregates (4) F br = (cid:88) (cid:96) ∈ D(cid:96) ≡ r (mod p b ) f (cid:96) for each residue r ∈ { , , . . . , p b − } and each b ∈ { , , . . . , m } .To start with, let us observe that this sequence is linearly redundant. Indeed, deﬁne the sum(5) F = (cid:88) (cid:96) ∈ D f (cid:96) and observe that for each b ∈ { , , . . . , m } we have the linear relation F = p b − (cid:88) r =0 F br . To obtain an equivalent and—as we will shortly show—linearly irredundant sequence, take thesequence formed by the sum F followed by F br for each nonzero residue r ∈ { , , . . . , p b − } andeach b ∈ { , , . . . , m } . Let us write F for this sequence of length s m . By extending the domain ofthe function f with zero-values f (cid:96) = 0 as needed, we can also assume that D = { , , . . . , s m − } in what follows.2.2. Sum-Aggregation as a Linear System.

Let us now study reconstruction of f from F .From (4) and (5) we observe that the task of reconstructing f from F is equivalent to solving thelinear system(6) F = Af , where A is the s m × s m nonzero residue aggregation matrix whose entries are deﬁned for all b ∈{ , , , . . . , m } , i ∈ { , , . . . , p b − } , and (cid:96) ∈ { , , . . . , s m − } by the rule(7) A bi,(cid:96) =  b = 0;1 if b ≥ i ≡ (cid:96) (mod p b );0 if b ≥ i (cid:54)≡ (cid:96) (mod p b ),where we have assumed for convenience that p = 2. Indeed, we readily verify from (4), (5), and (7)that F bi = s m − (cid:88) (cid:96) =0 A bi,(cid:96) f (cid:96) holds for each b ∈ { , , . . . , m } and i ∈ { , , . . . , p b − } . When we want to stress the m selectedprimes, we write A p ,p ,...,p m for the matrix A . The row-banded structure given by (7) is perhaps easiest illustrated with a small example. Belowwe display the matrix A for the primes p = 2, p = 3, and p = 5: A , , =   . Observe in particular that the ﬁrst band b = 0 corresponds to the sum (5) and the subsequent bands b ∈ { , , . . . , m } each correspond to one of the primes p , p , . . . , p m so that the p b − p b − nonzero residue classes modulo p b .Our main technical lemma establishes that the matrix A is invertible, thus enabling reconstruc-tion of f from F . Lemma 5 (Reconstruction from Sum-Aggregates by Prime Residue) . The nonzero residue aggre-gation matrix A p ,p ,...,p m is invertible whenever p , p , . . . , p m are distinct primes. The key idea in the proof is to decompose A p ,p ,...,p m over the complex numbers into the productof a near-block-diagonal matrix with near-Vandermonde blocks and a Vandermonde matrix, bothof which are then shown to have nonzero determinant. The rest of this section is devoted to a proofof Lemma 5.2.3. Preliminaries on Complex Roots of Unity.

We will need the following standard factsabout complex roots of unity. For a positive integer N , let us write ω N = exp (cid:18) π (cid:61) N (cid:19) , where (cid:61) = √− m ∈ Z we have(8) 1 N N − (cid:88) j =0 ω kmN = (cid:40) k ≡ N );0 if k (cid:54)≡ N ) . Reconstruction from Sum-Aggregates—Proof of Lemma 5.

We show that for distinctprimes p , p , . . . , p m the matrix A = A p ,p ,...,p m is invertible over rational numbers. Our strategyis to show that A = U V for two complex matrices U and V that both have nonzero determinant.Indeed, the near-cyclic banded structure of A suggests that one should pursue a decomposition interms of block-structured near-Vandermonde matrices. Let us ﬁrst deﬁne the matrices U and V ,then present a small example, and then complete the proof.The matrix U will use a ( m + 1) × ( m + 1) block structure that is similar to the ( m + 1)-bandstructure of A , but now the structure is used both for rows and columns. Again for conveniencewe assume p = 2. The matrix U is deﬁned for all b ∈ { , , , . . . , m } , i ∈ { , , . . . , p b − } , d ∈ { , , . . . , m } , and k ∈ { , , . . . , p d − } by the rule(9) U bi,dk =  d = 0 and b = 0; p b if d = 0 and b ≥ d ≥ b (cid:54) = d ; p b ω − ikp b if d ≥ b = d . The matrix V is a Vandermonde matrix with ( m +1)-banded structure deﬁned for all d ∈ { , , . . . , m } , k ∈ { , , . . . , p d − } , and (cid:96) ∈ { , , . . . , s m − } by the rule(10) V dk,(cid:96) = (cid:40) d = 0; ω k(cid:96)p d if d ≥ . Before proceeding with the proof that A = U V , let us present an example for the primes p = 2, p = 3, and p = 5. We have   A , , == 

12 12 ω − · ω − ·

13 13 ω − · ω − ·

15 15 ω − ·

25 15 ω − ·

35 15 ω − · ω − ·

15 15 ω − ·

25 15 ω − ·

35 15 ω − · ω − ·

15 15 ω − ·

25 15 ω − ·

35 15 ω − · ω − ·

15 15 ω − ·

25 15 ω − ·

35 15 ω − ·  U , , ··  ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω · ω ·  V , , . (11) The main technical aspect of the proof that A = U V is to partition the index (cid:96) ∈ { , , . . . , s m − } to the m + 1 bands. Towards this end, deﬁne for each c ∈ { , , . . . , m } the preﬁx-sum s c = (cid:40) c = 0;1 + (cid:80) c − (cid:96) =1 (cid:0) p c − (cid:1) if c ≥ (cid:96) ∈ { , , . . . , s m − } , we observe that there exist unique c ∈ { , , . . . , m } and j ∈ { , , . . . , p c − } such that(12) (cid:96) = j − s c . We are now ready to show that A = U V . Let b ∈ { , , . . . , m } , i ∈ { , , . . . , p b − } , and (cid:96) ∈ { , , . . . , s k − } be arbitrary. Let c ∈ { , , . . . , m } and j ∈ { , , . . . , p c − } be uniquelydetermined from (cid:96) by (12). From (9), (10), (8), and (7) we observe that m (cid:88) d =0 p d − (cid:88) k =0 U bi,dk V dk,(cid:96) ==  b = 0; p b (cid:0) (cid:80) p b − k =1 ω − ik + k ( j − s b ) p b (cid:1) = p b (cid:80) p b − k =0 ω k ( j − i − s b ) p b = 1 if b ≥ i ≡ j − s b = (cid:96) (mod p b ); p b (cid:0) (cid:80) p b − k =1 ω − ik + k ( j − s b ) p b (cid:1) = p b (cid:80) p b − k =0 ω k ( j − i − s b ) p b = 0 if b ≥ i (cid:54)≡ j − s b = (cid:96) (mod p b ).= A bi,(cid:96) . Thus, A = U V holds. It remains to show that both matrices U and V have nonzero determinantover the complex numbers. Starting with the Vandermonde matrix V , let ν = 1 and ν (cid:96) = ω jp c for (cid:96) ∈ { , , . . . , s m − } , where c ∈ { , , . . . , m } and j ∈ { , , . . . , p c − } are uniquely determinedfrom (cid:96) by (12). In particular, we observe that V is a Vandermonde matrix with D = s m − V =  ν ν · · · ν D ν ν · · · ν D ... ... ... ν D ν D · · · ν DD  . The Vandermonde determinant formula thus givesdet V = (cid:88) ≤ k<(cid:96) ≤ D ( ν (cid:96) − ν k ) . Furthermore, this determinant is nonzero because p , p , . . . , p m are distinct primes and thus ν , ν , . . . , ν D are distinct. Next, let us consider the matrix U deﬁned by (9). At this point it maybe useful to revisit the structure of U via the example (11). We observe that the block-diagonal of U with b = c ≥ p b − × ( p b − p b ω − ip b for i ∈ { , , . . . , p b − } and a ( p b − × ( p b − ω − ip b for i ∈ { , , . . . , p b − } are distinct.Thus, since the determinant of U is the product of the determinants of the block-matrices on thediagonal, each of which is nonzero, the determinant of U is nonzero. It follows that A is invertibleand thus given F we can solve for f via (6). This completes the proof of Lemma 5. (cid:3) Counting Pairs of Zero-One Vectors by Inner Product

This section documents our main algorithm and proves Theorem 1. Let κ be a parameter thatsatisﬁes, with foresight,(13) 4 ≤ κ ≤ log n (log log n ) . Let a (1) , a (2) , . . . , a ( n ) ∈ { , } d and b (1) , b (2) , . . . , b ( n ) ∈ { , } d be given as input with d ≤ κ log n .We want to compute for each t ∈ { , , . . . , d } the count f t = |{ ( i, j ) ∈ { , , . . . , n } : (cid:104) a ( i ) , b ( j ) (cid:105) = t }| . Our high-level approach will be to use Lemma 5 and (6) to solve for the counts f , f , . . . , f d using as input counts that have been sum-aggregated by prime residue. More precisely, we willwork with prime moduli p , p , . . . , p m and develop an algorithm that computes, for given furtherinput p ∈ { p , p , . . . , p m } and r ∈ { , , . . . , p − } , the sum-aggregated count F pr = |{ ( i, j ) ∈ { , , . . . , n } : (cid:104) a ( i ) , b ( j ) (cid:105) ≡ r (mod p ) }| . The detailed choices for m and the primes p , p , . . . , p m will be presented later.3.1. The Residue-Indicator Polynomial.

Assume p and r have been given. We will rely onthe polynomial method, and accordingly we ﬁrst build a standard polynomial that indicates theresidue r modulo p in a pair of vectors.Let x = ( x , x , . . . , x d ) and y = ( y , y , . . . , y d ) be two vectors of indeterminates. By Fermat’slittle theorem, the 2 d -indeterminate polynomial(14) G p,r (cid:0) x, y (cid:1) = 1 − (cid:18) d (cid:88) k =1 x k y k − r (cid:19) p − satisﬁes for all i, j ∈ { , , . . . , n } the indicator property(15) G p,r (cid:0) a ( i ) , b ( j ) (cid:1) ≡ (cid:40) p ) if (cid:104) a ( i ) , b ( j ) (cid:105) ≡ r (mod p );0 (mod p ) if (cid:104) a ( i ) , b ( j ) (cid:105) (cid:54)≡ r (mod p ) . We observe that G p,r has degree 2 p − Modulus Ampliﬁcation for Zero-One Residues.

To enable taking the sum of a largenumber of indicators, we make use of the modulus amplifying polynomials of Beigel and Tarui [8].

Theorem 6 (Modulus ampliﬁcation; Beigel and Tarui [8]) . For h ∈ Z ≥ , deﬁne the polynomial (16) A h ( z ) = 1 − (1 − z ) h h − (cid:88) j =0 (cid:18) h + j − j (cid:19) z j . Then, for all m ∈ Z ≥ and s ∈ Z , we have (i) s ≡ m ) implies A h ( s ) ≡ m h ) , and (ii) s ≡ m ) implies A h ( s ) ≡ m h ) . We observe that A h has degree 2 h −

1. Composing (16) and (14), we obtain the ampliﬁedresidue-indicator polynomial(17) G hp,r ( x, y ) = A h (cid:0) G p,r ( x, y ) (cid:1) . From (15) and Theorem 6, we observe the ampliﬁed indicator property(18) G hp,r (cid:0) a ( i ) , b ( j ) (cid:1) ≡ (cid:40) p h ) if (cid:104) a ( i ) , b ( j ) (cid:105) ≡ r (mod p );0 (mod p h ) if (cid:104) a ( i ) , b ( j ) (cid:105) (cid:54)≡ r (mod p ) . Furthermore, we observe that G hp,r has degree (2 h − p − Multilinear Reduct and Bounding the Number of Monomials.

For a nonnegativeinteger e , deﬁne e = 0 if e = 0 and e = 1 if e ≥

1. For a monomial x e x e · · · x e d d y f y f · · · y f d d , deﬁnethe multilinear reduct by x e x e · · · x e d d y f y f · · · y f d d = x e x e · · · x e d d y f y f · · · y f d d . For a polynomial Q ( x, y ), deﬁne the multilinear reduct Q ( x, y ) by taking the multilinear reductof each monomial Q ( x, y ) and simplifying. Since a ( i ) and b ( j ) are { , } -valued vectors, over theintegers we have(19) Q (cid:0) a ( i ) , b ( j ) (cid:1) = Q (cid:0) a ( i ) , b ( j ) (cid:1) . Furthermore, if Q has degree D , then Q has at most (cid:80) Dj =0 (cid:0) dj (cid:1) monomials. In particular, weobserve that G hp,r has at most (cid:80) hpj =0 (cid:0) dj (cid:1) monomials.3.4. Split-Monomial Form of the Multilinear Reduct.

Suppose that the multilinear reduct G hp,r ( x, y ) has exactly M monomials with the representation(20) G hp,r ( x, y ) = M (cid:88) k =1 γ ( k ) x e ( k )1 x e ( k )2 · · · x e ( k ) d d y f ( k )1 y f ( k )2 · · · y f ( k ) d d . For

I, J ⊆ { , , . . . , n } and k ∈ { , , . . . , M } , deﬁne(21) L I,k = (cid:88) i ∈ I (cid:0) a ( i )1 (cid:1) e ( k )1 (cid:0) a ( i )2 (cid:1) e ( k )2 · · · (cid:0) a ( i ) d (cid:1) e ( k ) d γ ( k ) , R J,k = (cid:88) j ∈ J (cid:0) b ( j )1 (cid:1) f ( k )1 (cid:0) b ( j )2 (cid:1) f ( k )2 · · · (cid:0) b ( j ) d (cid:1) f ( k ) d . From (21), (20), (19), and (18), we have(22) M (cid:88) k =1 L I,k R J,k = (cid:88) i ∈ I (cid:88) j ∈ J G hp,r (cid:0) a ( i ) , b ( j ) (cid:1) ≡ (cid:12)(cid:12)(cid:8) ( i, j ) ∈ I × J : (cid:104) a ( i ) , b ( j ) (cid:105) ≡ r (mod p ) (cid:9)(cid:12)(cid:12) (mod p h ) . In particular, assuming that | I || J | ≤ p h −

1, from (22) it follows that (cid:80) Mk =1 L I,k R J,k computedmodulo p h recovers the number of pairs ( i, j ) ∈ I × J with (cid:104) a ( i ) , b ( j ) (cid:105) ≡ r (mod p ).We now move from deriving the polynomial and its properties to describing the algorithm.3.5. Algorithm for the Prime-Residue Count.

The algorithm will rely on (22) via fast rect-angular matrix multiplication to count the number of pairs ( i, j ) ∈ { , , . . . , n } that satisfy (cid:104) a ( i ) , b ( j ) (cid:105) ≡ r (mod p ).The algorithm ﬁrst computes the explicit M -monomial representation of the polynomial G hp,r in (20). More precisely, the algorithm evaluates (14), (16), and (17) in explicit monomial represen-tation, taking multilinear reducts with respect to the variables x , x , . . . , x d , y , y , . . . , y d wheneverpossible. This results in the set(23) { ( k, γ ( k ) , e ( k )1 , e ( k )2 , . . . , e ( k ) d , f ( k )1 , f ( k )2 , . . . , f ( k ) d ) : k ∈ { , , . . . , M }} . Next, the algorithm constructs two rectangular matrices S and T , with the objective of makinguse of the following algorithm of Coppersmith [15]. Theorem 7 (Coppersmith [15]) . Given an N × (cid:98) N . (cid:99) matrix S and an (cid:98) N . (cid:99) × N matrix T as input, the matrix product ST over the integers can be computed in O ( N log N ) arithmeticoperations. Towards this end, let g be a positive integer whose value we will ﬁx later. Introduce two setpartitions of { , , . . . , n } with cells I , I , . . . , I (cid:100) n/g (cid:101) ⊆ { , , . . . , n } and J , J , . . . , J (cid:100) n/g (cid:101) ⊆ { , , . . . , n } , respectively, so that | I u | = g and | I v | = g for u, v ∈ { , , . . . , (cid:98) n/g (cid:99)} . Indeed, we thus have | I u || J v | ≤ g for all u, v ∈ { , , . . . , (cid:100) n/g (cid:101)} , so (22) applied to I u and J v modulo p h recovers the number of pairs( i, j ) ∈ I u × J v with (cid:104) a ( i ) , b ( j ) (cid:105) ≡ r (mod p ), assuming that g ≤ p h −

1, which will be justiﬁed byour eventual choice of g .Now let N = (cid:100) n/g (cid:101) and deﬁne the N × M and M × N matrices S and T by setting, for u, v ∈ { , , . . . , (cid:100) n/g (cid:101)} and k ∈ { , , . . . , M } , S uk = L I u ,k and T kv = R I v ,k . Concretely, the algorithm computes S and T from the given input one entry at a time usingthe computed monomial list (23) and the formulas (21) for I = I u and J = J v for each u, v ∈{ , , . . . , (cid:100) n/g (cid:101)} and k = 1 , , . . . , M . The algorithm then multiplies S and T to obtain the productmatrix ST modulo p h , where we assume that each entry of ST is reduced to { , , . . . , p h − } .Finally, the algorithm outputs the sum F pr = (cid:100) n/g (cid:101) (cid:88) u =1 (cid:100) n/g (cid:101) (cid:88) v =1 ( ST ) uv . Parameterizing the Algorithm.

Let us now start parameterizing the algorithm. First, toapply the algorithm in Theorem 7 to the matrices S and T , we need M ≤ N . = (cid:100) n/g (cid:101) . . Subjectto the assumption g ≤ n . —to be justiﬁed later—it will be suﬃcient to show that M ≤ n . . Werecall that M ≤ (cid:80) hpj =0 (cid:0) dj (cid:1) and d ≤ κ log n . With foresight, let us set(24) β κ = K log κ , where K > κ ≥

4, we have theupper bound(25) β κ log κβ κ = K − K log Kκ + K log log κκ ≤ K K . Let usassume—to be justiﬁed later—that p = o ( β κ log n ). Taking(26) h = (cid:22) β κ log np (cid:23) we have, for all large enough n , M ≤ hp (cid:88) j =0 (cid:18) dj (cid:19) ≤ hp (cid:18) d hp (cid:19) ≤ hp (cid:18) ed hp (cid:19) hp ≤ (cid:18) β κ log np + 1 (cid:19) p (cid:18) eκ log n (cid:0) β κ log np − (cid:1) p (cid:19) (cid:0) β κ log np +1 (cid:1) p = 4 (cid:0) β κ log n + p (cid:1)(cid:18) eκ log n (cid:0) β κ log n − p (cid:1) (cid:19) β κ log n + p ) ≤ (cid:0) β κ log n (cid:1)(cid:18) eκ β κ (cid:19) β κ log n ≤ n . , (27)where the last inequality follows by (25) and choosing K small enough. Thus, Theorem 7 applies,subject to the assumptions g ≤ n . , g ≤ p h −

1, and p = o ( β κ log n ), which still need to beestablished. Before this, we digress to further preliminaries to enable reconstruction.3.7. Preliminaries on Asymptotics of Primes.

In what follows let us write p j for the j thprime number with j = 1 , , . . . ; that is, p = 2, p = 3, p = 5, and so forth. Asymptotically,from the Prime Number Theorem we have p m ∼ m ln m (e.g. Rosser and Schoenfeld [20]), and thesum of the ﬁrst m primes satisﬁes (cid:80) mj =1 p j ∼ m ln m (cf. Bach and Shallit [6]), where we write f ( m ) ∼ g ( m ) if lim m →∞ f ( m ) g ( m ) = 1.When evaluated for the ﬁrst m primes, the reconstruction parameter (3) thus satisﬁes(28) s m = 1 + m (cid:88) j =1 ( p j − ∼ p m p m . We are now ready to continue parameterization of the algorithm. Further Parameterization of the Algorithm.

Let m be a positive integer whose valuewill be ﬁxed shortly. The algorithm will work with p , p , . . . , p m , the ﬁrst m prime numbers. Toreconstruct inner products of length- d zero-one vectors over the integers, we need d + 1 ≤ s m , whichfor d ≤ κ log n and (28) means p m p m ∼ κ log n . From Bertrand’s postulate it thus follows that choosing the least m so that(29) 2 (cid:112) κ (ln n ) ln ln n ≤ p m ≤ (cid:112) κ (ln n ) ln ln n implies that we have d +1 ≤ s m for all large enough n and thus reconstruction is feasible. The choice(29) also justiﬁes our ealier assumption made in the context of (26) and (27) that p j = o ( β κ log n )for all j ∈ { , , . . . , m } ; indeed, from (13) and (24), we have β κ log n = K log n log κ and thus from (13) and (29) we observe that p j β κ log n ≤ κ / (log κ )(ln n ) / (ln ln n ) / K log n = o (1) . Let us next choose the parameter g . Using p j = o ( β κ log n ) again, we have p h j j = p (cid:4) β κ log npj (cid:5) j ≥ p β κ log npj − j ≥ p β κ log n pj j = 2 β κ log n pj log p j = n β κ log pj pj . Since p < p < · · · < p m , for j ∈ { , , . . . , m } thus p h j j ≥ n β κ log pm pm . It follows that choosing(30) g = (cid:22)(cid:113) n β κ log pm pm − (cid:23) justiﬁes our assumption g ≤ p h j j − j ∈ { , , . . . , m } . The ﬁnal assumption g ≤ n . is justiﬁedby observing that log p m p m is a decreasing function of m and observing that β κ = o (1) by (13) and(24).The algorithm is now parameterized. Let us proceed to analyse its running time.3.9. Running Time Analysis.

First, let us seek control on N as a function of n . From (29) and(30), we have g ≥ (cid:113) n β κ κ +log ln n +log ln ln n √ κ ln n ln ln n − − . This together with (13) gives us the crude lower bound g = exp (cid:18) Ω (cid:18) β κ (cid:114) (ln n ) ln ln nκ (cid:19)(cid:19) . We thus have N = (cid:100) n/g (cid:101) = n − Ω (cid:0) β κ (cid:113) ln ln nκ ln n (cid:1) . Recalling (27), we observe that the time to compute the M -monomial list (23) can be bounded by n . because the algorithm is careful to take multilinear reducts and thus at no stage of evaluating(14), (16), and (17) the number of monomials increases above ( n . ) = n . . Sincelog p h j j = h j log p j = (cid:22) β κ log np j (cid:23) log p j = O (log n ) , the arithmetic over the integers and modulo p h j j for each j = 1 , , . . . , m runs in time polyloga-rithmic in n for each arithmetic operation executed by the algorithm. Because the algorithm inTheorem 7 runs in O ( N log N ) arithmetic operations, we observe that the polylogarithmic termsare subsumed by the asymptotic notation and the entire algorithm for computing F pr for given p ∈ { p , p , . . . , p m } and r ∈ { , , . . . , p − } runs in time(31) n − Ω (cid:0) β κ (cid:113) log log nκ log n (cid:1) = n − Ω (cid:0) (cid:113) log log nκ (log κ )2 log n (cid:1) . From (29) we observe that the required repeats for diﬀerent p and r result in multiplicativepolylogarithmic terms in n and are similarly subsumed to result in total running time of theform (31). This completes the proof of Theorem 1. (cid:3) A Faster Randomized Algorithm for

This section sketches a proof for Theorem 2. We follow the algorithm outlined in Alman andWilliams [5]. We note that by their Theorem 1.2, there are probabilistic polynomials over anyﬁeld with error (cid:15) of degree O ( (cid:112) n log(1 /(cid:15) )). In their Theorem 4.2, they have a probabilistic OR-construction that takes the disjunction of a random set of s pairs of vector inner products as q ( x , y , x , y , . . . , x s , y s ) = 1 + (cid:89) k =1 (cid:0) (cid:88) ( i,j ) ∈ R k (cid:0) p ( x i, + y i, , x i, + y j, , . . . , x i,s + y j,s ) (cid:1)(cid:1) , where p is a probabilistic threshold polynomial over F of error (cid:15) = s − , and R k ⊆ [ s ] for k = 1 , s -sized batch whose diﬀerence Hamming weight is less than the threshold.By repeated computations with new p ’s and R k ’s, a majority vote for the batch can be chosen asthe correct answer, again w.h.p. for all batches.We implement the following change of q to get an InnerProduct algorithm. We take p to bea probabilistic polynomial of error (cid:15) = s − for the symmetric function [[ (cid:80) ni =1 z i = t ]], over a ﬁeld ofcharacteristic > s . We then construct q as(32) q ( x , y , x , y , . . . , x s , y s ) = (cid:88) ( i,j ) ∈ [ s ] p ( x i, y i, , x i, y j, , . . . , x i,s y j,s ) . Since the characteristic of the ﬁeld is large enough, (32) is equal to the number of pairs in the s -sized batch that has inner product equal to t with probability at least 1 − s (cid:15) ≥ − s , a similarbound on the probability as in Theorem 4.2. Also, the degree of the polynomials is only a factor 2larger. As with the original algorithm, if we repeat this enough times and take the majority in eachbatch, we get the correct number of pairs with t as inner product in all batches. By summing theseﬁnal majority numbers over the integers, we obtain the output. We note that the parameters of theerror and the degree has only changed by a constant, and hence that all calculations of the runningtime and the error bound of the original algorithm carries through also for our modiﬁcation of thealgorithm. This completes the proof sketch. (cid:3) A Lower Bound for

This section proves Theorem 3; the proof of Theorem 4 is presented in Appendix 6.Throughout this section we let M be an n × n matrix with entries m ij ∈ { , } for i, j ∈{ , , . . . , n } . For convenience, let us write [ n ] = { , , . . . , n } . Recalling Ryser’s formula, we have(33) per M = ( − n (cid:88) S ⊆ [ n ] ( − | S | (cid:89) i ∈ [ n ] (cid:88) j ∈ S m ij . First Reduction: Chinese Remaindering.

Since it is immediate that 0 ≤ per M ≤ n !, itsuﬃces to compute the permanent modulo small primes p and then assemble the result over theintegers via the Chinese Remainder Theorem. Let us ﬁrst state and prove a crude upper boundon the size of the primes needed. For a positive integer m , let us write m m . Lemma 8.

For all suﬃciently large n , we have n ! ≤ ( n ln n ) .Proof. Recall that for a positive integer m we write write m m . For m ≥ m > m (cid:18) −

12 ln m (cid:19) . For the factorial function, for n ≥

1, we have (cf. Robbins [19]) n ! = √ πn (cid:18) ne (cid:19) n e α n with 112 n + 1 < α n < n , which gives us the comparatively crude upper bound, for n ≥ n ! < (cid:18) n + 12 (cid:19) ln n − n + 1 . We want ln n ! < ln m m ≥

563 and (cid:18) n + 12 (cid:19) ln n − n + 1 < m (cid:18) −

12 ln m (cid:19) . It is immediate that m ≥ n ln n suﬃces for m ≥ (cid:3) Thus, it suﬃces to work with all primes p with p ≤ n ln n in what follows.5.2. A Reduction from Zero-One Permanent to

This section starts ourwork towards Theorem 3 without yet parameterizing the reduction in detail. Let a prime 2 ≤ p ≤ n ln n be given. We seek to compute per M modulo p . Fix a primitive root g ∈ { , , . . . , p − } modulo p . For an integer a with a (cid:54)≡ p ), let us write dlog p,g a for the discrete logarithm of a relative to g modulo p . That is, dlog p,g a is the unique integer in { , , . . . , p − } thatsatisﬁes g dlog p,g a ≡ a (mod p ). Working modulo p and collecting the outer sum in (33) by the sign σ ∈ {− , } and the nonzero products by their discrete logarithm, we haveper M ≡ ( − n p − (cid:88) e =0 g e (cid:0) w ( e )1 − w ( e ) − (cid:1) (mod p ) , where w ( e ) σ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:26) S ⊆ [ n ] : ( − | S | = σ and dlog p,g (cid:89) i ∈ [ n ] (cid:88) j ∈ S m ij ≡ e (mod p − (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) for σ ∈ {− , } and e ∈ { , , . . . , p − } . Thus, to compute per M modulo p it suﬃces to computethe coeﬃcients w ( e ) σ .Towards this end, suppose that n ≥ L = { , , . . . , n/ } and R = { n/ , n/ , . . . , n } . For σ L , σ R ∈ { , − } , let w ( e ) σ L ,σ R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:26) S ⊆ [ n ] : ( − | S ∩ L | = σ L , ( − | S ∩ R | = σ R and dlog p,g (cid:89) i ∈ [ n ] (cid:88) j ∈ S m ij ≡ e (mod p − (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) Clearly w ( e ) σ = (cid:80) σ L ,σ R ∈{− , } σ L σ R = σ w ( e ) σ L ,σ R , so it suﬃces to focus on computing w ( e ) σ L ,σ R in what follows.Deﬁne the set families L σ L = (cid:8) A ⊆ L : ( − | A | = σ L (cid:9) and R σ R = (cid:8) B ⊆ R : ( − | B | = σ R (cid:9) with |L σ L | = |R σ R | = 2 n/ − . Next we will deﬁne two families of length- d zero-one vectors whosepair counts by inner product will enable us to recover the coeﬃcients w ( e ) σ L ,σ R . The structure ofthe vectors will be slightly elaborate, so let us ﬁrst deﬁne an index set D for indexing the | D | = d dimensions. Let D = (cid:8) ( i, (cid:96), r, k ) ∈ [ n ] × { , , . . . , p − } × { , , . . . , p − } × [ np ] : (cid:96) + r (cid:54)≡ p ) implies k ≤ dlog p,g (cid:0) (cid:96) + r (cid:1)(cid:9) . We have d = n p + np ( p − p − / < n (ln n ) . For A ∈ L σ L and B ∈ R σ R , deﬁne the vectors λ ( A ) ∈ { , } D and ρ ( B ) ∈ { , } D for all( i, (cid:96), r, k ) ∈ D by the rules(34) λ ( A ) i(cid:96)rk = (cid:40) (cid:96) ≡ (cid:80) j ∈ A m ij (mod p );0 otherwise; and ρ ( B ) i(cid:96)rk = (cid:40) r ≡ (cid:80) j ∈ B m ij (mod p );0 otherwise.To study the inner product (cid:104) λ ( A ) , ρ ( B ) (cid:105) it will be convenient to work with Iverson’s bracketnotation. Namely, for a logical proposition P , let[[ P ]] = (cid:40) P is true;0 if P is false . Over the integers, from (34) we now have (cid:104) λ ( A ) , ρ ( B ) (cid:105) = (cid:88) ( i,(cid:96),r,k ) ∈ D λ ( A ) i(cid:96)rk ρ ( B ) i(cid:96)rk = (cid:88) ( i,(cid:96),r,k ) ∈ D [[ (cid:96) ≡ (cid:88) j ∈ A m ij (mod p )]][[ r ≡ (cid:88) j ∈ B m ij (mod p )]]= (cid:88) i ∈ [ n ] p − (cid:88) (cid:96) = 0 (cid:96) + r (cid:54)≡ p ) p − (cid:88) r =0 [[ (cid:96) ≡ (cid:88) j ∈ A m ij (mod p )]][[ r ≡ (cid:88) j ∈ B m ij (mod p )]] dlog p,g (cid:0) (cid:96) + r (cid:1) + (cid:88) i ∈ [ n ] p − (cid:88) (cid:96) =0 [[ (cid:96) ≡ (cid:88) j ∈ A m ij (mod p )]][[ p − (cid:96) ≡ (cid:88) j ∈ B m ij (mod p )]] np = (cid:40)(cid:80) i ∈ [ n ] dlog p,g (cid:80) j ∈ A ∪ B m ij if (cid:81) i ∈ [ n ] (cid:80) j ∈ A ∪ B m ij (cid:54)≡ p ); ≥ np if (cid:81) i ∈ [ n ] (cid:80) j ∈ A ∪ B m ij ≡ p ).(35)In particular, letting f σ L ,σ R ,t = (cid:12)(cid:12)(cid:8) ( A, B ) ∈ L σ L × R σ R : (cid:104) λ ( A ) , ρ ( B ) (cid:105) = t (cid:9)(cid:12)(cid:12) , it follows immediately from (35) that we have w ( e ) σ ,σ = (cid:80) n ( p − t =0 , t ≡ e (mod p − f σ L ,σ R ,t , which enablesus to recover per M from the counts of pairs in L σ L × R σ R by inner product. Completing the Proof of Theorem 3.

Suppose we have an algorithm for

InnerProduct that runs in N − Ω(1 / log c ) time when given an input of N vectors from { , } c log N . Take N = 2 n/ − and observe that log N = n/ −

1. The reduction from previous section has d ≤ n (ln n ) and thus wecan take c = ( n ln n ) and thus solve n × n zero-one permanent in time N − Ω(1 / log c ) = 2 n − Ω( n/ log n ) .This completes the proof of Theorem 3.6. A Lower Bound for

This section continues our work towards relations to zero-one permanents started in Sect. 5; inparticular, we prove Theorem 4.6.1.

A Reduction from Zero-One Permanent to

This section starts our work towardsTheorem 4 without yet parameterizing the reduction in detail. As in Sect. 5, it suﬃces to describehow to compute per M modulo a given prime p with 2 ≤ p ≤ n ln n .Let g be a positive integer parameter, which we assume divides n . For h ∈ [ g ], let V h = { i ∈ [ n ] : ( h − n/g + 1 ≤ i ≤ hn/g } be a partition of the rows of M into g groups, each of size n/g . Again from Ryser’s formula, weobserve that per M = ( − n (cid:88) S ⊆ [ n ] ( − | S | (cid:89) h ∈ [ g ] (cid:89) i ∈ V h (cid:88) j ∈ S m ij . Grouping by sign σ ∈ {− , } and per-group residues r ∈ { , , . . . , p − } g modulo p , we thus have(36) per M ≡ ( − n (cid:88) r ∈{ , ,...,p − } g ( t ,r − t − ,r ) g (cid:89) h =1 r h (mod p ) , where t σ,r = (cid:12)(cid:12)(cid:8) S ⊆ [ n ] : ( − | S | = σ and (cid:89) i ∈ V h (cid:88) j ∈ S m ij ≡ r h (mod p ) for each h ∈ [ g ] (cid:9)(cid:12)(cid:12) . Observe that given all the counts t σ,r , it takes O ( p g g ) operations modulo p to compute the perma-nent modulo p via (36), which is less than 2 n n when g < n/ log p . We continue to describe how toget the counts t σ,r via orthogonal-vector counting.Assuming that n ≥ L = { , , . . . , n/ } and R = { n/ , n/ , . . . , n } . Let the residue vector r ∈ { , , . . . , p − } g be ﬁxed. For σ L , σ R ∈ { , − } , let t σ L ,σ R ,r = (cid:12)(cid:12)(cid:8) S ⊆ [ n ] : ( − | S ∩ L | = σ L , ( − | S ∩ R | = σ R , and (cid:89) i ∈ V h (cid:88) j ∈ S m ij ≡ r h (mod p ) for each h ∈ [ g ] (cid:9)(cid:12)(cid:12) . Clearly t σ,r = (cid:80) σ L ,σ R ∈{− , } σ L σ R = σ t σ L ,σ R ,r , so it suﬃces to focus on computing t σ L ,σ R ,r in what follows.We again work with the set families L σ L = (cid:8) A ⊆ L : ( − | A | = σ L (cid:9) and R σ R = (cid:8) B ⊆ R : ( − | B | = σ R (cid:9) . Let D = [ g ] × { , , . . . , p − } n/g . We have d = | D | = gp n/g . For A ∈ L σ L and B ∈ R σ R , deﬁne the vectors λ ( A ) ∈ { , } D and ρ ( B ) ∈ { , } D for all( h, u ) ∈ D by the rules λ ( A ) hu = (cid:40) (cid:80) j ∈ A m ij ≡ u i − ( h − n/g (mod p ) for all i ∈ V h ;0 otherwise;and ρ ( B ) hu = (cid:40) (cid:81) i ∈ V h (cid:0) u i − ( h − n/g + (cid:80) j ∈ B m ij (cid:1) ≡ r h (mod p );1 otherwise.(37)Over the integers, from (37) we now have (cid:104) λ ( A ) , ρ ( B ) (cid:105) = (cid:88) ( h,u ) ∈ D λ ( A ) hu ρ ( B ) hu = (cid:88) h ∈ [ g ] (cid:88) u ∈{ , ,...,p − } n/g (cid:89) i ∈ V h (cid:20)(cid:20)(cid:88) j ∈ A m ij ≡ u i − ( h − n/g (mod p ) (cid:21)(cid:21)(cid:20)(cid:20) (cid:89) i ∈ V h (cid:0) u i − ( h − n/g + (cid:88) j ∈ B m ij (cid:1) (cid:54)≡ r h (mod p ) (cid:21)(cid:21) = (cid:88) h ∈ [ g ] (cid:20)(cid:20) (cid:89) i ∈ V h (cid:18)(cid:88) j ∈ A m ij + (cid:88) j ∈ B m ij (cid:19) (cid:54)≡ r h (mod p ) (cid:21)(cid:21) = (cid:40) (cid:81) i ∈ V h (cid:80) j ∈ A ∪ B m ij ≡ r h (mod p ) for each h ∈ [ g ]; ≥ t σ L ,σ R ,r = (cid:12)(cid:12)(cid:8) ( A, B ) ∈ L σ L × R σ R : (cid:104) λ ( A ) , ρ ( B ) (cid:105) = 0 (cid:9)(cid:12)(cid:12) , which enables us to recover per M from the counts of orthogonal pairs in L σ L × R σ R .6.2. Completing the Proof of Theorem 4.

Suppose now that we have an algorithm for OV that runs in N − Ω(1 / log − (cid:15) c ) time for some 0 < (cid:15) < N vectors from { , } c log N . Take N = 2 n/ − and observe that log N = n/ − K > (cid:15) and the constant hidden by the Ω( · ) notation inthe running time of the OV algorithm. Take g = (cid:98) K − /(cid:15) n (log p ) − /(cid:15) (cid:99) and recall that the prime p is in the range 2 ≤ p ≤ n ln n . To compute the parameters t σ,r using the reduction in the previous section, for each prime p we need 4 p g invocations of the OV algorithm on an input of N vectors of dimension d = gp n/g . Thus, for all large enough n , since K − /(cid:15) n (log p ) − /(cid:15) ≤ g , we have d = gp n/g ≤ n K /(cid:15) (log p ) /(cid:15) . Since clearly d = c log N = c ( n/ −

1) and 2 /(cid:15) >

2, for all large enougn n , we havelog c ≤ K /(cid:15) (log p ) /(cid:15) ≤ K /(cid:15) (log p ) /(cid:15) , where the last inequality depends on choosing a large enough K so that the inequality is true for p = 2. Thus, − (log c ) (cid:15) − ≤ − (cid:15) − K − /(cid:15) (log p ) − /(cid:15) . One invocation of the OV algorithm thus runs in N − Ω(log (cid:15) − c ) = 2 n − Ω( n (cid:15) − K − /(cid:15) (log p ) − /(cid:15) ) time. For each prime 2 ≤ p ≤ n ln n , we need4 p g ≤ K − /(cid:15) n (log p ) − /(cid:15) invocations of the OV algorithm. Thus, the running time of all the invocations for the prime p is bounded by 4 p g N − Ω(log (cid:15) − c ) ≤ n − Ω( n (cid:15) − K − /(cid:15) (log p ) − /(cid:15) )+2+ K − /(cid:15) n (log p ) − /(cid:15) . By choosing a large enough K to dominate the constant hidden by the Ω( · ) notation in the runningtime of the OV algorithm, we thus have, for all large enough n ,4 p g N − Ω(log (cid:15) − c ) ≤ n − Ω( n (cid:15) − K − /(cid:15) (log p ) − /(cid:15) ) ≤ n − Ω( n (cid:15) − K − /(cid:15) (log n +log ln n ) − /(cid:15) ) ≤ n − Ω( n (log n ) − /(cid:15) ) . Since there are at most n ln n primes p to consider, the total running time to compute per M isbounded by 2 n − Ω( n/ log /(cid:15) − n ) . This completes the proof of Theorem 4.7. Further Applications

Counting Satisfying Assignments to a Sym ◦ And circuit via

Wedescribe how to embed a

Sym ◦ And circuit, i.e., a circuit of s And gates working on n Booleaninputs, connected by a top gate that is an arbitrary symmetric gate, in a

InnerProduct instanceof size N = 2 n/ and d = s . Assuming n even, we divide the n inputs in two equal halves L and R .We let A have one vector u for each assignment to the inputs in L , with one coordinate in u foreach And gate, representing the truth value of that gate restricted to the inputs in L . Likewise,we let B have one vector v for each assignment to the inputs in R , with each coordinate set to thetruth value of the represented gate restricted to the inputs in R . It is readily veriﬁed that (cid:104) u, v (cid:105) counts the number of And gates that are satisﬁed by the assignment represented by ( u, v ). Hence,knowing the number of assignments that satisfy exactly t of the And gates, for t = 0 , , . . . , s ,which is what the solution to the InnerProduct gives us, we can count the total number ofassignments that also satisﬁes the top symmetric gate.Variations where the circuit instead is a

Sym ◦ Or or a Sym ◦ Xor , are also possible.7.2.

Computing the Weight Enumerator Polynomial via

A binary linearcode of length n and rank k is a linear subspace C with dimension k of the vector space F n . The weight enumerator polynomial is W ( C ; x, y ) = n (cid:88) w =0 A w x w y n − w , where A w = |{ c ∈ C : (cid:104) c, c (cid:105) = w }| , for w = 0 , , . . . , n is the weight distribution ; that is, A w equals the number of codewords of C having exactly w ones. We will reduce the computation of the weight distribution, and hence the weight enumeratorpolynomial, to ( k/ instances of InnerProduct with N ≤ k/ and d = 2( n − k ) when k is even.Let the k × n matrix G be the generating matrix of the code; that is, the codewords of C areexactly the row-span of G . We can assume without loss of generality that the generator matrix hasthe standard form G = [ I k | P ], where I k is the k × k identity matrix. For each s A = 0 , , · · · , k/ s B = 0 , , . . . , k/

2, we make one instance of

InnerProduct .We let the set A have one vector u for each code c obtained as the linear combination of exactly s A of the ﬁrst k/ n − k last coordinates in the code word c is described by ablock of two coordinates in u . If c i = 0 we encode this as 01 in u , and if c i = 1 we encode this as10 in u . We concatenate all n − k encoded blocks to obtain u . Likewise, we let the set B have onevector v for each code c obtained as a linear combination of s B of the last k/ n − k last coordinates in the code word c is described by a block of two coordinates in v , but theencoding is opposite the one for A : If c i = 0 we encode this as 10 in v , and if c i = 1 we encode thisas 01 in v . We again concatenate all n − k encoded blocks to obtain v . With this design, it is readilyveriﬁed that for ( u, v ) ∈ A × B , the inner product (cid:104) u, v (cid:105) is equal to the number of ones in the last n − k coordinates in the code word obtained as the sum of the code word represented by u and thecode word represented by v . Also, by design the number of ones in the ﬁrst k coordinates equals s A + s B . Hence, by summing over all pairs that have the same inner product t , and aggregatingover all s A and s B , we can compute the weight distribution. Acknowledgment

We thank Virginia Vassilevska Williams and Ryan Williams for many useful discussions.

References [1] A. Abboud, R. R. Williams, and H. Yu. More applications of the polynomial method to algorithm design. InP. Indyk, editor,

Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA2015, San Diego, CA, USA, January 4-6, 2015 , pages 218–230. SIAM, 2015.[2] J. Alman. An illuminating algorithm for the light bulb problem. In J. T. Fineman and M. Mitzenmacher, editors, ,volume 69 of

OASICS , pages 2:1–2:11. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019.[3] J. Alman, T. M. Chan, and R. R. Williams. Polynomial representations of threshold functions and algorithmicapplications. In I. Dinur, editor,

IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA , pages 467–476. IEEE ComputerSociety, 2016.[4] J. Alman, T. M. Chan, and R. R. Williams. Faster deterministic and Las Vegas algorithms for oﬄine approximatenearest neighbors in high dimensions. In S. Chawla, editor,

Proceedings of the 2020 ACM-SIAM Symposium onDiscrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020 , pages 637–649. SIAM, 2020.[5] J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In V. Guruswami, editor,

IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20October, 2015 , pages 136–150. IEEE Computer Society, 2015.[6] E. Bach and J. Shallit.

Algorithmic Number Theory. Vol. 1 . Foundations of Computing Series. MIT Press,Cambridge, MA, 1996. Eﬃcient algorithms.[7] E. T. Bax and J. Franklin. A permanent algorithm with exp[Ω( N / / N )] expected speedup for 0-1 matrices. Algorithmica , 32(1):157–162, 2002.[8] R. Beigel and J. Tarui. On ACC.

Computational Complexity , 4:350–366, 1994.[9] A. Bj¨orklund. Below all subsets for some permutational counting problems. In R. Pagh, editor, , volume 53of

LIPIcs , pages 17:1–17:11. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016.[10] A. Bj¨orklund, T. Husfeldt, and I. Lyckberg. Computing the permanent modulo a prime power.

Inf. Process.Lett. , 125:20–25, 2017.[11] A. Bj¨orklund, P. Kaski, and R. Williams. Generalized Kakeya sets for polynomial evaluation and faster compu-tation of fermionants.

Algorithmica , 81(10):4010–4028, 2019. [12] A. Bj¨orklund and R. Williams. Computing permanents and counting hamiltonian cycles by listing dissimilarvectors. In C. Baier, I. Chatzigiannakis, P. Flocchini, and S. Leonardi, editors, , volume 132 of LIPIcs ,pages 25:1–25:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019.[13] T. M. Chan and R. Williams. Deterministic APSP, orthogonal vectors, and more: Quickly derandomizingRazborov-Smolensky. In R. Krauthgamer, editor,

Proceedings of the Twenty-Seventh Annual ACM-SIAM Sym-posium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016 , pages 1246–1255. SIAM,2016.[14] L. Chen and R. Williams. An equivalence class for orthogonal vectors. In T. M. Chan, editor,

Proceedings of theThirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA,January 6-9, 2019 , pages 21–40. SIAM, 2019.[15] D. Coppersmith. Rapid multiplication of rectangular matrices.

SIAM J. Comput. , 11(3):467–471, 1982.[16] R. Impagliazzo, R. Paturi, and F. Zane. Which problems have strongly exponential complexity?

J. Comput.Syst. Sci. , 63(4):512–530, 2001.[17] M. Karppa, P. Kaski, and J. Kohonen. A faster subquadratic algorithm for ﬁnding outlier correlations.

ACMTrans. Algorithms , 14(3):31:1–31:26, 2018.[18] D. E. Knuth.

The Art of Computer Programming, Volume 2: Seminumerical Algorithms . Addison-Wesley, Read-ing, MA, 1998.[19] H. Robbins. A remark on Stirling’s formula.

The American Mathematical Monthly , 62(1):26–29, 1955.[20] J. B. Rosser and L. Schoenfeld. Approximate formulas for some functions of prime numbers.

Ill. J. Math. , 6:64–94,1962.[21] H. J. Ryser.

Combinatorial Mathematics . The Carus Mathematical Monographs, No. 14. Published by TheMathematical Association of America; distributed by John Wiley and Sons, Inc., New York, 1963.[22] G. Valiant. Finding correlations in subquadratic time, with applications to learning parities and the closest pairproblem.

J. ACM , 62(2):13:1–13:45, 2015.[23] L. G. Valiant. The complexity of computing the permanent.

Theor. Comput. Sci. , 8:189–201, 1979.[24] R. Williams. A new algorithm for optimal 2-constraint satisfaction and its implications.