[PDF] Time-Optimal Interactive Proofs for Circuit Evaluation

Abstract

Recently, researchers have been working toward the development of practical general-purpose protocols for verifiable computation. These protocols enable a computationally weak verifier to offload computations to a powerful but untrusted prover, while providing the verifier with a guarantee that the prover performed the computations correctly. Despite substantial progress, existing implementations are not yet practical. The main bottleneck is typically the extra effort required by the prover to return an answer with a guarantee of correctness, compared to returning an answer with no guarantee. We describe a refinement of a powerful interactive proof protocol originally due to Goldwasser, Kalai, and Rothblum. Cormode, Mitzenmacher, and Thaler show how to implement the prover in this protocol in time O(S log S), where S is the size of an arithmetic circuit computing the function of interest. Our refinements apply to circuits whose wiring pattern is sufficiently "regular"; for these circuits, we bring the runtime of the prover down to O(S). That is, our prover can evaluate the circuit with a guarantee of correctness, with only a constant-factor blowup in work compared to evaluating the circuit with no guarantee. We argue that our refinements capture a large class of circuits, and prove some theorems formalizing this. Experimentally, our refinements yield a 200x speedup for the prover over the implementation of Cormode et al., and our prover is less than 10x slower than a C++ program that simply evaluates the circuit. Along the way, we describe a special-purpose protocol for matrix multiplication that is of interest in its own right. Our final contribution is a protocol targeted at general data parallel computation. Compared to prior work, this protocol can more efficiently verify complicated computations as long as that computation is applied independently to many pieces of data.

Full PDF

TTime-Optimal Interactive Proofs for Circuit Evaluation

Justin Thaler ∗ Abstract

Several research teams have recently been working toward the development of practical general-purpose protocols for veriﬁable computation. These protocols enable a computationally weak veriﬁer toofﬂoad computations to a powerful but untrusted prover , while providing the veriﬁer with a guaranteethat the prover performed the requested computations correctly. Despite substantial progress, existingimplementations require further improvements before they become practical for most settings. The mainbottleneck is typically the extra effort required by the prover to return an answer with a guarantee ofcorrectness, compared to returning an answer with no guarantee.We describe a reﬁnement of a powerful interactive proof protocol due to Goldwasser, Kalai, andRothblum [21]. Cormode, Mitzenmacher, and Thaler [14] show how to implement the prover in thisprotocol in time O ( S log S ) , where S is the size of an arithmetic circuit computing the function of interest.Our reﬁnements apply to circuits with sufﬁciently “regular” wiring patterns; for these circuits, we bringthe runtime of the prover down to O ( S ) . That is, our prover can evaluate the circuit with a guaranteeof correctness, with only a constant-factor blowup in work compared to evaluating the circuit with noguarantee.We argue that our reﬁnements capture a large class of circuits, and we complement our theoretical re-sults with experiments on problems such as matrix multiplication and determining the number of distinctelements in a data stream. Experimentally, our reﬁnements yield a 200x speedup for the prover over theimplementation of Cormode et al., and our prover is less than 10x slower than a C++ program that simplyevaluates the circuit. Along the way, we describe a special-purpose protocol for matrix multiplicationthat is of interest in its own right.Our ﬁnal contribution is the design of an interactive proof protocol targeted at general data parallelcomputation. Compared to prior work, this protocol can more efﬁciently verify complicated computa-tions as long as that computation is applied independently to many different pieces of data. ∗ Harvard University, School of Engineering and Applied Sciences. Supported by an NSF Graduate Research Fellowship andNSF grants CNS-1011840 and CCF-0915922. a r X i v : . [ c s . CR ] F e b ontents MATMULT . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 V Fast vs. Making V Streaming . . . . . . . . . . . . . . . . . . . . . . . . 16

DISTINCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.4 Reusing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.4.1 Computing the Necessary β ( z , p ) Values . . . . . . . . . . . . . . . . . . . . . . . 205.4.2 Computing the Necessary ˜ V i + ( p ) Values . . . . . . . . . . . . . . . . . . . . . . . 225.5 A General Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Experimental Results 27

MATMULT . . . . . . . . . . . . . . . . . . . . . . . . . 388.2.1 Comparison to Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2.2 Protocol Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Introduction

Protocols for veriﬁable computation enable a computationally weak veriﬁer V to ofﬂoad computations to apowerful but untrusted prover P . These protocols aim to provide the veriﬁer with a guarantee that the proverperformed the requested computations correctly, without requiring the veriﬁer to perform the computationsherself.Surprisingly powerful protocols for veriﬁable computation were discovered within the computer sciencetheory community several decades ago, in the form of interactive proofs (IPs) and their brethren, interactivearguments (IAs) and probabilistically checkable proofs (PCPs). In these protocols, the prover P solves aproblem using her (possibly vast) computational resources, and tells V the answer. P and V then have aconversation, i.e., they engage in a randomized protocol involving the exchange of one or more messages.During this conversation, P ’s goal is to convince V that the answer is correct.Results quantifying the power of IPs, IAs, and PCPs represent some of the most celebrated results inall of computational complexity theory, but until recently they were mainly of theoretical interest, far tooinefﬁcient for actual deployment. In fact, the main applications of these results have traditionally been innegative applications – showing that many problems are just as hard to approximate as they are to solveexactly.However, the surging popularity of cloud computing has brought renewed interest in positive appli-cations of protocols for veriﬁable computation. A typical motivating scenario is as follows. A businessprocesses billions or trillions of transactions a day. The volume is sufﬁciently high that the business cannotor will not store and process the transactions on its own. Instead, it ofﬂoads the processing to a commercialcloud computing service. The ofﬂoading of any computation raises issues of trust: the business may be con-cerned about relatively benign events like dropped transactions, buggy algorithms, or uncorrected hardwarefaults, or the business may be more paranoid and fear that the cloud operator is deliberately deceptive or hasbeen externally compromised. Either way, each time the business poses a query to the cloud, the businessmay demand that the cloud also provide a guarantee that the returned answer is correct.This is precisely what protocols for veriﬁable computation accomplish, with the cloud acting as theprover in the protocol, and the business acting as the veriﬁer. In this paper, we describe a reﬁnement of anexisting general-purpose protocol originally due to Goldwasser, Kalai, and Rothblum [14, 21]. When theyare applicable, our techniques achieve asymptotically optimal runtime for the prover, and we demonstratethat they yield protocols that are signiﬁcantly closer to practicality than that achieved by prior work.We also make progress toward addressing another issue of existing interactive proof implementations:their applicability. The protocol of Goldwasser, Kalai, and Rothblum (henceforth the GKR protocol) ap-plies in principle to any problem computed by a small-depth arithmetic circuit, but this is not the case whenmore ﬁne-grained considerations of prover and veriﬁer efﬁciency are taken into account. In brief, existingimplementations of interactive proof protocols for circuit evaluation all require that the circuit have a highlyregular wiring pattern [14, 40]. If this is not the case, then these implementations require the veriﬁer to per-form an expensive (though data-independent) preprocessing phase to pull out information about the wiringof the circuit, and they require a substantial factor blowup (logarithmic in the circuit size) in runtime forthe prover relative to evaluating the circuit without a guarantee of correctness. Developing a protocol thatavoids these pitfalls and applies to more general computations remains an important open question.Our approach is the following. We do not have a magic bullet for dealing with irregular wiring patterns;if we want to avoid an expensive pre-processing phase for the veriﬁer and minimize the blowup in runtimefor the prover, we do need to make an assumption about the structure of the circuit we are verifying. Ac-knowledging this, we ask whether there is some general structure in real-world computations that we can4everage for efﬁciency gains.To this end, we design a protocol that is highly efﬁcient for data parallel computation. By data paral-lel computation, we mean any setting in which one applies the same computation independently to manypieces of data. Many outsourced computations are data parallel, with Amazon Elastic MapReduce beingone prominent example of a cloud computing service targeted speciﬁcally at data parallel computations.Crucially, we do not want to make signiﬁcant assumptions on the sub-computation that is being applied, andin particular we want to handle sub-computations computed by circuits with highly irregular wiring patterns.The veriﬁer in our protocol still has to perform an ofﬂine phase to pull out information about the wiringof the circuit, but the cost of this phase is proportional to the size of a single instance of the sub-computation,avoiding any dependence on the number of pieces of data to which the sub-computation is applied. Similarly,the blowup in runtime suffered by the prover is the same as it would be if the prover had run the basic GKRprotocol on a single instance of the sub-computation.Our ﬁnal contribution is to describe a new protocol speciﬁc to matrix multiplication that is of interest inits own right. It avoids circuit evaluation entirely, and reduces the overhead of the prover (relative to running any unveriﬁable algorithm) to an additive low-order term. Goldwasser, Kalai, and Rothblum described a powerful general-purpose interactive proof protocol in [21].This protocol is framed in the context of circuit evaluation . Given a layered arithmetic circuit C of depth d ,size S ( n ) , and fan-in 2, the GKR protocol allows a prover to evaluate C with a guarantee of correctness intime poly ( S ( n )) , while the veriﬁer runs in time ˜ O ( n + d log S ( n )) , where n is the length of the input and the˜ O notation hides polylogarithmic factors in n .Cormode, Mitzenmacher, and Thaler showed how to bring the runtime of the prover in the GKR pro-tocol down from poly ( S ( n )) to O ( S ( n ) log S ( n )) [14]. They also built a full implementation of the protocoland ran it on benchmark problems. These results demonstrated that the protocol does indeed save the veri-ﬁer signiﬁcant time in practice (relative to evaluating the circuit locally); they also demonstrated surprisingscalability for the prover, although the prover’s runtime remained a major bottleneck. With the implemen-tation of [14] as a baseline, Thaler et al. [38] described a parallel implementation of the GKR protocol thatachieved 40x-100x speedups for the prover and 100x speedups for the (already fast) implementation of theveriﬁer.Vu, Setty, Blumberg, and Walﬁsh [40] further reﬁne and extend the implementation of Cormode etal. [14]. In particular, they combine the GKR protocol with a compiler from a high-level programminglanguage so that programmers do not have to explicitly express computation in the form of arithmetic circuitsas was the case in the implementation of [14]. This substantially extends the reach of the implementation,but it should be noted that their approach generates circuits with irregular wiring patterns, and hence onlyworks in a batching model, where the cost of a fairly expensive ofﬂine setup phase is amortized by verifyingmany instances of a single computation in batch. They also build a hybrid system that statically evaluateswhether it is better to use the GKR protocol or a different, cryptography-based argument system calledZaatar (see Section 1.1.2), and runs the more efﬁcient of the two protocols in an automated fashion.A growing line of work studies protocols for veriﬁable computation in the context of data streaming .In this context, the goal is not just to save the veriﬁer time (compared to doing the computation without aprover), but also to save the veriﬁer space. The protocols developed in this line of work allow the client to http://aws.amazon.com/elasticmapreduce/ There has been a lot of work on the development of efﬁcient interactive arguments, which are essentiallyinteractive proofs that are secure only against dishonest provers that run in polynomial time. A substantialbody of work in this area has focused on the development of protocols targeted at speciﬁc problems (e.g.[2, 5, 16]). Other works have focused on the development of general-purpose argument systems. Severalpapers in this direction (e.g. [8, 10, 11, 18]) have used fully homomorphic encryption, which unfortunatelyremains impractical despite substantial recent progress. Work in this category by Chung et al. [10] focuseson streaming settings, and is therefore particularly relevant.Several research teams have been pursuing the development of general-purpose argument systems thatmight be suitable for practical use. Theoretical work by Ben-Sasson et al. [4] focuses on the development ofshort PCPs that might be suitable for use in practice – such PCPs can be compiled into efﬁcient interactivearguments. As short PCPs are often a bottleneck in the development of efﬁcient argument systems, otherworks have focused on avoiding their use [3,6,7,19]. In particular, Gennaro et al. [19] and Bitansky et al. [9]develop argument systems with a clear focus on implementation potential. Very recent work by Parno etal. [30] describes a near-practical general-purpose implementation, called Pinocchio, of an argument systembased on [19]. Pinocchio is additionally non-interactive and achieves public veriﬁability.Another line of implementation work focusing on general-purpose interactive argument systems is dueto Setty et al. [34–36]. This line of work begins with a base argument system due to Ishai et al. [25], andsubstantially reﬁnes the theory to achieve an implementation that approaches practicality. The most recentsystem in this line of work is called Zaatar [36], and is also based on the work of Gennaro et al. [19].An empirical comparison of the GKR-based approach and Zaatar performed by Vu et al. [40] ﬁnds theGKR approach to be signiﬁcantly more efﬁcient for quasi-straight-line computations (e.g. programs withrelatively simple control ﬂow), while Zaatar is appropriate for programs with more complicated control ﬂow.

Our primary contributions are three-fold. Our ﬁrst contribution addresses one of the biggest remainingobstacles to achieving a truly practical implementation of the GKR protocol: the logarithmic factor overheadfor the prover. That is, Cormode et al. show how to implement the prover in time O ( S ( n ) log S ( n )) , where S ( n ) is the size of the arithmetic circuit to which the GKR protocol is applied, down from the Ω ( S ( n ) ) time required for a naive implementation. The hidden constant in the Big-Oh notation is at least 3, and thelog S ( n ) factor translates to well over an order of magnitude, even for circuits with a few million gates.We remove this logarithmic factor, bringing P ’s runtime down to O ( S ( n )) for a large class of circuits.Informally, our results apply to any circuit whose wiring pattern is sufﬁciently “regular”. We formalize theclass of circuits to which our results apply in Theorem 1.We experimentally demonstrate the generality and effectiveness of Theorem 1 via two case studies.Speciﬁcally, we apply an implementation of the protocol of Theorem 1 to a circuit computing matrix mul-tiplication ( MATMULT ), as well as to a circuit computing the number of distinct items in a data stream6

DISTINCT ). Experimentally, our reﬁnements yield a 200x-250x speedup for the prover over the state of theart implementation of Cormode et al. [14]. A serial implementation of our prover is less than 10x slowerthan a C++ program that simply evaluates the circuit sequentially, a slowdown that is tolerable in realisticoutsourcing scenarios where cycles are plentiful for the prover. Moreover, a parallel implementation of ourprover using a graphics processing unit (GPU) is roughly 30x faster than our serial implementation, andtherefore takes less time than that required to evaluate the circuit in serial.Our second contribution is to specify a highly efﬁcient protocol for veriﬁably outsourcing arbitrarydata parallel computation. Compared to prior work, this protocol can more efﬁciently verify complicatedcomputations, as long as that computation is applied independently to many different pieces of data. Weformalize this protocol and its efﬁciency guarantees in Theorem 2.Our third contribution is to describe a new protocol speciﬁc to matrix multiplication that we believe to beof interest in its own right. This protocol is formalized in Theorem 3. Given any unveriﬁable algorithm for n × n matrix multiplication that requires time T ( n ) using space s ( n ) , Theorem 3 allows the prover to run intime T ( n ) + O ( n ) using space s ( n ) + o ( n ) . Note that Theorem 3 (which is speciﬁc to matrix multiplication)is much less general than Theorem 1 (which applies to any circuit with a sufﬁciently regular wiring pattern).However, Theorem 3 achieves optimal runtime and space usage for the prover up to leading constants,assuming there is no O ( n ) time algorithm for matrix multiplication. While these properties are also satisﬁedby a classic protocol due to Freivalds [17], the protocol of Theorem 3 is signiﬁcantly more amenable foruse as a primitive when verifying computations that repeatedly invoke matrix multiplication. For example,using the protocol of Theorem 3 as a primitive, we give a natural protocol for computing the diameter ofan unweighted directed graph G . V ’s runtime in this protocol is O ( m log n ) , where m is the number ofedges in G , P ’s runtime matches the best known unveriﬁable diameter algorithm up to a low-order additiveterm [33, 42], and the total communication is just polylog ( n ) . We know of no other protocol achieving this.We complement Theorem 3 with experimental results demonstrating its efﬁciency. Section 2 presents preliminaries. We give a high-level overview of the ideas underlying our main results inSection 3. Section 4 gives a detailed overview of prior work, including the standard sum-check protocol aswell as the GKR protocol. Section 5 contains the details of our time-optimal protocol for circuit evaluationas formalized in Theorem 1. Section 6 describes our experimental cases studies of the protocol describedin Theorem 1. Section 7 describes our protocol for arbitrary data parallel computation. Section 8 describessome additional optimizations that apply to speciﬁc important wiring patterns. In particular, this sectiondescribes our special-purpose protocol for

MATMULT that achieves optimal prover efﬁciency up to leadingconstants. Section 9 concludes.

We begin by deﬁning a valid interactive proof protocol for a function f . Deﬁnition 1

Consider a prover P and veriﬁer V who both observe an input x and wish to compute a functionf : { , } n → R for some set R . After the input is observed, P and V exchange a sequence of messages.Denote the output of V on input x, given prover P and V ’s random bits R, by out ( V , x , R , P ) . V can output ⊥ if V is not convinced that P ’s claim is valid. e say P is a valid prover with respect to V if for all inputs x, Pr R [ out ( V , x , R , P ) = f ( x )] = . Theproperty that there is at least one valid prover P with respect to V is called completeness . We say V is a valid veriﬁer for f with soundness probability δ if there is at least one valid prover P with respect to V ,and for all provers P (cid:48) and all inputs x, Pr [ out ( V , A , R , P (cid:48) ) / ∈ { f ( x ) , ⊥} ] ≤ δ . We say a prover-veriﬁer pair ( P , V ) is a valid interactive proof protocol for f if V is a valid veriﬁer for f with soundness probability / ,and P is a valid prover with respect to V . If P and V exchange r messages in total, we say the protocol has (cid:100) r / (cid:101) rounds. Informally, the completeness property guarantees that an honest prover will convince the veriﬁer that theclaimed answer is correct, while the soundness property ensures that a dishonest prover will be caught withhigh probability. An interactive argument is an interactive proof where the soundness property holds onlyagainst polynomial-time provers P (cid:48) . We remark that the constant 1 / / − . Whenever we work over a ﬁnite ﬁeld F , we assume that a single ﬁeld operation can be computed in a singlemachine operation. For example, when we say that the prover P in our interactive protocols requires time O ( S ( n )) , we mean that P must perform O ( S ( n )) additions and multiplications within the ﬁnite ﬁeld overwhich the protocol is deﬁned. Input Representation.

Following prior work [12, 14, 15], all of the protocols we consider can handle inputsspeciﬁed in a general data stream form. Each element of the stream is a tuple ( i , δ ) , where i ∈ [ n ] and δ isan integer. The δ values may be negative, thereby modeling deletions. The data stream implicitly deﬁnes afrequency vector a , where a i is the sum of all δ values associated with i in the stream. For simplicity, weassume throughout the paper that the number of stream updates m is related to n by a constant factor i.e., m = Θ ( n ) .When checking the evaluation of a circuit C , we consider the inputs to C to be the entries of the frequencyvector a . We emphasize that in all of our protocols, V only needs to see the raw stream and not the aggregatedfrequency vector a (see Lemma 2 for details). Notice that we may interpret the frequency vector a as anobject other than a vector, such as a matrix or a string. For example, in MATMULT , the data stream deﬁnestwo matrices to be multiplied.When we refer to a streaming veriﬁer with space usage s ( n ) , we mean that the veriﬁer can make a singlepass over the stream of tuples deﬁning the input, regardless of their ordering, while storing at most s ( n ) elements in the ﬁnite ﬁeld over which the protocol is deﬁned. To focus our discussion in this paper, we give special attention to two problems also considered in priorwork [14, 38].1. In the

MATMULT problem, the input consists of two n × n matrices A , B ∈ Z n × n , and the goal is tocompute the matrix product A · B .2. In the DISTINCT problem, also denoted F , the input is a data steam consisting of m tuples ( i , δ ) froma universe of size n . The stream deﬁnes a frequency vector a , and the goal is to compute |{ i : a i (cid:54) = }| ,the number of items with non-zero frequency. 8 .1.3 Additional Notation Throughout, [ n ] will denote the set { , . . . , n } , while [[ n ]] will denote the set { , . . . , n − } .Let F be a ﬁeld, and F ∗ = F \ { } its multiplicative group. For any d -variate polynomial p ( x , . . . , x d ) : F d → F , we use deg i ( p ) to denote the degree of p in variable i . A d -variate polynomial p is said to be multilinear if deg i ( p ) ≤ i ∈ [ d ] . Given a function V : { , } d → { , } whose domain is the d -dimensional Boolean hypercube, the multilinear extension (MLE) of V over F , denoted ˜ V , is the uniquemultilinear polynomial F d → F that agrees with V on all Boolean-valued inputs. That is, ˜ V is the uniquemultilinear polynomial over F satisfying ˜ V ( x ) = V ( x ) for all x ∈ { , } d . We begin by describing the methodology underlying the GKR protocol before summarizing the ideas un-derlying our improved protocols.

In the GKR protocol, P and V ﬁrst agree on an arithmetic circuit C of fan-in 2 over a ﬁnite ﬁeld F computingthe function of interest ( C may have multiple outputs). Each gate of C performs an addition or multiplicationover F . C is assumed to be in layered form, meaning that the circuit can be decomposed into layers, andwires only connect gates in adjacent layers. Suppose the circuit has depth d ; we will number the layers from1 to d with layer d referring to the input layer, and layer 1 referring to the output layer.In the ﬁrst message, P tells V the (claimed) output of the circuit. The protocol then works its way initerations towards the input layer, with one iteration devoted to each layer. The purpose of iteration i is toreduce a claim about the values of the gates at layer i to a claim about the values of the gates at layer i + V to assume that the ﬁrst claim is true as long as the second claim is true. Thisreduction is accomplished by applying the standard sum-check protocol [29] to a certain polynomial.More concretely, the GKR protocol starts with a claim about the values of the output gates of the circuit,but V cannot check this claim without evaluating the circuit herself, which is precisely what she wants toavoid. So the ﬁrst iteration uses a sum-check protocol to reduce this claim about the outputs of the circuit toa claim about the gate values at layer 2 (more speciﬁcally, to a claim about an evaluation of the multilinearextension (MLE) of the gate values at layer 2). Once again, V cannot check this claim herself, so the seconditeration uses another sum-check protocol to reduce the latter claim to a claim about the gate values at layer3, and so on. Eventually, V is left with a claim about the inputs to the circuit, and V can check this claim onher own.In summary, the GKR protocol uses a sum-check protocol at each level of the circuit to enable V togo from verifying a randomly chosen evaluation of the MLE of the gate values at layer i to verifying a(different) evaluation of the MLE of the gate values at layer i +

1. Importantly, apart from the input layerand output layer, V does not ever see all of the gate values at a layer (in particular, P does not send thesevalues in full). Instead, V relies on P to do the hard work of actually evaluating the circuit, and uses thepower of the sum-check protocol as the main tool to force P to be consistent and truthful over the course ofthe protocol. 9 .2 Achieving Optimal Prover Runtime for Regular Circuits In Theorem 1, we describe an interactive proof protocol for circuit evaluation that brings P ’s runtime downto O ( S ( n )) for a large class of circuits, while maintaining the same veriﬁer runtime as in prior implementa-tions of the GKR protocol. Informally, Theorem 1 applies to any circuit whose wiring pattern is sufﬁciently“regular”.This protocol follows the same general outline as the GKR protocol, in that we proceed in iterationsfrom the output layer of the circuit to the input layer, using a sum-check protocol at iteration i to reducea claim about the gate values at layer i to a claim about the gate values at layer i +

1. However, at eachiteration i we apply the sum-check protocol to a carefully chosen polynomial that differs from the one usedby GKR. In each round j of the sum-check protocol, our choice of polynomial allows P to reuse work fromprior rounds in order to compute the prescribed message for round j , allowing us to shave a log S ( n ) factorfrom the runtime of P relative to the O ( S ( n ) log S ( n )) -time implementation due to Cormode et al. [14].Speciﬁcally, at iteration i , the GKR protocol uses a polynomial f ( i ) z deﬁned over log S i + S i + vari-ables, where S i is the number of gates at layer i . The “truth table” of f ( i ) z is sparse on the Boolean hypercube,in the sense that f ( i ) z ( x ) is non-zero for at most S i of the S i · S i + inputs x ∈ { , } log S i + S i + . Cormode etal. leverage this sparsity to bring the runtime of P in iteration i down to O ( S i log S i ) from a naive bound of Ω ( S i · S i + ) . However, this same sparsity prevents P from reusing work from prior iterations as we seek todo. In contrast, we use a polynomial g ( i ) z deﬁned over only log S i variables rather than log S i + S i + variables. Moreover, the truth table of g ( i ) z is dense on the Boolean hypercube, in the sense that g ( i ) z ( x ) maybe non-zero for all of the S i Boolean inputs x ∈ { , } log S i . This density allows P to reuse work from prioriterations in order to speed up her computation in round i of the sum-check protocol.In more detail, in each round j of the sum-check protocol, the prover’s prescribed message is deﬁnedvia a sum over a large number of terms, where the number of terms falls geometrically fast with the roundnumber j . Moreover, it can be shown that in each round j , each gate at layer i + i + P needs to do is proportional to the number of terms in the sum rather thanthe number of gates S i at layer i .We remark that a similar “reuse of work” technique was implicit in an analysis by Cormode, Thaler,and Yi [15, Appendix B] of an efﬁcient protocol for a speciﬁc streaming problem known as the secondfrequency moment. This frequency moment protocol was the direct inspiration for our reﬁnements, thoughwe require additional insights to apply the reuse of work technique in the context of evaluating generalarithmetic circuits.It is worth clarifying why our methods do not yield savings when applied to the polynomial f ( i ) z used inthe basic GKR protocol. The reason is that, since f ( i ) z is deﬁned over log S i + S i + variables instead ofjust log S i variables, the sum deﬁning P ’s message in round j is over a much larger number of terms whenusing f ( i ) z . It is still the case that each gate contributes to only one term of the sum, but until the number ofterms in the sum falls below S i (which does not happen until round j = log S i + log S i + of the sum-checkprotocol), it is possible for each gate to contribute to a different term. Before this point, grouping gates bythe term of the sum to which they contribute is not useful, since each group can have size 1.10 .3 Verifying General Data Parallel Computations Theorem 1 only applies to circuits with regular wiring patterns, as do other existing implementations ofinteractive proof protocols for circuit evaluation [14, 40]. For circuits with irregular wiring patterns, theseimplementations require the veriﬁer to perform an expensive preprocessing phase (requiring time propor-tional to the size of the circuit) to pull out information about the wiring of the circuit, and they require asubstantial factor blowup (logarithmic in the circuit size) in runtime for the prover relative to evaluating thecircuit without a guarantee of correctness.To address these bottlenecks, we do need to make an assumption about the structure of the circuit we areverifying. Ideally our assumption will be satisﬁed by many real-world computations. To this end, Theorem2 will describe a protocol that is highly efﬁcient for any data parallel computation, by which we mean anysetting in which one applies the same computation independently to many pieces of data. See Figure 2 inSection 7 for a schematic of a data parallel computation.The idea behind Theorem 2 is as follows. Let C be a circuit of size S with an arbitrary wiring pattern,and let C ∗ be a “super-circuit” that applies C independently to B different inputs before possibly aggregatingthe results in some fashion. If one naively applied the basic GKR protocol to the super-circuit C ∗ , V mighthave to perform a pre-processing phase that requires time proportional to the size of C ∗ , which is Ω ( B · S ) .Moreover, when applying the basic GKR protocol to C ∗ , P would require time Θ ( B · S · log ( B · S )) .In order to improve on this, the key observation is that although each sub-computation C can havea very complicated wiring pattern, the circuit is “maximally regular” between sub-computations, as thesub-computations do not interact at all. Therefore, each time the basic GKR protocol would apply thesum-check protocol to a polynomial derived from the wiring predicate of C ∗ , we instead use a simplerpolynomial derived only from the wiring predicate of C . This immediately brings the time required by V inthe pre-processing phase down to O ( S ) , which is proportional to the cost of executing a single instance ofthe sub-computation. By using the reuse of work technique underlying Theorem 1, we are also able to bring P ’s runtime down from Θ ( B · S · log ( B · S )) to Θ ( B · S · log S ) , i.e., P ’s requires a factor of O ( log S ) moretime to evaluate the circuit with a guarantee of correctness, compared to evaluating the circuit without sucha guarantee. This O ( log S ) factor overhead does not depend on the batch size B .Our improvements are most signiﬁcant when B (cid:29) S , i.e., when a (relatively) small but potentially com-plicated sub-computation is applied to a very large number of pieces of data. For example, given any verylarge database, one may ask “How many people in the database satisfy Property P ?” Our protocol allowsone to veriﬁably outsource such counting queries with overhead that depends minimally on the size of thedatabase, but that necessarily depends on the complexity of the property P . MATMULT

We describe a special-purpose protocol for n × n MATMULT in Theorem 3. The idea behind this protocol isas follows. The GKR protocol, as well the protocols of Theorems 1 and 2, only make use of the multilinearextension ˜ V i of the function V i mapping gate labels at layer i of the circuit to their values. In some cases,there is something to be gained by using a higher-degree extension of V i , and this is precisely what weexploit here.In more detail, our special-purpose protocol can be viewed as an extension of our circuit-checkingtechniques applied to a circuit C performing naive matrix multiplication, but using a quadratic extensionof the gate values in this circuit. This allows us to verify the computation using a single invocation of thesum-check protocol. More importantly, P can evaluate this higher-degree extension at the necessary pointswithout explicitly materializing all of the gate values of C , which would not be possible if we had used the11ultilinear extension of the gate values of C .In the protocol of Theorem 3, P just needs to compute the correct output (possibly using an algorithmthat is much more sophisticated than naive matrix multiplication), and then perform O ( n ) additional workto prove the output is correct. Since P does not have to evaluate C in full, this protocol is perhaps best viewedoutside the lens of circuit evaluation. Still, the idea underlying Theorem 3 can be thought of as a reﬁnementof our circuit evaluation protocols, and we believe that similar ideas may yield further improvements togeneral-purpose protocols in the future. We will often make use of the following basic property of polynomials.

Lemma 1 ( [32])

Let F be any ﬁeld, and let f : F m → F be a nonzero polynomial of total degree d. Then onany ﬁnite set S ⊆ F , Pr x ← S m [ f ( x ) = ] ≤ d / | S | . In words, if x is chosen uniformly at random from S m , then the probability that f ( x ) = is at most d / | S | .In particular, any two distinct polynomials of total degree d can agree on at most d / | S | fraction of points inS m . Our main technical tool is the sum-check protocol [29], and we present a full description of this protocol forcompleteness. See also [1, Chapter 8] for a complete exposition and proof of soundness.Suppose we are given a v -variate polynomial g deﬁned over a ﬁnite ﬁeld F . The purpose of the sum-check protocol is to compute the sum: H : = ∑ b ∈{ , } ∑ b ∈{ , } · · · ∑ b v ∈{ , } g ( b , . . . , b v ) . In order to execute the protocol, the veriﬁer needs to be able to evaluate g ( r , . . . , r v ) for a randomlychosen vector ( r , . . . , r v ) ∈ F v – see the paragraph preceding Proposition 1 below.The protocol proceeds in v rounds as follows. In the ﬁrst round, the prover sends a polynomial g ( X ) ,and claims that g ( X ) = ∑ x ,..., x v ∈{ , } v − g ( X , x , . . . , x v ) . Observe that if g is as claimed, then H = g ( ) + g ( ) . Also observe that the polynomial g ( X ) has degree deg ( g ) , the degree of variable x in g . Hence g can be speciﬁed with deg ( g ) + P will specify g by sending theevaluation of g at each point in the set { , , . . . , deg ( g ) } .Then, in round j > V chooses a value r j − uniformly at random from F and sends r j − to P . We willoften refer to this step by saying that variable j − bound to value r j − . In return, the prover sends apolynomial g j ( X j ) , and claims that g j ( X j ) = ∑ ( x j + ,..., x v ) ∈{ , } v − j g ( r , . . . , r j − , X j , x j + , . . . , x v ) . (1)The veriﬁer compares the two most recent polynomials by checking that g j − ( r j − ) = g j ( ) + g j ( ) ,and rejecting otherwise. The veriﬁer also rejects if the degree of g j is too high: each g j should have degreedeg j ( g ) , the degree of variable x j in g . 12n the ﬁnal round, the prover has sent g v ( X v ) which is claimed to be g ( r , . . . , r v − , X v ) . V now checksthat g v ( r v ) = g ( r , . . . , r v ) (recall that we assumed V can evaluate g at this point). If this test succeeds, andso do all previous tests, then the veriﬁer accepts, and is convinced that H = g ( ) + g ( ) . Proposition 1

Let g be a v-variate polynomial deﬁned over a ﬁnite ﬁeld F , and let ( P , V ) be the prover-veriﬁer pair in the above description of the sum-check protocol. ( P , V ) is a valid interactive proof protocolfor the function H = ∑ b ∈{ , } ∑ b ∈{ , } · · · ∑ b v ∈{ , } g ( b , . . . , b v ) . Observe that there is one round in the sum-check protocol for each of the v variables of g . The total com-munication is ∑ vi = deg i ( g ) + = v + ∑ vi = deg i ( g ) ﬁeld elements. In all of our applications, deg i ( g ) = O ( ) for all i , and so the communication cost is O ( v ) ﬁeld elements.The running time of the veriﬁer over the entire execution of the protocol is proportional to the totalcommunication, plus the amount of time required to compute g ( r , . . . , r v ) .Determining the running time of the prover is less straightforward. Recall that P can specify g j bysending for each i ∈ { , . . . , deg j ( g ) } the value: g j ( i ) = ∑ ( x j + ,..., x v ) ∈{ , } v − j g ( r , . . . , r j − , i , x j + , . . . , x v ) . (2)An important insight is that the number of terms deﬁning the value g j ( i ) in Equation (2) falls geo-metrically with j : in the j th sum, there are only 2 v − j terms, each corresponding to a Boolean vector in { , } v − j . The total number of terms that must be evaluated over the course of the protocol is therefore O (cid:16) ∑ vj = v − j (cid:17) = O ( v ) . Consequently, if P is given oracle access to the truth table of the polynomial g ,then P will require just O ( v ) time.Unfortunately, in our applications P will not have oracle access to the truth table of g . The key to ourresults is to show that in our applications P can nonetheless evaluate g at all of the necessary points in O ( v ) total time. We describe the details of the GKR protocol for completeness, as well as to simplify the exposition of ourreﬁnements.

Suppose we are given a layered arithmetic circuit C of size S ( n ) , depth d ( n ) , and fan-in two. Let S i denotethe number of gates at layer i of the circuit C . Assume S i is a power of 2 and let S i = s i . In order to explainhow each iteration of the GKR protocol proceeds, we need to introduce several functions, each of whichencodes certain information about the circuit.To this end, number the gates at layer i from 0 to S i −

1, and let V i : { , } s i → F denote the function thattakes as input a binary gate label, and outputs the corresponding gate’s value at layer i . The GKR protocolmakes use of the multilinear extension ˜ V i of the function V i (see Section 2.1.3).The GKR protocol also makes use of the notion of a “wiring predicate” that encodes which pairs ofwires from layer i + i in C . We deﬁne two functions, add i and mult i { , } s i + s i + to { , } , which together constitute the wiring predicate of layer i of C . Speciﬁcally,these functions take as input three gate labels ( j , j , j ) , and return 1 if gate j at layer i is the addition(respectively, multiplication) of gates j and j at layer i +

1, and return 0 otherwise. Let ˜add i and ˜mult i denote the multilinear extensions of add i and mult i respectively.Finally, let β s i ( z , p ) denote the function β s i ( z , p ) = s i ∏ j = (( − z j )( − p j ) + z j p j ) . It is straightforward to check that β s i is the multilinear extension of the function B ( x , y ) : { , } s i × { , } s i →{ , } that evaluates to 1 if x = y , and evaluates to 0 otherwise. The GKR protocol consists of d ( n ) iterations, one for each layer of the circuit. Each iteration starts with P claiming a value for ˜ V i ( z ) for some ﬁeld element z ∈ F s i . In the ﬁrst iteration and circuits with a singleoutput gate, z = V ( ) corresponds to the output value of the circuit.For circuits with many output gates, Vu et al. [40] observe that in the ﬁrst iteration, P may simply send V the (claimed) values of all output gates, thereby specifying a function V (cid:48) : { , } s → F claimed to equal V . V can pick a random point z ∈ F s and evaluate ˜ V (cid:48) ( z ) on her own in O ( S ) time (see Remark 1 in Section4.3.5). The Schwartz-Zippel Lemma (Lemma 1) implies that it is safe for V to believe that V (cid:48) indeed equals V as claimed, as long as ˜ V ( z ) = ˜ V (cid:48) ( z ) (which will be checked in the remainder of the protocol).The purpose of iteration i is to reduce the claim about the value of ˜ V i ( z ) to a claim about ˜ V i + ( ω ) forsome ω ∈ F s i + , in the sense that it is safe for V to assume that the ﬁrst claim is true as long as the secondclaim is true. To accomplish this, the iteration applies the sum-check protocol described in Section 4.2 to aspeciﬁc polynomial derived from ˜ V i + , ˜add i , and ˜mult i , and β s i . It can be shown that for any z ∈ F s i ,˜ V i ( z ) = ∑ ( p , ω , ω ) ∈{ , } si + si + f ( i ) z ( p , ω , ω ) , where f ( i ) z ( p , ω , ω ) = β s i ( z , p ) · (cid:0) ˜add i ( p , ω , ω )( ˜ V i + ( ω ) + ˜ V i + ( ω )) + ˜mult i ( p , ω , ω ) ˜ V i + ( ω ) · ˜ V i + ( ω ) (cid:1) . (3)Iteration i therefore applies the sum-check protocol of Section 4.2 to the polynomial f ( i ) z . There remainsthe issue that V can only execute her part of the sum-check protocol if she can evaluate the polynomial f ( i ) z at a random point f ( i ) z ( r , . . . , r s i + s i + ) . This is handled as follows.Let p ∗ denote the ﬁrst s i entries of the vector ( r , . . . , r s i + s i + ) , ω ∗ the next s i + entries, and ω ∗ the last s i + entries. Evaluating f ( i ) z ( p ∗ , ω ∗ , ω ∗ ) requires evaluating β ( z , p ∗ ) , ˜add i ( p ∗ , ω ∗ , ω ∗ ) , ˜mult i ( p ∗ , ω ∗ , ω ∗ ) ,˜ V i + ( ω ∗ ) , and ˜ V i + ( ω ∗ ) . 14 can easily evaluate β ( z , p ∗ ) in O ( s i ) time. For many circuits, particularly those with “regular” wiringpatterns, V can evaluate ˜add i ( p ∗ , ω ∗ , ω ∗ ) and ˜mult i ( p ∗ , ω ∗ , ω ∗ ) on her own in poly ( s i , s i + ) time as well. V cannot however evaluate ˜ V i + ( ω ∗ ) , and ˜ V i + ( ω ∗ ) on her own without evaluating the circuit. Instead, V asks P to simply tell her these two values, and uses iteration i + verify that these values are as claimed.However, one complication remains: the precondition for iteration i + P claims a value for ˜ V i ( z ) fora single z ∈ F s i . So V needs to reduce verifying both ˜ V i + ( ω ∗ ) and ˜ V i + ( ω ∗ ) to verifying ˜ V i + ( ω ∗ ) at a singlepoint ω ∗ ∈ F s i + , in the sense that it is safe for V to accept the claimed values of ˜ V i + ( ω ∗ ) and ˜ V i + ( ω ∗ ) aslong as the value of ˜ V i + ( ω ∗ ) is as claimed. This is done as follows. Reducing to Veriﬁcation of a Single Point.

Let (cid:96) : F → F s i + be some canonical line passing through ω ∗ and ω ∗ . For example, we can let (cid:96) be the unique line such that (cid:96) ( ) = ω ∗ and (cid:96) ( ) = ω ∗ . P sends a degree- s i + polynomial h claimed to be ˜ V i + ◦ (cid:96) , the restriction of ˜ V i + to the line (cid:96) . V checks that h ( ) = ω ∗ and h ( ) = ω ∗ (rejecting if this is not the case), picks a random point r ∗ ∈ F , and asks P to prove that ˜ V i + ( (cid:96) ( r ∗ )) = h ( r ∗ ) . By the Schwartz-Zippel Lemma (Lemma 1), as long as V is convinced that ˜ V i + ( (cid:96) ( r ∗ )) = h ( r ∗ ) , itis safe for V to believe that the values of ˜ V i + ( ω ∗ ) and ˜ V i + ( ω ∗ ) are as claimed by P . This completesiteration i ; P and V then move on to the iteration for layer i + V i + ( (cid:96) ( r ∗ )) has the claimed value. The Final Iteration.

Finally, at the ﬁnal iteration d , V must evaluate ˜ V d ( ω ∗ ) on her own. But the vector ofgate values at layer d of C is simply the input x to C . It can be shown that V can compute ˜ V d ( ω ∗ ) on herown in O ( n log n ) time, with a single streaming pass over the input [15]. Moreover, Vu et al. show how tobring V ’s time cost down to O ( n ) [40], but this methodology does not work in a general streaming model.For completeness, we present details of both of these observations in Section 4.3.5. Observe that the polynomial f ( i ) z deﬁned in Equation (3) is an ( s i + s i + ) -variate polynomial of degreeat most 2 in each variable, and so the invocation of the sum-check protocol at iteration i requires s i + s i + rounds, with three ﬁeld elements transmitted per round. Thus, the total communication cost is O ( d ( n ) log S ( n )) ﬁeld elements, where d ( n ) is the depth of the circuit C . The time cost to V is O ( n log n + d ( n ) log S ( n )) , where the n log n term is due to the time required to evaluate ˜ V d ( ω ∗ ) (see Lemma 2 below),and the d ( n ) log S ( n ) term is the time required for V to send messages to P and process and check themessages from P .As for P ’s runtime, for any iteration i of the GKR protocol, a naive implementation of the prover in thecorresponding instance of the sum-check protocol would require time Ω ( s i + s i + ) , as the sum deﬁning eachof P ’s messages is over as many as 2 s i + s i + terms. This cost can be Ω ( S ( n ) ) , which is prohibitively largein practice. However, Cormode, Mitzenmacher, and Thaler showed in [14] that each gate at layers i and i + C contributes to only a single term of sum, and exploit this to bring the runtime of the P down to O ( S ( n ) log S ( n )) . Various suggestions have been put forth for what to do if this is not the case. For example, these computations can alwaysbe done by V in O ( log S ( n )) space as long as the circuit is log-space uniform, which is sufﬁcient in streaming applications wherethe space usage of the veriﬁer is paramount [14]. Moreover, these computations can be done ofﬂine before the input is evenobserved, because they only depend on the wiring of the circuit, and not on the input [14, 21]. Finally, [40] notes that the costof this computation can be effectively amortized in a batching model, where many identical computations on different inputs areveriﬁed simultaneously. See Section 7 for further discussion, and a protocol that mitigates this issue in the context of data parallelcomputation. .3.5 Making V Fast vs. Making V Streaming

We describe how V can efﬁciently evaluate ˜ V d ( ω ∗ ) on her own, as required in the ﬁnal iteration of the GKRprotocol. Prior work has identiﬁed two methods for performing this computation. The ﬁrst method is dueto Cormode, Thaler, and Yi [15]. It requires O ( n log n ) time, and allows V to make a single streaming passover the input using O ( log n ) space. Lemma 2 ( [15])

Given an input x ∈ F n and a vector ω ∗ ∈ F log n , V can compute ˜ V d ( ω ∗ ) in O ( n log n ) timeand O ( log n ) space with a single streaming pass over the input, where ˜ V d is the multilinear extension of thefunction that maps i ∈ { , } log n to the value of the ith entry of x. Proof:

We exploit the following explicit expression for ˜ V d . For a vector b ∈ { , } log n let χ b ( x , . . . , x log n ) = ∏ log nk = χ b k ( x k ) , where χ ( x k ) = − x k and χ ( x k ) = x k . Notice that χ b is the unique multilinear polynomialthat takes b ∈ { , } log n to 1 and all other values in { , } log n to 0, i.e., it is the multilinear extension of theindicator function for boolean vector b . With this deﬁnition in hand, we may write:˜ V d ( p , . . . , p log n ) = ∑ b ∈{ , } log n V d ( b ) χ b ( p , . . . p log n ) (4)Indeed, it is easy to check that the right hand side of Equation (4) is a multilinear polynomial, and that itagrees with V d on all Boolean inputs. Hence, the right hand side must equal the multilinear extension of V d .In particular, by letting ( p , . . . , p log n ) = ω ∗ in Equation (4), we see that˜ V d ( ω ∗ ) = ∑ b ∈{ , } log n V d ( b ) χ b ( ω ∗ ) . (5)Given any stream update ( i , δ ) , let ( i , . . . , i log n ) denote the binary representation of i . Notice that up-date ( i , δ ) has the effect of increasing V d ( i , . . . , i log n ) by δ , and does not affect V d ( x , . . . x log n ) for any ( x , . . . , x log n ) (cid:54) = ( i , . . . , i log n ) . Thus, V can compute ˜ V d ( ω ∗ ) incrementally from the raw stream by initializ-ing ˜ V d ( ω ∗ ) ←

0, and processing each update ( i , δ ) via:˜ V d ( ω ∗ ) ← ˜ V d ( ω ∗ ) + δ · χ i ( ω ∗ ) . V only needs to store ˜ V d ( ω ∗ ) and ω ∗ , which requires O ( log n ) words of memory. Moreover, for any i , χ ( i ,..., i log n ) ( ω ∗ ) can be computed in O ( log n ) ﬁeld operations, and thus V can compute ˜ V d ( ω ∗ ) with one passover the raw stream, using O ( log n ) words of space and O ( log n ) ﬁeld operations per update.The second method is due to Vu et al. [40]. It enables V to compute ˜ V d ( ω ∗ ) in O ( n ) time, but requires V to use O ( n ) space. Lemma 3 (Vu et al. [40]) V can compute ˜ V d ( ω ∗ ) in O ( n ) time and O ( n ) space. Proof:

We again exploit the expression for ˜ V d ( ω ∗ ) in Equation (5). Notice the right hand side of Equation(5) expresses ˜ V d ( ω ∗ ) as the inner product of two n -dimensional vectors, where the b th entry of the ﬁrstvector is V d ( b ) and the b th entry of the second vector is χ b ( ω ∗ ) . This inner product can be computed in O ( n ) time given a table of size n whose b th entry contains the quantity χ b ( ω ∗ ) . Vu et al. show how to buildsuch a table in time O ( n ) using memoization. 16he memoization procedure consists of log n stages, where Stage j constructs a table A ( j ) of size2 j , such that for any ( b , . . . , b j ) ∈ { , } j , A ( j ) [( b , . . . , b j )] = ∏ ji = χ b i ( ω ∗ i ) . Notice A ( j ) [( b , . . . , b j )] = A ( j − ) [( b , . . . , b j − )] · χ b j ( ω ∗ j ) , and so the j th stage of the memoization procedure requires time O ( j ) . Thetotal time across all log n stages is therefore O ( ∑ log nj = j ) = O ( log n ) = O ( n ) . This completes the proof. Remark 1

In [41], Vu et al. further observe that if the input is presented in a speciﬁc order, then V canevaluate ˜ V d ( ω ∗ ) using O ( log n ) space. Compare this result to Lemma 2, which requires O ( n log n ) time for V , but allows V to use O ( log n ) space regardless of the order in which the input is presented. As with the GKR protocol, our protocol consists of d ( n ) iterations, one for each layer of the circuit. Eachiteration starts with P claiming a value for ˜ V i ( z ) for some value z ∈ F s i . The purpose of the iteration is toreduce this claim to a claim about ˜ V i + ( ω ) for some ω ∈ F s i + , in the sense that it is safe for V to assume thatthe ﬁrst claim is true as long as the second claim is true. As in the GKR protocol, this is done by invokingthe sum-check protocol on a certain polynomial.In order to improve on the costs of the GKR protocol implementation of Cormode et al. [14], we replacethe polynomial f ( i ) z in Equation (3) with a different polynomial g ( i ) z deﬁned over a much smaller domain.Speciﬁcally, g ( i ) z is deﬁned over only s i variables rather than s i + s i + variables as is the case of f ( i ) z . Using g ( i ) z in place of f ( i ) z allows P to reuse work across iterations of the sum-check protocol, thereby reducing P ’sruntime by a logarithmic factor relative to [14], as formalized in Theorem 1 below.The remainder of the presentation leading up to Theorem 1 proceeds as follows. After stating a pre-liminary lemma, we describe the polynomial g ( i ) z that we use in the context of three speciﬁc circuits: abinary tree of addition or multiplication gates, and a circuit computing the number of non-zero entries ofan n -dimensional vector a . The purpose of this exposition is to showcase the ideas underling Theorem 1 inconcrete scenarios. Second, we explain the algorithmic insights that allow P to reuse work across iterationsof the sum-check protocol applied to g ( i ) z . Finally, we state and prove Theorem 1, which formalizes the classof circuits to which our methods apply. We will repeatedly invoke the following lemma, which allows us to express the value ˜ V i ( z ) in a manneramenable to veriﬁcation via the sum-check protocol. This is essentially a restatement of [31, Lemma 3.2.1]. Lemma 4

Let W be any polynomial F s i → F that extends V i , in the sense that for all p ∈ { , } s i , W ( p ) = V i ( p ) . Then for any z ∈ F s i , ˜ V i ( z ) = ∑ p ∈{ , } si β s i ( z , p ) W ( p ) . (6) Proof:

It is easy to check that the right hand side of Equation (6) is a multilinear polynomial in z , and thatit agrees with V i on all Boolean inputs. Thus, the right hand side of Equation (6), viewed as a polynomial in z , must be the multilinear extension ˜ V i of V i . This completes the proof.17 .3 Polynomials for Speciﬁc Circuits Consider a circuit C that computes the product of all n of its inputs by multiplying them together via abinary tree. Label the gates at layers i and i + p = ( p , . . . , p s i ) ∈ { , } s i at layer i is the gate with label ( p , ) at layer i −

1, and the second inputto gate p has label ( p , ) . Here and throughout, ( p , ) denotes the s i + p . Interpreting p = ( p , . . . , p s i ) ∈ { , } s i as an integerbetween 0 and 2 s i − p as the high-order bit and p s i as the low-order bit, this says that the ﬁrst in-neighbor of p is 2 p and the second is 2 p +

1. It follows immediately that for any gate p ∈ { , } s i at layer i , V i ( p ) = ˜ V i + ( p , ) · ˜ V i + ( p , ) . Invoking Lemma 4, we obtain the following proposition. Proposition 2

Let C be a circuit consisting of a binary tree of multiplication gates. Then ˜ V i ( z ) = ∑ p ∈{ , } si g ( i ) z ( p ) ,where g ( i ) z ( p ) = β s i ( z , p ) · ˜ V i + ( p , ) · ˜ V i + ( p , ) . Remark 2

Notice that the polynomial g ( i ) z in Proposition 2 is a degree three polynomial in each variableof p. When applying the sum-check protocol to g ( i ) z , the prover therefore needs to send 4 ﬁeld elements perround.In the case of Proposition 2, the line (cid:96) : F → F i + in the “Reducing to Veriﬁcation of a Single Point”step has an especially simple expression. Let r ∈ F s i be the vector of random ﬁeld elements chosen by V over the execution of the sum-check protocol. Notice that (cid:96) ( ) must equal the point ( r , ) ∈ F s i + i.e., thepoint whose ﬁrst s i coordinates equal r and whose last coordinate equals 0. Similarly, (cid:96) ( ) must equal ( r , ) .We may therefore express the line (cid:96) via the equation (cid:96) ( t ) = ( r , t ) . In this case, ˜ V i + ◦ (cid:96) has degree 1 and isimplicitly speciﬁed when P sends the claimed values of ˜ V i ( r , ) and ˜ V i ( r , ) . The case of a binary tree of addition gates is similar to the case of multiplication gates.

Proposition 3

Let C be a circuit consisting of a binary tree of addition gates. Then ˜ V i ( z ) = ∑ p ∈{ , } si g ( i ) z ( p ) ,where g ( i ) z ( p ) = β s i ( z , p ) (cid:0) ˜ V i + ( p , ) + ˜ V i + ( p , ) (cid:1) . Remark 3

The polynomial g ( i ) z of Proposition 3 has degree 2 in all variables, rather than degree 3 as inProposition 2. DISTINCT

We now describe a circuit C for computing the number of non-zero entries of a vector a ∈ F n (this vectorshould be interpreted as the frequency vector of a data stream). A similar circuit was used in conjunctionwith the GKR protocol in [14] to yield an efﬁcient protocol with a streaming veriﬁer for DISTINCT , and weborrow heavily from the presentation there. We remark that our reﬁnements enable us to slightly simplifythe circuit used in [14] by avoiding the awkward use of a constant-valued input wire with value set to 1.This causes some gates in our circuit to have fan-in 1 rather than fan-in 2, which is easily supported by ourprotocol.The circuit C is tailored for use over the ﬁeld of cardinality equal to a Mersenne prime q = k − k . Fields of cardinality equal to a Mersenne prime can support extremely fast arithmetic, and asdiscussed later in Section 6.2, there are several Mersenne primes of appropriate magnitude for use withinour protocols. 18 a a x x x x x x x x x x x x x x x x x x x x x a x x x x x x x Figure 1: The ﬁrst several layers of a circuit for F on four inputs over the ﬁeld F with q = k − a i for each input entry a i . The second layer from the bottomcomputes a i and a i for all i . The third layer computes a i and a i = a i × a i , while the fourth layer computes a i and a i = a i × a i . The remaining layers (not shown) have structure identical to the third and fourthlayers until the value a q − i is computed for all i , and the circuit culminates in a binary tree of addition gates.The circuit C exploits Fermat’s Little Theorem, computing a q − i for each input entry a i before summingthe results. As described in [14], verifying the summation sub-circuit can be handled with a one invocationof the sum-check protocol, or less efﬁciently by running our protocol for a binary tree of addition gatesdescribed in Proposition 3.We now turn to describing the part of the circuit computing a q − i for each input entry a i . We may write q − = k −

2, whose binary representation is k − a q − i = ∏ k − j = a j i . To compute a q − i , the circuit repeatedly squares a , and multiplies together the results “as it goes”. In more detail, for j > d ( n ) − j of the circuit for computing a q − i ; the ﬁrstcomputes a j by squaring the corresponding gate at layer j −

1, and the second computes ∏ j − (cid:96) = a (cid:96) − i . SeeFigure 1 for a depiction.For our purposes there are k + k − n gates. Number the gates from 0 to 2 n − p to refer to both a gate number as well as its binaryrepresentation.An even-numbered gate p at layer i has both in-wires connected to gate p at layer i +

1, while anodd-numbered gate p has one in-wire connected to gate p and another connected to gate p −

1. Thus, theconnectivity information of the circuit is a simple function of the binary representation p of each gate atlayer i . If the low-order bit p s i of p is 0 (i.e., it is an even-numbered gate), then both in-neighbors at layer i + p have binary representation p . If the low-order bit p s i is 1 (i.e., it is an odd-numbered gate),then the ﬁrst in-neighbor of gate p has binary representation p , and the second has binary representation ( p − s i , ) , where p − s i denotes p with the coordinate p s i removed.Invoking Lemma 4, the following proposition is easily veriﬁed. Proposition 4

Let C be the circuit described above. For layers i ∈ { , . . . , k − } , ˜ V i ( z ) = ∑ p ∈{ , } si g ( i ) z ( p ) where g ( i ) z ( p ) = β s i ( z , p ) (cid:0) ( − p s i ) ˜ V i + ( p − s i , ) · ˜ V i + ( p − s i , ) + p s i ˜ V i + ( p − s i , ) · ˜ V i + ( p − s i , ) (cid:1) , where p − s i denotes p with the coordinate p s i removed. emark 4 To check P ’s claim in the ﬁnal round of the sum-check protocol applied to g ( i ) z , V needs to know ˜ V i + ( r , ) and ˜ V i + ( r , ) for some random vector r ∈ F s i − . This is identical to the situation in the case of abinary tree of addition or multiplication gates, where the “Reducing to Veriﬁcation of a Single Point” stephad an especially simple implementation. At layer k , an even-numbered gate p has both in-wires connected to gate p / k +

1, while anodd-numbered gate p has its unique in-wire connected to gate ( p − ) / k +

1. Thus, for a gate atlayer i = k , if the the low-order bit p s i of the gate’s binary representation p is 1 (i.e., it is an odd-numberedgate), then both in-neighbors at layer i + p − s i . If the low-order bit p s i is 0(i.e., it is an even numbered gate), then the unique in-neighbor of p at layer i + p − s i .Invoking Lemma 4, the following is easily veriﬁed. Proposition 5

Let C be the circuit described above. For layer i = k, ˜ V i ( z ) = ∑ p ∈{ , } si g ( i ) z ( p ) whereg ( i ) z ( p ) = β s i ( z , p ) (cid:0) ( − p s i ) ˜ V i + ( p − s i ) · ˜ V i + ( p − s i ) + p s i ˜ V i + ( p − s i ) (cid:1) , where p − s i denotes p with coordinate p s i removed. Finally, at layer k +

1, each gate p has both in-wires connected to gate p at layer k + Proposition 6

Let C be the circuit described above. For layer i = k + , ˜ V i ( z ) = ∑ p ∈{ , } si g ( i ) z ( p ) whereg ( i ) z ( p ) = β s i ( z , p ) ˜ V i + ( p ) · ˜ V i + ( p ) . Recall that our analysis of the costs of the sum-check protocol in Section 4.2.1 revealed that, when applyinga sum-check protocol to an s i -variate polynomial g ( i ) z , P only needs to evaluate g ( i ) z at O ( s i ) points acrossall rounds of the protocol. Our goal in this section is to show how P can do this in time O ( s i + s i + ) = O ( S i + S i + ) for all of the polynomials g ( i ) z described in Section 5.3. This is sufﬁcient to ensure that P takes O ( ∑ d ( n ) i = S i ) = O ( S ( n )) time across all iterations of our circuit-checking protocol.To this end, notice that all of the polynomials g z described in Propositions 2-6 have the following prop-erty: for any r ∈ F s i , evaluating g ( i ) z ( r ) can be done in constant time given β ( z , r ) and the evaluations of ˜ V i + at a constant number of points. For example, consider the polynomial g ( i ) z described in Proposition 4: g ( i ) z ( r ) can be computed in constant time given β s i ( z , r ) , ˜ V i + ( r − s i , ) , and ˜ V i + ( r − s i , ) .Moreover, the points at which P must evaluate g ( i ) z within the sum-check protocol are highly structured:in round j of the sum-check protocol, the points are all of the form ( r , . . . , r j − , t , b j + , . . . , b s i ) with t ∈{ , , . . . , deg j ( g ( i ) z ) } and ( b j + , . . . , b s i ) ∈ { , } s i − j . β ( z , p ) Values

Pre-processing.

We begin by explaining how P can, in O ( s i ) time, compute an array C ( ) of length 2 s i of allvalues β ( z , p ) = ∏ s i k = ( p k z k + ( − p k )( − z k )) for p ∈ { , } s i . P can do this computation in preprocessingbefore the sum-check protocol begins, as this computation does not depend on any of V ’s messages. Naively,20omputing all entries of C ( ) would require O ( s i s i ) time, as there are 2 s i values to compute, and eachinvolves Ω ( s i ) multiplications. However, this can be improved using dynamic programming.The dynamic programming algorithm proceeds in stages. In stage j , P computes an array C ( , j ) oflength 2 j . Abusing notation, we identify a number p in [ j ] with its binary representation in { , } j . P computes C , j [ p ] = j ∏ k = ( p k z k + ( − p k )( − z k )) via the recurrence C , j [( p , . . . , p j )] = C , j − [( p , . . . , p j − )] · ( p j z j + ( − p j )( − z j )) . Clearly C ( , s i ) equals the desired array C ( ) , and the total number of multiplications required over the entireprocedure is O ( ∑ s i j = j ) = O ( s i ) . We remark that our dynamic programming procedure is similar to themethod used by Vu et al. to reduce the veriﬁer’s runtime in the GKR protocol from O ( n log n ) to O ( n ) inLemma 3. Overview of Online Processing.

In round j of of the sum-check protocol, P needs to evaluate the polynomial β ( z , p ) at O ( s i − j ) points of the form ( r , . . . , r j − , t , b j + , . . . , b s i ) for t ∈ [ deg j ( g ( i ) z )] and ( b j + , . . . , b s i ) ∈{ , } s i − j . P will do this using the help of intermediate arrays C ( j ) deﬁned as follows.Deﬁne C ( j ) to be the array of length 2 s i − j such that for ( p j + , . . . , p s i ) ∈ { , } s i − j : C ( j ) [( p j + , . . . , p s i )] = (cid:32) j ∏ k = ( r k z k + ( − r k )( − z k )) (cid:33) · (cid:32) s i ∏ k = j + ( p k z k + ( − p k )( − z k )) (cid:33) , Efﬁciently Constructing C ( j ) Arrays.

Inductively, assume P has computed the array C ( j − ) in the previousround. As the base case, we explained how P can evaluate C ( ) in O ( s i ) time in pre-processing. Nowobserve that P can compute C ( j ) given C ( j − ) in O ( s i − j ) time using the following recurrence: C ( j ) [( p j + , . . . , p s i )] = z − j C ( j − ) [( , p j + , . . . , p s i )] · ( r j z j + ( − r j )( − z j )) . (7) Remark 5

Equation (7) is only valid when z j (cid:54) = . To avoid this issue, we can have V choose z j at randomfrom F ∗ rather than from F , and this will affect the soundness probability by at most an additive O ( d ( n ) · log S ( n ) / | F | ) term. Remark 6

Since computing multiplicative inverses in a ﬁnite ﬁeld is not a constant-time operation, it isimportant to note that z − j only needs to be computed once when determining the entries of C ( j ) , i.e., it neednot be recomputed for each entry of C ( j ) . Therefore, across all s i rounds of the sum-check protocol, only ˜ O ( s i ) time in total is required to compute these multiplicative inverses, which does not affect the asymptoticcosts for P . We discount the costs of computing z − j for the remainder of the discussion. Thus, at the end of round j of the sum-check protocol, when V sends P the value r j , P can compute C ( j ) from C ( j − ) using Equation (7) in O ( s i − j ) time.21 sing the C ( j ) Arrays.

Observe that given any point of the form p = ( r , . . . , r j − , t , b j + , . . . , b s i ) with ( b j + , . . . , b s i ) ∈ { , } s i − j , β ( z , p ) can be evaluated in constant time using the array C ( j − ) , using the equal-ity β ( z , p ) = C ( j − ) [( , p j + , . . . , p s i )] · z − j · ( tz j + ( − t )( − z j )) . As above, note that z − j can be computed just once and used for all points p , and this does not affect theasymptotic costs for P . Putting Things Together.

In round j of the sum-check protocol, P uses the array C ( j − ) to evaluate the O ( s i − j ) required β ( z , p ) values in O ( s i − j ) time. At the end of round j , V sends P the value r j , and P computes C ( j ) from C ( j − ) in O ( s i − j ) time. In total across all rounds of the sum-check protocol, P spends O ( ∑ s i j = s i − j ) = O ( s i ) time to compute the β ( z , p ) values. ˜ V i + ( p ) Values

For concreteness and clarity, we restrict our presentation within this subsection to the polynomial g ( i ) z de-scribed in Proposition 4. Theorem 1 abstracts this analysis into a general result capturing a large class ofwiring patterns.Recall that all of the polynomials g ( i ) z described in Propositions 2-6 have the following property: for any p ∈ F s i , evaluating g ( i ) z ( p ) can be done in constant time given β ( z , p ) and the evaluations of ˜ V i + at a constantnumber of points. We have already shown how P can evaluate all of the necessary β ( z , p ) values in O ( s i ) time. It remains to show how P can evaluate all of the ˜ V i + values in time O ( s i + s i + ) . We remark that inthe context of Proposition 4, s i = s i + ; however, we still distinguish between these two quantities throughoutthis subsection in order to ensure maximal consistency with the general derivation of Theorem 1.Recall that the polynomial g ( i ) z in Proposition 4 was deﬁned as follows: g ( i ) z ( p ) = β s i ( z , p ) (cid:0) ( − p s i ) ˜ V i + ( p − s i , ) · ˜ V i + ( p − s i , ) + p s i ˜ V ( p − s i , ) · ˜ V ( p − s i , ) (cid:1) . In round j of the sum-check protocol, P needs to evaluate g z at all points in the set S ( j ) = { ( r , . . . , r j − , t , b j + , . . . , b s i ) : t ∈ { , . . . , deg j ( g ( i ) z ) } and ( b j + , . . . , b s i ) ∈ { , } s i − j } . By inspection of g ( i ) z , it sufﬁces for V to evaluate ˜ V i + at the same set of points. To show how to accomplishthis efﬁciently, we exploit the following explicit expression for ˜ V i + . This expression was derived for thecase i + = d in Equation (4) within Lemma 2; we re-derive it here in the general case.For a vector b ∈ { , } s i + let χ b ( x , . . . , x s i + ) = ∏ s i + k = χ b k ( x k ) , where χ ( x k ) = − x k and χ ( x k ) = x k .With this deﬁnition in hand, we may write:˜ V i + ( p , . . . , p s i + ) = ∑ b ∈{ , } si + V i + ( b ) χ b ( p , . . . p s i + ) , (8)To see that Equation (8) holds, notice that the right hand side of Equation (8) is a multilinear polynomialin the variables ( p , . . . , b p i + ) , and that it agrees with V i + at all points p ∈ { , } s i + . Hence, it must be theunique multilinear extension of V i + .The intuition behind our optimizations is the following. In round j of the sum-check protocol, thereare | S ( j ) | points at which ˜ V i + must be evaluated. Equation (8) can be exploited to show that each gate atlayer i + V i + ( p ) for at most one point p ∈ S ( j ) ; namely the point p whose last22 i + − j coordinates agrees with those of p . This observation alone is enough to achieve an O ( S i + log S i ) runtime for P in total across all iterations of the sum-check protocol, because there are S i + gates at layer i +

1, and only s i = log S i rounds of the sum-check protocol. However, we need to go further in order toshave off the last log S i factor from P ’s runtime. Essentially, what we do is group the gates at layer i + p ∈ S ( j ) to which they contribute. Each such group can be treated as a single unit, ensuring thatthe work P has to do in any round of the sum-check protocol in order to evaluate ˜ V i + at all points in S ( j ) isproportional to | S ( j ) | rather than to S i + . Since the size of S ( j ) falls geometrically with j , our desired timebounds follow. Pre-processing. P will begin by computing an array V ( ) , which is simply deﬁned to be the vector of gatevalues at layer i +

1, i.e., identifying a number 0 < j < S i + with its binary representation in { , } s i + , P sets V ( ) [( j , . . . , j s i + )] = V i + ( j , . . . , j s i + ) for each ( j , . . . , j s i + ) ∈ { , } s i + . The right hand side of thisequation is simply the value of the j th gate at layer i + C . So P can ﬁll in the array V ( ) when sheevaluates the circuit C , before receiving any messages from V . Overview of Online Processing.

In round j of of the sum-check protocol, P needs to evaluate the polynomial˜ V i + at the O ( s i − j ) points in the set S ( j ) . P will do this using the help of intermediate arrays V ( j ) deﬁned asfollows.Deﬁne V ( j ) to be the length 2 s i + − j array such that for ( p j + , . . . , p s i + ) ∈ { , } s i + − j , V ( j ) [( p j + , . . . , p s i + )] = ∑ ( b ,..., b j ) ∈{ , } j V i + ( b , . . . , b j , p j + , . . . , p s i + ) · j ∏ k = χ b k ( r k ) , Efﬁciently Constructing V ( j ) Arrays.

Inductively, assume P has computed in the previous round the array V ( j − ) of length 2 s i + − j + .As the base case, we explained how P can ﬁll in V ( ) in the process of evaluating the circuit C . Nowobserve that P can compute V ( j ) given V ( j − ) in O ( s i + − j ) time using the following recurrence: V ( j ) [( p j + , . . . , p s i + )] = V ( j − ) [( , p j + , . . . , p s i )] · χ ( r j ) + V ( j − ) [( , p j + , . . . , p s i )] · χ ( r j ) . Thus, at the end of round j of the sum-check protocol, when V sends P the value r j , P can compute V ( j ) from V ( j − ) in O ( s i + − j + ) time. Using the V ( j ) Arrays.

We now show how to use the array V ( j − ) to evaluate ˜ V i + ( p ) in constant time forany point of the form p = ( r , . . . , r j − , t , b j + , . . . , b s i + ) with ( b j + , . . . , b s i + ) ∈ { , } s i + − j . We exploit thefollowing sequence of equalities: 23 V i + ( r , . . . , r j − , t , b j + , . . . , b s i ) = ∑ c ∈{ , } si + V i + ( c ) χ c ( r , . . . , r j − , t , b j + , . . . , b s i + )= ∑ ( c ,..., c j ) ∈{ , } j ∑ ( c j + ,..., c si + ) ∈{ , } si + − j V i + ( c ) χ c ( r , . . . , r j − , t , b j + , . . . , b s i + )= ∑ ( c ,..., c j ) ∈{ , } j ∑ ( c j + ,..., c si + ) ∈{ , } si + − j V i + ( c ) (cid:32) j − ∏ k = χ c k ( r k ) (cid:33) (cid:0) χ c j ( t ) (cid:1) (cid:32) s i + ∏ k = j + χ c k ( b k ) (cid:33) = ∑ ( c ,..., c j ) ∈{ , } j V i + ( c j + , . . . , c j , b j + , . . . , b s i + ) (cid:32) j − ∏ k = χ c k ( r k ) (cid:33) · χ c j ( t )= V ( j − ) [( , b j + , . . . , b s i + )] · χ ( t ) + V ( j − ) [( , b j + , . . . , b s i + )] · χ ( t ) . Here, the ﬁrst equality holds by Equation (8). The third holds by deﬁnition of the function χ c . Thefourth holds because for Boolean values b k , c k ∈ { , } , χ c k ( b k ) = c k = b k , and χ c k ( b k ) = V ( j − ) . Putting Things Together.

In round j of the sum-check protocol, P uses the array V ( j − ) to evaluate ˜ V i + ( p ) for all O ( s i − j ) points p ∈ S ( j ) . This requires constant time per point, and hence O ( s i − j ) time across allpoints. At the end of round j , V sends P the value r j , and P computes V ( j ) from V ( j − ) in O ( s i + − j ) time.In total across all rounds of the sum-check protocol, P spends O ( ∑ s i j = s i − j + s i + − j ) = O ( s i + s i + ) timeto evaluate ˜ V i + at the relevant points. When combined with our O ( s i ) -time algorithm for computing allthe relevant β ( z , p ) values, we see P takes O ( s i + s i + ) = O ( S i + S i + ) time to run the entire sum-checkprotocol for iteration i of our circuit-checking protocol. In this section we formalize a large class of circuits to which our reﬁnements yield asymptotic savingsrelative to prior implementations of the GKR protocol. Our protocol makes use of the following functionsthat capture the wiring structure of an arithmetic circuit C . Deﬁnition 2

Let C be a layered arithmetic circuit of depth d ( n ) and size S ( n ) over ﬁnite ﬁeld F . For everyi ∈ { , . . . , d − } , let in ( i ) : { , } s i → { , } s i + and in ( i ) : { , } s i → { , } s i + denote the functions that takeas input the binary label p of a gate at layer i of C, and output the binary label of the ﬁrst and secondin-neighbor of gate p respectively. Similarly, let type ( i ) : { , } s i → { , } denote the function that takes asinput the binary label p of a gate at layer i of C, and outputs 0 if p is an addition gate, and 1 if p is amultiplication gate. Intuitively, the following deﬁnition captures functions whose outputs are simple bit-wise transformationsof their inputs.

Deﬁnition 3

Let f be a function mapping { , } v to { , } v (cid:48) . Number the v input bits from to v, and the v (cid:48) output bits from to v (cid:48) . Assume that one machine word contains Ω ( v + v (cid:48) ) bits. We say that f is regular if fcan be evaluated on any input in constant time, and there is a subset of input bits S ⊆ [ v ] with |S| = O ( ) such that: . Each input bit in [ v ] \ S affects O ( ) of the output bits of f . Moreover, given input j ∈ [ v ] \ S , the set S j of output bits affected by x j can be enumerated in constant time.2. Each output bit of f depends on at most one input bit. Our protocol applied to C proceeds in d ( n ) iterations, where iteration i consists an application of thesum-check protocol to an appropriate polynomial derived from type ( i ) , in ( i ) , and in ( i ) , followed by a phasefor “reducing to veriﬁcation of a single point”. For any layer i of C such that in ( i ) , in ( i ) and type ( i ) are allregular, we can show that P can execute the sum-check protocol at iteration i in O ( S i + S i + ) time. To ensurethat P can execute the “reducing to veriﬁcation of a single point” phase in O ( S i + ) time, we need to placeone additional condition on in ( i ) and in ( i ) . Deﬁnition 4

We say that in ( i ) and in ( i ) are similar if there is a set of output bits T ⊆ [ s i + ] with |T | = O ( ) such that for all inputs x, the jth output bit of in ( i ) equals the jth output bit of in ( i ) for all j ∈ [ s i + ] \ T . We are ﬁnally in a position to state the class of circuits to which our reﬁnements apply.

Theorem 1

Let C be an arithmetic circuit, and suppose that for all layers i of C, in ( i ) , in ( i ) , and type ( i ) are regular. Suppose moreover that in ( i ) is similar to in ( i ) for all but O ( ) layers i of C. Then there is avalid interactive proof protocol ( P , V ) for the function computed by C, with the following costs. The totalcommunication cost is |O| + O ( d ( n ) log S ( n )) ﬁeld elements, where |O| is the number of outputs of C. Thetime cost to V is O ( n log n + d ( n ) log S ( n )) , and V can make a single streaming pass over the input, storingO ( log ( S ( n ))) ﬁeld elements. The time cost to P is O ( S ( n )) . The asymptotic costs of the protocol whose existence is guaranteed by Theorem 1 are identical to thoseof the implementation of the GKR protocol due to Cormode et al. in [14], except that in Theorem 1 P runsin time O ( S ( n )) rather than O ( S ( n ) log S ( n )) as achieved by [14]. We defer the proof to Appendix A. Theorem 1 applies to circuits computing functions from a wide range of applications, with the followingimplications.

MATMULT . Consider the following circuit C of size O ( n ) for multiplying two n × n matrices A and B .Let the input gate labelled ( , i , j ) correspond to A i j , and the input labelled ( , i , j ) correspond to B i j . Thelayer of C adjacent to the input consists of n gates, where the gate labeled ( i , j , k ) ∈ ( { , } log n ) computes A ik · B k j . All subsequent layers constitute a binary tree of addition gates summing up the results and therebycomputing ∑ k A ik B k j for all ( i , j ) ∈ [ n ] × [ n ] .For layers i ∈ { , . . . , log n } of this circuit, in ( i ) , in ( i ) , and type ( i ) are all regular, and moreover in ( i ) issimilar to in ( i ) (see Section 5.3.1 for a careful treatment of this wiring pattern). The remaining layer of thecircuit, layer i = log n +

1, is regular, though in ( log n + ) and in ( log n + ) are not similar. We obtain the followingimmediate corollary. Corollary 1

There is a valid interactive proof protocol for n × n MATMULT with the following costs. Thetotal communication cost is n + O ( d ( n ) log n ) ﬁeld elements, where the n term is required to specify theanswer. The time cost to V is O ( n log n ) , and V can make a single streaming pass over the input in timeO ( n log n ) and storing O ( log n ) ﬁeld elements. The time cost to P is O ( n ) .

25e note that the costs of Corollary 1 are subsumed by our special-purpose matrix multiplication protocolpresented later in Theorem 3. We included Corollary 1 to demonstrate the applicability of Theorem 1.

DISTINCT . Recall the circuit C over ﬁeld size q = k − a ∈ F n as input and outputs the number of non-zero entries of a . This circuit has k + i ∈ [ k − ] , an even-numbered gate p at layer i has both in-wires connected to gate p at layer i +

1, while an odd-numbered gate p at layer i has one in-wire connectedto gate p at layer i + p − ( p − s i , ) , where p − s i denotes the binary representation of p with the coordinate p s i removed). For these layers, in ( i ) , in ( i ) ,and type ( i ) are all regular, and in ( i ) is similar to in ( i ) .At layer k , an even-numbered gate p is has both in-wires connected to gate p / k +

1, whilean odd-numbered gate p at layer k has its unique in-wire connected to gate ( p − ) / k +

1. Inthe former case, both in-neighbors of gate p have binary representation p − s i . In the latter case the uniquein-neighbor of gate p has binary representation p − s i . It is therefore easily seen that in ( k ) , in ( k ) , and type ( k ) are all regular, and in ( k ) is similar to in ( k ) . Finally, at layer k +

1, both in-wires for gate p are connected togate p at layer k +

2. It is easily seen that in ( k + ) , in ( k + ) , and type ( k + ) are all regular, and in ( k + ) is similarto in ( k + ) . With all layers of C satisfying the requirements of Theorem 1, we obtain the following corollary. Corollary 2

Let q > max { m , n } be a Mersenne Prime. There is a valid interactive proof protocol over theﬁeld F q for DISTINCT with the following costs. The total communication cost is O ( log n log q ) ﬁeld elements.The time cost to V is O ( m log n ) , and V can make a single streaming pass over the input, storing O ( log n ) ﬁeld elements. The time cost to P is O ( n log q ) . To or knowledge, Corollary 2 yields the fastest known prover of any streaming interactive proof protocolfor

DISTINCT that also has total communication and space usage for V that is sublinear in both m and n .The fastest result previously was the O ( n · log ( n ) · log ( p )) -time prover obtained by the implementation ofCormode et al. [14]. We remark however that for a data stream with F distinct items, the prover in [14]actually can be made to run in time O ( n + F · log ( n ) · log ( p )) , where the O ( n ) term is due to the timerequired to simply observe the entire input stream. Therefore, for streams where F = o ( n / log n ) , theimplementation of [14] achieves an asymptotically faster prover than implied by Corollary 2. Remark 7

Cormode et al. in [14, Section 3.2] describe how to extend the GKR protocol to handle circuitswith gates that compute more general operations than just addition and multiplication. At a high level, [14]shows that gates computing any “low-degree” operation can be handled, and they demonstrate analyticallyand experimentally that these more general gates can achieve cost savings for the

DISTINCT problem. Thesesame optimizations are also applicable in conjunction with our reﬁnements. We omit further details forbrevity, and did not implement these optimizations in conjunction with our reﬁnements.

Other Problems.

In order to demonstrate its generality, we describe two other non-trivial applications ofTheorem 1. • Pattern Matching . In the Pattern Matching problem, the input consists of a stream of text T =( t , . . . , t n − ) ∈ [ n ] n and pattern P = ( p , . . . , p m − ) ∈ [ n ] m . The pattern P is said to occur at loca-tion i in T if, for every position k in P , p k = t i + k . The pattern-matching problem is to determine thenumber of locations at which P occurs in T . For example, one might want to determine the numberof times a given phrase appears in a corpus of emails stored in the cloud.26ormode et al. describe the following circuit C for Pattern Matching over the ﬁnite ﬁeld F q . Thecircuit ﬁrst computes the quantity I i = ∑ mj = ( t i + j − p j ) for each i ∈ [[ n ]] , and then exploits Fermat’sLittle Theorem (FLT) by computing M = ∑ n − mi = I q − i . The number of occurrences of the pattern equals n − m − M .Computing I i for each i can be done in log m + t i + k − p k for each pair ( i , k ) ∈ [[ n ]] × [[ q ]] , the next layer squares each of the results, and the circuitthen sums the results via a depth log m -binary tree of addition gates. The total size of the circuit C is O ( nm + n log q ) , where the nm term is due to the computation of the I i values, and the n log q term isdue to the FLT computation. The total depth of the circuit is O ( log m + log q ) = O ( log q ) .We have already demonstrated that Theorem 1 applies to the squaring layer, the binary tree sub-circuit,and the FLT computation. The only remaining layer of the circuit is the one that computes t i + k − p k for each pair ( i , k ) ∈ [[ n ]] × [[ m ]] . Unfortunately, Theorem 1 does not apply to this layer of the circuit.This is because the ﬁrst in-neighbor of a gate with label ( i , . . . , i log n , k , . . . , k log m ) ∈ { , } log n + log m has label equal to the binary representation of the integer i + k , and a single bit i j can affect many bitsin the binary representation of i + k (likewise, each bit in the binary representation of i + k may beaffected by many bits in the binary representation of i and k ).However, in Appendix B, we describe how to extend the ideas underlying Theorem 1 to handle thiswiring pattern. The extensions in Appendix B may be more broadly useful, as the wiring patternanalyzed there is an instance of a common paradigm, in that it interprets binary gate labels as a pairof integers and performs a simple arithmetic operation (namely addition) on those integers.We also remark that, instead of going through the analysis of Appendix B, a more straightforwardapproach is to simply apply the implementation of [14] to this layer; the runtime for P in the corre-sponding sum-check protocol is O ( nm log n ) . This does not affect the asymptotic costs of the protocolif m is constant, since in this case nm log n = O ( n log q ) , and the total runtime of P over all other layersof the circuit is Θ ( n log q ) .This analysis highlights the following point: our reﬁnements can be applied to a circuit on a layer-by-layer basis, so they can still yield speedups even if some but not all layers of a circuit are sufﬁciently“regular” for our reﬁnements to apply.A similar analysis applies to a closely related circuit that solves a more general problem known asPattern Matching with Wildcards. We omit these details for brevity. • Fast Fourier Transform.

Cormode et al. [14] also describe a circuit over C for computing the stan-dard radix-two decimation-in-time FFT. At a high level, this circuit works as follows. It proceedsin log n stages, where for k = ( k , . . . , k n ) ∈ { , } n , the k th output of stage i is recursively deﬁnedas V i ( k , . . . , k n ) = V i − ( k , k i − , , k i , . . . , k n ) + e − π ki / n V i − ( k , . . . , k i − , , k i + , . . . , k n ) . Theorem 1 iseasily seen to apply to the natural circuit executing this recurrence, and our reﬁnements would there-fore shave a logarithmic factor off the runtime of P applied to this circuit, relative to the implemen-tation of [14] (since this circuit is deﬁned over the inﬁnite ﬁeld C , the protocol is only deﬁned in amodel where complex numbers can be communicated and operated on at unit cost). We implemented the protocols implied by Theorem 1 as applied to circuits computing

MATMULT and

DISTINCT . These experiments serve as case studies to demonstrate the feasibility of Theorem 1 in prac-27ice, and to quantify the improvements over prior implementations. While Section 8 describes a specializedprotocol for

MATMULT that is signiﬁcantly more efﬁcient than the protocol implied by Theorem 1,

MAT - MULT serves as an important case study for the costs of the more general protocol described in Theorem1, and allows for direct comparison with prior implementation work that also evaluated general-purposeprotocols via their performance on the

MATMULT problem [14, 30, 35, 36, 38, 40].Our comparison point is the implementation of Cormode et al. [14], with some of the reﬁnements of Vuet al. [40] included. In particular, our comparison point for matrix multiplication uses the reﬁnement of [40]for circuits with multiple outputs described in Section 4.3.2. We did not include Vu et al.’s optimizationfrom Lemma 3 that reduced the runtime of V from O ( n log n ) to O ( n ) , because this optimization blows upthe space usage of V to Ω ( n ) , while we want to use a smaller-space veriﬁer for streaming applications suchas DISTINCT . The main takeaways of our experiments are as follows. When Theorem 1 is applicable, the prover in theresulting protocol is 200x-250x faster than the previous state of the art implementation of the GKR protocol.The communication costs and the number of rounds required by our protocols are also 2x-3x smaller thanthe previous state of the art. The veriﬁer in our implementation takes essentially the same amount of timeas in prior implementations of the GKR protocol; this time is much smaller than the time to perform thecomputation locally without a prover.Most of the observed 200x speedup can be attributed directly to our improvements in protocol designover prior work: the circuit for 512x512 matrix multiplication is of size 2 , and hence our log S factorimprovement the runtime of P likely accounts for at least a 28x speedup. The 3x reduction in the number ofrounds accounts for another 3x speedup. The remaining speedup factor of roughly 2x may be due to a morestreamlined implementation relative to prior work, rather than improved protocol design per se.We have both a serial implementation and a parallel implementation that leverages graphics processingunits (GPUs). The prover in our parallel implementation runs roughly 30x faster than the prover in our serialimplementation. The ability to leverage GPUs to obtain robust speedups in our setting is not unexpected, asThaler, Roberts, Mitzenmacher, and Pﬁster demonstrated substantial speedups for an earlier implementationof the GKR protocol using GPUs in [38].All of our code is available online at [39]. All of our serial code was written in C++ and all experimentswere compiled with g++ using the − O3 compiler optimization ﬂag and run on a workstation with a 64-bitIntel Xeon architecture and 48 GBs of RAM. We implemented all of our GPU code in CUDA and Thrust [24]with all compiler optimizations turned on, and ran our GPU implementation on an NVIDIA Tesla C2070GPU with 6 GBs of device memory.

Choice of Finite Field.

All of our circuits work over the ﬁnite ﬁeld of size q = −

1. Several remarksare appropriate regarding our choice of ﬁeld size. This ﬁeld was used in our earlier work [14] becauseit supports fast arithmetic, as reducing an integer modulo q can be done with a bit-shift, addition, and abit-wise AND. (The same observation applies to any ﬁeld whose size equals a Mersenne Prime, including2 −

1, 2 −

1, and 2 − / for all of the problems we consider (this probability is proportionalto d ( n ) log S ( n ) q ). 28he main potential issue with our choice of ﬁeld size is that “overﬂow” can occur for problems suchas matrix multiplication if the entries of the input matrices can be very large. For example, with 512 × A , B are larger than 2 , an entry in the productmatrix AB can be as large as 2 , which is larger than our ﬁeld size. If this is a concern, a larger ﬁeld size isappropriate. (Notice that for a problem such DISTINCT , there is no danger of overﬂow issues as long as thelength of the stream is smaller than 2 −

2, which is larger than any stream encountered in practice).A second reason to use larger ﬁeld sizes is to handle ﬂoating-point or rational arithmetic as proposed bySetty et al. in [35].All of our protocols can be instantiated over ﬁelds with more than q = − MATMULT . The costs of our serial

MATMULT implementation are displayed in Table 1. The prover in ourmatrix multiplication implementation is about 250x faster than the previous state of the art. For example,when multiplying two 512 x 512 matrices, our prover takes about 38 seconds, while our comparison im-plementation takes over 2.5 hours. A C++ program that simply evaluates the circuit without an integrityguarantee takes 6.07 seconds, so our prover experiences less than a 7x slowdown to provide the integrityguarantee relative to simply evaluating the circuit without such a guarantee.When multiplying two 512 x 512 matrices A and B , the protocol requires 236 rounds, and the totalcommunication cost of our protocol is 5.48 KBs (plus the amount of communication required to specify theanswer AB ). The previous state of the art required 767 rounds and close to 18 KBs of communication (plusthe amount of communication required to specify AB ). Notice that specifying a 512x512 matrix using 8 bytesper entry requires 2 MBs, which is more than 500 times larger than the 5.48 KBs of extra communicationrequired to verify the answer.A serial C++ program performing 512 x 512 matrix multiplication over the integers with ﬂoating pointarithmetic (without going through the circuit representation of the computation) required 1.53 seconds, soour prover runs approximately 25 times slower than a standard unveriﬁable matrix multiplication algorithm.A serial C++ program performing the same multiplication over the ﬁnite ﬁeld of size 2 − V ’s runtime from O ( n log n ) to O ( n ) and thereby furtherspeed up the veriﬁer. DISTINCT . The costs of our serial

DISTINCT implementation are displayed in Table 2. The comparison ofour implementation with prior work is similar to the case of matrix multiplication. Our prover is roughly 200times faster than the comparison implementation. For example, when computing the number of non-zero29 mplementation Problem Size P Time V Time Rounds Total Communication Circuit Eval TimePrevious state of the art 256 x 256 1054 s 0.02 s 623 14.6 KBs 0.73 sTheorem 1 256 x 256 4.37 s .02 s 190 4.4 KBs 0.73 sPrevious state of the art 512 x 512 9759 s 0.10 s 767 17.97 KBs 6.07 sTheorem 1 512 x 512 37.85 s 0.10 s 236 5.48 KBs 6.07 s

Table 1: Experimental results for n × n MATMULT with our serial implementation. The Total Communicationcolumn does not count the communication required to specify the answer, only the “extra” communicationrequired to run the veriﬁcation protocol.

Implementation P Time V Time Rounds Total Communication Circuit Eval TimePrevious state of the art 3400.23 s 0.20 s 3916 91.3 KBs 1.88 sTheorem 1 17.28 s 0.20 s 1361 40.76 KBs 1.88 s

Table 2: Experimental results for computing the number of non-zero entries of a vector of length 2 withour serial implementation.entries of a vector of length 2 , our prover takes about 17 seconds, while our comparison implementationtakes about 57 minutes. A C++ program that simply evaluates the circuit without an integrity guaranteetakes 1.88 seconds, so our prover experiences roughly a 10x slowdown to prove an integrity guaranteerelative to simply evaluating the circuit. Our implementation required 1361 rounds and 40.76 KBs of totalcommunication, compared to 3916 rounds and 91.3 KBs for the previous state of the art. This is essentiallya 3x reduction in the number of rounds, and a 2.25x reduction in the total amount of communication.A C++ program that (unveriﬁably) computes the number of non-zero entries in a vector x with 2 entries takes less than .01 seconds, and our prover implementation runs more than 1 ,

700 times longer thanthis. The reason that the slowdown for the prover relative to an unveriﬁable algorithm is larger for

DISTINCT than for

MATMULT is that

DISTINCT is a “less arithmetic” problem, in the sense that the size of the arithmeticcircuit we use for computing

DISTINCT is more than 100x larger than the runtime of an unveriﬁable serialalgorithm for the problem. We stress however that, as pointed out in [38], when solving the

DISTINCT problem in practice, an unveriﬁable algorithm would ﬁrst aggregate a data stream into its frequency-vectorrepresentation before determining the number of non-zero frequencies. In reporting a time bound of .01seconds for unveriﬁably solving

DISTINCT , we are not taking the aggregation time cost into account. Forsufﬁciently long data streams, the slow-down for our prover relative to an unveriﬁable algorithm would bemuch smaller than 1 , Our serial implementation demonstrates that P experiences a 10x slowdown in order to evaluate the circuitwith an integrity guarantee relative to simply evaluating the circuit without such a guarantee. The purposeof this section is to demonstrate that parallelization can further mitigate this slowdown. To this end, weimplemented a parallel version of our prover in the context of the matrix multiplication protocol of Section5. Our parallel implementation uses a graphics processing unit (GPU).The high-level idea behind our parallel implementation is the following. Each time we apply the sum-check protocol to a polynomial g ( i ) z , it sufﬁces for P to evaluate g ( i ) z at a large number of points r of the form p = ( r , . . . , r j − , t , b j + , . . . , b s i + ) with t ∈ { , . . . , deg j ( g ( i ) z ) } and ( b j + , . . . , b s i + ) ∈ { , } s i + − j . We canperform each of these evaluations independently. Thus, we devote a single thread on the GPU to each value30 mplementation Problem Size P Time Serial Circuit Eval TimeTheorem 1, Serial Implementation 256 x 256 4.37 s 0.73 sTheorem 1, Parallel Implementation 256 x 256 0.23 s 0.73 sTheorem 1, Serial Implementation 512 x 512 37.85 s 6.07 sTheorem 1, Parallel Implementation 512 x 512 1.29 s 6.07 s

Table 3: Experimental results for n × n MATMULT with our parallel prover implementation.of ( b j + , . . . , b s i + ) ∈ { , } s i + − j and have that thread evaluate g ( i ) z ( r ) at each of the deg j ( g ( i ) z ) + ( r , . . . , r j − , t , b j + , . . . , b s i + ) with the help of the C ( j − ) and V ( j − ) arrays described in Section 5.The one remaining issue is that after each round j of each invocation of the sum-check protocol, we need toupdate the arrays, i.e., we need to compute C ( j ) and V ( j ) . To accomplish this, we devote a single thread toeach entry of C ( j ) and V ( j ) .All steps of our parallel implementation achieve excellent memory coalescing, which likely plays asigniﬁcant role in the large speedups we were able to achieve. For example, if two threads are updatingadjacent entries of the array V ( j ) , the only memory accesses that the threads need to perform are to adjacententries of the array V ( j − ) .The results are shown in Table 3: we obtained about a 30x speedup for the prover relative to our serialimplementation. The reported prover runtime does count the time required to copy data between the host(CPU) and the device (GPU), but does not count the time required to evaluate the circuit, which our imple-mentation does in serial for simplicity. While our implementation evaluates the circuit serially, this step canin principle be done in parallel one layer at a time, as these circuits have only logarithmic depth. Notice thatwhen the circuit evaluation runtime is excluded, our parallel prover implementation runs faster in the caseof 512x512 matrix multiplication than the time required to evaluate the circuit sequentially.It is possible that we would observe slightly larger speedups at larger input sizes, but our parallel im-plementation exhausts the memory of the GPU at inputs larger than 512x512. This memory bottleneck wasalso experienced by Thaler, Roberts, Mitzenmacher, and Pﬁster [38], who used the GPU to obtain a parallelimplementation of the protocol of Cormode et al. [14], and helps motivate the importance of the improvedspace usage of the special purpose MATMULT protocol we give later in Theorem 3. For comparison, theGPU implementation of [38] required 39.6 seconds for 256 x 256 matrix multiplication, which is about175x slower than our parallel implementation.We also mention that Thaler, Roberts, Mitzenmacher, and Pﬁster [38] demonstrate that equally largespeedups via parallelization are achievable for the (already fast) computation of the veriﬁer. These resultsdirectly apply to our protocols as well, as the veriﬁer’s runtime in both implementations is dominated by thetime required to evaluate the MLE of the input at a random point [14, 38].

In this section, our goal is to extend the applicability of the GKR protocol. While the GKR protocol ap-plies in principle to any function computed by a small-depth circuit, this is not the case when ﬁne-grainedefﬁciency considerations are taken into account. The implementation of Cormode et al. [14] required theprogrammer to express a program as an arithmetic circuit, and moreover this circuit needed to have a regularwiring pattern, in the sense that the veriﬁer could efﬁciently evaluate the polynomials ˜add i and ˜mult i at apoint. If this was not the case, the veriﬁer would need to do an expensive (though data-independent) pre-processing phase to perform these evaluations. Moreover, even for circuits with regular wiring patterns, this31mplementation caused the prover to suffer an O ( log ( S ( n ))) factor blowup in runtime relative to evaluatingthe circuit without a guarantee of correctness. The results of Sections 5 and 8 asymptotically eliminate theblowup in runtime for the prover, but they also only apply when the circuit has a very regular wiring pattern.The implementation of Vu et al. [40] allows the programmer to express a program in a high-level lan-guage, but compiles these programs into potentially irregular circuits that require the veriﬁer to incur theexpensive preprocessing phase mentioned above, in order for the veriﬁer to evaluate the polynomials ˜add i and ˜mult i at a point. They therefore propose to apply their system in a “batching” model, where multiple in-stances of the same sub-computation are applied independently to different pieces of data. More speciﬁcally,their system applies the GKR protocol independently to each application of the computation, and relies onthe ability of the veriﬁer to use a single ˜add i and ˜mult i evaluation for all instances of the sub-computation,thereby amortizing the cost of this evaluation across the instances. To clarify, this use of a single ˜add i and˜mult i evaluation for all instances as in [40] is only sound if all of the instances are checked simultaneously.If the instances are instead veriﬁed one after the other, then P knows V ’s randomness in all but the ﬁrstinstance, and can use that knowledge to mislead V .The batching model of Vu et al. is identical to the data parallel setting we consider here. However, adownside to the solution of Vu et al. is that the veriﬁer’s work, as well as the total communication cost ofthe protocol, grows linearly with the “batch size” – the number of applications of the sub-computation thatare being outsourced. We wish to develop a protocol whose costs to both the prover and veriﬁer grow muchmore slowly with the batch size. As discussed above, existing interactive proof protocols for circuit evaluation either apply only to circuitswith highly regular wiring patterns or incur large overheads for the prover and veriﬁer. While we do not havea magic bullet for dealing with irregular wiring patterns, we do wish to mitigate the bottlenecks of existingprotocols by leveraging some general structure underlying many real-world computations. Speciﬁcally, thestructure we focus on exploiting is data-parallelism.By data parallel computation, we mean any setting in which the same sub-computation is applied in-dependently to many pieces of data, before possibly aggregating the results. Crucially, we do not want tomake signiﬁcant assumptions on the sub-computation that is being applied (in particular, we want to handlesub-computations computed by circuits with highly irregular wiring patterns), but we are willing to assumethat the sub-computation is applied independently to many pieces of data. See Figure 2 for a schematic of adata parallel computation.We have already seen a very simple example of a data parallel computation: the

DISTINCT problem. Thecircuit C from Section 5 used to solve this problem takes as input a vector a and computes a q − i mod q forall i (this is the data parallel phase of the computation), before summing the results (this is the aggregationphase). Notice that if the data stream consists of a sequence of words, then the DISTINCT problem becomesthe word-count problem, a classic data parallel application.By design, the protocol of this section also applies to more complicated data parallel computations.For example, it applies to arbitrary counting queries on a database. In a counting query, one applies somefunction independently to each row of the database and sums the results. For example, one may ask “Howmany people in the database satisfy Property P ?” Our protocol allows one to veriﬁably outsource sucha counting query with overhead that depends minimally on the size of the database, but that necessarilydepends on the complexity of the property P . 32 ata Sub-‐Comp C Data Sub-‐Comp C Data Sub-‐Comp C Data Sub-‐Comp C Data Sub-‐Comp C Data Sub-‐Comp C Aggrega1on

Figure 2: Schematic of a data parallel computation.

Let C be a circuit of size S ( n ) with an arbitrary wiring pattern, and let C ∗ be a “super-circuit” that applies C independently to B different inputs before aggregating the results in some fashion. For example, in the caseof a counting query, the aggregation phase simply sums the results of the data parallel phase. We assume thatthe aggregation step is sufﬁciently simple that the aggregation itself can be veriﬁed using existing techniques,and we focus on verifying the data parallel part of the computation.If we naively apply the GKR protocol to the super-circuit C ∗ , V might have to perform an expensive pre-processing phase to evaluate the wiring predicate of C ∗ at the necessary locations – this would require time Ω ( B · S ) . Moreover, when applying the basic GKR protocol to C ∗ , P would require time Θ ( B · S · log ( B · S )) .A different approach was taken by Vu et al [40], who applied the GKR protocol B independent times, oncefor each copy of C . This causes both the communication cost and V ’s online check time to grow linearlywith B , the number of sub-computations.In contrast, our protocol achieves the best of both prior approaches. We observe that although eachsub-computation C can have a complicated wiring pattern, the circuit is maximally regular between sub-computations, as the sub-computations do not interact at all. Therefore, each time the basic GKR protocolwould apply the sum-check protocol to a polynomial derived from the wiring predicate of C ∗ , we can insteaduse a simpler polynomial derived only from the wiring predicate of C . By itself, this is enough to ensurethat V ’s pre-processing phase requires time only O ( S ) , rather than O ( B · S ) as in a naive application of thebasic GKR protocol. That is, the cost of V ’s pre-processing phase is essentially proportional to the cost ofapplying the GKR protocol only to C , not to the super-circuit C ∗ .Furthermore, by combining this observation with the methods of Section 5, we can bring the runtimeof P down to O ( B · S · log S ) . That is, the blowup in runtime suffered by the prover, relative to performingthe computation without a guarantee of correctness, is just a factor of log S – the same as it would be if theprover had run the basic GKR protocol on a single instance of the sub-computation.33 .3 Technical Details Let C be an arithmetic circuit over F of depth d and size S with an arbitrary wiring pattern, and let C ∗ be thecircuit of depth d and size B · S obtained by laying B copies of C side-by-side, where B = b is a power of2. We assume that the in-neighbors of all of the S i gates at layer i can be enumerated in O ( S i ) time. We willuse the same notation as in Section 5, using ∗ ’s to denote quantities referring to C ∗ . For example, layer i of C has size S i = s i and gate values speciﬁed by the function V i , while layer i of C ∗ has size S ∗ i = s ∗ i and gatevalues speciﬁed by the function V ∗ i . We denote the length of the input to C ∗ by n ∗ = Bn . Our main theorem gives a protocol for compute ˜ V ∗ ( z ) , for any point z ∈ F s ∗ . The idea is that the veriﬁerwould ﬁrst apply simpler techniques (such as the protocol of Theorem 1) to the aggregation phase of thecomputation to obtain a claim about ˜ V ∗ ( z ) , and then use our main theorem to verify this claim. Hence, inprinciple V need not look at the entire output of the data parallel phase, only the output of the aggregationphase, which we anticipate to be much smaller. Theorem 2

For any point z ∈ F s ∗ , there is a valid interactive proof protocol for computing ˜ V ∗ ( z ) with thefollowing costs. V spends O ( S ) time in a pre-processing phase, and O ( n ∗ log n ∗ + d · log ( B · S )) time in anonline veriﬁcation phase, where the n ∗ log n ∗ term is due to the time required to evaluate the multilinearextension of the input to C ∗ at a point. P runs in total time O ( S · B · log S ) . The total communication isO ( d · log ( B · S )) ﬁeld elements. Proof:

Consider layer i of C ∗ . Let p = ( p , p ) ∈ { , } s i × { , } b be the label of a gate at layer i of C ∗ ,where p speciﬁes which “copy” of C the gate is in, while p designates the label of the gate within the copy.Similarly, let ω = ( ω , ω ) ∈ { , } s i + × { , } b and γ = ( γ , γ ) ∈ { , } s i + × { , } b be the labels of twogates at layer i + ( p , p ) ∈ { , } s i × { , } b , V ∗ i ( p , p ) = ∑ ω ∈{ , } si + ∑ γ ∈{ , } si + h ( i ) ( p , p , ω , γ ) , where h ( i ) ( p , p , ω , γ ) = (cid:0) ˜add i ( p , ω , γ ) (cid:0) ˜ V ∗ i + ( ω , p ) + ˜ V ∗ i + ( γ , p ) (cid:1) + ˜mult i ( p , ω , γ ) (cid:0) ˜ V ∗ i + ( ω , p ) · ˜ V ∗ i + ( γ , p ) (cid:1)(cid:1) . Essentially, this equation says that an addition (respectively, multiplication) gate p = ( p , p ) ∈ { , } s i + b is connected to gates ω = ( ω , ω ) ∈ { , } s i + + b and γ = ( γ , γ ) ∈ { , } s i + + b if and only if p , ω , and γ areall in the same copy of C , and p is connected to ω and γ within the copy.Lemma 4 then implies that for any z ∈ F s ∗ i ,˜ V ∗ i ( z ) = ∑ ( p , p , ω , γ ) ∈{ , } si ×{ , } b ×{ , } si + ×{ , } si + β s ∗ i ( z , ( p , p )) · h ( i ) ( p , p , ω , γ ) . i of our protocol, we apply the sum-check protocol to the polynomial g ( i ) z given by g ( i ) z ( p , p , ω , γ ) = β s ∗ i ( z , ( p , p )) · h ( i ) ( p , p , ω , γ ) . The communication costs of this protocol are im-mediate. Costs for V . In order to run her part of the sum-check protocol of iteration i , V only needs to perform therequired checks on each of P ’s messages. V ’s check requires O ( ) time in each round of the sum-checkprotocol except the last. In the last round of the sum-check protocol, V must evaluate the polynomial g ( i ) z ata single point. This requires evaluating β s ∗ i , ˜add i , ˜mult i , and ˜ V ∗ i + at a constant number of points. The ˜ V ∗ i + evaluations are provided by P in all iterations i of the protocol except the last, while the β s ∗ i evaluation canbe done in O ( log ( B · S )) time.The ˜add i and ˜mult i computations can be done in pre-processing in time O ( S i ) by enumerating the in-neighbors of each of the S i gates at layer i [14, 40]. Adding up the pre-processing time across all iterations i of our protocol, V ’s pre-processing time is O ( ∑ i S i ) = O ( S ) as claimed.In the ﬁnal iteration of the protocol, P no longer provides the ˜ V ∗ i + evaluation for V ; instead, V mustevaluate the multilinear extension of the input at a point on her own. This can be done in a streaming mannerusing space O ( log n ∗ ) in time O ( n ∗ log n ∗ ) . The time cost for V in the online phase follows. Costs for P . It remains to show that P can perform the required computations in iteration i of the protocolin time O (( S i + S i + ) · B · log ( S )) . To this end, notice g ( i ) z is a polynomial in v : = s i + s i + + b variables.We order the sum in this sum-check protocol so that the s i + s i + variables in p , ω , and γ are boundﬁrst in arbitrary order, followed by the variables of p . P can compute the prescribed messages in the ﬁrst s i + s i + = O ( log S ) rounds exactly as in the implementation of Cormode et al. [14]. They show that eachgate at layers i and i + C ∗ contributes to exactly one term in the sum deﬁning P ’s message in anygiven round of the sum-check protocol, and moreover the contribution of a given gate can be determined in O ( ) time. Hence the total time devoted required by P to handle these rounds is O ( B · ( S i + S i + ) · log S ) .It remains to show how P can compute the prescribed messages in the ﬁnal b rounds of the sum-checkprotocol while investing O (( S i + S i + ) · B ) across all rounds of the protocol.Recall that in order to compute P ’s message in round j of the sum-check protocol applied to the v -variatepolynomial g ( i ) z , it sufﬁces for P to evaluate g ( i ) z at 2 v − j points of the form ( r , . . . , r j − , t , b j + , . . . , b v ) , with t ∈ { , . . . , deg j ( g ( i ) z ) } and ( b j + , . . . , b v ) ∈ { , } v − j . Each of these evaluations of g ( i ) z can be computed in O ( ) time given the evaluations of β s ∗ i , ˜add i , ˜mult i , and ˜ V ∗ i + at the relevant points.Notice that once the variables in p , ω , and γ are bound to speciﬁc values, say r ( p ) , r ( ω ) , and r ( γ ) ,˜add i ( p , ω , γ ) and ˜mult i ( p , ω , γ ) are themselves bound to speciﬁc values, namely ˜add i ( r ( p ) , r ( ω ) , r ( γ ) ) and ˜mult i ( r ( p ) , r ( ω ) , r ( γ ) ) . So P only needs to evaluate these polynomials once, and both of these evaluationscan be computed by P in O ( S i ) time. Thus, the ˜add i , ˜mult i evaluations in the last b rounds require just O ( S i ) time in total. P can evaluate the function β s ∗ i at the relevant points exactly as in the proof of Theorem 1 using the C ( j ) arrays to ensure that this computation is done quickly. The array C ( ) has size 2 s ∗ i = O ( S i · B ) , and C ( j − ) gets updated to C ( j ) whenever a variable in p or p becomes bound. This ensures that across all rounds ofthe sum-check protocol, the β s ∗ i evaluations require O ( S i · B ) time in total.Likewise, the ˜ V ∗ i + evaluations can be handled exactly as in Theorem 1, using the the V ( j ) arrays to ensurethat this computation is done quickly. The array V ( ) has size 2 s ∗ i + = O ( S i + · B ) , and V ( j − ) gets updatedto V ( j ) whenever a variable in ω becomes bound (and similarly for the variables in γ ). This ensures thatacross all rounds of the sum-check protocol, the ˜ V ∗ i + evaluations take O (( S i + S i + ) · B ) in total.35 educing to Veriﬁcation of a Single Point. After executing the sum-check protocol at layer i asdescribed above, V is left with a claim about ˜ V i + ( ω , p ) and ˜ V i + ( γ , p ) , for ω , γ ∈ F s i , and p ∈ F b .This requires P to send ˜ V i + ( (cid:96) ( t )) for a canonical line (cid:96) ( t ) that passes through ( ω , p ) and ( γ , p ) . It iseasily seen that ˜ V i + ( (cid:96) ( t )) is a univariate polynomial of degree at most s i . Here, we are exploiting the factthat the ﬁnal b coordinates of ( ω , p ) and ( γ , p ) are equal.Hence P can specify ˜ V i + ( (cid:96) ( t )) by sending ˜ V i + ( (cid:96) ( t j )) for O ( s i ) many points t j ∈ F . Using the methodof Lemma 3, P can evaluate ˜ V i + at each point (cid:96) ( t j ) in O ( S i + ) time, and hence can perform all ˜ V i + ( (cid:96) ( t j )) evaluations in O ( S i + · s i ) = O ( S i + · log S ) time in total. This ensures that across all iterations of our protocol, P devotes at most O ( S · B · log S ) time to the “reducing to veriﬁcation of a single point” phase of the protocol.This completes the proof.In practice we would expect the results of the data parallel phase of computation represented by thesuper-circuit C ∗ to be aggregated in some fashion. We assume this aggregation step is amenable to veriﬁ-cation via other techniques. In the case of counting queries, the aggregation step simply sums the outputsof the data parallel step, which can be handled via Theorem 1, or slightly more efﬁciently via Proposition7 described below in Section 8. More generally, if this aggregation step is computed by a circuit C (cid:48) of size O ( S · B · log S / log B ) such that V can efﬁciently evaluate the multilinear extension of the wiring predicate of C (cid:48) , then we can simply apply the basic GKR protocol to C (cid:48) with asymptotic costs smaller than those of theprotocol described in Theorem 2. This application of the GKR protocol to C (cid:48) ends with a claim about thevalue of ˜ V ∗ ( z ) for some z ∈ F s ∗ . The veriﬁer can then invoke the protocol of Theorem 2 to verify this claim.We stress that the protocol of Theorem 2 can be applied if there are multiple data parallel stages inter-leaved with aggregation stages. In this section we describe two ﬁnal optimizations that are much more specialized than Theorems 1 and 2,but have a signiﬁcant effect in practice when they apply. In particular, Section 8.2 culminates in a protocolfor matrix multiplication that is of interest in its own right. It is hundreds of times faster than the protocolimplied by Theorem 1 and studied experimentally in Section 6.

Cormode et al. [21] describe an optimization that applies to any circuit C with a single output that culminatesin a binary tree of addition gates; at a high level, they directly apply a single sum-check protocol to the entirebinary tree, thereby treating the entire tree as a single addition gate with very large fan-in. In contrast, theoptimization described here applies to circuits with multiple outputs and allows the binary tree of additiongates to occur anywhere in the circuit, not just at the layers immediately preceding the output.At ﬁrst blush, our optimization might seem quite specialized since it only applies to circuits with aspeciﬁc wiring pattern. However, this is one of the most commonly occurring wiring patterns, as evidencedby its appearance within the circuits computing MATMULT , DISTINCT , Pattern Matching, and countingqueries. Notice that our optimization also applies to verifying multiple independent instances of any problemwith a single output whose circuit ends with a binary tree of sum-gates, such as verifying the number ofdistinct items in multiple distinct data streams, or posing multiple separate counting queries to a database.This is because, similar to Theorem 2, one can lay the circuits for each of the individual problem instances36 mplementation Problem Size P Time V Time Rounds Total Communication Circuit Eval TimeTheorem 1 256 x 256 4.37 s 0.02 s 190 4.4 KBs 0.73 sProposition 7 256 x 256 2.52 s 0.02 s 35 0.76 KBs 0.73 sTheorem 1 512 x 512 37.85 s 0.10 s 236 5.48 KBs 6.07 sProposition 7 512 x 512 22.98 s 0.10 s 39 0.86 KBs 6.07 s

Table 4: Experimental results for n × n MATMULT , with and without the reﬁnement of Section 8.1. Asin Table 1, the Total Communication column does not count the n ﬁeld elements required to specify theanswer.side-by-side and treat the result as a single “super-circuit” culminating in a binary tree of addition gates withmultiple outputs.The starting point for our optimization is the observation of Vu et al. [40] mentioned in Section 4.3.2:in order to verify that P has correctly evaluated a circuit with many output gates, P may simply send V the (claimed) values of all output gates, thereby specifying a function V (cid:48) : { , } s → F claimed to equal V . V can pick a random point z ∈ F s and evaluate ˜ V (cid:48) ( z ) on her own in O ( S ) time. An application of theSchwartz-Zippel Lemma (Lemma 1) implies that it is safe for V to believe that V is as claimed as long as˜ V ( z ) = ˜ V (cid:48) ( z ) . Our protocol as described in Section 5 would then proceed in iterations, with one iterationper layer of the circuit and one application of the sum-check protocol per iteration. This would ultimatelyreduce P ’s claim about the value of ˜ V ( z ) to a claim about ˜ V d ( z (cid:48) ) for some z (cid:48) ∈ F s d , where d is the inputlayer of the circuit.Instead, our ﬁnal reﬁnement uses a single sum-check protocol to directly reduce P ’s claim about ˜ V ( z ) to a claim about ˜ V d ( z (cid:48) ) for some random points z (cid:48) ∈ F s d . Proposition 7

Let C be a depth-d circuit consisting of a binary tree of addition gates, k inputs, and k − d outputs. For any points z ∈ F k − d , ˜ V ( z ) = ∑ p ∈{ , } k g z ( p ) , whereg z ( p ) = ˜ V d ( z , p k − d + , . . . , p k ) . Proof:

At layer i of C , the gate with label p ∈ { , } s i is the sum of the gates with labels ( p , ) and ( p , ) atlayer i +

1. It is then straightforward to observe that the for any p ∈ { , } k − d , the p th output gate has value V ( p , . . . , p k − d ) = ∑ ( p k − d + ,..., p d ) ∈{ , } d ˜ V d ( p , . . . , p k − d , p k − d + , . . . , p k ) . (9)Notice that the right hand side of Equation (9) is a multilinear polynomial in the variables ( p , . . . , p k − d ) that agrees with V ( p , . . . , p k − d ) at all Boolean inputs. Hence, the right hand side is the (unique) multilinearextension ˜ V of the function V : { , } k − d → { , } . The theorem follows.In applying the sum-check protocol to the polynomial g z in Proposition 7, it is straightforward to use themethods of Section 5.4.2 to implement the honest prover in time O ( k ) . We omit the details for brevity. Experimental Results.

Let C be the circuit for naive matrix multiplication described in Section 5.5.1. Todemonstrate the efﬁciency gains implied by Proposition 7, we modiﬁed our MATMULT implementation ofSection 6.2.1 to use the protocol of Proposition 7 to verify the sub-circuit of C consisting of a binary tree ofaddition gates. The results are shown in Table 4. Our optimizations in this section shave P ’s runtime by afactor of 1.5x-2x, the total number of rounds by a factor of more than 5, and the total communication (notcounting the cost of specifying the output of the circuit) by a factor of more than 5.37 .2 Optimal Space and Time Costs for MATMULT

We describe a ﬁnal optimization here on top of Proposition 7. While this optimization is speciﬁc to the

MAT - MULT problem, its effects are substantial and the underlying observation may be more broadly applicable.Suppose we are given an unveriﬁable algorithm for n × n matrix multiplication that requires time T ( n ) and space s ( n ) . Our reﬁnements reduce the prover’s runtime from O ( n ) in the case of Sections 5 and 8.1 to T ( n ) + O ( n ) , and lowers P ’s space requirement to s ( n ) + o ( n ) . That is, in the protocol the prover sendsthe correct output and performs just O ( n ) more work to provide a guarantee of correctness on top. It isirrelevant what algorithm the prover uses to arrive at the correct output – in particular, algorithms muchmore sophisticated than naive matrix multiplication are permitted. This runtime and space usage for P areoptimal even up to the leading constant assuming matrix multiplication cannot be computed in O ( n ) time.The ﬁnal protocol is extremely natural, as it consists of a single invocation of the sum-check protocol.We believe this protocol is of interest in its own right. The proof and technical details are in Section 8.2.2. Theorem 3

There is a valid interactive proof protocol for n × n matrix multiplication over the ﬁeld F q withthe following costs. The communication cost is n + O ( log n ) ﬁeld elements. The runtime of the prover isT ( n ) + O ( n ) and the space usage is s ( n ) + o ( n ) , where T ( n ) and s ( n ) are the time and space requirementsof any (unveriﬁable) algorithm for n × n matrix multiplication. The veriﬁer can make a single streamingpass over the input as well as over the claimed output in time O ( n log n ) , storing O ( log n ) ﬁeld elements. Using the observation of Vu et al. described in Lemma 3, the runtime of the veriﬁer can be brought downto O ( n ) at the cost of increasing V ’s space usage to O ( n ) . Furthermore, by Remark 1, the runtime of theveriﬁer can be brought down to O ( n ) while maintaining the streaming property if the input matrices arepresented in row-major order.The prover’s runtime in Theorem 3 is within an additive low-order term of any unveriﬁable algorithmfor matrix multiplication; this is essential in many practical scenarios where even a 2x slowdown is too steepa price to pay for veriﬁability. Notice also that the space usage bounds in Theorem 3 are in stark contrastto protocols based on circuit-checking: the prover in a general circuit-checking protocol may have to storethe entire circuit, and this can result in space requirements that are much larger than those of an unveriﬁablealgorithm for the problem. For example, naive matrix multiplication requires time O ( n ) , but only O ( n ) space, while the provers in our MATMULT protocols of Sections 5 and 8.1 require both space and time O ( n ) .As implementations of interactive proofs become faster, the prover is likely to run out of space long beforeshe runs out of time. It is worth comparing Theorem 3 to a well-known protocol due to Freivalds [17]. Let D ∗ denote the claimedoutput matrix. In Freivalds’ algorithm, the veriﬁer stores a random vector x ∈ F n , and computes D ∗ x and ABx , accepting if and only if

ABx = D ∗ x . Freivalds showed that this is a valid protocol. In both Freivalds’protocol and that of Theorem 3, the prover runs in time T ( n ) + O ( n ) (in the case of Freivalds’ algorithm,the O ( n ) term is 0), and the veriﬁer runs in linear or quasilinear time.We now highlight several properties of our protocol that are not achieved by prior work. Utility as a Primitive.

A major advantage of Theorem 3 relative to prior work is its utility as a primitivethat can be used to verify more complicated computations. This is important as many algorithms repeatedlyinvoke matrix multiplication as a subroutine. For concreteness, consider the problem of computing A k viarepeated squaring. By iterating the protocol of Theorem 3 k times, we obtain a valid interactive proof pro-tocol for computing A k with communication cost n + O ( k log ( n )) . The n term is due simply to specifying38he output A k , and can often be avoided in applications – see for example the diameter protocol describedtwo paragraphs hence. The i th iteration of the protocol for computing A k reduces a claim about an eval-uation of the multilinear extension of A k − i + to an analogous claim about A k − i . Crucially, the prover inthis protocol never needs to send the veriﬁer the intermediate matrices A k (cid:48) for k (cid:48) < k . In contrast, applyingFreivalds’ algorithm to this problem would require O ( kn ) communication, as P must specify each of theintermediate matrices A i .The ability to avoid having P explicitly send intermediate matrices is especially important in settingswhere an algorithm repeatedly invokes matrix multiplication, but the desired output of the algorithm issmaller than the size of the matrix. In these cases, it is not necessary for P to send any matrices; P caninstead send just the desired output, and V can use Theorem 3 to check the validity of the output with onlya polylogarithmic amount of additional communication. This is analogous to how the veriﬁer in the GKRprotocol can check the values of the output gates of a circuit without ever seeing the values of the “interior”gates of the circuit.As a concrete example illustrating the power of our matrix multiplication protocol, consider the funda-mental problem of computing the diameter of an unweighted (possibly directed) graph G on n vertices. Let A denote the adjacency matrix of G , and let I denote the n × n identity matrix. Then it is easily veriﬁed thatthe diameter of G is the least positive number d such that ( A + I ) di j (cid:54) = ( i , j ) . We therefore obtainthe following natural protocol for diameter. P sends the claimed output d to V , as well as an ( i , j ) such that ( A + I ) d − i j =

0. To conﬁrm that d is the diameter of G , it sufﬁces for V to check two things: ﬁrst, that allentries of ( A + I ) d are non-zero, and second that ( A + I ) d − i j is indeed non-zero.The ﬁrst task is accomplished by combining our matrix multiplication protocol of Theorem 3 withour DISTINCT protocol from Theorem 1. Indeed, let d j denote the j th bit in the binary representation of d . Then ( A + I ) d = ∏ (cid:100) log d (cid:101) j ( A + I ) d j j , so computing the number of non-zero entries of ( A + I ) d can becomputed via a sequence of O ( log d ) matrix multiplications, followed by a DISTINCT computation. Thesecond task, of verifying that ( A + I ) d − i j =

0, is similarly accomplished using O ( log d ) invocations of thematrix multiplication protocol of Theorem 3 – since V is only interested in one entry of ( A + I ) d − , P neednot send the matrix ( A + I ) d − in full, and the total communication here is just polylog ( n ) . V ’s runtime in this diameter protocol is O ( m log n ) , where m is the number of edges in G . P ’s runtimein the above diameter protocol matches the best known unveriﬁable diameter algorithm up to a low-orderadditive term [33, 42], and the communication is just polylog ( n ) . We know of no other protocol achievingthis.As discussed above, the fact that P ’s slowdown is a low-order additive term is critical in the manysettings in which even a 2x slowdown to achieve veriﬁability is unacceptable. Moreover, for a graph with n = P had to send the matrices ( I + A ) d or ( I + A ) d − explicitly (as required in prior work e.g. Cormode etal. [13]), the communication cost would be at least n = words, which translates to terabytes of data. Small-Space Streaming Veriﬁers.

In Freivalds’ algorithm, V has the store the random vector x , whichrequires Ω ( n ) space. There are methods to reduce V ’s space usage by generating x with limited randomness:Kimbrel and Sinha [26] show how to reduce V ’s space to O ( log n ) , but their solution does not work if V mustmake a streaming pass over arbitrarily ordered input. Chakrabarti et al. [12] extend the method of Kimbreland Sinha to work with a streaming veriﬁer, but this requires P to play back the input matrices A , B in aspecial order, increasing proof length to 3 n . Our protocol works with a streaming veriﬁer using O ( log n ) space, and our proof length is n + O ( log n ) , where the n term is due to specifying AB and can be avoidedin applications such as the diameter example considered above.39 .2.2 Protocol Details The idea behind the optimization is as follows. All of our earlier circuit-checking protocols only make useof the multilinear extension ˜ V i of the function V i mapping gate labels at layer i of the circuit to their values.In some cases, there is something to be gained by using a higher-degree extension of V i , and this is preciselywhat we exploit here. By using a higher-degree extension of the gate values in the circuit, we are able toapply the sum-check protocol to a polynomial that differs from the one used in Section 5. In particular, thepolynomial we use here avoids referencing the β s i polynomial used in Section 5. Details follow.When multiplying matrices A and B such that AB = D , let A ( i , j ) , B ( i , j ) and D ( i , j ) denote functionsfrom { , } log n × { , } log n → F q that map input ( i , j ) to A i j , B i j , and D i j respectively. Let ˜ A , ˜ B , and ˜ D denote their multilinear extensions. Lemma 5

For all ( p , p ) ∈ F log n × F log n , ˜ D ( p , p ) = ∑ p ∈{ , } log n ˜ A ( p , p ) · ˜ B ( p , p ) Proof:

For all ( p , p ) ∈ { , } log n × { , } log n , the right hand side is easily seen to equal D ( p , p ) , usingthe fact that D i j = ∑ k A ik B k j and the fact that ˜ A and ˜ B agree with the functions A ( i , j ) and B ( i , j ) at all Booleaninputs. Moreover, the right hand side is a multilinear polynomial in the variables of ( p , p ) . Putting thesefacts together implies that the right hand side is the unique multilinear extension of the function D ( i , j ) .Lemma 5 implies the following valid interactive proof protocol for matrix multiplication: P sends a ma-trix D ∗ claimed to equal the product D = AB . V evaluates ˜ D ∗ ( r , r ) at a random point ( r , r ) ∈ F log n × F log n .By the Schwartz-Zippel lemma, it is safe for V to believe D ∗ is as claimed, as long as ˜ D ∗ ( r , r ) = ˜ D ( r , r ) (formally, if D ∗ (cid:54) = D , then ˜ D ∗ ( r , r ) (cid:54) = ˜ D ( r , r ) with probability 1 − n / q ). In order to check that˜ D ∗ ( r , r ) = ˜ D ( r , r ) , we invoke a sum-check protocol on the polynomial g r , r ( p ) = ˜ A ( r , p ) · ˜ B ( p , r ) . V ’s ﬁnal check in this protocol requires her to compute g r , r ( r ) for a random point r ∈ F log n . V cando this by evaluating both of ˜ A ( r , r ) and ˜ B ( r , r ) with a single streaming pass over the input, and thenmultiplying the results.The prover can be made to run in time T ( n ) + O ( n ) across all rounds of the sum-check protocol usingthe V ( j ) arrays described in Section 5 to quickly evaluate ˜ A and ˜ B at all of the necessary points. The V ( j ) arrays are initialized in round 0 to equal the input matrices themselves, and there is no need for P to maintainan “uncorrupted” copy of the original input (though in practice this may be desirable). Thus, the V ( j ) arrayscan be computed using the storage P initially devoted to the inputs, and P needs to store just O ( ) additionalﬁeld elements over the course of the protocol ( P does not even need to store the messages sent by V , as P need not refer to the j th message once the array V ( j ) is computed). The claimed s ( n ) + o ( n ) space usagebound for P follows. Remark 8

Let C be the circuit for naive matrix multiplication described in Section 5. Notice that the n-variate polynomial h ( p , p , p ) = ˜ A ( p , p ) · ˜ B ( p , p ) extends the function V i mapping gate labels at layeri = log n of C to their values. However, h is not the multilinear extension of ˜ V i , as h has degree two in thevariables of p .Informally, Theorem 3 cannot be said to perform “circuit checking” on C, since it is not necessary for P to evaluate all of the gates in C; indeed, the prover in Theorem 3 can run in sub-cubic time using fast matrixmultiplication algorithms. However, the use of a low-degree extension of the gate values at layer log n of Callows one to view the protocol of Theorem 3 as a direct extension of the circuit-checking methodology. mplementation Problem Size Naive Matrix Multiplication Time Additional Time for P V

Time RoundsTheorem 3 2 × Z F q Theorem 3 2 × Z F q Table 5: Experimental results for the n × n MATMULT protocol of Theorem 3.

Remark 9

Consider the problem of computing a matrix power M k via repeated squaring. We may apply theprotocol of Theorem 3 in k iterations, with the ith iteration applied to inputs A = B = M k − i . The ith iterationof this protocol reduces a claim about an evaluation of the multilinear extension of M k − i + to an analogousclaim about the multilinear extension of M k − i at two points of the form ( r , r ) , ( r , r ) ∈ F log n × log n . We canfurther reduce the claims about ( r , r ) , ( r , r ) to a claim about a single point exactly as in the “Reducingto Veriﬁcation of a Single Point” step of the GKR protocol. We then move onto iteration i + . Notice inparticular that the veriﬁer only needs to observe the output matrix M k and the input matrix M to run thisprotocol; in particular, P does not need to explicitly send the intermediate matrices M k − i to V . We implemented the protocol just described (our implementation is sequential). The results are shownin Table 5, where the column labelled “Additional Time for P ” denotes the time required to compute P ’sprescribed messages after P has already computed the correct answer. We report the naive matrix multi-plication time both when the computation is done using standard multiplication of 64-bit integers, as wellas when the computation is done using ﬁnite ﬁeld arithmetic over the ﬁeld with q = − O ( n log n ) time reported in Theorem 3. The veriﬁer’s runtime could beimproved using Lemma 3 at the cost of increasing V ’s space usage to O ( n ) , but we did not implement thisoptimization. Moreover, if the input matrices are presented in row-major order, then the observation of Vuet al. described in Remark 1 improves V ’s runtime with no increase in space usage.The main takeaways from Table 5 are that the veriﬁer does indeed save substantial time relative toperforming matrix multiplication locally, and that the runtime of the prover is hugely dominated by the timerequired simply to compute the answer. We believe our results substantially advance the goal of achieving a truly practical general purpose imple-mentation of interactive proofs. The O ( log S ( n )) factor overhead in the runtime of the prover within priorimplementations of the GKR protocol is too steep a price to pay in practice, and our reﬁnements (formal-ized in Theorem 1) remove this logarithmic factor overhead for circuits with regular wiring patterns. Ourexperiments demonstrate that this protocols yields a prover that is less than 10x slower than a C++ programthat simply evaluates the circuit, and that our protocols are highly amenable to parallelization. Exploitingsimilar ideas, we have also extended the reach of prior interactive proof protocols by describing an efﬁcientprotocol (formalized in Theorem 2) for general data parallel computation, and given a protocol for matrixmultiplication in which the prover’s overhead (relative to any unveriﬁable algorithm) is just a low-orderadditive term. The latter is a powerful primitive for verifying the many algorithms that repeatedly invokematrix multiplication. A major message of our results is that the more structure that exists in a computation,the more efﬁciently it can be veriﬁed, and that this structure exists in many real-world computations.We believe two directions in particular are worthy of future work. The ﬁrst direction is to build a full-ﬂedged system implementing our protocol for data parallel computation. Our vision is to combine our41rotocol with a high-level programming language allowing the programmer to easily specify data parallelcomputations, analogous to frameworks such as MapReduce. Any such program could be automaticallycompiled in the manner of Vu et al. [40] into a circuit, and our protocol could be run automatically onthat circuit. The second direction is to further enable such a compiler to automatically take advantageof our other reﬁnements, which are targeted at computations that are not necessarily data parallel. Thesereﬁnements apply to a circuit on a layer-by-layer basis, so they may yield substantial speedups in practiceeven if they apply only to a subset of the layers of a circuit. Acknowledgements.

The author is grateful to Frank McSherry for raising the question of outsourcinggeneral data parallel computations, and to Michael Mitzenmacher and Graham Cormode for discussionsand feedback that greatly improved the quality of this manuscript.

References [1] S. Arora and B. Barak.

Computational Complexity: A Modern Approach . Cambridge University Press,2009.[2] S. Benabbas, R. Gennaro, Y. Vahlis. Veriﬁable delegation of computation over large datasets. In

CRYPTO , pages 111-131, 2011.[3] E. Ben-Sasson, A. Chiesa, D. Genkin, and E. Tromer. Fast reductions from RAMs to delegatablesuccinct constraint satisfaction problems. In

ITCS , pages 401-414, 2013.[4] E. Ben-Sasson, A. Chiesa, D. Genkin, and E. Tromer. On the concrete-efﬁciency threshold ofprobabilistically-checkable proofs. In

STOC , 2013.[5] D. Boneh and D. Freeman. Homomorphic signatures for polynomial functions. In

EUROCRYPT , pages149-168, 2011.[6] N. Bitansky, R. Canetti, A. Chiesa, and E. Tromer. From extractable collision resistance to succinctnon-interactive arguments of knowledge, and back again. In

ITCS , pages 326-349, 2012.[7] N. Bitansky, R. Canetti, A. Chiesa, and E. Tromer. Recursive composition and bootstrapping forSNARKs and proof-carrying data. In

STOC , 2013.[8] N. Bitansky, and A. Chiesa. Succinct arguments from multi-prover interactive proofs and their efﬁ-ciency beneﬁts. In

CRYPTO , pages 255-272, 2012.[9] N. Bitansky, A. Chiesa, Y. Ishai, R. Ostrovsky, and O. Paneth. Succinct non-interactive arguments vialinear interactive proofs. In

TCC , pages 315-333, 2013.[10] K-M. Chung, Y. Tauman Kalai, F-H. Liu, R. Raz. Memory delegation. In

CRYPTO , pages 151-168,2011.[11] K-M. Chung, Y. Tauman Kalai, and S. P. Vadhan. Improved delegation of computation using fullyhomomorphic encryption. In

CRYPTO , pages 483-501, 2010.[12] A. Chakrabarti, G. Cormode, A. McGregor, and J. Thaler. Annotations in data streams.

ElectronicColloquium on Computational Complexity (ECCC) , 19:22, 2012. A preliminary version of this paperby A. Chakrabarti, G. Cormode, and A. McGregor appeared in

ICALP

Algorithmica , 65(2):409-442, 2013.[14] G. Cormode, M. Mitzenmacher, and J. Thaler. Practical veriﬁed computation with streaming interac-tive proofs. In

ITCS , pages 90-112, 2012.[15] G. Cormode, J. Thaler, and K. Yi. Verifying computations with streaming interactive proofs.

PVLDB ,5(1):25–36, 2011.[16] D. Fiore, R. Gennaro. Publicly veriﬁable delegation of large polynomials and matrix computations,with applications. In

CCS , pages 501-512, 2012.[17] R. Freivalds. Fast probabilistic algorithms. In

MFCS , pages 57–69, 1979.[18] R. Gennaro, C. Gentry, and B. Parno. Non-interactive veriﬁable computing: outsourcing computationto untrusted workers. In

CRYPTO , pages 465-482, 2010.[19] R. Gennaro, C. Gentry, B. Parno, and M. Raykova. Quadratic span programs and succint NIZKswithout PCPs. In

EUROCRYPT , pages 626-645, 2013.[20] C. Gentry. A fully homomorphic encryption scheme. PhD thesis, Stanford University, 2009.[21] S. Goldwasser, Y. T. Kalai, and G. N. Rothblum. Delegating computation: interactive proofs formuggles. In

STOC , pages 113–122, 2008.[22] J. Groth. Short pairing-based non-interactive zero-knowledge arguments. In

ASIACRYPT , pages 321-340, 2010.[23] T. Gur and R. Raz Arthur-Merlin Streaming Complexity. In

ICALP (1) , 2013.[24] J. Hoberock and N. Bell. Thrust: A parallel template library, 2011. Version 1.3.0.[25] Y. Ishai, E. Kushilevitz, and R. Ostrovsky. Efﬁcient arguments without short PCPs. In

CCC , pages278–291, 2007.[26] T. Kimbrel and R. K. Sinha. A probabilistic algorithm for verifying matrix products Using O(n ) timeand log n + O ( ) random bits. Inf. Process. Lett.

ITCS , pages 305-320,2013.[28] H. Lipmaa. Progression-free sets and sublinear pairing-based non-interactive zero- knowledge argu-ments. In

TCC , pages 169-189, 2012.[29] C. Lund, L. Fortnow, H. Karloff, and N. Nisan. Algebraic methods for interactive proof systems.

J.ACM , 39:859–868, 1992.[30] B. Parno, C. Gentry, J. Howell, and M. Raykova. Pinocchio: nearly practical veriﬁable computation.In

IEEE Symposium on Security and Privacy (Oakland) , 2013.[31] G. Rothblum. Delegating computation reliably : paradigms and constructions. Ph.D. Thesis. Availableonline at http://hdl.handle.net/1721.1/54637 , 2009.4332] J. Schwartz. Fast probabilistic algorithms for veriﬁcation of polynomial identities.

J. ACM , 27(4):701-717, 1980.[33] R. Seidel. On the all-pairs-shortest-path problem in unweighted undirected graphs.

JCSS , 51(3):400-403, 1995.[34] S. Setty, R. McPherson, A. J. Blumberg, and M. Walﬁsh. Making argument systems for outsourcedcomputation practical (sometimes). In

NDSS , 2012.[35] S. Setty, V. Vu, N. Panpalia, B. Braun, A. J. Blumberg, and M. Walﬁsh. Taking proof-based veriﬁedcomputation a few steps closer to practicality. In

USENIX Security , 2012.[36] S. Setty, B. Braun, V. Vu, A. J. Blumberg, B. Parno, and M. Walsh. Resolving the conﬂict betweengenerality and plausibility in veriﬁed computation. In

EuroSys , pages 71-84, 2013.[37] A. Shamir. IP = PSPACE.

J. ACM , 39:869–877, October 1992.[38] J. Thaler, M. Roberts, M. Mitzenmacher, and H. Pﬁster. Veriﬁable computation with massively parallelinteractive proofs.

In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) , 2012.[39] Justin Thaler. Source Code for Time-Optimal interactive proofs for circuit evaluation. Available onlineat http://http://people.seas.harvard.edu/ ∼ jthaler/Tcode.htm [40] V. Vu, S. Setty, A. J. Blumberg, and M. Walﬁsh. A hybrid architecture for interactive veriﬁable com-putation. Pre-print, November 2012. In IEEE Symposium on Security and Privacy (Oakland) , May2013.[41] V. Vu, S. Setty, A. J. Blumberg, and M. Walﬁsh. Personal Communication, January 2013.[42] R. Yuster, Computing the diameter polynomially faster than APSP.

CoRR , Vol. abs/1011.6181, 2010.

A Proof of Theorem 1

Proof:

Consider layer i of the circuit C . Since in ( i ) and in ( i ) are regular, there is a subset of input bits S i ⊆ [ v ] with |S i | = c i for some constant c i such that each input bit in [ v ] \ S affects O ( ) of the output bits of in ( i ) and in ( i ) . Number the input variables so that the numbers { , . . . , c i } correspond to variables in S i .Let ρ ∈ { , } c i be an assignment to the variables in S , and let I ρ : { , } s i → { , } denote the indicatorfunction for ρ . For example, if c i = ρ = ( , , ) , then I ρ ( x ) = x = , x =

0, and x =

1, and I ρ ( x ) = I ρ denote the multilinear extension of I ρ . In the previous example, ˜ I ρ = x ( − x ) x .Finally, let in ( i ) , ρ and in ( i ) , ρ denote the functions in ( i ) and in ( i ) with the variables in S i ﬁxed to the assignment ρ , and for k ∈ { , } , let b ρ , k , j denote the j th output bit of in ( i ) k , ρ .By regularity, for each assignment ρ ∈ { , } c i to the variables in S i , the j th output bit b ρ , k , j of in k ρ depends on only one variable x q ( ρ , k , j ) ∈ [ s i ] \ S i for some function q ( ρ , k , j ) . Let ˜ b ρ , k , j ( x q ( ρ , k , j ) ) : F → F denote the multilinear extension of the function b ρ , k , j ( x q ( ρ , k , j ) ) : { , } → { , } . If b ρ , k , j is not identically0 or identically 1, then either ˜ b ρ , k , j ( x q ( ρ , k , j ) ) = x q ( ρ , k , j ) or ˜ b ρ , k , j = − x q ( ρ , k , j ) .For any ρ ∈ { , } s i , deﬁne ˜in ( i ) , ρ to be the concatenation of the ˜ b ρ , , j functions for all j ∈ [ s i + ] . Underthis deﬁnition, ˜in ( i ) , ρ is a collection of s i + linear polynomials, where each of the polynomials depends on a44ingle variable, and we may view ˜in ( i ) , ρ as a single function mapping F s i to F s i + . We deﬁne ˜in ( i ) , ρ and ˜type ( i ) ρ analogously to ˜in .Now let W ( i ) ( p ) = ∑ ρ ∈ L ( i ) ˜ I ρ ( p ) · (cid:16) ˜type ( i ) ρ ( p ) · ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17) · ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17) + (cid:16) − ˜type ρ ( i ) ( p ) (cid:17) (cid:16) ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17) + ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17)(cid:17)(cid:17) . It is easily checked that for all p ∈ { , } s i , V i ( p ) = W ( i ) ( p ) . Lemma 4 then implies that ˜ V i ( z ) = ∑ p ∈{ , } si g ( i ) z ( p ) , where g ( i ) z ( p ) = β s i ( z , p ) · W ( i ) ( p ) . Our protocol follows precisely the description of Sec-tion 5.1, with P and V applying the sum-check protocol to the polynomial g ( i ) z at iteration i . Communication Costs and Costs to V . Notice that our polynomial g ( i ) z ( p ) = β ( z , p ) · W ( i ) ( p ) has de-gree O ( ) in each variable. Indeed, β ( z , p ) has degree 1 in each variable. Moreover, W ( i ) ( p ) is a sum ofpolynomials that each have degree O ( ) in each variable, and hence W ( i ) ( p ) itself has degree O ( ) in eachvariable.This latter fact can be seen by observing that for each assignment ρ ∈ { , } c i to the variables in S i , itholds that ˜ I ρ ( p ) , ˜type ( i ) ρ ( p ) , ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17) and ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17) all have constant degree in each variable.That ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17) and ˜ V i + (cid:16) ˜in ( i ) , ρ ( p ) (cid:17) have constant degree in each variable follows from the facts that˜ V i + is a multilinear polynomial, and that each input variable j ∈ [ s i ] \ S i affects at most a constant numberof outputs for ˜in , ρ and ˜in , ρ by Property 1 of Deﬁnition 3.Since g ( i ) z ( p ) has degree O ( ) in each variable, the claimed communication cost and the costs to the ver-iﬁer follow immediately by summing the corresponding costs of the sum-check protocols over all iterations i ∈ { , . . . , d ( n ) } (see Section 4.2). Time Cost for P . It remains to demonstrate how P can compute her prescribed messages when applyingthe sum-check protocol to the polynomial g ( i ) z in time O ( S i + S i + ) . It will follow that P ’s runtime over all d ( n ) invocations of the sum-check protocol is O ( ∑ d ( n ) i = S i ) = O ( S ( n )) .As in our analysis of Section 5.4, it sufﬁces to show how P can quickly evaluate g ( i ) z at all points in S ( j ) ,where S ( j ) consists of all points of the form p = ( r , . . . , r j − , t , p j + , . . . , p s i ) with t ∈ { , , . . . , deg j ( g ( i ) z ) } and ( p j + , . . . , p s i ) ∈ { , } s i − j . As g ( i ) z ( p ) = β s i ( z , p ) · W ( i ) ( p ) , it sufﬁces for P to evaluate β s i ( z , · ) and W ( · ) at all such points p . The β s i ( z , · ) computations can be done in O ( S i ) total time across all iterations of thesum-check protocol, exactly as in Section 5.4.1.To see how P can efﬁciently evaluate all of the W ( i ) ( p ) values efﬁciently, notice that for any ﬁxedpoint p ∈ F s i , W ( i ) ( p ) can be computed efﬁciently given ˜type ( i ) ρ ( p ) , ˜ V i + ( ˜in , ρ ( p )) , and ˜ V i + ( ˜in , ρ ( p )) forall ρ ∈ { , } c i . As |S i | = c i = O ( ) , modulo a constant-factor blowup in runtime it sufﬁces to explain howto perform these evaluations for a ﬁxed restriction ρ ∈ { , } c i to the variables in S i .It is easy to see that ˜type ( i ) ρ ( p ) can be evaluated in constant time, since this function depends on only 1input variable x q ( ρ , , ) . All that remains is to show how P can evaluate ˜ V i + ( ˜in , ρ ( p )) quickly; the case for˜ V i + ( ˜in , ρ ( p )) is similar.To this end, we follow the approach of Section 5.4.2.45 re-processing. P will begin by computing an array V ( ) , which is simply deﬁned to be the vector of gatevalues at layer i + < j < S i + with its binary representation in { , } s i + , P sets V ( ) [( j , . . . , j s i + )] = V i + ( j , . . . , j s i + ) for each ( j , . . . , j s i + ) ∈ { , } s i + . The right hand side of thisequation is simply the value of the j th gate at layer i + C . So P can ﬁll in the array V ( ) when sheevaluates the circuit C , before receiving any messages from V . Overview of Online Processing.

Assume without loss of generality that the output bits of ˜in , ρ ( p ) arelabelled in increasing order of the input bits they are affected by. So for example if p affects 2 output bitsof ˜in , ρ and p affects 3 output bits, then the bits affected by p are labelled 1 and 2 respectively, while thebits affected by p are labelled 3, 4, and 5.In round j of of the sum-check protocol, P needs to evaluate the polynomial ˜ V i + at the O ( s i + − j ) pointsin the sets ˜in , ρ ( S ( j ) ) and ˜in , ρ ( S ( j ) ) . P will do this using the help of intermediate arrays as follows. Efﬁciently Constructing V ( j ) Arrays.

Let a j − denote the total number of output bits affected by the ﬁrst j − P has computed in the previous round an array V ( j − ) of length2 s i + − a j − , such that for each p = ( p a j − + , . . . , p s i + ) ∈ { , } s i + − a j − , the p th entry of V ( j − ) equals V ( j − ) [( p a j − + , . . . , p s i + )] = ∑ ( c ,..., c aj − ) ∈{ , } aj − V i + ( c , . . . , c a j − , p a j − + , . . . , p s i + ) · j − ∏ k = χ c k ( ˜ b ρ , , k ( r q ( ρ , , k ) )) , where recall that q ( ρ , , k ) is the input bit that output bit k of in , ρ depends on. As the base case, weexplained how P can ﬁll in V ( ) in the process of evaluating the circuit C .Let x , . . . , x s i denote the input variables to in , and let b , . . . , b s i + denote the outputs of in . Intuitively,at the end of round j of the sum-check protocol, P must “bind” input variable x j to value r j ∈ F . This hasthe effect of binding the output variables affected by x j , since each such output variable depends only on x j . For illustration, suppose the variable x affects output variable b ; speciﬁcally, suppose that b = − x .Then binding x to value r has the effect of binding b to value 1 − r . V ( j ) is obtained from V ( j − ) bytaking this into account. We formalize this as follows.Assume that variable x j affects only one output variable b ρ , , a j − + , and thus a j = a j − +

1; if this is notthe case, we can compute V ( j ) by applying the following update once for each output variable affected by x j . Observe that P can compute V ( j ) given V ( j − ) in O ( s i + − a j − ) time using the following recurrence: V ( j ) [( p a j + , . . . , p s i + )] = V ( j − ) [( , p a j + , . . . , p s i + )] · χ ( ˜ b ρ , , a j ( r j ))+ V ( j − ) [( , p a j + , . . . , p s i + )] · χ ( ˜ b ρ , , a j ( r j )) . Thus, at the end of round j of the sum-check protocol, when V sends P the value r j , P can compute V ( j ) from V ( j − ) in O ( s i + − a j − ) time. Using the V ( j ) Arrays.

We now show how to use the array V ( j − ) to evaluate ˜ V i + ( ˜in , ρ ( p )) in constant timefor any point p of the form p = ( r , . . . , r j − , t , p j + , . . . , p s i ) with ( p j + , . . . , p s i ) ∈ { , } s i − j . In order toease notation in the following derivation, we make the simplifying assumption that ˜ b ρ , , k ( x q ( ρ , , k ) ) = x q ( ρ , , k ) for all output bits k ∈ [ s i + ] . The derivation when this assumption does not hold is similar.We exploit the following sequence of equalities: 46 V i + ( ˜in , ρ ( p )) = ∑ c ∈{ , } si + V i + ( c ) χ c ( ˜in , ρ ( p ))= ∑ ( c ,..., c aj − ) ∈{ , } aj − ∑ ( c aj − + ,..., c si + ) ∈{ , } si + − aj − V i + ( c ) χ c ( ˜in , ρ ( p ))= ∑ ( c ,..., c aj − ) ∈{ , } aj − ∑ ( c aj − + ,..., c si + ) ∈{ , } si + − aj − V i + ( c ) (cid:32) a j − ∏ k = χ c k ( ˜ b ρ , , k ( r q ( ρ , , k ) )) (cid:33) (cid:32) a j ∏ k = a j − + χ c k ( ˜ b ρ , , k ( t )) (cid:33) (cid:32) s i + ∏ k = a j + χ c k ( p q ( ρ , , k ) ) (cid:33) = ∑ ( c ,..., c aj ) ∈{ , } aj V i + ( c j + , . . . , c a j , p q ( ρ , , a j + ) , . . . , p q ( ρ , , s j + ) ) (cid:32) a j − ∏ k = χ c k ( r k ) (cid:33) · (cid:32) a j ∏ k = a j − + χ c k ( t ) (cid:33) = ∑ ( p aj − + ,..., p aj ) ∈{ , } aj − aj − V ( j − ) [( p q ( ρ , , a j − + ) , . . . , p q ( ρ , , s j + ) )] · a j ∏ k = a j − + χ p k ( t ) Here, the ﬁrst equality holds by Equation (8). The third holds by deﬁnition of the functions χ c and˜in , as well as the assumption that ˜ b ρ , , k ( x q ( ρ , , k ) ) = x q ( ρ , , k ) for all k ∈ [ s i + ] . The fourth holds because forBoolean values c k , p q ( ρ , , k ) ∈ { , } , χ c k ( p q ( ρ , , k ) ) = c k = p q ( ρ , , k ) , and χ c k ( p q ( ρ , , k ) ) = V ( j − ) .The ﬁnal expression above can be computed with O ( a j − a j − ) time given the array V ( j − ) . Since a j − a j − is constant by Property 1 of Deﬁnition 3, O ( a j − a j − ) = O ( ) . Putting Things Together.

In round j of the sum-check protocol, P uses the array V ( j − ) to evaluate ˜ V i + ( ˜in ( p )) for all O ( s i − j ) points p ∈ S ( j ) , which requires constant time per point and hence O ( s i − j ) time over all pointsin S ( j ) . At the end of round j , V sends P the value r j , and P computes V ( j ) from V ( j − ) in O ( s i + − a j − ) time. By ordering input variables in such a way that a j > a j − for all j , we ensure that in total acrossall rounds of the sum-check protocol, P spends O ( ∑ s i j = s i − j + s i + − j ) = O ( s i + s i + ) time to evaluate˜ V i + at the relevant points. When combined with our O ( s i ) -time algorithm for computing all the relevant β ( z , p ) values, we see that P takes O ( s i + s i + ) = O ( S i + S i + ) time to run the entire sum-check protocolfor iteration i of our circuit-checking protocol. Reducing to Veriﬁcation of a Single Point.

After executing the sum-check protocol at layer i asdescribed above, V is left with a claim about ˜ V i + ( ω ) and ˜ V i + ( ω ) from two points ω , ω ∈ F s i + . If i is a layer for which in ( i ) and in ( i ) are similar (see Deﬁnition 4), we run the reducing to veriﬁcation of asingle point phase exactly as in the basic GKR protocol. This requires P to send ˜ V i + ( (cid:96) ( t )) for a canonicalline (cid:96) ( t ) that passes through the points ω and ω . Because in ( i ) and in ( i ) are similar, it is easily seenthat ˜ V i + ( (cid:96) ( t )) is a univariate polynomial of constant degree. Hence P can specify ˜ V i + ( (cid:96) ( t )) by sending˜ V i + ( (cid:96) ( t j )) for O ( ) many points t j ∈ F . Using the method of Lemma 3, P can evaluate ˜ V i + at each point (cid:96) ( t j ) in O ( S i + ) time, and hence can perform all ˜ V i + ( (cid:96) ( t j )) evaluations in O ( S i + ) time in total.Let c = O ( ) be the number of layers i for which in ( i ) and in ( i ) are not similar. At each such layer i , weskip the “reducing to veriﬁcation at a single point” phase of the protocol. Each time we do this, it doublesthe number of points ω ∈ F s i + that must be considered at the next iteration. However, we only skip the“reducing to veriﬁcation at a single point” phase c times, and thus at all layers i of the circuit, V needs tocheck ˜ V i ( ω j ) for at most 2 c = O ( ) points. This affects P ’s and V ’s runtime by at most a 2 c = O ( ) factor,and the O ( S ) time bound for P , and the O ( n log n + d ( n ) log S ( n )) time bound for V follow.47 Analysis for Pattern Matching

Let C be the circuit for pattern matching described in Section 5.5.1. Our goal in this appendix is to handlethe layer of the circuit adjacent to the input layer. Call this layer (cid:96) . Layer (cid:96) computes t i + k − p k for each pair ( i , k ) ∈ [[ n ]] × [[ q ]] . We want to show how to use a sum-check protocol to reduce a claim about the valueof ˜ V (cid:96) ( z ) for some z ∈ F s (cid:96) to a claim about ˜ V (cid:96) + ( r ) for some r ∈ F s (cid:96) + , while ensuring that P runs in time O ( S (cid:96) ) = O ( nm ) .The idea underlying our analysis here is the following. The reason Theorem 1 does not apply to layer (cid:96) is that the ﬁrst in-neighbor of a gate with label p = ( i , . . . , i log n , k , . . . , k log m ) ∈ { , } log n + log m has labelequal to the binary representation of the integer i + k , and a single bit i k can affect many bits in the binaryrepresentation of i + k (likewise, each bit in the binary representation of i + k may be affected by many bitsin the binary representation of i and k ). In order to ensure that each bit of p affects only a single bit of y = in ( (cid:96) ) ( p ) , we introduce log n dummy variables ( c , . . . , c log n ) and force the j th dummy variable c j to havevalue equal to the j th carry bit when adding numbers i and k in binary. Now each bit of p affects only oneoutput bit, and each output bit y j is only affected by at most three “input bits”: i j , k j , and c j if j ≤ log m , andjust i j and c j if j > log m .To this end, let φ : { , } → { , } be the function that evaluates to 1 on input ( i , k , c , c ) if and onlyif c = i + k + c < c = i + k + c ≥

2. That is, φ outputs 1 if and only if c is equal tothe carry bit when adding i , k , and c . Let ˜ φ be the multilinear extension of φ . Notice ˜ φ can be evaluatedat any point r ∈ F in O ( ) time.Now let ( i , k , c ) denote a vector in F log n × F log m × F log n , and deﬁne Φ ( i , k , c ) : = log n ∏ j = ˜ φ ( i j , k j , c j − , c j ) , where it is understood that c − = k j = j > log m .For any Boolean vector ( i , k , c ) ∈ { , } log n ×{ , } log m ×{ , } log n , it is easily veriﬁed that Φ ( i , k , c ) = j , c j equals the j th carry bit when adding numbers i and k in binary.Finally, let γ : { , } → { , } be the function that evaluates to 1 on input ( i , k , c ) if and only if i + k + c = γ be the multilinear extension of γ . Notice ˜ γ can be evaluated at any point r ∈ F in O ( ) time.Now consider the following log n + log m -variate polynomial over the ﬁeld F : W ( (cid:96) ) ( i , k ) = ∑ ( c ,..., c log n ) ∈{ , } log n Φ ( i , k , c ) · (cid:0) ˜ T ( ˜ γ ( i + k + c ) , . . . , ˜ γ ( i log n + k log n + c log n − )) − ˜ P ( k , . . . , k log m ) (cid:1) , where again it is understood that c − = k j = j > log m . Here, ˜ T is the multilinear extensionof the input T , viewed as a function from { , } log n to [ n ] , and ˜ P is the multilinear extension of the inputpattern P , viewed as a function from { , } log m to [ n ] .It can be seen that for all Boolean vectors ( i , k ) = { , } log n × { , } log m , W ( (cid:96) ) ( i , k ) = V (cid:96) ( i , k ) . This isbecause for any ( i , k ) ∈ { , } log n × { , } log m , Φ ( i , k , c ) will be zero for all c except the c consisting of thecorrect carry bits for i and k , and for this input c , ˜ T ( ˜ γ ( i + k + c ) , . . . , ˜ γ ( i log n + k log n + c log n − )) will equal T ( i + k ) when interpreting i , k as integers in the natural way.Lemma 4 then implies that for all z ∈ F log n + log m ,˜ V (cid:96) ( z ) = ∑ ( i , k ) ∈{ , } log n ×{ , } log m β log n + log m ( z , ( i , k )) · W ( (cid:96) ) ( i , k ) ∑ ( i , k , c ) ∈{ , } log n ×{ , } log m ×{ , } log n β log n + log m ( z , ( i , k )) · Φ ( i , k , c ) · (cid:0) ˜ T ( ˜ γ ( i + k + c ) , . . . , ˜ γ ( i log n + k log n + c log n − )) − ˜ P ( j , . . . , j log m ) (cid:1) . Therefore, in order to reduce a claim about ˜ V (cid:96) ( z ) to a claim about ˜ T ( r ) and ˜ P ( r ) for random vectors r ∈ F log n and r ∈ F log m , it sufﬁces to apply the sum-check protocol to the 2 log n + log m -variate polynomial g z ( i , k , c ) = β log n + log m ( z , ( i , k )) · Φ ( i , k , c ) · (cid:0) ˜ T ( ˜ γ ( i + k + c ) , . . . , ˜ γ ( i log n + k log n + c log n − )) − ˜ P ( j , . . . , j log m ) (cid:1) . It remains to show how to extend the techniques underlying Theorem 1 to allow P to compute all of therequired messages in this sum-check protocol in O ( nm ) time. For brevity, we restrict ourselves to a sketchof the techniques.The ﬁrst obvious complication is that the sum deﬁning P ’s message in a given round of the sum-checkprotocol has as many as 2 n + log m = Ω ( mn ) > nm terms. Fortunately, the Φ polynomial ensures thatalmost all of these terms are zero: when considering any Boolean setting of the variables i j , k j , and c j − ,the only setting of c j that P must consider is the one corresponding to the carry bit of i j + k j + c j − i.e.,the unique setting of c j such that φ ( i j , k j , c j − , c j ) =

1. This ensures that at round 3 j , 3 j +

1, and 3 j + g z , P must only evaluate g z at O ( log n + log m − j ) terms, which is fallinggeometrically quickly with j .We now turn to explaining how P can evaluate g z at all necessary points in round 3 j , 3 j + j + O ( log n + log m − j ) . To accomplish this, it is sufﬁcient for P to evaluate β log n + log m at the necessarypoints, as well as Φ , ˜ T , and ˜ P at the necessary points. The β log n + log m evaluations are handled exactly asin Theorem 1 i.e., by using C ( j ) arrays (but these arrays only get updated every time a variable i j or k j gets bound within the sum-check protocol; no update is necessary when a variable c j gets bound). The ˜ P evaluations are also handled exactly as in Theorem 1, using V ( j ) arrays that only need to be updated when avariable k j gets bound.The ˜ T evaluations require some additional explanation on top of the analysis of Theorem 1. We want P to be able to use V ( j ) arrays as in Theorem 1 to evaluate ˜ T at the necessary points in constant time per point,but we need to make sure that P can compute array V ( j ) from V ( j − ) in time that falls geometrically quicklywith j . In order to do this, it is essential to choose a speciﬁc ordering for the sum in the sum-check protocol.Speciﬁcally, we write the sum as: ∑ i ∑ k ∑ c ∑ i ∑ k ∑ c · · · ∑ i log n ∑ c log n g z ( i , k , c ) . This ensures that, e.g., ( i , k , c ) are the ﬁrst three variables in the sum-check protocol to become boundto random values in F . The reason we must do this is so that every 3 rounds, another value ˜ γ ( i j + k j + c j − ) feeding into ˜ T becomes bound to a speciﬁc value (and moreover the outputs of ˜ γ ( i j (cid:48) + k j (cid:48) + c j (cid:48) − ) are unaffected by the bound variables for all j (cid:48) > j ) . This is precisely the property we exploited in theprotocol of Theorem 1 to ensure that the V ( j ) arrays there halved in size every round, and that V ( j ) could becomputed from V ( j − ) in time proportional to its size. So we can use V ( j ) arrays to efﬁciently perform the˜ T evaluations, updating the arrays every time another value ˜ γ ( i j + k j + c j − ) feeding into ˜ T becomes boundto a speciﬁc value.Finally, the Φ evaluations can be handled as follows. Consider for simplicity round 3 j of the protocol.Recall that P only needs to evaluate Φ at points for which φ j (cid:48) ( i j (cid:48) , k j (cid:48) , c j (cid:48) − , c j (cid:48) ) = j (cid:48) > j . Thus,for all j (cid:48) > j , φ j (cid:48) does not affect the product deﬁning Φ . So in order to evaluate Φ at the relevant points, itsufﬁces for P to evaluate the φ j (cid:48) s for j (cid:48) ≤ j . Now at round 3 j of the protocol, all triples ( i j (cid:48) , k j (cid:48) , c j (cid:48) ) for j (cid:48) < j are already bound, say to the values ( r ( i ) j (cid:48) , r ( k ) j (cid:48) , r ( c ) j (cid:48) ) , and hence all the φ j (cid:48) functions for j (cid:48) < j are themselves49lready bound to speciﬁc values. So in order to quickly determine the contribution of the φ j (cid:48) s for j (cid:48) < j tothe product deﬁning Φ , it sufﬁces for P to maintain the quantity ∏ j (cid:48) < j φ j (cid:48) ( r ( i ) j (cid:48) , r ( k ) j (cid:48) , r ( c ) j (cid:48) ) over the course ofthe protocol, which takes just O ( log n ) time in total. Finally, the contribution of φ j to the product deﬁning Φ can be computed in constant time per point. This completes the proof that Φ can be evaluated by P at allof the necessary points in O ( ))