[PDF] A Unified Evaluation of Two-Candidate Ballot-Polling Election Auditing Methods

Abstract

Counting votes is complex and error-prone. Several statistical methods have been developed to assess election accuracy by manually inspecting randomly selected physical ballots. Two 'principled' methods are risk-limiting audits (RLAs) and Bayesian audits (BAs). RLAs use frequentist statistical inference while BAs are based on Bayesian inference. Until recently, the two have been thought of as fundamentally different. We present results that unify and shed light upon 'ballot-polling' RLAs and BAs (which only require the ability to sample uniformly at random from all cast ballot cards) for two-candidate plurality contests, which are building blocks for auditing more complex social choice functions, including some preferential voting systems. We highlight the connections between the methods and explore their performance. First, building on a previous demonstration of the mathematical equivalence of classical and Bayesian approaches, we show that BAs, suitably calibrated, are risk-limiting. Second, we compare the efficiency of the methods across a wide range of contest sizes and margins, focusing on the distribution of sample sizes required to attain a given risk limit. Third, we outline several ways to improve performance and show how the mathematical equivalence explains the improvements.

Full PDF

AA Uniﬁed Evaluation of Two-CandidateBallot-Polling Election Auditing Methods

Zhuoqun Huang , Ronald L. Rivest − − − , Philip B.Stark − − − , Vanessa Teague , − − − , andDamjan Vukcevic , − − − School of Mathematics and Statistics, University of Melbourne, Parkville, Australia Computer Science & Artiﬁcial Intelligence Laboratory, Massachusetts Institute ofTechnology, USA Department of Statistics, University of California, Berkeley, USA Thinking Cybersecurity Pty. Ltd. College of Engineering and Computer Science, Australian National University Melbourne Integrative Genomics, University of Melbourne, Parkville, Australia [email protected]

Abstract.

Counting votes is complex and error-prone. Several statisti-cal methods have been developed to assess election accuracy by manuallyinspecting randomly selected physical ballots. Two ‘principled’ methodsare risk-limiting audits (RLAs) and Bayesian audits (BAs). RLAs use fre-quentist statistical inference while BAs are based on Bayesian inference.Until recently, the two have been thought of as fundamentally diﬀerent.We present results that unify and shed light upon ‘ballot-polling’ RLAsand BAs (which only require the ability to sample uniformly at randomfrom all cast ballot cards) for two-candidate plurality contests, the arebuilding blocks for auditing more complex social choice functions, in-cluding some preferential voting systems. We highlight the connectionsbetween the methods and explore their performance.First, building on a previous demonstration of the mathematical equiv-alence of classical and Bayesian approaches, we show that BAs, suitablycalibrated, are risk-limiting. Second, we compare the eﬃciency of themethods across a wide range of contest sizes and margins, focusing onthe distribution of sample sizes required to attain a given risk limit.Third, we outline several ways to improve performance and show howthe mathematical equivalence explains the improvements.

Keywords:

Statistical audit · Risk-limiting · Bayesian

Even if voters verify their ballots and the ballots are kept secure, the count-ing process is prone to errors from malfunction, human error, and maliciousintervention. For this reason, the US National Academy of Sciences [4] and theAmerican Statistical Association have recommended the use of risk-limitingaudits to check reported election outcomes. amstat.org/asa/ﬁles/pdfs/POL-ASARecommendsRisk-LimitingAudits.pdf a r X i v : . [ s t a t . A P ] A ug Huang et al.

The simplest audit is a manual recount, which is usually expensive and time-consuming. An alternative is to examine a random sample of the ballots andtest the result statistically. Unless the margin is narrow, a sample far smallerthan the whole election may suﬃce. For more eﬃciency, sampling can be doneadaptively: stop when there is strong evidence supporting the reported outcome[7]. Risk-limiting audits (RLAs) have become the audit method recommendedfor use in the USA. Pilot RLAs have been conducted for more than 50 electionsin 14 US states and Denmark since 2008. Some early pilots are discussed ina report from the California Secretary of State to the US Election AssistanceCommission. In 2017, the state of Colorado became the ﬁrst to complete astatewide RLA. The deﬁning feature of RLAs is that, if the reported outcomeis incorrect, they have a large, pre-speciﬁed minimum probability of discoveringthis and correcting the outcome. Conversely, if the reported outcome is correct,then they will eventually certify the result. This might require only a smallrandom sample, but the audit may lead to a complete manual tabulation of thevotes if the result is very close or if tabulation error was an appreciable fractionof the margin.RLAs exploit frequentist statistical hypothesis testing. There are by nowmore than half a dozen diﬀerent approaches to conducting RLAs [8]. Electionaudits can also be based on Bayesian inference [6].With so many methods, it may be hard to understand how they relate toeach other, which perform better, which are risk-limiting, etc. Here, we reviewand compare the statistical properties of existing methods in the simplest case: atwo-candidate, ﬁrst-past-the-post contest with no invalid ballots. This allows usto survey a wide range of methods and more clearly describe the connections anddiﬀerences between them. Most real elections have more than two candidates, ofcourse. However, the methods designed for this simple context are often adaptedfor more complex elections by reducing them into pairwise contests (see belowfor further discussion of this point). Therefore, while we only explore a simplescenario, it sheds light on how the various approaches compare, which may informfuture developments in more complex scenarios. There are many other aspectsto auditing that matter greatly in practice, we do not attempt to cover all ofthese but we comment on some below.For two-candidate, no-invalid-vote contests, we explain the connections anddiﬀerences among many audit methods, including frequentist and Bayesian ap-proaches. We evaluate their eﬃciency across a range of election sizes and margins.We also explore some natural extensions and variations of the methods. We en-sure that the comparisons are ‘fair’ by numerically calibrating each method toattain a speciﬁed risk limit.We focus on ballot-polling audits , which involve selecting ballots at randomfrom the pool of cast ballots. Each sampled ballot is interpreted manually; those https://votingsystems.cdn.sos.ca.gov/oversight/risk-pilot/ﬁnal-report-073014.pdf interpretations comprise the audit data. (Ballot-polling audits do not rely on thevoting system’s interpretation of ballots, in contrast to comparison audits .) Paper outline:

Section 2 provides context and notation. Section 3 sketchesthe auditing methods we consider and points out the relationships among themand to other statistical methods. Section 4 explains how we evaluate these meth-ods. Our benchmarking experiments are reported in Section 5. We ﬁnish with adiscussion and suggestions for future work in Section 6.

We consider contests between two candidates, where each voter votes for exactlyone candidate. The candidate who receives more votes wins. Ties are possible ifthe number of ballots is even.Real elections may have invalid votes, for example, ballots marked in favour ofboth candidates or neither; for multipage ballots, not every ballot paper containsevery contest. Here we assume every ballot has a valid vote for one of the twocandidates. See Section 6.Most elections have more than two candidates and can involve complex algo-rithms (‘social choice functions’) for determining who won. A common tactic forauditing these is to reduce them to a set of pairwise contests such that certifyingall of the contests suﬃces to conﬁrm the reported outcome [3,1,8]. These contestscan be audited simultaneously using methods designed for two candidates thatcan accommodate invalid ballots, which most of the methods considered belowdo. Therefore, the methods we evaluate form the building blocks for many of themore complex methods, so our results are more widely relevant.We do not consider stratiﬁed audits , which account for ballots cast acrossdiﬀerent locations or by diﬀerent voting methods within the same election.

We use the terms ‘ballot’ and ‘ballot card’ interchangeably, even though typicalballots in the US consist of more than one card (and the distinction does matterfor workload and for auditing methods). We consider unweighted ballot-polling audits, which require only the ability to sample uniformly at random from allballot cards.The sampling is typically sequential. We draw an initial sample and assess theevidence for or against the reported outcome. If there is suﬃcient evidence thatthe reported outcome is correct, we stop and ‘certify’ the winner. Otherwise,we inspect more ballots and try again, possibly continuing to a full manualtabulation. At any time, the auditor can chose to conduct a full hand countrather than continue to sample at random. That might occur if the work ofcontinuing the audit is anticipated to be higher than that of a full hand countor if the audit data suggest that the reported outcome is wrong. One reasonablerule is to set a maximum sample size (number of draws, not necessarily thenumber of distinct ballots) for the audit; if the sample reaches that size but the

Huang et al. outcome has not been conﬁrmed, there is a full manual tabulation. The outcomeaccording to that manual tabulation becomes oﬃcial.There are many choices to be made, including:

How to assess evidence.

Each stage involves calculating a statistic from thesample. What statistic do we use? This is one key diﬀerence amongst auditingmethods, see Section 3.

Threshold for evidence.

The decision of whether to certify or keep samplingis done by comparing the statistic to a reference value. Often the value ischosen such that it limits the probability of certifying the outcome if theoutcome is wrong, i.e. limits the risk (see below).

Sampling with or without replacement.

Sampling may be done with orwithout replacement. Sampling without replacement is more eﬃcient; sam-pling with replacement often yields simpler mathematics. The diﬀerence ineﬃciency is small unless a substantial fraction (e.g. 20% or more) of theballots are sampled.

Sampling increments.

By how much do we increase the sample size if thecurrent sample does not conﬁrm the outcome? We could enlarge the sampleone ballot at a time, but it is usually more eﬃcient to have larger ‘rounds’.The methods described here can accommodate rounds of any size.We assume that the auditors read votes correctly, which generally requiresretrieving the correct ballots and correctly applying legal rules for interpretingvoters’ marks.

Let X , X , · · · ∈ { , } denote the sampled ballots, with X i = 1 representing avote in favour of the reported winner and X i = 0 a vote for the reported loser.Let n denote the number of (not necessarily distinct) ballots sampled at agiven point in the audit, m the maximum sample size (i.e. number of draws) forthe audit, and N the total number of cast ballots. We necessarily have n (cid:54) m and if sampling without replacement we also have m (cid:54) N .Each audit method summarizes the evidence in the sample using a statisticof the form S n ( X , X , . . . , X n , n, m, N ). For brevity, we suppress n , m and N in the notation.Let Y n = (cid:80) ni =1 X i be the number of sampled ballots that are in favour of thereported winner. Since the ballots are by assumption exchangeable, the statisticsused by most methods can be written in terms of Y n .Let T be the true total number of votes for the winner and p T = T /N thetrue proportion of such votes. Let p r be the reported proportion of votes for thewinner. We do not know T nor p T , and it is not guaranteed that p r (cid:39) p T .For sampling with replacement, conditional on n , Y n has a binomial distribu-tion with parameters n and p T . For sampling without replacement, conditionalon n , Y n has a hypergeometric distribution with parameters n , T and N . valuation of Two-Candidate Ballot-Polling Election Auditing Methods 5 Risk-limiting audits amount to statistical hypothesis tests. The null hypothesis H is that the reported winner(s) did not really win. The alternative H is thatthe reported winners really won. For a single-winner contest, H : p T (cid:54) , (reported winner is false) H : p T > . (reported winner is true)If we reject H , we certify the election without a full manual tally. The certiﬁca-tion rate is the probability of rejecting H . Hypothesis tests are often character-ized by their signiﬁcance level (false positive rate) and power . Both have naturalinterpretations in the context of election audits by reference to the certiﬁcationrate. The power is simply the certiﬁcation rate when H is true. Higher power re-duces the chance of an unnecessary recount. A false positive is a miscertiﬁcation :rejecting H when in fact it is true. The probability of miscertiﬁcation dependson p T and the audit method, and is known as the risk of the method. In atwo-candidate plurality contest, the maximum possible risk is typically attainedwhen p T = .For many auditing methods we can ﬁnd an upper bound on the maximumpossible risk, and can also set their evidence threshold such that the risk islimited to a given value. Such an upper bound is referred to as a risk limit ,and methods for which this is possible are called risk-limiting . Some methodsare explicitly designed to have a convenient mechanism to set such a bound, forexample via a formula. We call such methods automatically risk-limiting .Audits with a sample size limit m become full manual tabulations if theyhave not stopped after drawing the m th ballot. Such a tabulation is assumed toﬁnd the correct outcome, so the power of a risk-limiting audit is 1. We use theterm ‘power’ informally to refer to the chance the audit stops after drawing m or fewer ballots. We describe Bayesian audits in some detail because they provide a mathemat-ical framework for many (but not all) of the other methods. We then describethe other methods, many of which can be viewed as Bayesian audits for a spe-ciﬁc choice of the prior distribution. Some of these connections were previouslydescribed by [11]. These connections can shed light on the performance or in-terpretation of the other methods. However, our benchmarking experiments arefrequentist, even for the Bayesian audits (for example, we calibrate the methodsto limit the risk).Table 1 lists the methods described here; the parameters of the methods aredeﬁned below.

Huang et al.

Table 1:

Summary of auditing methods.

The methods in the ﬁrstpart of the table are benchmarked in this report.

Method Quantities to set Automatically risk-limiting

Bayesian f ( p ) —Bayesian (risk-max.) f ( p ) , for p > . (cid:88) BRAVO p (cid:88) MaxBRAVO None —ClipAudit None — † KMart g ( γ ) ‡ (cid:88) Kaplan–Wald γ (cid:88) Kaplan–Markov γ (cid:88) Kaplan–Kolmogorov γ (cid:88) † Provides a pre-computed table for approximate risk-limiting thresholds ‡ Extension introduced here

Bayesian audits quantify evidence in the sample as a posterior distribution of theproportion of votes in favour of the reported winner. In turn, that distributioninduces a (posterior) probability that the outcome is wrong, Pr( H | Y n ), the upset probability .The posterior probabilities require positing a prior distribution , f for thereported winner’s vote share p . (For clarity, we denote the fraction of votes forthe reported winner by p when we treat it as random for Bayesian inference andby p T to refer to the actual true value.)We represent the posterior using the posterior odds,Pr( H | X , . . . , X n )Pr( H | X , . . . , X n ) = Pr( X , . . . , X n | H )Pr( X , . . . , X n | H ) × Pr( H )Pr( H ) . The ﬁrst term on the right is the

Bayes factor (BF) and the second is the priorodds. The prior odds do not depend on the data: the information from the datais in the BF. We shall use the BF as the statistic, S n . It can be expressed as, S n = Pr( X , . . . , X n | H )Pr( X , . . . , X n | H ) = (cid:82) p> . Pr( Y n | p ) f ( p ) dp (cid:82) p (cid:54) . Pr( Y n | p ) f ( p ) dp . The term Pr( Y n | p ) is the likelihood . The BF is similar to a likelihood ratio, butthe likelihoods are integrated over p rather than evaluated at speciﬁc values (incontrast to classical approaches, see Section 3.2). Understanding priors.

The prior f determines the relative contributions ofpossible values of p to the BF. It can be continuous, discrete or neither. A conju-gate prior is often used [6], which has the property that the posterior distribution valuation of Two-Candidate Ballot-Polling Election Auditing Methods 7 is in the same family, which has mathematical and practical advantages. For sam-pling with replacement the conjugate prior is beta (which is continuous), whilefor sampling without replacement it is a beta-binomial (which is discrete).Vora [11] showed that a prior that places a probability mass of 0.5 on thevalue p = 0 . / ,

1] is risk-maximizing : for such aprior, limiting the upset probability to α also limits the risk to α .We explore several priors below, emphasizing a uniform prior (an example ofa ‘non-partisan prior’ [6]), which is a special case within the family of conjugatepriors used here. Bayesian audit procedure.

A Bayesian audit proceeds as follows. At eachstage of sampling, calculate S n and then: (cid:26) if S n > h, terminate and certify,if S n (cid:54) h, continue sampling. (*)If the audit does not terminate and certify for n (cid:54) m , there is a full manualtabulation of the votes.The threshold h is equivalent to a threshold on the upset probability: Pr( H | Y n ) < υ corresponds to h = − υυ Pr( H )Pr( H ) . If the prior places equal probability onthe two hypotheses (a common choice), this simpliﬁes to h = − υυ . Interpretation.

The upset probability, Pr( H | Y n ), is not the risk, which wewrite informally as max H Pr(certify | H ). The procedure outlined above limitsthe upset probability. This is not the same as limiting the risk. Nevertheless, inthe election context considered here, Bayesian audits are risk-limiting, but witha risk limit that is in general larger than the upset probability threshold. For a given prior, sampling scheme, and risk limit α , we can calculate a valueof h for which the risk of the Bayesian audit with threshold h is bounded by α .For risk-maximizing priors, taking h = − αα yields an audit with risk limit α . The basic sequential probability ratio test (SPRT) [12], adapted slightly to suitthe auditing context here , tests the simple hypotheses H : p T = p ,H : p T = p , This is a consequence of the fact that the risk is maximized when p T = 0 .

5, a factthat we can use to bound the risk by choosing an appropriate value for the threshold.The mathematical details are shown in Section A. The SPRT allows rejection of either H or H , but we only allow the former here.This aligns it with the broader framework for election audits described earlier. Also,we impose a maximum sample size, as per that framework. Huang et al. using the likelihood ratio: (cid:40) if S n = Pr( Y n | p )Pr( Y n | p ) > α , terminate and certify (reject H ),otherwise, continue sampling.This is equivalent to (*) for a prior with point masses of 0.5 on the values p and p with h = 1 /α . This procedure has a risk limit of α .The test statistic can be tailored to sampling with or without replacement byusing the appropriate likelihood. The SPRT has the smallest expected samplesize among all level α tests of these same hypotheses. This optimality holds onlywhen no constraints are imposed on the sampling (such as a maximum samplesize).The SPRT statistic is a nonnegative martingale when H holds; Kolmogorov’sinequality implies that it is automatically risk-limiting. Other martingale-basedtests are discussed in Section 3.4.The statistic from a Bayesian audit can also be a martingale, if the prioris the true data generating process under H . This occurs, for example, for arisk-maximizing prior if p T = 0 . BRAVO.

In a two-candidate contest, BRAVO [3] applies the SPRT with: p = 0 . ,p = p r − (cid:15), where (cid:15) is a pre-speciﬁed small value for which p > . Because it is theSPRT, BRAVO has a risk limit no larger than α .BRAVO requires picking p (analogous to setting a prior for a Bayesianaudit). The recommended value is based on the reported winner’s share, butthe SPRT can be used with any alternative. Our numerical experiments do notinvolve a reported vote share; we simply set p to various values. MaxBRAVO.

As an alternative to specifying p , we experimented with replac-ing the likelihood, Pr( Y n | p ), with the maximized likelihood, max p Pr( Y n | p ),leaving other aspects of the test unchanged. This same idea has been usedin other contexts, under the name MaxSPRT [2]. We refer to our version as MaxBRAVO . Because of the maximization, the method is not automaticallyrisk-limiting, so we calibrate the stopping threshold h numerically to attain thedesired risk limit, as we do for Bayesian audits. Rivest [5] introduces

ClipAudit , a method that uses a statistic that is very easy tocalculate, S n = ( A n − B n ) / √ A n + B n , where A n = Y n and B n = n − Y n . Appox-imately risk-limiting thresholds for this statistic were given (found numerically), Such a prior places all its mass on p = 0 . p (cid:54) . The SPRT can perform poorly when p T ∈ ( p , p ); taking (cid:15) > along with formulae that give approximate thresholds. We used ClipAudit withthe ‘best ﬁt’ formula [5, equation (6)].As far as we can tell, ClipAudit is not related to any of the other methods wedescribe here, but S n is the test statistic commonly used to test the hypothesis H : p T = 0 . H : p T > . S n = A n − B n √ A n + B n = Y n − n + Y n √ n = Y n /n − . (cid:112) . × (1 − . /n = ˆ p T − p (cid:112) p × (1 − p ) /n . Several martingale-based methods have been developed for the general prob-lem of testing hypotheses about the mean of a non-negative random variable.SHANGRLA exploits this generality to allow auditing of a wide class of elec-tions [8]. While we did not benchmark these methods in our study (they arebetter suited for other scenarios, such as comparison audits, and will be lesseﬃcient in the simple case we consider here), we describe them here in order tousefully point out some of the connections between methods.For each of the methods below, the essential diﬀerence is in the deﬁnition ofthe statistic, S n . The procedure in each case is the same: we certify the electionif S n > /α , otherwise we keep sampling. All of the procedures can be shown tohave a risk limit of α .All the procedures have a ‘padding’ parameter γ that prevents degeneratevalues of S n . This parameter either needs to be set to a speciﬁc value or isintegrated out.The statistics below that are designed for sampling without replacement de-pend on the order in which ballots are sampled. None of the other statistics (inthis section or earlier) have that property.We use t to denote the value of E ( X i ) under the null hypothesis. In thetwo-candidate context discussed in this paper, this would be set to t = p = 0 . Y = 0. KMart.

This method was described online under the name

KMart and isimplemented in SHANGRLA [8]. There are two versions of the test statistic,designed for sampling with or without replacement , respectively: S n = (cid:90) n (cid:89) i =1 (cid:18) γ (cid:20) X i t − (cid:21) + 1 (cid:19) dγ, and S n = (cid:90) n (cid:89) i =1 (cid:32) γ (cid:34) X i (cid:0) N − i +1 N (cid:1) t − N Y i − − (cid:35) + 1 (cid:33) dγ. This method is related to Bayesian audits for two-candidate contests: forsampling with replacement and no invalid votes, we have shown that KMart https://github.com/pbstark/MartInf/blob/master/kmart.ipynb When sampling without replacement, if we ever observe Y n > Nt then we ignorethe statistic and terminate the audit since H is guaranteed to be true.0 Huang et al. is equivalent to a Bayesian audit with a risk-maximizing prior that is uniformover p > . The same analysis shows how to extend KMart to be equiva-lent to using an arbitrary risk-maximizing prior, by inserting an appropriatelyconstructed weighting function g ( γ ) into the integrand. There is no direct relationship of this sort for the version of KMart thatuses sampling without replacement, since this statistic depends on the order theballots are sampled but the statistic for Bayesian audits does not.

Kaplan–Wald.

This method is similar to KMart but involves picking a valueof γ rather than integrating over γ [10]. The previous proof shows that: forsampling with replacement, Kaplan–Wald is equivalent to BRAVO with p =( γ + 1) /

2; for sampling without replacement, there is no such relationship.

Kaplan–Markov.

This method applies Markov’s inequality to the martingale (cid:81) i (cid:54) n X i / E ( X i ), where the expectation is calculated assuming sampling withreplacement [9]. This gives the statistic, S n = (cid:81) ni =1 ( X i + γ ) / ( t + γ ). Kaplan–Kolmogorov.

This method is the same as Kaplan–Markov but withthe expectation calculated assuming sampling without replacement [8]. Thisgives the statistic, S n = (cid:81) ni =1 (cid:2) ( X i + γ ) (cid:0) N − i +1 N (cid:1)(cid:3) / (cid:2) t − N Y i − + N − i +1 N γ (cid:3) . We evaluated the methods using simulations; see the ﬁrst part of Table 1.For each method, the termination threshold h was calibrated numerically toyield maximum risk as close as possible to 5%. This makes comparisons amongthe methods ‘fair’. We calibrated even the automatically risk-limiting methods,resulting in a slight performance boost. We also ran some experiments withoutcalibration, to quantify this diﬀerence.We use three quantities to measure performance: maximum risk and ‘power’,deﬁned in Section 2.3, and the mean sample size. Choice of auditing methods.

Most of the methods require choosing the formof statistics, tuning parameters, or a prior. Except where stated, our benchmark-ing experiments used sampling without replacement. Except where indicated,we used the version of each statistic designed for the method of sampling used.For example, we used a hypergeometric likelihood when sampling without re-placement. For Bayesian audits we used a beta-binomial prior (conjugate to thehypergeometric likelihood) with shape parameters a and b . For BRAVO, we triedseveral values of p . The mathematical details are shown in Section B. As for KMart, if Y n > Nt we ignore the statistic and terminate the audit.valuation of Two-Candidate Ballot-Polling Election Auditing Methods 11 The tests labelled ‘BRAVO’ are tests of a method related to but not identicalto BRAVO, because there is no notion of a ‘reported’ vote share in our experi-ments. Instead, we set p to several ﬁxed values to explore how the underlyingtest statistic (from the SPRT) performs in diﬀerent scenarios.For MaxBRAVO and Bayesian audits with risk-maximizing prior, due to timeconstraints we only implemented statistics for the binomial likelihood (whichassumes sampling with replacement). While these are not exact for samplingwithout replacement, we believe this choice has only a minor impact when m (cid:28) N (based on our results for the other methods when using diﬀerent likelihoods).For Bayesian audits with a risk-maximizing prior, we used a beta distributionprior (conjugate to the binomial likelihood) with shape parameters a and b .ClipAudit only has one version of its statistic. It is not optimized for samplingwithout replacement (for example, if you sample all of the ballots, it will not‘know’ this fact), but the stopping thresholds are calibrated for sampling withoutreplacement. Election sizes and sampling designs.

We explored combinations of elec-tion sizes N ∈ { , , , , , } and maximum sample sizes m ∈ { , , , } . Most of our experiments used a sampling incre-ment of 1 (i.e. check the stopping rule after each ballot is drawn). We also variedthe sampling increment (values in { , , , , , , , , , } ) andtried sampling with replacement. Benchmarking via dynamic programming.

We implemented an eﬃcientmethod for calculating the performance measures using dynamic programming. This exploits the Markovian nature of the sampling procedure and the low di-mensionality of the (univariate) statistics. This approach allowed us to calculate—for elections with up to tens of thousands of votes—exact values of each of theperformance measures, including the tail probabilities of the sampling distribu-tions, which require large sample sizes to estimate accurately by Monte Carlo.We expect that with some further optimisations our approach would be com-putationally feasible for larger elections (up to 1 million votes). The complexitylargely depends on the maximum sample size, m . As long as this is moderate(thousands) our approach is feasible. For more complex audits (beyond two-candidate contests), a Monte Carlo approach is likely more practical. Diﬀerent methods have diﬀerent distributions ofsample sizes; Figure 1 shows these for a few methods when p T = 0 .

5. Some meth-ods tend to stop early; others take many more samples. Requiring a minimumsample size might improve performance of some of the methods; see Section 5.3. Our code is available at: https://github.com/Dovermore/AuditAnalysis2 Huang et al. . . . Audit sample size (p T = P r obab li t y Bayesian, a = b = = b = = = = Fig. 1:

Sample size distributions.

Audits of elections with N = 20 ,

000 ballots,maximum sample size m = 2 , p T = 0 . n = m ) and progresses to a full manualtabulation. ‘Bayesian (r.m.)’ refers to the Bayesian audit with a risk-maximizingprior. The sawtooth pattern is due to the discreteness of the statistics. Mean sample sizes.

We focus on average sample sizes as a measure of auditeﬃciency. Table 2 shows the results of experiments with N = 20 ,

000 and m =2 , p = 0 .

55 or a Bayesian audit with a moderately constrained prior ( a = b = 100) were optimal when p T was closer to 0.5. Methods with substantial priorweight on wider margins, such as BRAVO with p = 0 . p = p T ). However, our experiments violate the theoretical assump-tions because we imposed a maximum sample size, m . (Indeed, when p = p T =0 .

51, BRAVO is no longer optimal in our experiments.) valuation of Two-Candidate Ballot-Polling Election Auditing Methods 13

Table 2:

Results from benchmarking experiments.

Audits of elections with N = 20 ,

000 ballots and a maximum sample size m = 2 , p T ; the corresponding margin of victory ( mov ) isalso reported. Each row refers to a speciﬁc auditing method. For calibratedmethods, we report the threshold obtained. For easier comparison, we presentthese on the nominal risk scale for BRAVO, MaxBRAVO and ClipAudit (e.g. α = 1 /h for BRAVO), and on the upset probability scale for the Bayesianmethods ( υ = 1 / ( h + 1)). For the experiments without calibration, we reportthe maximum risk of each method when set to a ‘nominal’ risk limit of 5%. Weonly report uncalibrated results for methods that are automatically risk-limiting,as well as ClipAudit using its ‘best ﬁt’ formula to set the threshold. ‘Bayesian(r.m.)’ refers to the Bayesian audit with a risk-maximizing prior. The numbersin bold are those that are (nearly) best for the given experiment and choice of p T . The section labelled ‘ n (cid:62) Power (%) Mean sample size p T (%) →

52 55 60 52 55 60 64 70

Method mov (%) → Calibrated α or υ (%)Bayesian, a = b = 1 0.2 35

99 100

172 90 a = b = 100 1.2 48

100 100 1551

616 232 150 97Bayesian, a = b = 500 3.6

53 100 100 a = b = 1 6.1 19 94 p = 0 . BRAVO, p = 0 .

55 5.3 37

99 100 1549 562

196 129 85BRAVO, p = 0 .

51 22.7

55 100 100

98 100

169 89 Calibrated, n (cid:62) α or υ (%)Bayesian, a = b = 1 0.6 45

99 100

311 300 300

Bayesian (r.m.), a = b = 1 34.4 39

99 100

307 300 300

BRAVO, p = 0 . BRAVO, p = 0 .

55 6.0 38

99 100 1545 583 309 300 300

BRAVO, p = 0 .

51 22.7

55 100 100

MaxBRAVO 5.0 44

99 100

310 300 300

ClipAudit 11.4 44

99 100 1545

310 300 300Uncalibrated

Risk (%)Bayesian (r.m.), a = b = 1 3.7 17 93 p = 0 . BRAVO, p = 0 .

55 4.7

37 98 100 1561 572

200 131 86BRAVO, p = 0 .

51 0.029 6 89

34 98 100

167 88

454 Huang et al.

Two methods were consistently poor: BRAVO with p = 0 .

51 and a Bayesianaudit with a = b = 500. Both place substantial weight on a very close election.MaxBRAVO and ClipAudit, the two methods without a direct match toBayesian audits, performed similarly to a Bayesian audit with a uniform prior( a = b = 1). All three are ‘broadly’ tuned: they perform reasonably well in mostscenarios, even when they are not the best. Eﬀect of calibration on the uncalibrated methods.

For most of the auto-matically calibrated methods, calibration had only a small eﬀect on performance.BRAVO with p = 0 .

51 is an exception: it was very conservative because it nor-mally requires more than m samples. Other election sizes and performance measures.

The broad conclusionsare the same for a range of values of m and N , and when performance is measuredby quantiles of sample size or probability of stopping without a full hand countrather than by average sample size. Sampling with vs without replacement.

There are two ways to changeour experiments to explore sampling with replacement: (i) construct versions ofthe statistics speciﬁcally for sampling with replacement; (ii) leave the methodsalone but sample with replacement. We explored both options, separately andcombined; diﬀerences were minor when m (cid:28) N . Consider the following two methods, which were the most eﬃcient for diﬀerentelection margins: (i) BRAVO with p = 0 .

55; (ii) ClipAudit. For p T = 0 .

52, themean sample sizes are 1,549 vs 1,630 (BRAVO saved 81 draws on average). For p T = 0 .

7, the equivalent numbers are 85 vs 45 (ClipAudit saved 40 draws onaverage).Picking a method requires trade-oﬀs involving resources, workload predictabil-ity, and jurisdictional idiosyncrasies in ballot handling and storage—as well asthe unknown true margin. Diﬀerences in expected sample size across ballot-polling methods might be immaterial in practice compared to other desiderata.

Increasing the numberof ballots sampled in each ‘round’ increases the chance that the audit will stopwithout a full hand count but increases mean sample size. This is as expected;the limiting version is a single ﬁxed sample of size n = m , which has the highestpower but loses the eﬃciency that early stopping can provide.Increasing the sampling increment had the most impact on methods thattend to stop early, such as Bayesian audits with a = b = 1, and less on methods valuation of Two-Candidate Ballot-Polling Election Auditing Methods 15 that do not, such as BRAVO with p = 0 .

51. Increasing the increment alsodecreases the diﬀerences among the methods. This makes sense because whenthe sample size is m , the methods are identical (since all are calibrated to attainthe risk limit).Considering the trade-oﬀ discussed in the previous section, since increasingthe sampling increment improves power but increases mean sample size, it re-duces eﬀort when the election is close, but increases it when the margin is wide. Increasing the maximum sample size ( m ). Increasing m has the sameeﬀect as increasing the sampling increment: higher power at the expense of morework on average. This eﬀect is stronger for closer elections, since sampling willlikely stop earlier when the margin is wide. Requiring/encouraging more samples.

The Bayesian audit with a = b = 1tends to stop too early, so we tried two potential improvements, shown in Table 2.The ﬁrst was to impose a minimum sample size, in this case n (cid:62) a and b . When a = b = 100, we obtain largely the same beneﬁt forclose elections with a much milder penalty when the margin is wide. The overallperformance proﬁle becomes closer to BRAVO with p = 0 . We compared several ballot-polling methods both analytically and numerically,to elucidate the relationships among the methods. We focused on two-candidatecontests, which are building blocks for auditing more complex elections. Weexplored modiﬁcations and extensions to existing procedures. Our benchmarkingexperiments calibrated the methods to attain the same maximum risk.Many ‘non-Bayesian’ auditing methods are special cases of a Bayesian pro-cedure for a suitable prior, and Bayesian methods can be calibrated to be risk-limiting (at least, in the two-candidate, all-valid-vote context investigated here).Diﬀerences among such methods amount to technical details, such as choices oftuning parameters, rather than something more fundamental. Of course, upsetprobability is fundamentally diﬀerent from risk.No method is uniformly best, and most can be ‘tuned’ to improve perfor-mance for elections with either closer or wider margins—but not both simulta-neously. If the tuning is not extreme, performance will be reasonably good for awide range of true margins. In summary:1. If the true margin is known approximately, BRAVO is best.2. Absent reliable information on the margin, ClipAudit and Bayesian auditswith a uniform prior (calibrated to attain the risk limit) are eﬃcient.

3. Extreme settings, such as p ≈ . Future work:

While we tried to be comprehensive in examining ballot-polling methods for two-candidate contests with no invalid votes, there are manyways to extend the analysis to cover more realistic scenarios. Some ideas include:(i) more than two candidates and non-plurality social choice functions; (ii) in-valid votes; (iii) larger elections; (iv) stratiﬁed samples; (v) batch-level audits;(vi) multi-page ballots.

References

1. Blom, M., Stuckey, P.J., Teague, V.J.: Ballot-polling risk limiting audits for IRVelections. In: Electronic Voting. pp. 17–34. Springer, Cham (2018)2. Kulldorﬀ, M., Davis, R.L., Kolczak, M., Lewis, E., Lieu, T., Platt, R.: A maximizedsequential probability ratio test for drug and vaccine safety surveillance. SequentialAnalysis (1), 58–78 (2011). https://doi.org/10.1080/07474946.2011.5399243. Lindeman, M., Stark, P.B., Yates, V.S.: BRAVO: Ballot-polling risk-limiting auditsto verify outcomes. In: 2012 Electronic Voting Technology Workshop/Workshop onTrustworthy Elections (EVT/WOTE ’12) (2012)4. National Academies of Sciences, Engineering, and Medicine: Securing the Vote:Protecting American Democracy. The National Academies Press, Washington, DC(Sep 2018). https://doi.org/10.17226/251205. Rivest, R.L.: ClipAudit: A simple risk-limiting post-election audit. arXiv e-printsarXiv:1701.08312 (Jan 2017)6. Rivest, R.L., Shen, E.: A Bayesian method for auditing elections. In: 2012 Elec-tronic Voting Technology/Workshop on Trustworthy Elections (EVT/WOTE ’12)(2012)7. Stark, P.: Conservative statistical post-election audits. Ann. Appl. Stat. , 550–581(2008), http://arxiv.org/abs/0807.40058. Stark, P.: Sets of half-average nulls generate risk-limiting audits: SHANGRLA.Voting ’20 in press (2020), preprint: http://arxiv.org/abs/1911.100359. Stark, P.B.: Risk-limiting postelection audits: Conservative P -values from commonprobability inequalities. IEEE Transactions on Information Forensics and Security (4), 1005–1014 (Dec 2009). https://doi.org/10.1109/TIFS.2009.203419010. Stark, P.B., Teague, V.: Veriﬁable European elections: Risk-limiting audits forD’Hondt and its relatives. USENIX Journal of Election Technology and Systems(JETS) (2),117–186 (June 1945). https://doi.org/10.1214/aoms/1177731118valuation of Two-Candidate Ballot-Polling Election Auditing Methods 17 A Risk-limiting Bayesian audits with arbitrary priors

Vora [11] provides a construction of a risk-limiting Bayesian audit, by taking aBayesian audit with an arbitrary prior ( f X ) and constructing a new prior basedon it ( f ∗ X ) that has the property that a threshold on the upset probability is alsoa risk limit.The argument can be extended to show that any prior has a bounded risklimit and can therefore be used to conduct a risk-limiting audit. Such a usagewould involve calculating an appropriate threshold on the upset probability thatresults in a particular speciﬁed bound on the risk limit. A.1 Lemma

In a two-candidate election, the risk of an audit is given by the (mis)certiﬁcationprobability when the true tally is equal votes for each candidate, or the closestpossible such non-winning tally (notionally p = 0 .

5; in the notation of Vora [11]this would be the case of x = N − for odd N , and x = N for even N ).Proof:This assertion can be proved by the same monotonicity argument used in theproof of Theorem 2 of Vora [11]: hg ( k, n, x, N ) is a monotone increasing functionof x for x ∈ [0 , N − ], and applying this termwise to the formula for the risk at x , P T ( Λ, x ) leads to the conclusion.

A.2 Corollary

For any prior, the risk of the Bayesian audit with this prior is given by P T ( Λ, N − ). A.3 Lemma

The risk of a Bayesian audit is a monotone increasing function of γ , the thresholdon the upset probability. (In other words, relaxing the threshold leads to higherrisk.)Proof:If γ is increased, then: – Any sequence in Λ remains in Λ , with the sample size at which the auditstops possibly reducing (i.e. the audit terminates earlier). – Some sequences in ¯ Λ move to Λ , due to the relaxed threshold.Therefore, overall there will be a shift of probability from ¯ Λ to Λ . This is truefor any given true x , and in particular for the value which gives the largestmiscertiﬁcation probability ( x = N − ). Therefore, the risk has increased. A.4 Corollary

The monotonic relationship implies that we can reduce the risk by imposing astricter threshold on the upset probability. In particular, we can reduce it untilis less than any pre-speciﬁed limit. Thus, we can use any Bayesian audit in arisk-limiting fashion.Note that to implement this in practice we need to be able calculate therisk for any given threshold and optimise the threshold value to reduce the riskunder the speciﬁed limit. This should be straightforward enough for the two-candidate case via either simulation or exact calculation, since we know whichvalue of x gives rise to the maximum miscertiﬁcation probability. Note that sucha calculation would need to be done separately for any given choice of samplingscheme and prior. B KMart as a Bayesian audit

This appendix shows a proof that KMart, assuming sampling with replacement,is equivalent to a Bayesian audit with a risk-maximising uniform prior for thereported winner’s true vote tally. It also introduces a more general version ofthe test statistic that corresponds to an arbitrary risk-maximising prior. Bothresults are shown for a simple two-candidate contest.

B.1 KMart is equivalent to a Bayesian audit

Suppose we are auditing a simple two-candidate election, using sampling withreplacement. We observe iid X , X , · · · ∈ { , } , where X j = 1 is a vote forthe reported winner and X j = 0 is a vote for the reported loser. Let E X j = t ,the true tally of the reported winner. In other words, the X j are a sequence ofBernoulli trials with success probability t .The null hypothesis for the audit is that the reported winner actually lost, i.e.that t (cid:54) . To carry out a test, we usually set this to the ‘hardest’ case , whichis H : t = t = . The alternative hypothesis is that the winning candidate wasreported correctly, i.e. H : t > .In practice we will always have a ﬁnite number of total votes, and thus arealistic model would have the support of t be a discrete set (i.e. values of theform k/N where N is the total number of votes). However, for mathematicalconvenience here we will allow the support of t to be the unit interval, which iscontinuous. KMart audits.

KMart is a risk-limiting election auditing method based onmartingale theory. For the context described above, it uses the following test ‘Hardest’ means that it is the case that leads to the largest false positive rate(miscertiﬁcation probability), i.e. the risk .valuation of Two-Candidate Ballot-Polling Election Auditing Methods 19 statistic: A n = (cid:90) n (cid:89) j =1 (cid:18) γ (cid:20) X j t − (cid:21) + 1 (cid:19) dγ. Since we are working with t = , we can rewrite this expression, A n = 2 n (cid:90) n (cid:89) j =1 (cid:18) γ (cid:20) X j − (cid:21) + 12 (cid:19) dγ. For a speciﬁed risk limit, α , the audit proceeds until A n > /α , at which pointthe election is certiﬁed ( H is rejected), or is otherwise terminated in favour ofdoing a full recount. Bayesian audits.

A Bayesian audit is based on standard Bayesian inference.The verdict of the audit is based on the posterior probability that the reportedwinner actually won (or lost, in which case this is called the upset probability ).Typically, a threshold will be placed on this probability for deciding whether tocertify the election or carry on sampling.Bayesian audits can be represented in terms of the posterior odds, whichgives a similar formulation to other risk-limiting audits [11]. For the contextdescribed above, they would use the following test statistic: B n = Pr( H | X , . . . , X n )Pr( H | X , . . . , X n ) = Pr( X , . . . , X n | H )Pr( X , . . . , X n | H ) × Pr( H )Pr( H ) . We will limit our discussion to risk-maximising prior distributions. These place a probability mass of on the value of t = , and the remainingprobability is over the set t ∈ ( , H ) = Pr( H ) = ,meaning that the prior odds drop out of the above equation. The remaining termis the Bayes factor (BF). Let’s write this out more explicitly.Let Y n = (cid:80) nj =1 X n . The denominator of the BF is simple: the likelihood ofthe sample at the (point) null value,Pr( X , . . . , X n | H ) = Pr (cid:18) X , . . . , X n | t = 12 (cid:19) = (cid:18) (cid:19) Y n (cid:18) (cid:19) n − Y n = 12 n . The numerator requires integrating over the prior under H . Letting this be f ( t ),where t ∈ ( , X , . . . , X n | H ) = (cid:90) t Y n (1 − t ) n − Y n f ( t ) dt. Putting these together gives, B n = 2 n (cid:90) t Y n (1 − t ) n − Y n f ( t ) dt. Similar to KMart, a Bayesian audit proceeds until B n < /α . See Vora [11] for an example with a discrete support.0 Huang et al.

Equivalence.

Both A n and B n are expressed as integrals but with the X j indiﬀerent ‘places’ in the integrand. The key to showing they are equivalent is tonotice that the X j are binary variables, which allows us to set up an identity thatrelates the two ways of writing the integral. Speciﬁcally, we have the followingidentity, γ (cid:18) X j − (cid:19) + 12 = (cid:18) γ (cid:19) X j (cid:18) − γ (cid:19) − X j . This allows us to rewrite A n , A n = 2 n (cid:90) (cid:18) γ (cid:19) Y n (cid:18) − γ (cid:19) n − Y n dγ = (cid:90) (1 + γ ) Y n (1 − γ ) n − Y n dγ. Next, let γ = 2 t − A n = (cid:90) (2 t ) Y n (2 − t ) n − Y n dt = 2 n (cid:90) t Y n (1 − t ) n − Y n dt. Finally, note that this is identical to B n if we set the prior to be uniform over H , i.e. f ( t ) = 2.In other words, a KMart audit is equivalent to a Bayesian audit that uses arisk-maximising uniform prior. B.2 Extending KMart to arbitrary priors

From the above result, we can see that γ plays a similar role to t . The somewhatarbitrary integral over γ used to deﬁne A n can be generalised by specifying aweighting function g ( γ ), A n = (cid:90) n (cid:89) j =1 (cid:18) γ (cid:20) X j t − (cid:21) + 1 (cid:19) g ( γ ) dγ. Applying the same transformations as above gives, A n = 2 n (cid:90) t Y n (1 − t ) n − Y n × g (2 t − dt. In other words, this generalised version of KMart is equivalent to a Bayesianaudit with the following risk-maximising prior: f ( t ) = 2 × g (2 t − . The original KMart is the special case where g ( · ) = 1. B.3 Eﬃcient computation by exploiting the equivalence

We can use the above equivalence to develop fast ways to compute the KMartstatistic, by relating it to standard Bayesian calculations using conjugate priors.First, we show that if we take a conjugate prior distribution, truncate it, andadd some point masses, the resulting distribution is still conjugate. Then we usethis result to write a formula for the posterior distribution for the same case asabove (simple two-candidate election, sampling with replacement). valuation of Two-Candidate Ballot-Polling Election Auditing Methods 21

Truncation and point masses preserve conjugacy. (The proofs shown hereare not too hard to derive and may well be described elsewhere.)Suppose we have a single parameter, θ , some data, D , a likelihood function, L ( θ | D ), and a conjugate prior distribution, f ( θ ). That means we have, f ( θ | D ) ∝ L ( θ | D ) f ( θ ) . Let the normalising constant be, k = (cid:90) L ( θ | D ) f ( θ ) dθ. This allows us to express the posterior as, f ( θ | D ) = 1 k L ( θ | D ) f ( θ ) , The sections that follow each start with these deﬁnitions and transform the priorin various ways.

Truncation.

Truncate the prior to a subset S (i.e. we only allow θ ∈ S ). Writethis truncated prior as, f ∗ ( θ ) = f ( θ ) I S ( θ ) z S , where I S ( θ ) is the indicator function that takes value 1 when θ ∈ S , and z S = (cid:82) f ( θ ) I S ( θ ) dθ = (cid:82) S f ( θ ) dθ is the normalising constant due to truncation.If we use this prior, what posterior do we get? It will be, f ∗ ( θ | D ) = 1 k ∗ L ( θ | D ) f ∗ ( θ ) , where, k ∗ = (cid:90) L ( θ | D ) f ∗ ( θ ) dθ. Expanding this out gives, f ∗ ( θ | D ) = 1 k ∗ z S L ( θ | D ) f ( θ ) I S ( θ ) = kk ∗ z S f ( θ | D ) I S ( θ ) . This is the original posterior truncated to S . Thus, the truncation results instaying within the same family of (truncated) probability distributions, whichmeans this family is conjugate. Adding a point mass.

Deﬁne a ‘spiked’ prior where we add a point mass at θ , f ∗ ( θ ) = a δ θ ( θ ) + bf ( θ ) , where a + b = 1. In other words, a mixture distribution with mixture weights a and b . The normalising constant is, k ∗ = (cid:90) L ( θ | D ) f ∗ ( θ ) dθ = aL ( θ | D ) + bk. We can write the posterior as, f ∗ ( θ | D ) = 1 k ∗ L ( θ | D ) f ∗ ( θ ) = a L ( θ | D ) k ∗ δ θ ( θ ) + bkk ∗ f ( θ | D ) . This is a ‘spiked’ version of the original posterior. You can see this more clearlyby deﬁning, a ∗ = a L ( θ | D ) k ∗ , b ∗ = bkk ∗ , where a ∗ + b ∗ = 1. Thus, ‘spiking’ a distribution results in a conjugate family.Note that the mixture weights get updated as we go from the prior to theposterior. Truncating and adding point masses.

We can combine both of the previousoperations and we will still retain conjugacy. In fact, due to the generality of theproof, we can apply each one an arbitrary number of times, e.g. to add manypoint masses.

Application to KMart.

When sampling with replacement, the conjugate priorfor t (the true tally of the reported winner) is a beta distribution.We showed earlier that KMart was equivalent to using a risk-maximisingprior. Starting with any beta distribution, we can form the corresponding risk-maximising prior by truncating to t ∈ ( ,

1] and adding a probability mass of at t = . Based on the argument presented above, this prior is conjugate.Moreover, we can express the posterior in closed form.Let the original prior be t ∼ Beta( α, β ). Note that this α is just a hyper-parameter and not a speciﬁed risk limit. The risk-maximising prior retains thefunctional form of this prior for t > and also has a mass of at t = .After we observe a sample of size n from the audit, we have a posterior withan updated probability mass at t = . This mass will be the upset probability.We can derive an expression for it using equations similar to above (it willcorrespond to a ∗ using the notation from above).Let f ( t ) be the pdf of the original beta prior, F ( t ) be its cdf, S = ( ,

1] thetruncation region, F (cid:48) ( t ) the cdf of the beta-distributed portion of the posterior(i.e. the posterior distribution if we use the original beta prior), and B ( · , · ) bethe beta function. We have, k ∗ = 12 (cid:18) (cid:19) n + 12 k (cid:48) z S , where z S = (cid:90) f ( t ) dt = 1 − F (cid:18) (cid:19) and k (cid:48) = (cid:90) L ( t | D ) f ( t ) dt = B ( Y n + α, n − Y n + β ) B ( α, β ) (cid:18) − F (cid:48) (cid:18) (cid:19)(cid:19) . valuation of Two-Candidate Ballot-Polling Election Auditing Methods 23 Putting these together gives, k ∗ = 12 n +1 + 12 × B ( Y n + α, n − Y n + β ) B ( α, β ) × − F (cid:48) (cid:0) (cid:1) − F (cid:0) (cid:1) . The upset probability is, a ∗ = n +1 k ∗ . These quantities will be straightforward to calculate as long we have eﬃcientways to calculate:1. The beta function2. The cdf of a beta distributionBoth have fast implementations in R.21