Understanding Accuracy-Efficiency Trade-Offs as a Means for Holding Distributed ML Systems Accountable
RRegulating Accuracy-Efficiency Trade-Offs inDistributed Machine Learning Systems
A. Feder Cooper
Cornell University, Department ofComputer Science [email protected]
Karen Levy
Cornell University, Department ofInformation Science& Cornell Law School [email protected]
Christopher De Sa
Cornell University, Department ofComputer Science [email protected]
Abstract
In this paper we discuss the trade-off between accuracy andefficiency in distributed machine learning (ML) systems andanalyze its resulting policy considerations. This trade-off isin fact quite common in multiple disciplines, including lawand medicine, and it applies to a wide variety of subfieldswithin computer science. Accuracy and efficiency trade-offshave unique implications in ML algorithms because, beingprobabilistic in nature, such algorithms generally exhibit er-ror tolerance. After describing how the trade-off takes shapein real-world distributed computing systems, we show theinterplay between such systems and ML algorithms, explain-ing in detail how accuracy and efficiency interact particularlyin distributed ML systems. We close by making specific callsto action for approaching regulatory policy for the emergingtechnology of real-time distributed ML systems.
Engineering is defined by trade-offs—by competing goalsthat need to be negotiated in order to meet system designrequirements. One of the central trade-offs in engineering,particularly in the field of computer science, is between ac-curacy and efficiency . More specifically, there is an inherenttension between how correct computations are and how long it takes to compute them. While this trade-off represents a general problem, it playsout in various ways across different subfields of computing.For example, in computer hardware, circuits can use approxi-mation techniques to relax constraints on accuracy—on howthey perform bitwise computations—in order to speed upperformance. In image processing, varying numbers of pixelscan be used to represent a given image; using fewer pixels Framing the accuracy-efficiency trade-off as a cardinal trade-off in comput-ing importantly differs from how Ohm and Frankle [69] discuss efficiency.Their work calls efficiency the "cardinal virtue" of computing in order todiscuss what they view as exceptional cases of inserting inefficiency intocomputer systems—what they term "desirable inefficiency." Instead, view-ing the accuracy-efficiency trade-off as central enables us to not refer to"inefficient" computing models as exceptional and strikes us as a more pre-cise and generalizable statement of the issues at stake. Therefore, ratherthan casting particularly inefficient computing models (e.g., cryptography)as exceptional—as Ohm and Frankle do—we can conceive of them as im-plementing the trade-off at one end of the accuracy-efficiency spectrum(privileging accuracy). causes a loss in accuracy of the image being represented, butalso furthers space-efficiency by requiring less memory tostore the image. These kinds of examples are abundant incomputing. In fact, the trade-off is so ubiquitous and well-known to computer scientists that it has even given rise to itsown subfield, approximate computing , which resides withinthe programming languages (PL) discipline [62, 63]. Thissubfield has shown that it is useful to analyze the accuracy-efficiency trade-off in the context of how error tolerant anapplication is—that is, how different domains resolve thequestion of how much inaccuracy can be permitted, whileretaining guarantees about quality and safety [79].While commonly acknowledged in some areas of com-puter science—perhaps to the point of mundanity—the policyimplications of this trade-off remain relatively unexamined.We therefore focus this paper on analyzing its regulatoryimplications in the context of a particularly urgent tech-nological domain— distributed machine learning . We arguethat the accuracy-efficiency trade-off exposes a high-levelabstraction that policymakers should use to provide regu-latory interventions. That is, rather than operating at oneof two extremes—either solely having policymakers rely ontechnical experts to make high-stakes policy decisions orinundating policymakers with underlying low-level tech-nical details—we advocate for something in between: MLsystems researchers should focus on providing guaranteesconcerning correctness and performance, and should buildassociated tools to help policymakers reason about theseguarantees. These tools should effectively expose the degreeof uncertainty in distributed ML systems, thus facilitatinglawmakers’ ability to reason about and regulate the resultingrisk of their deployment in high-stakes domains, such asautonomous cars and military drones.To make this case, we organize the remainder of this paperas follows. In Section 2, we outline the general trade-off be-tween accuracy and efficiency. We discuss how this trade-offis salient in disciplines other than computing (Section 2.1)and then pay particular attention to the various ways it canbe used to analyze different subfields of computer science(Section 2.2). We then provide grounding for understand-ing how accuracy and efficiency are in tension with eachother in distributed ML systems. To understand the specificsof the trade-off in this setting, we first outline separately a r X i v : . [ c s . C Y ] J u l ooper, et. al. how it unfolds in ML algorithms (Section 3) and distributedcomputing systems (Section 4.1), and then bring this discus-sion together to clarify emergent tensions in the subset ofML systems that serve high-stakes, real-time applications(Section 4.2). Based on the overarching themes we extractin our discussion, we then close with specific calls to actionregarding these systems, both in ML systems research andin policy (Section 5). Gathering increasingly accurate information is a computa-tionally expensive task that is in tension with acting effi-ciently. In Section 2.1 we discuss how this tension plays outin various domains outside of computing, each with theirown normative concerns. We then turn our attention in Sec-tion 2.2 to the various accuracy-efficiency trade-offs that areinherent throughout CS. Studying this trade-off in comput-ing generally falls under the area of approximate computing .We use this discussion to ground our treatment in later sec-tions of how such trade-offs play out uniquely in the contextof machine learning (Section 3) and distributed machinelearning systems (Section 4).
The trade-off at the heart of this paper is not unique to com-puting. It can be observed in a diverse range of disciplines,including economics, law, and medicine. In decision theory,the time-value of information is an important concept formaking choices. There is a cost to gathering increasinglyaccurate information. Waiting to act based on information isitself an action—one that can have potentially more negativeconsequences than acting earlier on imperfect information.Kahneman and Tversky elaborate on this idea in their well-known cognitive psychology research concerning reasoningabout uncertainty [50]. They argue that humans use variousheuristics to make decisions more efficiently, often acting onbiases they have due to incomplete information. There is atension between taking the time to gather more informationand making a more informed decision—between the speedof making a decision and the quality of information used tomake it.Sunstein connects this idea directly to the potential haz-ards of using heuristics in legal decision-making [84]. Nev-ertheless, he observes that heuristics are common — andnecessary—in the law to obtain a suitable balance betweenefficient resolution and the "best" (i.e., most accurate) adju-dicative outcomes.A number of rules in US civil and criminal procedure (e.g.,speedy trial requirements, local rules imposing filing dead-lines, statutes of limitations) impose time constraints forthe sake of efficient case resolution; these values must bebalanced against needs for thorough fact-finding and argu-mentation. Due process implicates both values. Some areas of law explicitly consider how to balance between them. Fortwo examples, the standard for preliminary injunctive reliefin the United States explicitly considers whether irreparableinjury will occur because of the passage of time, if reliefis not granted before the (often lengthy) full resolution ofa case. Federal Rule of Evidence 403 allows for the exclu-sion of relevant evidence its probative value is substantiallyoutweighed by a danger ... undue delay, wasting time, orneedlessly presenting cumulative evidence."Debates about the merits of the so-called "precaution-ary principle" in policymaking also capture the accuracy-efficiency trade-off. The precautionary principle advises ex-treme caution around new innovations when there is sub-stantial unknown risk; in operation, it effectively places theburden of proof on risk-creating actors (like chemical plants)to provide sufficient evidence that they are not producingsignificant risk of harm, rather than vice versa. As withspeedy trials, there is a trade-off between the time it takes togather evidence—to highlight the risk landscape—and captur-ing that landscape effectively. There are legal rationales onboth sides of the spectrum with regard to how this trade-offshould be implemented. For example, critics of the precau-tionary principle could be said to favor efficiency. They findthe principle to be too stringent with regard to the burden itplaces on accuracy; it is "literally paralyzing" in its attemptsto regulate risk [85]. On the other side, others argue that theprecautionary principle provides a valuable way to reasonabout preventing harm by shifting the burden of proof ofsafety to potential risk creators. They are supportive of thefact that the principle requires actors to justify the risks theycreate: it is worth the time cost to gather information, suchthat it is possible to better manage risk in the context ofscientific uncertainty [78]. Another urgent example of the trade-off in action arises inpublic health—specifically, in the context of the COVID-19pandemic and antibody tests. The World Health Organization(WHO) has recently argued that, prior to certifying COVID-19 antibodies for treatment, it is necessary to guarantee thatsuch antibodies confer immunity to the virus. Several medi-cal professionals have challenged this mandate from WHO,highlighting the time-sensitive nature of taking action inthe pandemic: "Demanding incontrovertible evidence maybe appropriate in the rarefied world of scholarly scientificinquiry. But in the context of a raging pandemic, we simplydo not have the luxury of holding decisions in abeyance until In addition to these examples from the law, the accuracy-efficiency trade-offis salient in other aspects of governance. In particular, it plays an importantrole in wartime intelligence gathering. There is an inherent tension betweengathering more accurate intelligence about an opponent or enemy andacting on that intelligence before it becomes stale and loses its usefulness.This is the so-called "fog of war" notion, which attempts to capture therelationship between how the uncertainty of information changes overtime [90].2 egulating Accuracy-Efficiency Trade-Offs in Distributed Machine Learning Systems all the relevant evidence can be assembled. Failing to take ac-tion is itself an action that carries profound costs and healthconsequences" [93]. More generally, the authors claim that itis the norm for healthcare practitioners to act on incompleteinformation—to balance potential inaccuracies in availabledata with the urgency to treat serious conditions. To see this,consider an example in which a patient has a small growthon their lungs, and it is uncertain whether the growth isbenign or malignant. A doctor may then be faced with thefollowing choice: they can either perform a biopsy—a veryinvasive procedure—now, with the possibility that such anearly stage test could yield inconclusive results; or, they canwait and see if it grows, and then biopsy it at a later stagewhere there can be more certain results regarding potentialmalignancy. The doctor is faced with a trade-off betweenpotentially inaccurate information in the short-term and ahigher certainty of accuracy (with the potential danger ofnot having acted quickly enough) in the longer term.These examples all concern reasoning under uncertainty,and they get at the relationship between uncertainty andexternalities of risk. Before we discuss this relationship moreformally in the context of machine learning systems (Section4), we demonstrate how the accuracy-efficiency trade-off iswidely relevant across various domains in computing.
In computing, the accuracy-efficiency trade-off is a spec-trum, not a binary decision, that has implications in almostall computer science subfields (See Figure 1). To capture theintuition, let us consider a familiar example—image compres-sion to generate JPEG images. Raw images tend to be veryhigh resolution, meaning that they contain many pixels perinch. In order to capture each pixel, such images tend to re-quire a lot of storage space. However, it is often not necessaryto use so many pixels for a high-quality image. To the humaneye, a compressed version of the raw image often suffices;removing some pixels or averaging and combining the val-ues of neighboring pixels often is not detectable. Moreover,compressing a raw image to a JPEG takes up less computerstorage space and can lead to faster image processing whendoing photo editing since there are fewer pixels to consider.In other words, reducing the accuracy of the image can leadto greater computational efficiencies when manipulating it.The notion of this trade-off between accuracy and effi-ciency forms the basis of approximate computing . The mainidea of this field is that a computer system can achieve certainperformance benefits if it exerts less computational effortto compute perfectly accurate answers. In other words, itis possible to relax accuracy in order to yield efficiency im-provements [62, 63, 79]. As with JPEGs, relaxing the accuracydoes not necessarily have negative consequences; rather, itis possible that the decreased accuracy has no observable
Accuracy Efficiency
Image processing
JPEG image file
Image processing
Raw image file
Hardware architecture
Distributed computing
Tight synchronization
Hardware architecture
Hardware architecture
Distributed computing
Loose synchronization
Scientific computing
Closed-form calculations
Scientific computing
Sampling
Figure 1.
Implementing the accuracy-efficiency trade-off inapproximate computing: Image compression (Section 2.2),bit precision representing numbers (Sections 2.2 & 3.2), dis-tributed systems (Section 4.1), and sampling (Section 3.4.).impact for a particular application. In other words, some ap-plications are tolerant of inaccuracy; they are error resilient.While the term "approximate computing" is a fairly recentinnovation, these ideas are as old as computers. One of theoldest examples comes from how computer hardware repre-sents numerical values—particularly floating-point numbers,which, as opposed to integers, can have arbitrary precision;they require (potentially infinite) decimal places to express.However, computers require discretization to store numbersin binary encoding; floating-point numbers are expressedin a finite number of bits, limiting their precision and howaccurately they can reflect the values of the numbers theyrepresent [35, 63].Approximate computing not only demonstrates the exis-tence of this trade-off, but also provides ways of formallycharacterizing it. In turn, this characterization can enablecomputer scientists to leverage the "right" trade-off in dif-ferent application domains. For example, formal reasoningaround the trade-off can yield application-specific qualitymetrics. Quality can be thought of as how a program con-ceives of "good enough" results. Often, the quality of "goodenough" cannot be guaranteed with complete certainty, butcan be verified with high probability. Leaving room for un-certainty allows for edge case behaviors, which might fall be-low the specified quality threshold. Quality metrics thereforecapture how much an approximation is allowed to deviatefrom the precise version’s results. Computer scientists cantherefore design software that requires a certain degree ofprogram quality with a certain (high) probability [79]. Quality metrics are particularly salient in high-impact ap-plication areas. Consider an autonomous surgical robot or A popular example of this comes from Amazon’s cloud computing services(AWS). Their cloud storage service provides "11 9’s" of reliability with regardto storing data objects, meaning that 99.999999999% of the time saving suchobjects to the cloud occurs without error [1].3 ooper, et. al. autonomous car. Both application domains require both ac-curacy and efficiency in order to be safe and reliable, butcannot maximize both properties at the same time due to theinherent trade-off between them. Instead, in each applicationdomain, we need to have a sense of how much error we cantolerate in order to meet certain speed demands.The same can be said for police use of facial recognitionsoftware. For example, as described in Dietterich [28], a studyin South Wales found a false positive rate of 91.3% in a facialrecognition application that tries to match faces with out-standing arrest warrants at public events. While this appli-cation is not necessarily efficiency-sensitive—a human could,in theory, intervene to verify the accuracy of the results priorto acting—this would not necessarily be the case if such tech-nology were integrated into police body cameras for thepurpose of making in-the-moment (and potentially life-or-death) decisions. Advocates for this technology argue that insituations of "imminent danger", efficiency is crucial; for ex-ample, they contend that it is necessary to speedily identifya person-of-interest. Accuracy in this case is equally impor-tant; in heightened stress environments, mistaking someonefor a person-of-interest has repeatedly proven catastrophic,particularly in the United States. Because of these competingtechnological goals, it is not clear exactly how approximatecomputing could be safe in this context, as the high stakes in-volved do not lend themselves to error resilience; it may notbe safe to use such technology at all. In other high-impact le-gal contexts, the trade-off can potentially be reasoned aboutsafely. Consider automated risk assessment tools [20, 83].Accuracy in assessing risk is critical, but is not necessarilytime-sensitive. Operating on the scale of minutes, hours, oreven days might suffice, particularly if such time spans entailincreases in accuracy. Several influential papers on artificial intelligence (AI) fromthe 1980s and 1990s also demonstrate the potentially highimpact of appropriately dealing with accuracy-efficiencytrade-offs [13, 47]. In particular, in a classic paper, Horvitzposes the question of how autonomous agents can effectivelyperform computations under tight computational resourceconstraints [47]. He discusses how approximations or heuris-tics can lead to more efficient resource utilization—at the There have been similar findings in other cities such as Detroit [56]. We do not argue in favor of this "need for speed" in law enforcement.Advocates for police reform in the US have argued for years that such a"need" is in fact constructed to benefit police in cases of misconduct [74]. While this observation speaks to the trade-off between accuracy and effi-ciency, we do not intend for this to be taken as an endorsement for usingrisk assessment tools in criminal law domains. We instead apply this ex-ample narrowly to explicate accuracy-efficiency considerations, withoutcommenting on the normative implications of what accuracy means in thiscontext or the desirability of the use of such tools. cost of potentially less-correct computation. He frames thisas a "time-precision tradeoff," in order to indicate how thereis an inherent tension between the utility of a correct com-putation and how fast that computation is completed, inthe context of evaluating reasoning under uncertainty forautonomous agents.This trade-off persists beyond classical AI to contempo-rary work in statistical ML, as ML’s probabilistic nature hasimportant implications for the relationship between accuracyand efficiency in ML models. Trained ML models performinference that is not always correct, often tolerating a certaindegree of inaccuracy. Being resilient to errors is necessaryfor producing robust models. This notion of error resilience(or inaccuracy tolerance) varies for different types of MLalgorithms. Regardless of particular differences, there is ageneral tension between correctness and performance . Thecorrectness of a ML algorithm can be understood as whetheror not the algorithm converged to the distribution we set outto learn, i.e., Did we learn the right model?
Its performanceindicates whether convergence to the distribution—whethercorrect or incorrect—happened in a timely manner, i.e.,
Howfast did we learn the model?
As with other approximate com-puting problems, ML can relax its demands on accuracy inorder to achieve increases in efficiency. In fact, this relax-ation is a requirement in many learning domains. Without it,inference computations can be so inefficient to perform thatthey become intractable. We describe five such cases below.
Performance directly relates to the size of the task on whichwe perform learning. Intuitively, if a learning algorithmis slow on small tasks—that is, tasks with small datasets—then that algorithm will be slow, if not computationally in-tractable, on much larger ones. More concretely, this rela-tionship between runtime and task size often exists due tocoupling between the computation done by the learningprocedure’s optimization algorithm and the task’s datasetsize. For example, when computing the gradient needed todetermine which direction the learning algorithm shouldstep for its next iteration, it is often necessary to sum overevery data point in the dataset. As we show in Figure 2 withthe Gradient Descent (GD) algorithm, for larger datasets thissummation becomes increasingly costly.A very common approach for improving efficiency is touse a subsample or minibatch of the dataset, rather than thewhole dataset, when performing calculations. In the case While "precision" and "accuracy" are different, there is a relationship be-tween them. For our purposes, it useful to think of the degree of precisionas a mechanism for controlling how much accuracy is possible to achievewhen performing a computation. For example, using fewer bits (i.e, low bitprecision) to represent numbers can drastically effect the degree of accuracyof calculations done with those numbers, since this is effectively the same asdoing computation on (potentially very highly) rounded numbers (Section3.2).4 egulating Accuracy-Efficiency Trade-Offs in Distributed Machine Learning Systems Number of data points T i m e p e r i t e r a t i o n ( s e c o n d s ) Comparing Runtime per Iteration for GD and SGD
Gradient Descent (GD)Stochastic Gradient Descent (SGD, with minibatch size=1)
Figure 2.
The runtime for GD (a full-batch method using thewhole dataset to compute gradients) is coupled with datasetsize: as dataset size increases, so does runtime per iteration ofthe algorithm. In contrast, a subsampled, minibatch methodlike SGD (which here uses only 1 data point to compute thegradient) is decoupled from the dataset size: it maintains arelatively constant runtime per iteration.of computing gradients, instead of using a "full batch" (i.e.,the whole dataset) we use a randomly sampled subset of thedata points, which entails spending less time on computa-tion. Stochastic Gradient Descent (SGD) is an example of analgorithm that takes this approach. Using a minibatch canoften have minimal impact on the overall accuracy of thelearned model. A particular iteration of the algorithm willhave less accuracy when computing the gradient (Figure 2);but, when run for lots of iterations, the final result can stillbe statistically correct. In expectation, we can learn the samedistribution as if we had been using the whole dataset ineach iteration; we can often theoretically guarantee robust-ness [14, 51].Moreover, the decision to subsample is not all-or-nothing;it is a spectrum. It is possible to vary the minibatch size thealgorithm uses. Larger minibatches—especially those thatapproach the size of the full dataset—require more time butare also more accurate per iteration. Conversely, smallerbatch sizes make each iteration faster and more scalableto larger datasets, but in doing so sacrifice accuracy periteration.
Another common approach involves using low-precision rep-resentations of the numerical values on which the computerperforms computations. This method, sometimes called quan-tization, is similar to the idea of floating-point precision—how much accuracy the computer can capture based on howmany bits it uses to represent numbers—that we discussedin Section 2.2. Computing with more precise floating-pointnumbers is more computationally expensive; it tends to take more time (i.e., sacrifices efficiency) but can capture a moreaccurate range of results.Much work in machine learning explores using low- pre-cision numbers to achieve faster results. This work relaxesrequirements on the accuracy of the trained model in orderto achieve these speed-ups [4, 24, 26, 38, 39, 41]. As withthe minibatching example in Section 3.1, this sacrifice inaccuracy does not necessarily require sacrificing overall cor-rectness if in expectation the algorithm can still theoreticallyguarantee learning the right distribution. There is also aspectrum at play here; similar to varying the minibatch sizeto tune the trade-off between accuracy and efficiency, it ispossible to vary the number of bits of precision. More bitsyield higher accuracy and slowdowns, while fewer bits re-quire less time per computation and thus potentially sacrificesome correctness. Depending on a particular application’stolerance to error, this sacrifice in accuracy can be worth thespeed-ups it creates [75].
The prior examples discuss the cost of running computations.Specifically, they discuss how differently-sized batches ofdata (Section 3.1) and how differing degrees of numerical pre-cision (Section 3.2) directly impact how long it takes a com-puter to execute a computation. Even though these examplesconcern a computer’s behavior, we have not yet consideredhow hardware specifications of the computer running thealgorithm might also impact that behavior. Surely this isimportant, as different computers have different computingcapabilities due to varying hardware; a NASA supercom-puter has more computational resources than a personallaptop.Recent years have seen an increase in the variety of com-putational devices available and a corresponding increasein the variety of computations we wish to run on them. Forexample, Internet of Things (IoT) devices and sensors, suchas Google Home or Amazon Echo, perform inference. Theyserve up answers to spoken language questions; however,they also have limited on-board capabilities to perform com-putations locally. These limitations take several forms. Forexample, such devices might not have a lot of power to pro-cess data quickly or might lack storage capacity for largeamounts of data.Often, these devices can communicate with more sophis-ticated computers over the Internet, offloading computationor storage to those computers. However, this communicationexposes another trade-off between accuracy and efficiency;it takes time to send the data to a remote computer, per-form some computation, and then return a response to thedevice [11]. That computation may be more accurate, butachieving that accuracy comes with a cost in speed. Con-versely, doing the computation locally on the device wouldbe faster; however, due to the device’s more limited compu-tational resources, it will not necessarily be very accurate. ooper, et. al. Prior work considers how computer vision models can belearned and stored on a mobile device like a smartphone. For such resource-constrained devices, different applicationshave different needs in terms of how to trade-off betweenhow accurately and how quickly a computation is performed.Some prior work has explored these application-specificneeds, providing an interface for flexibly implementing dif-ferent points along the accuracy-efficiency trade-off spec-trum. For example, MobileNets contains manually-tunableparameters that allow the model developer to strike the rightbalance for particular learning problems [48]. Dependingon the application domain, the developer can tune a largermodel that uses more resources (i.e., a model that is slowerbut more accurate) or one that is smaller and uses fewerresources (i.e., a model that is faster but less accurate).
We now delve into a slightly more sophisticated example. Weconsider a branch of ML that has recently proven particularlyuseful in the probabilistic modeling of data for Bayesianinference: Markov chain Monte Carlo (MCMC) sampling. Tounderstand MCMC, it’s first important to have an intuitionregarding how sampling works. We will explain samplingby way of a simple, familiar example: flipping a coin.When flipping a normal coin, it can either result in "heads"or "tails," with a 50% chance for yielding each. Let us considerthat it is possible for a coin to be biased—that the coin isweighted in such a way that, when flipped, it yields heads60% of the time. In order to figure out how biased the coinis—the probability it yields heads—we flip the coin repeatedlyto generate samples of the coin’s behavior and record theresults. That is, we flip the coin for multiple trials, and aftereach trial we update the estimated probability that the coinyields a heads result. We can view this updating probabil-ity as the information we are learning—we are generating amodel of the coin’s behavior, which we store as the probabil-ity of flipping heads. When we begin flipping the coin, thereare not many generated samples. As a result, as shown in Fig-ure 3 our estimation of the probability of heads might changea lot; it can update fairly erratically. Over time, as we gen-erate more samples, the probability estimate becomes morestable. We converge to a probability that does not changevery much, giving us a fairly good estimate of the coin’s bias.MCMC can be thought of as a more complicated instan-tiation of a sampling method like this. Instead of learningthe probability of a biased coin, we are trying to learn theparameters of a desired probability distribution. To do this,we construct a Markov chain, from which we iterativelyproduce samples. Similar to updating the estimated bias of Aside from being faster, there are several reasons why such local computa-tion and storage might be desirable for a mobile application, as opposed tocommunicating with and offloading these requirements to more powerfulremote computers. Notably, local computation can ensure privacy, as thelearned model and collected data never leave the mobile device.
Number of coin flip samples P r o b a b ili t y o f H e a d s Learning a Coin's Bias via Sampling
Figure 3.
Generating samples of a coin flip to determininga coin’s bias (in this case, 60% heads).the coin after each sampling iteration, in each iteration ofMCMC we update our estimation of the distribution parame-ters. Eventually, the values of the parameters become stable;we converge to an equilibrium in which the samples wecontinue to draw reflect the desired distribution [16].While this technique is very powerful for accurately per-forming Bayesian inference, it comes with significant perfor-mance drawbacks. In particular, when the learning problem’sdataset is large, the performance of an MCMC algorithm of-ten suffers. Just as in Section 3.1, efficiency and scalabilitybecome limited due to computations that require summingover every data point in the dataset; performance is therefore(loosely speaking) inversely proportional to the size of thedataset. Additionally, just as before, we can lessen these lim-itations by introducing subsampling—by using a randomlyselected minibatch of data instead of the whole dataset. How-ever, as we have seen with the accuracy-efficiency trade-off,there is no free lunch; the speed-ups from minibatchingcan introduce inaccuracy. More specifically, we can lose theguarantee of converging to the correct desired distribution,which can yield potentially disastrous inference results [97].Instead of yielding exact results, the randomness from usingminibatches can introduce bias that entails inexact results.Prior work makes the case that inexactness can be worthits performance gains—that it is better to be faster even ifthere is a risk of losing accuracy, since it can enable scal-ing up MCMC to big data inference problems. As a result,there is a rich scholarly literature concerning inexact mini-batch MCMC methods [19, 52, 80]. However, in practice, datascience practitioners often do not use inexact methods; forreliability, they find that it is better to be slow and correctthan fast and wrong. Recent work therefore attempts to con-struct new minibatch MCMC methods that retain exactness—methods that have theoretical guarantees regarding accuracywhile also incorporating certain tricks and statistical insightsthat enable preserving some of the speed-ups minibatchingprovides [23, 61, 96–98]. In other words, these exact meth-ods lean toward the accuracy side of the accuracy-efficiency egulating Accuracy-Efficiency Trade-Offs in Distributed Machine Learning Systems trade-off; they guarantee converging to the correct, desireddistribution, but to do so they sacrifice speed in relation totheir inexact counterparts, particularly on some types oflearning tasks. Finally, we examine the trade-off in machine learning in asynchronous settings. The examples we have discussed sofar are synchronous : there is one computer process that doesall of the computation, one step at a time. In contrast, it ispossible to run computations asynchronously , in which differ-ent computer processes or threads perform computationsside-by-side. This facilitates dividing computationally inten-sive tasks into parts, such that different portions can happenin parallel and then can be combined to compute the finalresult.In other words, the parallelization from asynchrony canlead to speed-ups in ML since multiple parts of the learningproblem can be computed at once. However, depending onhow the parallel results are combined, it can also lead todecreases in accuracy. That is, if different processes end upworking on the overlapping parts of the overarching compu-tation, the process that finishes its computation second canoverwrite the value computed by the one that finished first,causing inaccuracies in the results [5, 26, 58, 68]. This can beavoided by forcing the different processes to coordinate theirupdates, such that they do not overwrite each other. How-ever, such coordination takes time; it enables more accuracy,but decreases efficiency. In some cases, this overwriting isworth the speed-ups it enables; it is still possible to computegood quality learning estimates [25, 76]. So far, our discussion does not take into consideration howthe accuracy-efficiency trade-off behaves for ML in real-world, deployed systems—systems that often consist of multi-ple computers that communicate and work together to solvelarge, complex problems. Such systems often communicateasynchronously: instead of one computer doing multiplesub-computations at the same time (Section 3.5), there aremultiple computers operating in parallel on the same prob-lem. In the next section, we discuss how such real-worlddistributed ML systems raise unique concerns with regardto accuracy and efficiency. A computer can run multiple processes at once. Each process is an instanceof a running program—this is why one can run both an Internet browser anda text editor at the same time. In other words, processes allow for paralleltasks to run on one computer [6]. A thread is a further mechanism for parallelization on a computer, whichoperates below the level of a process. That is, a process can have multiplethreads running at the same time. For example, this is what allows a texteditor (which is running in a process) to simultaneously enable display-ing both typing and syntax-error highlighting in real-time. Each of thesefunctions happens in its own thread of computation. Asynchrony is complementary to other examples in this section. Forexample, it can be used in combination with minibatching, low-precision,and in MCMC to implement other types of accuracy-efficiency trade-offs.
Our overarching aim is to understand the particular ten-sions between accuracy and efficiency for distributed ma-chine learning systems , and how these tensions differ fromthose we discussed regarding machine learning algorithmsin Section 3. To make these distinctions clear, we first clarifysome key ideas from distributed computing in Section 4.1. From this basis, we can then layer on more complexity inSection 4.2. We weave in our understanding of the accuracy-efficiency trade-off for ML algorithms from Section 3 andobserve how the different tensions interact with each other.Considered together, we demonstrate how machine learningand distributed systems trade-offs present especially chal-lenging problems for real-time, high-impact systems likeautonomous vehicles. These real-time domains inform ourpolicy discussion in Section 5.
In contrast with a single, solitary computer, a distributedsystem is a network of computers that communicate witheach other. Via this communication, the computers can worktogether to solve problems. Each computer in the networkhas its own data and performs its own computations, and itshares data and computation results with other computersin the network when necessary. For example, if a computerneeds data from another computer in order to execute acomputation, it can request the data from that computer.Because the computers are in distributed locations andneed to communicate, there are important considerationswith regard to how efficiently information can be sharedbetween them. That is, when a computer contacts another inthe system to request its data, it takes time to complete the re-quest and receive the data—in direct opposition to efficiency.There are also issues of accuracy between computers in thesystem. Each computer has its own data—its own snapshotof what it knows to be the state of the overarching system.However, that information is not complete; it is just a subsetand can possibly contradict the information that other com-puters in the system have. Simply put, the computers can beinconsistent with each other.In other words, in distributed systems we can frame thetrade-off between accuracy and efficiency as a tension be-tween consistency and latency —the speed with which thesystem updates. There is a trade-off between all of the com-puters in the system having the same understanding of thedata in the system and the time it takes to propagate that un-derstanding throughout the system [2, 15]. Due to this trade-off, in distributed systems that update their data frequently itis actually quite difficult to quickly build a consistent, holistic We touch on this topic only briefly, since our main focus is the behavior ofsuch systems in the context of machine learning
For more detailed treatment,see Cooper [21] and Cooper and Levy [22].7 ooper, et. al. understanding of the environment across different comput-ers in the network. This is because consistency is a movingtarget; each computer processes information locally fasterthan it can share it with the entire network. Given that ittakes time to communicate, it is hard for computers to staycompletely up to date with each other.Nevertheless, for the sake of efficiency, individual comput-ers in the system often need to make decisions in the pres-ence of inconsistency. Otherwise, because of the tension be-tween consistency and latency, waiting for complete consis-tency across computers before a computer could make localchanges would bring the entire system to a standstill. Instead,particular distributed system implementations need to an-swer the question of how much inconsistency and slownessthey can each tolerate, which is often application-dependent.To understand this spectrum, we will consider a few ex-amples of distributed systems that implement the trade-offdifferently [27, 43]. First, consider a social media website,which has computers hosting its data distributed all over theworld. A user visiting the site from a personal device tendsto access the geographically closest computer server hostingthe site; different users across the world therefore accessdifferent computer servers. Such a system favors efficiency(i.e., low latency) over the different computer servers beingconsistent with each other. It is more important to returnthe website to each user quickly than it is to make sure thatevery user is accessing the website with exactly the samedata. This is why on some social media sites it is possible tosee out-of-order comments on a feed; the site is making abest effort to resolve its current state, which entails aggregat-ing information from across the system. It attempts to builda consistent picture, but limits how much time it spendsdoing so—sacrificing consistency—so that it can remain fast[27, 60, 89]. The system implements this choice via its com-munication strategy. Rather than contacting every computerin the system to construct a coordinated, consistent picture(which would take a lot of time) a particular computer onlycommunicates with a subset. It trades off the accuracy itwould get from communicating with every computer for theefficiency of communicating with fewer computers [42, 54].In contrast to an efficiency-favoring social media site,blockchain technology is a distributed system for storinga transaction ledger that favors consistency at the cost ofbeing slow [65]. In short, it is a distributed system whereeach computer has its own copy of the entire ledger. Whena computer wants to add a transaction to the system, it hasto broadcast that information to every computer in the net-work. All of the computers need to agree on the validityof a transaction before it can be included. As a result, the system proceeds in lockstep, only when there is coordinatedagreement. These different implementation choices reflect differentdesign goals. The cloud was designed for e-commerce ap-plications, in which supplying (even potentially inaccurate)responses quickly to the user is critical for user engagement[9, 17]. For blockchain systems, consistency is paramount;it is crucial that all of the computers agree with each otherabout the state of the ledger, because it is this agreementthat facilitates its reliability as a transaction record.While these two examples seem to imply that there isan all-or-nothing choice in the trade-off between consis-tency and latency in distributed systems, this is not the case.Like accuracy and efficiency more generally in approximatecomputing (Section 2.2), the trade-off between consistencyand latency is a spectrum [2, 94]. It is possible to quantifyconsistency and to measure and monitor its maintenancethroughout a distributed system [60, 81]. Developers canreason about the degree of inconsistency their particular sys-tem can tolerate safely, and can detect and tune the system’simplementation accordingly to also enforce an upper boundon latency [10, 30, 37, 73, 86, 95].
Given this background on how accuracy and efficiency arein tension with each other in distributed systems in general(Section 4.1) and our earlier discussion of accuracy-efficiencytrade-offs in ML (Section 3), we can now specifically considerreal-time (i.e., latency-critical) distributed ML systems. As anexample, consider a distributed system of autonomous vehi-cles. Numerous vehicles are potentially networked togetherand with other devices, such as smart traffic lights. Moreover,while each vehicle moves throughout the environment withits own local notion of the state of the environment, infor-mation that other vehicles possess could also prove useful.For example, if an accident is up ahead, a vehicle closer tothe crash can communicate that information to the vehiclesbehind it, which in turn can apply pressure to their brakesand potentially prevent a pile-up.In such real-time transportation domains, accuracy andefficiency are both critical. Some ML inference applica-tions may be error tolerant, but in high-stakes domains thismay not always be the case; it is unclear how much inac-curacy will be tolerable while still ensuring safety [12]. This is a tremendous oversimplification for brevity, since the point ofintroducing this example is to explain trade-offs between accuracy andefficiency. A more detailed treatment appears in Narayanan et al. [66]. For coherence with the main framing of the trade-off between accuracyand efficiency in this paper, we will use this language, instead of consistencyand latency, going forward. However, as noted in Section 4.1, consistencyand latency can be viewed as cases of accuracy and efficiency, respectively. For more on the normative values at play in such situations and how thisis not merely a hypothetical situation (it in fact played a crucial role in theUber AV crash in 2018 [12] , please refer to Cooper and Levy [22].8 egulating Accuracy-Efficiency Trade-Offs in Distributed Machine Learning Systems
The way such systems will need to treat efficiency is similar.They will need to make decisions quickly and, much like thenon-computing examples in Section 2.1, there is an inherenttrade-off between waiting to make a completely informeddecision and making a decision fast enough for it to be use-ful [2, 15]. What is different here is the degree of efficiencyneeded—in some cases, inference decisions will be necessaryat subsecond speeds.In short, it is not entirely clear what the right design goalis for real-time systems like autonomous vehicles and howthe trade-off should be implemented for them [28]. Given thedynamic nature of the environment, the particular trade-offimplementation may depend on context. Some environmentswill be more efficiency-critical: it would be catastrophic fora car to take an extra half-second to be certain that there isa pedestrian directly in front of it. In other cases, having anaccurate sense of the environment may be more importantthan allowing the cars to operate quickly. For example, whendetecting a deep pothole up ahead, it could be safer for a carto slow down to decide its course of action—to accuratelydetermine if the hole is shallow enough for the car to con-tinue on its course or if the hole is deep enough to warrantveering off the road to avoid it.Distributed ML systems raise different accuracy-efficiencyquestions than either distributed systems that do not involveML, or ML systems that are not distributed. With regard tothe former, the kinds of coordination and consistency issuesthat distributed systems can tolerate while maintaining cor-rectness are different in nature than what newer ML systemscan tolerate (particularly around issues like numerical errorand staleness) [9, 27, 95]. With regard to the latter, as wesaw in Section 3, since ML models (necessarily imperfectly)approximate representations of the world, it is possible forML models to operate on data that are not completely ac-curate and still yield results that are correct enough—thatfall within the same bounds of imperfection that we deemtolerable when operating on accurate data. We can extendsuch data inaccuracies beyond things like subsampling toinclude data staleness inherent in asynchronous distributedsettings. Allowing for such staleness comes with the benefitof increasing efficiency, as the system would not need towait to synchronize—to completely resolve staleness issuesbefore proceeding with its computation. Similar to the singlecomputer case, their overall output still can be correct evenwhen operating on numerically imprecise or stale data ina distributed setting; however, existing work in this fielddoes not necessarily guarantee such output must be correct[5, 31, 38, 58, 68, 77, 99].Instead, prior work has examined this phenomenon at ahigh level by looking at the correctness and the performanceof end-to-end ML systems, rather than directly evaluating theunderlying accuracy-efficiency trade-off. This work focuseson empirical results for tuning the staleness of the underlying data storage layer. Tuning has generally either been manual—curated to the particular problem domain—or absent, leavingthe user to pick from a few predefined settings that enforcehigh accuracy, ignore accuracy altogether for efficiency, orattempt some middle-ground, "in-between" approach [3, 46,53, 57, 71]. Attempts at more flexible trade-offs have entailedvery domain- or algorithm- specific solutions [59, 70, 91].While it is possible to implement any of these differentpoints in the trade-off, current large-scale systems for dis-tributed learning and inference tend to opt for efficiency.They focus on minimizing communication between comput-ers in the system in order to be efficient enough to scale tolarger problems. Some of these systems can achieve ordersof magnitude in performance improvements by dropping up-dates without simultaneously destroying correctness [68, 87];however, it is not clear these approaches will work for real-time distributed ML systems that are safety-critical, such asautonomous vehicles. It will not always be feasible for thesesystems to lose updates. Existing approaches to mitigate suchlosses in ML systems involve increasing communication be-tween computers in the system. However, this then impactsthe other side of the accuracy-efficiency trade-off, leadingto inefficiencies from bottlenecks in coordination betweencomputers. We have taken considerable space to clarify a variety ofaccuracy-efficiency trade-offs—from how they generally im-pact the field of computing to how they describe the range ofpossible behaviors for distributed machine learning systems.More specifically, it is necessary and urgent to expose theaccuracy-efficiency trade-off because it is a potential leverfor regulation. Though various manifestations of the trade-off are well-acknowledged in technical communities, theyhave not, to date, been legible to policymakers. We arguethat policymakers must understand the implications of theaccuracy-efficiency trade-off in order to responsibly regulateemerging technologies—that it is necessary and urgent toexpose the trade-off as a potential lever for regulation.As we have documented in Sections 2.2-4.2, this trade-offis not binary; it is a spectrum and can be treated like a tun-able dial set appropriately to the context. Our hope is thatexposing this dial will provide a certain degree of techni-cal transparency to lawmakers, such that high-stakes sys-tems do not get deployed without sufficient public oversight[21, 22]. Contemporary policy debates about high-stakes,time-sensitive machine learning applications—in domains This problem is similar to what exists in weakly consistent storage sys-tems, which have side-stepped this issue by using semantic information tocoordinate "only when necessary" [8, 29, 36, 44, 92].9 ooper, et. al. like policing, warfare, and public health—often involve con-cerns about what degree of accuracy we ought to demandfrom such systems. These concerns often arise in the courseof attempting to minimize disparate outcomes across groups(e.g., differential accuracy rates for face recognition alongdimensions of race and gender [18]). But debates about theharms of inaccuracy are incomplete if they fail to acknowl-edge and reckon with the technical trade-off between ac-curacy and efficiency. Accuracy may necessarily be limitedwhen speed is essential, and as we have seen, the speed ofdecision-making can implicate important public values aswell [22]. Informed policy debate about machine learningmust pay attention to the limits imposed by this trade-off.Beyond exposing this trade-off, we also propose a twofoldcall to action. The first portion of this call is for computerscientists. While our work here exposes the trade-off be-tween accuracy and efficiency and how to engage with it—tobuild systems that can prioritize application-dependent bal-ances between the two—it also indicates gaps in existingapproaches in real-time ML systems. These gaps imply thatexisting systems will likely not suffice for high-stakes, emerg-ing applications such as autonomous vehicles. In particular,in Section 4.2 we explain the importance of needing to makethe accuracy-efficiency trade-off transparent in a system’simplementation; a system’s ability to be assessed with regardto this trade-off should be considered as important as everyother technical feature. A potential future research direc-tion could mathematically formalize the semantics of thetrade-off in ML systems. This could enable building tools tooptimally tune the trade-off between consistency and latencyfor different classes of distributed ML algorithms, balancingtheir individual accuracy and efficiency needs.Such tools would also provide policymakers with insightinto how certain implementation decisions impact overallsystem behavior. This is crucial because, as we have shownthroughout this paper, low-level technical decisions are nottrivial; they should not be dismissed as "just implementationdetails" left up to the whims of engineers without publicoversight. To be clear, we are not claiming that policymak-ers need to understand the full extent of low-level technicaldetails to provide this oversight. Rather, we are suggestingthat surfacing the higher-level trade-offs that lower-leveldecisions entail clarifies valid sites for potential policy inter-vention [22, 49, 64]. One can then think of such trade-offsas the right layer of abstraction with which policymakerscan engage. At this level, policymakers can reason about thenormative values and policy goals implicated by the trade-offin different domains [22, 32, 34]. The case of the accuracy-efficiency trade-off, for example, can be used to clarify howlower-level engineering decisions relate to notions of safetyand quality [79].It is this reasoning that informs the second part of our callto action: policymakers should view the accuracy-efficiencytrade-off as a regulable decision point at which they can meaningfully intervene. They can use these trade-offs toassess the expected behavior of real-time ML systems. Asa result, we can fairly pose to policymakers questions likethe following: At what point should we deem informationof sufficiently high quality to justify the execution of poten-tially high-impact decisions by technical systems? When isit safe for a system to spend more time computing inferenceoutcomes, particularly when more efficient heuristics do notsufficiently remove uncertainty from automated decision-making?In other words, by giving policymakers the tools to rea-son about these higher-level trade-offs, we are able to takea step toward closing what Jasanoff terms the "responsibil-ity gap." That is, policymakers will have a more sufficientunderstanding of technology and will be better equipped togauge the range of possibilities for its governance. This way,when technological failures occur, rather than viewing themsimply as "unintended consequences" or "normal accidents"[72], policymakers can more actively participate in the evalu-ation of how uncertainty in probabilistic, automated decisionsystems contributes to the construction of risk [45, 49].
This two-pronged call to action highlights the relationshipbetween uncertainty and risk in distributed machine learn-ing systems. By providing a mechanism to reason about theaccuracy-efficiency trade-off, computer scientists expose aparticular kind of decisional uncertainty that depends ontime [13, 47]. Clarifying this uncertainty does not, however,identify specific risks that these automated decisions bringabout. Given the uncertainties involved, it is up to regula-tors to frame potential risks and to identify the normative,domain-specific values at play [33, 49]. That is, while com-puter scientists can reason about how much error is tolerabledue to the accuracy-efficiency trade-off (Section 2.2), we con-tend that policymakers and regulators need to determinehow much of the resulting risk is tolerable.In select cases, in which it is possible to deem the amountof predetermined risk to be intolerable, policymakers coulddisallow particular technical systems from widespread de-ployment [28, 72, 78]. However, in most cases, it may not bepossible to preemptively fully analyze the risk landscape [82,85]. Instead, this is where exposing the trade-off betweenaccuracy and efficiency can lead to accountability after-the-fact [7, 40, 55]. In other words, when deployed in the wild forlong enough, due to their complexity real-time, high-stakesML systems are likely to incur harm [67, 72, 88]. Given thatthis is unavoidable, it is important to build tools like thosewe call for in Section 5. This way, it will be possible to deter-mine if a system has deviated further than expected from itsnormal (what we consider to be acceptable) behavior [79]—cases in which policymakers and regulators can hold theappropriate stakeholders to account. egulating Accuracy-Efficiency Trade-Offs in Distributed Machine Learning Systems Acknowledgments
This work was made possible by generous funding fromAdrian Sampson and the John D. and Catherine T. MacArthurFoundation. We would like to thank Jaime Ashander, KenBirman, Em Feder Cooper, Thomas G. Dietterich, James Grim-melmann, Ido Kilovaty, Kristian Lum, Alan Mackworth, He-len Nissenbaum, Alec Pollak, Fred B. Schneider, and MatthewSun for their comments and suggestions on various versionsof this work—with particular appreciation given to HarryAuster for his incisive feedback.
Code
The code used to generate Figures 2 and 3 can be found at https://github.com/pasta41/lml-2020 . References [1] 2020. Amazon S3 Storage Classes. https://aws.amazon.com/s3/storage-classes/ [2] Daniel Abadi. 2012. Consistency Tradeoffs in Modern DistributedDatabase System Design: CAP is Only Part of the Story.
Computer
Proceed-ings of the 12th USENIX Conference on Operating Systems Design andImplementation (Savannah, GA, USA) (OSDI’16) . USENIX Association,Berkeley, CA, USA, 265–283.[4] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and MilanVojnovic. 2017. QSGD: Communication-Efficient SGD via GradientQuantization and Encoding. In
Advances in Neural Information Pro-cessing Systems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates,Inc., 1709–1720.[5] Dan Alistarh, Christopher De Sa, and Nikola Konstantinov. 2018. TheConvergence of Stochastic Gradient Descent in Asynchronous SharedMemory.
CoRR abs/1803.08841 (2018). arXiv:1803.08841[6] Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2018.
Oper-ating Systems: Three Easy Pieces (1.00 ed.). Arpaci-Dusseau Books.[7] Michael Backes, Peter Druschel, Andreas Haeberlen, and DominiqueUnruh. 2009. CSAR: A Practical and Provable Technique to MakeRandomized Systems Accountable. In
Proceedings of the Network andDistributed System Security Symposium, NDSS 2009, San Diego, Califor-nia, USA, 8th February - 11th February 2009 .[8] B. R. Badrinath and Krithi Ramamritham. 1992. Semantics-basedConcurrency Control: Beyond Commutativity.
ACM Trans. DatabaseSyst.
17, 1 (March 1992), 163–199.[9] Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M.Hellerstein, and Ion Stoica. 2012. Probabilistically Bounded Stalenessfor Practical Partial Quorums.
Proc. VLDB Endow.
5, 8 (April 2012),776–787.[10] D. Barbara and H. Garcia-Molina. 1990. The case for controlled in-consistency in replicated data. In [1990] Proceedings. Workshop on theManagement of Replicated Data . 35–38.[11] Ken Birman, Bharath Hariharan, and Christopher De Sa. 2019. Cloud-Hosted Intelligence for Real-Time IoT Applications.
SIGOPS Oper. Syst.Rev.
53, 1 (July 2019), 7–13. [12] National Transportation Safety Board. 2019.
Collision Between VehicleControlled by Developmental Automated Driving System and Pedestrian .Technical Report. Tempe, Arizona, USA.[13] Mark Boddy and Thomas L. Dean. 1994. Deliberation Scheduling forProblem Solving in Time-Constrained Environments.
Artif. Intell.
SIAM Rev.
60, 2 (Jan 2018),223–311.[15] Eric Brewer. 2012. CAP Twelve Years Later: How the ”Rules” HaveChanged.
Computer
45, 2 (2012), 23–29.[16] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. 2011.
Handbook of Markov Chain Monte Carlo . CRC Press.[17] Jake Brutlag. 2009. Speed matters for google web search.[18] Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersec-tional Accuracy Disparities in Commercial Gender Classification. In
Proceedings of the 1st Conference on Fairness, Accountability and Trans-parency (Proceedings of Machine Learning Research, Vol. 81) , Sorelle A.Friedler and Christo Wilson (Eds.). PMLR, New York, NY, USA, 77–91.[19] Tianqi Chen, Emily Fox, and Carlos Guestrin. 2014. Stochastic gradientHamiltonian Monte Carlo. In
International Conference on MachineLearning . 1683–1691.[20] Alexandra Chouldechova. 2017. Fair Prediction with Disparate Impact:A Study of Bias in Recidivism Prediction Instruments.
Big Data
5, 2(Jun 2017), 153–163.[21] A. Feder Cooper. 2018. Imperfection is the Norm: A Computer SystemsPerspective on IoT and Enforcement. (2018). https://law.yale.edu/isp/events/imperfect-enforcement (Im)Perfect Enforcement Conference.[22] A. Feder Cooper and Karen Levy. 2020. Distributing Accountability andDistributed Computing: Policy Implications in Real-Time ComputerSystems. (2020). Under submission.[23] Robert Cornish, Paul Vanetti, Alexandre Bouchard-Côté, George Deli-giannidis, and Arnaud Doucet. 2019. Scalable Metropolis-Hastings forexact Bayesian inference with large datasets.
International Conferenceon Machine Learning (2019).[24] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015.BinaryConnect: Training Deep Neural Networks with binary weightsduring propagations. arXiv:1511.00363 [cs.LG][25] Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti.2018. HOGWILD!-Gibbs Can Be PanAccurate. In
Proceedings of the32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS ’18) . Curran Associates Inc., Red Hook, NY,USA, 32–41.[26] Christopher De Sa, Matthew Feldman, Christopher Ré, and KunleOlukotun. 2017. Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent. In
Proceedings of the 44th An-nual International Symposium on Computer Architecture (Toronto, ON,Canada) (ISCA ’17) . Association for Computing Machinery, New York,NY, USA, 561–574.[27] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, GunavardhanKakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubra-manian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’sHighly Available Key-value Store. In
Proceedings of Twenty-first ACMSIGOPS Symposium on Operating Systems Principles (Stevenson, Wash-ington, USA) (SOSP ’07) . ACM, New York, NY, USA, 205–220.[28] Thomas G. Dietterich. 2018. Robust artificial intelligence and robusthuman organizations.
Frontiers of Computer Science
13, 1 (Dec 2018),1–3.[29] Lisa Cingiser DiPippo and Victor Fay Wolfe. 1997. Object-Based Se-mantic Real-Time Concurrency Control with Bounded Imprecision.
IEEE Trans. on Knowl. and Data Eng.
9, 1 (Jan. 1997), 135–147.[30] W. Du and A. Elmagarmid. 1989. Quasi Serializability: A CorrectnessCriterion for Global Concurrency Control in InterBase. In
Proceedings ooper, et. al. of the 15th International Conference on Very Large Data Bases (Amster-dam, The Netherlands) (VLDB ’89) . Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, 347–355.[31] Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube,and Priya Nagpurkar. 2018. Slow and Stale Gradients CanWin the Race: Error-Runtime Trade-offs in Distributed SGD.arXiv:1803.01113 [stat.ML][32] M. Flanagan, Daniel Howe, and H. Nissenbaum. 2008. Embodyingvalues in technology: Theory and practice. Information Technologyand Moral Philosophy (01 2008), 322–353.[33] Mary Flanagan and Helen Nissenbaum. 2014.
Values at Play in DigitalGames . The MIT Press.[34] Batya Friedman and David G. Hendry. 2019.
Value Sensitive Design:Shaping Technology with Moral Imagination . The MIT Press.[35] Batya Friedman and Helen Nissenbaum. 1996. Bias in ComputerSystems.
ACM Trans. Inf. Syst.
14, 3 (July 1996), 330–347.[36] Hector Garcia-Molina. 1983. Using Semantic Knowledge for Transac-tion Processing in a Distributed Database.
ACM Trans. Database Syst.
8, 2 (June 1983), 186–213.[37] Wojciech Golab, Xiaozhou Li, and Mehul A. Shah. 2011. AnalyzingConsistency Properties for Fun and Profit. In
Proceedings of the 30thAnnual ACM SIGACT-SIGOPS Symposium on Principles of DistributedComputing (San Jose, California, USA) (PODC ’11) . ACM, New York,NY, USA, 197–206.[38] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. 2014.Compressing Deep Convolutional Networks using Vector Quantiza-tion.
CoRR abs/1412.6115 (2014). arXiv:1412.6115[39] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and PritishNarayanan. 2015. Deep Learning with Limited Numerical Precision.In
Proceedings of the 32nd International Conference on InternationalConference on Machine Learning - Volume 37 (Lille, France) (ICML’15) .JMLR.org, 1737–1746.[40] Andreas Haeberlen, Petr Kouznetsov, and Peter Druschel. 2007. Peer-Review: Practical Accountability for Distributed Systems.
SIGOPSOper. Syst. Rev.
41, 6 (Oct. 2007), 175–188.[41] Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression:Compressing Deep Neural Networks with Pruning, Trained Quantiza-tion and Huffman Coding. arXiv:1510.00149 [cs.CV][42] Joseph M. Hellerstein and Peter Alvaro. 2019. Keeping CALM:When Distributed Consistency is Easy.
CoRR abs/1901.01930 (2019).arXiv:1901.01930[43] M. Herlihy. 1990. Apologizing Versus Asking Permission: OptimisticConcurrency Control for Abstract Data Types.
ACM Trans. DatabaseSyst.
15, 1 (March 1990), 96–124.[44] Nathaniel Herman, Jeevana Priya Inala, Yihe Huang, Lillian Tsai, EddieKohler, Barbara Liskov, and Liuba Shrira. 2016. Type-aware Trans-actions for Faster Concurrent Code. In
Proceedings of the EleventhEuropean Conference on Computer Systems (London, United Kingdom) (EuroSys ’16) . ACM, New York, NY, USA, Article 31, 16 pages.[45] Stephen Hilgartner. 1992. The Social Construction of Risk Objects: Or,How to Pry Open Networks of Risk. In
Organizations, Uncertainties,and Risk , James F. Short and Lee Clark (Eds.). 39–53.[46] Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim,Phillip B. Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing.2013. More Effective Distributed ML via a Stale Synchronous ParallelParameter Server. In
Advances in Neural Information Processing Systems26 , C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.Weinberger (Eds.). Curran Associates, Inc., 1223–1231.[47] Eric J. Horvitz. 1987. Reasoning about Beliefs and Actions under Com-putational Resource Constraints. In
Proceedings of the Third Conferenceon Uncertainty in Artificial Intelligence (Seattle, WA) (UAI ’87) . AUAIPress, Arlington, Virginia, USA, 429–447.[48] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for MobileVision Applications. arXiv:1704.04861 [cs.CV][49] Sheila Jasanoff. 2016.
The Ethics of Invention: Technology and the HumanFuture . New York: W.W. Norton & Company.[50] Daniel Kahneman, Paul Slovic, and Amos Tversky. 1982.
Judgment un-der Uncertainty: Heuristics and Biases . New York: Cambridge UniversityPress.[51] Jürgen Kiefer and Jacob Wolfowitz. 1952. Stochastic Estimation of theMaximum of a Regression Function.[52] Anoop Korattikara, Yutian Chen, and Max Welling. 2014. Austerity inMCMC land: Cutting the Metropolis-Hastings budget. In
InternationalConference on Machine Learning . 181–189.[53] Jack Kosaian, K. V. Rashmi, and Shivaram Venkataraman. 2019. ParityModels: Erasure-Coded Resilience for Prediction Serving Systems. In
Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP ’19) . Association for ComputingMachinery, New York, NY, USA, 30–46.[54] Sudha Krishnamurthy, William Sanders, and Michel Cukier. 2002. AnAdaptive Framework for Tunable Consistency and Timeliness UsingReplication. (05 2002).[55] B. W. Lampson. 2004. Computer Security in the Real World.
Computer
37, 6 (June 2004), 37–46.[56] Timothy B. Lee. 2020. Detroit police chief cops to 96-percent facial recognition error rate.
Ars Technica (June2020). https://arstechnica.com/tech-policy/2020/06/detroit-police-chief-admits-facial-recognition-is-wrong-96-of-the-time [57] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, AmrAhmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-YiingSu. 2014. Scaling Distributed Machine Learning with the ParameterServer. In
Proceedings of the 11th USENIX Conference on Operating Sys-tems Design and Implementation (Broomfield, CO) (OSDI’14) . USENIXAssociation, Berkeley, CA, USA, 583–598.[58] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2017. Asyn-chronous Decentralized Parallel Stochastic Gradient Descent.arXiv:1710.06952 [math.OC][59] Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G.Andersen. 2011. Don’T Settle for Eventual: Scalable Causal Consistencyfor Wide-area Storage with COPS. In
Proceedings of the Twenty-ThirdACM Symposium on Operating Systems Principles (Cascais, Portugal) (SOSP ’11) . ACM, New York, NY, USA, 401–416.[60] Haonan Lu, Kaushik Veeraraghavan, Philippe Ajoux, Jim Hunt,Yee Jiun Song, Wendy Tobagus, Sanjeev Kumar, and Wyatt Lloyd.2015. Existential Consistency: Measuring and Understanding Consis-tency at Facebook. In
Proceedings of the 25th Symposium on OperatingSystems Principles (Monterey, California) (SOSP ’15) . ACM, New York,NY, USA, 295–310.[61] Dougal Maclaurin and Ryan Prescott Adams. 2015. Firefly Monte Carlo:Exact MCMC with subsets of data. In
Twenty-Fourth International JointConference on Artificial Intelligence .[62] Sparsh Mittal. 2016. A Survey of Techniques for Approximate Com-puting.
ACM Comput. Surv.
48, 4, Article 62 (March 2016), 33 pages.[63] Thierry Moreau, Joshua San Miguel, Mark Wyse, James Bornholt,Armin Alaghi, Luis Ceze, Natalie Enright Jerger, and Adrian Sampson.2018. A Taxonomy of General Purpose Approximate ComputingTechniques.
IEEE Embed. Syst. Lett.
10, 1 (March 2018), 2–5.[64] D.K. Mulligan and K.A. Bamberger. 2018. Saving governance-by-design.
California Law Review
106 (06 2018), 697–784.[65] Satoshi Nakamoto. 2009. Bitcoin: A peer-to-peer electronic cash sys-tem. https://bitcoin.org/bitcoin.pdf [66] Arvind Narayanan, Joseph Bonneau, Edward Felten, Andrew Miller,and Steven Goldfeder. 2016.
Bitcoin and Cryptocurrency Technologies:A Comprehensive Introduction . Princeton University Press, USA.[67] Helen Nissenbaum. 1996. Accountability in a Computerized Society.
Science and Engineering Ethics egulating Accuracy-Efficiency Trade-Offs in Distributed Machine Learning Systems [68] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright.2011. HOGWILD!: A Lock-free Approach to Parallelizing StochasticGradient Descent. In
Proceedings of the 24th International Conferenceon Neural Information Processing Systems (Granada, Spain) (NIPS’11) .Curran Associates Inc., USA, 693–701.[69] Paul Ohm and Jonathan Frankle. 2019. Desirable Inefficiency. In
FloridaLaw Review , Vol. 70. 777–836. Issue 4.[70] Xinghao Pan, Joseph Gonzalez, Stefanie Jegelka, Tamara Broderick,and Michael I. Jordan. 2013. Optimistic Concurrency Control for Dis-tributed Unsupervised Learning. In
Proceedings of the 26th InternationalConference on Neural Information Processing Systems - Volume 1 (LakeTahoe, Nevada) (NIPS’13) . Curran Associates Inc., USA, 1403–1411.[71] Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos,Ce Zhang, Michael I Jordan, Kannan Ramchandran, and ChristopherRé. 2016. Cyclades: Conflict-free Asynchronous Machine Learning.In
Advances in Neural Information Processing Systems 29 , D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). CurranAssociates, Inc., 2568–2576.[72] Charles Perrow. 1999.
Normal Accidents: Living with High Risk Tech-nologies - Updated Edition . Princeton University Press, Princeton, NewJersey.[73] K. Ramamritham and C. Pu. 1995. A formal characterization of epsilonserializability.
IEEE Transactions on Knowledge and Data Engineering
7, 6 (1995), 997–1007.[74] Rick Rojas and Richard Fausset. 2020.
The New York Times (14 June2020). [75] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev,Christopher R. Aberger, Kunle Olukotun, and Christopher Ré. 2018.High-Accuracy Low-Precision Training. (2018). arXiv:1803.03383[76] Christopher De Sa, Chris Re, and Kunle Olukotun. 2016. EnsuringRapid Mixing and Low Bias for Asynchronous Gibbs Sampling. In
Proceedings of The 33rd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 48) , Maria Florina Bal-can and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA,1567–1576.[77] Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré.2015. Taming the Wild: A Unified Analysis of Hogwild!-Style Algo-rithms.
CoRR abs/1506.06438 (2015). arXiv:1506.06438[78] Noah Sachs. 2011. Rescuing the Strong Precautionary Principle fromits Critics.
University of Illinois Law Review
Hardware and Software for ApproximateComputing . Ph.D. Dissertation. University of Washington. [80] Daniel Seita, Xinlei Pan, Haoyu Chen, and John Canny. 2016. Anefficient minibatch acceptance test for Metropolis-Hastings. (2016).arXiv:1610.06848[81] Zechao Shang, Jeffrey Xu Yu, and Aaron J. Elmore. 2018. RushMon:Real-time Isolation Anomalies Monitoring. In
Proceedings of the 2018International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18) . ACM, New York, NY, USA, 647–662.[82] Henry E. Smith. 2015. Equity as Second-Order Law: The Problem ofOpportunism. In
Harvard Public Law Working Paper No. 15-13 .[83] Sonja B. Starr. 2014. Evidence-Based Sentencing and the ScientificRationalization of Discrimination.
Stanford Law Review
66 (2014),803–872.[84] Cass R. Sunstein. 2002. Hazardous Heuristics.
U Chicago Law &Economics (2002).[85] Cass R. Sunstein. 2003. Beyond the Precautionary Principle.
U ChicagoLaw & Economics (2003).[86] Francisco J. Torres-Rojas, Mustaque Ahamad, and Michel Raynal. 1999.Timed Consistency for Shared Distributed Objects. In
PODC .[87] J. Tsitsiklis, D. Bertsekas, and M. Athans. 1986. Distributed asynchro-nous deterministic and stochastic gradient optimization algorithms.
IEEE Trans. Automat. Control
31, 9 (1986), 803–812.[88] Diane Vaughan. 1996.
The Challenger launch Decision: Risky Technology,Culture, and Deviance at NASA . University of Chicago Press.[89] Werner Vogels. 2009. Eventually Consistent.
Commun. ACM
52, 1 (Jan.2009), 40–44.[90] Carl von Clausewitz. 1832.
Vom Kriege . Ferdinand DÃijmmler.[91] Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gre-gory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing.2015. Managed Communication and Consistency for Fast Data-parallelIterative Analytics. In
Proceedings of the Sixth ACM Symposium onCloud Computing (Kohala Coast, Hawaii) (SoCC ’15) . ACM, New York,NY, USA, 381–394.[92] W. E. Weihl. 1988. Commutativity-based concurrency control forabstract data types.
IEEE Trans. Comput.
37, 12 (1988), 1488–1505.[93] MC Weinstein, KA Freedberg, EP Hyle, and AD Paltiel. 2020. Waitingfor Certainty on Covid-19 Antibody Tests - At What Cost?
NewEngland Journal of Medicine (2020).[94] Haifeng Yu and Amin Vahdat. 2000. Design and Evaluation of aContinuous Consistency Model for Replicated Services. In
Proceedingsof the 4th Conference on Symposium on Operating System Design &Implementation - Volume 4 (San Diego, California) (OSDI’00) . USENIXAssociation, Berkeley, CA, USA, Article 21.[95] Haifeng Yu and Amin Vahdat. 2000. Efficient Numerical Error Bound-ing for Replicated Network Services. In
Proceedings of the 26th In-ternational Conference on Very Large Data Bases (VLDB ’00) . MorganKaufmann Publishers Inc., San Francisco, CA, USA, 123–133.[96] Ruqi Zhang, A. Feder Cooper, and Christopher De Sa. 2020. AM-AGOLD: Amortized Metropolis Adjustment for Efficient StochasticGradient MCMC.
International Conference on Artificial Intelligence andStatistics (2020).[97] Ruqi Zhang, A. Feder Cooper, and Christopher De Sa. 2020. Asymptot-ically Optimal Exact Minibatch Metropolis-Hastings.
ArXiv preprint (2020).[98] Ruqi Zhang and Christopher M. De Sa. 2019. Poisson-Minibatchingfor Gibbs Sampling with Convergence Rate Guarantees. In
Advancesin Neural Information Processing Systems . 4923–4932.[99] Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. 2015. Staleness-aware Async-SGD for Distributed Deep Learning.