[PDF] Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines

Abstract

The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, "Is this intersection congested?" The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation (e.g., sampling rate, batch size, or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on users' desired latency and cost targets, (c) input video contents may exercise different paths in the DAG and produce a variable amount intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. We present Llama: a heterogeneous and serverless framework for auto-tuning video pipelines. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. This makes the problem of auto-tuning large video pipelines tractable and allows us to handle input-dependent behavior, conditional branches in the DAG, and execution variability. We describe the algorithms in Llama and evaluate it on a cloud platform using serverless CPU and GPU resources. We show that compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average.

Full PDF

LLlama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines

Francisco Romero*

Stanford University

Mark Zhao*

Stanford University

Neeraja J. Yadwadkar

Stanford University

Christos Kozyrakis

Stanford University

Abstract

The proliferation of camera-enabled devices and large video reposi-tories has given rise to a diverse set of video analytics applications.The video pipelines for these applications are DAGs of operationsthat transform videos, process extracted metadata, and answerquestions such as, “Is this intersection congested?” The latency andresource efficiency of pipelines can be optimized using configurableknobs for each operation such as the sampling rate, batch size, ortype of hardware used. However, determining efficient configura-tions is challenging because (a) the configuration search space isexponentially large, and (b) the optimal configuration depends onthe desired latency target and the input video contents that mayexercise different paths in the DAG and produce different volumesof intermediate results. Existing video analytics and processingsystems leave it to the users to manually configure operations andselect hardware resources. Hence, we observe that they often exe-cute inefficiently and fail to meet latency and cost targets.We present Llama: a heterogeneous and serverless frameworkfor auto-tuning video pipelines. Llama optimizes the overall videopipeline latency by (a) dynamically calculating latency targets per-operation invocation, and (b) dynamically running a cost-basedoptimizer to determine efficient configurations that meet the targetlatency for each invocation. This makes the problem of auto-tuninglarge video pipelines tractable and allows us to handle input de-pendent behavior, conditional branches in the DAG, and executionvariability. It also allows us to dynamically target heterogenoushardware and minimize execution cost. We describe the algorithmsin Llama and evaluate it on a cloud platform using serverless CPUand GPU resources. We show that compared to state-of-the-art clus-ter and serverless video analytics and processing systems, Llamaachieves 7.9 × lower latency and 17.2 × cost reduction on average. PVLDB Reference Format:

Francisco Romero*, Mark Zhao*, Neeraja J. Yadwadkar, and ChristosKozyrakis. Llama: A Heterogeneous & Serverless Framework forAuto-Tuning Video Analytics Pipelines. PVLDB, 14(1): XXX-XXX, 2020.doi:XX.XX/XXX.XX

Video traffic is exploding in scale, predicted to account for over 82%of all internet traffic by 2022 [5]. A myriad of domains use videopipelines , with tens of video analytics and processing operations, to

This work is licensed under the Creative Commons BY-NC-ND 4.0 InternationalLicense. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy ofthis license. For any use beyond those covered by this license, obtain permissio n byemailing [email protected]. Copyright is held by the owner/author(s). Publication rightslice nsed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.doi:XX.XX/XXX.XX extract meaningful information from raw videos. For example, anAMBER Alert application will typically leverage traffic feed camerasacross a city to pinpoint specific individuals and cars [6]. To doso, the application uses a pipeline to detect frames with peopleand/or cars, and match them to specific individual’s faces and cardescriptions, respectively. Another example is an ornithologist (ascientist who studies birds), who is looking to study a specific birdspecies, will deploy multiple cameras where the birds may resideor eat [29, 45]. To analyze the feed, the ornithologist can build apipeline to detect the presence of birds, classify detected birds intospecific bird species, and enhance frames corresponding to specificspecies for further study. As video analytics research continues toflourish, we expect a perpetual proliferation of emerging domainsthat depend on video pipelines, such as smart cities [20, 39, 42, 74],surveillance analytics [23], healthcare [41], and retail [19].The pervasive use of video analytics applications has led tosignificant challenges. Video pipelines must meet a wide rangeof latency, throughput, and cost targets in order to be practicalacross applications. For example, a pipeline that detects cars andpeople in a traffic feed should be tuned to be more cost efficientfor city traffic planners who have relaxed latency targets, whilethe same pipeline must be tuned to meet strict latency targetsfor AMBER Alert responders [6]. Video analytics and processingframeworks must tune pipeline operation knobs, such as samplingrate, batch size, hardware target, and resource allocation, in orderto meet the unique latency or cost requirements of diverse analyticsapplications. However, automatically tuning these knobs is difficultfor the following reasons.

Tuning knobs is difficult in the presence of heterogeneoushardware.

Hardware accelerators, in the form of GPUs [26], FP-GAs [36, 67], TPUs [44], and vision chips [1] can result in significantperformance benefits for many video and analytics operations. Se-lecting the right values for tuning knobs across these heterogeneousaccelerators can have a huge impact in the performance and effi-ciency of video pipelines. We observed a 3.7 × latency variation bytuning CPU cores, GPU memory, and batch size for operations in arepresentative AMBER Alert pipeline processed using Scanner [57].While recent research has proposed mechanisms to tune operationknobs based on resource usage [39, 42, 70], they are limited to sim-ple pipelines and homogeneous hardware platforms. Furthermore,they rely on hours to days of profiling and must do so for each newpipeline, video, and latency target [18, 74]. Video pipelines can have input-dependent execution flow.

An input video’s contents influence the execution flow of a pipelinein two ways. First, the number of intermediate outputs for an oper-ation may depend on the frame being processed. For the ornithol-ogist’s pipeline, the bird detection operation will output croppedbird images to be analyzed and enhanced by subsequent operations. a r X i v : . [ c s . D C ] F e b econd, downstream operations may be conditionally executedbased on the intermediate output. For example, we will only en-hance frames with the specific bird species. Consequently, tuningconfiguration knobs and resource allocations dynamically basedon the video content is critical for performance and efficiency. Wefound the static configurations made by gg [34], a general purposeserverless framework, degraded performance by as much as 64% foran AMBER Alert pipeline with an input-dependent execution flow.Existing systems, such as VideoStorm [74] and GrandSLAm [47],only support simple sequential pipelines with deterministic flow.Scanner [57] and Focus [39] provision resources statically.Systems that use serverless platforms as their backend, such asExCamera [35], gg [34], PyWren [43], and Sprocket [22], execute ap-plications by using thousands of short-lived functions [2, 4, 8]. Thefunction-level resource allocation offered by serverless platformsmakes them an attractive option for processing video pipelines, asthey enable dynamic tuning for each operation invoked. However,existing serverless offerings lack support for heterogeneous hard-ware accelerators and application constraints such as tail latencytargets. Thus, similar to cluster-based systems [57, 59, 73], meetingdiverse latency targets for video pipelines that run on systems usingserverless platforms as their backend requires users to manually,and perhaps exhaustively, explore operation knobs and resourceallocation options.There have also been several frameworks enabling users to rungeneral analytics dataflow and query execution [32, 40, 54, 55, 73],as well as proposed optimizations and techniques for executingthese framework’s workflows on a broad range of execution en-gines [37] and hardware platforms [58]. These frameworks are notwell-suited for the needs of dominating and emerging video ana-lytics operations such as encoding and decoding videos [35, 57],trading off cost for both latency and quality [42, 74], and queryinglarge video datasets [29, 39]. They also rely on users to properlyallocate the right amount of resources and configure operationknobs to meet diverse application targets.We present Llama, a video analytics and processing frameworkthat supports heterogenous hardware and automatically tunes eachoperation invocation to meet diverse latency targets for the overallvideo pipeline. Llama is a full-fledged serverless framework that isnot built on a commercial serverless platform. Instead, it providesa serverless experience to its users, who do not need to express thetype or amount of resources needed to meet their overall pipelinelatency targets.Llama relies on two key ideas to meet diverse latency targets.First, Llama dynamically computes how much time can be spenton each invocation to meet the pipeline latency target (i.e., per-invocation latency targets). By computing a per-invocation latencytarget, Llama can dynamically explore the configuration space foreach invocation, and adapt to performance volatility and input-dependent execution flow. Second, Llama introduces a cost-basedoptimizer that is run dynamically to determine the most efficientoperation configuration that meets the per-invocation target. To doso, Llama (a) uses early speculation and late commit : a technique forchoosing an initial operation knob configuration during pipelineprocessing, and revisiting the configuration right before execution,(b) leverages priority-based commit to prioritize operations basedon hardware affinity and DAG dependencies, and (c) employs safe (a)sequential (b)parallel (c)branchingif else Figure 1:

Simple DAGs thatcan be used to compose com-plex video pipelines. decodepreprocessobject-detectionpreprocess preprocess face-recognition car-recognition

Figure 2:

An AMBER Alertpipeline that finds faces andcars in a video. delayed batching that batches operations for efficiency as long as itdoes not violate per-invocation targets.We deploy Llama on Google Cloud Platform with serverless CPUand GPU backends and evaluate its efficiency and ability to meetlatency targets for five video analytics and processing pipelines. Bydynamically configuring operations for both CPU and GPU basedon pipeline latency targets, Llama achieves on average 7.9 × latencyimprovement and 17.2 × cost reduction compared to three state-of-the-art cluster and serverless video analytics and processingsystems: Nexus [59], Scanner [57], and gg [34]. Applications define video pipelines as directed acyclic graphs(DAGs), where vertices represent video analytics and processingoperations, while edges represent dataflow.As described in literature [3, 31, 47], video pipelines can be com-posed from three basic DAG patterns shown in Figure 1: (a) se-quential, where each vertex has at most one input and one output,(b) parallel, where multiple vertices execute in parallel, and (c)branching, where the output of a vertex, called branching vertex,conditionally determines the next vertex to execute. For example,the AMBER Alert pipeline [59, 74] for face and car recognition inFigure 2 begins with a sequential path of decoding and preprocess-ing operations, followed by a branching object detection operation.Depending on the output, people or cars are sent to parallel faceand car recognition operations, respectively.Table 1 categorizes video analytics and processing systems basedon two key features: (a) Their ability to specify and meet perfor-mance targets.

User-facing systems typically require that the videopipeline meet a latency target, ideally while minimizing resourceusage (cost). For example, the AMBER Alert pipeline needs to meeta strict latency target so that responders can take timely action. (b)Support for general video operations.

To compose video pipelines, auser combines video operations, such as inference models, videoencoders, image filters, as well as analytics operations that processextracted metadata. For example, the ornithologist’s pipeline willcontain video decoding, object detection, detection for the specificbird species, and image enhancement operations. Some systems,such as Scanner [57], VideoStorm [74], and gg [34], support generalvideo operations. Others, such as Focus [39] and Nexus [59], focuson one facet of video pipelines (e.g., deep learning inference) andrely on external services for other operations. randSLAm[47] VideoStorm[74] Focus[39] Nexus[59] Scanner[57] gg[34] Sprocket[22] Llama

Performance targets Yes Yes Yes Yes No No No Yes

Features

General operations No Yes No No Yes Yes No YesTraverse large configuration space Limited ¶ Limited † Limited ¶ Limited ¶ No No No YesHandle input-dependent exec. flow No No No Yes No Limited ‡ No Yes

Challenges

Dynamically adjust resource alloc. No Limited § No Limited § No Yes Yes Yes

Table 1:

Comparison of existing video processing systems with Llama based on whether they (a) support performance targets and generaloperations, and (b) address the challenges of meeting performance targets for general video pipelines. ¶ Limited to domain-specific knobs. † Large profiling overhead. ‡ Cannot handle branching. § Limited to single hardware platform. L a t e n c y ( s ) Number of faces L a t e n c y ( s ) Number of faces

Figure 3:

Execution latency on CPU (left) and GPU (right) for aface detection pipeline that identifies unique faces in a frame [49].Latency varies up to 17.2 × and 4 × on CPU and GPU, respectively. Large configuration space.

Pipeline operations offer a varietyof knobs that can be used to improve latency and resource use.Many operations have more knobs such as batch size, samplingrate, and resolution. Other knobs select the hardware platform (e.g.,CPU, GPU, FPGA, TPU, or vision processor) and set the resourceallocation (e.g., number of CPU cores or amount of GPU memory).Our experiments with Scanner demonstrate 3.7 × execution latencyvariation across different operation configurations for the AMBERAlert pipeline of Figure 2. Determining configurations is challengingdue to the exponential growth in the configuration space with thenumber of operations, knobs, and hardware platforms available.As shown in Table 1, Scanner, gg, and Sprocket do not auto-tuneconfigurations knobs, putting the burden on the user to staticallyspecify good operation configurations. Focus and Nexus are domain-specific to deep learning inference and are limited to configuring theinference models used and the batch size, respectively. GrandSLAmonly leverages batch size as a configurable knob and considers onlyone hardware platform at a time. VideoStorm is able to supportgeneral knob configurations. However, it takes tens of CPU hours toprofile pipelines and requires re-profiling when the pipeline, inputvideo, or latency targets change [74]. Input-dependent execution flow.

Input-dependent executionflow occurs in two cases: First, inputs determine the conditionalpath in branching pipelines. In the AMBER Alert pipeline of Fig-ure 2, a frame will only take the face recognition path if object-detection finds a person in it. Since a conditional path is notresolved until the branching operation finishes, provisioning re-sources and selecting configurations to meet a pipeline’s latency tar-get is challenging. Existing systems either treat branching pipelinesas parallel ones (i.e., by executing all conditional branches) [32, 57]or do not support non-sequential pipelines [22, 47, 74].Second, operations can produce a variable number of outputs,and thus a variable load for downstream operations. If the number of intermediate outputs is unknown until the operation is executed,determining the parallelism or resources needed downstream tomeet latency targets is difficult, especially if these operations arecomputationally expensive. Figure 3 shows that the latency for apipeline that identifies the unique faces in a frame depends on thenumber of unique faces: 17.2 × and 4 × difference between 60 facesversus no faces on a CPU and a GPU respectively. As noted byJockey [32], the nondeterminism introduced by input-dependentbehavior require dynamic adaptation in the underlying systemin order to meet a pipeline’s latency target. Most existing videoanalytics and processing frameworks and systems do not accountfor input-dependent execution flow. Dynamically adjusting resource allocation of operation in-vocations.

As a pipeline executes, the degree of available paral-lelism depends on the various knob settings (e.g., batching) andthe number of intermediate outputs produced. Many existing sys-tems require users to statically provision a cluster, which limitsthe resource available to exploit parallelism [32, 47] or leads toover-allocation and higher costs when parallelism is low. Somesystems periodically adjust resources and bin pack requests as theload changes, but are limited by how quickly hardware (e.g., GPUs)and VMs can be loaded/unloaded [59]. Systems such as gg [34]and Sprocket [22] leverage serverless platforms [2, 4, 8] to dynami-cally allocate resources for each operation invocation. While usingserverless platforms for video pipelines that exhibit varying de-grees of parallelism is attractive, it does not address all challengeson its own. A user must still manually select hardware types andconfigure knobs to ensure latency targets are met.

Llama is a heterogeneous and serverless framework for auto-tuningvideo analytics and processing pipelines. Llama’s objective is tomeet the overall pipeline latency target, while minimizing cost(resource usage). As noted in Section 2, input-dependent execu-tion flow and resource volatility precludes the use of static tun-ing approaches [32]. It also precludes designing and calculating aglobally-optimal solution a-priori or dynamically. Instead, Llamaoptimizes the overall pipeline execution by iteratively and dynami-cally optimizing each operation invocation using the most up-to-date information about the state of execution flow and resourceavailability. Specifically, Llama (a) dynamically reduces the pipelinetarget latency to per-operation invocation latency targets, valuesthat we call slack , and (b) continuously configures each operationinvocation to meet the slack at minimal cost. Dynamically allottingslack ensures the pipeline latency target is met without having to tatically account for all possible conditional paths or sources of re-source volatility in serverless environments. It also allows Llama torevisit configuration decisions as the resource environment evolvesor as input-dependent operations are run. Llama finds the set ofcost-efficient configurations for the entire pipeline because it mini-mizes cost at each configuration assignment subject to the overalllatency target.We address the challenges outlined in Section 2 as follows: Traversing the large configuration space.

Llama profiles andmakes configuration decisions on a per-operation , not per-pipelinebasis. New operations undergo a short (seconds to minutes) one-time profiling step independent of the pipelines that include theoperation. Operations are not re-profiled as the pipeline composi-tion or video and latency targets change. As the pipeline executes,Llama makes configuration decisions for one operation invocationat a time, reducing the exponential configuration space of an entirepipeline to that of an individual operation.

Handling input-dependent execution flow.

Llama uses threetechniques to meet latency targets despite the nondeterminismthat stems from input-dependent behavior and resource volatility(e.g., resource contention): (a) early speculation and late commit thatselects an initial configuration decision as soon as an invocation isavailable and then revisits the configuration right before execution,(b) priority-based commit that prioritizes operations based on theiraffinity to hardware accelerators and their depth in the pipeline,and (c) safe delayed batching that waits for additional inputs (videoframes) to batch them, as long as doing so does not violate theinvocation’s allotted slack.

Dynamically adjusting resource allocations.

Making per-invocation configuration decisions also allows Llama to dynami-cally right-size resource allocations across heterogeneous serverlessbackends. Llama decides the hardware type and resource sizing(e.g., GPU with 2GB of memory) during dynamic configurationbased on what is necessary to meet the slack. Early speculation andlate commit, as well as priority-based commit, also allow Llama tobalance resources between operations.

This section reviews Llama’s architecture and workflow, whileSection 4 focuses on the core techniques Llama uses to configureinvocations to meet their latency target.Figure 4 shows that Llama uses an offline specification phaseand an online optimization phase. The specification phase has twopurposes. First, it allows the user to specify a pipeline with multiple,general operations using a SDK. Second, it extracts the following in-formation: a set of all possible sequential paths through the pipeline,and the latency and resource footprint of each unique operation foreach feasible configuration of knobs. The pipeline specification andthe extracted metadata are stored for use during the online phase.The online phase is triggered when users submit an input videoand a latency target to Llama. Llama executes the pipeline by con-tinuously generating and executing a set of invocations for eachoperation as their input dependencies are resolved. For example, if object-detection outputs a frame tagged with a person, a newinvocation is generated for the preprocess operation in the AM-BER Alert pipeline in Figure 2. The online phase configures eachinvocation by first estimating its slack. It then uses the respective

Specification PhaseOnline Phase

Operation Library face recognitionface recognitionface recognition

Configuration SpecificationsOperation Executables

Custom Operations

Invocations

Operation-Profiler

Sample Video *PU Configuration Specification

Pipeline SDK Pipeline Specification

Extracted Metadata

ConfigsDecomposed Paths

TargetInput Video Configurator M anage r C o mm i t S pe c u l a t e SQ[λ

GPU ] Speculative Queues

SQ[λ

CPU ] CQ[λ

GPU ]CQ[λ

CPU ] Commit Queues

Configured Invocations

Scheduler

Configured Invocations λ CPU λ GPU

Intermediate Frames and Feedback … … λ *PU Serverless Backends

Pipeline-Decomposer

Figure 4:

Llama’s architecture diagram. "engine": "gpu","resource": 200,"latency": 320,"arguments": [ "image_blur", "", "","", " ],"binary_name": "image_blur","batch_size": 2,"num_inputs": 4,"num_outputs": 2,"resolution": 1920x1080,"id": "image_blur_7"

Figure 5:

Example configuration specification for one image blur-ring configuration. Llama’s configuration specifications allows forgeneral operation knob configuration. operation’s profiling data to determine the most efficient configura-tion that can complete the invocation within the allotted slack. Theprocess repeats until all invocations in the pipeline are executed(see Figure 4).

Application Interface.

Users define pipelines using the LlamaSDK. They specify pipeline operations, dependencies between oper-ations, and conditional flow. Llama provides a library of operations(e.g., decode and face recognition). Each operation consists of abinary executable, indexed by its SHA256 hash, and a configuration pecification file that contains configuration options and perfor-mance statistics for the operation. Users can optionally bring theirown operations by providing an executable and a configurationtemplate that specifies tunable knobs (e.g., hardware type, batchsize, or number of filters) and ranges for each knob. The Operation-Profiler uses these inputs in a one-time profiling step to generate aconfiguration specification. The operation and configuration spec-ification are then added into the Operation Library and re-usedacross pipelines without further profiling. Operation-Profiler.

The Operation-Profiler collects performanceand resource statistics for each operation. Using the operation exe-cutable and configuration template as inputs, it first enumerates allpossible configurations specified by the template, then executes ashort profiling step using one or more sample frames for each con-figuration (depending on the batch size). Statistics such as latencyand resource footprint (e.g., peak memory utilization) are collectedand stored as entries in a configuration specification file. The framecontent does not affect these statistics (recall that input-dependentexecution flow manifests between operations). Since slack calcula-tion (Section 4.1) is only dependent on the relative performance ofoperation invocations across the pipeline, the Operation-Profilerdesignates a reference configuration for each operation to providea measure of relative performance. We chose the smallest CPUconfiguration (1-core, batch-1) for each operation’s reference con-figuration. During runtime, operation invocation performance thatdiffers from its profiled value, due to resource contention or profil-ing inaccuracy, is accounted for by leveraging feedback (Section 3.3).An example configuration specification for an image blurringoperation is shown in Figure 5. The configuration specificationis structured so that arbitrary operation- and hardware-specificconfiguration knobs can be described by users, and dynamicallyconfigured during runtime. This enables Llama to support generaloperations and arbitrary video pipelines for a myriad of videoanalytics and processing applications.

Pipeline-Decomposer.

To enable the online phase to dynami-cally compute slack, the Pipeline-Decomposer performs a one-time decomposition of the pipeline into all possible sequential paths inthe pipeline. To do so, it performs a modified depth-first search onthe pipeline DAG to enumerate all paths from the input operation(i.e., operation with no upstream dependencies) to an output op-eration (i.e., an operation with no downstream dependencies). Forexample, the AMBER Alert pipeline in Figure 2 is decomposed intothe two sequential paths ending in face-recognition and car-recognition , respectively (see the Pipeline-Decomposer block inFigure 4. The Pipeline-Decomposer optimizes this process by mem-oizing previously-seen paths during the DAG traversal, and emitsan intermediate representation of the decomposed set of paths.

Manager.

Llama’s Manager takes video inputs and latency targetsand orchestrates the entire pipeline execution, maintaining execu-tion state and generating new invocations. It tracks the progressof each frame as it flows through the pipeline, as well as any in-flight invocations. Whenever an invocation completes, the Managerrecords the invocation’s runtime statistics (i.e., latency, cost, andconfiguration) and the path to intermediate outputs. The runtimestatistics are used to update the configuration profiles obtained from

Algorithm 1

Operation invocation slack allotment 𝑝𝑎𝑡ℎ𝑠 ← A set of all sequential paths in the pipeline 𝑡 ← elapsed time 𝑡𝑎𝑟𝑔𝑒𝑡 ← pipeline latency target ⊲ Given backend 𝜆 , allot slack to an invocation of operation 𝑜𝑝 procedure ComputeSlack( 𝑜𝑝, 𝜆 ) 𝑠𝑙𝑎𝑐𝑘𝑠 = {} for all { 𝑝𝑎𝑡ℎ ∈ 𝑝𝑎𝑡ℎ𝑠 | 𝑜𝑝 ∈ 𝑝𝑎𝑡ℎ } do 𝑝𝐿𝑎𝑡 = RemainingPathLatency ( 𝑜𝑝, 𝑝𝑎𝑡ℎ ) 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔𝑇𝑖𝑚𝑒 = ( 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑡 − 𝑞𝑢𝑒𝑢𝑒𝑖𝑛𝑔𝑇𝑖𝑚𝑒 ( 𝜆 )) 𝑝𝑆𝑙𝑎𝑐𝑘 = ( 𝑜𝑝.𝑅𝑒𝑓 𝑒𝑟𝑒𝑛𝑐𝑒𝐿𝑎𝑡 ()/ 𝑝𝐿𝑎𝑡 ) × 𝑟𝑒𝑚𝑎𝑖𝑛𝑖𝑛𝑔𝑇𝑖𝑚𝑒 𝑠𝑙𝑎𝑐𝑘𝑠.𝑎𝑝𝑝𝑒𝑛𝑑 ( 𝑝𝑆𝑙𝑎𝑐𝑘 ) return 𝑚𝑖𝑛 ( 𝑠𝑙𝑎𝑐𝑘 ∈ 𝑠𝑙𝑎𝑐𝑘𝑠 ) the Operation-Profiler via a feedback loop. We use an exponentialsmoothing algorithm to update the profiling; other algorithms canbe incorporated as well. The intermediate outputs are then usedto resolve any conditional branches. The Manager then spawnsinvocations for downstream operations once all dependencies havebeen resolved. These invocations are then sent to the Configurator. Configurator.

To meet the overall pipeline latency target, theConfigurator decides (a) how much slack to allot to an operationinvocation, and (b) what the most efficient configuration is to meetthe slack. The Configurator works with the Scheduler to keep trackof available resources at a serverless backend as it makes configu-ration decisions. We discuss the decision techniques in Section 4.

Scheduler.

After the Configurator has configured an invocation’sknobs, the invocation is sent to the Scheduler for execution. TheScheduler executes the configured invocations onto the hardwareplatform specified by the Configurator. This includes creating andmanaging the necessary backend connections, mitigating stragglers,and handling invocation failures (Section 4.5). When an invocationsuccessfully returns, the Scheduler provides the Manager with theinvocation metadata and output results.

During the online phase, the goal of the Configurator is to assigneach operation invocation a configuration so the overall pipelinelatency target is met. As described in Section 2, input-dependentexecution flow and backend resource volatility mean that each op-eration invocation’s most efficient configuration of hardware type,resource size, batch size, and operation-specific knobs cannot bestatically determined. Instead, Llama’s Configurator (a) determineshow much slack can be spent on its invocation, and (b) uses a cost-based optimizer to select a configuration to meet that slack. Thissection explains how the Configurator makes these decisions.

Given a user-specified pipeline latency target, the Configurator firstneeds to compute a slack for each operation invocation. Existingsystems, such as GrandSLAm [32], statically determine each opera-tion invocation’s slack by assuming that the pipeline is linear andthat invocations are known in advance with predictable latencies(i.e., no nondeterminism). Our insight is to instead dynamically calculate each operation invocation’s slack. The slack can then beused to select the best invocation configuration to use (Section 4.2). lama calculates an operation invocation’s slack using Algo-rithm 1. Configuring invocations to meet their slack ultimately leadsto meeting the pipeline latency target. Given an invocation of oper-ation 𝑜𝑝 and a configuration’s backend 𝜆 , ComputeSlack beginsby finding every sequential path through the DAG that contains 𝑜𝑝 (Line 7). ComputeSlack then estimates the latency to completethe path, starting at 𝑜𝑝 , using the reference configuration for eachoperation (see Section 3.2). By using the reference configuration’slatency, Llama avoids a causal dilemma of needing a configurationto compute slack, and needing a slack to select a configuration. Theoperation invocation’s slack for that path is then determined basedon the remaining time (Line 9), factoring in estimated queueingtime at 𝜆 , weighted by the relative latency of 𝑜𝑝 to the remainingpath (Line 10). The final slack for an invocation of 𝑜𝑝 on 𝜆 is thenthe minimum slack value over all possible execution paths of 𝑜𝑝 ,ensuring that all input-dependent branch resolutions are accountedfor (Line 12). We discuss how Llama reclaims overly-conservativeslack in the next section. Since slack is calculated for each operation invocation, Llama canquickly evaluate configurations in a smaller per-operation, not per-pipeline, configuration space. After calculating the invocation’sslack for each available serverless backend 𝜆 (Algorithm 1), Llamaapplies the objective function shown in Equation 1 for all possibleconfigurations 𝑥 of 𝑜𝑝 using the invocation slack corresponding tothe serverless hardware backend 𝜆 ( 𝑥 ) targeted by 𝑥 . 𝐶 ( 𝑥 ) and 𝐿 ( 𝑥 ) are the estimated cost and latency respectively to run configuration 𝑥 . 𝑅 ( 𝑥 ) is the resources requested by 𝑥 (e.g., amount of memory),and 𝑅 𝑡𝑜𝑡𝑎𝑙 ( 𝜆 ( 𝑥 )) is the resource limit of 𝜆 ( 𝑥 ) . 𝐵 ( 𝑥 ) is the batch sizeof configuration 𝑥 , and 𝛼 is a tunable parameter (in $/second). 𝑜𝑏 𝑗 ( 𝑥, 𝑠𝑙𝑎𝑐𝑘 ) = (cid:40) 𝐶 ( 𝑥 )/ 𝐵 ( 𝑥 ) 𝐿 ( 𝑥 ) < 𝑠𝑙𝑎𝑐𝑘 𝐶 ( 𝑥 ) 𝐵 ( 𝑥 ) + 𝛼 ( 𝐿 ( 𝑥 )∗ 𝑅 ( 𝑥 ))( 𝐵 ( 𝑥 )∗ 𝑅 𝑡𝑜𝑡𝑎𝑙 ( 𝜆 ( 𝑥 )) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (1)Intuitively, this objective function evaluates the monetary costto run 𝑥 when there is a feasible slack. If slack cannot be met,the cost function weighs in favor of potentially more expensiveconfigurations that achieve a higher throughput. 𝛼 sets the balancebetween cost and throughput, with high values of 𝛼 set to meetthe slack at all costs, while lower values of 𝛼 may leverage morecost-efficient configurations potentially at the expense of exceedingslack. The configuration objective function is independent of theinput video or overall pipeline.Users who wish to optimize for a different metric (e.g., minimallatency subject to a cost budget, quality constraint, etc.) can addtheir own objective function to Llama. Furthermore, since 𝑅 isspecific to each backend (e.g., concurrent invocation limits, GPUmemory, or CPU cores), Llama can support other heterogeneousbackends (e.g., serverless CPUs and GPUs or on-premise clusters). Impact of errors in slack estimation.

Since conditional flowwill not always resolve to the worst-case path, the allotted slackmay be overly-conservative and result in Llama choosing a config-uration with a lower-than-necessary latency to meet the pipelinelatency target. However, since Llama configures each invocationseparately and dynamically, future invocations can be configuredto recover from the earlier mis-predictions.

Object Detection Face Recognition Car Recognition

Config Latency Cost Config Latency Cost Config Latency Cost 𝜆 𝐺𝑃𝑈

1s 0.02 𝜆 𝐺𝑃𝑈

2s 0.04 𝜆 𝐺𝑃𝑈

1s 0.02 𝜆 𝐶𝑃𝑈

2s 0.01 𝜆 𝐶𝑃𝑈

6s 0.03 𝜆 𝐶𝑃𝑈

4s 0.02

Table 2:

Example configuration options for operations in the AM-BER Alert pipeline (Figure 2). Assume preprocess has a latency of 1second regardless of configuration.

Consider the case of the AMBER Alert’s pipeline (Figure 2) inwhich we need to compute the slack for an object-detection invocation given that there are 8 seconds remaining to meet thepipeline latency target. For simplicity, let the configuration choicebe between a 𝜆 𝐶𝑃𝑈 (using a CPU) and 𝜆 𝐺𝑃𝑈 (using a GPU), and as-sume there is no queueing at either backend (i.e., queueingTime( 𝜆 ) is 0). The reference latency for face-recognition is 6 seconds, for car-recognition is 4 seconds, and for preprocess is 1 second.Llama sets object-detection invocation’s slack to 1 second basedon the longer face-recognition path, which rules out choosingthe 𝜆 𝐶𝑃𝑈 configuration from Table 2. However, if execution pro-ceeds down the car-recognition path, when execution reaches car-recognition , there will be 6 seconds remaining. Now, Llamacan select a slower, more cost-efficient configuration, 𝜆 𝐶𝑃𝑈 in thiscase, without violating the pipeline latency target.

To manage invocations that cannot be run concurrently due tolimited backend parallelism, Llama locally queues invocations andaccounts for the queueing time when allotting slack (Line 9 inAlgorithm 1). The queueing time depends on 𝑥 𝑖 : the selected con-figuration of each queued invocation 𝑖 (i.e., it is not sufficient to usethe number of queued operation invocations as a measure of waittime [32, 62]). Thus, invocations need to be assigned a configura-tion before they are queued. However, the initial configuration 𝑥 𝑖 is often made many seconds before it is actually invoked, leadingto sub-optimal configurations for three reasons. (a) Invocationsqueued in front of 𝑖 may experience execution times that vary fromthe profiled values. This can occur due to resource contention orinput-dependent execution flow. (b) The estimated latency for 𝑥 𝑖 may be updated via feedback while it is queued. (c) The numberof invocations queued behind 𝑖 may quickly grow (e.g., many com-pleted object-detection invocations may output a large numberof car-recognition and face-recognition invocations); thus, 𝑥 𝑖 should be chosen to ensure upstream invocations can meet theirslack. Hence, by the time a queued invocation is ready to run, itsselected configuration needs to be revisited to determine if it is stillthe right configuration.To solve this, Llama leverages a novel technique inspired by latebinding [24, 51, 52, 56, 63] that we call early speculation and latecommit . With early speculation and late commit, Llama maintainstwo queues per serverless backend 𝜆 : an unbounded speculativequeue ( 𝑆𝑄 [ 𝜆 ] ) and a small, bounded commit queue ( 𝐶𝑄 [ 𝜆 ] ) setto hold enough invocations to saturate 𝜆 . Once an invocation 𝑖 isready to execute, the Configurator uses Algorithm 1 to assign it aslack, and uses Equation 1 to select a speculative configuration . Theconfigured invocation is then put into the appropriate speculativequeue, thus enabling Llama to estimate the queueing time at each ackend. Once 𝑖 reaches the head of the speculative queue, as priorinvocations are executed, Llama revisits the configuration of 𝑖 byusing Algorithm 1 and Equation 1 again. It then commits the con-figuration into the appropriate commit queue. Doing so mitigatesthe queueing challenges we noted above by delaying binding aninvocation to a final configuration for as long as possible. Thisprovides Llama with maximum flexibility and the most up-to-datestate about pipeline dataflow and performance at each backend.With early speculation and late commit, Llama can estimate thequeueing time using Equation 2 for each serverless backend 𝜆 basedon each configured invocation 𝑖 in its queues. 𝐿 ( 𝑥 𝑖 ) and 𝑅 ( 𝑥 𝑖 ) arethe estimated latency of, and resources requested by 𝑥 𝑖 , respectively. 𝑅 𝑡𝑜𝑡𝑎𝑙 ( 𝜆 ( 𝑥 𝑖 )) is the total amount of resources or concurrency limitat the serverless backend specified by the configuration 𝑥 𝑖 . 𝑄 𝑆𝑄 [ 𝜆 ] ,𝐶𝑄 [ 𝜆 ] = ∑︁ 𝑖 ∈{ 𝑆𝑄 [ 𝜆 ] ,𝐶𝑄 [ 𝜆 ] } 𝐿 ( 𝑥 𝑖 ) 𝑅 ( 𝑥 𝑖 ) 𝑅 𝑡𝑜𝑡𝑎𝑙 ( 𝜆 ( 𝑥 𝑖 )) (2)Intuitively, the queueing time is estimated as the sum of each 𝑥 𝑖 ’sprofiled configuration latency, weighted by 𝑥 𝑖 ’s requested resources(to account for parallel execution). The cumulative queueing timeover 𝑆𝑄 [ 𝜆 ] and 𝐶𝑄 [ 𝜆 ] is then used in ComputeSlack (Line 9).Note that 𝑆𝑄 [ 𝜆 ] is also included when committing configured in-vocations to account for the invocations queued behind 𝑖 . The Configurator’s decisions described in Section 4.2 assume per-operation invocation decisions can be made independently of eachother. However, Llama also needs to reason about the relationshipbetween operations and their invocations for the following reasons:

When to batch invocations.

As pipeline dataflow progresses,there can be moments when an operation may have fewer invoca-tions available than the most efficient configuration’s batch size.For example, if a pipeline contains a slower face detection operationfollowed by a faster blurring operation, the blur operation’s invoca-tions will likely drain the speculative queue faster than it can buildup. In such cases, executing upstream operations first yields a largerbatch size, amortizing RPC and I/O overheads. However, waitingfor upstream operation invocations to complete their executionmay result in a slack violation.

Under-allotting slack due to incorrect profiling.

As describedin Section 4.1, slack allotted to an invocation is a function of thereference configuration’s profiled latency for downstream opera-tions. Furthermore, a configuration’s latency is updated using afeedback loop after execution (Section 3.3). However, slack can beunder-allotted to operation invocations if the reference configu-ration latency is significantly shorter than its actual latency, andthe feedback loop is not closed early on during pipeline execution.This is especially problematic for longer pipelines, and for pipelineswith the last operation’s invocations needing a longer slack thanthe rest. Thus, it is beneficial to prioritize invocations by pipelinedepth early in the pipeline’s execution so that feedback can updateall reference configurations.

Affinity of operations to heterogeneous hardware.

While pri-oritizing invocations by pipeline depth can help prevent under-allotting slack, the issue of prioritizing operation invocations on par-ticular hardware platforms still remains. For example, consider thecase in which both an object-detection and face-recognition invocation must be configured based on the options from Table 2.Resource limits force the two invocations to split their decision be-tween 𝜆 𝐶𝑃𝑈 and 𝜆 𝐺𝑃𝑈 . Committing object-detection ’s invoca-tion first may prevent face-recognition ’s invocation from access-ing the GPU, forcing it to choose 𝜆 𝐶𝑃𝑈 . Since face-recognition ’sinvocation benefits more from the acceleration, the better decisionis to assign the CPU to object-detection ’s invocation and theGPU to face-recognition ’s invocation, while maintaining thesame cost. The relative benefit of running operation invocationson a particular hardware platform (i.e., its hardware affinity) mustbe incorporated into configuration decisions.

Llama addresses these challenges using two techniques: safe de-layed batching and priority-based commit , implemented inconjunction with early speculation and late commit.

Safe delayed batching.

Safe delayed batching addresses the chal-lenge of waiting for additional invocations to batch. During bothearly speculation and late commit, if Llama determines the mostcost-efficient configuration that meets slack has a batch size largerthan the number of invocations available for a given operation(using Equation 1), it waits until more invocations arrive to assign aconfiguration. It does so safely — only if there are enough upstreaminvocations and slack will not be violated. Otherwise, it uses thebest feasible configuration.

Priority-based commit.

Priority-based commit addresses thechallenge of under-allotting slack and affinity of operations toheterogeneous hardware. First, to address the challenge of under-allotting slack, the Configurator prioritizes invoking a certain num-ber of reference invocations for each operation (see Section 5 formore details), favoring deeper operations in the pipeline. This en-sures that the feedback loop for all reference configurations is closedas fast as possible to minimize under-allotted slack.Second, to compute an operation’s affinity to heterogeneoushardware, Llama compares the benefits an operation invocationreceives from running on a specific backend to other availablebackends. It computes the affinity of an invocation’s operation 𝑜𝑝 tohardware backend 𝜆 using Equation 3, where 𝑋 𝑜𝑝,𝜆 is the subset ofconfigurations for 𝑜𝑝 that run on 𝜆 and 𝑋 c 𝑜𝑝,𝜆 is the complementaryset (i.e., all other configurations for 𝑜𝑝 ). 𝑎𝑓 𝑓 𝑖𝑛𝑖𝑡𝑦 ( 𝑜𝑝, 𝜆 ) = min ∀ 𝑥 ∈ 𝑋 c 𝑜𝑝,𝜆 { 𝑜𝑏 𝑗 ( 𝑥, 𝑠𝑙𝑎𝑐𝑘 )} min ∀ 𝑥 ∈ 𝑋 𝑜𝑝,𝜆 { 𝑜𝑏 𝑗 ( 𝑥, 𝑠𝑙𝑎𝑐𝑘 )} (3)Intuitively, Equation 3 determines if a hardware backend pro-vides more benefit (via Equation 1) to an invocation than otherbackends. Llama prioritizes invocations from operations with ahigher affinity to a hardware backend 𝜆 when committing them toeach 𝐶𝑄 [ 𝜆 ] . This ensures each backend achieves its highest utility. During execution, operation invocations may straggle or fail toexecute [21, 30, 71]. The Scheduler keeps track of each invocation’s xecution time. If it exceeds a configurable time-out (discussed inSection 5) or the Scheduler receives an error, the Scheduler noti-fies the Manager to create a duplicate invocation. This duplicatedinvocation is then passed to the Configurator to begin the slackallotment and configuration process anew. The allotted slack willnow be reduced, potentially resulting in a different configurationto still meet the pipeline latency target. We implemented Llama as an extension to gg [34] in ∼

4K linesof C++ code. We modified gg’s C++ and Python SDK to supportcomplex pipelines and general knob configurations. Llama sup-ports operations from any framework or library; we implementednon-deep learning pipeline operations (e.g., blur and meanshift)using OpenCV [25] and FFmpeg [33], and deep learning pipelineoperations with TensorFlow [16]. The source code will be availableupon publication of this paper.We implemented the online phase on top of gg’s dispatcher andbackend resource manager. The online phase is implemented ona single thread (we evaluate overheads in Section 6.5), but canscale out to multiple threads as needed. For straggler mitigation,we set each invocation’s time-out value to 1.5 × the invocation’sprofiled latency. Larger values wait too long to spawn a duplicateinvocation, which may violate the pipeline latency target, whilesmaller values unnecessarily overload the speculation queues. Fordepth-first priority, we observed that 10 invocations of the referenceconfiguration were sufficient to obtain enough feedback values toconverge on a latency measurement. Smaller values do not collectenough feedback values to prevent under-allotted slack, while largervalues unnecessarily prioritize invocations with configurations thatmay not be efficient.For the offline specification phase, we implemented theOperation-Profiler as a client to the online phase that collects andstores the profiled metadata into configuration specifications. Con-figuration specifications, such as the one depicted in Figure 5, areimplemented as JSON files. The Operation Library is implementedin an object store (e.g., Google Cloud Storage).We deployed Llama with serverless CPUs and serverless GPUs ascompute backends. For serverless CPUs, we provision and managea cluster of CPUs similar to existing serverless offerings [68]. Eachinvocation requests a specific number of cores (up to 4). Llama alsosupports running on serverless computing services such as AWSLambda [2] or Google Cloud Functions [8], where the invocationresources requested would be an amount of DRAM.Since there exists no serverless GPU services or frameworksat the time of writing, we built our own implementation (about1K lines of C++ code) that we believe is representative of a futureproduction offering [26]. Similar to CPU serverless offerings, aninvocation requests an amount of GPU memory (in MB) per invo-cation. Our serverless GPU scheduler then allocates a proportionalamount of GPU threads using Nvidia MPS [10], allowing for multi-ple invocations to execute concurrently. Invocations are executedon a first-come, first-served basis. Our serverless GPU backend isalso compatible with GPU generations that support concurrent jobexecution in hardware, such as the Nvidia A100 [11]. We answer the following questions: (a) How does Llama compare tostate-of-the-art systems (Scanner, Nexus, and gg)? (b) How effectiveis Llama in meeting diverse latency targets? (c) How does eachtechnique employed by Llama, such as early speculation and latecommit and safe delayed batching, contribute to its ability to meetthe latency target? (d) What is the impact of profiling errors onLlama’s ability to meet latency targets? (e) What are the overheadsof various decisions Llama makes?

Metrics.

Unless otherwise noted, we use pipeline processing la-tency and cost as metrics for success. For each experiment, wereport the mean of three runs.

Experimental setup.

We deployed Llama on Google Cloud Plat-form (GCP) [7]. The Llama runtime is single-threaded, and ran ona n1-standard-8 instance (8 vCPUs, 30 GB of DRAM). We usedthe following setup for all experiments unless otherwise noted.For the serverless CPU backends, we used 10 instances of type n1-standard-64 (64 vCPUs, 240GB of DRAM). For the serverlessGPU backends, we used 2 instances of type custom-12-46080 (1V100 GPU, 12 vCPUs, 45 GB of DRAM). All instances feature IntelXeon Platinum E5-2620 CPUs operating at 2.20GHz, Ubuntu 16.04with 5.3.0 kernel, and up to 32 Gbps networking speed.

Baseline systems.

We compared pipeline processing latency andcost with three state-of-the-art systems: Scanner [57], Nexus [59],and gg [34]. Scanner is a cluster-based video analytics system, usedby Facebook for processing 360 ◦ videos [13]. Nexus is a GPU clus-ter engine for accelerating deep learning-based video analysis, andgg is a general purpose serverless framework. We evaluated twocommon Scanner setups: one in which a user only provisions acluster with CPUs ( sc-cpu ), and one in which, similar to Nexus, auser runs all operations on a GPU ( sc-gpu ). For gg, we also com-pared against a version augmented with Llama’s branching support( gg-branch ). sc-cpu , gg , and gg-branch do not support heteroge-neous accelerators, while Nexus and sc-gpu require GPU VMs. Toequalize compute resources provided to all systems, we observethat custom-12-46080 and n1-standard-64

VMs are effectivelypriced the same on GCP (a difference of 1% at the time of writ-ing). (This price-equivalency is also true for an equivalent V100( p3.2xlarge ) and 64 vCPU ( m5.16xlarge ) VM on AWS.) Thus, weprovisioned sc-cpu , gg , and gg-branch with 12 n1-standard-64 VMs, and

Nexus and sc-gpu with 12 custom-12-46080

VMs.

Resource requests and cost model.

For Llama and gg, eachinvocation requests a set amount of resources (GPU memory orCPU cores) as is done in commercial serverless offerings. Therespective backend then provisions the invocation with the re-quested resources, and charges a price based on the amount ofrequested resources and invocation latency. We calculate the price(in $/(resource-second)) by dividing the cost per second chargedby GCP by the VM’s total resources. For example, the price ofa V100 GPU invocation is calculated by dividing the price of custom-12-46080 by 16 𝐺𝐵 . Since Scanner and Nexus are cluster-based frameworks, we compute cost using the time to rent thecluster for the duration of the execution; we do not include the costof starting and maintaining a warm cluster. Video pipeline.

Table 3 shows the pipelines, operations, andvideos that we used. Three of the five pipelines (AMBER Alert, ipeline Description Length (Form) Operations ( AMBER Alert detect cars and people 5 (branching) decode † , preprocess † , object detect.,face recog., car recog. (646) traffic camera [15], 10 min, 1080pFace Blurring detect indiv. face and blur from all frames 5 (branching) decode † , preprocess † , face recog.,template match † , blur † (600) rally [12], 10 min, 720pDenoising detect indiv. face and denoise/segment 5 (branching) decode † , preprocess † , face recog.,template match † , meanshift † (600) rally [12], 10 min, 720pToonify apply cartoon effect to video 4 (parallel) decode † , edge detect. † , bilateral filter † ,merge edge-filter † , encode † (989) tears of steel [14], 10 min, 720pSynthetic synthetic pipeline for sensitivity analysis 7 (sequential) decode † , blur † , preprocess † ,face recog. (596) rally [12], 10 min, 720p Table 3:

Details of video processing pipelines used for evaluating Llama, their operations, and video inputs. Operations with a † are non-deeplearning pipeline operations. L a t e n c y ( m i nu t e s ) AMBER Alert Face Blurring

Denoising Toonify

Figure 6:

Latency of baselines to execute each pipeline. Nexus onlysupports the AMBER Alert pipeline (unsupported pipelines are de-noted by × ). Llama’s fastest, but most expensive, execution is fasterthan all baselines. Face Blurring, and Denoising) are branching pipelines in whichonly invocations satisfying the branching condition are executed.For AMBER Alert, only frames with faces and cars execute the“face-recognition” and “car-recognition” path, respectively. For FaceBlurring and Denoising, frames with faces proceed through a tem-plate match operation in which the frame is compared against apre-determined face. If a match was found, the face in the frame iscorrespondingly blurred, or denoised using meanshift. The Toonifypipeline is parallel; it executes the bilateral filtering and edge oper-ations in parallel before merging and encoding the frames. Finally,the synthetic pipeline is a chain of 5 image blurring operations,with the last operation being a face recognition operation. The facerecognition operation is the most compute-intensive operation ofthis pipeline, which allows us to evaluate Llama’s ability to meetdiverse pipeline latency targets, even when configurations weremis-profiled (Section 6.4). Since sc-cpu , sc-gpu , and gg do notsupport branches, they execute the three branching pipelines asparallel ones (i.e., both branches are executed). We first show how Llama’s ability to dynamically reconfigureoperation invocations enables it to outperform existing systems,both in terms of latency and cost.

Experimental setup.

For

Nexus , we set the pipeline latency tar-get to be 2 seconds per frame, which we found to be the strictest C o s t ( $ ) AMBER Alert Face BlurringDenoising Toonify

Figure 7:

Cost incurred by the baselines for each pipeline. Nexusonly supports the AMBER Alert pipeline (unsupported pipelinesare denoted by × ). Llama’s cheapest, but slowest, execution ischeaper than all baselines. latency that does not drop any requests [59]. Nexus then automat-ically configures the batch size and number of instances for eachmodel. For sc-cpu and sc-gpu , we swept each operation’s batchsize from 1 to 64 (by powers of 2) and set each value based onthe lowest pipeline execution latency (reported in Figure 6). For gg and gg-branch , we set each invocation’s configuration basedon the lowest, most cost-effective CPU latency reported by theOperation-Profiler. We configured Llama with two pipeline latencytargets: an unachievable low target that forced Llama to minimizepipeline execution latency at the expense of cost, llama-fast , andan overly-loose target that allowed Llama to minimize the overallcost, llama-cheap . These two pipeline latency targets representthe execution latency target range Llama can meet by dynamicallyconfiguring operations on the available heterogeneous backends. Results and discussion.

Figures 6 and 7 show the processinglatency and total cost, respectively, to execute each of the four non-synthetic pipelines. The results demonstrate Llama achieves lowerlatency, higher throughput, and lower cost than existing systems.Even when the cost of starting and maintaining a warm clusterare not considered (the former being ∼ × and 28 × on average)and cheaper (up to 110 × and 55 × on average) than sc-cpu . Com-pared to sc-gpu , Llama is up to 11 × faster (6 × on average) andup to 27 × cheaper (18 × on average). Scanner cannot dynamicallyadjust and right-size invocation configurations, and thus cannot % % % % % % % % % % % % % % % AMBER

Alert

Face

Blurring

Denoising Toonify Synthetic C o s t ( $ ) L a t e n c y n o r m . t o p i p e li n e t a r g e t Normalized Latency Cost

Figure 8:

Evaluating Llama given varied latency targets. 50%: meanof the measured latencies of llama-fast and llama-cheap , 25%:mean of llama-fast and 50%, and 75%: mean of llama-cheap and50%. The execution latency is normalized to the pipeline target ( ≤ address performance degradation caused by resource contentionfor compute-intensive operations (e.g., deep learning inference andmeanshift) or memory-intensive operations (e.g., bilateral filtering).Next, since Nexus focuses on inference-serving pipelines, weare only able to compare Llama against it with the AMBER Alertpipeline (other pipelines denoted by × in Figures 6 and 7). While weprovide Nexus with 12 GPUs,

Nexus ’s bin-packing algorithm [59]utilizes only 8; thus we report cost for 8 GPUs. By dynamically decid-ing when to use CPU versus GPU configurations, Llama achieves1.3 × speedup and 2.8 × lower cost compared to Nexus .Finally, compared to gg , Llama is up to 3.1 × faster (2.2 × onaverage) and up to 8.2 × cheaper (5.7 × on average). Compared to gg-branch , Llama is up to 2.9 × faster (1.8 × on average) and upto 6.8 × cheaper (4.7 × on average). While gg-branch is able toreason about conditional flow, it cannot make dynamic invocationconfiguration decisions or adjust to resource volatility, resulting ina higher latency and cost compared to Llama. By making dynamicinvocation configurations, Llama is able to determine how welloperations perform across heterogeneous backends and right-sizeresources depending on the pipeline latency target. Processing a pipeline via llama-fast yields the fastest butmost expensive pipeline execution on available hardware, while llama-cheap yields the cheapest but slowest. We now show Llamacan also meet latency targets that lie between these two extremes.

Experimental setup.

For each pipeline, we provide three la-tency targets to Llama that lie between the times required to ex-ecute the pipeline using llama-fast and llama-cheap . The 50%latency target is the mean latency between the latencies achievedby llama-fast and llama-cheap . The 25% latency target (the moststringent of the three) is the mean latency between llama-fast and the 50% latency target. Finally, the 75% latency target (the leaststringent of the three) is the mean latency between llama-cheap and the 50% latency target. For example, llama-fast executed FaceBlurring in 155 seconds, and llama-cheap executed it in 423 sec-onds; the 25%, 50%, and 75% latency targets are 225, 290, and 380seconds respectively. -10% No FB No

DFP No SDB No ESLC No PBC

No FB No

DFP No SDB No ESLC No PBCAMBER Alert Toonify P e r c e n t d i ff e r e n c e f r o m L l a m a Latency Cost

Figure 9:

Impact of turning Llama’s techniques off on the AMBERAlert and Toonify pipelines. Red borders and circled slashes indi-cate the pipeline latency target was violated. FB is feedback, DFPis depth-first priority, SDB is safe-delayed batching, ESLC is earlyspeculation and late commit, and PBC is priority-based commit.

Results and discussion.

Figure 8 shows the observed executionlatency, normalized to each of the aforementioned pipeline latencytargets ( ≤ We now show how each technique of Llama contributes to itsability to efficiently meet pipeline latency targets.

Experimental setup.

We performed an ablation study with twodistinct pipelines: Amber Alert and Toonify. Following is the listof techniques employed by Llama: feedback loop (FB, Section 3.3),depth-first priority (DFP, Section 4.4), safe delayed batching (SDB,Section 4.4), early speculation and late commit (ESLC, Section 4.3),and priority-based commit (PBC, Section 4.4). Note that priority-based commit includes both depth-first priority and hardware affin-ity. For each run, we turn off a single technique and record thepipeline execution latency and cost. For each pipeline, we use its50% pipeline latency target specified in Section 6.2.

Results and discussion.

Figure 9 shows the results of our ablationstudy (red borders and circled slashes indicate the latency targetwas violated). For the AMBER Alert pipeline, disabling feedback,depth-first priority, or early speculation and late commit resultsin latency target violations. All three techniques allow Llama toaccurately measure and adapt to performance volatility causedby input-dependent execution flow (branching operations) andresource contention. For example, disabling feedback causes Llamato miss the latency target because resource contention resulted ininvocations taking longer than their profiled values. With feedbackenabled, Llama is able to detect this and choose configurationswith higher throughput at a small expense of cost-efficiency. Onthe other hand, disabling safe delayed batching or priority-basedcommit causes Llama to not use large batches for deep learninginference invocations on GPUs, resulting in reduced cost-efficiency. ipeline (target) Llama Llama w/o FB & DFP Denoising (350s) (348s, $1.20) (369s, $1.64)Synthetic (520s) (520s, $2.31) (487s, $3.14)

Table 4:

Impact of profiling errors. Latency and cost for the Denois-ing and Synthetic pipelines when profiled values are inaccurate (setto 50% of their measured latencies). FB is feedback and DFP is depth-first priority. Without these techniques, Llama cannot meet the la-tency target, or uses configurations that are not cost-effective.

For the Toonify pipeline, disabling feedback also causes a latencytarget violation similar to the AMBER Alert pipeline. Disabling ei-ther safe delayed batching or early speculation and late commitresults in Llama choosing less cost-efficient configurations. Onthe other hand, disabling depth-first priority and priority-basedcommit results in more cost-efficient configurations without vio-lating the latency target. This is because these techniques led toLlama choosing configurations that are more throughput-intensivethan necessary for merge edge-filter operation invocations in aneffort to meet the pipeline latency target. However, as noted for theAMBER Alert pipeline, and as we will evaluate in Section 6.4, bothdepth-first priority and priority-based commit are important forLlama’s robustness in right-sizing resources and meeting latencytargets despite profiling errors.The ablation study shows that all of Llama’s techniques are im-portant for meeting pipeline latency targets, even when pipelinesexhibit input-dependent execution flow and introduce large config-uration spaces.

Experimental setup.

To evaluate the impact of profiling errors onLlama’s efficacy, we used the Denoising and Synthetic pipelines be-cause they represent worst-case scenarios: an expensive operationat the end of the pipeline with an under-estimated latency. In addi-tion, the Synthetic pipeline is the longest pipeline, which furtherexacerbates profiling errors. In such cases, Llama will under-allotslack to the last operation unless techniques are used to mitigatemis-profiling. All operation profiled latencies are “mis-profiled” bybeing set to 50% of their values, and we again use the respective 50%pipeline latency target, specified in Section 6.2, for each pipeline.

Results and discussion.

Table 4 shows the latency and cost torun both pipelines with all of Llama’s techniques compared to bothfeedback and depth-first priority turned off (the two techniquesLlama relies on to adjust for inaccurate profiling). For the Denoisingpipeline, disabling feedback and depth-first priority causes Llamato under-allot slack to the last meanshift operation. This results ina missed pipeline latency target because Llama could not adjustto the profiling errors until late in the pipeline execution. For theSynthetic pipeline, when both techniques were off, Llama meetsthe latency target but at a 35% higher cost. This is because the50% lower-than-profiled latencies cause Llama’s objective function(Equation 1) to incorrectly calculate that the CPU, not the GPU,is most cost-efficient for the meanshift operation. Since the CPUconfiguration was actually 50% slower and less cost-effective thana GPU, the cost ends up being higher than if a GPU would havebeen used.

Phase Action Latency (% of exec.)

Specification Profiling 257 ±

155 sPath decomposition 1.74 sOnline Speculate 0.005 ± ± ± ± Table 5:

Llama’s decision overheads. Mean and standard deviationlatencies of invocations for the AMBER Alert pipeline. Latenciesare per-invocation for online actions, per-operation for profiling,and per-pipeline for path decomposition. For each online action, weshow the percent of the execution time spent on the action acrossall operation invocations (73K).

These results demonstrate that both depth-first priority andfeedback are necessary to quickly resolve profiling discrepanciesearly on during execution.

Finally, we evaluate Llama’s overheads and its ability to scale acrossbackends. During the specification phase, Llama’s Operation-Profiler generates configuration specifications for each new op-eration, and the Pipeline-Decomposer decomposes a pipeline intoall possible sequential paths. During the online phase, the Configu-rator speculates and commits each invocation, calculating the invo-cation’s slack and assigning it a configuration. The Scheduler then invokes the committed configuration. Once an invocation returns,the Manager takes the finalize action to record runtime statistics,resolve branches, and generate new invocations.Table 5 shows the overhead for these decisions when specifyingand running the AMBER Alert pipeline with the 50% intermediatelatency target from Section 6.2; all other pipelines have similaroverheads. For the specification phase, profiling each operationtakes an average of 257 seconds, and only needs to be performedthe first time an operation is added to the Operation Library. Thedecomposition step, which is performed once per pipeline, takesonly 1.7 seconds.During the online phase, Llama only spends 483 micro-seconds,on average, to process (i.e., speculate, commit, invoke, and finalize)an invocation, allowing Llama to process over 2000 invocationsper second. Calculating a slack and determining a configurationis efficient, as speculate only requires 5 micro-seconds. Most timeis spent evaluating priority between operations during commit,connecting and sending invocations to backends during invoke, andupdating global state once invocations completed during finalize.Low action overheads also allows Llama to improve executionlatency as the number of backend instances or maximum concur-rency increases. Compared to llama-fast for the AMBER Alertpipeline run on 10 CPU and 2 GPU instances (Section 6.1), having6 CPU and 1 GPU instances is 46% slower, while having 15 CPUand 3 GPU instances is 25% faster. Llama adjusts its configurationdecisions based on the available backend resources.

Scheduling multiple pipelines.

In this paper, we focused onhow Llama can meet the latency target of a single running pipeline. n future work, we will study how Llama’s Configurator can bemodified to support multiple concurrently-running pipelines thatpotentially share common operations. We foresee several challengesin supporting multiple concurrently-running pipelines, including:(a) computing a slack for invocations across invocations acrosspipelines, and (b) choosing invocations to batch across pipelinesgiven disparate latency targets. Extensibility to other domains.

Llama was designed to addressthe challenges of running video analytics and processing pipelines(Section 2). However, Llama is also extensible to other operationsand frameworks outside of the video analytics and processing do-main. We designed Llama’s operation configuration specification(Section 3.2) to support arbitrary operation- and hardware-specificconfiguration knobs (see Table 3 for examples of the various opera-tions used in our evaluation). In addition, Llama’s SDK supportspipelines being expressed as DAGs composed from sequential, par-allel, and branching patterns. Two such example domains are nat-ural language processing question-answer systems and databasequery engines. Both of these domains have diverse latency targets,input-dependent execution flow (or stateful models in the case ofquestion-answer systems), and pipelines operations that can beconfigured with various knobs and across heterogeneous hardwareplatforms (e.g., database machine learning operations on CPUsversus custom accelerators [66]). Llama can be used for meetingpipeline latency targets for applications in these, and potentiallyother, domains.

Video and general-purpose analytics frameworks.

Nexus [59]uses latency targets to schedule and manage deep learning modelson GPU clusters for video analytics, but only works on homo-geneous clusters and does not support general video operations.Focus [39] uses latency targets to decide what deep learning modelto use to trade off latency for accuracy, but statically-provisionsresources and does not support input-dependent execution flow.Simiarly, NoScope [46] trains cheap, proxy models for optimiz-ing binary classification, and BlazeIt [45] provides a declarativelanguage with optimizations for aggregation and limit queries forvideo analytics. NoScope and BlazeIt are orthogonal to Llama, andcan be incorporated into Llama’s online phase for deep learningmodels. Scanner [57] is a cluster-based video processing enginethat scales to large clusters of CPUs and GPUs, but requires a user-set static hardware and resource configuration. VideoStorm [74]is a streaming video processing system that optimizes for qualityand lag, but requires re-profiling as the video inputs and pipelineschange (which can take hours to days).Other cluster-based systems for both domain-specific andgeneral-purpose applications [27, 32, 40, 54, 55, 70, 73] either donot support input-dependent execution flow, require extensive per-pipeline profiling, or require users to configure and right-size re-sources. Serverless frameworks such as gg [34], Sprocket [22], andPyWren [43] leverage the burst-parallel capability of existing server-less platforms to process video pipelines. Unlike Llama, no existingsystem meets diverse pipeline latency targets across complex videopipelines using heterogeneous serverless backends.

Dataflow optimizations and scheduling techniques.

Grand-SLAm [47] uses a slack-forwarding technique to statically deter-mine the batch size for sequential microservice graphs. Delayedbatching is used by Clipper [28] to increase efficiency of inferencequeries, but relies on users to statically set how long Clipper shouldwait for requests to batch. Late binding is a technique used by sched-ulers [24, 51, 52, 56, 63] to maximize the flexibility of the schedulingdecision and knowledge of system state. However, these systems donot consider the need to configure operations for meeting pipelinelatency targets. TetriSched [62] utilizes global scheduling that pre-vents tasks from being sent to a sub-optimal set of resources due toresources being held by earlier jobs, but only supports per-operationtargets, not an end-to-end pipeline latency target. Early speculationand late commit, and priority-based commit allow Llama to com-pute slack and make configuration decisions for arbitrarily complexpipelines to meet overall pipeline latency targets. Musketeer [37]and Dandelion [58] optimize dataflow DAGs for execution on abroad range of execution engines or hardware platforms. These op-timizations are compatible with Llama, and can be used to expandthe backends and hardware platforms Llama supports.

Cost-based query optimization.

Several works have exploredcost-based query optimization for relational databases [17, 38, 48,60, 61, 65]. Recently, frameworks such as Tempura [69] have pro-vided support for incremental data processing for queries whoseoptimal plan is input-dependent. Llama is compatible with theseframeworks, and can leverage their cost-based optimizations as anextension to how the Configurator chooses configurations basedon an invocation’s slack, especially for database operations.

Auto-tuning configurations.

CherryPick [18] and Ernest [64]present a performance prediction framework for recurring dataanalytics jobs; however, these systems require tens of executionsof a job to set the configuration parameters. PARIS [72] focuses onVM-size selection; OptimusCloud [53] and Selecta [50] are domain-specific VM configuration systems for databases and storage tech-nologies, respectively. Llama dynamically configures general videooperations to meet diverse latency targets, and only requires one-time per-operation profiling.

We presented Llama, a heterogeneous and serverless video an-alytics and processing framework that executes general videopipelines, meeting user-specified performance targets at minimalcost. By dynamically configuring individual operation invocations,Llama efficiently traverses large configuration spaces, adapts toinput-dependent execution flow, and dynamically allocates re-sources across heterogeneous serverless backends. Llama makesper-operation invocation decisions by first calculating invocationslack, then leveraging key techniques such as safe delayed batching,priority-based commit, and early speculation and late commit toefficiently and accurately select configurations that meet the slack.In doing so, Llama achieves an average improvement of 7.9 × forlatency and 17.2 × for cost compared to state-of-the-art systems. References Proceedingsof the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI’16) . USENIX Association, USA, 265–283.[17] Yanif Ahmad, Oliver Kennedy, Christoph Koch, and Milos Nikolic. 2012.DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views.

Proc. VLDB Endow.

5, 10 (June 2012), 968–979. https://doi.org/10.14778/2336664.2336670[18] Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman,Minlan Yu, and Ming Zhang. 2017. CherryPick: Adaptively Unearthing the BestCloud Configurations for Big Data Analytics. In

Computer

50, 10 (2017), 58–67. https://doi.org/10.1109/MC.2017.3641638[21] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013.Effective Straggler Mitigation: Attack of the Clones. In

Proceedings of the ACM Symposiumon Cloud Computing (Carlsbad, CA, USA) (SoCC ’18)

Proceedings of the Thirteenth EuroSys Conference (Porto, Portugal) (EuroSys ’18) . Association for Computing Machinery, New York, NY, USA, Article20, 15 pages. https://doi.org/10.1145/3190508.3190532[25] G. Bradski. 2000. The OpenCV Library.

Dr. Dobb’s Journal of Software Tools (2000).[26] Jack Choquette and Wishwesh Gandhi. 2020. NVIDIA’s A100 GPU: Performanceand Innovation for GPU Computing. In . IEEE.[27] Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, JosephGonzalez, and Alexey Tumanov. 2020. InferLine: Latency-Aware Provisioningand Scaling for Prediction Serving Pipelines. In

Proceedings of the 11th ACMSymposium on Cloud Computing (Virtual Event, USA) (SoCC ’20) . Association forComputing Machinery, New York, NY, USA, 477–491. https://doi.org/10.1145/3419111.3421285[28] Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gon-zalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction ServingSystem. In

Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6 (San Francisco, CA) (OSDI’04) .USENIX Association, USA, 10.[31] T. Elgamal. 2018. Costless: Optimizing Cost of Serverless Computing throughFunction Fusion and Placement. In . 300–312. https://doi.org/10.1109/SEC.2018.00029[32] Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and RodrigoFonseca. 2012. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In

Proceedings of the 7th ACM European Conference on Computer Systems (Bern,Switzerland) (EuroSys ’12) . Association for Computing Machinery, New York, NY,USA, 99–112. https://doi.org/10.1145/2168836.2168847[33] FFmpeg 2021. FFmpeg. https://ffmpeg.org/.[34] Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chatterjee, ChristosKozyrakis, Matei Zaharia, and Keith Winstein. 2019. From Laptop to Lambda:Outsourcing Everyday Jobs to Thousands of Transient Functional Containers. In

Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference (Renton, WA, USA) (USENIX ATC ’19) . USENIX Association, USA, 475–488.[35] Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubra-maniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, andKeith Winstein. 2017. Encoding, Fast and Slow: Low-Latency Video ProcessingUsing Thousands of Tiny Threads. In

Proceedings of the 14th USENIX Conferenceon Networked Systems Design and Implementation (Boston, MA, USA) (NSDI’17) .USENIX Association, USA, 363–376.[36] Ilya Ganusov and Mahesh Iyer. 2020. Agilex Generation of Intel FPGAs. In . IEEE.[37] Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, AllenClement, and Steven Hand. 2015. Musketeer: All for One, One for All in DataProcessing Systems. In

Proceedings of the Tenth European Conference on ComputerSystems (Bordeaux, France) (EuroSys ’15) . Association for Computing Machinery,New York, NY, USA, Article 2, 16 pages. https://doi.org/10.1145/2741948.2741968[38] Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-If Analysis, andCost-Based Optimization of MapReduce Programs.

Proc. VLDB Endow.

4, 11 (Aug.2011), 1111–1122. https://doi.org/10.14778/3402707.3402746[39] Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman,Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus:Querying Large Video Datasets with Low Latency and Low Cost. In

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on ComputerSystems 2007 (Lisbon, Portugal) (EuroSys ’07) . Association for Computing Ma-chinery, New York, NY, USA, 59–72. https://doi.org/10.1145/1272996.1273005[41] Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, YilongWang, Qiang Dong, Haipeng Shen, and Yongjun Wang. 2017. Artificialintelligence in healthcare: past, present and future.

Stroke and VascularNeurology

2, 4 (2017), 230–243. https://doi.org/10.1136/svn-2017-000101arXiv:https://svn.bmj.com/content/2/4/230.full.pdf[42] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and IonStoica. 2018. Chameleon: Scalable Adaptation of Video Analytics. In

Proceedingsof the 2018 Conference of the ACM Special Interest Group on Data Communication (Budapest, Hungary) (SIGCOMM ’18) . Association for Computing Machinery,New York, NY, USA, 253–266. https://doi.org/10.1145/3230543.3230574[43] Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht.2017. Occupy the Cloud: Distributed Computing for the 99%. In

Proceedings ofthe 2017 Symposium on Cloud Computing (Santa Clara, California) (SoCC ’17) .Association for Computing Machinery, New York, NY, USA, 445–451. https://doi.org/10.1145/3127479.3128601[44] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S.Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M.Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland,R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A.Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix,T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E.Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W.Wang, E. Wilcox, and D. H. Yoon. 2017. In-datacenter performance analysis of atensor processing unit. In . 1–12. https://doi.org/10.1145/3079856.3080246[45] Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. BlazeIt: Optimizing Declara-tive Aggregation and Limit Queries for Neural Network-Based Video Analytics.

Proc. VLDB Endow.

13, 4 (Dec. 2019), 533–546. https://doi.org/10.14778/3372716.3372725[46] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017.NoScope: Optimizing Neural Network Queries over Video at Scale.

Proc. VLDBEndow.

10, 11 (Aug. 2017), 1586–1597. https://doi.org/10.14778/3137628.3137664

47] Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn,Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for Jobs inMicroservices Execution Frameworks. In

Proceedings of the Fourteenth EuroSysConference 2019 (Dresden, Germany) (EuroSys ’19) . Association for ComputingMachinery, New York, NY, USA, Article 34, 16 pages. https://doi.org/10.1145/3302424.3303958[48] Sunghwan Kim, Taesung Lee, Seung-won Hwang, and Sameh Elnikety. 2018. ListIntersection for Web Search: Algorithms, Cost Models, and Optimizations.

Proc.VLDB Endow.

12, 1 (Sept. 2018), 1–13. https://doi.org/10.14778/3275536.3275537[49] Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit.

Journal of MachineLearning Research

10 (2009), 1755–1758.[50] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. 2018. Selecta: HeterogeneousCloud Storage Configuration for Data Analytics. In

Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI’18) . USENIX Association, USA, 253–267.[53] Ashraf Mahgoub, Alexander Michaelson Medoff, Rakesh Kumar, Subrata Mitra,Ana Klimovic, Somali Chaterji, and Saurabh Bagchi. 2020. OPTIMUSCLOUD:Heterogeneous Configuration Optimization for Distributed Databases in theCloud. In

Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data (Indianapolis, Indiana, USA) (SIGMOD ’10) .Association for Computing Machinery, New York, NY, USA, 135–146. https://doi.org/10.1145/1807167.1807184[55] Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, AnilMadhavapeddy, and Steven Hand. 2011. CIEL: A Universal Execution Engine forDistributed Data-Flow Computing. In

Proceedings of the 8th USENIX Conference onNetworked Systems Design and Implementation (Boston, MA) (NSDI’11) . USENIXAssociation, USA, 113–126.[56] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow:Distributed, Low Latency Scheduling. In

Proceedings of the Twenty-Fourth ACMSymposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP’13) . Association for Computing Machinery, New York, NY, USA, 69–84. https://doi.org/10.1145/2517349.2522716[57] Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scanner:Efficient Video Analysis at Scale.

ACM Trans. Graph.

37, 4, Article 138 (July 2018),13 pages. https://doi.org/10.1145/3197517.3201394[58] Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and DennisFetterly. 2013. Dandelion: A Compiler and Runtime for Heterogeneous Systems. In

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP ’13) . Association for Computing Machinery,New York, NY, USA, 49–68. https://doi.org/10.1145/2517349.2522715[59] Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, MatthaiPhilipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPUCluster Engine for Accelerating DNN-Based Video Analysis. In

Proceedings ofthe 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario,Canada) (SOSP ’19) . Association for Computing Machinery, New York, NY, USA,322–337. https://doi.org/10.1145/3341301.3359658[60] Ji Sun and Guoliang Li. 2019. An End-to-End Learning-Based Cost Estimator.

Proc. VLDB Endow.

13, 3 (Nov. 2019), 307–319. https://doi.org/10.14778/3368289.3368296[61] Jian Tan, Tieying Zhang, Feifei Li, Jie Chen, Qixing Zheng, Ping Zhang, HonglinQiao, Yue Shi, Wei Cao, and Rui Zhang. 2019. IBTune: Individualized Buffer Tuning for Large-Scale Cloud Databases.

Proc. VLDB Endow.

12, 10 (June 2019),1221–1234. https://doi.org/10.14778/3339490.3339503[62] Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global rescheduling withadaptive plan-ahead in dynamic heterogeneous clusters. In

Proceedings of the11th European Conference on Computer Systems, EuroSys 2016 (Proceedings ofthe 11th European Conference on Computer Systems, EuroSys 2016) . Associationfor Computing Machinery, Inc. https://doi.org/10.1145/2901318.2901355 11thEuropean Conference on Computer Systems, EuroSys 2016 ; Conference date:18-04-2016 Through 21-04-2016.[63] Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J.Franklin, and Ion Stoica. 2014. The Power of Choice in Data-Aware ClusterScheduling. In

Proceedings of the 11th USENIX Conference on Operating SystemsDesign and Implementation (Broomfield, CO) (OSDI’14) . USENIX Association,USA, 301–316.[64] Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht,and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In

Proceedings of the 2002 ACM SIGMODInternational Conference on Management of Data (Madison, Wisconsin) (SIGMOD’02) . Association for Computing Machinery, New York, NY, USA, 37–48. https://doi.org/10.1145/564691.564697[66] Matthew Vilim, Alexander Rucker, Yaqi Zhang, Sophia Liu, and Kunle Olukotun.2020.

Gorgon: Accelerating Machine Learning from Relational Data . IEEE Press,309–321. https://doi.org/10.1109/ISCA45697.2020.00035[67] Martin Voogel, Yohan Frans, and Matt Ouellette. 2020. Xilinx Versal PremiumSeries. In .IEEE.[68] Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and MichaelSwift. 2018. Peeking Behind the Curtains of Serverless Platforms. In

Proc. VLDB Endow.

14, 1 (Sept. 2020), 14–27. https://doi.org/10.14778/3421424.3421427[70] Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Sasa Misailovic,and Saurabh Bagchi. 2018. VideoChef: Efficient Approximation for StreamingVideo Processing Pipelines. In

Proceedings of theACM Symposium on Cloud Computing (Seattle, WA, USA) (SOCC ’14) . Associationfor Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/2670979.2671005[72] Neeraja J. Yadwadkar, Bharath Hariharan, Joseph E. Gonzalez, Burton Smith, andRandy H. Katz. 2017. Selecting the Best VM Across Multiple Public Clouds: A Data-driven Performance Modeling Approach. In

Proceedings of the 2017 Symposiumon Cloud Computing (Santa Clara, California) (SoCC ’17) . ACM, 452–465. https://doi.org/10.1145/3127479.3131614[73] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, andIon Stoica. 2010. Spark: Cluster Computing with Working Sets. In

Proceedingsof the 2nd USENIX Conference on Hot Topics in Cloud Computing (Boston, MA) (HotCloud’10) . USENIX Association, USA, 10.[74] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose,Paramvir Bahl, and Michael J. Freedman. 2017. Live Video Analytics at Scalewith Approximation and Delay-Tolerance. In