[PDF] THIA: Accelerating Video Analytics using Early Inference and Fine-Grained Query Planning

Abstract

To efficiently process visual data at scale, researchers have proposed two techniques for lowering the computational overhead associated with the underlying deep learning models. The first approach consists of leveraging a specialized, lightweight model to directly answer the query. The second approach focuses on filtering irrelevant frames using a lightweight model and processing the filtered frames using a heavyweight model. These techniques suffer from two limitations. With the first approach, the specialized model is unable to provide accurate results for hard-to-detect events. With the second approach, the system is unable to accelerate queries focusing on frequently occurring events as the filter is unable to eliminate a significant fraction of frames in the video. In this paper, we present THIA, a video analytics system for tackling these limitations. The design of THIA is centered around three techniques. First, instead of using a cascade of models, it uses a single object detection model with multiple exit points for short-circuiting the inference. This early inference technique allows it to support a range of throughput-accuracy tradeoffs. Second, it adopts a fine-grained approach to planning and processes different chunks of the video using different exit points to meet the user's requirements. Lastly, it uses a lightweight technique for directly estimating the exit point for a chunk to lower the optimization time. We empirically show that these techniques enable THIA to outperform two state-of-the-art video analytics systems by up to 6.5X, while providing accurate results even on queries focusing on hard-to-detect events.

Full PDF

TTHIA: Accelerating Video Analytics usingEarly Inference and Fine-Grained Query Planning

Jiashen Cao

Georgia Institute of Technology

Ramyad Hadidi

Georgia Institute of Technology

Joy Arulraj

Georgia Institute of Technology

Hyesoon Kim

Georgia Institute of Technology

Abstract

To efficiently process visual data at scale, researchers have proposedtwo techniques for lowering the computational overhead associ-ated with the underlying deep learning models. The first approachconsists of leveraging a specialized, lightweight model to directlyanswer the query. The second approach focuses on filtering irrele-vant frames using a lightweight model and processing the filteredframes using a heavyweight model. These techniques suffer fromtwo limitations. With the first approach, the specialized model isunable to provide accurate results for hard-to-detect events. Withthe second approach, the system is unable to accelerate queriesfocusing on frequently occurring events as the filter is unable toeliminate a significant fraction of frames in the video.In this paper, we present Thia, a video analytics system fortackling these limitations. The design of Thia is centered aroundthree techniques. First, instead of using a cascade of models, it usesa single object detection model with multiple exit points for short-circuiting the inference. This early inference technique allows itto support a range of throughput-accuracy tradeoffs. Second, itadopts a fine-grained approach to planning, and processes differentchunks of the video using different exit points to meet the user’srequirements. Lastly, it uses a lightweight technique for directlyestimating the exit point for a chunk to lower the optimizationtime. We empirically show that these techniques enable Thia tooutperform two state-of-the-art video analytics systems by up to6.5 × , while providing accurate results even on queries focusing onhard-to-detect events. Researchers have proposed systems for quickly processing visualdata with a tolerable drop in accuracy [3, 4, 7, 8, 10–13, 16, 23, 24].These systems detect objects in videos using deep neural networks(DNNs) [5, 17]. The key challenge that these systems tackle is thecomputational overhead of the underlying object detection model.

Prior Work . To efficiently process visual data at scale, researchershave proposed two techniques. The first approach, presented inBlazeIt [10], consists of leveraging a specialized, lightweight modelto directly answer the query. The second approach, introduced inPP [13], focuses on filtering irrelevant frames using a lightweightmodel. The frames that pass through the filtering model are thenprocessed by the heavyweight object detection model (illustratedin Figure 2). So, these systems accelerate query processing by not processing a subset of video frames using the heavyweight model.However, these techniques suffer from two limitations. With thefirst approach, the specialized model is unable to provide accurate results for hard-to-detect events. With the second approach, thesystem is unable to accelerate queries focusing on frequently oc-curring events. This is because the filter is unable to eliminate asignificant fraction of frames in the video.Another line of research, illustrated in Tahoma [1], focuses onleveraging a collection of differently sized models to process theframes based on the complexity of the event. However, using sucha cascade of models comes with two limitations. First, switchingfrom one model to another in the GPU is expensive. This switchingoverhead is further exacerbated if we seek to frequently changethe model to process different subsets ( i.e., chunks) of the video tomaximize performance. For instance, loading a Faster-RCNN modelin PyTorch on an NVIDIA Titan Xp GPU takes 2 s (including frame-work initialization and model loading). Second, using a collection ofmodels to support different throughput-accuracy tradeoffs does notscale well due to the large GPU memory footprint of these models.Prior efforts have mostly focused on altering the design of theinference pipeline. However, they do not elaborate on how to adaptthis pipeline ( e.g., when to use a particular model) based on thechunk. They choose a single plan for the entire video based on theprofiling results obtained on a set of sampled frames. Such a coarse-grained approach to query planning does not leverage the variationin the frequency and detection difficulty of different events in avideo. If objects are difficult to detect, this approach leads to lessaccurate results. On the other hand, if objects are easier to detect,a conservative coarse-grained query plan significantly increasesthe query processing time (but returns correct results). We defer adetailed discussion of these limitations to §3.

Our Approach . In this paper, we present Thia, a video analyticssystem for tackling the limitations highlighted above. Thia lever-ages three techniques to accelerate queries over visual data.First, it uses a single object detection model with multiple pointsfor short-circuiting the inference. These exit points (EPs) offer a setof throughput-accuracy tradeoffs. While processing the query, Thiauses a shallow EP to quickly process frames that are irrelevant orcontain easy-to-detect events. If the frames contain hard-to-detectevents, then Thia falls back to a deeper EP in the model to deliverhigher accuracy. This Early Inference technique eliminates theswitching overhead and lowers the GPU memory footprint of Thia.Second, Thia adopts a fine-grained approach to planning. It pro-cesses different chunks of the video using different EPs to meet boththe performance and accuracy requirements (elaborated in §6.1).This Fine-Grained Planning technique increases the optimizationtime of a query. To lower this overhead, we present a third tech-nique to quickly decide which EP to use for a given chunk. Thia a r X i v : . [ c s . D B ] F e b iashen Cao, Ramyad Hadidi, Joy Arulraj, and Hyesoon Kim Backbone RPN ROI

Obj:

Car

Conf:

Box: [100, 230, 20, 23]

Data

Figure 1: Object Detection Model – Components of the model. uses a shallow model for Exit Point Estimation instead of run-ning inference on the sampled frames. We evaluate a set of queriesfocusing on events with different levels of frequency and detec-tion difficulty on two traffic surveillance datasets: UA-DeTrac [21]and Jackson Town [10]. On all of the queries, Thia outperformsthe state-of-the-art video analytics systems by up to 6.5 × with atolerable drop in accuracy. Contributions . Our research makes the following contributions: • We present the Early Inference technique to construct asingle model that offers a set of throughput-accuracy tradeoffsfor challenging vision tasks like object detection. • We propose a Fine-Grained Planning technique that worksin tandem with the Early Inference technique. • We present the Exit Point Estimation technique to reducethe optimization overhead of the Fine-Grained Planningtechnique. • We implement all of these techniques in Thia and show thatit outperforms two state-of-the-art video analytics systems ona wide range of queries.

We now present an overview of object detection and samplingtechniques used in video analytics systems (§2.1 and §2.2). We laterdiscuss the key techniques used in state-of-the-art systems (§2.3).

Object detection models usually contain three components: (1)backbone network, (2) region proposal network (RPN), and (3)region of interests network (ROI), as illustrated in Figure 1. Thebackbone network extracts the high-level features from a frame.Then, the RPN and ROI networks determine the location and typeof objects detected in the frame. Data flows from the backbonenetwork to RPN, and ROI returns the final prediction results (objectcategory, location within the frame, and confidence score).In machine learning literature, the oracle model returns the cor-rect answer to all queries. However, in practice, there is no groundtruth for unseen data. Similar to prior efforts, we assume that themost accurate model, which also tends to be the most compute-intensive model, is the oracle model [9–11, 13, 24].

Sampling is a frequently used technique for processing visual dataat scale. By processing only a subset of frames using the objectdetection model, a video analytics system lowers the overall queryprocessing time. For example, BlazeIt [10] uses uniformly randomsampling to process aggregate queries ( e.g., counting the averagenumber of cars within a given period of time).

BlazeItPP FilterPool ObjectDetectorDiscard

Obj:

Car …

FilterSelectorModelSelector SpecializedModel PoolObjectDetector

Obj:

Car … Or DataDataDataDataDataData

Figure 2: Architecture of Video Analytics Systems – Architecture oftwo state-of-the-art video analytics systems: (1) PP [13] and (2) BlazeIt [10]. + MS + MC + FP ✝ + EI ✝ + EP-Est ✝ NaivePP ✔ BlazeIt ✔ Thia-Single ★ ✔ Thia-Multi ★ ✔ ✔ Thia-EI ★ ✔ ✔ Thia ★ ✔ ✔ ✔ MS : Model Specialization, MC : Model Cascade, FP : Fine-GrainedPlanning, EI : Early Inference, EP-Est : Exit Point Estimation. ✝ : Techniques used in Thia. ★ : Variants of Thia. Table 1: Qualitative Comparison of Video Analytics Systems – Keycharacteristics of state-of-the-art video analytics systems.

Thia uses sampling for a different purpose (elaborated in §6). Achunk is a continuous segment of frames within a video. Thia’sOptimizer constructs plans at chunk-level granularity (instead ofvideo-level granularity) to lower the query processing time. Queryprocessing time consists of two components: (1) optimization time,and (2) execution time. During the optimization phase, Thia gen-erates a plan for each chunk. During the execution phase, it runsthese plans.

Table 1 lists the key characteristics of several state-of-the-art videoanalytics systems: (1) PP [13], (2) BlazeIt [10] (3) Miris [2], (4)Tahoma [1], and (5) Panorama [24]. We present the benefits andlimitations of the first two systems in §1. Their architectures areillustrated in Figure 2.Miris [2] is a video analytics system that focuses on multi-objecttracking. It uses coarse-grained sampling to gain a high-level per-spective of the video and then gradually increases the samplingrate to improve the accuracy of tracking. Thia differs from Mirisin two ways. First, it is tailored for object detection. Second, it onlysamples for query planning (not for query execution). In §7.7, weillustrate the benefits of other optimizations in Thia by comparingit against a variant of Thia that only uses Fine-Grained Planning(Thia-Single in Table 1).Tahoma [1] is another closely related analytics system. It con-structs a model cascade by combining a chain of image classificationmodels and determines when to short-circuit the inference based on

HIA: Accelerating Video Analytics usingEarly Inference and Fine-Grained Query Planning

Query SQL PredicateFrequency PredicateDifficultyQ1

Select frameID

From

UA-DeTrac

Where Count (Car) >= 4;

Frequent Easy Q2 Select frameID

From

UA-DeTrac

Where Count (Truck) >= 1;

Frequent Hard Q3 Select frameID

From

UA-DeTrac

Where Count (Bus) >= 4;

Rare Hard Q4 Select frameID

From

Jackson-Town

Where Count (Car) >= 4;

Rare Hard

Table 2: List of Queries – Queries with varying frequency and levels ofdifficulty in detecting events. the confidence score of prediction of each model. Unlike Tahoma,Thia is geared toward object detection. So, the inference resultconsists of a set of confidence scores for all the objects present inthe frame. It is challenging to short-circuit the inference pipelinebased on a set of confidence scores. In Thia, the Early Inferencetechnique is guided by query accuracy (not model accuracy). In §7.7and §7.8, we illustrate the limitations of using a model cascade bycomparing Thia against a variant of Thia that uses Fine-GrainedPlanning along with a model cascade (Thia-Multi in Table 1),instead of a single Early Inference model .Panorama [24] is another state-of-the-art video video analyticssystem that uses a single model to solve the unbounded vocabularyproblem in object recognition. While this system also offers a set ofthroughput-accuracy tradeoffs similar to Thia, it is geared towardscomparing embeddings from two input frames. So, it selects the EPbased on the delta between two embeddings while extracting theembeddings. Lastly, it clusters these embeddings to recognize theobjects in the input frames. In contrast, Thia uses Early Inferencein the object detection model itself and seeks to reduce optimizationtime using the Exit Point Estimation technique. In this section, we discuss the limitations of PP and BlazeIt tomotivate the need for Thia. We focus on the four queries describedin Table 2. These queries differ in: (1) frequency of appearance oftarget objects in the video, and (2) level of difficulty in providing acorrect answer to a query.

Limitation I – model specialization overhead . Both PPand BlazeIt rely on specialized models. PP uses a specialized modelas a filter. Since each filter detects only one object category, it needsto train multiple lightweight models ( i.e., filters) during runtimeto support different object categories. BlazeIt uses a specializedmodel to directly return the results. A model may directly returnthe count of cars in an image, so it must maintain multiple modelsfor different predicates ( e.g.,

Count (Car) is a predicate). With this Thia-Multi delivers better performance than a naive model cascade due to theFine-Grained Planning technique. With a naive model cascade, the system cannotdirectly process frames with the optimal model. It must use all the smaller modelsbefore stopping the inference at the optimal model.

Difficulty Precision Recall Throughput

Easy 97.72% × Hard 100.00% 100.00% × Table 3: BlazeIt vs Naive – Key metrics of BlazeIt with respect to aNaive system that only uses the heavyweight object detector. model specialization technique, these systems need to train andmaintain models for different objects and predicates, respectively.We seek to reduce model maintenance overhead by offering a rangeof accuracy and query execution time tradeoffs in a single model.

Limitation II – freqent events . The filtering techniqueused in the PP system [13] relies on data reduction by the filter toachieve speedup. Let’s assume the system is processing 𝑁 framesand that the fraction of frames that is filtered and discarded by thefilter is 𝑟 . Let the costs of running the filter and running the objectdetector be 𝐶 𝑓 and 𝐶 𝑜 per frame, respectively. To obtain a speedup,the data reduction rate must satisfy this constraint: 𝑁 ( 𝐶 𝑓 + ( − 𝑟 ) · 𝐶 𝑜 ) < 𝑁𝐶 𝑜 ≡ 𝑟 > 𝐶 𝑓 𝐶 𝑜 This constraint is not met by frequent events ( e.g.,

Q1 in Section 3).In this case, since 𝑟 is small, the filter slows down the overall pipelinesince it adds additional overhead. As a result, PP is slower thanNaive ( i.e., naively running object detector on every frame) forfrequent queries like Q1. PP only provides a 0.93 × speedup com-pared to Naive in this case. Instead, for rare queries like Q3, PPis able to provide a 1.44 × speedup compared to Naive. We seekto dynamically adjust the query execution pipeline based on theestimated frequency of the event. Limitation III – difficult-to-detect objects . BlazeIt [10]uses a specialized model to directly return aggregates ( e.g., numberof cars in an image). This approach does not generalize to complexvisual datasets. The reasons are twofold. First, the specialized modelis designed to be shallow for fast execution. So, it is unable to learncomplex patterns. Second, it relies on an ad-hoc subset of videos fortraining, so the lack of positive examples greatly affects the qualityof the model.As shown in Table 3, BlazeIt returns precise answers for easy-to-answer queries. However, it has a lower recall metric. For hard-to-answer queries ( e.g.,

Q3), the specialized model does not offer usefulresults. So, the system instead falls back to the object detectionmodel. In this case, BlazeIt runs the specialized model, resultingin lower performance than Naive. In contrast, Thia is capable ofselecting an optimal plan with good accuracy and performancemetrics.

Our Approach . In Figure 3, we show two prediction results fromour Early Inference technique (the oracle object detection EP anda shallow EP, respectively). We observe that the faster EP is still ableto capture the presence of cars, but it is less accurate in two ways.First, the bounding boxes are not accurate, so multiple boundingboxes are returned for the same object. Second, it tends to misshard-to-detect objects ( e.g., objects far away or objects with lights).If a user queries for an image with exactly four cars, Thia usesthe oracle exit point to satisfy the precision requirement. However, iashen Cao, Ramyad Hadidi, Joy Arulraj, and Hyesoon Kim (a)

Results of oracle EP- 1 × speedup. (b) Results of shallow EP- 6 × speedup. Figure 3: Objects Detection Results – Objects detected by (a) oracle, and(b) shallow EP. ❶ Fine-grainedPlanning

Plan - Skip Plan - EP-1 Plan - OracleRange - [0,100] Range - [100,300] Range - [300,500]

Early Inference

New Query ?Training ❷ EPEstimator Q ue r y P l an Thia

Frame 1 - , Frame 2: ❸ Early Inferencing Model Backend S k i p EP - EP - O r a c l e Figure 4: System Overview – The two major components of Thia are:(1) Optimizer and (2) Execution Engine. While the Optimizer relieson Fine-Grained Planning and Exit Point Estimation techniques, theExecution Engine performs Early Inference. if the user is only interested in images with cars, Thia uses thefaster exit point to obtain a 6 × speedup. We design Thia so that itcarefully chooses the optimal query execution plan for every chunkof the video to deliver higher accuracy and speedup. Figure 4 illustrates the architecture of Thia. ❶ Fine-Grained Planning.

When the system gets a query, theOptimizer uses the Fine-Grained Planning technique to con-struct a query execution plan. It first splits the entire video into aset of small chunks. The size of a chunk is determined dynamicallyat runtime (covered in §6). For each chunk, the Optimizer choosesthe optimal plan ( i.e., when to stop inference in the model). Such afine-grained query plan enables Thia to deliver higher accuracyand throughput compared to a coarse-grained plan for the entirevideo. A naive technique for picking the plan consists of runningthe model on a set of sampled frames from the chunk. While the fine-grained plan reduces the query execution time (Thia-EI in Ta-ble 1), it increases the query optimization time, which hurts theoverall query processing time (discussed in §7.5). To reduce theoptimization time, Thia instead leverages a more lightweight ExitPoint Estimation technique. ❷ Exit Point Estimation.

Thia uses Exit Point Estimationand Fine-Grained Planning techniques in tandem to reduce theoverhead of the Optimizer. The Optimizer uses a shallow neuralnetwork to directly estimate when to short-circuit the inferencein an Early Inference model. It trains an EP estimator for everyunique query executed in the system. We discuss how Thia obtainsdata for training the Exit Point Estimation model in §6. ❸ Early Inference.

The fine-grained query plan constructed bythe Optimizer consists of a list of chunks and the model chosen foreach chunk. For example, Thia may skip frames through , runEP-1 on frames through , and evaluate the oracle EP ( i.e., EP-3) on frames through . The Execution Engine takesthis query plan and uses the Early Inference technique to deliverdifferent accuracy-performance tradeoffs with a single model.

In this section, we present the Early Inference technique. We firstprovide an overview of this technique in §5.1. We then illustrate itsutility using a case study with Faster-RCNN [17] in §5.2.

We seek to construct a single model with multiple exit pointswherein the inference may be short-circuited to improve perfor-mance at the expense of accuracy. We do not want to construct acollection of models to accomplish this goal. The Optimizer dynam-ically adjusts the EP based on the query. If the query is relativelyeasy to answer, Thia delivers higher speedup by stopping the in-ference earlier (while returning accurate results). We discuss howThia estimates the correct EP for a chunk in §6. In this section, wefocus on how we construct a model with multiple EPs.As discussed in §2.1, object detection models usually rely on abackbone network that is based on a state-of-the-art image classi-fication model ( e.g.,

ResNet-50 [6] and VGG-16 [18]). Since theseclassification models are tailored for high accuracy, they consistof a stack of compute-intensive layers that lead to lower inferencethroughput. The layers in a backbone network are sequentiallyconnected to each other. Our key idea is to provide faster detectionresults with lower accuracy by using the features from earlier layersin the backbone network.

Model Cascading vs Early Inference:

Researchers haveproposed model cascades for face recognition [19, 20]. Similar toEarly Inference, in a model cascade, the features from earlierlayers in the backbone network are used for face recognition. How-ever, these techniques differ in two ways. First, face recognition isa binary classification task ( i.e., face exists in the image or not). So,the additional classification layers are only instrumented in thisapproach. Second, these efforts propose a bespoke architecture toconstruct the cascade. We instead seek to support early inferencein widely used object detection models.

HIA: Accelerating Video Analytics usingEarly Inference and Fine-Grained Query Planning

Backbone RPN + ROI O b j : C a r , C on f: . , B o x : [ , , , ] EP-1EP-2EP-3EP-4Oracle

Stage 1 Stage 2Or Stage 3Or Stage 4Or Stage 5Or

FasterMore Accurate

Figure 5: Early Inference in Faster-RCNN – Architecture of a Faster-RCNN model that supports early inference.

To support Early Inference with a general-purpose object de-tection model, we introduce additional RPN and ROI units in theearlier layers of the backbone network. We modify the number ofparameters in these units so that they operate on the feature tensoremitted by the backbone network. As shown in Figure 5, for a giveninput, Optimizer may choose to short-circuit the inference usingthe newly added units to speed up inference.

Faster-RCNN is a state-of-the-art object detector [17]. We nowdiscuss how we extend this model to support Early Inference.We next describe how to generalize the training process to othermodels.

Faster-RCNN with Early Inference : The backbone net-work of Faster-RCNN is the ResNet-50 [6] model. ResNet-50 consistsof five stacked compute blocks, so we extend this model to supportfive EPs (we could support fewer or additional EPs if needed byinstrumenting other layers of the backbone network). The defaultoutput of the model corresponds to the fifth EP. We add four addi-tional EPs that provide a wide set of throughput-accuracy tradeoffs( i.e.,

EP-1, EP-2, EP-3, and EP-4 in Figure 5). We refer to EP-5 as theoracle (since it is the output of the original model). We preserve thestructure of RPN and ROI units as is the case of the oracle. However,we modify the first layer in these units to work with the outputtensors of the early EPs that vary in size. Table 4 lists the layerconfiguration of each EP . Top-down training : We adopt a novel top-down training tech-nique for constructing models that support early inference. Westart the training process with the following multi-loss function: 𝐿 ( 𝑥 ; 𝑌 ) = | 𝐸 | ∑︁ 𝑒 ∈ 𝐸 𝐿 ({ 𝑦 𝑐 } , { 𝑦 𝑡 } ; 𝑌 ) 𝐸 represents the set of exit points (including the oracle). 𝐿 denotesthe object detection loss function used in Faster-RCNN [17]. Thistraining step tunes all EPs. 𝐸 represents all possible object detection EPs, including the oracleEP. We begin with the oracle EP. The reasons for doing this aretwofold. First, the oracle gets the features emitted by the last stageof the backbone network, so training this EP ensures that all layers We found that upsampling the input channel size to does not improve accuracysince the features from earlier EPs are coarse.

Exit Points Layer Config Performance

Channel Kernel Speedup TPr FNr

EP-1

64 3 × × EP-2

256 3 × × EP-3

512 3 × × EP-4 × × EP-5 (Oracle) × × TPr: True positive ratio. FNr: False negative ratio.

Table 4: Early Inference model knobs – the layer configuration andperformance of each object detection knob in Early Inference model. E P - E P - E P - E P - E P - E P - E P - E P - ( O r a c l e ) S p ee d u p ( X ) Figure 6: Generalization of Early Inference – Application of theEarly Inference technique to a VGG-16 model for image classification. converge to the optimal state. Second, we seek to ensure that theoracle EP in the Early Inference model delivers the same accuracyas that of the original model. After training the oracle EP, we freezeall the layer parameters in the backbone network. This ensuresthat fine-tuning the shallow EPs later does not affect the previouslytuned EPs. We gradually fine-tune the RPN and ROI units startingfrom EP-4 through EP-1.

Throughput-Accuracy Tradeoffs . Table 4 lists the speedupof shallow EPs with respect to the the oracle EP. For a given videoframe, this Early Inference model offers up to a . × speedupwhen we stop the inference at the first EP. Table 4 also summarizesthe true positive and false negative percentage of each EP on thetraining dataset with respect to the oracle EP. These metrics areaveraged across all categories. Shallower EPs return more falsenegatives and fail to return a few true positives. In other words,they are more likely to not return a positive frame instead of mis-classifying a negative frame. If the system were to use a shallowEP for the entire video or sequence of images, the impact on queryaccuracy would be significant. Instead, it must use the oracle EPon some difficult chunks of the video. We cover this Fine-GrainedPlanning technique in §6. In §7, we demonstrate that Thia hasa tolerable accuracy loss using both Early Inference and Fine-Grained Planning techniques. Generalization.

The Early Inference technique generalizesto other models ( e.g.,

VGG-16 [18]) and other vision tasks ( e.g., image classification). This is because most of these deep learningmodels contain similar backbone networks that benefit from theEarly Inference technique. Furthermore, the number of EPs maybe increased or decreased based on the complexity of the model. iashen Cao, Ramyad Hadidi, Joy Arulraj, and Hyesoon Kim ❶❷ ❶

Figure 7: Variation of Optimal EP – The fastest, accurate EP in an EarlyInference model for a sequence of chunks in a video.

Figure 6 illustrates another Early Inference model based onVGG-16 for an image classification task that is trained on the Flower-102 dataset [14]. Here, EP-1 provides a . × speedup compared toEP-8 ( i.e., the oracle EP). By using all the eight EPs together, thesystem achieves a . × speedup compared to the oracle EP withminimal accuracy loss. We present the Fine-Grained Planning technique in this section.In §6.1, we make the case for Fine-Grained Planning. In §6.2,we discuss how Thia samples frames and constructs chunks toapply this technique. Lastly, in §6.3, we introduce the Exit PointEstimation technique for reducing the optimization time.

As we discussed in §5.2, the Early Inference model contains aset of EPs. The goal of the Optimizer is to choose an optimal(accurate and fast) EP for every fine-grained chunk of the videoat runtime. Our key observation is that the optimal EP changes atchunk granularity. Figure 7 illustrates the chunk-level query planfor Q4 in Section 3. The triangles in Figure 7 represent the fastest(but still accurate enough) EP for every chunk in the video. Thisexample shows that the optimal EP constantly changes. So, it isessential to dynamically adjust the query plan at runtime to achieveboth good accuracy and performance.State-of-the-art systems ( e.g.,

BlazeIt [10] and PP [13]) take acoarse-grained approach to planning. They choose a single planfor the entire video based on the accuracy of the model on a set ofsampled frames. The limitations of this technique are twofold.

Performance degradation . Positive events tend to not ap-pear in every chunk of the video ( i.e., selectivity of the predicate istypically high). If we pick a static plan for the entire video, videochunks that are less likely to contain positive events or that con-tain easy-to-detect events are passed to a more compute-intensiveEP. Thus, the system does not leverage the opportunity to furtherimprove performance by either skipping those chunks or usingless compute-intensive EPs for those chunks. With Fine-GrainedPlanning, Thia uses a faster EP or directly skips the entire chunk( ❶ in Figure 7). Accuracy loss . The distribution of the target event and the ac-curacy of the model vary across the video. A statically selected,shallow EP will hurt accuracy by missing hard-to-detect events.As shown in Figure 7, some chunks require deeper EPs to make

Algorithm 1:

Fine-grained query planning.

Input : V - Video data. EP-List - The list of EPs in the Early Inference model. P - Precision constraint of the query. R - Recall constraint of the query. Output :

Return a list of fine-grained plans. video_length ← Length( V )// estimate initial sampling rate. sampling_rate ← EstimateSamplingRate( V )// optimize the query plan. return GetQueryPlan( V , EP-List , P , R , sampling_rate ) Function

PickBestEP(

V_sub,

EP-List , P , R ) Output :

Return the optimal EP under P and R constraints, andthe rate of positive frames in the sampled subset. Function

GetQueryPlan( V , EP-List , P , R , sampling_rate ) Output :

A collection of fine-grained plans. // divide into smaller chunks. sampling_span ← 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 _ 𝑟𝑎𝑡𝑒 V_sub ← V [ , · 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 _ 𝑠𝑝𝑎𝑛 , · 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 _ 𝑠𝑝𝑎𝑛 ...] best_ep, posi_ratio ← PickBestEP(

V_sub,

EP-List , P , R )// the collection of fine-grained plans. plan ← {} if (posi_ratio is sufficient and best_ep is fastest) or ( Length( V ) is small) then plan += { V : best_ep } else if posi_ratio is insufficient then plan += { V : skip } else for V_chunk ∈ V do // double the sampling rate. plan += GetQueryPlan(

V_chunk,

EP-List , P , R , · sampling_rate ) return plan accurate predictions ( ❷ in Figure 7). With Fine-Grained Plan-ning, Thia dynamically adjusts the plan based on the difficulty ofdetecting the target event. When Thia gets a query, it first splits the given video into a set ofchunks. It then samples a set of frames from each chunk and thenevaluates the accuracy of all the EPs in the Early Inference modelon these sampled frames. Using these results, the system selectsthe best EP for each chunk. Lastly, it executes the query using theselected plan. The key components of the algorithm that Thia usesfor chunking videos are as follows: ❶ Hierarchical Chunking . The two key decisions made bythe Optimizer are: (1) chunk size, and (2) sampling rate ( i.e., thenumber of frames to pick from a chunk). The system delivers higheraccuracy with a higher sampling rate since more samples allow it tobetter estimate the optimal EP for each chunk. However, this hurtsthroughput since the system must evaluate the model’s behavior onmore frames, thereby increasing optimization time. Choosing thechunk size is also a challenging task. This is because the duration

HIA: Accelerating Video Analytics usingEarly Inference and Fine-Grained Query Planning of an event varies based on the video, so Thia must dynamicallyadjust the chunk size at runtime.To tackle these challenges, Thia takes a hierarchical approachfor picking the chunk size and the sampling rate for each chunk. Itinitially uses a large chunk size and a low sampling rate. This allowsthe Optimizer to gain a rough understanding of the contents of thevideo based on the inference results collected using the sampledframes. Based on this knowledge, it recursively adjusts the chunksize and sampling rate.Algorithm 1 presents the hierarchical, recursive technique usedby the Optimizer. As shown in Line 9, the recursive algorithmstops when the chunk size is smaller than a threshold or if thefastest EP has been chosen for a given chunk that contains enoughpositive frames. In Line 8, the

PickBestEP function returns the rateof positive frames in a chunk ( i.e., posi_ratio ) that is obtainedfrom the oracle EP. It is important to ensure that the chunk hassufficient positive frames, since the calculated precision and recallmetrics of the EPs do not generalize well without sufficient positiveframes. These constraints bound the optimization time. As shownin Line 11, if there are very few positive frames, then Thia skipsthe entire chunk to reduce both optimization time and executiontime. Lastly, it gradually reduces the chunk size and increases thesampling rate, as shown in Line 15. The intuition is that if the systemis not able to select a plan based on its coarse-grained understandingof the chunk, it must sample more frames from that chunk in thenext iteration. By using a small chunk size, the Optimizer is ableto adjust the plans quickly to transient events. ❷ Sampling Rate Bounds . The Optimizer gradually increasesthe sampling rate to improve the quality of its plan. However, thisincreases the optimization time and may hurt the overall through-put obtained with the plan. This is because the decrease in thequery execution time is not sufficient to justify the increase in theoptimization time. To overcome this limitation, the Optimizer usesthe following constraint to bound the initial sampling rate (Line 2): 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 _ 𝑟𝑎𝑡𝑒 ∗ ⌈ log | 𝑉 | ⌉ ≤ . Here, we assume that chunks must contain at least 100 framesand we seek to bound the final sampling rate to . even in theworst-case setting. The maximum depth of the recursive algorithmis ⌈ log | 𝑉 | ⌉ (sampling rate is doubled in each iteration). ❸ Memoization of Inference Results . In Line 7, the newlypicked samples could be different from those that have alreadybeen evaluated. Evaluating all the EPs on a sample is expensive. Toreduce this overhead, the Optimizer memoizes the inference resultsand reuses the results of nearby frames. This technique is illustratedin Figure 8. Without memoization, the results for the second andfourth frames must be obtained again when the sampling rate isincreased in the next iteration. Thia instead reuses the results ofnearby frames within the same chunk. With memoization, it picksthe cached results for the third frame instead of running inferenceon the fourth frame. Thus, it evaluates only the EPs on the secondframe. ❹ Evaluation-Based EP selection . To select the best EP, asshown in Line 8, Thia evaluates all the EPs on the sampled framesand compares them with the oracle EP. It picks the fastest EP that

Level ❶ Level ❷ No Reuse Reuse

Figure 8: Sampling results no reuse vs reuse – an illustrative exampleabout sampling results about performance saving with no reuse and reuse. provides . precision and . recall. These constraints empiricallyoffer maximal speedup with minimal accuracy loss (§7.2). Even withall of these optimizations, evaluating the EPs on a frame comeswith non-trivial optimization time. We next present the Exit PointEstimation technique for further reducing the optimization time. We seek to reduce the optimization time associated with the EarlyInference technique. As we present in §7.5, it is important tobalance the tradeoff between optimization time and execution timeto improve the overall query processing time.The Exit Point Estimation technique consists of using a shal-low, two-layer neural network instead of the evaluation step inquery planning. The neural network directly returns the optimalEP based on the backbone features. This allows the Optimizer toeliminate compute-intensive evaluation of all EPs. For example,with Faster-RCNN, the inputs to the Exit Point Estimation modelare the features emitted by the fifth stage of the Early Inferencemodel.To train this neural network, Optimizer uses images fromthe training dataset of the Early Inference model along withthe associated EP decision. For robust results, Thia must train aseparate Exit Point Estimation model for each query. However,the overhead of training this model is tolerable because: (1) it isa one-time overhead for each query; and (2) training time for theExit Point Estimation model is negligible compared to total queryprocessing time due to the simple structure of the model. We deferan empirical analysis of this optimization to §7.6.

Estimation-Based EP selection . Since the Exit Point Esti-mation model directly estimates the optimal EP for a video frame,it does not return precision and recall metrics for all the EPs. So,the Optimizer extrapolates these metrics based on the estimatedEP. We next discuss how this extrapolation is done.Let us split the set of sampled frames into two subsets: (1) thosethat contain positive events as reported by the oracle EP (Line 9in Algorithm 1), and (2) those that do not contain positive events. Wedefine those two subsets as 𝑆 and (cid:98) 𝑆 , respectively. For a given videoframe 𝑥 , let us denote the EP estimation model that outputs theoptimal EP for that frame by 𝑂𝑃𝑇 𝐸𝑃 ( 𝑥 ) . The Optimizer estimatesthe number of true positives (TP), false positives (FP), and false iashen Cao, Ramyad Hadidi, Joy Arulraj, and Hyesoon Kim negatives (FN) for any EP 𝑘 in the Early Inference as: 𝑇 𝑃 𝑘 = ∑︁ 𝑥 ∈ 𝑆 𝑔 ( 𝑥 ) , 𝑔 ( 𝑥 ) = (cid:26) , if 𝑘 ≥ 𝑂𝑃𝑇 𝐸𝑃 ( 𝑥 ) , otherwise 𝐹𝑃 𝑘 = ∑︁ 𝑥 ∈ (cid:98) 𝑆 𝑔 ′ ( 𝑥 ) , 𝑔 ′ ( 𝑥 ) = (cid:26) , if 𝑘 < 𝑂𝑃𝑇 𝐸𝑃 ( 𝑥 ) , otherwise 𝐹 𝑁 𝑘 = ∑︁ 𝑥 ∈ 𝑆 𝑔 ′′ ( 𝑥 ) , 𝑔 ′′ ( 𝑥 ) = (cid:26) , if 𝑘 < 𝑂𝑃𝑇 𝐸𝑃 ( 𝑥 ) , otherwiseOur intuition is that a shallow EP is less accurate than a deep EP.So, for a video frame 𝑥 , the estimated optimal EP ( i.e., 𝑂𝑃𝑇 𝐸𝑃 ( 𝑥 ) )returns correct results. Then, all EPs after the estimated optimalEP ( i.e., k ≥ 𝑂𝑃𝑇 𝐸𝑃 ( 𝑥 ) ) should also return correct results, and viceversa. Hence, in the case of positive events, a deeper EP 𝑘 than theestimated optimal EP provides true positive prediction. On the otherhand, a shallower EP 𝑘 than the estimated optimal EP provides falsenegative prediction. In the case of negative events, shallower EP 𝑘 than the estimated optimal EP likely results a false positive. Withthese projected metrics, the Optimizer derives the precision andrecall metrics for an EP 𝑘 as: 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑘 = 𝑇 𝑃 𝑘 𝑇 𝑃 𝑘 + 𝐹𝑃 𝑘 , 𝑅𝑒𝑐𝑎𝑙𝑙 𝑘 = 𝑇 𝑃 𝑘 𝑇 𝑃 𝑘 + 𝐹 𝑁 𝑘 Lastly, the Optimizer picks the fastest EP that meets the precisionand recall constraints ( e.g.,

We seek to answer the following questions in our evaluation: • How effective is the Early Inference technique in reducingthe query processing time (§7.2)? • How much does each technique contribute to the overallperformance (§7.3)? • How effective is the Fine-Grained Planning compared tothe coarse-grained planning (§7.4)? • What is the time spent on query planning and execution(§7.5)? • How effective is the Exit Point Estimation technique inreducing the optimization time (§7.6)? • How effective is Thia compared to other state-of-the-artsystems (§7.7)? • How does the Early Inference technique compare againstthe model cascade technique (§7.8)?

Evaluated Systems . Table 1 lists all the video analytics systemsthat we compare in our analysis (including the variants of Thia).In the Naive system, we apply the oracle EP on every frame. Wenormalize the accuracy metrics of other systems against those of theNaive system. We reimplement two other state-of-the-art systemsin our framework for comparative analysis: (1) PP [13], and (2)BlazeIt [10]. In our implementation, the PP system uses ResNet-34 [6] to filter out unrelated frames. The BlazeIt system uses aspecialized model (ResNet-34) to accelerate queries.To better understand the performance of Thia, we examinethree variants of our system: (1) Thia-Single uses only the Fine-Grained Planning method with the oracle EP. (2) Thia-Multi also uses the Fine-Grained Planning method along with multipleEPs. Specifically, we use Faster-RCNN models with three backbonenetworks: ResNet-18, ResNet-34, and ResNet-50 [6] as three EPs.(3) Thia-EI is the closest variant of Thia. It uses the Fine-GrainedPlanning along with the Early Inference technique (but doesnot use the Exit Point Estimation technique).

Datasets . We evaluate these systems on two datasets: (1) UA-DeTrac [21], and (2) Jackson-Town dataset from [10]. Both datasetsare obtained from traffic surveillance cameras. We focus on fourvehicle categories in both datasets:

Car , Truck , Bus , and

Others . Evaluation metrics . Similar to other video analytics systems [1,9–11, 13], our evaluation normalizes the results with respect to theoracle model (Faster-RCNN model backed by ResNet-50). So, weprovide the F-1 score calculated relative to the results of the oraclemodel. We also report separate precision and recall metrics for eachquery. This is important since a user might require fine-grainedaccuracy requirements ( e.g., low precision and high recall). Weassume that the decoded video is present on disk.

Queries . To evaluate these systems, we use the four queries listedin Table 2. Based on the predicate, the frequency of true positiveevents and the difficulty of detecting those events vary.

Software and Hardware . We implement Thia with the Detec-tron2 [22] framework in PyTorch [15]. We evaluate these systemson a server with 44 CPU cores and 256 GB memory along with oneTitan Xp GPU with 12 GB memory.

Model Training . As discussed in §5.2, we construct the EarlyInference model based on Faster-RCNN. We split the UA-DeTracdataset into two parts: training and validation subsets. We trainthe Early Inference model on the the training subset. We warmup the training process for epoch, and then each EP in EarlyInference model is trained for epochs in a top-down manner.Since the Jackson-Town dataset does not have ground-truth labels,we directly apply the Early Inference model, which is tailoredfor the UA-DeTrac dataset. For Thia-Multi, we train three models:Faster-RCNN based on ResNet-18, Faster-RCNN based on ResNet-34, and Faster-RCNN based on ResNet-50. Each model is trainedfor 10 epochs.To train the Exit Point Estimation model, we use imagesfrom the UA-DeTrac training set. We split these images into training( images) and validation subsets. We construct the training sothat the distribution of different EPs is balanced. The training datafor this model consists of backbone features for those video frames,and the output is the fastest EP that is accurate enough. We quicklytrain this shallow network for epochs. In this experiment, we compare the query processing time of Thia-EI to that of other video analytics systems. The results are shownin Figure 9. The bottom right corner represents the ideal case (fasterexecution with accurate predictions).

Thia-EI . The most notable observation is that Thia-EI outperformsother systems on most queries. Thia-EI uses both Early Inferenceand Fine-Grained Planning. On Q1 and Q2, since the fractionof frames filtered out is limited, using an extra specialized model

HIA: Accelerating Video Analytics usingEarly Inference and Fine-Grained Query Planning

Thia-EI PP BlazeIt Naive P r o c e ss i n g T i m e ( s ) Q1 Q2

F1 Score P r o c e ss i n g T i m e ( s ) Q3 F1 Score Q4 Figure 9: Impact of Early Inference – Query processing time andF-1 scores delivered by Thia-EI, PP, and BlazeIt (bottom right cornerrepresents the ideal system). P e r c e n t a g e Q1 Q2 E P - E P - E P - E P - O r a c l e P e r c e n t a g e Q3 E P - E P - E P - E P - O r a c l e Q4 Figure 10: Usage of EPs in Early Inference model – Percentage offrames processed using the EPs in the Early Inference model. before the object detector adds additional execution overhead. Thia-EI consistently reduces the total runtime and also delivers a higherF-1 score compared to other systems. In particular, it is 2 – 6 × fasterthan Naive with a tolerable drop in F-1 score. BlazeIt . On Q1, BlazeIt outperforms other systems with respectto query processing time. However, as we discussed in §3, its spe-cialized model delivers a lower F-1 score. On other queries, since theF-1 score of the specialized model is too low to be useful, BlazeItfalls back to the oracle model. Even though the specialized model isnot effective, BlazeIt still evaluates the query with the specializedmodel, so the processing time of BlazeIt is higher than that ofNaive for Q2, Q3, and Q4. PP . PP reduces the processing time on Q3 and Q4. This is becausethese two queries focus on relatively rare events. So, the model inPP is able to filter out a significant fraction of frames to acceleratequery processing. We next examine how the EPs in a model are used by Thia-EIwhile processing queries. The results of this experiment are shown N a i v e + F P + E I S p ee d u p ( X ) Q1 N a i v e + F P + E I Q2 N a i v e + F P + E I Q3 N a i v e + F P + E I Q4 Figure 11: Ablation study – Contribution of Fine-Grained Planningand Early Inference techniques to the performance of Thia-EI. in Figure 10. To better understand the contribution of each tech-nique to the performance of Thia-EI, we also conduct an ablationstudy. We measure the execution time of Thia-Single and Thia-EIto illustrate the benefits of Fine-Grained Planning and EarlyInference techniques, respectively. The results of this study areillustrated in Figure 11.These two experiments demonstrate that: (1) Fine-GrainedPlanning is able to adaptively choose the appropriate EP for eachchunk based on the query and video frames, and (2) Fine-GrainedPlanning and Early Inference techniques have significant im-pact for rare events; but, in the case of frequent events, the speedupmainly comes from Early Inference.On Q1, since positive events appear in majority of the videoframes, the impact of Fine-Grained Planning is minimal. Whenwe add in the Early Inference technique, Thia-EI delivers higherspeedup by using shallow EPs for easy-to-detect events, as shownin Figure 10. Q1 demonstrates an extreme scenario wherein thefirst EP (EP-1) provides correct predictions on all video frames. Incontrast, in the case of Q2, Thia-EI must use multiple EPs due toharder-to-detect events. Here, the Optimizer reduces the executiontime by carefully choosing the EPs to use. As shown in Figure 10,while some frames are assigned to the oracle EP, other frames areassigned to shallow EPs to reduce execution time. By reducing theexecution time, Thia-EI delivers a 4 × speedup over Naive.Unlike queries focusing on frequent events, the Fine-GrainedPlanning technique provides more performance benefits in thecase of queries related to rare events, because the Optimizer de-cides to skip some chunks during execution (Line 11). On Q3 andQ4, as shown in Figure 11, Fine-Grained Planning leads to a5 × speedup. Nevertheless, the Early Inference technique is stilluseful for these queries. Thia-EI carefully assigns certain videoframes to shallow EPs to improve the performance without losingaccuracy. The system is thus accelerated further by 3 × when theEarly Inference technique enabled. We demonstrate the benefits of Fine-Grained Planning by com-paring Thia-EI against a system that uses coarse-grained planningwith the same Early Inference model. With coarse-grained plan-ning, we evaluate all EPs on of the sampled video framesand pick the EP that meets the precision and recall constraints.We show a breakdown of the query processing time in Figure 12.On all queries, Fine-Grained Planning provides a better queryplans than coarse-grained planning so that the execution time isconsistently lower. Moreover, with optimizations like samplingrate bounds and memoization in Fine-Grained Planning, it does iashen Cao, Ramyad Hadidi, Joy Arulraj, and Hyesoon Kim

Optimization Time Execution Time N a i v e C o a r s e - g r a i n e d F i n e - g r a i n e d P r o c e ss i n g T i m e ( s ) Q1 N a i v e C o a r s e - g r a i n e d F i n e - g r a i n e d Q2 N a i v e C o a r s e - g r a i n e d F i n e - g r a i n e d Q3 N a i v e C o a r s e - g r a i n e d F i n e - g r a i n e d Q4 Figure 12: Impact of Fine-Grained Planning – Breakdown of queryprocessing time with fine-grained and coarse-grained planning techniques. E x e c u t i o n T i m e D i s t r i b u t i o n Q1 Q2

Video Fraction E x e c u t i o n T i m e D i s t r i b u t i o n Q3 Video Fraction Q4 Figure 13: Variation of Execution Time – Variation of query executiontime across the chunks in the video.

Optimization Time Execution Time P r o c e ss i n g T i m e ( s ) Q1 Q2 N a i v e T h i a - S i n g l e T h i a - M u l t i T h i a - E I T h i a P r o c e ss i n g T i m e ( s ) Q3 N a i v e T h i a - S i n g l e T h i a - M u l t i T h i a - E I T h i a Q4 Figure 14: Breakdown of query processing time – Components ofquery processing time (optimization time and execution time) associatedwith Naive, Thia-Single, Thia-Multi, Thia-EI, and Thia. not incur higher optimization time than the naive coarse-grainedplanning approach.We next measure the distribution of query execution time overthe fraction of the video being analysed in Figure 13. An evendistribution ( e.g.,

Q1 and Q2) suggests that the same query plan isused for a large chunk. In contrast, an uneven distribution ( e.g.,

Q3and Q4) suggests that the plan changes frequently across the video.

TrainingTime (s) A cc u r a c y Q1 TrainingTime (s) Q2 TrainingTime (s) Q3 TrainingTime (s) Q4 Figure 15: Accuracy of the Exit Point Estimation model – Variationin validation accuracy over training time.

Query Under (%) Over (%)Q1 Q2 Q3 Q4 Table 5: Impact of Exit Point Estimation technique – Accuracy ofof Exit Point Estimation technique relative to Thia-EI.

We now provide a breakdown of the processing time of Thia-EIand its variants and compare it against Naive. Recall that all sys-tems except for Naive use the Fine-Grained Planning technique.Though Thia-Single is only able to use the oracle EP, it is able toskip frames with no relevant events using Fine-Grained Planning.The results are shown in Figure 14. Access to a set of EPs allowsboth Thia-EI and Thia-Multi to reduce execution time in compar-ison to Thia-Single and Naive. The reduction in execution time ismore prominent compared to Thia-Single for queries focusing onmore frequent events. While Thia-Multi supports multiple EPssimilar to Thia-EI, Thia-EI supports multiple EPs in a single model.So, it has a lower GPU memory footprint, as shown in Figure 18. Inaddition to that, Thia-EI offers more flexibility in terms of creatingand selecting different EPs. On Q2, due to the limited flexibility ofThia-Multi, it has lower execution time and also lower accuracythan Thia-EI.The cons of using multiple EPs with Fine-Grained Planning isthe increase in optimization time. This is because the Optimizerhas to evaluate all EPs to choose an optimal EP. As illustratedin Figure 14, the optimization time of Thia-EI and Thia-Multiis consistently higher than that of Thia-Single. Increasing thisflexibility ( i.e., adding more EPs) leads to higher sampling overhead.Thus, Thia-EI has higher optimization time than Thia-Multi andboth have higher optimization time than Thia-Single. This moti-vates the need for reducing the optimization time.

Training Time . Since the Optimizer needs to train an estimationmodel for every unique query, we first quantify the training over-head of the Exit Point Estimation technique. Figure 15 showsthe variation in validation accuracy over training time. The modelquickly converges since the Thia uses a small training set (200 sam-ples) and a two-layer neural network for Exit Point Estimation.

HIA: Accelerating Video Analytics usingEarly Inference and Fine-Grained Query Planning

Naive PP BlazeIt Thia Variants S p ee d u p ( X ) Q1 Q2 Q3 Q4 P r e c i s i o n T h i a - S i n g l e T h i a - M u l t i T h i a - E I T h i a R e c a ll T h i a - S i n g l e T h i a - M u l t i T h i a - E I T h i a T h i a - S i n g l e T h i a - M u l t i T h i a - E I T h i a T h i a - S i n g l e T h i a - M u l t i T h i a - E I T h i a Figure 16: End-to-end Comparison – Comparative analysis of speedup, precision, and recall metrics against state-of-the-art video analytics systems.

It takes less than seconds ( . % of total processing time) to traineach of these models for all queries. optimization time . We next investigate the efficacy of the ExitPoint Estimation technique in reducing the optimization time.We integrate the Exit Point Estimation technique into Thia-EIto construct Thia. Using the Exit Point Estimation technique,Thia is able to directly predict an appropriate EP to use for a chunk.In contrast, Thia-EI runs inference using all EPs during optimiza-tion phase to select the EP. As shown in Figure 14, Thia cuts theoptimization time in half. Recall that Thia uses the object detectionEPs for Fine-Grained Planning. So, the Exit Point Estimationtechnique uses the backbone features for choosing a plan for eachchunk. We also measure the overhead of using Exit Point Esti-mation. This technique introduces a minimal additional overhead(18 s) even under the highest sampling rate (processing time is inorder of thousands of seconds). execution time . Lastly, we discuss the impact of the Exit PointEstimation technique on planning accuracy ( i.e., choosing theoptimal EP) and execution time. We measure the planning accu-racy relative to Thia-EI. In Table 5, Under and

Over represent thepercentage of chunks for which the Exit Point Estimation tech-nique returns a shallower EP and a deeper EP than that returnedby Thia-EI. While shallower estimates hurt query accuracy, deeperestimates increase execution time. As shown in Figure 14, executiontime increases only negligibly for all queries except for Q1. Sincethe Exit Point Estimation technique reduces optimization time,the total processing time of Thia is lower than that of Thia-EIon all queries except for Q1. This is because Q1 can be accuratelyanswered using the first EP (Figure 10), so deeper estimates increaseexecution time. We discuss the impact on query accuracy in §7.7.

We report the speedup, precision, and recall metrics with respect toother state-of-the-art systems in Figure 16. The bars on the left siderepresent three systems: (1) Naive, (2) BlazeIt, and (3) PP. The N a i v e M o d e l C a s c a d e T h i a - E I T h i a S p ee d u p ( X ) Q1 N a i v e M o d e l C a s c a d e T h i a - E I T h i a Q2 N a i v e M o d e l C a s c a d e T h i a - E I T h i a Q3 N a i v e M o d e l C a s c a d e T h i a - E I T h i a Q4 Figure 17: Model Cascade vs.

Early Inference – Comparison of thequery processing time taken by Naive, Model Cascade, Thia-EI, and Thia. latter two video analytics systems use specialized models. The barson the right side represent variants of Thia that use one or moretechniques presented in this paper.The most notable observation is that systems that have access toa set of EPs deliver higher performance than those that have accessto a single EP. Unlike Thia-Multi, which maintains a collectionof separate models, the Early Inference technique offers moreflexibility in choosing the optimal EP. However, this techniqueincreases the optimization time. We overcome this limitation usingthe Exit Point Estimation technique.Thia consistently delivers higher speedup than other systems.Due to the inaccuracy of the Exit Point Estimation technique,Thia has a minimal drop in accuracy. The drop in recall is moreprominent because shallow EPs are unable to recognize hard-to-detect events, which leads to more false negatives. In contrast, thedrop in precision is minimal. On Q1, Thia improves both precisionand recall. This is because the Optimizer augmented with the ExitPoint Estimation technique overestimates the EPs for this query,leading to an improvement in accuracy. vs.

Early Inference

As we mentioned in §2.3, Thia-Multi uses a model cascade [1]. Itdiffers from the naive model cascade technique in that it uses Fine-Grained Planning to select EPs. In contrast, a naive approach iashen Cao, Ramyad Hadidi, Joy Arulraj, and Hyesoon Kim

Supported Exit Points M e m U s a g e ( M B ) EIMC

Figure 18: Memory Footprint – Comparison of memory footprint ofThia-EI and Thia-Multi ( i.e., a model cascade). P e r f e c t T h i a - E I T h i a S p ee d u p ( X ) Q1 P e r f e c t T h i a - E I T h i a Q2 P e r f e c t T h i a - E I T h i a Q3 P e r f e c t T h i a - E I T h i a Q4 Figure 19: Optimality of Fine-Grained Planning – Comparison ofexecution time of Thia-EI and Thia against that with the optimal plan. determines whether to stop at an EP based on the a confidencescore of the prediction from the previous EP. Thia-Multi out-performs the naive approach since shallower EPs in the modelcascade are always executed with the latter technique. As a re-sult, our Early Inference based systems ( i.e.,

Thia-EI and Thia)outperform the model cascade approach. Since it is challengingto construct a confidence-score based system in the case of objectdetection, we show the projected performance of the model cascadeapproach compared to Thia-EI and Thia in Figure 17. The EarlyInference based system delivers 2 × speedup compared to modelcascade.Using multiple models to construct model cascade also increasesthe memory footprint of the system. As shown in Figure 18, the real-time memory usage of a model cascade increases when we increasethe number of EPs. It has a 5 GB memory footprint with 5 EPs. Incontrast, since the Early Inference model shares parameters andinference features, it only incurs a 2 GB memory footprint for thesame number of EPs. We now examine the quality of the query plans relative to theoptimal plan by comparing the execution speedup. The optimal planis constructed using a brute-force EP selection on every frame ( i.e., chunk size = 1). The optimal plan is un-achievable in reality becausethe brute-force selection on every frame significantly increasesoptimization time. The results are shown in Figure 19. The plansconstructed by the Optimizer are 0.3 × slower than the optimalplan. So, there is still potential for improving the quality of thequery plans. We plan to explore techniques for doing so with atolerable impact on optimization time in the future. Accuracy of Exit Point Estimation model . The Exit PointEstimation technique uses a simple neural network to model theoptimal EPs distribution. However, the inaccuracy of this model leads to a minimal drop in overall query accuracy. We plan to studyother techniques to reduce the optimization time in the future. Forexample, instead of using a deep learning model, a lightweight sta-tistical estimator may be sufficient. A challenge with this approachis that this estimator must accurately map all of the parametersreturned by the object detection model ( e.g., a set of bounding boxesand confidence scores) to the appropriate EP.

Query Support . Currently, Thia supports a limited set of queries.To support general-purpose video analytics, we will need to addsupport for additional types of queries ( e.g., aggregate queries). Weplan to integrate the Early Inference and Fine-Grained Plan-ning techniques into the query execution engine and the queryoptimizer of a full-featured video analytics system in the future.

Model Cascade . Researchers in the area of face detection haveproposed models that support a set of EPs that are geared for differ-ent accuracy and speed trade-offs [19, 20]. These models return abinary decision and a confidence score ( i.e., whether a face exists).Based on the confidence score, the model chooses the appropriateEP. In contrast, Thia uses the estimator to directly pick the EP.

Query planning . The authors of Chameleon [9] observe thatan appropriate query plan is critical to gain high performance andaccuracy. Similar to the Fine-Grained Planning technique, itadjusts the execution plan at runtime. To reduce the cost of pickingthe correct plan, it exploits temporal locality of nearby frames inthe video, thereby reducing the profiling cost. To further reducethis cost, it uses a clustering algorithm to explore correlation acrossvideos. Thia instead uses a shallow neural network to directlyestimate the optimal EP to use for a chunk.

10 Conclusion

We presented, Thia, a video analytics system for efficiently pro-cessing visual data at scale. Thia leverages the early inference tech-nique to support a range of throughput-accuracy tradeoffs. It thenadopts a fine-grained approach to query planning and processesdifferent chunks of the video with different exit points to meet theuser’s requirements. Lastly, Thia uses a lightweight technique fordirectly estimating the exit point using a shallow deep learningmodel to lower the optimization time. We empirically show thatthese techniques enable Thia to outperform two state-of-the-artvideo analytics systems by up to 6.5 × , while providing accurateresults even on queries focusing on hard-to-detect events. HIA: Accelerating Video Analytics usingEarly Inference and Fine-Grained Query Planning

References [1] Michael R. Anderson, Michael Cafarella, German Ros, and Thomas F. Wenisch.2019. Physical Representation-Based Predicate Optimization for a Visual Analyt-ics Database.

ICDE (2019).[2] Favyen Bastani, Songtao He, Arjun Balasingam, Karthik Gopalakrishnan, Mo-hammad Alizadeh, Hari Balakrishnan, Michael Cafarella, Tim Kraska, and SamMadden. 2020. MIRIS: Fast Object Track Queries in Video.

SIGMOD (2020).[3] P Baumann, A Dehmel, P Furtado, R Ritsch, and N Widmann. 1998. The Multidi-mensional Database System RasDaMan.

SIGMOD (1998).[4] Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, Hyeontaek Lim,David G Andersen, Michael Kaminsky, and Subramanya R Dulloor. 2019. ScalingVideo Analytics on Constrained Edge Nodes.

SysML (2019).[5] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2018. MaskR-CNN.

ICCV (2018).[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep ResidualLearning for Image Recognition.

CVPR (2015).[7] Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman,Paramvir Bahl, Matthai Philipose, Phillip B Gibbons, and Onur Mutlu. 2018. Focus:Querying Large Video Datasets with Low Latency and Low Cost.

OSDI (2018).[8] Ramesh Jain and Arun Hampapur. 1994. Metadata in Video Databases.

SIGMOD (1994).[9] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodík, Siddhartha Sen, andIon Stoica. 2018. Chameleon: scalable adaptation of video analytics.

SIGCOMM (2018).[10] Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. BlazeIt: Optimizing Declara-tive Aggregation and Limit Queries for Neural Network-Based Video Analytics.

VLDB (2019).[11] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017.NoScope: Optimizing Deep CNN-Based Queries over Video Streams at Scale.

VLDB (2017). [12] Yao Lu, Aakanksha Chowdhery, and Srikanth Kandula. 2016. Optasia: A RelationalPlatform for Efficient Large-Scale Video Analytics.

SoCC (2016).[13] Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018.Accelerating Machine Learning Inference with Probabilistic Predicates.

SIGMOD (2018).[14] M-E. Nilsback and A. Zisserman. 2008. Automated Flower Classification over aLarge Number of Classes.

ICVGIP (2008).[15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.2017. Automatic differentiation in PyTorch.

NIPS-W (2017).[16] Luis Remis, Vishakha Gupta-Cledat, Christina Strong, and Ragaad Altarawneh.2018. VDMS: Efficient Big-Visual-Data Access for Machine Learning Workloads.(2018).[17] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks.

NeurIPS (2015).[18] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-works for Large-Scale Image Recognition.

ICLR (2015).[19] Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep Convolutional NetworkCascade for Facial Point Detection.

CVPR (2013).[20] P. Viola and M. Jones. 2001. Rapid Object Detection Using a Boosted Cascade ofSimple Features.

CVPR (2001).[21] Longyin Wen, Dawei Du, Zhaowei Cai, Zhen Lei, Ming-Ching Chang, HonggangQi, Jongwoo Lim, Ming-Hsuan Yang, and Siwei Lyu. 2020. UA-DETRAC: A NewBenchmark and Protocol for Multi-Object Detection and Tracking.

CVIU (2020).[22] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick.2019. Detectron2. https://github.com/facebookresearch/detectron2.[23] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose,Paramvir Bahl, and Michael J Freedman. 2017. Live Video Analytics at Scale withApproximation and Delay-Tolerance.

NSDI (2017).[24] Yuhao Zhang and Arun Kumar. 2019. Panorama: A Data System for UnboundedVocabulary Querying over Video.