[PDF] ExSample: Efficient Searches on Video Repositories through Adaptive Sampling

Abstract

Capturing and processing video is increasingly common as cameras become cheaper to deploy. At the same time, rich video understanding methods have progressed greatly in the last decade. As a result, many organizations now have massive repositories of video data, with applications in mapping, navigation, autonomous driving, and other areas. Because state-of-the-art object detection methods are slow and expensive, our ability to process even simple ad-hoc object search queries ('find 100 traffic lights in dashcam video') over this accumulated data lags far behind our ability to collect it. Processing video at reduced sampling rates is a reasonable default strategy for these types of queries, however, the ideal sampling rate is both data and query dependent. We introduce ExSample, a low cost framework for object search over unindexed video that quickly processes search queries by adapting the amount and location of sampled frames to the particular data and query being processed. ExSample prioritizes the processing of frames in a video repository so that processing is focused in portions of video that most likely contain objects of interest. It continually re-prioritizes processing based on feedback from previously processed frames. On large, real-world datasets, ExSample reduces processing time by up to 6x over an efficient random sampling baseline and by several orders of magnitude over state-of-the-art methods that train specialized per-query surrogate models. ExSample is thus a key component in building cost-efficient video data management systems.

Full PDF

EExSample: Efﬁcient Searches on Video Repositoriesthrough Adaptive Sampling

Oscar Moll

MIT CSAIL [email protected] Favyen Bastani

MIT CSAIL [email protected] Sam Madden

MIT CSAIL [email protected] Stonebraker

MIT CSAIL [email protected] Vijay Gadepally

MIT Lincoln Laboratory [email protected] Tim Kraska

MIT CSAIL [email protected]

ABSTRACT

Capturing and processing video is increasingly common ascameras and networks improve and become cheaper. At thesame time, algorithms for rich scene understanding and ob-ject detection have progressed greatly in the last decade.As a result, many organizations now have massive reposito-ries of video data, with applications in mapping, navigation,autonomous driving, and other areas.Because state of the art vision algorithms to interpretscenes and recognize objects are slow and expensive, ourability to process even simple ad-hoc selection queries (‘ﬁnd100 example traﬃc lights in dashboard camera video’) overthis accumulated data lags far behind our ability to collect it.Sampling image frames from the videos is a reasonable de-fault strategy for these types of queries queries, however, theideal sampling rate is both data and query dependent. Weintroduce ExSample, a low cost framework for ad-hoc, unin-dexed video search which quickly processes selection queriesby adapting the amount and location of sampled frames tothe data and the query being processed.ExSample prioritizes which frames within a video repos-itory are processed in order to quickly identify portions ofthe video that contain objects of interest. ExSample con-tinually re-prioritizes which regions of video to sample frombased on feedback from previous samples. On large, real-world video datasets ExSample reduces processing time byup to 4x over an eﬃcient random sampling baseline and byseveral orders of magnitude versus state-of-the-art methodswhich train specialized models for each query. ExSample isa key component in building cost-eﬃcient video data man-agement systems.

PVLDB Reference Format: xxx. xxx.

PVLDB , 21(xxx): xxxx-yyyy, 2021.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

1. INTRODUCTION

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. 21, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

Video cameras have become incredibly aﬀordable over thelast decade, and are ubiquitously deployed in static andmobile settings, such as smartphones, vehicles, surveillancecameras, and drones. These video datasets are enabling anew generation of applications. For example, video datafrom vehicle dashboard-mounted cameras, dashcams , is usedto train object detection and tracking models for autonomousdriving systems [21], or to annotate map datasets like Open-StreetMap with locations of traﬃc lights, stop signs, andother infrastructure [13], and to analyze the scene of colli-sions from dashcam footage to automate insurance claimsprocessing [14].However, these applications process large amounts of videoto extract useful information. Consider the basic task ofﬁnding examples of traﬃc lights – to, for example, anno-tate a map – within a large collection of dashcam videocollected from many vehicles. The most basic approach toevaluate this query is to run an object detector frame byframe over the dataset. Because state of the art object de-tectors run at about 10 frames per second (fps) on state ofthe art GPUs, one third of the typical video recording rateof 30fps, scanning through a collection of 1000 hours of videowith a detector on a GPU would take 3x that time: 3000GPU hours. In the oﬄine query case, which is the case wefocus on in this paper, we can parallelize our scan over thevideo across many GPUs, but, as the rental price of a GPUis around $ ? ], our bill for this one ad-hoc querywould be $

10K regardless of parallelism. Hence, this work-load presents challenges in both time and monetary cost.Note that accumulating 1000 hours of view represents just10 cameras recording for less than a week.A practical means for coping with this issue is to skipframes: for example only run object detection on one framefor every second of video. After all, we might think it rea-sonable to assume all traﬃc lights are visible for longer thanthat, and the savings are large compared to inspecting ev-ery frame: processing only one frame every second decreasescosts by 30x for a video recorded at 30fps. Unfortunately,this strategy has limitations: for example 1 frame out of 30that we look at may not show the light clearly, causing thedetector to miss it completely, while the neighboring framesmay show it more clearly. Secondly, lights that remain vis-ible in the video for a long time, like 30 seconds, would beseen multiple times unnecessarily, and worse, for other typesof objects that remain visible for shorter times the appropri-1 a r X i v : . [ c s . D B ] M a y te sampling rate is unknown, and will vary across datasetsdepending on factors such as whether the camera is movingor static, or the angle and distance to the object.In this paper we introduce ExSample, a video samplingtechnique designed to reduce the number of frames that needto be processed by an expensive object detector for searchqueries on video datasets. ExSample frames this problemas one of deciding which frame from the dataset to look atnext based on what it has seen in the past. Our approach tothis problem ExSample starts by conceptually splitting thedataset into temporal chunks, (e.g half-hour chunks), andframes the problem as deciding which chunk to sample fromnext. As it does this, ExSample keeps a per-chunk estimateof the probability of ﬁnding a new result if the next frame weprocess through the object detector were sampled randomlyfrom that chunk. As it samples more frames, ExSampleestimates become more accurate.Recent related work, such as probabilistic predicates [12],NoScope[9], and BlazeIt[8], overlap partially with ExSampleon their aim at reducing the cost of processing a variety ofqueries over video. At a high level, they approach this prob-lem by training cheaper surrogate models for each querywhich they use to approximate the behavior of the objectdetector. They then prioritize which frames to actually in-spect based on how the surrogate scores them. This generalapproach to can yield large savings in speciﬁc scenarios.However, approaches relying on training cheap surrogateshave two important shortcomings in the context of ad-hocobject queries, especially when the number of desired re-sults is limited. The ﬁrst has to do with the extra workneeded: for highly selective queries, seeking objects that ap-pear rarely in video requires building and labelling a train-ing set ahead of time, which can be as hard as solving thesearch problem in the ﬁrst place. Conversely, for commonobjects that appear frequently throughout the dataset, thesurrogate models introduce an additional inference cost thatoutweighs the limited savings they provide. Finally, whenusers only need results up to a ﬁxed number, such as in alimit query or when building a training set, surrogate basedapproaches still require an upfront dataset scan in order toscore the video frames in the dataset, which can be more ex-pensive than simply sampling frames randomly. Unlike ex-isting work, ExSample imposes no preprocessing overhead.The second shortcoming is in the user need to avoid nearduplicate results in search queries over video. ExSample isdesigned to give higher weight to areas of video likely to haveresults which are both new and diﬀerent , rather than areaswhere the object detector would score high. A key challengehere is that to be general, ExSample makes no assumptionsabout how long objects remain visible on screen and howoften they appear. Instead, ExSample is guided by feedbackfrom the outputs of the object detector on previous samples.Our contributions are 1) the adaptive sampling algorithmExSample to facilitate ad-hoc searches over video reposito-ries 2) a formal justiﬁcation justifying ExSample’s design3) an empirical evaluation showing ExSample is eﬀective onreal datasets and under real system constraints, and outper-forms existing approaches to the search problem.We evaluate ExSample on a variety of search queries span-ning diﬀerent objects, diﬀerent kinds of video, and diﬀerentnumbers of desired results. We show savings in the numberof frames processed ranging from 1.1 to 4x, with a geomet-ric average of 2x across all settings, in comparison to an eﬃcient random sampling. Additionally, in comparison toa surrogate-model based approach inspired on BlazeIt [8],our method processes fewer frames to ﬁnd the same num-ber of distinct results in many cases, and in the remainingcases ExSample still requires one to two-orders of magni-tude less clock time because ExSample does not require anupfront preprocessing phase, and so can avoid the prepro-cessing costs of surrogate based approaches.

2. BACKGROUND

In this section we review object detection, introduce dis-tinct object queries as opposed to plain object queries, ex-plain our main cost evaluation metric: frames processed bythe object detector, and justify our main baseline: randomsampling.

An object detector is an algorithm that operates on stillimages, inputting an image frame and outputting a set ofboxes within the image containing the objects of interest.The amount of objects found will range from zero to ar-bitrarily many. Well known examples of object detectorsinclude Yolo[16] and Mask R-CNN[4].In Figure 1, we show two example frames that have beenprocessed by an object detector, with the traﬃc light detec-tions surrounded by boxes.Object detectors with state of the art accuracy in bench-marks such as COCO[11] typically process around 10 framesper second on modern hardware, though it is possible toachieve real time rates by sacriﬁcing accuracy [6, 16].In this paper we do not seek to improve on state-of-the-art object detection approaches. Instead, we treat objectdetectors as a black box with a costly runtime, and aim tosubstantially reduce the number of video frames processedby the detector.

In this paper we are interested in processing higher levelqueries on video enabled by the availability of object detec-tors. In particular, we are concerned with object queriessuch as “ﬁnd 20 traﬃc lights in my dataset” over collectionsof multiple videos from multiple cameras. In natural video,any object of interest lingers within view over some lengthof time. For example, the frames in Figure 1 contain thesame traﬃc light a few seconds apart. While either framewould be an acceptable result for our traﬃc light query, anapplication such as OpenStreetMap would not beneﬁt morefrom having both frames. We are therefore interested in re-turning distinct object results, and we refer to this speciﬁcvariation as a distinct object query. Similarly, for an ap-plication such as constructing a training set for a classiﬁeror detector, richer sets of diverse examples are preferable tonear duplicates. Note that an application could always ﬁndnear duplicates if desired by traversing the video backwardsand forwards starting from a given result, the more diﬃcultpart is ﬁnding initial results in the ﬁrst place.The goal of this paper is to reduce the cost of processingsuch queries. Moreover, we want to do this on ad-hoc dis-tinct object queries on ad-hoc datasets, where there are di-verse videos in our dataset and where object detections havenot been computed ahead of time for the type of objects weare looking for. This distinction aﬀects multiple decisionsin the paper, including not only the results we return but2 igure 1: Two video frames showing the same traf-ﬁc light instance several seconds apart. A distinctobject query is deﬁned by having these two boxesonly count as one result. also the main design of ExSample and how we measure re-sult recall. Related work [8, 9] in video processing from thedatabase community aims to reduce cost as well, but as faras we know no existing work targets this kind of query.

A straightforward method to process a distinct objectquery is to scan all frames. In the traﬃc light query, forexample, we can process every video in the dataset, sequen-tially evaluating an object detector on each frame of eachvideo. If one or more lights are detected, they are matchedwith detections from the previous frame, and only truly new,unmatched ones contribute to the result. If there is a limitclause in our query, such as the limit of 20 in our exam-ple query ‘ﬁnd 20 traﬃc lights in our dataset’, we can stopscanning as soon as we accumulate 20 results.Oﬀ the shelf object detectors act on a frame by frame ba-sis, and do not have a notion of time, memory or deduplica-tion. So, we assume that in addition to specifying an objectdetector, the application speciﬁes a discriminator functionthat decides which detections are new instances and whichcorrespond to previous instances of an already processed de-tection. This is how we deﬁne the full set of unique instancesin the video and how we generate the ground truth.For example, the discriminator function may apply an ob-ject tracking algorithm like Median Flow [7] or SORT [1]that computes the position of an object over a sequenceof frames; then, by tracking new instances backwards andforwards through video around the sampled frame in whichthey were ﬁrst detected, we can determine whether future in-stances correspond to prior ones by comparing them againstpreviously computed tracks. We expand on this in our evalu-ation section. Alternatively, the application could use simpleheuristics, such as two detections of traﬃc lights occurringwithin 30 seconds of each other must be the same light. Asexplained in the introduction, an alternative to simple scan-ning is to process frames sequentially but skip forward 1second at every step, eﬀectively only processing one frameper second of video. A better strategy still is picking fromthe dataset uniformly at random. By better, we mean thatthe time to ﬁnd the number of results requested in the queryis smaller when inspecting frames at random than it is fora strategy that skips 1 second forward in time. The mainreason for this is that random will explore more areas of thedata more quickly. Moving sequentially means that our fu-ture samples stay close to or past samples. In the case wherewe ﬁnd a result, frames one second apart are less likely tohave novel results. Similarly, in the case where there areno results, frames one second apart are less likely to have any results. Moreover, a ﬁxed skipping one second may misssome results completely, while random sampling will even-tually come back to nearby frames.

3. ExSample

In this section we explain how our approach, ExSam-ple, minimizes the time of ﬁnding results. Due to the highcompute demands of object detection models, even whenprocessed in GPUs, runtime and compute cost when usingExSample are a function of the number of frames sampledand then processed by the object detector. When discussingExSample we use the terms frames sampled and frames pro-cessed interchangeably because every sampled frame is alsoa processed frame.In order to minimize the number of frames processsed,At a high level, ExSample works by estimating which ﬁlesand temporal segments within a video are more likely toyield new results, sampling frames more often from thoseregions. Importantly, because its goal is to ﬁnd distinctresults, ExSample accounts for both result abundance andvariability, rather than purely for raw hits.This is implemented by conceptually splitting each ﬁleinto for example, half-hour or shorter, chunks, and scoringeach chunk separately based on what the object detector re-turns on previously sampled frames on each region. ExSam-ple scores chunks based not just on past hits but also onpast repetitions: the more repeated results there are, thelower the score will be, regardless of past hits. This scor-ing system allows ExSample to allocate resources to morepromising areas initially also allowing it to diversify, overtime, where it looks next. In our evaluation, we show thistechnique helps ExSample outperform purely greedy strate-gies even when they are equipped with ad-hoc heuristics toavoid duplicates.To make it practical, ExSample is composed of two coreparts: an estimate of future results per chunk explained insubsection 3.1, and a robust mechanism to translate theseestimates into a decision which accounts for estimate errorsexplained in subsection 3.3. In those two sections we focuson quantifying the types of error. Later, in Algorithm 1 weexplain the algorithm step by step.

In this section we derive our estimate for the future valueof sampling a chunk. To facilitate keeping up with the no-tation we introduce in the following sections, we restate alldeﬁnitions in Appendix A.In order to make the optimal decision of which chunk tosample next, ExSample estimates R ( n + 1) for each chunk,which represents the number of new results we expect toﬁnd on the ( n + 1) th sample. By new , we mean R does notcount results already found in the previous n samples evenif they also appear in the n + 1 frame. The chunk with thelargest R ( n + 1) is a good location to sample next.In our traﬃc light example, the chance of ﬁnding a traﬃclight by random sampling intuitively depends on both thenumber of traﬃc lights in the data, which we will call N ,the number of video frames each light is visible for, which wewill call p i , and the chance of ﬁnding a new traﬃc light willdepend additionally on how many samples we have drawnalready, which we will call n . Note that p i varies from lightto light: for example in video collected from a moving vehiclethat stops at a red light, red lights will tend to have large3 i lasting in the order of minutes, while green and yellowlights are likely have much smaller p i , perhaps in the orderof a few seconds.Technically, R ( n +1) = Σ Ni [ i / ∈ seen( n )] · p i , where seen( n )is the set of results we have already seen in previous frames. R ( n +1) is a single number fully determined by seen( n ), butit is random if we only know n . Both the total number ofinstances N and their durations p i are unknown to us unlesswe have scanned and processed the whole dataset, so ourability to estimate R ( n +1) may seem hopeless. Fortunately,there are tools to estimate R ( n + 1) which do not requirefor us to ﬁrst estimate either N or the individual p i . Theestimate instead relies on counting the number of resultsseen exactly once so far, which we will represent with N ( n ). N ( n ) is a quantity we can observe and track as we sampleframes. The estimate is: R ( n + 1) ≈ N ( n ) /n (1)The formula N ( n ) /n appears in other contexts as theGood-Turing estimator[2], but N ( n ) has a diﬀerent mean-ing in those contexts. In our video search application wesample frames, not symbols, at random, and a single framesample can indirectly sample arbitrarily many objects, orno object at all. This means that in our application, N and N ( n ) could range from being 0 to being far larger than n itself, for example in a crowded scene. In the typical setting(such as the one explained in the original paper [2]) thereis exactly one instance per sample and the only question iswhether it is new. In that situation N ( n ) will always besmaller than n .In the remainder of this section we show that the esti-mate N ( n ) /n applies in our problem setting as well, andin particular that we can bound its relative bias using high-level properties of the data and the query: the number ofresult instances N , the average duration of a result µ p =Σ p i /N and also the standard deviation of the durations σ p = (cid:112) Σ( p i − µ p ) /N . Note that the error we discuss inthis section is the bias of our estimate E[ N ( n )] /n , an errorthat will occur even in the absence of randomness. In a latersection we deal with the problem of errors that arise fromrandomness in the sample.In particular we will focus on the relative error:rel. err = E[ N ( n )] /n − E[ R ( n + 1)]E[ N ( n )] /n The following inequalities bound the relative bias error ofour estimate:

Theorem (Bias) . ≤ rel. err (2) rel. err ≤ max p i (3) rel. err ≤ √ N ( µ p + σ p ) (4)Intuitively, Equation 2 tells us that N n/n tends to over-estimate, Equation 3 that the size of the over-estimate isguaranteed to be less than the largest probability, which islikely small, and Equation 4 says that even if a few p i out-liers were large, as long the durations within one σ of theaverage µ p are small the error will remain small. A large N or a large µ p or σ p may seem problematic for Equation 4,but we note that a large number of results N or long averageduration µ p implies many results will be found after only a few samples, so the end goal of the search problem is easy inthe ﬁrst place and having guarantees of accurate estimatesis less important.In a later experiment with skewed data and many in-stances we show the estimate works well, and real data inour evaluation has natural skew and we obtain consistentlygood experimental results. Proof.

Now we prove Equation 2 Equation 3, and Equa-tion 4.The chance that object i is seen on the ( n + 1) th try afterbeing missed on the ﬁrst n tries is p i (1 − p i ) n . For the rest ofthe paper, we will name this quantity π i ( n +1). By linearityof expectation E[ R ( n + 1)] = Σ Ni =1 π i ( n + 1)For the rest of this proof, we will avoid explicitly writingsummation indexes i , since our summations always are overthe result instances, from i = 0 to N .E[ N ( n )] can also be expressed directly. The chance ofhaving seen instance i exactly once after n samples is np i (1 − p i ) n − , with the extra factor n coming from the possibility ofthe instance having shown up at any of the ﬁrst n samples.We can rewrite this as nπ i ( n ). So E[ N ( n )] = n Σ π i ( n ),giving: E[ N ( n )] /n = Σ π i ( n )Now we will focus on the numerator E[ N ( n )] /n − R ( n +1)E[ N ( n )] /n − E[ R ( n + 1)] = Σ π i ( n ) − π i ( n + 1)= Σ p i (1 − p i ) n − = Σ p i π i ( n )We can see here that each term in the error is positive, hencewe always overestimate, which proves Equation 2.Now we want to bound this overestimate. Intuitively weknow the overestimate is small because each term is a scaleddown version of the terms in E[ N ( n )] /n = (cid:80) π i ( n ), moreprecisely: Σ p i π i ( n ) ≤ (max p i ) · Σ π i ( n )= (max p i ) · E[ N ( n )] /n (5)For example, if all the p i are all less than 1%, then theoverestimate is also less than 1% of N ( n ) /n . However, ifwe know there may be a few outliers with large p i which arenot representative of the rest, unavoidable in real data, thenwe would like to know our estimate will still be useful. Wecan show this is still true as follows:Σ p i π i ( n ) ≤ (cid:113) (Σ p i )(Σ π i ) (Cauchy-Schwarz) ≤ (cid:113) (Σ p i ) (Σ π i ) π i are positive= (cid:113) (Σ p i )Σ π i = (cid:113) (Σ p i ) E[ N ] /n = √ N (cid:113) (Σ p i /N ) E[ N ] /n p i /N is the second moment of the p i , and canbe rewritten as σ p + µ p , where µ p and σ p are the mean andstandard deviation of the underlying result durations:= √ N (cid:113) ( σ p + µ p ) E[ N ] /n ≤ √ N ( σ p + µ p ) E[ N ] /n (6)Putting together Equation 5 and Equation 6 we get:E[ N ( n )] /n − E[ R ( n + 1)] ≤ (cid:40) (max p i ) E[ N ] n √ N ( σ p + µ p ) E[ N ] n And dividing both sides, we getrel. err = E[ N ( n )] n − E[ R ( n + 1)]E[ N ( n )] /n ≤ (cid:40) max p i √ N ( σ p + µ p )Which justiﬁes the remaining two bounds: Equation 3 andEquation 4. In principle, if we knew R j ( n j + 1) for every chunk, thealgorithm could simply take the next sample from the chunkwith the largest estimate. However, if we simply use the rawestimate R j ( n j + 1) ≈ N j ( n j ) /n j (7)And pick the chunk with the largest estimate, then we runinto two potential problems. The ﬁrst is that each estimatemay be oﬀ due to the randomness of the sample, especiallythose chunks with smaller n j . We will address this problemin the next section. The second potential problem is the is-sue of instances spanning multiple chunks, which we addressafterwards. In reality we assign scores to multiple chunks as we sam-ple them, and each chunk will be been sampled a diﬀerentnumber of times n j and will have its own N j ( n j ). In fact,we really want diﬀerent chunks to be sampled very diﬀerentnumber of times because that is what ExSample must do tooutperform random.We have shown how to estimate which chunk is promising,but for it to be practical, we still need to handle the prob-lem that our observed N ( n ) will ﬂuctuate randomly due torandomness in our sampling. This is especially true early inthe sampling process, where only a few samples have beencollected but we need to make a sampling decision. Becausethe quality of the estimates themselves is tied to the num-ber of samples we have taken, and we do not want to stopsampling a chunk due to a small amount of bad luck earlyon, it is important we estimate how noisy our estimate is.The usual way to do this is by estimating the variance ofour estimator: Var[ N ( n ) /n ]. Once we have a good idea onhow this variance error depends on the number of samplestaken, we can make informed decisions about which chunkto pick next, balancing both the raw score and the numberof samples it is based on. Theorem (Variance) . If instances occur independently ofeach other, then

Var[ N ( n ) /n ] ≤ E[ N ( n )] /n (8) Note that this bound also implies that the variance erroris more than 1 /n smaller than the value of the raw esti-mate, because we can rewrite it as (E[ N ( n )] /n ) /n . Notehowever, that the independence assumption is necessary forproving this bound. While in reality diﬀerent results maynot occur and co-occur truly independently, our experimen-tal results in the evaluation results show our estimate workswell enough in practice. Proof.

We will estimate the variance N ( n ) assuming inde-pendence of the diﬀerent instances, We can express N ( n ) asum of binary indicator variables X i , which are 1 if instance i has shown up exactly once. X i = 1 with probability π i ( n ) = np i (1 − p i ) n − . Then, because of our independence as-sumption, the total variance can be estimated by summing.Therefore N ( n ) = (cid:80) i X i and because of our independenceassumption Var[ N ( n )] = (cid:80) i Var[ X i ]. Because X i is aBernoulli random variable, its variance is π i ( n )(1 − π i ( n ))which is bounded by π i ( n ) itself. Therefore, Var[ N ( n )] ≤ (cid:80) nπ i ( n ). This latter sum we know from before is E[ N ( n )].Therefore Var[ N ( n ) /n ] ≤ E[ N ( n )] /n .In fact, we can go further and fully characterize the dis-tribution of values N ( n ) takes. Theorem (Sampling distribution of N ( n )) . Assuming p i are small or n is large, and assuming independent occurrenceof instances, N ( n ) follows a Poisson distribution with pa-rameter λ = Σ π i ( n ) .Proof. We prove this by showing N ( n )’s moment generat-ing function (MGF) matches that of a Poisson distributionwith λ : M ( t ) = exp (cid:0) λ [ e t − (cid:1) As in the proof of Equation 8, we think of N ( n ) as asum of independent binary random variables X i , one perinstance. Each of these variables has a moment generatingfunction M X i ( t ) = 1+ π i ( e t − x ≈ exp( x ) forsmall x , and π i ( e t −

1) will be small, then 1 + π i ( e t − ≈ exp( π i ( e t − π i ( e t −

1) is always eventually smallfor some n because π i ( n + 1) = p i (1 − p i ) n ≤ p i e np i ≤ /en .Because the MGF of a sum of independent random vari-ables is the product of the terms’ MGFs, we arrive at: M N ( n ) = (cid:89) i M X i ( t ) = exp (cid:32)(cid:34)(cid:88) i π i (cid:35) (cid:2) e t − (cid:3)(cid:33) Now we can use this information to design a decisionmaking strategy. The goal is to meaningfully pick between( N j ( n ) , n j ) and ( N k ( n ) , n k ) instead of only between N j and N k , where the only reasonable answer would be to pick thelargest. One way to implement this comparison is to ran-domize it, which is what Thompson sampling [18] does.Thompson sampling works by modeling unknown param-eters such as R j not with point estimates such as N j ( n j ) /n j but with a wider distribution over its possible values. Thewidth should depend on our uncertainty. Then, wheneverwe would have used the point estimate value to make a deci-sion, we instead draw a number sample from its distributionand use that number instead, eﬀectively adding noise to our5stimate in proportion to how uncertain it is. In our im-plementation, we choose to model the uncertainty around R j ( n + 1) as following a Gamma distribution: R j ( n + 1) ∼ Γ( α = N j , β = n j ) (9)Although less common than the Normal distribution, theGamma distribution is shaped much like the Normal whenthe N /n is large, but behaves more like a single-tailed dis-tribution when N /n is near 0, which is desirable because N /n will become very small over time, but we know ourhidden λ is always non-negative. The Gamma distributionis a common way to model the uncertainty around an un-known (but positive) parameter λ for a Poisson distributionwhose samples we observe. This choice is especially suit-able for our use case, as we have shown that N ( n ) does infact follow a Poisson distribution. The mean value of theGamma distribution Equation 9 is α/β = N j /n j which isby design consistent with Equation 7, and its variance is α/β = N j /n j which by design consistent with the variancebound of Equation 8.Finally, the Gamma distribution is not deﬁned when α or β are 0, so we need both a way to deal with the scenariowhere N ( n ) = 0 which could happen often due to objectsbeing rare, or due to having exhausted all results. As well asat initialization, when both N and n are 0. We do this byadding a small quantity α and β to both terms, obtaining: R j ( n j + 1) ∼ Γ( α = N j ( n j ) + α , β = n j + β ) (10)We used α = . β = 1 in practice, though we didnot observe a strong dependence on this value. The nextquestion is whether these techniques work when applied onskewed data. In this section we provide an empirical validation of theestimates from the previous sections, including Equation 7,and Equation 10. The question we are interested in is: givenan observed N and n , what is the true expected R ( n + 1),and how does it compare to the belief distribution Γ( N , n )which we propose using.We ran a series of simulation experiments. To do this,we ﬁrst generate 1000 p , p , ...p at random to represent1000 results with diﬀerent durations. To ensure there isduration skew we observe in real data, we use a lognormaldistribution to generate the p i . To illustrate the skew in thevalues, the smallest p i is 3 × − , while the max p i = . p i is 1 × − . The parameters µ p computedfrom the p i are 3 × − and 8 × − respectively. Fora dataset with 1 million frames (about 10 hours of video),these durations correspond to objects spanning from 1 / . p i . To decide which of the instances willshow up in our frame we simulate tossing 1000 coins inde-pendently, each with their own p i , and the positive drawsgive us the subset of instances visible in that frame. Wethen proceed drawing these samples sequentially, trackingthe number of frames we have sampled n , how many in-stances we have seen exactly once, N , and we also recordE [ R ( n + 1)]: the expected number of new instances we canexpect in a new frame sampled, which is possible because we can compute it directly as Σ Ni [ i / ∈ seen ( n )] · p i , be-cause in the simulation we know the remaining unknowninstances and know their hidden probabilities p i , so we com-pute E [ R ( n + 1)]. We sequentially sample frames up to n = 180000, and repeat the experiment 10K times, obtaininghundreds of millions of tuples of the form ( n, N , R ( n + 1))for our ﬁxed set of p i . Using this data, we can answer ouroriginal question: given an observed N and n , what is thetrue expected R ( n + 1)? by selecting diﬀerent observed n and N at random, and conditioning (ﬁltering) on them,plotting a histogram of the actual R ( n + 1) that occurred inall our simulations. We show these histograms for 10 pairsof n and N in Figure 2, alongside our belief distribution.Figure 2 shows a mix of 3 important scenarios. The ﬁrst3 subplots with n ≤ R ( n +1). This is intuitively expected, because early on both thebias and the variance of our estimate are bottlenecked bythe number of samples, and not by the inherent uncertaintyof R ( n + 1). As n grows to mid range values (next 4 plots),we see that the curve ﬁts the histograms very well, and alsothat the curve keeps shifting left to lower and lower ordersof magnitude on the x axis. Here we see that the one-sidednature of the Gamma distribution ﬁts the data better than abell shaped curve. The ﬁnal 3 subplots show scenarios where n has grown large and N potentially very small, including acase where N = 0. In that last subplot, we see the eﬀect ofhaving the extra α in Equation 10, which means Thompsonsampling will continue producing non-zero values at randomand we will eventually correct our estimate when we ﬁnd anew instance. In 3 of the subplots there is a clear bias tooverestimate, though not that large despite the large skew.This empirical validation was based on simulated data. Inour evaluation we show these modeling choices works wellin practice on real datasets as well, where our assumptionof independence is not guaranteed and where durations maynot necessarily follow the same distribution law. If instances can span multiple chunks, for example a traf-ﬁc light that spans across the boundaries of two neighbor-ing chunks, Equation 7 is still accurate with the caveat that N j ( n j ) is interpreted as the number of instances seen ex-actly once globally and which were found in chunk j . Thesame object found once in two chunks j and k does not con-tribute to either N j or to N k , even though each chunk hasonly seen it once. The derivation of this rule is similar tothat in the previous section, and is given in Appendix B. Inpractice, if only a few rare instances span multiple chunksthen results are almost the same and this adjustment doesnot need to be implemented.At runtime, the numerator N j will only increase in valuethe ﬁrst time we ﬁnd a new result globally, decrease backas soon as we ﬁnd it again either in the same chunk or else-where, and ﬁnding it a third time would not change it any-more. Meanwhile n j increases upon sampling a frame fromthat chunk. This natural relative increase and decrease of N j with respect to each other allows ExSample to seamlesslyshift where it allocates samples over time. : 172085N1: 5 n: 179601N1: 0n: 120911N1: 4 n: 131315N1: 7n: 41013N1: 18 n: 48094N1: 17n: 100N1: 116 n: 14093N1: 58n: 82N1: 127 n: 97N1: 1210e+00 3e−05 6e−05 9e−05 0.0e+00 2.5e−05 5.0e−05 7.5e−050.00000 0.00005 0.00010 0.00015 0.00000 0.00005 0.00010 0.000150.00025 0.00050 0.00075 0.00100 0.001250.00000 0.00025 0.00050 0.00075 0.001001.0 1.2 1.4 0.002 0.003 0.004 0.005 0.006 0.0071.1 1.3 1.5 1.7 1.9 1.0 1.2 1.4 R(n+1) actual R(n+1) point estimate Thompson samples Gamma(N1+.1,n+1)

Estimates, real values and Thompson sampling

Figure 2: Comparing our Gamma heuristic of Equa-tion 10 with a histogram of the true values R ( n + 1) from a simulation with heavily skewed p i . The de-tails of the simulation are discussed in subsubsec-tion 3.3.2. We picked 10 ( N , n ) pairs from the datato include multiple important edge scenarios: where n is less than as well as when n is very large (inthis case up to 20% of the total frames). We alsoshow N close to 0 due to bad luck in the last sub-plot. Note we are using the noisy observed N andnot the idealized E[ N ] which would be a lot moreaccurate, but is not directly observable. The his-tograms show the range of values seen for R ( n + 1) when we have the observed N and n . The pointestimate N /n of (Equation 7) is shown as a verti-cal line. The belief distribution density is plottedas a thicker orange line, and 5 samples drawn fromit using Thompson sampling are shown with dashedlines. In practice, the chunk and sample approach of ExSamplewill work well when diﬀerent chunks have diﬀerent scoresand these diﬀerences persist after more than a few samples.This is the case if diﬀerent ﬁles can be very diﬀerent incontent, for example a car driving in one city vs another city, or in a highway, or within a single ﬁle, if the cameramoves or if there is a strong temporal pattern. For example,if after a few samples we ﬁnd that 50% of likely have noresults, we can expect ExSample to focus sampling on therest of the dataset, with savings bounded by 2x compared torandom sampling, which would keep allocating samples inone location. In contrast, if all chunks have essentially thesame score then random sampling should be just as good oras bad at ﬁnding results.Chunking based on time is likely to work well becausethere is some amount of locality to a lot of types of results.For example, traﬃc lights appear in cities and are likely toappear one block after the next. Making intervals too longmeans we have less opportunities for scores to be diﬀerentacross intervals (for example, making them the whole video).On the other hand, making intervals very short (for example,one second long) means a lot of sampling is spent estimatingwhich chunks are better, and they payoﬀ of this informationis smaller because we run out of frames.For our evaluation, simply using chunks based on ﬁles andup to 30 minute length video intervals worked well across ourbenchmarks.

In this section we lay out how the intuition of the previoustranslates into pseudocode, which we show in Algorithm 1. input : video, chunks, detector, matcher, result limit output: ans ans ← [] // arrays for stats of each chunk N ← [0,0, . . . ,0] n ← [0,0, . . . ,0] while len( ans ) < result limit do // 1) choice of chunk and frame for j ← to M do R j ← Γ( N [ j ] + α , n [ j ] + β ) . sample() end j ∗ ← arg max j R j frame id ← chunks[ j ∗ ] . sample() // 2) io, decode, detection and matching rgb frame ← video . read and decode(frame id) dets ← detector(rgb frame) // d are the unmatched dets// d are dets with only one match d , d ← matcher . get matches(frame id , dets) // 3) update state N [ j ∗ ] ← N [ j ∗ ]+ len( d ) − len( d ) n [ j ∗ ] ← n [ j ∗ ] + 1 matcher . add(frame id , dets) ans.add(d ) end Algorithm 1: ExSampleThe inputs to the algorithm are:- video

The video dataset itself, which may be a singlevideo or a collection of ﬁles.- chunks

The collection of logical chunks that we have splitour video dataset into. One natural way is to do it bytime, so we can think of them as splitting each ﬁle in ourdataset into 30 minute units. There are M chunks total.7 detector . An object detector provided by the user, fordetecting objects interest to the application.- matcher . An algorithm that matches detections to sug-gest which are new and which may be duplicates. Thenotion of new is application speciﬁc, but matcher can beimplemented based on feature vector appearance, for ex-ample. We note that the matcher does not need to beaccurate. A dummy matcher could say any two instancesmore than 1 second apart are distinct, which is eﬀectivelywhat much current work does. Its function is simply tosignal that we are repeating results so we can better dis-count chunks.- result limit An indication of when to stop.After initializing arrays to hold per-chunk statistics, thecode can be understood in three parts: choosing a frame,processing of the frame, and state update. The frame choicepart is where ExSample makes a decision about which frameto process next. It starts with the Thompson sampling stepin line 6, where we draw a separate sample R j from the beliefdistribution Equation 10 for each of the chunks, which isthen used in line 8 to pick the highest scoring chunk. The j in the code is used as the index variable for any loop over thechunks. During the ﬁrst execution of the while loop all thebelief distributions are identical, but their samples will notbe, breaking the ties at random. Once we have decided ona best chunk index j ∗ , we sample a frame index at randomfrom it in line 9.The second part includes all the heavy work involved invideo processing: reading and decoding the frame we chose,applying the object detector to it (line 11). After that isdone we pass the detections on to a matcher algorithm,which compares the detections we pass with those we havereturned before in other frames and decides if they are dis-tinct enough to be considered separate results. For it to beuseful to our task, the matcher algorithm needs to give usthe subset of detections which did not match any previousresults, and the ones that matched exactly once with anydetection from a previous iteration. The length of each ofthose lists is all we need to update our statistics in part 3.It is important to note that this part of the algorithm isthe main bottleneck, with the detector call in line 11 beingmost of the work, followed in second place by the randomread and decode of line 10. In comparison, the overhead ofthe ﬁrst part is negligible and fully parallelizable. It onlygrows with the number of chunks.The third part updates the state of our algorithm, updat-ing N and n for the chunk we sampled from. Additionally,we store detections in the matcher and append the truly newdetections to the answer. We note that the amount of statewe need to track only grows with the amount of results wehave so far, and not with the size of the dataset. Finally, we prevent several other optimizations to Algo-rithm 1.

Algorithm 1 processes one frame at a time, but to makegood use of modern GPUs we may want to run inference inbatches. The code for a batched version is similar to thatin Algorithm 1, but on line 6 we draw B samples per j , in-stead of one sample from each belief distribution, creating B cohorts of samples. In Figure 2 we show 5 diﬀerent values from Thompson sampling of the same distribution in dashedlines. Because each sample for the same chunk will be dif-ferent, the chunk with the maximum value will also varyand we will get B chunk indices, biased toward the morepromising chunks. The logic for state update only requiressmall modiﬁcations. In principle, we may fear that pickingthe next B frames at random instead of only 1 frame couldlead to suboptimal decision making within that batch, butat least for small values of B up to 50, which is what we usein our evaluation, we saw no signiﬁcant drop. This is likelyenough to fully utilize a machine with multiple GPUs.We do not implement or evaluate asynchronous, distributedexecution in this paper, but the same reasoning suggestsExSample could be made to scale to an asynchronous set-ting, with workers processing a batch of frames at a timewithout waiting for other workers. Ultimately all the up-dates to N j and n j are commutative because they are ad-ditive. While random sampling is a good baseline, random allowssamples to happen very close to each other in quick succes-sion: for example in a 1000 hour video, random samplingis likely to start sampling frames within the same one hourblock after having sampled only about 30 diﬀerent hours,instead of after having sampled most of the hours once. Forthis reason, we introduce a variation of random sampling,which we call random+ , to deliberately avoid sampling tem-porally near previous samples when possible: by samplingone random frame out of every hour, then sampling oneframe out of every not-yet sampled half an hour at random,and so on, until eventually sampling the full dataset. Weevaluate the separate eﬀect of this change in in our evalua-tion, and we also use random+ to sample within each chunkin our dataset when evaluating ExSample. This is imple-mented by modifying the internal implementation of the chunk.sample() method, in line 9, but does not change thetop level algorithm.

Generalized instance durations

Throughout the pa-per we have used p i as proxy for duration, assuming we selectframes with uniform random sampling. However, we couldweight frames non-uniformly at random in the ﬁrst place,for example by using some precomputed per-frame score. Ifwe use a non-uniform weight to sample the frames, we eﬀec-tively induce a diﬀerent set of p (cid:48) i for each result, ideally onewith µ p (cid:48) (cid:29) µ p . The estimates for the relative value of dif-ferent chunks will still be correct since ExSample is designedto work with any underlying p i . Generalized chunks

We have introduced the idea of achunk as that of a partitioning of the data. However, itwould be possible for chunks to overlap, and this is equiv-alent to having instances that span multiple chunks. Thischoice is only meaningful if the diﬀerent chunks are diﬀerentin some way, for example one chunk can be a half an hour ofvideo sampled uniformly at random, while a diﬀerent chunkcan be the same half hour of video sampled under a diﬀerentset of weights, using the idea of generalized p i .

4. EVALUATION

Our goals for this evaluation are to demonstrate the ben-eﬁts of ExSample on real data, comparing it to alternatives8ncluding random sampling as well as existing work in videoprocessing. We show on these challenging datasets ExSam-ple achieves savings of up to 4x with respect to random sam-pling, and orders of magnitude higher to approaches basedon related work.Even though existing work does not explicitly optimizefor distinct object queries like ExSample does, existing worksuch as BlazeIt and NoScope optimizes searching for framesthat satisfy expensive predicates. They also recognize theneed to avoid redundant results and implement a basic formof duplicate avoidance by skipping a ﬁxed amount of time.It is therefore a reasonable question whether existing workalready inadvertently processes distinct object queries eﬃ-ciently and in this evaluation we show this is not the case,and that the two kinds of query require diﬀerent approaches.The main line of existing work for video processing useslight weight conv nets to assign a preliminary score to eachframe and then processes frames from highest to lowestscore. BlazeIt is the state of the art representative of thissurrogate model approach.

Both our implementation of BlazeIt and ExSample are attheir core a sampling loop where the choice of which frame toprocess next is based on am algorithm-speciﬁc score. Basedon this score, the system will fetch a batch of frames fromthe video and run that batch through an object detector.We implement this sampling loop in Python, using PyTorchto run inference on a GPU. The object detector model,which we refer to as the full model, is Faster-RCNN witha ResNet-101 backbone used for ground truth. To reachreasonable random access frame decoding rates we use theHwang library from the Scanner project[15], and re-encodeour datasets to insert keyframes every 20 frames. BlazeItrequires extra upfront costs prior to sampling to train thesurrogate model, and we will describe our implementationof that part of BlazeIt well and its associated ﬁxed costs insubsection 4.6.We implemented the subset of BlazeIt for limit querieswith simple predicates, based on the description in the pa-per [8] as well as their published code. We opted for ourown implementation to make sure the I/O and decodingcomponents of both ExSample and BlazeIt were equally op-timized, and also because extending BlazeItto handle ourown datasets, ground truth, and metrics is more. For thecheaper surrogate (aka specialized model) in the BlazeItpa-per we use an ImageNet pre-trained ResNet-18. This modelis more heavyweight than the ones used in that paper, butalso more accurate. We clarify our timed results do notdepend strongly on the runtime cost ResNet-18.

For this evaluation use two datasets. Which we refer toas Dashcam and BDD. The dashcam dataset consists of10 hours of video, or over 1.1 million video frames, col-lected from a vehicle-mounted dashboard camera over sev-eral drives in cities and highways. Each drive can range fromaround 20 minutes to several hours.The BDD dataset used for this evaluation consists of asubset of the Berkeley Deep Drive Dataset[21]. The BDDdataset consists of 40 second video clips of video also dash-board camera. Our subset is made out of 1000 randomlychosen clips. In both datasets, the camera will move at variable speedsdepending on the drive. The BDD dataset in particularincludes clips from many cities.

Both the dashcam and BDD datasets include similar typesof objects we expect to see in cities and highways. These in-clude stationary objects such as traﬃc lights and and signs,and moving objects such as bicycles and trucks. Each typeof object is a diﬀerent search query, making our evaluationconsist of 8 queries per dataset.In addition to searching for diﬀerent object classes, wealso vary the limit parameter to achieve .1, .5 and .9 ofrecall, where recall is measured as fraction of distinct resultsfound. These recall rates are meant to represent diﬀerentkinds of applications: .1 (10%) represents a scenario wherean autonomous vehicle data scientist is looking for a fewtest examples, whereas a higher recall like .9 would be moreuseful in a urban planning or mapping scenario where ﬁndingmany instances is desired.

Neither the dashcam nor the BDD datasets have human-generated object instance labels that both identify and trackmultiple objects over time. Therefore we approximate groundtruth by sequentially scanning every video in the datasetand running each frame through an object detector. If anyobjects are detected, we match the bounding boxes withthose from previous frames and resolve which correspond tothe same instance. For object detection we use a referenceimplementation of FasterRCNN[17] from Facebook’s Detec-tron2 library[20], one of the higher accuracy object detectionmodels, pre-trained on the COCO[11] dataset. In particu-lar we use the version with a ResNet-101[5] backbone. Tomatch object instances across neighboring frames, we em-ploy an IoU matching approach similar to SORT[1]. IoUmatching is a simple baseline for multi-object tracking thatleverages the output of an object detector and matches de-tection boxes based on overlap across adjacent frames.

Here we evaluate the cost of processing, in both time andframes, distinct object queries using ExSample, random+ andBlazeIt. Because some classes like parking meter are ex-tremely rare, and some such as truck are much more com-mon, the absolute times and frame counts needed to processa query vary by many orders of magnitude across diﬀerentobject classes. It is easier to get a global view of cost sav-ings by normalizing query processing times of ExSample andBlazeIt against that of random+ . That is, if random+ takes1 hour to ﬁnd 20% of traﬃc lights, and ExSample takes 0.5hours to do the same, time savings would be 2x. We alsoapply this normalization when comparing frames processed.Results for dashcam are shown in Figure 3 and for bdd inFigure 4Overall, ExSample saves more than 2x in frames processedvs random+ , averaged across all classes and diﬀerent recalllevels. Savings do vary by class, reaching savings of above 4xfor the person class. Savings by time are much larger whencomparing with BlazeIt because although BlazeIt does re-duces the number of frames that the expensive detector isrun on, especially at low recalls, it performs very poorly itshight overhead costs prior to processing the query cancel9he early wins from better prediction. This is especially ev-ident in Figure 5, top row. Note that random+ is a betterbaseline than BlazeIt for this query, demonstrating currenttechniques developed for other types of queries do not nec-essarily transfer to this query type.Figure 3 shows two main trends. The ﬁrst is that mod-els trained with BlazeIt do succeed in ﬁnding objects faster,when we measure it in terms of number of frames processed.However, the total time it takes to sample using ExSampleis orders of magnitude shorter because ExSample does notrequire a prior training phase, and only runs on a subsetof frames, whereas the surrogate model needs to be run onevery frame (we evaluate the relative costs of these two over-heads in BlazeIt in the next Section). The impact of theseoverheads reduces in magnitude as more samples are taken.However, in none of our experiments is the gain enough toamortize the costs of training.Figure 4 Shows similar trends. This is a challenging sce-nario for ExSample because each clip is only 40 seconds long,meaning we get less samples per chunk (and more chunksoverall). This decreases our performance compared to thelarger chunks in the dashcam dataset. However, we stillachieve a 2x improvement over random. Surrogate mod-els have a harder time learning with high accuracy on thisdataset, as shown in Figure 5. Counterintuitively, this helpsin Figure 4 because random scoring is a better baseline forunique object queries than very precise scoring.Note that BlazeIttrained models are not able to beat ran-dom search for this task. Counterintuitively, this resultis not due to random models being as surrogate models –BlazeIt’s error rates are much better than random in thevalidation set during training. However, BlazeIt is usuallybetter only when used as a classiﬁer but not as a way to sam-ple promising frames. The main issue is that while BlazeItmodels do predict which frames are likely to have an ob-ject of interest, they pick high scoring frames regardless ofwhether they are new or not. Because of the greedy natureof this approach, the extra accuracy from BlazeIt models isonly helpful early on when every positive result is likely new,at low numbers of samples. The experiments in Figure 3show BlazeIt is able to ﬁnd 10% of all bicycles after onlyabout 10 seconds into sampling, whereas ExSample reachesthis level after 100 seconds, and random+ after 300 seconds.However, the training and surrogate scoring phases and anoverhead of 10000 seconds to BlazeIt, hence ExSample cando 100x better, and random+

30x better early on since theyhave no ﬁxed overhead at the start.

This section aims to break down in more detail the costsunderlying the results in the previous section. In short, afterBlazeIt surrogate models are trained, they must be run overthe full dataset and the scores are used to identify the high-est scoring frames for sampling. While BlazeIt surrogatemodels are indeed much faster than the full models, scan-ning the full dataset is expensive even if the cheap modelwere free because loading and decoding video is not free.BlazeIt[8] prioritizes sampling the highest scoring frames,where the score is computed with a cheaper surrogate model.Answering queries with such systems involves four stages,whose throughputs on our data are shown in Figure 6.1. labelling phase : requires labelling a fraction of the datasetwith the expensive object detector. Its runtime grows metric: savings in frames metric: time savings r e c a ll : . r e c a ll : . r e c a ll : . b i cyc l e pe r s on t r a ff i c li gh t f i r e h y d r an t s t op s i gn bu s t r u ck b i cyc l e pe r s on t r a ff i c li gh t f i r e h y d r an t s t op s i gn bu s t r u ck BlazeIt random this work

Savings on Dashcam dataset

Figure 3: Results on the Dashcam dataset. Com-paring time and frames processed for diﬀerent re-call levels (each row) on the dashcam dataset Dif-ferent methods (color) are compared relative to thecosts of using random+ to process the same query andreach the same level of instance recall. The left col-umn shows savings when computed in terms times ,which include initial surrogate training overheadsfor BlazeIt. The right column shows savings whencomparing frames processed , excludes any framesprocessed for training purposes. linearly with training set size, and because the objectdetector is involved, the throughput is as low as that ofthe sampling phase.2. training phase : once the labels are generated, a cheapersurrogate model is ﬁt to the dataset. This phase can berelatively cheap if the surrogate is itself cheap and thetraining set ﬁts in memory, avoiding any need for I/O ordecoding. Figure 6 shows the throughput of ResNet-18is indeed much higher and unlikely to be the bottleneck.3. scoring phase : the surrogate model runs over the dataset,producing a score for each frame. Even if the surrogatemodel is virtually free, Figure 6 shows that the IO anddecode for the remainder of the dataset dominate theruntime.4. sampling phase : we fetch and process frames in descend-ing order of surrogate score. This phase ends when wehave found enough results for the user. This is the onlyphase for ExSample and for baselines such as random+ .Regardless of access pattern, this phase is dominated bythe cost of inference with the full model.The ﬁrst three phases can be seen as a ﬁxed cost paid priorto ﬁnding results for surrogate-based methods. The promiseof these surrogate based techniques is that by paying theupfront cost we can greatly save on the rest of the processing.But our results in the previous section show that ExSampleis often more eﬀective than the surrogate, without thesehigh up-front costs. Furthermore, these up-front costs have10 etric: savings in frames metric: time savings r e c a ll : . r e c a ll : . r e c a ll : . pe r s on b i k e m o t o r r i de r bu s t r u ck t r a ff i c li gh t t r a ff i c s i gn pe r s on b i k e m o t o r r i de r bu s t r u ck t r a ff i c li gh t t r a ff i c s i gn BlazeIt random this work

Savings on BDD dataset

Figure 4: Results on BDD dataset. The surrogatemodel has a harder time learning on this dataset,with lower accuracy scores. Counterintuitively, thismakes its savings comparable to those of random sam-pling, improving its results compared to Dashcamdatasest to be paid multiple times, if, for example, the user wantsto look for a new class of object or process a new data set.It’s unlikely that users will want to look for the same classof object on the same frames of video again, which is allpre-training a surrogate makes more eﬃcient.

5. RELATED WORK

Several approaches have recently proposed optimizationsto address the cost of mining video data. A common idea inthese approaches is to use cascaded classiﬁers, such as theViola-Jones cascade detector [19], which enables real-timeobject tracking in video streams with a cascade frameworkthat considers additional Haar-like features in successive cas-cade layers. Lu et al. propose applying cascaded classi-ﬁers to eﬃciently process video queries containing proba-bilistic predicates that specify desired precision and recalllevels [12]. They employ SVM, KDE, and deep neural net-work classiﬁers that input features from dimension reduc-tion approaches such as principal component analysis andfeature hashing to eﬃciently skip processing of video framesthat the classiﬁers are conﬁdent do not contain objects rele-vant to the query. NoScope [9] trains specialized approxima-tions to expensive CNNs while maintaining accuracy levelswithin a user-deﬁned window. However, these approachesdo not generalize well to diverse types of video data. Forexample, they require a costly training process to evalu-ate classiﬁer accuracy (and, in some cases, to construct theclassiﬁers), which may diﬀer from video to video. Similarly,NoScope uses a diﬀerence detector speciﬁcally designed forvideo from static cameras, which is ineﬀective on video cap-tured in mobile settings, e.g. by dashboard cameras. More-over, for datasets where cascaded classiﬁers perform well, dataset: bdd dataset: dashcam b i k e bu s m o t o r pe r s on r i de r t r a ff i c li gh t t r a ff i c s i gn t r u ck b i cyc l e bu s f i r e h y d r an t pe r s on s t op s i gn t r a ff i c li gh t t r u ck s u rr oga t e AP / r ando m AP Surrogate model AP vs random AP

Figure 5: The surrogate models trained for eachquery have diﬀerent eﬀectiveness. Here we com-pare their Average Precision to that of a randomlyassigned score. For the BDD dataset, the surro-gate model does only slightly better than randomin precision, which causes it to perform similarly torandom when measured in savings. For the Dash-cam dataset, the surrogate models are more accu-rate than random. Counterintuitively, higher accu-racy in scores hurts savings in Figure 3 due to thegreediness of the algorithm. our approach is complementary, as the classiﬁers can be ap-plied over sampled frames to obtain additional speedup.Unlike cascaded classiﬁer methods, BlazeIt [8] proposestraining specialized models to evaluate speciﬁc query clauses,and applies random sampling when specialized models arenot available. BlazeIt also proposes a declarative, SQL-likelanguage for querying objects with several constraint types.As with other methods, our approach complements BlazeIt– substituting random sampling for our sampling algorithmmay yield substantial performance gains. It is possible thatthe methods proposed here could be integrated into systemssuch as BlazeIt.Other techniques focus on improving neural network in-ference speed. Deep Compression [3] prunes connections inthe network that have a negligible impact on the inferenceresult. ShrinkNets [10] proposes a dynamic network resiz-ing scheme that extends the pruning process to neural net-work training, reducing not only inference time but trainingtime as well. Again, these techniques can be combined withExSample to further improve query processing speed.

6. CONCLUSION

Over the next decade, workloads that process video to ex-tract useful information may become a standard data min-ing task for analysts in application areas such as government,real estate, and autonomous vehicles. Such pipelines present11 o+decode inference combined

107 21 17 ( s e q . i o ) label throughput (fps) phase_type io+decode inference combined inference combined ( s u rr o g a t e ) train throughput (fps) io+decode inference combined

107 1805 100 ( s e q . i o ) ( s u rr o g a t e ) score throughput (fps) io+decode inference combined

24 21 11 ( r a n d o m i o ) sample throughput (fps) Figure 6: Throughput of each processing phase inour implementation of BlazeIt. The yellow boxesshow the overall throughput reached by those pro-cessing phases in our implementation. Additionally,to distinguish the bottlenecks from the object de-tector from those of video decoding, we show themaximum throughput achievable by I/O and videodecoding in purple, and inference in cyan. Infer-ence with surrogate models is marked to distinguishit from inference with the expensive model. Forthe scoring phase, which dominates the bulk of theoverhead in BlazeIt, scanning through the datasetbounds the throughput to 100fps in our dataset. Al-though labelling throughput is low, labelling onlyhappens on a fraction of the dataset, so representsa small fraction of the overall runtime. a new systems challenge due to the cost of applying state ofthe art machine vision techniques.In this paper we introduced ExSample, an approach forprocessing instance ﬁnding queries on large video reposito-ries through chunk-based adaptive sampling. Speciﬁcally,the aim of the approach is to ﬁnd frames of video thatcontain instances of objects of interest, without running anobject detection algorithm on every frame, which could beprohibitively expensive. Instead, in ExSample, we sampleframes and run the detector on just the sampled frames, tun-ing the sampling process based on whether a new instanceof an object of interest is found in the sampled frames. Todo this tuning, ExSample partitions the data into chunksand dynamically adjusts the frequency with which it sam-ples from diﬀerent chunks based on rate at which new in-stances are sampled from each chunk. We formulate thissampling process as an instance of Thompson sampling, us-ing a Good-Turing estimator to compute the likelihood ofﬁnding a new object instance in each video chunk. In thisway, as new instances in a particular chunk are exhausted,ExSample naturally refocuses its sampling on other less fre-quently sampled chunks.Our evaluation of ExSample on a real-world dataset ofdashcam data shows that it is able to substantially improveon both the number of frames it samples and the total run-time versus both random sampling and methods based onlightweight “surrogate” models, such as BlazeIt [8], that are designed to estimate frames likely to contain objects of inter-est with lower overhead. In particular, these surrogate-basedmethods are much slower because they require running thesurrogate model on all frames.

APPENDIXA. NOTATION n Number of frames sampled so far. Frames sampledand frames processed means the same thing. N Number of distinct results in the data. We treat theterms result and instance as synonyms. i Index variable over results. i ∈ [1 , N ]. N ( n ) Number of results seen exactly once up until the n th sampled frame. We omit n when is clear from contextseen( n ) set of i seen after n frames have been processed. p i Probability of seeing result i in a randomly drawnframe. It is proportional to duration in video. Wetreat duration and probability as synonyms. R ( n +1) Number of new results we expect to ﬁnd in the nextsampled frame R ( n + 1) = (cid:80) [ i ∈ seen( n )] · p i µ p mean duration Σ p i /Nσ p stddev of durations p i π i ( n ) p i (1 − p i ) n − : the chance result i appears ﬁrst at the n th sampled frame. We may leave the n implicit. M Number of chunks j Index variable over chunks. j ∈ [1 , M ] B. OBJECTS SPANNING MULTIPLE CHUNKS

Here we prove Equation 7 is also valid when diﬀerentchunks may share instances. Assume we have sampled n frames from chunk 1, n from chunk 2, etc. Assume in-stance i can appear in multiple chunks: with probability p i of being seen after sampling chunk 1, p i of being seen aftersampling chunk 2 and so on. We will assume we are work-ing with chunk 1, without loss of generality. The expectednumber of new instances if we sample once more from chunk1 is: R ( n + 1) = N (cid:88) i =1 (cid:34) p i (1 − p i ) n M (cid:89) j =2 (1 − p ij ) n j (cid:35) (11)Similarly, the expected number of instances seen exactlyonce in chunk 1, and in no other chunk up to this point is N = N (cid:88) i =1 (cid:34) n p i (1 − p i ) n − M (cid:89) j =2 (1 − p ij ) n j (cid:35) (12)In both equations, the expression (cid:81) Mj =2 (1 − p ij ) n j factorsin the need for instance i to not have been while samplingchunks 2 to M . We will abbreviate this factor as q i . Wheninstances only show up in one chunk, q i = 1, and everythingis the same as in Equation 1.The expected error is: N ( n ) /n − R ( n + 1) = N (cid:88) i =1 (cid:2) p i (1 − p i ) n − q i (cid:3) (13)Which again is term-by-term smaller than N ( n ) /n bya factor of p i . REFERENCES [1] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft.Simple online and realtime tracking. CoRR ,abs/1602.00763, 2016.[2] I. J. GOOD. THE POPULATION FREQUENCIESOF SPECIES AND THE ESTIMATION OFPOPULATION PARAMETERS.

Biometrika ,40(3-4):237–264, 12 1953.[3] S. Han, H. Mao, and W. J. Dally. Deep compression:Compressing deep neural networks with pruning,trained quantization and huﬀman coding. In

International Conference on Learning Representations ,2016.[4] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick.Mask R-CNN.

CoRR , abs/1703.06870, 2017.[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition.

CoRR ,abs/1512.03385, 2015.[6] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,A. Fathi, I. Fischer, Z. Wojna, Y. Song,S. Guadarrama, and K. Murphy. Speed/accuracytrade-oﬀs for modern convolutional object detectors.

CoRR , abs/1611.10012, 2016.[7] Z. Kalal, K. Mikolajczyk, and J. Matas.Forward-backward error: Automatic detection oftracking failures. In , pages 2756–2759.IEEE, 2010.[8] D. Kang, P. Bailis, and M. Zaharia. Blazeit: Fastexploratory video queries using neural networks.

CoRR , abs/1805.01046, 2018.[9] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, andM. Zaharia. NoScope: Optimizing neural networkqueries over video at scale.

Proc. VLDB Endow. ,10(11):1586–1597, Aug. 2017.[10] G. Leclerc, R. C. Fernandez, and S. Madden. Learningnetwork size while training with ShrinkNets. In

Conference on Systems and Machine Learning , 2018.[11] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ar,and C. L. Zitnick. Microsoft COCO: common objectsin context.

CoRR , abs/1405.0312, 2014.[12] Y. Lu, A. Chowdhery, S. Kandula, and S. Chaudhuri.Accelerating machine learning inference withprobabilistic predicates. ACM SIGMOD, June 2018.[13] T. Murray. Help improve imagery in your area withour new camera lending program. , 2018.[14] Nexar. Nexar. ,2018.[15] A. Poms, W. Crichton, P. Hanrahan, andK. Fatahalian. Scanner: Eﬃcient video analysis atscale.

ACM Trans. Graph. , 37(4):138:1–138:13, July2018.[16] J. Redmon and A. Farhadi. YOLOv3: An incrementalimprovement.

CoRR , abs/1804.02767, 2018.[17] S. Ren, K. He, R. B. Girshick, and J. Sun. FasterR-CNN: towards real-time object detection withregion proposal networks.

CoRR , abs/1506.01497,2015. [18] D. Russo, B. V. Roy, A. Kazerouni, and I. Osband. Atutorial on thompson sampling.

CoRR ,abs/1707.02038, 2017.[19] P. Viola and M. Jones. Rapid object detection using aboosted cascade of simple features, 2001.[20] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, andR. Girshick. Detectron2. https://github.com/facebookresearch/detectron2 ,2019.[21] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao,V. Madhavan, and T. Darrell. BDD100K: A diversedriving video database with scalable annotationtooling.