ExSample: Efficient Searches on Video Repositories through Adaptive Sampling
Oscar Moll, Favyen Bastani, Sam Madden, Mike Stonebraker, Vijay Gadepally, Tim Kraska
EExSample: Efficient Searches on Video Repositoriesthrough Adaptive Sampling
Oscar Moll
MIT CSAIL [email protected] Favyen Bastani
MIT CSAIL [email protected] Sam Madden
MIT CSAIL [email protected] Stonebraker
MIT CSAIL [email protected] Vijay Gadepally
MIT Lincoln Laboratory [email protected] Tim Kraska
MIT CSAIL [email protected]
ABSTRACT
Capturing and processing video is increasingly common ascameras and networks improve and become cheaper. At thesame time, algorithms for rich scene understanding and ob-ject detection have progressed greatly in the last decade.As a result, many organizations now have massive reposito-ries of video data, with applications in mapping, navigation,autonomous driving, and other areas.Because state of the art vision algorithms to interpretscenes and recognize objects are slow and expensive, ourability to process even simple ad-hoc selection queries (‘find100 example traffic lights in dashboard camera video’) overthis accumulated data lags far behind our ability to collect it.Sampling image frames from the videos is a reasonable de-fault strategy for these types of queries queries, however, theideal sampling rate is both data and query dependent. Weintroduce ExSample, a low cost framework for ad-hoc, unin-dexed video search which quickly processes selection queriesby adapting the amount and location of sampled frames tothe data and the query being processed.ExSample prioritizes which frames within a video repos-itory are processed in order to quickly identify portions ofthe video that contain objects of interest. ExSample con-tinually re-prioritizes which regions of video to sample frombased on feedback from previous samples. On large, real-world video datasets ExSample reduces processing time byup to 4x over an efficient random sampling baseline and byseveral orders of magnitude versus state-of-the-art methodswhich train specialized models for each query. ExSample isa key component in building cost-efficient video data man-agement systems.
PVLDB Reference Format: xxx. xxx.
PVLDB , 21(xxx): xxxx-yyyy, 2021.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
1. INTRODUCTION
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.
Proceedings of the VLDB Endowment,
Vol. 21, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
Video cameras have become incredibly affordable over thelast decade, and are ubiquitously deployed in static andmobile settings, such as smartphones, vehicles, surveillancecameras, and drones. These video datasets are enabling anew generation of applications. For example, video datafrom vehicle dashboard-mounted cameras, dashcams , is usedto train object detection and tracking models for autonomousdriving systems [21], or to annotate map datasets like Open-StreetMap with locations of traffic lights, stop signs, andother infrastructure [13], and to analyze the scene of colli-sions from dashcam footage to automate insurance claimsprocessing [14].However, these applications process large amounts of videoto extract useful information. Consider the basic task offinding examples of traffic lights – to, for example, anno-tate a map – within a large collection of dashcam videocollected from many vehicles. The most basic approach toevaluate this query is to run an object detector frame byframe over the dataset. Because state of the art object de-tectors run at about 10 frames per second (fps) on state ofthe art GPUs, one third of the typical video recording rateof 30fps, scanning through a collection of 1000 hours of videowith a detector on a GPU would take 3x that time: 3000GPU hours. In the offline query case, which is the case wefocus on in this paper, we can parallelize our scan over thevideo across many GPUs, but, as the rental price of a GPUis around $ ? ], our bill for this one ad-hoc querywould be $
10K regardless of parallelism. Hence, this work-load presents challenges in both time and monetary cost.Note that accumulating 1000 hours of view represents just10 cameras recording for less than a week.A practical means for coping with this issue is to skipframes: for example only run object detection on one framefor every second of video. After all, we might think it rea-sonable to assume all traffic lights are visible for longer thanthat, and the savings are large compared to inspecting ev-ery frame: processing only one frame every second decreasescosts by 30x for a video recorded at 30fps. Unfortunately,this strategy has limitations: for example 1 frame out of 30that we look at may not show the light clearly, causing thedetector to miss it completely, while the neighboring framesmay show it more clearly. Secondly, lights that remain vis-ible in the video for a long time, like 30 seconds, would beseen multiple times unnecessarily, and worse, for other typesof objects that remain visible for shorter times the appropri-1 a r X i v : . [ c s . D B ] M a y te sampling rate is unknown, and will vary across datasetsdepending on factors such as whether the camera is movingor static, or the angle and distance to the object.In this paper we introduce ExSample, a video samplingtechnique designed to reduce the number of frames that needto be processed by an expensive object detector for searchqueries on video datasets. ExSample frames this problemas one of deciding which frame from the dataset to look atnext based on what it has seen in the past. Our approach tothis problem ExSample starts by conceptually splitting thedataset into temporal chunks, (e.g half-hour chunks), andframes the problem as deciding which chunk to sample fromnext. As it does this, ExSample keeps a per-chunk estimateof the probability of finding a new result if the next frame weprocess through the object detector were sampled randomlyfrom that chunk. As it samples more frames, ExSampleestimates become more accurate.Recent related work, such as probabilistic predicates [12],NoScope[9], and BlazeIt[8], overlap partially with ExSampleon their aim at reducing the cost of processing a variety ofqueries over video. At a high level, they approach this prob-lem by training cheaper surrogate models for each querywhich they use to approximate the behavior of the objectdetector. They then prioritize which frames to actually in-spect based on how the surrogate scores them. This generalapproach to can yield large savings in specific scenarios.However, approaches relying on training cheap surrogateshave two important shortcomings in the context of ad-hocobject queries, especially when the number of desired re-sults is limited. The first has to do with the extra workneeded: for highly selective queries, seeking objects that ap-pear rarely in video requires building and labelling a train-ing set ahead of time, which can be as hard as solving thesearch problem in the first place. Conversely, for commonobjects that appear frequently throughout the dataset, thesurrogate models introduce an additional inference cost thatoutweighs the limited savings they provide. Finally, whenusers only need results up to a fixed number, such as in alimit query or when building a training set, surrogate basedapproaches still require an upfront dataset scan in order toscore the video frames in the dataset, which can be more ex-pensive than simply sampling frames randomly. Unlike ex-isting work, ExSample imposes no preprocessing overhead.The second shortcoming is in the user need to avoid nearduplicate results in search queries over video. ExSample isdesigned to give higher weight to areas of video likely to haveresults which are both new and different , rather than areaswhere the object detector would score high. A key challengehere is that to be general, ExSample makes no assumptionsabout how long objects remain visible on screen and howoften they appear. Instead, ExSample is guided by feedbackfrom the outputs of the object detector on previous samples.Our contributions are 1) the adaptive sampling algorithmExSample to facilitate ad-hoc searches over video reposito-ries 2) a formal justification justifying ExSample’s design3) an empirical evaluation showing ExSample is effective onreal datasets and under real system constraints, and outper-forms existing approaches to the search problem.We evaluate ExSample on a variety of search queries span-ning different objects, different kinds of video, and differentnumbers of desired results. We show savings in the numberof frames processed ranging from 1.1 to 4x, with a geomet-ric average of 2x across all settings, in comparison to an efficient random sampling. Additionally, in comparison toa surrogate-model based approach inspired on BlazeIt [8],our method processes fewer frames to find the same num-ber of distinct results in many cases, and in the remainingcases ExSample still requires one to two-orders of magni-tude less clock time because ExSample does not require anupfront preprocessing phase, and so can avoid the prepro-cessing costs of surrogate based approaches.
2. BACKGROUND
In this section we review object detection, introduce dis-tinct object queries as opposed to plain object queries, ex-plain our main cost evaluation metric: frames processed bythe object detector, and justify our main baseline: randomsampling.
An object detector is an algorithm that operates on stillimages, inputting an image frame and outputting a set ofboxes within the image containing the objects of interest.The amount of objects found will range from zero to ar-bitrarily many. Well known examples of object detectorsinclude Yolo[16] and Mask R-CNN[4].In Figure 1, we show two example frames that have beenprocessed by an object detector, with the traffic light detec-tions surrounded by boxes.Object detectors with state of the art accuracy in bench-marks such as COCO[11] typically process around 10 framesper second on modern hardware, though it is possible toachieve real time rates by sacrificing accuracy [6, 16].In this paper we do not seek to improve on state-of-the-art object detection approaches. Instead, we treat objectdetectors as a black box with a costly runtime, and aim tosubstantially reduce the number of video frames processedby the detector.
In this paper we are interested in processing higher levelqueries on video enabled by the availability of object detec-tors. In particular, we are concerned with object queriessuch as “find 20 traffic lights in my dataset” over collectionsof multiple videos from multiple cameras. In natural video,any object of interest lingers within view over some lengthof time. For example, the frames in Figure 1 contain thesame traffic light a few seconds apart. While either framewould be an acceptable result for our traffic light query, anapplication such as OpenStreetMap would not benefit morefrom having both frames. We are therefore interested in re-turning distinct object results, and we refer to this specificvariation as a distinct object query. Similarly, for an ap-plication such as constructing a training set for a classifieror detector, richer sets of diverse examples are preferable tonear duplicates. Note that an application could always findnear duplicates if desired by traversing the video backwardsand forwards starting from a given result, the more difficultpart is finding initial results in the first place.The goal of this paper is to reduce the cost of processingsuch queries. Moreover, we want to do this on ad-hoc dis-tinct object queries on ad-hoc datasets, where there are di-verse videos in our dataset and where object detections havenot been computed ahead of time for the type of objects weare looking for. This distinction affects multiple decisionsin the paper, including not only the results we return but2 igure 1: Two video frames showing the same traf-fic light instance several seconds apart. A distinctobject query is defined by having these two boxesonly count as one result. also the main design of ExSample and how we measure re-sult recall. Related work [8, 9] in video processing from thedatabase community aims to reduce cost as well, but as faras we know no existing work targets this kind of query.
A straightforward method to process a distinct objectquery is to scan all frames. In the traffic light query, forexample, we can process every video in the dataset, sequen-tially evaluating an object detector on each frame of eachvideo. If one or more lights are detected, they are matchedwith detections from the previous frame, and only truly new,unmatched ones contribute to the result. If there is a limitclause in our query, such as the limit of 20 in our exam-ple query ‘find 20 traffic lights in our dataset’, we can stopscanning as soon as we accumulate 20 results.Off the shelf object detectors act on a frame by frame ba-sis, and do not have a notion of time, memory or deduplica-tion. So, we assume that in addition to specifying an objectdetector, the application specifies a discriminator functionthat decides which detections are new instances and whichcorrespond to previous instances of an already processed de-tection. This is how we define the full set of unique instancesin the video and how we generate the ground truth.For example, the discriminator function may apply an ob-ject tracking algorithm like Median Flow [7] or SORT [1]that computes the position of an object over a sequenceof frames; then, by tracking new instances backwards andforwards through video around the sampled frame in whichthey were first detected, we can determine whether future in-stances correspond to prior ones by comparing them againstpreviously computed tracks. We expand on this in our evalu-ation section. Alternatively, the application could use simpleheuristics, such as two detections of traffic lights occurringwithin 30 seconds of each other must be the same light. Asexplained in the introduction, an alternative to simple scan-ning is to process frames sequentially but skip forward 1second at every step, effectively only processing one frameper second of video. A better strategy still is picking fromthe dataset uniformly at random. By better, we mean thatthe time to find the number of results requested in the queryis smaller when inspecting frames at random than it is fora strategy that skips 1 second forward in time. The mainreason for this is that random will explore more areas of thedata more quickly. Moving sequentially means that our fu-ture samples stay close to or past samples. In the case wherewe find a result, frames one second apart are less likely tohave novel results. Similarly, in the case where there areno results, frames one second apart are less likely to have any results. Moreover, a fixed skipping one second may misssome results completely, while random sampling will even-tually come back to nearby frames.
3. ExSample
In this section we explain how our approach, ExSam-ple, minimizes the time of finding results. Due to the highcompute demands of object detection models, even whenprocessed in GPUs, runtime and compute cost when usingExSample are a function of the number of frames sampledand then processed by the object detector. When discussingExSample we use the terms frames sampled and frames pro-cessed interchangeably because every sampled frame is alsoa processed frame.In order to minimize the number of frames processsed,At a high level, ExSample works by estimating which filesand temporal segments within a video are more likely toyield new results, sampling frames more often from thoseregions. Importantly, because its goal is to find distinctresults, ExSample accounts for both result abundance andvariability, rather than purely for raw hits.This is implemented by conceptually splitting each fileinto for example, half-hour or shorter, chunks, and scoringeach chunk separately based on what the object detector re-turns on previously sampled frames on each region. ExSam-ple scores chunks based not just on past hits but also onpast repetitions: the more repeated results there are, thelower the score will be, regardless of past hits. This scor-ing system allows ExSample to allocate resources to morepromising areas initially also allowing it to diversify, overtime, where it looks next. In our evaluation, we show thistechnique helps ExSample outperform purely greedy strate-gies even when they are equipped with ad-hoc heuristics toavoid duplicates.To make it practical, ExSample is composed of two coreparts: an estimate of future results per chunk explained insubsection 3.1, and a robust mechanism to translate theseestimates into a decision which accounts for estimate errorsexplained in subsection 3.3. In those two sections we focuson quantifying the types of error. Later, in Algorithm 1 weexplain the algorithm step by step.
In this section we derive our estimate for the future valueof sampling a chunk. To facilitate keeping up with the no-tation we introduce in the following sections, we restate alldefinitions in Appendix A.In order to make the optimal decision of which chunk tosample next, ExSample estimates R ( n + 1) for each chunk,which represents the number of new results we expect tofind on the ( n + 1) th sample. By new , we mean R does notcount results already found in the previous n samples evenif they also appear in the n + 1 frame. The chunk with thelargest R ( n + 1) is a good location to sample next.In our traffic light example, the chance of finding a trafficlight by random sampling intuitively depends on both thenumber of traffic lights in the data, which we will call N ,the number of video frames each light is visible for, which wewill call p i , and the chance of finding a new traffic light willdepend additionally on how many samples we have drawnalready, which we will call n . Note that p i varies from lightto light: for example in video collected from a moving vehiclethat stops at a red light, red lights will tend to have large3 i lasting in the order of minutes, while green and yellowlights are likely have much smaller p i , perhaps in the orderof a few seconds.Technically, R ( n +1) = Σ Ni [ i / ∈ seen( n )] · p i , where seen( n )is the set of results we have already seen in previous frames. R ( n +1) is a single number fully determined by seen( n ), butit is random if we only know n . Both the total number ofinstances N and their durations p i are unknown to us unlesswe have scanned and processed the whole dataset, so ourability to estimate R ( n +1) may seem hopeless. Fortunately,there are tools to estimate R ( n + 1) which do not requirefor us to first estimate either N or the individual p i . Theestimate instead relies on counting the number of resultsseen exactly once so far, which we will represent with N ( n ). N ( n ) is a quantity we can observe and track as we sampleframes. The estimate is: R ( n + 1) ≈ N ( n ) /n (1)The formula N ( n ) /n appears in other contexts as theGood-Turing estimator[2], but N ( n ) has a different mean-ing in those contexts. In our video search application wesample frames, not symbols, at random, and a single framesample can indirectly sample arbitrarily many objects, orno object at all. This means that in our application, N and N ( n ) could range from being 0 to being far larger than n itself, for example in a crowded scene. In the typical setting(such as the one explained in the original paper [2]) thereis exactly one instance per sample and the only question iswhether it is new. In that situation N ( n ) will always besmaller than n .In the remainder of this section we show that the esti-mate N ( n ) /n applies in our problem setting as well, andin particular that we can bound its relative bias using high-level properties of the data and the query: the number ofresult instances N , the average duration of a result µ p =Σ p i /N and also the standard deviation of the durations σ p = (cid:112) Σ( p i − µ p ) /N . Note that the error we discuss inthis section is the bias of our estimate E[ N ( n )] /n , an errorthat will occur even in the absence of randomness. In a latersection we deal with the problem of errors that arise fromrandomness in the sample.In particular we will focus on the relative error:rel. err = E[ N ( n )] /n − E[ R ( n + 1)]E[ N ( n )] /n The following inequalities bound the relative bias error ofour estimate:
Theorem (Bias) . ≤ rel. err (2) rel. err ≤ max p i (3) rel. err ≤ √ N ( µ p + σ p ) (4)Intuitively, Equation 2 tells us that N n/n tends to over-estimate, Equation 3 that the size of the over-estimate isguaranteed to be less than the largest probability, which islikely small, and Equation 4 says that even if a few p i out-liers were large, as long the durations within one σ of theaverage µ p are small the error will remain small. A large N or a large µ p or σ p may seem problematic for Equation 4,but we note that a large number of results N or long averageduration µ p implies many results will be found after only a few samples, so the end goal of the search problem is easy inthe first place and having guarantees of accurate estimatesis less important.In a later experiment with skewed data and many in-stances we show the estimate works well, and real data inour evaluation has natural skew and we obtain consistentlygood experimental results. Proof.
Now we prove Equation 2 Equation 3, and Equa-tion 4.The chance that object i is seen on the ( n + 1) th try afterbeing missed on the first n tries is p i (1 − p i ) n . For the rest ofthe paper, we will name this quantity π i ( n +1). By linearityof expectation E[ R ( n + 1)] = Σ Ni =1 π i ( n + 1)For the rest of this proof, we will avoid explicitly writingsummation indexes i , since our summations always are overthe result instances, from i = 0 to N .E[ N ( n )] can also be expressed directly. The chance ofhaving seen instance i exactly once after n samples is np i (1 − p i ) n − , with the extra factor n coming from the possibility ofthe instance having shown up at any of the first n samples.We can rewrite this as nπ i ( n ). So E[ N ( n )] = n Σ π i ( n ),giving: E[ N ( n )] /n = Σ π i ( n )Now we will focus on the numerator E[ N ( n )] /n − R ( n +1)E[ N ( n )] /n − E[ R ( n + 1)] = Σ π i ( n ) − π i ( n + 1)= Σ p i (1 − p i ) n − = Σ p i π i ( n )We can see here that each term in the error is positive, hencewe always overestimate, which proves Equation 2.Now we want to bound this overestimate. Intuitively weknow the overestimate is small because each term is a scaleddown version of the terms in E[ N ( n )] /n = (cid:80) π i ( n ), moreprecisely: Σ p i π i ( n ) ≤ (max p i ) · Σ π i ( n )= (max p i ) · E[ N ( n )] /n (5)For example, if all the p i are all less than 1%, then theoverestimate is also less than 1% of N ( n ) /n . However, ifwe know there may be a few outliers with large p i which arenot representative of the rest, unavoidable in real data, thenwe would like to know our estimate will still be useful. Wecan show this is still true as follows:Σ p i π i ( n ) ≤ (cid:113) (Σ p i )(Σ π i ) (Cauchy-Schwarz) ≤ (cid:113) (Σ p i ) (Σ π i ) π i are positive= (cid:113) (Σ p i )Σ π i = (cid:113) (Σ p i ) E[ N ] /n = √ N (cid:113) (Σ p i /N ) E[ N ] /n p i /N is the second moment of the p i , and canbe rewritten as σ p + µ p , where µ p and σ p are the mean andstandard deviation of the underlying result durations:= √ N (cid:113) ( σ p + µ p ) E[ N ] /n ≤ √ N ( σ p + µ p ) E[ N ] /n (6)Putting together Equation 5 and Equation 6 we get:E[ N ( n )] /n − E[ R ( n + 1)] ≤ (cid:40) (max p i ) E[ N ] n √ N ( σ p + µ p ) E[ N ] n And dividing both sides, we getrel. err = E[ N ( n )] n − E[ R ( n + 1)]E[ N ( n )] /n ≤ (cid:40) max p i √ N ( σ p + µ p )Which justifies the remaining two bounds: Equation 3 andEquation 4. In principle, if we knew R j ( n j + 1) for every chunk, thealgorithm could simply take the next sample from the chunkwith the largest estimate. However, if we simply use the rawestimate R j ( n j + 1) ≈ N j ( n j ) /n j (7)And pick the chunk with the largest estimate, then we runinto two potential problems. The first is that each estimatemay be off due to the randomness of the sample, especiallythose chunks with smaller n j . We will address this problemin the next section. The second potential problem is the is-sue of instances spanning multiple chunks, which we addressafterwards. In reality we assign scores to multiple chunks as we sam-ple them, and each chunk will be been sampled a differentnumber of times n j and will have its own N j ( n j ). In fact,we really want different chunks to be sampled very differentnumber of times because that is what ExSample must do tooutperform random.We have shown how to estimate which chunk is promising,but for it to be practical, we still need to handle the prob-lem that our observed N ( n ) will fluctuate randomly due torandomness in our sampling. This is especially true early inthe sampling process, where only a few samples have beencollected but we need to make a sampling decision. Becausethe quality of the estimates themselves is tied to the num-ber of samples we have taken, and we do not want to stopsampling a chunk due to a small amount of bad luck earlyon, it is important we estimate how noisy our estimate is.The usual way to do this is by estimating the variance ofour estimator: Var[ N ( n ) /n ]. Once we have a good idea onhow this variance error depends on the number of samplestaken, we can make informed decisions about which chunkto pick next, balancing both the raw score and the numberof samples it is based on. Theorem (Variance) . If instances occur independently ofeach other, then
Var[ N ( n ) /n ] ≤ E[ N ( n )] /n (8) Note that this bound also implies that the variance erroris more than 1 /n smaller than the value of the raw esti-mate, because we can rewrite it as (E[ N ( n )] /n ) /n . Notehowever, that the independence assumption is necessary forproving this bound. While in reality different results maynot occur and co-occur truly independently, our experimen-tal results in the evaluation results show our estimate workswell enough in practice. Proof.
We will estimate the variance N ( n ) assuming inde-pendence of the different instances, We can express N ( n ) asum of binary indicator variables X i , which are 1 if instance i has shown up exactly once. X i = 1 with probability π i ( n ) = np i (1 − p i ) n − . Then, because of our independence as-sumption, the total variance can be estimated by summing.Therefore N ( n ) = (cid:80) i X i and because of our independenceassumption Var[ N ( n )] = (cid:80) i Var[ X i ]. Because X i is aBernoulli random variable, its variance is π i ( n )(1 − π i ( n ))which is bounded by π i ( n ) itself. Therefore, Var[ N ( n )] ≤ (cid:80) nπ i ( n ). This latter sum we know from before is E[ N ( n )].Therefore Var[ N ( n ) /n ] ≤ E[ N ( n )] /n .In fact, we can go further and fully characterize the dis-tribution of values N ( n ) takes. Theorem (Sampling distribution of N ( n )) . Assuming p i are small or n is large, and assuming independent occurrenceof instances, N ( n ) follows a Poisson distribution with pa-rameter λ = Σ π i ( n ) .Proof. We prove this by showing N ( n )’s moment generat-ing function (MGF) matches that of a Poisson distributionwith λ : M ( t ) = exp (cid:0) λ [ e t − (cid:1) As in the proof of Equation 8, we think of N ( n ) as asum of independent binary random variables X i , one perinstance. Each of these variables has a moment generatingfunction M X i ( t ) = 1+ π i ( e t − x ≈ exp( x ) forsmall x , and π i ( e t −
1) will be small, then 1 + π i ( e t − ≈ exp( π i ( e t − π i ( e t −
1) is always eventually smallfor some n because π i ( n + 1) = p i (1 − p i ) n ≤ p i e np i ≤ /en .Because the MGF of a sum of independent random vari-ables is the product of the terms’ MGFs, we arrive at: M N ( n ) = (cid:89) i M X i ( t ) = exp (cid:32)(cid:34)(cid:88) i π i (cid:35) (cid:2) e t − (cid:3)(cid:33) Now we can use this information to design a decisionmaking strategy. The goal is to meaningfully pick between( N j ( n ) , n j ) and ( N k ( n ) , n k ) instead of only between N j and N k , where the only reasonable answer would be to pick thelargest. One way to implement this comparison is to ran-domize it, which is what Thompson sampling [18] does.Thompson sampling works by modeling unknown param-eters such as R j not with point estimates such as N j ( n j ) /n j but with a wider distribution over its possible values. Thewidth should depend on our uncertainty. Then, wheneverwe would have used the point estimate value to make a deci-sion, we instead draw a number sample from its distributionand use that number instead, effectively adding noise to our5stimate in proportion to how uncertain it is. In our im-plementation, we choose to model the uncertainty around R j ( n + 1) as following a Gamma distribution: R j ( n + 1) ∼ Γ( α = N j , β = n j ) (9)Although less common than the Normal distribution, theGamma distribution is shaped much like the Normal whenthe N /n is large, but behaves more like a single-tailed dis-tribution when N /n is near 0, which is desirable because N /n will become very small over time, but we know ourhidden λ is always non-negative. The Gamma distributionis a common way to model the uncertainty around an un-known (but positive) parameter λ for a Poisson distributionwhose samples we observe. This choice is especially suit-able for our use case, as we have shown that N ( n ) does infact follow a Poisson distribution. The mean value of theGamma distribution Equation 9 is α/β = N j /n j which isby design consistent with Equation 7, and its variance is α/β = N j /n j which by design consistent with the variancebound of Equation 8.Finally, the Gamma distribution is not defined when α or β are 0, so we need both a way to deal with the scenariowhere N ( n ) = 0 which could happen often due to objectsbeing rare, or due to having exhausted all results. As well asat initialization, when both N and n are 0. We do this byadding a small quantity α and β to both terms, obtaining: R j ( n j + 1) ∼ Γ( α = N j ( n j ) + α , β = n j + β ) (10)We used α = . β = 1 in practice, though we didnot observe a strong dependence on this value. The nextquestion is whether these techniques work when applied onskewed data. In this section we provide an empirical validation of theestimates from the previous sections, including Equation 7,and Equation 10. The question we are interested in is: givenan observed N and n , what is the true expected R ( n + 1),and how does it compare to the belief distribution Γ( N , n )which we propose using.We ran a series of simulation experiments. To do this,we first generate 1000 p , p , ...p at random to represent1000 results with different durations. To ensure there isduration skew we observe in real data, we use a lognormaldistribution to generate the p i . To illustrate the skew in thevalues, the smallest p i is 3 × − , while the max p i = . p i is 1 × − . The parameters µ p computedfrom the p i are 3 × − and 8 × − respectively. Fora dataset with 1 million frames (about 10 hours of video),these durations correspond to objects spanning from 1 / . p i . To decide which of the instances willshow up in our frame we simulate tossing 1000 coins inde-pendently, each with their own p i , and the positive drawsgive us the subset of instances visible in that frame. Wethen proceed drawing these samples sequentially, trackingthe number of frames we have sampled n , how many in-stances we have seen exactly once, N , and we also recordE [ R ( n + 1)]: the expected number of new instances we canexpect in a new frame sampled, which is possible because we can compute it directly as Σ Ni [ i / ∈ seen ( n )] · p i , be-cause in the simulation we know the remaining unknowninstances and know their hidden probabilities p i , so we com-pute E [ R ( n + 1)]. We sequentially sample frames up to n = 180000, and repeat the experiment 10K times, obtaininghundreds of millions of tuples of the form ( n, N , R ( n + 1))for our fixed set of p i . Using this data, we can answer ouroriginal question: given an observed N and n , what is thetrue expected R ( n + 1)? by selecting different observed n and N at random, and conditioning (filtering) on them,plotting a histogram of the actual R ( n + 1) that occurred inall our simulations. We show these histograms for 10 pairsof n and N in Figure 2, alongside our belief distribution.Figure 2 shows a mix of 3 important scenarios. The first3 subplots with n ≤ R ( n +1). This is intuitively expected, because early on both thebias and the variance of our estimate are bottlenecked bythe number of samples, and not by the inherent uncertaintyof R ( n + 1). As n grows to mid range values (next 4 plots),we see that the curve fits the histograms very well, and alsothat the curve keeps shifting left to lower and lower ordersof magnitude on the x axis. Here we see that the one-sidednature of the Gamma distribution fits the data better than abell shaped curve. The final 3 subplots show scenarios where n has grown large and N potentially very small, including acase where N = 0. In that last subplot, we see the effect ofhaving the extra α in Equation 10, which means Thompsonsampling will continue producing non-zero values at randomand we will eventually correct our estimate when we find anew instance. In 3 of the subplots there is a clear bias tooverestimate, though not that large despite the large skew.This empirical validation was based on simulated data. Inour evaluation we show these modeling choices works wellin practice on real datasets as well, where our assumptionof independence is not guaranteed and where durations maynot necessarily follow the same distribution law. If instances can span multiple chunks, for example a traf-fic light that spans across the boundaries of two neighbor-ing chunks, Equation 7 is still accurate with the caveat that N j ( n j ) is interpreted as the number of instances seen ex-actly once globally and which were found in chunk j . Thesame object found once in two chunks j and k does not con-tribute to either N j or to N k , even though each chunk hasonly seen it once. The derivation of this rule is similar tothat in the previous section, and is given in Appendix B. Inpractice, if only a few rare instances span multiple chunksthen results are almost the same and this adjustment doesnot need to be implemented.At runtime, the numerator N j will only increase in valuethe first time we find a new result globally, decrease backas soon as we find it again either in the same chunk or else-where, and finding it a third time would not change it any-more. Meanwhile n j increases upon sampling a frame fromthat chunk. This natural relative increase and decrease of N j with respect to each other allows ExSample to seamlesslyshift where it allocates samples over time. : 172085N1: 5 n: 179601N1: 0n: 120911N1: 4 n: 131315N1: 7n: 41013N1: 18 n: 48094N1: 17n: 100N1: 116 n: 14093N1: 58n: 82N1: 127 n: 97N1: 1210e+00 3e−05 6e−05 9e−05 0.0e+00 2.5e−05 5.0e−05 7.5e−050.00000 0.00005 0.00010 0.00015 0.00000 0.00005 0.00010 0.000150.00025 0.00050 0.00075 0.00100 0.001250.00000 0.00025 0.00050 0.00075 0.001001.0 1.2 1.4 0.002 0.003 0.004 0.005 0.006 0.0071.1 1.3 1.5 1.7 1.9 1.0 1.2 1.4 R(n+1) actual R(n+1) point estimate Thompson samples Gamma(N1+.1,n+1)
Estimates, real values and Thompson sampling
Figure 2: Comparing our Gamma heuristic of Equa-tion 10 with a histogram of the true values R ( n + 1) from a simulation with heavily skewed p i . The de-tails of the simulation are discussed in subsubsec-tion 3.3.2. We picked 10 ( N , n ) pairs from the datato include multiple important edge scenarios: where n is less than as well as when n is very large (inthis case up to 20% of the total frames). We alsoshow N close to 0 due to bad luck in the last sub-plot. Note we are using the noisy observed N andnot the idealized E[ N ] which would be a lot moreaccurate, but is not directly observable. The his-tograms show the range of values seen for R ( n + 1) when we have the observed N and n . The pointestimate N /n of (Equation 7) is shown as a verti-cal line. The belief distribution density is plottedas a thicker orange line, and 5 samples drawn fromit using Thompson sampling are shown with dashedlines. In practice, the chunk and sample approach of ExSamplewill work well when different chunks have different scoresand these differences persist after more than a few samples.This is the case if different files can be very different incontent, for example a car driving in one city vs another city, or in a highway, or within a single file, if the cameramoves or if there is a strong temporal pattern. For example,if after a few samples we find that 50% of likely have noresults, we can expect ExSample to focus sampling on therest of the dataset, with savings bounded by 2x compared torandom sampling, which would keep allocating samples inone location. In contrast, if all chunks have essentially thesame score then random sampling should be just as good oras bad at finding results.Chunking based on time is likely to work well becausethere is some amount of locality to a lot of types of results.For example, traffic lights appear in cities and are likely toappear one block after the next. Making intervals too longmeans we have less opportunities for scores to be differentacross intervals (for example, making them the whole video).On the other hand, making intervals very short (for example,one second long) means a lot of sampling is spent estimatingwhich chunks are better, and they payoff of this informationis smaller because we run out of frames.For our evaluation, simply using chunks based on files andup to 30 minute length video intervals worked well across ourbenchmarks.
In this section we lay out how the intuition of the previoustranslates into pseudocode, which we show in Algorithm 1. input : video, chunks, detector, matcher, result limit output: ans ans ← [] // arrays for stats of each chunk N ← [0,0, . . . ,0] n ← [0,0, . . . ,0] while len( ans ) < result limit do // 1) choice of chunk and frame for j ← to M do R j ← Γ( N [ j ] + α , n [ j ] + β ) . sample() end j ∗ ← arg max j R j frame id ← chunks[ j ∗ ] . sample() // 2) io, decode, detection and matching rgb frame ← video . read and decode(frame id) dets ← detector(rgb frame) // d are the unmatched dets// d are dets with only one match d , d ← matcher . get matches(frame id , dets) // 3) update state N [ j ∗ ] ← N [ j ∗ ]+ len( d ) − len( d ) n [ j ∗ ] ← n [ j ∗ ] + 1 matcher . add(frame id , dets) ans.add(d ) end Algorithm 1: ExSampleThe inputs to the algorithm are:- video
The video dataset itself, which may be a singlevideo or a collection of files.- chunks
The collection of logical chunks that we have splitour video dataset into. One natural way is to do it bytime, so we can think of them as splitting each file in ourdataset into 30 minute units. There are M chunks total.7 detector . An object detector provided by the user, fordetecting objects interest to the application.- matcher . An algorithm that matches detections to sug-gest which are new and which may be duplicates. Thenotion of new is application specific, but matcher can beimplemented based on feature vector appearance, for ex-ample. We note that the matcher does not need to beaccurate. A dummy matcher could say any two instancesmore than 1 second apart are distinct, which is effectivelywhat much current work does. Its function is simply tosignal that we are repeating results so we can better dis-count chunks.- result limit An indication of when to stop.After initializing arrays to hold per-chunk statistics, thecode can be understood in three parts: choosing a frame,processing of the frame, and state update. The frame choicepart is where ExSample makes a decision about which frameto process next. It starts with the Thompson sampling stepin line 6, where we draw a separate sample R j from the beliefdistribution Equation 10 for each of the chunks, which isthen used in line 8 to pick the highest scoring chunk. The j in the code is used as the index variable for any loop over thechunks. During the first execution of the while loop all thebelief distributions are identical, but their samples will notbe, breaking the ties at random. Once we have decided ona best chunk index j ∗ , we sample a frame index at randomfrom it in line 9.The second part includes all the heavy work involved invideo processing: reading and decoding the frame we chose,applying the object detector to it (line 11). After that isdone we pass the detections on to a matcher algorithm,which compares the detections we pass with those we havereturned before in other frames and decides if they are dis-tinct enough to be considered separate results. For it to beuseful to our task, the matcher algorithm needs to give usthe subset of detections which did not match any previousresults, and the ones that matched exactly once with anydetection from a previous iteration. The length of each ofthose lists is all we need to update our statistics in part 3.It is important to note that this part of the algorithm isthe main bottleneck, with the detector call in line 11 beingmost of the work, followed in second place by the randomread and decode of line 10. In comparison, the overhead ofthe first part is negligible and fully parallelizable. It onlygrows with the number of chunks.The third part updates the state of our algorithm, updat-ing N and n for the chunk we sampled from. Additionally,we store detections in the matcher and append the truly newdetections to the answer. We note that the amount of statewe need to track only grows with the amount of results wehave so far, and not with the size of the dataset. Finally, we prevent several other optimizations to Algo-rithm 1.
Algorithm 1 processes one frame at a time, but to makegood use of modern GPUs we may want to run inference inbatches. The code for a batched version is similar to thatin Algorithm 1, but on line 6 we draw B samples per j , in-stead of one sample from each belief distribution, creating B cohorts of samples. In Figure 2 we show 5 different values from Thompson sampling of the same distribution in dashedlines. Because each sample for the same chunk will be dif-ferent, the chunk with the maximum value will also varyand we will get B chunk indices, biased toward the morepromising chunks. The logic for state update only requiressmall modifications. In principle, we may fear that pickingthe next B frames at random instead of only 1 frame couldlead to suboptimal decision making within that batch, butat least for small values of B up to 50, which is what we usein our evaluation, we saw no significant drop. This is likelyenough to fully utilize a machine with multiple GPUs.We do not implement or evaluate asynchronous, distributedexecution in this paper, but the same reasoning suggestsExSample could be made to scale to an asynchronous set-ting, with workers processing a batch of frames at a timewithout waiting for other workers. Ultimately all the up-dates to N j and n j are commutative because they are ad-ditive. While random sampling is a good baseline, random allowssamples to happen very close to each other in quick succes-sion: for example in a 1000 hour video, random samplingis likely to start sampling frames within the same one hourblock after having sampled only about 30 different hours,instead of after having sampled most of the hours once. Forthis reason, we introduce a variation of random sampling,which we call random+ , to deliberately avoid sampling tem-porally near previous samples when possible: by samplingone random frame out of every hour, then sampling oneframe out of every not-yet sampled half an hour at random,and so on, until eventually sampling the full dataset. Weevaluate the separate effect of this change in in our evalua-tion, and we also use random+ to sample within each chunkin our dataset when evaluating ExSample. This is imple-mented by modifying the internal implementation of the chunk.sample() method, in line 9, but does not change thetop level algorithm.
Generalized instance durations
Throughout the pa-per we have used p i as proxy for duration, assuming we selectframes with uniform random sampling. However, we couldweight frames non-uniformly at random in the first place,for example by using some precomputed per-frame score. Ifwe use a non-uniform weight to sample the frames, we effec-tively induce a different set of p (cid:48) i for each result, ideally onewith µ p (cid:48) (cid:29) µ p . The estimates for the relative value of dif-ferent chunks will still be correct since ExSample is designedto work with any underlying p i . Generalized chunks
We have introduced the idea of achunk as that of a partitioning of the data. However, itwould be possible for chunks to overlap, and this is equiv-alent to having instances that span multiple chunks. Thischoice is only meaningful if the different chunks are differentin some way, for example one chunk can be a half an hour ofvideo sampled uniformly at random, while a different chunkcan be the same half hour of video sampled under a differentset of weights, using the idea of generalized p i .
4. EVALUATION
Our goals for this evaluation are to demonstrate the ben-efits of ExSample on real data, comparing it to alternatives8ncluding random sampling as well as existing work in videoprocessing. We show on these challenging datasets ExSam-ple achieves savings of up to 4x with respect to random sam-pling, and orders of magnitude higher to approaches basedon related work.Even though existing work does not explicitly optimizefor distinct object queries like ExSample does, existing worksuch as BlazeIt and NoScope optimizes searching for framesthat satisfy expensive predicates. They also recognize theneed to avoid redundant results and implement a basic formof duplicate avoidance by skipping a fixed amount of time.It is therefore a reasonable question whether existing workalready inadvertently processes distinct object queries effi-ciently and in this evaluation we show this is not the case,and that the two kinds of query require different approaches.The main line of existing work for video processing useslight weight conv nets to assign a preliminary score to eachframe and then processes frames from highest to lowestscore. BlazeIt is the state of the art representative of thissurrogate model approach.
Both our implementation of BlazeIt and ExSample are attheir core a sampling loop where the choice of which frame toprocess next is based on am algorithm-specific score. Basedon this score, the system will fetch a batch of frames fromthe video and run that batch through an object detector.We implement this sampling loop in Python, using PyTorchto run inference on a GPU. The object detector model,which we refer to as the full model, is Faster-RCNN witha ResNet-101 backbone used for ground truth. To reachreasonable random access frame decoding rates we use theHwang library from the Scanner project[15], and re-encodeour datasets to insert keyframes every 20 frames. BlazeItrequires extra upfront costs prior to sampling to train thesurrogate model, and we will describe our implementationof that part of BlazeIt well and its associated fixed costs insubsection 4.6.We implemented the subset of BlazeIt for limit querieswith simple predicates, based on the description in the pa-per [8] as well as their published code. We opted for ourown implementation to make sure the I/O and decodingcomponents of both ExSample and BlazeIt were equally op-timized, and also because extending BlazeItto handle ourown datasets, ground truth, and metrics is more. For thecheaper surrogate (aka specialized model) in the BlazeItpa-per we use an ImageNet pre-trained ResNet-18. This modelis more heavyweight than the ones used in that paper, butalso more accurate. We clarify our timed results do notdepend strongly on the runtime cost ResNet-18.
For this evaluation use two datasets. Which we refer toas Dashcam and BDD. The dashcam dataset consists of10 hours of video, or over 1.1 million video frames, col-lected from a vehicle-mounted dashboard camera over sev-eral drives in cities and highways. Each drive can range fromaround 20 minutes to several hours.The BDD dataset used for this evaluation consists of asubset of the Berkeley Deep Drive Dataset[21]. The BDDdataset consists of 40 second video clips of video also dash-board camera. Our subset is made out of 1000 randomlychosen clips. In both datasets, the camera will move at variable speedsdepending on the drive. The BDD dataset in particularincludes clips from many cities.
Both the dashcam and BDD datasets include similar typesof objects we expect to see in cities and highways. These in-clude stationary objects such as traffic lights and and signs,and moving objects such as bicycles and trucks. Each typeof object is a different search query, making our evaluationconsist of 8 queries per dataset.In addition to searching for different object classes, wealso vary the limit parameter to achieve .1, .5 and .9 ofrecall, where recall is measured as fraction of distinct resultsfound. These recall rates are meant to represent differentkinds of applications: .1 (10%) represents a scenario wherean autonomous vehicle data scientist is looking for a fewtest examples, whereas a higher recall like .9 would be moreuseful in a urban planning or mapping scenario where findingmany instances is desired.
Neither the dashcam nor the BDD datasets have human-generated object instance labels that both identify and trackmultiple objects over time. Therefore we approximate groundtruth by sequentially scanning every video in the datasetand running each frame through an object detector. If anyobjects are detected, we match the bounding boxes withthose from previous frames and resolve which correspond tothe same instance. For object detection we use a referenceimplementation of FasterRCNN[17] from Facebook’s Detec-tron2 library[20], one of the higher accuracy object detectionmodels, pre-trained on the COCO[11] dataset. In particu-lar we use the version with a ResNet-101[5] backbone. Tomatch object instances across neighboring frames, we em-ploy an IoU matching approach similar to SORT[1]. IoUmatching is a simple baseline for multi-object tracking thatleverages the output of an object detector and matches de-tection boxes based on overlap across adjacent frames.
Here we evaluate the cost of processing, in both time andframes, distinct object queries using ExSample, random+ andBlazeIt. Because some classes like parking meter are ex-tremely rare, and some such as truck are much more com-mon, the absolute times and frame counts needed to processa query vary by many orders of magnitude across differentobject classes. It is easier to get a global view of cost sav-ings by normalizing query processing times of ExSample andBlazeIt against that of random+ . That is, if random+ takes1 hour to find 20% of traffic lights, and ExSample takes 0.5hours to do the same, time savings would be 2x. We alsoapply this normalization when comparing frames processed.Results for dashcam are shown in Figure 3 and for bdd inFigure 4Overall, ExSample saves more than 2x in frames processedvs random+ , averaged across all classes and different recalllevels. Savings do vary by class, reaching savings of above 4xfor the person class. Savings by time are much larger whencomparing with BlazeIt because although BlazeIt does re-duces the number of frames that the expensive detector isrun on, especially at low recalls, it performs very poorly itshight overhead costs prior to processing the query cancel9he early wins from better prediction. This is especially ev-ident in Figure 5, top row. Note that random+ is a betterbaseline than BlazeIt for this query, demonstrating currenttechniques developed for other types of queries do not nec-essarily transfer to this query type.Figure 3 shows two main trends. The first is that mod-els trained with BlazeIt do succeed in finding objects faster,when we measure it in terms of number of frames processed.However, the total time it takes to sample using ExSampleis orders of magnitude shorter because ExSample does notrequire a prior training phase, and only runs on a subsetof frames, whereas the surrogate model needs to be run onevery frame (we evaluate the relative costs of these two over-heads in BlazeIt in the next Section). The impact of theseoverheads reduces in magnitude as more samples are taken.However, in none of our experiments is the gain enough toamortize the costs of training.Figure 4 Shows similar trends. This is a challenging sce-nario for ExSample because each clip is only 40 seconds long,meaning we get less samples per chunk (and more chunksoverall). This decreases our performance compared to thelarger chunks in the dashcam dataset. However, we stillachieve a 2x improvement over random. Surrogate mod-els have a harder time learning with high accuracy on thisdataset, as shown in Figure 5. Counterintuitively, this helpsin Figure 4 because random scoring is a better baseline forunique object queries than very precise scoring.Note that BlazeIttrained models are not able to beat ran-dom search for this task. Counterintuitively, this resultis not due to random models being as surrogate models –BlazeIt’s error rates are much better than random in thevalidation set during training. However, BlazeIt is usuallybetter only when used as a classifier but not as a way to sam-ple promising frames. The main issue is that while BlazeItmodels do predict which frames are likely to have an ob-ject of interest, they pick high scoring frames regardless ofwhether they are new or not. Because of the greedy natureof this approach, the extra accuracy from BlazeIt models isonly helpful early on when every positive result is likely new,at low numbers of samples. The experiments in Figure 3show BlazeIt is able to find 10% of all bicycles after onlyabout 10 seconds into sampling, whereas ExSample reachesthis level after 100 seconds, and random+ after 300 seconds.However, the training and surrogate scoring phases and anoverhead of 10000 seconds to BlazeIt, hence ExSample cando 100x better, and random+
30x better early on since theyhave no fixed overhead at the start.
This section aims to break down in more detail the costsunderlying the results in the previous section. In short, afterBlazeIt surrogate models are trained, they must be run overthe full dataset and the scores are used to identify the high-est scoring frames for sampling. While BlazeIt surrogatemodels are indeed much faster than the full models, scan-ning the full dataset is expensive even if the cheap modelwere free because loading and decoding video is not free.BlazeIt[8] prioritizes sampling the highest scoring frames,where the score is computed with a cheaper surrogate model.Answering queries with such systems involves four stages,whose throughputs on our data are shown in Figure 6.1. labelling phase : requires labelling a fraction of the datasetwith the expensive object detector. Its runtime grows metric: savings in frames metric: time savings r e c a ll : . r e c a ll : . r e c a ll : . b i cyc l e pe r s on t r a ff i c li gh t f i r e h y d r an t s t op s i gn bu s t r u ck b i cyc l e pe r s on t r a ff i c li gh t f i r e h y d r an t s t op s i gn bu s t r u ck BlazeIt random this work
Savings on Dashcam dataset
Figure 3: Results on the Dashcam dataset. Com-paring time and frames processed for different re-call levels (each row) on the dashcam dataset Dif-ferent methods (color) are compared relative to thecosts of using random+ to process the same query andreach the same level of instance recall. The left col-umn shows savings when computed in terms times ,which include initial surrogate training overheadsfor BlazeIt. The right column shows savings whencomparing frames processed , excludes any framesprocessed for training purposes. linearly with training set size, and because the objectdetector is involved, the throughput is as low as that ofthe sampling phase.2. training phase : once the labels are generated, a cheapersurrogate model is fit to the dataset. This phase can berelatively cheap if the surrogate is itself cheap and thetraining set fits in memory, avoiding any need for I/O ordecoding. Figure 6 shows the throughput of ResNet-18is indeed much higher and unlikely to be the bottleneck.3. scoring phase : the surrogate model runs over the dataset,producing a score for each frame. Even if the surrogatemodel is virtually free, Figure 6 shows that the IO anddecode for the remainder of the dataset dominate theruntime.4. sampling phase : we fetch and process frames in descend-ing order of surrogate score. This phase ends when wehave found enough results for the user. This is the onlyphase for ExSample and for baselines such as random+ .Regardless of access pattern, this phase is dominated bythe cost of inference with the full model.The first three phases can be seen as a fixed cost paid priorto finding results for surrogate-based methods. The promiseof these surrogate based techniques is that by paying theupfront cost we can greatly save on the rest of the processing.But our results in the previous section show that ExSampleis often more effective than the surrogate, without thesehigh up-front costs. Furthermore, these up-front costs have10 etric: savings in frames metric: time savings r e c a ll : . r e c a ll : . r e c a ll : . pe r s on b i k e m o t o r r i de r bu s t r u ck t r a ff i c li gh t t r a ff i c s i gn pe r s on b i k e m o t o r r i de r bu s t r u ck t r a ff i c li gh t t r a ff i c s i gn BlazeIt random this work
Savings on BDD dataset
Figure 4: Results on BDD dataset. The surrogatemodel has a harder time learning on this dataset,with lower accuracy scores. Counterintuitively, thismakes its savings comparable to those of random sam-pling, improving its results compared to Dashcamdatasest to be paid multiple times, if, for example, the user wantsto look for a new class of object or process a new data set.It’s unlikely that users will want to look for the same classof object on the same frames of video again, which is allpre-training a surrogate makes more efficient.
5. RELATED WORK
Several approaches have recently proposed optimizationsto address the cost of mining video data. A common idea inthese approaches is to use cascaded classifiers, such as theViola-Jones cascade detector [19], which enables real-timeobject tracking in video streams with a cascade frameworkthat considers additional Haar-like features in successive cas-cade layers. Lu et al. propose applying cascaded classi-fiers to efficiently process video queries containing proba-bilistic predicates that specify desired precision and recalllevels [12]. They employ SVM, KDE, and deep neural net-work classifiers that input features from dimension reduc-tion approaches such as principal component analysis andfeature hashing to efficiently skip processing of video framesthat the classifiers are confident do not contain objects rele-vant to the query. NoScope [9] trains specialized approxima-tions to expensive CNNs while maintaining accuracy levelswithin a user-defined window. However, these approachesdo not generalize well to diverse types of video data. Forexample, they require a costly training process to evalu-ate classifier accuracy (and, in some cases, to construct theclassifiers), which may differ from video to video. Similarly,NoScope uses a difference detector specifically designed forvideo from static cameras, which is ineffective on video cap-tured in mobile settings, e.g. by dashboard cameras. More-over, for datasets where cascaded classifiers perform well, dataset: bdd dataset: dashcam b i k e bu s m o t o r pe r s on r i de r t r a ff i c li gh t t r a ff i c s i gn t r u ck b i cyc l e bu s f i r e h y d r an t pe r s on s t op s i gn t r a ff i c li gh t t r u ck s u rr oga t e AP / r ando m AP Surrogate model AP vs random AP
Figure 5: The surrogate models trained for eachquery have different effectiveness. Here we com-pare their Average Precision to that of a randomlyassigned score. For the BDD dataset, the surro-gate model does only slightly better than randomin precision, which causes it to perform similarly torandom when measured in savings. For the Dash-cam dataset, the surrogate models are more accu-rate than random. Counterintuitively, higher accu-racy in scores hurts savings in Figure 3 due to thegreediness of the algorithm. our approach is complementary, as the classifiers can be ap-plied over sampled frames to obtain additional speedup.Unlike cascaded classifier methods, BlazeIt [8] proposestraining specialized models to evaluate specific query clauses,and applies random sampling when specialized models arenot available. BlazeIt also proposes a declarative, SQL-likelanguage for querying objects with several constraint types.As with other methods, our approach complements BlazeIt– substituting random sampling for our sampling algorithmmay yield substantial performance gains. It is possible thatthe methods proposed here could be integrated into systemssuch as BlazeIt.Other techniques focus on improving neural network in-ference speed. Deep Compression [3] prunes connections inthe network that have a negligible impact on the inferenceresult. ShrinkNets [10] proposes a dynamic network resiz-ing scheme that extends the pruning process to neural net-work training, reducing not only inference time but trainingtime as well. Again, these techniques can be combined withExSample to further improve query processing speed.
6. CONCLUSION
Over the next decade, workloads that process video to ex-tract useful information may become a standard data min-ing task for analysts in application areas such as government,real estate, and autonomous vehicles. Such pipelines present11 o+decode inference combined
107 21 17 ( s e q . i o ) label throughput (fps) phase_type io+decode inference combined inference combined ( s u rr o g a t e ) train throughput (fps) io+decode inference combined
107 1805 100 ( s e q . i o ) ( s u rr o g a t e ) score throughput (fps) io+decode inference combined
24 21 11 ( r a n d o m i o ) sample throughput (fps) Figure 6: Throughput of each processing phase inour implementation of BlazeIt. The yellow boxesshow the overall throughput reached by those pro-cessing phases in our implementation. Additionally,to distinguish the bottlenecks from the object de-tector from those of video decoding, we show themaximum throughput achievable by I/O and videodecoding in purple, and inference in cyan. Infer-ence with surrogate models is marked to distinguishit from inference with the expensive model. Forthe scoring phase, which dominates the bulk of theoverhead in BlazeIt, scanning through the datasetbounds the throughput to 100fps in our dataset. Al-though labelling throughput is low, labelling onlyhappens on a fraction of the dataset, so representsa small fraction of the overall runtime. a new systems challenge due to the cost of applying state ofthe art machine vision techniques.In this paper we introduced ExSample, an approach forprocessing instance finding queries on large video reposito-ries through chunk-based adaptive sampling. Specifically,the aim of the approach is to find frames of video thatcontain instances of objects of interest, without running anobject detection algorithm on every frame, which could beprohibitively expensive. Instead, in ExSample, we sampleframes and run the detector on just the sampled frames, tun-ing the sampling process based on whether a new instanceof an object of interest is found in the sampled frames. Todo this tuning, ExSample partitions the data into chunksand dynamically adjusts the frequency with which it sam-ples from different chunks based on rate at which new in-stances are sampled from each chunk. We formulate thissampling process as an instance of Thompson sampling, us-ing a Good-Turing estimator to compute the likelihood offinding a new object instance in each video chunk. In thisway, as new instances in a particular chunk are exhausted,ExSample naturally refocuses its sampling on other less fre-quently sampled chunks.Our evaluation of ExSample on a real-world dataset ofdashcam data shows that it is able to substantially improveon both the number of frames it samples and the total run-time versus both random sampling and methods based onlightweight “surrogate” models, such as BlazeIt [8], that are designed to estimate frames likely to contain objects of inter-est with lower overhead. In particular, these surrogate-basedmethods are much slower because they require running thesurrogate model on all frames.
APPENDIXA. NOTATION n Number of frames sampled so far. Frames sampledand frames processed means the same thing. N Number of distinct results in the data. We treat theterms result and instance as synonyms. i Index variable over results. i ∈ [1 , N ]. N ( n ) Number of results seen exactly once up until the n th sampled frame. We omit n when is clear from contextseen( n ) set of i seen after n frames have been processed. p i Probability of seeing result i in a randomly drawnframe. It is proportional to duration in video. Wetreat duration and probability as synonyms. R ( n +1) Number of new results we expect to find in the nextsampled frame R ( n + 1) = (cid:80) [ i ∈ seen( n )] · p i µ p mean duration Σ p i /Nσ p stddev of durations p i π i ( n ) p i (1 − p i ) n − : the chance result i appears first at the n th sampled frame. We may leave the n implicit. M Number of chunks j Index variable over chunks. j ∈ [1 , M ] B. OBJECTS SPANNING MULTIPLE CHUNKS
Here we prove Equation 7 is also valid when differentchunks may share instances. Assume we have sampled n frames from chunk 1, n from chunk 2, etc. Assume in-stance i can appear in multiple chunks: with probability p i of being seen after sampling chunk 1, p i of being seen aftersampling chunk 2 and so on. We will assume we are work-ing with chunk 1, without loss of generality. The expectednumber of new instances if we sample once more from chunk1 is: R ( n + 1) = N (cid:88) i =1 (cid:34) p i (1 − p i ) n M (cid:89) j =2 (1 − p ij ) n j (cid:35) (11)Similarly, the expected number of instances seen exactlyonce in chunk 1, and in no other chunk up to this point is N = N (cid:88) i =1 (cid:34) n p i (1 − p i ) n − M (cid:89) j =2 (1 − p ij ) n j (cid:35) (12)In both equations, the expression (cid:81) Mj =2 (1 − p ij ) n j factorsin the need for instance i to not have been while samplingchunks 2 to M . We will abbreviate this factor as q i . Wheninstances only show up in one chunk, q i = 1, and everythingis the same as in Equation 1.The expected error is: N ( n ) /n − R ( n + 1) = N (cid:88) i =1 (cid:2) p i (1 − p i ) n − q i (cid:3) (13)Which again is term-by-term smaller than N ( n ) /n bya factor of p i . REFERENCES [1] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft.Simple online and realtime tracking. CoRR ,abs/1602.00763, 2016.[2] I. J. GOOD. THE POPULATION FREQUENCIESOF SPECIES AND THE ESTIMATION OFPOPULATION PARAMETERS.
Biometrika ,40(3-4):237–264, 12 1953.[3] S. Han, H. Mao, and W. J. Dally. Deep compression:Compressing deep neural networks with pruning,trained quantization and huffman coding. In
International Conference on Learning Representations ,2016.[4] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick.Mask R-CNN.
CoRR , abs/1703.06870, 2017.[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition.
CoRR ,abs/1512.03385, 2015.[6] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,A. Fathi, I. Fischer, Z. Wojna, Y. Song,S. Guadarrama, and K. Murphy. Speed/accuracytrade-offs for modern convolutional object detectors.
CoRR , abs/1611.10012, 2016.[7] Z. Kalal, K. Mikolajczyk, and J. Matas.Forward-backward error: Automatic detection oftracking failures. In , pages 2756–2759.IEEE, 2010.[8] D. Kang, P. Bailis, and M. Zaharia. Blazeit: Fastexploratory video queries using neural networks.
CoRR , abs/1805.01046, 2018.[9] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, andM. Zaharia. NoScope: Optimizing neural networkqueries over video at scale.
Proc. VLDB Endow. ,10(11):1586–1597, Aug. 2017.[10] G. Leclerc, R. C. Fernandez, and S. Madden. Learningnetwork size while training with ShrinkNets. In
Conference on Systems and Machine Learning , 2018.[11] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ar,and C. L. Zitnick. Microsoft COCO: common objectsin context.
CoRR , abs/1405.0312, 2014.[12] Y. Lu, A. Chowdhery, S. Kandula, and S. Chaudhuri.Accelerating machine learning inference withprobabilistic predicates. ACM SIGMOD, June 2018.[13] T. Murray. Help improve imagery in your area withour new camera lending program. , 2018.[14] Nexar. Nexar. ,2018.[15] A. Poms, W. Crichton, P. Hanrahan, andK. Fatahalian. Scanner: Efficient video analysis atscale.
ACM Trans. Graph. , 37(4):138:1–138:13, July2018.[16] J. Redmon and A. Farhadi. YOLOv3: An incrementalimprovement.
CoRR , abs/1804.02767, 2018.[17] S. Ren, K. He, R. B. Girshick, and J. Sun. FasterR-CNN: towards real-time object detection withregion proposal networks.
CoRR , abs/1506.01497,2015. [18] D. Russo, B. V. Roy, A. Kazerouni, and I. Osband. Atutorial on thompson sampling.
CoRR ,abs/1707.02038, 2017.[19] P. Viola and M. Jones. Rapid object detection using aboosted cascade of simple features, 2001.[20] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, andR. Girshick. Detectron2. https://github.com/facebookresearch/detectron2 ,2019.[21] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao,V. Madhavan, and T. Darrell. BDD100K: A diversedriving video database with scalable annotationtooling.