[PDF] ReXCam: Resource-Efficient, Cross-Camera Video Analytics at Scale

Abstract

Full PDF

RReXCam: Resource-Efﬁcient Cross-Camera Video Analytics at Scale

Samvit Jain, Xun Zhang, Yuhao Zhou,Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Joseph E. Gonzalez

University of California Berkeley, University of Chicago, Microsoft Research

Abstract

Enterprises are increasingly deploying large camera networksfor video analytics. Many target applications entail a commonproblem template: searching for and tracking an object or ac-tivity of interest (e.g. a speeding vehicle, a break-in) through alarge camera network in live video. Such cross-camera analyt-ics is compute and data intensive, with cost growing with thenumber of cameras and time. To address this cost challenge,we present ReXCam, a new system for efﬁcient cross-cameravideo analytics . ReXCam exploits spatial and temporal lo-cality in the dynamics of real camera networks to guide itsinference-time search for a query identity. In an ofﬂine proﬁl-ing phase, ReXCam builds a cross-camera correlation model that encodes the locality observed in historical trafﬁc patterns.At inference time, ReXCam applies this model to ﬁlter framesthat are not spatially and temporally correlated with the queryidentity’s current position. In the cases of occasional misseddetections, ReXCam performs a fast-replay search on recentlyﬁltered video frames, enabling gracefully recovery. Together,these techniques allow ReXCam to reduce compute workloadby 8.3 × on an 8-camera dataset, and by 23 × – 38 × on a sim-ulated 130-camera dataset. ReXCam has been implementedand deployed on a testbed of 5 AWS DeepLens cameras. The Internet of Things (IoT) has led to an explosion of datasources, and applications that rely on real-time inferencesover these data. In parallel, the models making these infer-ences have improved in accuracy, even surpassing humans forcertain vision tasks, but at increased resource cost. This workaddresses the systems challenges of scaling up IoT applica-tions to enable live video analytics on a ﬂeet of cameras .Live video analytics over a ﬂeet of camera feeds embodiestwo key trends— massive data sources and compute-intensiveinference ( e.g., neural nets). On the one hand, enterprises de-ploy large camera networks for public safety and businessintelligence [11]. For instance, Chicago and London policeaccess footage from 30,000 and 12,000 cameras to respondto crimes in real time [4, 5]. On the other hand, many appli-cations rely crucially on cross-camera video analytics, i.e., detecting, associating and tracking queried “identities” inthe live streams as these identities move across the camerafeeds over time ( e.g., high-value shoppers in a store [8, 34]or suspects in a city [46, 66]). However, cross-camera analyt- t t + 1 t + 4 Query identityc4c3c2c1 ...

Figure 1:

Spatio-temporal correlations for video inference. Thecameras (on the y-axis) are plotted according to their mutualdistances, e.g., c1 and c2 are spatially closer than c1 and c3. Insearching for a query identity starting at frame t (marked indark red), ReXCam eliminates some cameras entirely (spatialﬁltering), as well as frames t + and t + (temporal ﬁltering).In this example, ReXCam searches ﬁrst on c1, c2, and c3 (butnot c4), ﬁnds the target vehicle in c3, and then searches only onc2 and c4 (but not c1 and c3). The cameras and the times atwhich they are searched are marked in green. The unmarkedportions represent compute savings. ics applications are computationally more challenging than“stateless” single-camera vision tasks (such as object detectionin one camera feed) as they entail discovering associationsacross frames and across cameras . Their compute cost thusgrows with the number of cameras.Prior work falls short of addressing this challenge. Work incomputer vision improves accuracy of cross-camera analytics( e.g., [55, 58, 70]), but it has largely ignored the prohibitivecompute costs. Recent systems have accelerated analyticson live videos via frame sampling and/or cascaded ﬁlters fordiscarding frames [25, 28, 37, 40, 63, 65]. However, they sharea key drawback that they optimize the execution of analyticson single video feeds, independent of the other streams. Thus,the compute cost of cross-camera analytics still grows withmore deployed cameras and longer activity time. Spatio-temporal correlations:

Our main insight is that thecost of cross-camera analytics can be drastically reduced byexploiting the physical correlations of objects among the cam-era streams. We develop ReXCam, a cross-camera analyticssystem that leverages inherent spatio-temporal correlations toaggressively prune the set of camera streams to be processed,thus decreasing compute costs. In the ideal case, ReXCamreduces cost to the number of cameras that the queried object1 a r X i v : . [ c s . D C ] D ec ppears in at any point in time and not the total number ofdeployed cameras. A key property of cross-camera applica-tions is that objects of interest appear only in a small numberof cameras at any time, even in large camera deployments.Spatial correlations indicate geographical association be-tween cameras – the probability that objects seen in a sourcecamera will move next to a particular destination camera’sﬁeld of view. Temporal correlations indicate association be-tween cameras over time – the probability that objects seenin a source camera will move next to a destination camera’sview at a particular time . These spatio-temporal correlationsenable ReXCam to guide its cross-camera inference searchtoward cameras and frames most likely to contain the queryidentity (see Figure 1). ReXCam’s use of spatio-temporalcorrelations to cut the cost of cross-camera analytics is funda-mentally different than the cross-camera correlations used byrecent work ( e.g., [37]) that optimizes the resource-accuracy proﬁling but not the live video analytics itself, which stillexecutes on each stream independently. Challenges:

ReXCam, at its core, applies the physical proper-ties in the IoT world (spatio-temporal correlations across cam-eras) to high-level AI applications (cross-camera video analyt-ics). This has led to three main challenges. First, automaticallyobtaining spatio-temporal correlations is expensive on unla-beled video data. Second, applying spatio-temporal correla-tions to existing single-camera inference modules ( e.g., objecttrackers) is non-trivial and requires clean abstractions withthe necessary system supports. Finally, any spatio-temporalproﬁle is bound to have errors that will lead to missing objects,which need to be detected and rectiﬁed efﬁciently.To tackle these challenges, ReXCam operates in three dis-tinct phases. 1) In an ofﬂine proﬁling phase, it constructsa cross-camera spatio-temporal correlation model from un-labeled video data, which encodes the locality observed inhistorical trafﬁc patterns. This is an expensive one-time op-eration that requires detecting entities with an ofﬂine tracker,and then converting them into an aggregate proﬁle of cross-camera correlations. 2) At inference time, ReXCam uses thisspatio-temporal model to ﬁlter out cameras that are not cor-related to the query identity’s current position (camera), andis thus unlikely to contain its next instance. 3) Occasionally,this ﬁltering will cause ReXCam to miss query detections.In these cases, ReXCam performs a fast-replay search onrecently ﬁltered frames (that it stores), uncovers the missedquery instances, and gracefully recovers into its live search.

Evaluation Highlights:

We evaluate ReXCam using the well-studied DukeMTMC video data [55] from the Duke campus.On this 8-camera dataset, ReXCam saves compute cost by8 . × over a correlation-agnostic baseline ( ∼

90% of the idealsavings). These savings come at a drop in recall of only 1 . × − × .Interestingly, ReXCam improves precision by 39%, perhapsbecause the spatio-temporal pruning acts as a “low pass ﬁlter”. Finally, we have implemented and deployed ReXCam on asmall testbed of 5 AWS DeepLens smart cameras [13]. Contributions:

Our work makes three main contributions.1) We quantify the potential for harnessing spatio-temporalcorrelations in cross-camera video analytics.2) We build a cross-camera video analytics system that learnsand applies spatio-temporal proﬁles on live videos.3) We develop robust error-handling mechanisms to avoidmissed detections by storing and searching on recent videos.

We explain some example cross-camera video analytics appli-cations (§2.1), the modules in their analytics pipelines (§2.2),and then the compute models for video analytics (§2.3).

Large camera networks are installed in cities (such as London,Beijing, and Chicago), transport facilities (trafﬁc intersections,airports), and enterprise campuses (corporate ofﬁces, retailshops) [1, 5, 12, 66]. A common class of applications in thesecamera deployments rely on re-identifying and following ob-jects (e.g., people or vehicles) as they move across the viewsof the different cameras . The focus is on following select“objects of interest” that are typically provided by externalentities (such as law enforcement). A key characteristic ofcross-camera applications is that objects of interest occur onlyin a small fraction of the cameras at any given time.

1) Public safety.

Cross-camera video analytics helps localizesuspects after a security breach. For example, after a reportedincident of a person pulling out a gun inside an ofﬁce build-ing, we will want to track that person (whose image can beobtained from the camera footage) across the cameras in thebuilding while security personnel are dispatched.Alternatively, after a major public attack ( e.g., in a train),law enforcement may track the accomplices of the identiﬁedperpetrator, which may be obtained from police databases thatstore people frequently associated with the perpetrator [66].Following these accomplices across the thousands of camerasin the city allows for effective police apprehension.

2) Vehicle tracking in trafﬁc cameras.

In the U.S. and Eu-rope, AMBER alerts are raised on suspected child abduc-tions [2]. The license plate and vehicle details are obtainedfrom investigations, and alerts are broadcast to citizens in thearea [2]. Tracking of the suspect’s vehicle across the thou-sands of cameras on highways and city streets can keep tabson the suspect and victim, even as police intervene [46].Likewise, when trafﬁc police notice a vehicle speeding ormaking a dangerous maneuver, they will note its details andwill be interested in tracking the vehicle as it moves acrossthe city using cross-camera analytics to assess its behavior.

3) Retail store cameras.

Using computer vision to improveshopping experience is a big thrust among retailers. “Special”shoppers ( e.g., loyal customers, or customers on wheelchairs)are identiﬁed as they enter the store and cross-camera analyt-2igure 2:

Illustration of identity re-identiﬁcation. ics can be used to track them across the hundreds of camerasin the store to make sure they are provided timely attention(e.g., dispatching a store representative) when necessary.

Video analytics pipelines for cross-camera applications (in§2.1) typically consist of a series of modules on the decodedframes of the video stream: (1) an object detection module,which extracts and classiﬁes objects of interest in each videoframe ( e.g., people, gun), and (2) a re-identiﬁcation module,which given a query image ( e.g., of a person), returns po-sitions of co-identical instances of the query in subsequentframes (if present). Cross-camera analytics pipelines detectobjects in each camera, and track the objects across cameras.Core to this pipeline is the vision primitive of identity re-identiﬁcation [39, 50, 56]. Given an image of a query identity q , a re-identiﬁcation (re-id) algorithm ranks every image ina gallery G based on its feature distance to q ; the lower thedistance the higher the similarity (Figure 2). Typically, fea-tures are the intermediate representation of a neural networktrained to associate instances of co-identical entities.Object detection and re-id are the most challenging steps ofcross-camera video analytics – in terms of cost and accuracy– and our work focuses on improving both of them. Cost.

Tracking in large camera networks is computationallyexpensive. Tracking even a single object of interest througha camera network, after an initial detection, can potentiallyrequire analyzing every subsequent frame in every camera(without good heuristics for geographic localization). Accuracy.

Re-id is a non-trivial problem in computer vi-sion [59, 68], being particularly difﬁcult in crowded scenesand in large camera networks due to signiﬁcant differencesin lighting and viewpoint across cameras. Often, re-id mod-els must rely on weak signals (like clothing), thus making itdifﬁcult among a large gallery of objects in a frame.Our use of spatio-temporal correlations to prune the videoframes to analyze – i.e., run object detection and re-id – sig-niﬁcantly cuts down the inference space, thus improving bothcost as well as accuracy . While our focus is on cross-cameraapplications, we also show how spatio-temporal correlationsimprove the cost of even single-camera applications (§5.4). Optimizations using frame sampling in each camera stream [28, 40] areorthogonal to our idea of using spatio-temporal correlations across cameras,and we will quantify this aspect in our experiments in §8.2.

Consistent with existing deployments [23, 29, 47], our focusis on “edge” computation of video analytics. In our setup, allthe cameras are in a high-speed local network with sufﬁcientbandwidth to an edge compute box (e.g., Azure Data BoxEdge [3]) that is managed by the enterprise (that has deployedthe cameras). For example, cameras in an ofﬁce building areanalyzed in an edge box located in the same building. Trafﬁccameras in a city are analyzed in the local trafﬁc commandcenter [45]. Videos are streamed to this edge box and thepipeline modules (§2.2) including object detection and re-idare run on this edge. Reducing the compute load enables morevideo feeds to be processed on the edge box or alternatelyreduces the resources to be provisioned.Our ideas also readily apply to a network of AI cameras(as we implement and deploy in §7), each of which consist ofcompute on-board, accelerators ( e.g.,

GPUs), and storage [13,53]. Our techniques will enable each camera to be provisionedwith much lower resources, thus lowering their cost.

We analyze the potential of using spatio-temporal correla-tions for cross-camera video analytics using the DukeMTMCdataset [55]. We study cross-camera identity tracking thatinvolves tracking an object of interest, in real time, througha camera network. In particular, given an instance of a queryidentity q ( e.g., a person) ﬂagged in camera c q at frame f ,we return all subsequent frames, across all cameras, in which q appears as it moves around. We measure the reduction incompute, i.e., the number of frames on which object detectionand re-id operations (§2.2) are executed. We now present an empirical study to quantify the cross-camera correlations in the DukeMTMC dataset [55], one ofthe most popular benchmarks in computer vision person re-idand tracking [60,67]. This quantiﬁcation motivates our designof a video analytics system that leverages such correlationsto improve the performance of cross-camera analytics. TheDukeMTMC dataset contains footage from eight camerasplaced on the Duke University campus (see Figure 3), in anarea with signiﬁcant pedestrian trafﬁc. The ﬁeld of viewsof the cameras do not mostly intersect, but the cameras areplaced close enough that people frequently appear in multiplecameras, as is typical in camera deployments. The datasetcontains over 2,700 unique identities across 85 minutes offootage, recorded at 60 frames per second [55].

Cross-camera movement of individuals (or “trafﬁc”) demon-strates a high degree of spatial correlation. Here, “trafﬁc” be-tween cameras A and B is deﬁned as the set of unique individ-uals detected in camera A that are next detected in camera B .(Note that a person that moves from A to B via camera C are3igure 3: DukeMTMC camera network [55]. Marked regionsshow the visual ﬁeld of view of each camera. C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a E x i t e d Destination Camera

Camera 1Camera 2Camera 3Camera 4Camera 5Camera 6Camera 7Camera 8 S o u r c e C a m e r a T r a ff i c P e r c e n t a g e ( % ) Figure 4:

Spatial correlations in the DukeMTMC dataset [55].Cells display % of outbound trafﬁc (individuals) from eachcamera that appears at other cameras. Each row corresponds toa particular source camera while each column to a destinationcamera; each row’s values add up to 100%. The ﬁnal columnrepresents trafﬁc that exits the camera network. excluded from the trafﬁc count of A → B and instead includedin the A → C trafﬁc count.) We ﬁnd that individuals seen ata camera c q move next to only a small number of c q ’s peercameras. On the 8-camera DukeMTMC dataset, only 1.9 of7 potential peer cameras, on average, receive even 5% of thetotal outbound trafﬁc (or individuals) from a given camera.Figure 4 shows the full pair-wise spatial correlations.Exploiting this insight can signiﬁcantly reduce our computeworkload, at little cost to accuracy, when searching for a queryidentity q ( e.g., a person), that was ﬁrst detected in camera c q . In comparison to a scheme that searches all n − at least 5% of the trafﬁc from c q , reduces our computeby 3 . × (we search only 1.9 cameras instead of 7, or 3 . × fewer frames to run object detection and re-id; see §2.2), while still capturing 95% of all detections as per our experiments.An interesting aspect is that geographical proximity is not necessarily a good spatial ﬁlter. Consider camera-5 (Figure4), out of which a signiﬁcant fraction of individuals (trafﬁc)go to cameras 2 and 6 but not to &DPHUD &DPHUD 7UDYHOWLPHVHF WRGHVWLQDWLRQFDPHUD ) U HTXHQ F\ Figure 5:

Temporal correlations in the DukeMTMC dataset[55] (for two example destination cameras 2 and 4). Plots dis-play distribution of inter-camera travel times. Each plot cor-responds to trafﬁc to the particular destination camera. Eachcolored line represents a particular source camera. camera 8 to cameras 2 and 5 even though these are physicallyproximate. Thus, learning these patterns in a data-driven fash-ion is a more robust approach (as we will quantify in §8.2).Data-driven learning also allows us to capture asymmetry inthe trafﬁc patterns between cameras, for e.g., over 50% oftrafﬁc from camera-7 move to camera-6 but less than 25% oftrafﬁc moves in the reverse direction from camera-6 to 7.

Cross-camera trafﬁc also demonstrates a high degree of tempo-ral correlation. As Figure 5 shows, travel times of individualsbetween a particular source camera and a destination camerain the DukeMTMC dataset are highly correlated. This is ex-plained by the fact that these are static cameras and thus theirpairwise distances are also static. Thus, for a given pair ofcameras, the travel times for people to leave the feed of onecamera and appear in the other camera are likely to be clus-tered around a mean value. In the DukeMTMC dataset, theaverage travel time between all camera pairs is 44.2s, and thestandard deviation is only 10.3s (or only 23% of the mean).Exploiting temporal correlations, even on its own, has thepotential to provide compute savings. Given the task of lo-cating a given query identity q , ﬁrst identiﬁed in camera c q ,in one of the n − each of the n − at least 98% of the objects appear.Such an approach has the potential to reduce our computeload by 7 . × compared to a naive approach that does not usesuch a (time) windowed search. This shows the considerablepotential in leveraging the tight distribution of travel times ofindividuals between the views of the cameras. We now put together the gains due to spatial and temporalﬁltering combined over a baseline that searches all n − . × savings4n the compute cost. This encouraging potential for savings,even for a 8-camera dataset, motivates us to both learn and ex-ploit the spatio-temporal correlations for cross-camera videoanalytics. As we will show in §8, ReXCam achieves 8 . × reduction in compute cost, which is ∼

90% of the potential.In addition, the ﬁltering of frames to search also improves the precision of the results from 51% for the baseline approachto 90% with ReXCam, with little drop in recall.

Building upon the strong spatial/temporal correlations acrosscameras seen in §3, we develop ReXCam, a r esource- e fﬁcient cross - cam era analytics system that leverages the correlationsacross cameras to reduce computing cost. As depicted inFigure 6, ReXCam provides two core functions for cross-camera video analytics applications. The spatio-temporal model (§5.1) describes the spatial andtemporal correlation between cameras, and can be queried byapplications. At a high level, one can query the model withtwo cameras, c s and c d , and a time window, and it will returnhow likely an object leaving c s will appear in c d ( i.e., thespatial correlation) and if it appears in c d how likely it willappear within the time window ( i.e., temporal correlation). The forward and replay analysis (§5.2 and §5.3) performreal-time inference on live videos ( i.e., forward) as well asinference on history video ( i.e., replay). Both capabilitiesoperate jointly, and replay search is inherently needed forspatio-temporal pruning: ignoring a camera due to weak spa-tial/temporal correlation will inevitably introduce false neg-atives that a baseline of searching all cameras would haveavoided, so ReXCam provides the abstraction of replay searchto allow faster-than-real time search over some history videos(that were ignored) for error correction.In §5.2 we demonstrate how cross-camera identity tracking(tracking an identity across cameras over time from a knownstarting point) is performed using spatio-temporal pruning.We also show the generality of the functionalities of ReXCamby applying spatio-temporal pruning for cross-camera identity detection (ﬁnding a queried identity, e.g., a lost child, in a largecamera deployment) in §5.4 that is both an important single-camera application as well as ties to the cross-camera identitytracking by providing it the starting point for its tracking.

We now describe ReXCam’s solution for leveraging spatio-temporal correlations in cross-camera video analytics.

ReXCam builds upon the cross-camera correlations in §3.

1) Spatial correlations capture associations between camerapairs arising from the movement of trafﬁc (individuals) be-tween the views of the camera streams. The degree of spatialcorrelation S between two cameras c s , c d is quantiﬁed by theratio of: (a) the number of individuals leaving the source cam- Spatio-temporal model (§5.1)

Model profiling (§6)

Replay analysis (§5.4)

ReXCam Applications

Cross-camera identity tracking (§5.2)Multi-camera identity detection (§5.3)

ReXCamShared functions

Cameras & underlying compute resources … Forward analysis

Figure 6:

Architecture of ReXCam. era’s stream for the destination camera, n ( c s , c d ) , to (b) thetotal number of entities leaving the source camera: S ( c s , c d ) = n ( c s , c d ) ∑ i n ( c s , c i ) When a large fraction of individuals that leave c s ’s view areseen next in a camera c i , we say that c i is highly correlated to camera c s . Note that S may be asymmetric (as seen in ouranalysis in §3.1.1); camera c s may not be highly correlatedwith camera c i , even if the converse is true. In cross-cameraidentity search, ReXCam exploits spatial correlations by pri-oritizing cameras that are highly correlated to the last camerawhere the queried identity q was spot (called query camera ).

2) Temporal correlations capture associations between cam-era pairs over time . If a large fraction of the trafﬁc leavingcamera c s for camera c d arrives within durations t and t ,then camera c d is said to be highly correlated in the time win-dow [ t , t ] to camera c s . The degree of temporal correlation T between two cameras c s , c d during a window [ t , t ] is theratio of: (a) individuals reaching c d from c s within a durationwindow [ t , t ] to (b) total individuals reaching c d from c s : T ( c s , c d , [ t , t ]) = n ( c s , c d , [ t , t ]) n ( c s , c d ) Indeed, cameras in real-world deployments have substantialtemporal correlation (§3.1.2). In cross-camera identity search,ReXCam exploits temporal correlations by prioritizing the time window [ t , t ] in which a destination camera is mostcorrelated with the query camera. Spatio-temporal model

Given a source camera c s , the cur-rent frame index f curr (which serves as a timestamp), and adestination camera c d , our proposed spatio-temporal model M outputs true if c d is both spatially and temporally correlatedwith c s at f curr , and false otherwise. In our description, theframe index f curr serves the role of the timestamp.The thresholds for being spatially correlated with c s , andtemporally correlated with c s at time f curr are model param-eters. As an example, we may ﬁrst wish to search camerasreceiving at least s thresh =

5% of trafﬁc from c s , during thetime window containing the ﬁrst 1 − t thresh =

98% of trafﬁcfrom c s . These parameter settings exclude both outlier cam-eras (cameras receiving less than 5% of the trafﬁc from c s )5 tt

100 10 20 f curr (cid:1)(cid:2)(cid:3)(cid:4), (cid:3)1, (cid:7) (cid:8)(cid:9)(cid:10)(cid:10) (cid:11) (cid:12) 1 for (cid:7) (cid:8)(cid:9)(cid:10)(cid:10) ∈ 0,10 sec f curr (cid:1) (cid:3)(cid:4), (cid:3)3, (cid:7) (cid:16)(cid:17)(cid:18)(cid:18) (cid:12) 0 , ∀(cid:7) (cid:8)(cid:9)(cid:10)(cid:10) History traffic Cq  C1 Cq C1C2C3

Spatial pruning Temporal pruning (cid:7) (cid:23)(cid:2)(cid:24)(cid:11) (cid:7) (cid:23)(cid:2)(cid:25)(cid:11) (cid:1)(cid:2)(cid:3)(cid:4), (cid:3)2, (cid:7) (cid:8)(cid:9)(cid:10)(cid:10) (cid:11) (cid:12) 1 for (cid:7) (cid:8)(cid:9)(cid:10)(cid:10) ∈ sec Figure 7:

Spatio-temporal correlations between camera Cq(where the object was ﬁrst spotted) and three other cameras.C1 and C2 have spatial and temporal correlations with Cq (indifferent time intervals). C1 is correlated with Cq in the times[0, 10] but not otherwise; and C2 is correlated with Cq only intimes [10, 20]. C3 is not correlated with Cq. and outlier frames (frames containing the last 2% of the trafﬁcfrom c s ). Deﬁning s thresh and t thresh as a percent of trafﬁc (orindividuals) directly translates to precision and recall of theentities being tracked. M is formally deﬁned as: M ( c s , c d , f curr ) =  , S ( c s , c d ) ≥ s thresh and T ( c s , c d , [ f , f curr ]) ≤ − t thresh , otherwise (1)Here f is the frame index at which the ﬁrst historical arrival at c d from c s was recorded. The reason of having f is becauseit takes time to travel from c s to c d , and cost savings canbe maximized by not searching on frames while objects aremoving between cameras. As a result, our temporal ﬁlterchecks if the volume of historical trafﬁc that arrived at c d between [ f , f curr ] is less than 1 − t thresh of the total trafﬁc.This ensures that f curr falls in the “dense” part of the traveltime distribution, where we are likely to ﬁnd q . (Note thatwe must check that f curr ≥ f . When f curr < f , M is false .)Figure 7 shows an illustration for using M with f values foreach destination camera. (We construct the model M in §6.) Search hits and misses:

Leveraging the spatio-temporalmodel M allows us to explore the subset of the inferencespace (camera streams and time windows) that is most likelyto contain q . A “hit” reduces cost, as we avoid searching theentire space. On the (rare) misses, we go back and ﬁnd q in the past video frames over all the camera streams we had ﬁlteredout using M . In §5.3, we will explain how we handle missesand mitigate the delay it introduces. Maximizing the costsavings from hits and minimizing the miss-induced delays isa tradeoff controlled by the parameters s thresh and t thresh . Algorithm 1

Tracking with the spatio-temporal model input : video feeds { V c } for camera c , sp_corr ( c s , c d ) → { true , false } tp_corr ( c s , c d , f ) → { true , false } for query ( q , f q , c q ) ∈ Q do q feat = features ( q ) (cid:46) extract image features f curr = f q + (cid:46) init current frame index M q = [] (cid:46) init query match array phase = (cid:46) start phase one while ( f curr − f q ) ≤ exit_t do V corr = ﬁlter ( sp_corr , tp_corr , c q , f curr , V ) frames = get_frames ( V corr , f curr ) gallery = extract_entities ( frames ) ranked = rank_reid ( q feat , gallery ) if ranked [ ][ dist ] < match_thresh then M q = append ( M q , ranked [ ][ img ]) q feat = update_rep ( q feat , ranked [ ][ feat ]) f q = f curr phase = (cid:46) reset to phase one break f curr = increment ( f curr ) if phase = and T ( c s , c d , [ f , f curr ]) > − t thresh then f curr = f q + (cid:46) reset frame index sp_corr = relax ( sp_corr ) tp_corr = relax ( tp_corr ) phase = (cid:46) start phase two output : matched detections { M q } Algorithm 1 explains our cross-camera identity tracking . Incross-camera identity tracking, the input consists of a queryimage q , last seen in frame f q on camera c q . (If the input doesnot contain the frame f q , we can ﬁrst run the next application,multi-camera identity detection, to locate it.) The goal is toﬂag all subsequent frames, on all cameras, where q appears.Note that q can appear again on the same camera ( c = c q ),different cameras ( c (cid:54) = c q ), or else exit the network altogether.For each query q , we begin by extracting image features q feat and initializing an empty array of discovered matches M q . Foreach frame, as explained in §2.2, we: (1) extract individuals(objects) from each frame using an object detection model,(2) rank the objects based on their feature similarity distanceto q using a re-id model (Figure 2).If the top-ranked detection is within a threshold( match_thresh in Algorithm 1), i.e., a co-identical instance isfound by the re-id model, we add the detection to our arrayof matches M q , update our query representation q feat to incor-porate the features of the new instance of q , update the queryframe index f q to f curr , and proceed with tracking q ; lines14-18. We continue searching until the gap between the lastdetected instance of q and our current frame index exceeds apre-deﬁned exit threshold (deﬁned as exit_t in Algorithm 1).At this point, we conclude that q must have exited the camera6etwork, and cease tracking q .We apply the spatio-temporal model to cross-camera track-ing as follows (marked in blue in Algorithm 1). The model M has two ﬁlters (lines 2 and 3): (1) spatial_corr ( c s , c d ) ,which given a source camera c s and a destination cam-era c d returns true if c d is correlated with c s , and (2) temporal_corr ( c s , c d , f ) , which given a source camera c s ,a destination camera c d , and a frame index f , returns true if c d is correlated with c s at f . At query time, these two func-tions are passed to the ﬁlter function (line 10), which given alist of video feeds V , returns the subset of cameras ( V corr ) thatare both spatially and temporally correlated to c q at f curr .Applying ﬁlter reduces the inference search space, at eachframe step f curr , from all entity detections at f curr on everycamera to all entity detections at f curr on correlated cameras.This allows us to abstain from running object detection andfeature extraction models on non-correlated cameras, andreduces the size of the re-id gallery in the ranking step. If ﬁlter in Algorithm 1 were applied to the example in Figure 7,the set V corr would be only C1 in in the times [0, 10], only C2in the times [10, 20], and null set at all other times. Spatio-temporal pruning may cause a drop in recall: missingactual occurrences of the query identity q , which would bediscovered by a baseline that exhaustively searches all theframes of all the cameras. When tracking on the spatiallyﬁltered cameras does not discover q after exit_t time (line 22in Algorithm 1), we will initiate a “second pass” through thevideo frames that we skipped; we call this replay search . Replay subset:

We initiate replay search on a broader subsetof cameras and timespans. In particular, we go back to thelast camera that the queried identity was seen, c q ( i.e., restartthe tracking procedure from f curr = f q +

1, line 23, as f q wasthe last frame the queried object was seen), and ﬁnd all thecorrelated cameras and time windows that c q is correlatedwith using the spatio-temporal proﬁle but now with thresholdss thresh and t thresh decreased by a factor of 10. If we do discoveran instance of q , we proceed with tracking from that detection,initiating a new phase one in Algorithm 1. If we still do not,we search the entire camera network until the exit threshold.Note that despite relaxing s thresh and t thresh , the camerasover which we perform replay search will still be only a smallfraction of the overall camera network and for only a small du-ration in the past. This is because a vast majority of cameras(in a large deployment) will have never seen trafﬁc (individ-uals) from c q . Implicit to replay search is also the ability tostore videos in the past. However, this only needs to be forthe last few minutes (few 100 MBs even for HD videos). Replay delay:

Searching on videos from the past indicatesthat we are lagging behind tracking the identity. Thus, it isdesirable to speed up the search process. ReXCam processesthe historical videos at faster-than-real-time . a) Skip frame mode – Process the historical videos at lower frame rate (via frame sampling) and lower resolution (viaframe downsizing) to increase processing rate but potentiallylower accuracy. We use ofﬂine proﬁling [63, 65] to decide theframe rates and resolution to limit the drop in accuracy. b) Parallelism mode – Process the historical videos byparallelizing them across other cameras or edge machines(depending on the setup; §2.3) that are idle. As explainedabove, the broader replay search is likely still only a smallsubset of all the videos, so spare resources will be available.We implement both solutions and investigate their trade-offs on accuracy and delay in our evaluation (§8.3). While our focus thus far has been on cross-camera video ana-lytics, spatio-temporal models can also be applied to reducethe cost of single-camera analytics , e.g., ﬁnd a lost baby orlost car in a mall’s or city’s cameras. This involves runningobject detectors independently on each camera stream , andis expensive for large camera deployments. In this section,we apply our cross-camera spatio-temporal model (§5.1) tosuch single-camera “identity detection”. Not only is it an ap-plication of wide relevance on its own, it also ties closely withcross-camera tracking (§5.2) to provide it the starting point ofthe query q (which we have been referring to as camera c q ).Identity detection refers to ﬁnding a given identity q ( e.g., an image of a lost baby or suspect) in many camera streams.The intuition why the spatio-temporal model helps is that if q is not found in camera C1 and the spatio-temporal model indi-cates that most objects appearing in camera C2 have recentlyappeared in C1, then camera C2 is unlikely to contain q . Inother words, the model allows to prune the cameras and timewindows in which q is unlikely to be found based on whenand where q was not found earlier. At any point of time, wemaintain a probability for each camera to contain an objectthat has not been “scanned” ( i.e., not found in the camerafeeds we have searched so far). The cameras with high valuesof this probability will be prioritized in the search.Formally, we deﬁne P c , w to be the probability of any un-scanned object ( i.e., an object that did not appear in any cam-era when it was searched) appearing in camera c in time win-dow w . Thus, the greater the P c , w is, the more likely searchingcamera c in window w would yield a “hit”. We also deﬁne P ∗ c is the probability of the identity entering the whole cameranetwork at camera c at any point in time. We estimate thisvalue by looking at the history trace and dividing the num-ber of objects who appear camera c ﬁrst by total number ofobjects. Then P c , = P ∗ c and P c , w with w > P c , w = P ∗ c + ∑ w j ≤ w , c i I c i , w j · P c i , w j · S ( c i , c ) · T ( c i , c , w ) where I c i , w j is a binary ﬂag indicating if camera c i wassearched at time window w j ( I c i , w j =

0) or not ( I c i , w j = q to appear in camera c and time window7 is the sum of the probability of it entering the whole networkat c ( i.e., P ∗ c ) and the probability of q moving from anothercamera c i to at time w j , i.e., I c i , w j · P c i , w j · S ( c i , c ) · T ( c i , c , w ) .At any point in time, we search the camera c and timewindow w whose P c , w is greater than a threshold θ . If theidentity is found, the search ends. Otherwise, we set I c i , w j = P c , w . This is run until we ﬁnd the queriedidentity. §8.5 evaluates our gains with identity detection. A ﬁnal piece of ReXCam system is the proﬁling and main-taining of the spatio-temporal correlations. ReXCam takes anapproach that builds on standard techniques from computervision. Before ReXCam is deployed, we ﬁrst use a multi-target, multi-camera (MTMC) tracker to label entities in adataset of historical video, collected from the same cameradeployment on which the live tracking is executed. Logically,such a tracker will return for each detected entity instance i a tuple, ( c i , f i , e i ) , containing the camera identiﬁer c i , frameindex f i , and entity identiﬁer e i for the detection, respectively.Using these, we compute n ( c s , c d , [ t , t ]) , the number ofentities leaving any source camera c s for any destination cam-era c d within a time interval [ t , t ] . These quantities translatedirectly to our spatio-temporal model M in Eq. 1 (see §5.1).However, directly using MTMC trackers to proﬁle spatio-temporal correlations in the history video is computationallyexpensive, neutralizing the savings from the search pruning.This is because unlike single-target tracking, a MTMC trackerwill track all entities in the dataset. To limit the proﬁling over-heads, we explore the trade-off between the robustness ofofﬂine proﬁling and the accuracy of subsequent single-targetcross-camera tracking using the generated model. In particu-lar, the proﬁling cost can be reduced by labeling fewer frameswith the MTMC tracker ( e.g., by selecting a lower frame sam-pling rate or choosing a smaller subset of the data to label).At ﬁrst glance, this will likely reduce the search accuracy asthe spatio-temporal correlations is based on a sampled sub-set of entities. In practice, however, we found that despitelabeling fewer frames for the proﬁling, our precision and re-call drops are only mild, and thus our solution of labelingfewer frames signiﬁcantly reduces the proﬁling cost without impacting accuracy. We empirically show this in §8.4.Finally, ReXCam needs to cope with potential changes inthe spatio-temporal correlations ( e.g., a road work may blocka busy segment, which can reduce the correlation between twocameras). These ‘changes are relatively infrequent, but whenthey do happen, ReXCam can automatically detect them andinitiate re-proﬁling. In particular, ReXCam tracks the num-ber of objects that are missed in the normal pruned searchbut detected in the subsequent replay search (in an “uncorre-lated” time interval or camera), and triggers a re-proﬁling ofthe spatio-temporal correlations between the correspondingcameras when there is a spike in pruning errors. Note thatthe error in the spatio-temporal proﬁle during the re-proﬁling Controller

Camera-1Camera-2Camera-3 Camera-4Camera-5

Figure 8:

ReXCam testbed deployment at AnonCampus withﬁve AWS DeepLens smart cameras. The red lines show thewalkways in the building, and we learn the spatio-temporal cor-relation of people traversing the walkways. The controller andall the cameras exchange “trigger” and “feedback” messages. will not affect ReXCam’s inference, but only increase latencybecause the replay search handles the errors.

We implement ReXCam with 1.5K line of Python code overAWS DeepLens smart cameras [13]. Each DeepLens cameraruns Ubuntu OS-16.04 LTS, and is equipped with an IntelGen9 GPU and Intel Atom Processor CPU, 8GB RAM, and16GB built-in storage. Our testbed includes ﬁve such camerasconnected to each other via Wi-Fi and deployed on Anon-Campus (Figure 8). In our testbed, video analytics modules(object detection, re-id) run on DeepLens’s on-chip GPU andCPU. The testbed of smart cameras contrasts the alternatemodel for video analytics using nearby edge boxes (§2.3).We use a laptop (connected to the same Wi-Fi network asthe cameras) to run the ReXCam controller . The ReXCamcontroller is responsible for proﬁling (§6) and maintainingthe spatio-temporal model of correlations among cameras.The connectivity between the controller and the cameras isonly to exchange “control messages” and not video data. Weimplement two main control inferences (Figure 8):1. A trigger message from the controller to a camera trig-gers the camera to start (or stop) searching for a speciﬁedquery identity in its video within a speciﬁed time interval.The trigger message can also be used to initiate searchin history videos for replay search (§5.3).2. A feedback message from a camera to the controller no-tiﬁes the controller on an interesting incident ( e.g., thespeciﬁed identity has just been detected, or left the cam-era’s view) in real-time. A feedback follows an activationmessage and is sent as soon as the incident occurs.

Fault tolerance:

The cameras broadcast a heartbeat every fewseconds to the controller to handle instances of cameras fail-ing. The ReXCam controller can be replicated for resilience.The only persistent state held by the ReXCam controller isthe model of spatio-temporal correlations, which is backed up,and is updated only at coarse timescales. The spatio-temporal8igure 9:

Example snapshots from AnonCampus (left) andDukeMTMC [55] (right) cameras. pruning algorithm (Algorithm 1) is also stateless, and trig-gered by feedback messages from the cameras.

Our evaluation of ReXCam shows the following highlights.1) ReXCam’s compute savings on the 8-cameraDukeMTMC dataset is 8 . × (which is ∼

90% of the poten-tial; §3). ReXCam also improves precision from 51% to 90%.On the larger simulated dataset of 130 cameras from Porto,our savings grow with the number of cameras. (§8.2, §8.3)2) Deployment on the 5-camera testbed with AWSDeepLens cameras leads to 3 . × savings in compute. (§8.2)3) ReXCam’s optimizes to keep the proﬁling costs smallwithout impacting the precision and recall. (§8.4)We evaluate ReXCam for single-camera analytics in §8.5. We evaluate ReXCam on three datasets.

1) AnonCampus dataset (§7) consists of 35 minutes of 1080pvideo recorded at 24 frames per second, captured by ﬁveDeepLens cameras deployed in a school building (see Figure8). The dataset is manually labeled with person identities.

2) DukeMTMC dataset is a video surveillance dataset withfootage from eight cameras installed on the Duke Universitycampus (see Figure 3). The data consists of 85 minutes of1080p video from each camera recorded at 60 frames per sec-ond. In all, the footage contains over 2,700 unique identitiesand over 4 million person detections (all labeled).Figure 9 shows snapshots from eight different cameras(four each) from the AnonCampus and DukeMTMC datasets.

3) Porto dataset is generated from 1,710,671 trajectories ob-tained from 442 taxis running in the city of Porto, Portugalbetween Jan. 2013 and June 2014 [10]. Each trajectory con-tains timestamps and GPS coordinates sampled every 15 sec-onds. To emulate cross-camera tracking, we manually pin

130 cameras at intersections of the city (we get the cameras’coordinates from Google Maps) and set each camera’s ﬁeld-of-view to be a square area centered at the camera with length l = B. Models —

For our re-id model, we use an open-source, ResNet-50-based implementation of person re-id [6],trained in PyTorch on a subset of the Duke dataset called DukeMTMC-reID [7]. We then implement our tracking (Al-gorithms 1), which applies this model iteratively at inferencetime to discover all instances of a query identity in the Dukedataset. Since DeepLens uses the clDNN and Intel GPUs, weleverage person-reidentiﬁcation-retail-0076 from the Open-VINO model zoo [32] for re-id in the AnonCampus dataset.To build our spatio-temporal model on unlabeled videodata (simulating real deployment conditions), we apply anofﬂine multi-target multi-camera (MTMC) tracker [9] (§6)to label every person detection in a subset of the dataset ( i.e., profile set with 16352 frames). We implement a proﬁler toextract spatial and temporal correlations from these labels.

C. Workload —

We run a set of 100 tracking queries, { q i } ,drawn from the test query partition of the DukeMTMC-reIDdataset [7] (20 from the AnonCampus dataset, and 100 fromthe Porto dataset). Each tracking query consists of multiple iterations . Each iteration involves searching for the next in-stance , q ji , of the query identity in the dataset, starting with theinitial instance q i . A tracking query terminates when no moreinstances can be found. Experiments on the DukeMTMCdataset were conducted on AWS EC2 p2.xlarge instances(contains one Nvidia Tesla K80 GPU). D. Metrics —

We report the following four metrics whichare computed over the entire query set. (i)

Compute cost – Number of video frames processed, aggregated over allqueries { q i } . (ii) Recall (%) – Ratio of query instances re-trieved to all query instances in dataset, q ji . (iii) Precision (%) – Ratio of query instances retrieved to all retrieved instances, r ji . (iv) Delay (sec.) – Lag between position of tracker andcurrent video frame, in seconds, at the end of a tracking query.This will be 0 for a query if no replay search was performed.Compute cost, recall, and precision are reported in aggrega-tion. Delay is reported as an average value per query.

E. Compared Schemes —

To evaluate our spatio-temporal ﬁltering, we compare against two schemes:

1) Baseline (all) - Searches for query identity q in all thecameras at every frame step. Uses state-of-the-art re-id model[6]. no spatio-temporal ﬁltering is utilized.

2) Baseline (GP) - Searches for query identity q only in thecameras that are in geographical proximity to the query cam-era at every frame step. Uses state-of-the-art re-id model [6].For DukeMTMC dataset, we manually set pairs of neighbor-ing cameras using Figure 3 while for Porto dataset, we setgeographical proximity threshold to 4 l (where l =

3) ReXCam - Searches for query identity q only on camerasthat are currently spatio-temporally correlated with c q (as perAlgorithm 1). The same person re-id model is used as in thebaseline [6]. We consider various versions of Equation 1, cor-responding to different spatio-temporal ﬁlters. Each versionis coded as S s -T t , where s indicates the spatial ﬁltering thresh-old and t indicates the temporal ﬁltering threshold. Highervalues of s and t indicate more aggressive ﬁltering (no t valueindicates no temporal ﬁltering and helps measure the gainsof spatial ﬁltering alone). For instance, S5-T2 ﬁlters cameras9

Cost (1000s of frames)

Recall (%)

Precision (%)

Figure 10:

Results for all-camera baseline (tan) vs. ﬁve versionsof ReXCam (blues) on the AnonCampus dataset. We argue S30-T1 (*) offers the best trade-off on all metrics.

Cost (1000s of frames)

Recall (%)

Precision (%)

Delay (sec.)

Figure 11:

Results for all-camera baseline (orange), geo-proximity baseline (tan) vs. ﬁve versions of ReXCam (blues) onthe DukeMTMC dataset. We argue S5-T2 (*) offers the besttrade-off on all metrics. that receive < of the trafﬁc from query camera c q . In ad-dition, its ﬁlter frames outside the time window containingthe ﬁrst 98% of trafﬁc from c q . Figure 10, Figure 11 and Figure 12 compare the perfor-mance of the baseline and various ReXCam versions on threedatasets, respectively. We ﬁnd that ReXCam signiﬁcantly out-performs both baselines , by (1) reducing compute cost and(2) improving precision, while maintaining comparable re-call. It is noteworthy that the best thresholds for ReXCam isdependent on the dataset. ReXCam versions

S30-T1 , S5-T2 , S1-T1 offer the best trade-off between compute cost, recall,precision, and delay in the three datasets, and in general haveto be tuned. We term these schemes

ReXCam-O (ptimal).

1) Compute cost – Baseline (all) is by far the mostcompute-intensive, processing 98,760 frames for 20queries and 45,638/85,890 frames for 100 queries on theDukeMTMC/Porto dataset, respectively. Baseline (GP) savesthe cost quite a bit but its performance ﬂuctuates on differentsettings due to the discrepancy between spatial correlationand geographical proximity (as also pointed out in §3.1.1).Each successive version of ReXCam achieves lower computecost than its predecessor. For instance, in Figure 11, the most

Cost (1000s of frames)

Recall (%)

Precision (%)

Delay (sec.)

Figure 12:

Results for all-camera baseline (orange), geo-proximity baseline (tan) vs. four versions of ReXCam (blues)on the Porto dataset (130 cameras). aggressive version of ReXCam, S10-T10, processes only3,513 frames, and achieves 13 × lower compute cost on 8cameras than the all-camera baseline. Similarly, a maximalvalue of × compute savings can be achieved in Figure 10.In comparison, ReXCam-O processes 28,680/5,500/3,776frames, which translates to 3 . × /8 . × /23 × lower cost thanthe all-camera baseline in the ﬁve-camera (AnonCampus),eight-camera (DukeMTMC), and 130-camera (Porto) dataset.

2) Recall (%) – Compared with both baselines, recall of theReXCam versions declines slightly when spatial/temporal ﬁl-tering is introduced. In Figure 11, for example, baseline (all)achieves recall of 81.3%. Both spatial-only schemes achieve79.3% recall.

ReXCam-O achieves 79.7%, a 1.6% drop fromthe baseline. Similar patterns are observed in Figure 10 andFigure 12. The reason why recall becomes lower in the Anon-Campus deployment is because of the increased instances ofocclusions in indoor environments (see Figure 9). Note thatin Figure 12, recall drops signiﬁcantly from baseline (all) tobaseline (GP), as a number of relevant cameras are mistakenlyexcluded by geographical proximity-based pruning.

3) Precision (%) – Baseline (all) achieves precision of 50.4%,51.1% and 49.6% on three datasets, respectively. All versionsof ReXCam improve on this, but

ReXCam-O in particularachieves 71.7%/90.4%/85.8% precision, which is a gain of21.3%/39.3%/36.2% over the baseline . Compared with base-line (GP), precision gain from

ReXCam-O remains as high as on the DukeMTMC and Porto dataset. Higherprecision is a key beneﬁt of spatio-temporal ﬁltering for cross-camera video analytics. By searching fewer irrelevant cam-eras, and fewer irrelevant frames, ReXCam is less likely todeclare matches that do not actually match the query.

4) Delay (sec.) – Here we report total cumulative lag (lag inthe absence of replay search (§5.3)), averaged over all queries.We do not report the delay from the AnonCampus deploymentsince among all 20 queries, only one needed replay search.For both DukeMTMC and Porto results, we ﬁnd that delayincreases with more spatial or temporal pruning. This is ex-pected as there are more instances of misses.

ReXCam-O , in10 C o s t S a v i n g s ( x ) S1-T1

Cost savings

30 50 70 90 110 130Number of Cameras

S12-T12

Cost savings P r e c i s i o n ( % ) ReXCam Baseline (all) ReXCam Baseline (all)

Figure 13:

Cost savings vs. number of cameras (Porto dataset).

Figure 14:

Results for all-camera baseline vs. ReXCam S5-T2on the DukeMTMC dataset with frame skipping. particular, incurs moderate delay – less delay than S5-T1 andS5-T10 but more delay than spatial-only ﬁltering.Given this analysis,

ReXCam-O offers a favorable trade-off between the four metrics – achieving nearly the low-est compute cost (3 . × /8 . × /23 × lower), nearly the highestprecision (21 . . .

2% higher), competitive recall(2 . . .

5% lower), and moderate lag ( ≈ Large-scale camera data:

The key objective of using thetrajectories from the Porto dataset was to experiment on ReX-Cam’s gains at scale (§8.1); unfortunately there are no videodatasets available for hundreds of cameras. Figure 13 showscost savings and precision of ReXCam/Baseline (all) with in-creasing number of cameras. Cost savings steadily grows withincreasing number of cameras, achieving up to 38 × lower costthan baseline (all) in ReXCam S12-T12 for 130 cameras. Webelieve this is an encouraging result for ReXCam’s value forlarge camera deployments. All through, ReXCam maintainsa 34 .

5% gain on precision with little impact on recall.

Frame skipping:

Frame sampling is a key technique inprior work [28, 37, 65] to make single-camera analyticscheaper. Such techniques are orthogonal to ReXCam’s spatio-temporal pruning for cross-camera analytics, and we quantifyour point. Figure 14 measures the impact of frame skipping—uniformly skip one in 3 frames, and one in 4 frames—on bothbaseline (all) and ReXCam. As shown in the ﬁgure, ReXCammaintains a much lower compute cost in both skipping cases.Speciﬁcally, the cost savings are 8 . × and 8 . × , which is inthe same ballpark as without frame skipping of 8 . × , thusshowing the orthogonality of frame skipping to ReXCam. Cost (1000s of frames)

Recall (%)

Precision (%)

Delay (sec.)

Baseline (all) ReX-O Rex-O (2x skip) ReX-O (2x fast)

Figure 15:

Replay search. Schemes compared: baseline,ReXCam-O (normal replay search), ReXCam-O ( × skip),ReXCam-O ( × fast-forward). Scheme × skip outperforms × fast-forward on both compute cost and delay. In this section, we evaluate the effectiveness in reducing lagin replay search using the two proposed schemes from §5.3:

Skip frame mode - Employ a x frame sampling rate to increasethroughput on historical frames, at the price of lower accuracy(via missed detections). (

2x skip ) Parallelism mode - Employ a 2 x frame processing rate toincrease throughput, at the price of increased compute cost(via increased resource usage). (

2x ff )Both schemes are applied to

ReXCam-O , and comparedto (a) the all-camera baseline and (b)

ReXCam-O with thedefault real-time replay search, which incurs 2.6s of delay.As Figure 15 shows, both

2x skip and

2x ff achieve delayreductions, decreasing ﬁnal cumulative lag to 1 .

8s and 1 .

2x skip doesn’t halve the delayis due to the skipped query instances during the ﬁrst round ofreplay search where s thresh and t thresh decreased by a factor of10. Also, delay reductions from

2x skip and

2x ff come withdifferent tradeoffs.

2x skip reduces recall by 1.2% to 78.0%,but increases precision from 90.37% to 90.87% and increasecompute cost savings from 8 . × to 8 . × better than thebaseline (by processing fewer historical frames).

2x ff doesnot impact recall and precision, but reduces compute costsavings from 8 . × to only 8 . × better than the baseline. Proﬁling cost increases with the number of frames that mustbe processed by the MTMC tracker (§6). We investigate thetrade-off between proﬁling cost and subsequent tracking ac-curacy. Speciﬁcally, we test whether we can build a precisespatio-temporal model on smaller subsets of the training dataobtained by uniformly sampling the frames. We apply a sam-pling rate of 8 × , 6 × , 4 × , 2 × , and 1 × (using X in 8 frames) inthe profile partition of the Duke dataset (§8.1) for proﬁling,which translates to correspondingly lower proﬁling costs.As Figure 16 shows, recall of ReXCam during live track-ing reaches the maximum of 80.1% with 6 × sampling, i.e., when half of the frames are labeled for ofﬂine proﬁling to11 (1x) 4 (2x) 8 (4x) 12 (6x) 16 (8x)Profiling cost (1000s of frames)4654627078 R e c a ll ( % ) Better 80859095100 P r e c i s i o n ( % ) Recall Precision

Figure 16:

Ofﬂine proﬁling cost vs. online recall. Proﬁle inter-vals compared (in minutes of data used per camera): . min.(full), . min., . min. (half), . min., . min. obtain the spatio-temporal model. Interestingly, on either sideof this, the recall falls. On the left side, the drop is caused byinsufﬁcient amount of proﬁling data. On the right side, thesmall drop is because extra data results in a spatial-temporalmodel being overﬁt to the profile partition. This experimentindicates that spatial-temporal model can be built on a rea-sonably small set of training data ( i.e., ∼ i.e., × sampling) frames are used for training.If we combine the proﬁling cost with the cost of the livevideo analytics, we see that ReXCam would need to run only

34 live tracking queries to break-even with locality-agnostictracking (calculations omitted). This represents a small frac-tion of the expected annual workload in large video analyticsoperations [65, 66] that track many hundreds of thousands ofqueries. Hence ReXCam’s proﬁling costs are small and willnot dent the gains, leaving it to remain sizable.

Lastly, we evaluate ReXCam’s spatio-temporal pruning onidentity detection, the single-camera application described in§5.4. As Figure 17 shows, ReXCam achieves as high as 7 . × cost reduction with θ = .

95 on the 8-camera DukeMTMCdataset ( θ is the likelihood threshold for searching a camera’sstream). Similar to trends in cross-camera tracking, the gainon precision far outweighs the drop on recall. In fact, for θ = .

75, recall does not drop at all while precision improvesby 28% even as cost savings stay at 6 . × . This experimentshows the generality of applying ReXCam for both cross-camera as well as single-camera applications. Video Analytics Systems.

A sizable body of work on videoanalytics has emerged recently [28, 40, 46, 65]. Chameleonexploits correlations in camera content ( e.g., velocity of ob-jects) to amortize proﬁling costs , but not the cost of the videoanalytics itself [37]. These works leave three problems unex-plored, each of which ReXCam addresses. First, they focus on single-frame tasks ( e.g., object detection and classiﬁcation),which are stateless. In contrast, surveillance applications, likethe real-time tracking we focus on, involve multi-frame track-

Cost (1000s of frames)

Recall (%)

Precision (%)

Figure 17:

Identity detection. All-camera baseline (tan) vs.three versions of ReXCam (blues) on the DukeMTMC dataset. ing, where future questions depend on past inference results.Second, they study single camera analytics. Thus, they do notexplore the complexities involved in cross-camera inference on live video ( e.g., occlusions) that deﬁne applications suchas person re-id. Third, in contrast to classiﬁcation tasks, manysecurity applications search for new object instances ( e.g., asuspicious person) where the training data is skewed towardnegative examples. Our use of correlations, i.e., movementacross cameras, however, yields substantial accuracy gains.

Efﬁcient Machine Learning.

Improving ML models usingmodel compression [26, 42], compact architectures [31, 44],knowledge distillation [14, 24, 27], and model specialization[28, 40] is orthogonal to ReXCam, which would gain fromany efﬁciency improvement of the models ( e.g., for re-id).Unlike systems that tradeoff model resources and accuracy[19–21,25,30,36,43,49,66], ReXCam entails a new approach:instead of running cheaper models, we run inference on lessdata by using spatio-temporal correlations.

Computer Vision.

Techniques for person re-id and multi-target, multi-camera (MTMC) tracking make the followingcontributions: (1) new datasets [55, 59, 60, 69], (2) new neuralnetwork architectures [54, 59, 60, 69], or (3) new trainingschemes [57, 60, 69, 70]. However, past computer vision workdo not address the inference cost of re-id and MTMC tracking[16, 17, 35, 41, 48], nor does it study online tracking (iteratedre-id), a key application of interest in camera systems.

Visual Data Management.

Image and video databases ex-plore the use of classical computer vision techniques to indexvideo efﬁciently [15, 18, 22, 51, 52].

Cross-camera inference with CNNs on live video entails substantially different chal-lenges than the target domain of these works.

Mobility Modeling.

Mobility modeling and prediction haslong been a topic of interest in mobile computing. Studieshave shown promising results in generating human/vehiclemobility models from call detail records [33, 64], wireless sig-nals [62], social media [38], and transactions in transportationsystems [61, 64]. While none of these works apply mobilitymodels to video analytics, ReXCam could beneﬁt from theirtechniques on building accurate spatial-temporal models.12

Cross-camera analytics is a computationally expensive func-tionality that underpins a range of real-world video analyticsapplications, from suspect tracking to intelligent retail stores.We presented ReXCam, a system that leverages a learnedmodel of cross-camera correlations to drastically reduce thesize of the inference time search space, thus reducing the costof cross-camera video analytics. ReXCam directs its searchtowards the camera streams that likely contain the identitybeing tracked, while gracefully recovering from (rare) missesusing a replay search on historical videos. Our results arepromising: ReXCam reduces compute workload by 8 . × onthe 8-camera DukeMTMC dataset, and improve inferenceprecision by 39%. On a simulated dataset of 130 cameras, itsgains grow with the number of cameras. We have deployed aﬁve camera testbed on campus, which we plan to expand forfurther experiments. References [1] Absolutely everywhere in beijing is now covered bypolice video surveillance. https://qz.com/518874/ .Accessed: 2018-10-27.[2] Amber alert. https://en.wikipedia.org/wiki/AMBER_Alert . Accessed: 2018-10-27.[3] Azure data box edge. https://docs.microsoft.com/en-us/azure/databox-online/data-box-edge-overview .[4] British transport police: Cctv. . Accessed:2018-10-27.[5] Can 30,000 cameras help solve chicago’s crime prob-lem? . Accessed:2018-10-27.[6] Deep person reid. https://github.com/KaiyangZhou/deep-person-reid/ . Accessed:2018-10-28.[7] Dukemtmc-reid. https://github.com/layumi/DukeMTMC-reID_evaluation . Accessed: 2018-10-28.[8] Genetec announces airport sense. . Accessed: 2018-10-30.[9] Multi-target, multi-camera tracking. https://github.com/ergysr/DeepCC . Accessed: 2018-10-28.[10] Taxi Service Trajectory - Prediction Challenge, ECMLPKDD 2015 Data Set. https://archive.ics.uci. edu/ml/datasets/Taxi+Service+Trajectory+-+Prediction+Challenge,+ECML+PKDD+2015 .[11] Video meets the internet ofthings. . Ac-cessed: 2018-10-28.[12] You’re being watched: there’s one cctv cam-era for every 32 people in uk. . Ac-cessed: 2018-10-27.[13] Amazon. AWS DeepLens. https://aws.amazon.com/deeplens/ , 2017.[14] Rohan Anil, Gabriel Pereyra, Alexandre Passos, RobertOrmandi, George E Dahl, and Geoffrey E Hinton. Largescale distributed neural network training through onlinedistillation. In

ICLR , 2018.[15] Walid Aref, et al. Video query processing in theVDBMS testbed for video database research. In

ACMMMDB , 2003.[16] Yinghao Cai and Gérard Medioni. Exploring ContextInformation for Inter-Camera Multiple Target Tracking.In

IEEE WACV , 2014.[17] Simone Calderara, Rita Cucchiara, and Andrea Prati.Bayesian-Competitive Consistent Labeling for PeopleSurveillance.

TPAMI , 30, 2007.[18] M. La Cascia and E. Ardizzone. Jacob: Just a content-based query system for video databases. In

IEEEICASSP , 1996.[19] Tiffany Yu-han Chen. Glimpse: Continuous, Real-TimeObject Recognition on Mobile Devices Categories andSubject Descriptors. In

ACM SenSys , 2015.[20] Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J.Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper:A Low-Latency Online Prediction Serving System. In

USENIX NSDI , 2017.[21] Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN:Resource-Aware Multi-Tenant On-Device Deep Learn-ing for Continuous Mobile Vision. In

ACM MobiCom ,2018.[22] Myron Flickner, et al. Query by Image and Video Con-tent: The QBIC System.

Computer , 28, 1995.[23] Ganesh Ananthanarayanan, Victor Bahl, Peter Bodík,Krishna Chintalapudi, Matthai Philipose, Lenin Ravin-dranath Sivalingam, Sudipta Sinha. Real-time Video13nalytics - the killer app for edge computing. In

IEEEComputer , 2017.[24] Urban Gregor, Krzysztof J. Geras, Ebrahimi KahouSamira, Ozlem Aslan, Wang Shengjie, AbdelrahmanMohamed, Matthai Philipose, Matt Richardson, andCaruana Rich. Do Deep Convolutional Nets ReallyNeed To Be Deep And Convolutional? In

ICLR , 2017.[25] S Han, H Shen, M Philipose, S Agarwal, A Wolman, andA Krishnamurthy. MCDNN: An approximation-basedexecution framework for deep stream processing underresource constraints. In

ACM MobiSys , 2016.[26] Song Han, Huizi Mao, and William J. Dally. DeepCompression: Compressing Deep Neural Networks withPruning, Trained Quantization and Huffman Coding. In

ICLR , 2016.[27] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distillingthe Knowledge in a Neural Network. In

NIPS , 2014.[28] Kevin Hseih, Ganesh Ananthanarayanan, Peter Bodik,Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons,and Onur Mutlu. Focus: Querying Large Video Datasetswith Low Latency and Low Cost. In

USENIX OSDI ,2018.[29] Chien-Chun Hung, Ganesh Ananthanarayanan, PeterBodík, Leana Golubchik, Minlan Yu, Victor Bahl, andMatthai Philipose. VideoEdge: Processing CameraStreams using Hierarchical Clusters. In

ACM SEC ,2018.[30] Loc N Huynh, Youngki Lee, and Rajesh K Balan. Deep-Mon: Mobile GPU-based Deep Learning Frameworkfor Continuous Vision Applications. In

ACM MobiSys ,2017.[31] Forrest N. Iandola, Matthew W. Moskewicz, KhalidAshraf, Song Han, William J. Dally, and Kurt Keutzer.Squeezenet: Alexnet-level accuracy with 50x fewer pa-rameters and <1mb model size. arXiv:1602.07360 ,2016.[32] Intel. OpenVINO. https://github.com/opencv/open_model_zoo , 2019.[33] Sibren Isaacman, Richard Becker, Ramón Cáceres, Mar-garet Martonosi, James Rowland, Alexander Varshavsky,and Walter Willinger. Human Mobility Modeling atMetropolitan Scales. In

ACM MobiSys , 2012.[34] Samvit Jain, Ganesh Ananthanarayanan, Junchen Jiang,Yuanchao Shu, and Joseph E. Gonzalez. Scaling VideoAnalytics Systems to Large Camera Deployments. In

ACM HotMobile , 2019. [35] Omar Javed, Khurram Shaﬁque, Zeeshan Rasheed, andMubarak Shah. Modeling inter-camera space-timeand appearance relationships for tracking across non-overlapping views.

CVIU , 109, 2008.[36] Angela H Jiang, Daniel LK Wong, ChristopherCanel, Lilia Tang, Ishan Misra, Michael Kaminsky,Michael A Kozuch, Padmanabhan Pillai, Daniel L-KWong, David G Andersen, and Gregory R Ganger. Main-stream: Dynamic Stem-Sharing for Multi-Tenant VideoProcessing. In

USENIX ATC , 2018.[37] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik,Siddhartha Sen, and Ion Stoica. Chameleon: Video An-alytics at Scale via Adaptive Conﬁgurations and Cross-Camera Correlations. In

ACM SIGCOMM , 2018.[38] Raja Jurdak, Kun Zhao, Jiajun Liu, Maurice Abou-Jaoude, Mark Cameron, and David Newth. Understand-ing Human Mobility from Twitter.

PLOS ONE , 10(7):1–16, 07 2015.[39] K. Jüngling, C. Bodensteiner, and M. Arens. Personre-identiﬁcation in multi-camera networks. In

IEEECVPR , 2011.[40] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis,and Matei Zaharia. NoScope: Optimizing Neural Net-work Queries over Video at Scale. In

VLDB , 2017.[41] Cheng-Hao Kuo, Chang Huang, and Ram Nevatia. Inter-camera Association of Multi-target Tracks by On-LineLearned Appearance Afﬁnity Models. In

ECCV , 2010.[42] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet,and Hans Peter Graf. Pruning Filters for Efﬁcient Con-vNets. In

ICLR , 2017.[43] Robert LiKamWa and Lin Zhong. Starﬁsh: EfﬁcientConcurrency Support for Computer Vision Applications.In

ACM MobiSys , 2015.[44] Min Lin, Qiang Chen, and Shuicheng Yan. Network innetwork. In

ICLR , 2014.[45] Franz Loewenherz, Victor Bahl, and Yinhai Wang.Video analytics towards vision zero. In

ITE Journal ,2017.[46] Yao Lu, Aakanksha Chowdhery, and Srikanth Kandula.Optasia: A Relational Platform for Efﬁcient Large-ScaleVideo Analytics. In

ACM SoCC , 2016.[47] Mahadev Satyanarayanan, Victor Bahl, Ramon Caceres,Nigel Davies. The Case for VM-based Cloudlets inMobile Computing. In

IEEE Pervasive Computing ,2009.1448] Dimitrios Makris, Tim Ellis, and James Black. Bridgingthe gaps between cameras. In

CVPR , 2004.[49] Akhil Mathur, Nicholas D. Lane, Sourav Bhattacharya,Aidan Boran, Claudio Forlivesi, and Fahim Kawsar.DeepEye: Resource Efﬁcient Local Execution of Mul-tiple Deep Vision Models Using Wearable CommodityHardware. In

ACM MobiSys , 2017.[50] N. Narayan, N. Sankaran, D. Arpit, K. Dantu, S. Setlur,and V. Govindaraju. Person Re-identiﬁcation for Im-proved Multi-person Multi-camera Tracking by Contin-uous Entity Association. In

IEEE CVPR , 2017.[51] V. E. Ogle and M. Stonebraker. Chabot: Retrieval froma relational database of images.

Computer , 28, 1995.[52] JungHwan Oh and Kien A. Hua. Efﬁcient and cost-effective techniques for browsing and indexing largevideo databases. In

ACM SIGMOD , 2000.[53] Qualcomm. Vision Intelligence Platform. ,2018.[54] Kumar S. Ray, Vijayan K. Asari, and Soma Chakraborty.Object Detection by Spatio-Temporal Analysis andTracking of the Detected Objects in a Video with Vari-able Background. In arXiv:1705.02949 , 2017.[55] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cuc-chiara, and Carlo Tomasi. Performance Measures and aData Set for Multi-Target, Multi-Camera Tracking. In

ECCV Workshops , 2016.[56] Ergys Ristani and Carlo Tomasi. Features for Multi-Target Multi-Camera Tracking and Re-Identiﬁcation. In

IEEE CVPR , 2018.[57] Ergys Ristani and Carlo Tomasi. Features for Multi-Target Multi-Camera Tracking and Re-Identiﬁcation. In

IEEE CVPR , 2018.[58] Xiaogang Wang. Intelligent multi-camera video surveil-lance: A review.

Pattern recognition letters , 34(1):3–19,2013.[59] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian.Person Trasfer GAN to Bridge Domain Gap for PersonRe-Identiﬁcation. In

IEEE CVPR , 2018. [60] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xi-aogang Wang. Joint Detection and Identiﬁcation FeatureLearning for Person Search. In

IEEE CVPR , 2017.[61] Zidong Yang, Ji Hu, Yuanchao Shu, Peng Cheng, JimingChen, and Thomas Moscibroda. Mobility Modeling andPrediction in Bike-Sharing Systems. In

ACM MobiSys ,2016.[62] Jungkeun Yoon, Brian D. Noble, Mingyan Liu, andMinkyong Kim. Building Realistic Mobility Modelsfrom Coarse-grained Traces. In

ACM MobiSys , 2006.[63] Ben Zhang, Xin Jin, Sylvia Ratnasamy, JohnWawrzynek, and Edward A Lee. Awstream: Adaptivewide-area streaming analytics. In

Proceedings of the2018 Conference of the ACM Special Interest Group onData Communication , pages 236–252. ACM, 2018.[64] Desheng Zhang, Jun Huang, Ye Li, Fan Zhang,Chengzhong Xu, and Tian He. Exploring human mobil-ity with multi-source data at extremely large metropoli-tan scales. In

ACM MobiCom , 2014.[65] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik,Matthai Philipose, Paramvir Bahl, and Michael J. Freed-man. Live Video Analytics at Scale with Approximationand Delay-Tolerance. In

USENIX NSDI , 2017.[66] Tan Zhang, Aakanksha Chowdhery, Paramvir Bahl, KyleJamieson, and Suman Banerjee. The Design and Imple-mentation of a Wireless Video Surveillance System. In

ACM MobiCom , 2015.[67] Zhimeng Zhang, Jianan Wu, Xuan Zhang, and ChiZhang. Multi-Target, Multi-Camera Tracking by Hi-erarchical Clustering: Recent Progress on DukeMTMCProject. arXiv:1712.09531 , 2017.[68] Liang Zheng, Yi Yang, and Alexander G Hauptmann.Person Re-identiﬁcation: Past, Present and Future. arXiv:1610.02984 , 2015.[69] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmo-han Chandraker, Yi Yang, and Qi Tian. Person Re-identiﬁcation in the Wild. In

IEEE CVPR , 2017.[70] Zheng Zhu, Wei Wu, Wei Zou, and Junjie Yan. End-to-End Flow Correlation Tracking With Spatial-TemporalAttention. In