ReXCam: Resource-Efficient, Cross-Camera Video Analytics at Scale
Samvit Jain, Xun Zhang, Yuhao Zhou, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Joseph Gonzalez
RReXCam: Resource-Efficient Cross-Camera Video Analytics at Scale
Samvit Jain, Xun Zhang, Yuhao Zhou,Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, Joseph E. Gonzalez
University of California Berkeley, University of Chicago, Microsoft Research
Abstract
Enterprises are increasingly deploying large camera networksfor video analytics. Many target applications entail a commonproblem template: searching for and tracking an object or ac-tivity of interest (e.g. a speeding vehicle, a break-in) through alarge camera network in live video. Such cross-camera analyt-ics is compute and data intensive, with cost growing with thenumber of cameras and time. To address this cost challenge,we present ReXCam, a new system for efficient cross-cameravideo analytics . ReXCam exploits spatial and temporal lo-cality in the dynamics of real camera networks to guide itsinference-time search for a query identity. In an offline profil-ing phase, ReXCam builds a cross-camera correlation model that encodes the locality observed in historical traffic patterns.At inference time, ReXCam applies this model to filter framesthat are not spatially and temporally correlated with the queryidentity’s current position. In the cases of occasional misseddetections, ReXCam performs a fast-replay search on recentlyfiltered video frames, enabling gracefully recovery. Together,these techniques allow ReXCam to reduce compute workloadby 8.3 × on an 8-camera dataset, and by 23 × – 38 × on a sim-ulated 130-camera dataset. ReXCam has been implementedand deployed on a testbed of 5 AWS DeepLens cameras. The Internet of Things (IoT) has led to an explosion of datasources, and applications that rely on real-time inferencesover these data. In parallel, the models making these infer-ences have improved in accuracy, even surpassing humans forcertain vision tasks, but at increased resource cost. This workaddresses the systems challenges of scaling up IoT applica-tions to enable live video analytics on a fleet of cameras .Live video analytics over a fleet of camera feeds embodiestwo key trends— massive data sources and compute-intensiveinference ( e.g., neural nets). On the one hand, enterprises de-ploy large camera networks for public safety and businessintelligence [11]. For instance, Chicago and London policeaccess footage from 30,000 and 12,000 cameras to respondto crimes in real time [4, 5]. On the other hand, many appli-cations rely crucially on cross-camera video analytics, i.e., detecting, associating and tracking queried “identities” inthe live streams as these identities move across the camerafeeds over time ( e.g., high-value shoppers in a store [8, 34]or suspects in a city [46, 66]). However, cross-camera analyt- t t + 1 t + 4 Query identityc4c3c2c1 ...
Figure 1:
Spatio-temporal correlations for video inference. Thecameras (on the y-axis) are plotted according to their mutualdistances, e.g., c1 and c2 are spatially closer than c1 and c3. Insearching for a query identity starting at frame t (marked indark red), ReXCam eliminates some cameras entirely (spatialfiltering), as well as frames t + and t + (temporal filtering).In this example, ReXCam searches first on c1, c2, and c3 (butnot c4), finds the target vehicle in c3, and then searches only onc2 and c4 (but not c1 and c3). The cameras and the times atwhich they are searched are marked in green. The unmarkedportions represent compute savings. ics applications are computationally more challenging than“stateless” single-camera vision tasks (such as object detectionin one camera feed) as they entail discovering associationsacross frames and across cameras . Their compute cost thusgrows with the number of cameras.Prior work falls short of addressing this challenge. Work incomputer vision improves accuracy of cross-camera analytics( e.g., [55, 58, 70]), but it has largely ignored the prohibitivecompute costs. Recent systems have accelerated analyticson live videos via frame sampling and/or cascaded filters fordiscarding frames [25, 28, 37, 40, 63, 65]. However, they sharea key drawback that they optimize the execution of analyticson single video feeds, independent of the other streams. Thus,the compute cost of cross-camera analytics still grows withmore deployed cameras and longer activity time. Spatio-temporal correlations:
Our main insight is that thecost of cross-camera analytics can be drastically reduced byexploiting the physical correlations of objects among the cam-era streams. We develop ReXCam, a cross-camera analyticssystem that leverages inherent spatio-temporal correlations toaggressively prune the set of camera streams to be processed,thus decreasing compute costs. In the ideal case, ReXCamreduces cost to the number of cameras that the queried object1 a r X i v : . [ c s . D C ] D ec ppears in at any point in time and not the total number ofdeployed cameras. A key property of cross-camera applica-tions is that objects of interest appear only in a small numberof cameras at any time, even in large camera deployments.Spatial correlations indicate geographical association be-tween cameras – the probability that objects seen in a sourcecamera will move next to a particular destination camera’sfield of view. Temporal correlations indicate association be-tween cameras over time – the probability that objects seenin a source camera will move next to a destination camera’sview at a particular time . These spatio-temporal correlationsenable ReXCam to guide its cross-camera inference searchtoward cameras and frames most likely to contain the queryidentity (see Figure 1). ReXCam’s use of spatio-temporalcorrelations to cut the cost of cross-camera analytics is funda-mentally different than the cross-camera correlations used byrecent work ( e.g., [37]) that optimizes the resource-accuracy profiling but not the live video analytics itself, which stillexecutes on each stream independently. Challenges:
ReXCam, at its core, applies the physical proper-ties in the IoT world (spatio-temporal correlations across cam-eras) to high-level AI applications (cross-camera video analyt-ics). This has led to three main challenges. First, automaticallyobtaining spatio-temporal correlations is expensive on unla-beled video data. Second, applying spatio-temporal correla-tions to existing single-camera inference modules ( e.g., objecttrackers) is non-trivial and requires clean abstractions withthe necessary system supports. Finally, any spatio-temporalprofile is bound to have errors that will lead to missing objects,which need to be detected and rectified efficiently.To tackle these challenges, ReXCam operates in three dis-tinct phases. 1) In an offline profiling phase, it constructsa cross-camera spatio-temporal correlation model from un-labeled video data, which encodes the locality observed inhistorical traffic patterns. This is an expensive one-time op-eration that requires detecting entities with an offline tracker,and then converting them into an aggregate profile of cross-camera correlations. 2) At inference time, ReXCam uses thisspatio-temporal model to filter out cameras that are not cor-related to the query identity’s current position (camera), andis thus unlikely to contain its next instance. 3) Occasionally,this filtering will cause ReXCam to miss query detections.In these cases, ReXCam performs a fast-replay search onrecently filtered frames (that it stores), uncovers the missedquery instances, and gracefully recovers into its live search.
Evaluation Highlights:
We evaluate ReXCam using the well-studied DukeMTMC video data [55] from the Duke campus.On this 8-camera dataset, ReXCam saves compute cost by8 . × over a correlation-agnostic baseline ( ∼
90% of the idealsavings). These savings come at a drop in recall of only 1 . × − × .Interestingly, ReXCam improves precision by 39%, perhapsbecause the spatio-temporal pruning acts as a “low pass filter”. Finally, we have implemented and deployed ReXCam on asmall testbed of 5 AWS DeepLens smart cameras [13]. Contributions:
Our work makes three main contributions.1) We quantify the potential for harnessing spatio-temporalcorrelations in cross-camera video analytics.2) We build a cross-camera video analytics system that learnsand applies spatio-temporal profiles on live videos.3) We develop robust error-handling mechanisms to avoidmissed detections by storing and searching on recent videos.
We explain some example cross-camera video analytics appli-cations (§2.1), the modules in their analytics pipelines (§2.2),and then the compute models for video analytics (§2.3).
Large camera networks are installed in cities (such as London,Beijing, and Chicago), transport facilities (traffic intersections,airports), and enterprise campuses (corporate offices, retailshops) [1, 5, 12, 66]. A common class of applications in thesecamera deployments rely on re-identifying and following ob-jects (e.g., people or vehicles) as they move across the viewsof the different cameras . The focus is on following select“objects of interest” that are typically provided by externalentities (such as law enforcement). A key characteristic ofcross-camera applications is that objects of interest occur onlyin a small fraction of the cameras at any given time.
1) Public safety.
Cross-camera video analytics helps localizesuspects after a security breach. For example, after a reportedincident of a person pulling out a gun inside an office build-ing, we will want to track that person (whose image can beobtained from the camera footage) across the cameras in thebuilding while security personnel are dispatched.Alternatively, after a major public attack ( e.g., in a train),law enforcement may track the accomplices of the identifiedperpetrator, which may be obtained from police databases thatstore people frequently associated with the perpetrator [66].Following these accomplices across the thousands of camerasin the city allows for effective police apprehension.
2) Vehicle tracking in traffic cameras.
In the U.S. and Eu-rope, AMBER alerts are raised on suspected child abduc-tions [2]. The license plate and vehicle details are obtainedfrom investigations, and alerts are broadcast to citizens in thearea [2]. Tracking of the suspect’s vehicle across the thou-sands of cameras on highways and city streets can keep tabson the suspect and victim, even as police intervene [46].Likewise, when traffic police notice a vehicle speeding ormaking a dangerous maneuver, they will note its details andwill be interested in tracking the vehicle as it moves acrossthe city using cross-camera analytics to assess its behavior.
3) Retail store cameras.
Using computer vision to improveshopping experience is a big thrust among retailers. “Special”shoppers ( e.g., loyal customers, or customers on wheelchairs)are identified as they enter the store and cross-camera analyt-2igure 2:
Illustration of identity re-identification. ics can be used to track them across the hundreds of camerasin the store to make sure they are provided timely attention(e.g., dispatching a store representative) when necessary.
Video analytics pipelines for cross-camera applications (in§2.1) typically consist of a series of modules on the decodedframes of the video stream: (1) an object detection module,which extracts and classifies objects of interest in each videoframe ( e.g., people, gun), and (2) a re-identification module,which given a query image ( e.g., of a person), returns po-sitions of co-identical instances of the query in subsequentframes (if present). Cross-camera analytics pipelines detectobjects in each camera, and track the objects across cameras.Core to this pipeline is the vision primitive of identity re-identification [39, 50, 56]. Given an image of a query identity q , a re-identification (re-id) algorithm ranks every image ina gallery G based on its feature distance to q ; the lower thedistance the higher the similarity (Figure 2). Typically, fea-tures are the intermediate representation of a neural networktrained to associate instances of co-identical entities.Object detection and re-id are the most challenging steps ofcross-camera video analytics – in terms of cost and accuracy– and our work focuses on improving both of them. Cost.
Tracking in large camera networks is computationallyexpensive. Tracking even a single object of interest througha camera network, after an initial detection, can potentiallyrequire analyzing every subsequent frame in every camera(without good heuristics for geographic localization). Accuracy.
Re-id is a non-trivial problem in computer vi-sion [59, 68], being particularly difficult in crowded scenesand in large camera networks due to significant differencesin lighting and viewpoint across cameras. Often, re-id mod-els must rely on weak signals (like clothing), thus making itdifficult among a large gallery of objects in a frame.Our use of spatio-temporal correlations to prune the videoframes to analyze – i.e., run object detection and re-id – sig-nificantly cuts down the inference space, thus improving bothcost as well as accuracy . While our focus is on cross-cameraapplications, we also show how spatio-temporal correlationsimprove the cost of even single-camera applications (§5.4). Optimizations using frame sampling in each camera stream [28, 40] areorthogonal to our idea of using spatio-temporal correlations across cameras,and we will quantify this aspect in our experiments in §8.2.
Consistent with existing deployments [23, 29, 47], our focusis on “edge” computation of video analytics. In our setup, allthe cameras are in a high-speed local network with sufficientbandwidth to an edge compute box (e.g., Azure Data BoxEdge [3]) that is managed by the enterprise (that has deployedthe cameras). For example, cameras in an office building areanalyzed in an edge box located in the same building. Trafficcameras in a city are analyzed in the local traffic commandcenter [45]. Videos are streamed to this edge box and thepipeline modules (§2.2) including object detection and re-idare run on this edge. Reducing the compute load enables morevideo feeds to be processed on the edge box or alternatelyreduces the resources to be provisioned.Our ideas also readily apply to a network of AI cameras(as we implement and deploy in §7), each of which consist ofcompute on-board, accelerators ( e.g.,
GPUs), and storage [13,53]. Our techniques will enable each camera to be provisionedwith much lower resources, thus lowering their cost.
We analyze the potential of using spatio-temporal correla-tions for cross-camera video analytics using the DukeMTMCdataset [55]. We study cross-camera identity tracking thatinvolves tracking an object of interest, in real time, througha camera network. In particular, given an instance of a queryidentity q ( e.g., a person) flagged in camera c q at frame f ,we return all subsequent frames, across all cameras, in which q appears as it moves around. We measure the reduction incompute, i.e., the number of frames on which object detectionand re-id operations (§2.2) are executed. We now present an empirical study to quantify the cross-camera correlations in the DukeMTMC dataset [55], one ofthe most popular benchmarks in computer vision person re-idand tracking [60,67]. This quantification motivates our designof a video analytics system that leverages such correlationsto improve the performance of cross-camera analytics. TheDukeMTMC dataset contains footage from eight camerasplaced on the Duke University campus (see Figure 3), in anarea with significant pedestrian traffic. The field of viewsof the cameras do not mostly intersect, but the cameras areplaced close enough that people frequently appear in multiplecameras, as is typical in camera deployments. The datasetcontains over 2,700 unique identities across 85 minutes offootage, recorded at 60 frames per second [55].
Cross-camera movement of individuals (or “traffic”) demon-strates a high degree of spatial correlation. Here, “traffic” be-tween cameras A and B is defined as the set of unique individ-uals detected in camera A that are next detected in camera B .(Note that a person that moves from A to B via camera C are3igure 3: DukeMTMC camera network [55]. Marked regionsshow the visual field of view of each camera. C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a C a m e r a E x i t e d Destination Camera
Camera 1Camera 2Camera 3Camera 4Camera 5Camera 6Camera 7Camera 8 S o u r c e C a m e r a T r a ff i c P e r c e n t a g e ( % ) Figure 4:
Spatial correlations in the DukeMTMC dataset [55].Cells display % of outbound traffic (individuals) from eachcamera that appears at other cameras. Each row corresponds toa particular source camera while each column to a destinationcamera; each row’s values add up to 100%. The final columnrepresents traffic that exits the camera network. excluded from the traffic count of A → B and instead includedin the A → C traffic count.) We find that individuals seen ata camera c q move next to only a small number of c q ’s peercameras. On the 8-camera DukeMTMC dataset, only 1.9 of7 potential peer cameras, on average, receive even 5% of thetotal outbound traffic (or individuals) from a given camera.Figure 4 shows the full pair-wise spatial correlations.Exploiting this insight can significantly reduce our computeworkload, at little cost to accuracy, when searching for a queryidentity q ( e.g., a person), that was first detected in camera c q . In comparison to a scheme that searches all n − at least 5% of the traffic from c q , reduces our computeby 3 . × (we search only 1.9 cameras instead of 7, or 3 . × fewer frames to run object detection and re-id; see §2.2), while still capturing 95% of all detections as per our experiments.An interesting aspect is that geographical proximity is not necessarily a good spatial filter. Consider camera-5 (Figure4), out of which a significant fraction of individuals (traffic)go to cameras 2 and 6 but not to &