Scanner: Efficient Video Analysis at Scale
SScanner: Efficient Video Analysis at Scale
ALEX POMS,
Carnegie Mellon University
WILL CRICHTON, PAT HANRAHAN, and KAYVON FATAHALIAN,
Stanford University
A growing number of visual computing applications depend on the analy-sis of large video collections. The challenge is that scaling applications tooperate on these datasets requires efficient systems for pixel data accessand parallel processing across large numbers of machines. Few program-mers have the capability to operate efficiently at these scales, limiting thefield’s ability to explore new applications that leverage big video data. Inresponse, we have created Scanner, a system for productive and efficientvideo analysis at scale. Scanner organizes video collections as tables in adata store optimized for sampling frames from compressed video, and exe-cutes pixel processing computations, expressed as dataflow graphs, on theseframes. Scanner schedules video analysis applications expressed using theseabstractions onto heterogeneous throughput computing hardware, such asmulti-core CPUs, GPUs, and media processing ASICs, for high-throughputpixel processing. We demonstrate the productivity of Scanner by authoringa variety of video processing applications including the synthesis of stereoVR video streams from multi-camera rigs, markerless 3D human pose recon-struction from video, and data-mining big video datasets such as hundredsof feature-length films or over 70,000 hours of TV news. These applicationsachieve near-expert performance on a single machine and scale efficientlyto hundreds of machines, enabling formerly long-running big video dataanalysis tasks to be carried out in minutes to hours.CCS Concepts: •
Computing methodologies → Graphics systems andinterfaces ; Image processing;Additional Key Words and Phrases: large-scale video processing
ACM Reference format:
Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scan-ner: Efficient Video Analysis at Scale.
ACM Trans. Graph.
37, 4, Article 138(August 2018), 14 pages.https://doi.org/10.1145/3197517.3201394
The world is increasingly instrumented with sources of video: cam-eras are commonplace on people (smartphone cameras, GoPros),on vehicles (automotive cameras, drone videography), and in urbanenvironments (traffic cameras, security cameras). Extracting valuefrom these high-resolution video streams is a key research and com-mercial challenge, and a growing number of applications in fieldslike computer graphics, vision, robotics and basic science are basedon analyzing large amounts of video.The challenge is that scaling video analysis tasks to large videocollections (thousands of hours of cable TV or YouTube clips, theoutput of a modern VR video capture rig) requires optimized systemsfor managing pixel data as well as efficient, parallel processingon accelerated computing hardware (clusters of multi-core CPUs,GPUs, and ASICs). Unfortunately, very few programmers have theskill set to implement efficient software for processing large video © 2018 Copyright held by the owner/author(s). Publication rights licensed to Associationfor Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Not forredistribution. The definitive Version of Record was published in
ACM Transactions onGraphics , https://doi.org/10.1145/3197517.3201394. datasets, inhibiting the field’s ability to explore new applicationsthat leverage this data. Inspired by the impact of data analyticsframeworks such as MapReduce [Dean and Ghemawat 2004] andSpark [Zaharia et al. 2010], which facilitate rapid development ofscalable big-data analytics applications, we have created
Scanner , asystem for productive and efficient big video data analysis.Scanner provides integrated system support for two performance-critical aspects of video analysis: storing and accessing pixel datafrom large video collections, and executing expensive pixel-leveloperations in parallel on large numbers of video frames. Scanneraddresses the first need by organizing video collections and derivedraster data (depth maps, activation maps, flow fields, etc.) as tablesin a data store whose implementation is optimized for compressedvideo. It addresses the second need by organizing pixel-analysistasks as dataflow graphs that operate on sequences of frames sam-pled from tables. Scanner graphs support features useful for videoprocessing, such as sparse sampling of video frames, access to tem-poral windows of frames, and state propagation across computationson successive frames. Scanner schedules these computations effi-ciently onto heterogeneous computing hardware such as multi-coreCPUs, GPUs, and media processing ASICs.We demonstrate that applications using Scanner for expensive,pixel-level video processing operations achieve near-expert perfor-mance when deployed on workstations with high-core count CPUsand multiple GPUs. The same applications also scale efficiently tohundreds of machines without source-level change. We report onexperiences using Scanner to implement several large-scale videoanalysis applications including VR video processing, 3D humanpose reconstruction from multi-viewpoint video, and data mininglarge video datasets of TV news. In these cases, Scanner enabledvideo analysis tasks that previously required days of processing(when implemented by researchers and data scientists using ad hocsolutions) to be carried out efficiently in hours to minutes. Scan-ner is available as open-source code at https://github.com/scanner-research/scanner.
Executing pixel-analysis pipelines (e.g., feature extraction, face/ob-ject detection, image similarity and alignment) on large image collec-tions is the performance-critical component of many big visual dataapplications such as data-driven image manipulation and enhance-ment [Hays and Efros 2007; Kemelmacher-Shlizerman 2016], noveltechniques for organizing and browsing photo collections [Sivicet al. 2008; Snavely et al. 2006], and exploratory data mining of thevisual world [Chen et al. 2013; Doersch et al. 2012; Ginosar et al.2017; Matzen et al. 2017; Zhu et al. 2014]. While these early applica-tions analyzed collections of images, a growing class of applicationsnow seek to manipulate large video datasets. To better understandthe challenges and requirements of these video analysis workloads,we selected a diverse set of video analysis applications to guide the
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. a r X i v : . [ c s . C V ] M a y SynthWarpConcat ConcatWarpcam00 id frame cam01 id frame
SynthWarpcam02 id frame
Warpcam13 id frame
Synth id 360frame10 1010
Flow FlowFlow ... ...
Surround 360 VR Video Hyperlapse Pose Estimation
Match
S[0,w] id frameid matches
SIFT Match
S[0,1] id frameid matches
SIFTCondMatch
S[0,w] id frame R es i ze DNN EvalPredictPose id 2D pose
Large-scale Video Data Mining id frame H i s t og r a m id boundary HistDiff
S[0,1]
DetectOutlier
S[-w, w] id frame G a t he r ([...]) id montage Montage S li ce (...) W( ∞ ) (b) U ns li ce (...) (a) (c) (f)(d) (e) Left EyeRight Eye ...
Cost Matrix
GPU operationCPU operation id frame
Input video(table)Output tableData sequence S[ x , y ,...] Stencil operation W( x ) Stateful operation(warmup = W) id output
Fig. 1. We have implemented a set of video analysis applications in Scanner by expressing key pixel processing operations as dataflow graphs (Section 3.2).Each application contributes unique challenges to Scanner’s design, such as stateful processing, combining information across video streams, sparse frameaccess, and the need to process large numbers of video clips. Image credit, left to right:
The Rachel Maddow Show © MSNBC 2015-2017, "Palace of Fine ArtsTake 1" © Facebook 2017, “Run 5K” clip (top) and Figure 1 (bottom) from [Joshi et al. 2015], “160422_mafia2” scene from [Joo et al. 2016]. design of Scanner. Fig. 1 summarizes the structure of these applica-tions, which are implemented in Scanner and evaluated at scale inSection 5.2.
Large-scale video data mining.
Many applications now seek toperform labeling and data-mining of large video collections. Ex-amples include autonomous vehicle development [Bojarski et al.2016], surveillance, smart-city monitoring, and everyday egocentricvideo capture [Singh et al. 2016]. These computations require bothtraditional computer vision operations (optical flow, object track-ing, etc.) and DNN inference (object detection, frame segmentation,activity recognition) to be executed on millions to billions of videoframes. To keep costs manageable, it is common to sparsely sampleframes from the video (e.g. every n -th frame, a list of frames likelyto contain interesting objects). In Section 5.2.3 we report on experi-ences labeling and data mining two large video datasets: a datasetcontaining over 600 feature length films (106 million frames) and adataset of 70,000 hours of TV news (12 billion frames, 20 TB). Software for gen-erating omnidirectional stereo (ODS) video, 360 degree stereo panora-mas, provide a solution for authoring VR video. We ported theSurround 360 pipeline [Facebook 2017] for producing ODS videovideo from 14 synchronized 2K video streams. This application in-volves per-frame operations (warping input frames to a sphericalprojection), cross-video-stream operations (depth estimation betweenframes from adjacent cameras), within-stream frame-to-frame de-pendencies (stateful temporal smoothing of computed flow fields),and the ability to output a final compressed high-resolution videostream. Surround 360 processing is computationally intense; it cantake over twelve seconds to produce a single output frame on a32-core server. The Jump VR Video processing pipeline has similarcharacteristics [Anderson et al. 2016].
Hyperlapse generation.
Hyperlapses are stabilized timelapsevideos synthesized from long videos captured with moving cameras. The challenge of generating a high-quality hyperlapse involves se-lecting source video frames that approximate a desired timelapseplayback speed while minimizing apparent camera movement. Wehave implemented two variants of the frame-selection computationdescribed by Joshi et al. [2015], which performs SIFT feature ex-traction and matching over sliding windows of frames from a videostream (temporal stencil computations).
3D human pose estimation.
Recent computer vision advancesmake it possible to estimate temporally consistent human joint lo-cations from dense multi-viewpoint video. This offers the promiseof markerless human motion capture, even in high-occlusion sce-narios, but comes at the cost of processing many video streams.For example, human motion capture sessions from the CMU Panop-tic Dataset [Joo et al. 2015] feature 480 synchronized streams of640 ×
480 video (see visualization in Fig. 12). The dominant cost ofa top-performing method for 3D pose reconstruction from thesestreams [Joo et al. 2016] involves evaluating a DNN on every frameof all streams to estimate 2D pose. The 2D poses are subsequentlyfused to obtain a 3D pose estimate.
Scanner’s goal is to enable rapid development and scaling of appli-cations such as those described above. This required a system withflexible abstractions to span a range of video analysis tasks, but alsosufficiently constrained to allow efficient, highly-parallel implemen-tations. Specifically, our experiences implementing the applicationsin Section 2.1 suggest that the size and temporal nature of videointroduces several unique system requirements and challenges:
Organize and store compressed video.
Managing tens of thou-sands of video clips, as well as per-frame raster products derivedfrom its analysis (e.g., multiple resolutions of frames, flow fields,depth maps, feature maps, etc.), can be tedious and error pronewithout clear abstractions for organizing this data. The relationaldata model [Codd 1970] provides a natural representation for or-ganizing video collections (e.g., a table per video, a row per video
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. canner: Efficient Video Analysis at Scale • 138:3 frame), however we are not aware of a modern database systemoptimized for managing, indexing, and providing efficient frame-levelaccess to data stored compactly using video-specific compression(e.g., H.264). While some applications require video data to be main-tained in a lossless form, in most cases it is not practical to store largevideo datasets as individual frames (even if frames are individuallycompressed). Video collections can fill TBs of storage even whenencoded compactly using video-specific compression schemes. Ig-noring inter-frame compression opportunities can increase storagefootprint by an order or magnitude or more.
Support a flexible set of frame access patterns.
Video com-pression schemes are designed for sequential frame access duringvideo playback, however video analysis tasks exhibit a rich set offrame access patterns. While some applications access all videoframes, others sample frames sparsely, select frame ranges, operateon sliding windows (e.g., optical flow, stabilization), or require join-ing frames from multiple videos (e.g., multi-view stereo). A systemfor video analysis must provide a rich set of streaming frame-levelaccess patterns and implement these patterns efficiently on com-pressed video representations.
Support frame-to-frame (temporal) dependencies.
Reason-ing about a sequence of video frames as a whole (rather thanconsidering individual frames in isolation) is fundamental to algo-rithms such as object tracking, optical flow, or activity recognition.Sequence-level reasoning is also key to achieving greater algorith-mic efficiency when executing per-frame computations on a videostream since it is possible to exploit frame-to-frame coherence toaccelerate analysis. Therefore, the system must permit video analy-sis computations to maintain state between processing of frames,but also constrain frame-to-frame dependencies to preserve oppor-tunities for efficient data streaming and parallel execution.
Schedule pixel-processing pipelines (with black-box ker-nels) onto heterogeneous, parallel hardware.
Authoring high-performance implementations of low-level image processing ker-nels (e.g., DNN evaluation, feature extraction, optical flow, objecttracking) is difficult, so application developers typically constructanalysis pipelines from pre-existing kernels provided by state-of-the-art performance libraries (e.g., cuDNN, OpenCV) or synthesizedby high-performance DSLs (e.g., Halide [Ragan-Kelley et al. 2012]).Therefore, a video analysis system must assume responsibility forautomatically scheduling these pipelines onto parallel, heteroge-neous machines, and orchestrate efficient data movement betweenkernels. (The 3D human pose reconstruction pipeline presentedin Section 5.2.1 involves computation on the CPU, GPU, and videodecoding ASICs.) Although a single system for both kernel code gen-eration and distributed execution provides opportunities for globaloptimization, it is not practical to force applications to use a specifickernel code generation framework. For reasons of productivity andperformance, Scanner should minimally constrain what 3rd-partykernels applications can use.
Scaling video analysis.
Designing abstractions to address theabove challenges is difficult because they must also permit an im-plementation which is able to scale from a workstation packed withGPUs underneath a researcher’s desk to a cluster of thousands ofmachines, and from a dataset of a few 4K video streams to millions of 480p videos. Specifically, our examples require Scanner to scalein a number of ways: • Number of videos.
Scanner applications should scale tovideo datasets of arbitrary size (in our cases: millions or bil-lions of frames), and consisting of both long videos (manyfeature length films or long-running vehicle capture ses-sions), or a large number of short video clips (e.g., millionsof YouTube video clips). • Number of concurrent video streams.
We seek to han-dle applications that must process and combine a largenumber of video streams capturing a similar subject, scene,or event, such as VR video (14 streams) and 3D pose recon-struction (480 streams) discussed in Section 2.1. Scannershould accelerate computationally intensive pipelines toenable processing these streams at near-real time rates. • Number of throughput-computing cores.
Scanner ap-plications should efficiently utilize throughput computinghardware (multi-core CPUs, multiple GPUs, media pro-cessing ASICs, and future DNN accelerators [Jouppi et al.2017]) to achieve near-expert performance on a single ma-chine, and also scale out to large numbers of compute-richmachines (thousands of CPUs or GPUs) with little-to-nosource-level change.We have designed Scanner to address these challenges. When ourgoals of productivity, scope, and performance conflict, we opted infavor of maintaining a scalable and performant system. This philoso-phy resulted in a number of clear non-goals for Scanner. For example,Scanner does not seek to aid with processing the results of pixel orfeature-level analysis (image metadata, object labels, histograms,etc.). Post-processing these smaller derived data sets often involvesa diverse set of algorithms that are well supported by existing dataanalysis frameworks. Also, Scanner does not seek to define its ownprogramming language for authoring high-performance kernel func-tions. Many domain-specific programming frameworks exist for thispurpose today and Scanner aims to inter-operate with and augmentthese best-in-class tools, not replicate their functionality.
In this section we describe the primary abstractions used to con-struct Scanner applications. Scanner adopts two dataflow program-ming concepts familiar to users of existing data analytics frame-works and stream processing systems [Abadi et al. 2016; Chen et al.2015; Dean and Ghemawat 2004; Zaharia et al. 2010], but extendsand implements these concepts uniquely for the needs of efficientvideo processing.
Videos as logical tables.
Scanner represents video collectionsand the pixel-level products of video frame analysis (e.g., flow fields,depth maps, activations) as tables in a data store. Scanner’s datastore features first-class support for video frame column types tofacilitate key performance optimizations.
Video processing operations as dataflow graphs.
Scannerstructures video analysis tasks as dataflow graphs whose nodes pro-duce and consume sequences of per-frame data. Scanner’s embodi-ment of the dataflow model includes operators useful for common
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. vid_00.mp4vid_01.mp4vid_99.mp4 ingest Input:Output: vid_00vid_00_faces
Scanner Data Store … frame vid_00 vid_99 vid_00_faces vid_99_faces videos = ['vid_00.mp4',...,'vid_99.mp4']db = scanner.Database()video_tables = db.ingest_videos(videos)frame = db.ops.FrameInput()sparse_frames = frame.stride(10)resized = db.ops.Resize( frame = sparse_frames, width = 496, height = 398, device = GPU)detections = db.ops.DNN( frame = resized, model = 'face_dnn.prototxt', batch = 8, device = GPU)frame_detections = detections.space(10)face_bboxes = db.ops.Track( frame = frame, detections = frame_detections, warmup = 20, device = CPU)jobs = []for table in video_tables: jobs.append(Job(op_args = { frame: table.column('frame'), face_bboxes: table.name + '_faces' }))pose_tables = db.run( BulkJob(outputs=[face_bboxes], jobs=jobs)) ResizeDNNTrack id frame impl: KCF
Scanner Graph Execution Engine run
Computation GraphScanner Application System Flow (stride=10)
Space (stride=10)
B(8) impl: Caffeimpl: Halide id face_bboxes list of bboxes(length = 18000)1920x1080 images(length = 1800)W(20) ... id frame id faces id faces … Input:Output: vid_99vid_99_faces ... job job id Fig. 2. Scanner computation graphs (blue) operate on sequences of per-frame data extracted from data store tables (tan), and produce outputs that are storedas new tables (pink). This graph performs expensive face detection every 10th frame, and uses these detections to seed an object tracker run on each frame. video processing tasks such as sparse frame sampling, stenciledframe access, and stateful processing across frames.We first provide an example of how Scanner’s abstractions areused to conduct a simple video analysis task, then describe themotivation and design of key system primitives in further detail.
Fig. 2 illustrates a simple video analysis application (implementedusing Scanner’s Python API) that annotates a video with boundingboxes for the faces in each frame.First, the application ingests a collection of videos into the Scannerdata store, shown in yellow. Logically, each video is represented by atable, with one row per video frame. In the example, ingest produces100 tables, each with 18,000 rows, corresponding to 10-minute 30 FPSvideos. The Scanner data store provides first-class support for tablecolumns of video frame type, which facilitates compact storage andefficient frame-level access to compressed video data (Section 4.3).(See supplemental material for additional detail on how first-classvideo support enables Scanner’s storage formats to be optimized tospecific access patterns without needing application-level change.)Next, the application defines a five-stage computation graph thatspecifies what processing to perform on the video frames (codeshaded in blue). Since accurate face-detection is costly, the applica-tion samples every 10th frame from the input video (
Stride ), down-samples the resulting frames (
Resize ), then evaluates a DNN todetect faces in each downsampled frame to produce a per-frame listof bounding boxes (
DNN ). The 3 FPS (sparse-in-time) detections arethen re-aligned (
Space ) with the original high-resolution, 30 FPS im-age sequence from the data store, and used to seed an object tracker(
Track ) that augments the original detections with additional detec-tions produced by tracking on the original frames. The computationgraph outputs a sequence of per-frame face bounding boxes that isstored as a new table with a column named face_bboxes .A Scanner job specifies a computation graph to execute and thetables it consumes and produces. In this example, the application defines one job for each video (code shaded in pink). Scanner au-tomatically schedules all jobs onto a target machine (potentiallyexploiting parallelism across jobs, frames in a job, and computa-tions in the graph), resulting in the creation of new database tables(shown to the right in pink in Fig. 2). After using Scanner to per-form the expensive pixel processing operations on video frames, anapplication typically exports results from Scanner, and uses exist-ing data analysis frameworks to perform less performance-criticalpost-processing of the face bounding box locations.
Scanner applications express video processing tasks in the dataflowmodel by defining computation graphs. For consistency with [Abadiet al. 2016; Chen et al. 2015], we refer to graph nodes, which definestages of computation, as operations . Graph edges are sequences whose elements contain per-frame data communicated betweenoperations. Figures 1 and 2 illustrate Scanner computation graphs forour example applications. These graphs range from simple pipelinesdefining stages of processing on a single video to complex DAGs ofmany operations on multiple input video streams.
Sequences.
Scanner sequences are finite-length, 1D collectionsthat are streamed element-by-element (or in small batches) to graphoperations [Buck et al. 2004; Thies et al. 2002; Zaharia et al. 2010].Each element in a length N sequence is associated with a point inthe [ , N ) domain. It is typical for sequence elements in Scannerapplications to be video frames, or derived structures produced bygraph operations, such as transformed images, flow fields, depthmaps, or frame metadata (e.g., lists of per-frame object boundingboxes). Graph Operations.
A major challenge in Scanner’s design wasselecting a set of graph operations that could be composed to expressa rich set of video processing applications, but was sufficientlyconstrained to enable a streaming, data-parallel implementation.Scanner supports the following classes of graph operations, whichare characterized by their input stream access patterns, and whether
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. canner: Efficient Video Analysis at Scale • 138:5
B(2)
DetectorTracker (a) Map (with element batching) (b) Stencil (g) Bounded State (Warmup=2) (e) Dense Strided Stencil
Detector Flow
S[0,1]
Flow
S[0,1]
Resize
W(2)6 7 65 (f) Sparse Strided Stencil
Sample (stride=3) (c) Strided Sampling (d) Strided Spacing null null null null2 50 1 3 4 6 72 50 1 20 1 3 4 62 5 Flow
S[0,1]
Resize Slice (4,1,3)
Tracker (h) Bounded State After Slice
W(2) 0 1 3 4 6 72 50 1 3 4 6 72 50 1 3 4 6 72 5
Space (stride=3)
UnsliceSample (0,4,8,9)
Sample (stride=3)
Sample (stride=3)
Fig. 3. Scanner analyzes operation dependencies to reduce computation during graph execution. White boxes denote elements of a sequence, and are labeledwith their corresponding sequence domain point. Black boxes denote the execution of a graph operation on an element. Grayed elements are not required toproduce the graph’s required outputs and need not be computed. state is propagated between invocations on consecutive streamelements.
Maps.
Scanner operations may be mapped (Fig. 3-a) onto inputsequences or onto multiple sequences of the same length (e.g., re-sizing an input frame or evaluating a DNN to generate per-frameactivations).
Sampling/spacing operations.
Sampling and spacing operations(Figure 3-c,d) modify the length of sequences by selecting a subsetof elements from the input sequence (sampling) or adding “fill” ele-ments to it (spacing). Sampling operations enable computation on asparse set of frames for computational efficiency or when specificframes must be selected for processing. For example, sampling ev-ery 30th row from a table representing a one-minute long, 30 FPSvideo (1800 frames) yields a length 60 sequence representing thevideo sampled at one frame per second. Spacing operations invertsampling and are used to align sequences representing data sampledat different frame rates. For example, in Fig. 2 a spacing operationwas used to convert face detections computed at 3 FPS back into a30 FPS stream. Both sampling and spacing operations can be definedby strides, ranges, or index lists.
Stencil operations.
Stencil operations gain access to a windowof elements from the input sequence defined by a constant-offsetstencil. For example, the optical flow operation in Fig. 3-b requireselements i and i + i (the stencil is denoted by S[0,1] next to the operation). Composingstencil and sampling operations yields a rich set of frame accesspatterns. For example, performing stride- N sampling prior to opticalflow with stencil ( i , i + ) yields flow vectors computed on a lowframe rate video sequence (Fig. 3-f), whereas sampling after theflow operation yields a sparse set of flow fields computed fromdifferences between original video frames (Fig. 3-e). Bounded State Operations.
Video processing requires operationsthat maintain state from frame-to-frame, either because it is fun-damental to the operation being performed (e.g., tracking) or as acompute optimization when there is little frame-to-frame change.However, if unconstrained, stateful processing would force serializa-tion of graph execution. As a compromise, Scanner allows statefuloperations, but limits the extent to which the processing of onesequence element can affect processing of later ones. Specifically, Scanner guarantees that prior to invoking an instance of a boundedstate operation to generate output element i , the operation will havepreviously been invoked to produce at least the previous W ele-ments of its output sequence. (The “warmup” value W is providedto Scanner by the stateful operation.) As a result, the operationis guaranteed that effects of processing element i will be visiblewhen processing elements ( i + i + W −
1) (Fig. 3-g: horizontalarrows). In Figs. 1 and 3, we denote the warmup size of boundedstate operations (in elements) using the notation
W() . An operationmay have an infinite warmup, indicating that it must process inputsequences serially (zero parallelism).Warmup allows operations to benefit from element-to-elementstate propagation, while the bound on information flow providesScanner flexibility to parallelize stateful operators at the cost of asmall amount of redundant computation. For example, it is validto execute a bounded state operation ( W =
2) with a length-100output sequence by producing output elements [0,50) on one ma-chine independently from elements [48,99) on a second. Scannerautomatically discards warmup elements 48 and 49 from the secondworker (it does not include them in the output sequence), althougheffects of their processing may impact the value of subsequent ele-ments (e.g., 50) generated by this worker. Bounded state operationsuse warmup to approximate the output of unbounded (fully serial)stateful execution when the influence of an operation’s effects isknown to be localized in the video stream. For example, warmupof a few elements can be used to prime an object tracker prior toproducing required outputs, or to minimize temporal discontinu-ities in the outputs of a stateful operation at the boundary of twoindependently computed regions.
Slicing/unslicing operations.
Slicing and unslicing operations insertand remove boundaries that affect stenciling and state propagationin a sequence. For example, slicing a video sequence at intervalsaccording to shot boundaries would reset stencil operation accesspatterns and stateful processing to avoid information flow betweenshots (Fig. 3-h illustrates the use of slicing to partition a sequenceinto three independent slices). Unslicing removes these boundariesfor all subsequent operations. Slicing and unslicing can be viewedas a constrained form of sequence nesting.
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018.
Computation Graph Limitations.
Scanner’s design constrainsthe data flow expressible in computation graphs to permit twoperformance-critical graph scheduling optimizations: parallel graphexecution and efficient graph scheduling in conditions of sparsesampling (Section 4). For similar reasons, Scanner currently disal-lows computation graphs with loops or operations that performdata-dependent filtering (discarding elements that do not pass apredicate) or amplification. Although Scanner operations are notprovided mechanisms for dynamically modifying sequence length,sequence elements can be of tuple or list type (e.g., operations canproduce variable length lists of face bounding boxes per frame).
Consistent with the goals from Section 2, Scanner does not providemechanisms for defining the implementation of graph operations.With the exception of system-provided sampling, spacing, and slic-ing operations, Scanner operation definitions are implemented in3rd party languages, externally compiled, and exposed to applica-tions as Scanner graph operations using an operation definitionAPI inspired by that of modern dataflow frameworks for machinelearning [Abadi et al. 2016; Chen et al. 2015]. In the face detectionexample from Fig. 2,
Resize is implemented in Halide,
DNN by theCaffe library [Jia et al. 2014] in CUDA, and
Track as multi-threadedC++. For bounded state operations, the allocation and managementof mutable state carried across invocations is encapsulated entirelywithin the operation’s definition and is opaque to Scanner (e.g.,internal object tracker state).Although Scanner is oblivious to the details of an operation’s im-plementation, to facilitate efficient graph scheduling, all operationsmust declare their processing resource requirements (e.g., requiresa GPU, requires N CPU cores) and data dependencies (warmupamount for stateful operations, stencil offsets for stencil operations)to Scanner. For efficiency, Scanner also supports operations that gen-erate a batch of output elements (rather than a single element) perinvocation (e.g., DNN inference on a batch of frames). We denote thebatch size of operations as
B() in computation graph illustrations.
Scanner jobs are executed by a high-performance runtime thatprovides applications high-throughput access to video frames andefficiently schedules computation graphs onto a parallel machine.While aspects of Scanner’s implementation constitute intelligentapplication of parallel systems design principles, the challenges ofefficiently accessing compressed video data and executing composi-tions of sampling, stenciling, and bounded state graph operationsled to unique implementation choices detailed here.
The Scanner scheduler is responsible for efficiently distributing Scan-ner jobs onto the parallel processing resources within a machine andacross large clusters of machines. Scanner implements data-parallelexecution in the presence of stateful kernels by spawning multiple instances of the computation graph. In each instance, bounded stategraph operations can maintain mutable state buffers, and all graphoperations can preallocate a unique copy of read-only buffers (e.g.,
Worker 11 GPU8 CPUs Worker 22 GPU8 CPUs CPU
GPU
B(2)
DNN EvalPredictPoseIO PacketWork PacketBatch … ……
Work PacketWork PacketResize … ……… … …
IO Packet
Master … Job QueueResizeDNN EvalPredictPoseGPU Decoder
Instance 0
I/O I/O … I/O I/O … ResizeDNN EvalPredictPoseGPU Decoder
Instance 1
I/O I/O … I/O I/O … ResizeDNN EvalPredictPoseGPU Decoder
Instance 2
I/O … I/O … Data Movement through Computation Graph Instance(a) Computation Graph Scheduling (b)
Fig. 4. Left: Scanner creates multiple computation graph instances to pro-cess sequence elements in parallel. Here, three instances of the pose estima-tion graph (from Fig. 1-f) are distributed to single-GPU (left) and dual-GPU(right) machines. Instances of I/O and video decode stages that deliver datato and from application-defined graphs are shown in gray. Right: Scannerstreams data through an execution graph at different bulk granularities tomaximize data movement throughput and keep memory footprint low.
DNN weights, lookup tables). Scanner determines the maximumnumber of instances that can be created per machine by queryinggraph operations for their resource requirements, then maximizesparallelism without oversubscribing the machine. Fig. 4-left depictsa heterogeneous cluster of two machines, each containing an eight-core CPU and at least one GPU (worker 1 contains a single GPU, andworker 2 has two GPUs). To map the three-stage pose estimationpipeline (Fig. 1-f), which contains graph operations that requireGPU execution and one operation that requires four CPU cores,onto this cluster, Scanner creates one computation graph instanceon worker 1 and two instances of the pipeline on worker 2.Scanner computation graphs can be statically analyzed to deter-mine each sequence element’s dependencies before graph execution.This allows Scanner to partition the elements of a job’s output se-quence into smaller work packets without violating graph operationdependencies. Work packets are then distributed to computationgraph instances, enabling parallelization within a single video andbetter load balancing (evaluated in Section 5.1.4). In addition to par-allel work distribution, the Scanner runtime provides fault toleranceby automatically reassigning and restarting individual work packets(not entire jobs) assigned to failed workers. Scanner also distributeswork to new worker machines that are added to a cluster while ajob is running (supporting elasticity).Scanner implements many common throughput-computing opti-mizations to sustain high-performance graph execution on machineswith many cores and multiple GPUs. These include bulk transferof sequence data between the data store and video decoders (par-ticularly important in high latency cloud storage scenarios), bulk-granularity time-multiplexing of graph operations onto availablemachine compute resources, pipelining of CPU-GPU data transfers
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. canner: Efficient Video Analysis at Scale • 138:7 and data store I/O with graph operation execution (Fig. 4-right),and using custom GPU memory pools to reduce driver entry-pointcontention in multi-GPU environments.In addition to processing work packets in parallel using mul-tiple graph instances (data parallelism), Scanner also parallelizescomputation within each graph instance by executing operationssimultaneously on different CPU cores and GPU devices (pipelineparallelism). Scanner’s current implementation does not distributethe execution of a single graph instance across different machines.(We have not yet encountered applications that benefit from thisfunctionality.) Multi-field elements are provided to operations instruct-of-arrays format to enable SIMD processing by batched opera-tions without additional data shuffling. The granularities of bulk I/O(I/O packet size) and parallel work distribution (work packet size)are system parameters that can be tuned manually by a Scanner ap-plication developer to maximize performance, although auto-tuningsolutions are possible. We evaluate the benefit of each of these keyruntime optimizations in Section 5.1.3.
Scanner’s sequences are logically dense, however when a computa-tion graph contains sampling operations, only a sparse set of inter-mediate sequence elements must be computed to generate a job’srequired outputs. Since dependencies during graph execution donot depend on the values of sequence elements, Scanner determineswhich elements are required upfront through per-element graphdependency analysis. Interval analysis methods used to analyzestencil dependencies in image processing systems [Ragan-Kelleyet al. 2013] are of little value when required graph outputs spanthe entire output domain, but are sparse (for example generatingevery N -th frame of an output sequence yields interval boundsthat span the entire domain of all upstream sequences). Instead,given the set of output sequence points a job must produce, Scanneranalyzes computation graph dependencies to determine the exactset of required points for all graph sequences. During graph exe-cution, Scanner sparsely computes only the necessary sequencepoints. During dependency analysis, a bounded state operation withwarmup size W is treated like a stencil operation with the footprint( i − W ,..., i − i ).Fig. 3 illustrates the results of per-element dependency analysisfor various example computation graphs. Gray boxes indicate se-quence elements that are not required to compute the requestedcomputation graph output elements and do not need to be com-puted by Scanner. Performing per-element dependency analysis toidentify and eliminate unnecessary computation is unusual in athroughput-oriented system. However, Scanner graph operationstypically involve expensive processing at the scale of entire frames,so the overhead of computing exact per-element liveness is negligi-ble compared to the cost of invoking graph operations on elementsthat are not needed for the final job result.To avoid the storage overhead of fully materializing lists of re-quired sequence domain points, Scanner performs dependency anal-ysis incrementally (at work packet granularity) as graph compu-tation proceeds. Scanner also coalesces input sequence elementsinto dense batches to retain the efficiency of batch processing evenwhen dependency analysis yields execution that is sparse. frame 0 (keyframe) frame 120 (keyframe) frame 310 (keyframe) frame 340 (keyframe) byte0 byte4,840 byte11,284 byte12,480 Frames in work packet: 130, 134, 192, 320, 321,... (keyframe) byte6,796
Fig. 5. The Scanner data store maintains an index of keyframe locationsfor video frame columns. The index is used to reduce I/O and video decodework when accessing a sparse set of frames from the video.
For all stateless graph operations, sparse execution is a systemimplementation detail that does not influence the output of a Scan-ner application. It is valid, but inefficient, for Scanner to generateall sequence elements, even if they are never consumed. However,since prior invocations of a bounded state operation may impactfuture output, the values output by a bounded state operation maydepend on which elements the Scanner runtime chooses to produce.(Different work distributions or conservative dependency analysiscould yield different operation output.) However, Scanner applica-tions are robust to this behavior since bounded state operationsby definition are required to produce “acceptable” output providedtheir warmup condition is met.
Scanner presents the abstraction that videos are tables of individualframes, but internally stores video frame columns as compressedH.264 byte streams [Marpe et al. 2006] to minimize footprint and toreduce I/O. For example, the footprint of the 12 billion frame tvnewsdataset (used in Section 5.2.3) is 20 TB when stored as H.264 bytestreams, but exceeds 6 PB when expanded to an uncompressed N-Darray of 24-bit pixels.The cost of supporting compressed video storage in a system thatmust also support sparse frame-level data access is two-fold. Firstthe byte stream must be decoded on the fly prior to graph execution.Second, video decode involves inherently sequential computationsince most frames are encoded as deltas on data in prior frames.Therefore, to materialize a requested video frame, a decoder mustlocate the preceding “keyframe” (the last self-contained frame inthe bytestream) then decode all frames up to the requested frame.To accelerate access and decode of individual frames, the Scannerdata store maintains an index of the byte stream offsets of keyframesin video columns, similar to indices maintained by video containerformats to support scrubbing [ISO/IEC 2015]. The data store usesthis index to minimize the amount of I/O and decode performedwhen servicing a sparse set of frame requests. For example, considerthe sequence of elements in Fig. 5. To process this sequence, Scannerloads bytes from storage beginning from the keyframe precedingframe 130 (at byte offset 4,840). Decoding begins at this point, andcontinues until frame 192. Then, decoder state is reset to keyframe310, and the process continues. When frames must be decoded butare not required by graph execution (e.g., frames 131-133, 135-191),Scanner skips decoder post-processing (extracting frames from thedecoder, performing format conversion, etc.).
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018.
10 1 GPU T h r oughpu t (r e l a t i v e t o O pen C V ) Video Decode Throughput
Fig. 6. When executing a single graph instance, Scanner’s sparse video de-code optimizations improve throughput compared to OpenCV baselines onboth the CPU and GPU. Scanner further improves CPU decode throughputby using multiple graph instances to more efficiently utilize all CPU cores.
Scanner’s data store implements a number of additional optimiza-tions to maximize throughput, such as avoiding unnecessary resetof video decoder state when multiple required frames fall betweentwo keyframes and time multiplexing decoders at bulk granularityto avoid unnecessary state resets when jobs draw video data frommultiple tables. When available, Scanner also leverages ASIC hard-ware decode capabilities to accelerate video decode. For example,use of GPU-resident video decoding hardware frees programmableresources to execute other graph operations and also allows com-pressed video data to be communicated over the CPU-GPU bus.
The goal of Scanner is to create a system that is sufficiently ex-pressive to enable a rich set of video processing applications whilealso maintaining high performance. We evaluated Scanner’s perfor-mance in terms of the efficiency of video frame access, efficiency inscheduling computation graphs onto a single machine, and scala-bility of applications to large numbers of CPUs and GPUs and verylarge video datasets. We evaluated Scanner’s utility and expressive-ness by implementing the video analysis workloads from Section 2.1and deploying them at scale.
One of Scanner’s goals is toprovide applications with high-throughput access to compressedvideo frames, even when requested access patterns are sparse. Weevaluated Scanner’s H.264 decode performance against an OpenCVbaseline under a varying set of frame access patterns drawn fromour workloads: • stride-1. All video frames • stride-24. Every 24th frame. • gather. A random list of frames that sparsely samples thevideo (0 .
25% of the video). • range. Blocks of 2,000 consecutive frames, each spread outby 20,000 frames. • keyframe. Only the keyframes from the video.Figure 6 presents Scanner’s decode throughput under these accesspatterns on a 2.2 hour, 202,525 frame, 1920 × T h r oughpu t (r e l a t i v e t o ba s e li ne ) FLOWHIST16-core CPU0.5 . Single Graph Instance Throughput . . DNN Fig. 7. Scanner executes graphs implemented using well-optimized kernelswith nearly no overhead, matching or exceeding baseline implementationson both the CPU and GPU. Better orchestration of the compute graphproduces modest improvements in hist and dnn.
The CPU version of this baseline delivers single-machine through-put that is similar to prior work on systems for large-scale videoprocessing [Yang and Wu 2015]. For the CPU and GPU, we includeresults for a single graph instance to isolate the effect of sparsevideo decode optimizations. For the CPU, we also evaluate multiplegraph instances to exploit Scanner’s ability to decode different partsof the stream in parallel (we evaluate multiple graph instances onmultiple GPUs in Section 5.1.3).In all cases, Scanner’s throughput matches or exceeds that ofthe baselines’. For a single graph instance, Scanner realizes higherthroughput than the baselines when frame access is sparse (as muchas 17 × on the GPU). This speedup comes from Scanner avoidingpost-decode processing of frames which must be decoded but thatare not needed for graph execution (Section 4.3). Scanner uses themachine’s 16 CPU cores more efficiently when executing multiplegraph instances (Multi-instance on Fig. 6) since multiple instancesof the decoder run in parallel (in addition to the parallelizationavailable in H.264 decode which the baseline also exploits).Even though Scanner’s throughput can be higher than that ofthe CPU and GPU OpenCV baselines’ in sparse access scenarios,overall throughput (FPS) of sparse access is fundamentally lower.If an application is flexible in which frames it can sample, such asaccessing only a video’s keyframes (keyframe), it is possible toobtain higher throughput compared to other sparse access patterns(stride-24 or gather), particularly when decoding on the GPU. In conjunctionwith video frame access, Scanner is also responsible for schedulingcomputation graphs of optimized kernels to machines with CPUsand GPUs. To test this, we chose three highly optimized kernelsdrawn from the applications in Section 2.1 and compared theirnative performance (when invoked from C++ and using OpenCVfor video decode as in Section 5.1.1) to Scanner implementationsusing a single compute graph instance. These are: hist.
Compute and store the pixel color histogram for all frames(video decode bound). Histogram is computed via OpenCV’s cv::calcHist / cv::cuda::histEven routines on the CPU/GPUrespectively. flow. Compute optical flow for all frames using a 2-frame stencil(OpenCV’s CPU and GPU
FarnebackOpticalFlow routines). dnn.
Downsample and transform an input frame, then evaluatethe Inception-v1 image DNN [Szegedy et al. 2015] for all frames.Image transformation is performed in Halide and DNN evaluationis performed using Caffe [Jia et al. 2014].
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. canner: Efficient Video Analysis at Scale • 138:9
DNNFLOWHIST T h r oughpu t (r e l a t i v e t o G P U ) Single Machine Scalability
FLOWHIST T h r oughpu t (r e l a t i v e t o s i ng l e i n s t an c e ) DNN 4-GPUs1 2 4 1 2 4A B A B A BA: Single-InstanceB: Multi-Instance
Fig. 8. Scanner computation graphs can be scaled to machines featuringmulti-core CPUs and multiple GPUs without code modification. On theCPU, Scanner improves utilization of the 16-cores by parallelizing acrossmultiple frames. On the GPU, Scanner achieves near linear speedup (atleast 3.7 × ) when moving from one to four GPUs.. HIST FLOW DNN
Multi-GPU Pipelining GPU Decode GPU Mem Pool Work Packet Size Batching
Multi-GPU Scalability Factor Analysis S p ee d u p ( r e l a t i v e t o n o o p t i m i z a t i o n s ) Fig. 9. Scanner’s runtime optimizations result in a 5 to 19 × speedup of threemicrobenchmarks on a four-GPU machine. Each microbenchmark benefitsdifferently from the optimizations, but the combination of all optimizationsproduces the best performance. Figure 7 presents the throughput of CPU and GPU versions ofthe Scanner implementations of the microbenchmarks (using thelibraries given above) normalized to their native implementations.We use the same multi-core CPU + single GPU machine from Sec-tion 5.1.1. In all cases, the Scanner implementations execute thekernels without incurring significant overhead, nearly matchingor exceeding the native implementations. The Scanner implemen-tations of hist on the CPU and dnn on the GPU achieve a mod-est improvement in throughput due to better orchestration of thecomputation graph (pipelining of video decode, data transfers, andkernel execution).
It is common for high-end work-stations and modern servers to be packed densely with multipleGPUs and CPUs. We evaluated Scanner’s scalability on multi-coreCPU and multi-GPU platforms by running the microbenchmarksfrom Section 5.1.2 on a server with the same CPU but now withfour Titan Xp GPUs. Figure 8 compares the microbenchmarks using multiple graph instances against their single graph instance coun-terparts from Section 5.1.2. Since OpenCV’s hist and flow arenot parallelized on the CPU, Scanner benefits from parallelizationacross video frames, providing a 5.1 and 12.5 × speedup respectively.Although the Caffe library is internally parallelized, Scanner stillbenefits from processing multiple frames simultaneously for dnn.The GPU benchmarks realize near linear scaling (at least 3.7 × )from one to four GPUs. The Scanner benchmarks realize thesethroughput improvements without requiring modification to theScanner application . Achieving good multi-GPU scaling requiredthe runtime optimizations discussed in Section 4. Figure 9 depicts a To t a l T i m e ( s ec o n d s ) vCPU cores GPUs Single-Video Scaling s s HIST HISTPOSE s s s s Fig. 10. Scanner reduces the latency of analyzing a single video by usinghundreds of GPUs and thousands of CPU cores. Scaling out reduces pro-cessing times from multiple minutes to seconds.
100 20010000107 . . . T h r o u g h p u t ( r e l a t i v e t o v C P U s ) vCPU cores GPUs Large Dataset Scalability
HIST, CINEMAHIST, TVNEWS HIST, CINEMAHIST, TVNEWSPOSE, CINEMAPOSE, TVNEWS . . . Linear Linear T h r o u g h p u t ( r e l a t i v e t o G P U s ) Fig. 11. Scanner applications efficiently scale to hundreds of GPUs andthousands of CPU cores when processing large datasets. Speedup is nearlylinear until stragglers cause reduced scaling at high machine counts. factor analysis of these optimizations for the three pipelines usedin the four GPU scalability evaluation. The baseline configurationis Scanner with all optimizations disabled. Each data point adds oneof the optimizations mentioned in Section 4:(1) Using multiple GPUs(2) Pipelining CPU-GPU computations and data-transfer(3) GPU HW ASIC decode(4) GPU memory pool(5) Increased work packet size(6) Batching input elements to kernelsEven when executing the simple computation graphs of hist,flow, and dnn benchmarks, achieving multi-GPU scalability re-quired combining several key optimizations. For example, hist isdecode bound, and benefits most from
GPU Memory Pool becauseeliminating per-video frame memory allocations enables the GPUhardware video decoders (enabled by
GPU Decode ) to operate athigh throughput. In the case of dnn, speedups from
Batching areonly possible after enabling a
Work Packet Size that is greaterthan the batch size.
The true benefitof Scanner is the ability to scale video processing applications tolarge numbers of machines and to very large video datasets. Toevaluate Scanner’s scalability, we executed two benchmarks, thehist computation graph from Section 5.1.2, and pose, the OpenPosehuman pose estimation benchmark [Cao et al. 2016] which is centralto several larger applications in Section 5.2, at scale on GoogleCompute Engine (GCE). We perform CPU scaling experiments oninstances with 32 vCPUs (the unit of CPU hardware allocation onGCE, usually one hyper-thread), and GPU scaling experiments oninstances with 16 vCPUs and two NVIDIA K80 GPUs. Since thepose benchmark does not support CPU execution, we only evaluateit in GPU scaling experiments.
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018.
CinematographyMean Girls3D-Pose ReconstructionSurround 360 Open Edition TV News (Montage of Rachel Maddow) ...
Left EyeRight Eye Star Wars
Fig. 12. Surround 360: Scanner’s port of Surround 360 fuses 14 video streams into a panoramic video for VR display. 3D Pose: Views of a social scene by 72 ofthe 480 cameras in the CMU Panoptic Studio (Joo et al. [2016]). Scanner performs pose estimation on all 480 camera streams which are then fused into 3Dposes (shown projected onto a single view) Cinematography: A montage of one frame from each shot in
Star Wars and
Mean Girls computed using Scannerpipelines (Figure 1-a and -b). TV News: Scanner was used to calculate screen time given to people in 70,000 hours of TV News. Here we show instances ofRachel Maddow, a popular news host. Image credit, left to right: "Palace of Fine Arts Take 1", © Facebook 2017; top image from [Joo et al. 2016] Figure 1,© Hanbyul Joo;
Star Wars: Episode IV - A New Hope , © Lucasfilm Ltd. 1977;
Mean Girls , © Paramount Pictures 2004;
The Rachel Maddow Show © MSNBC2015-2017.
Single Video Scaling.
One use of scaling to large machines isto deliver video processing results back to the user rapidly. (e.g., forquick preview or analysis.). Figure 10 shows Scanner executing histand pose on a single 2.2 hour feature-length film on a cluster of 2,400cores and a cluster of 75 GPUs. Executing hist on this video took4.3 minutes on a single machine (32 vCPUs) and nearly 15 minuteson a single GPU. These times were reduced to 20 and 31 secondsrespectively when parallelizing this computation to large CPU andGPU clusters. Scaling pose to the large GPU cluster reduced poseestimation processing time from 55 minutes (1 GPU) to two minutes(75 GPUs).
Large Dataset Scalability.
Scanner facilitates scaling to largevideo datasets that would be impractical to process without the useof large numbers of machines. Figure 11 shows the speedup achievedrunning the hist and pose benchmarks on datasets used by the videodata mining applications in Section 5.2.3: cinema, a collection of657 feature length films (107 million frames, 2.3 TB), and tvnews,a collection of short clips (approximately 10 seconds each) from60K TV news videos (these shots total 86 million frames). Scannerscales linearly up to 3000 vCPUs and 150 GPUs while continuing toscale near linearly up to 250 GPUs. Speedups are sublinear at highermachine counts since a single slow machine (straggler) can delayjob completion. Techniques for mitigating the effect of stragglersare well-studied and could be implemented by a future version ofScanner [Ananthanarayanan et al. 2013].
We have used Scanner to scale a range of video processing applica-tions (Section 2.1), enabling us to use many machines to obtain re-sults faster, and to scale computations to much larger video datasetsthan previously practical. Each application presented a unique com-bination of frame access patterns, usage of Scanner computationgraph features, and computational demands.
The video-based 3Dpose reconstruction algorithm by Joo et al. [2016] requires efficientscheduling of compute graphs with both CPU and GPU operations to fully utilize machines packed densely with GPUs. The algorithminvolves evaluating a DNN on every frame of the 480 video streamsin the Panoptic Studio (Figure 1-f). (Per-frame results from eachvideo are then fused to estimate a per-frame 3D pose as in Fig-ure 12, 3D Pose). An optimized implementation of the per-framealgorithm took 16.1 hours to process a 40-second sequence of cap-tured video on a single Titan Xp GPU (frames 13,500 to 14,500 of the“160422_mafia2” scene from the CMU Panoptic Dataset). A version ofthis algorithm was previously parallelized onto four Titan Xp GPUs,reducing processing time to seven hours [Cao et al. 2016]. Usingthe exact same kernels, the Scanner implementation reduces run-time on the same 4-GPU machine to 2.6 hours due to more efficientgraph scheduling (better pipelining and data transfer optimizationsas discussed in Section 5.1.3).Using Scanner, it was also simple to further accelerate the appli-cation using a large cluster of multi-GPU machines in the cloud. Thesame Scanner application scheduled onto 200 K80 GPUs(25 8-GPU machines on GCE) completed processing of the samevideo sequence in only 25 minutes. Dramatically reducing pose re-construction time to minutes stands to enable researchers to capturelonger and richer social interactions using emerging video-basedcapture infrastructure such as the Panoptic Studio.
The real-time hyperlapse algo-rithm of [Joshi et al. 2015], which computes stabilized timelapses,makes use of computations that stencil over temporal windows. Thecomputational bottleneck in the hyperlapse algorithm is featureextraction from the input images and pairwise feature matchingbetween neighboring images. We implemented those portions ofthe algorithm as kernels in Scanner (Figure 1-d) using a GPU kernelto extract SIFT features from each frame and a second GPU kernelwith a stencil window of size w to perform feature matching. Scan-ner’s stenciling mechanism simplified the implementation of thefeature matching kernel (the runtime handles storing intermediatevideo frames and results) and made the pipeline easy to extend.For example, Joshi et al. [2015] suggest a performance optimizationthat approximates the reconstruction cost between two frames as ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. canner: Efficient Video Analysis at Scale • 138:11 T h r oughpu t (r e l a t i v e t o C P U S ) ScannerS360CPU cores
Fig. 13. Single-machine scalability of the Surround 360 pipeline imple-mented in Scanner vs. the open source implementation. The Scanner im-plementation utilizes a single-machine well, scaling better to higher corecounts. the sum of successive costs, falling back to the full windowed fea-ture matching when necessary. The corresponding Scanner pipeline(Figure 1-e) reduces the matching kernel’s stencil size to [ , ] tocapture the adjacent reconstruction costs and adds a new kernel CondMatch which stencils over both the derived matching costsand the original features, conditionally determining if it is necessaryto perform the full windowed feature matching.
We have also used Scanner asthe compute engine for two big video data mining research projects,requiring sparse sampling of videos, bounded state, and fault tol-erance when scaling to hundreds of machines. The first involvesvisual analysis of a corpus of 657 feature length films (7.7 millionframes, 2.3 TB). For example, Scanner applications are used to detectshot boundaries (via histogram differences, Figure 1-a), produce filmsummaries via montage (as in Figure 12-middle, with the Scannerpipeline in Figure 1-b), and detect faces. The second is a large-scaleanalysis of video from over three years of US TV news (FOX, MSNBC,and CNN), which includes over 70,000 hours of video (20 TB, 12 bil-lion frames, six petapixels). In this project Scanner is being usedto perform large-scale data mining tasks to discover trends in me-dia bias and culture. These tasks involve visual analyses on videoframes such as classifying news into shots, identifying the genderand identity of persons on screen, estimating screen time of variousindividuals, and understanding the movement of anchors on screenvia pose estimation. Use of Scanner to manage and process billionsof video frames was essential.The large size of the feature length films and the TV news datasetstress-tested Scanner’s ability to scale. For example, to estimate thescreen time allotted to male-presenting versus female-presentingindividuals, we used Scanner to compute color histograms on everyframe of the dataset (to detect shot boundaries), and then sparselycomputed face bounding boxes and embeddings on a single frameper shot. To execute these tasks, we used a GCE cluster of 100 64-vCPU preemptible machines, relying on Scanner’s fault tolerancemechanism to handle preemption. The size of the dataset also re-quired the use of cloud storage for both the videos and the derivedmetadata. Each computation took less than a day to complete andScanner maintained 90%+ utilization of the 6,400 vCPUs throughouteach run.
We ported the Facebook Surround 360Open Edition VR video stitching pipeline to Scanner [Facebook2017]. The application requires simultaneously accessing 14 inputvideo streams, scheduling up to 44 computation graph operationson a large number of CPU cores, employing kernels with temporal dependencies (the
Flow kernel is configured as a bounded state op-eration since it depends on the output of previous frames), and com-pressing output video frames to produce the final stereo panoramaoutput (Figure 1-c). Given Scanner’s current scheduler implementa-tion, we found it most efficient to execute each
Warp , Flow , Synth block (the kernels surrounded by the blue box in Figure 1-c) as aseparate job in Scanner and then feed each of those job’s outputsinto the
Concat stages using a second bulk launch. The Scannerimplementation uses the same kernels as Facebook’s reference im-plementation.In contrast to the reference Surround 360 implementation, whichis parallelized across the 14 input video streams (but outputs framesserially), our Scanner implementation is also parallelized acrosssegments of output frames, making use of bounded state opera-tions with warmup of size 10 to maintain temporal coherence acrosssegments of the video. Figure 13 plots the relative speedup of thereference and Scanner Surround 360 implementations on a machinewith 32 CPUs (64 hyper-threaded). The Scanner implementationscales more efficiently on the large machine (5.3 seconds per frameversus 13.3 seconds per frame for the reference) due to the changein parallelization strategy. It is also faster due to pipelining (over-lapping data movement and compute) and decreased IO since theScanner implementation performs compression of the large outputframes on the fly before writing out to disk.We ran the Scanner version of Surround 360 implementation on aone minute sequence (28 GB, 25k total frames) over eight machineswith 32 vCPU cores each (256 cores total) on Google ComputeEngine and achieved a rate of 1.5 FPS. As was the case with ourother applications, we were able to scale Surround 360 without anychanges to the Scanner application.
Scanner contributes a unique integration of data-flow programmingabstractions and systems implementation components that meet theproductivity and performance needs of video analysis applications.However, many individual components of Scanner’s design wereinfluenced by prior systems for big data processing, databases, andmachine learning.
Distributed data analytics frameworks.
Frameworks such asMapReduce [Dean and Ghemawat 2004] and Spark [Zaharia et al.2010] enable concise and productive expression of data analytics ap-plications using data parallel operations on large collections. Whilethese platforms handle the “scale-out” scheduling challenges of dis-tributed computing (e.g. work distribution and fault tolerance), asidentified in Section 2.2, they require new primitives and signifi-cant changes to their internal implementation to meet a broad setof video analysis needs. For example, while it is possible to useSpark to process video, prior implementations [Yang and Wu 2015]do not implement intra-video parallelism (precluding single-videospeedups), do not target heterogeneous machines, and do not imple-ment the video decode optimizations shown to provide significantbenefits in Section 5.1.1. Scanner features such as bounded stateoperations (needed for intra-video parallelization in applicationslike VR video synthesis) and unneeded element elimination (neededfor efficient sparse sampling common in data mining, Sec. 5.2.3) donot yet exist in popular distributed data-parallel systems.
ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018.
Also, as we demonstrate in Fig. 9, Scanner execution graphs re-quire a high-performance, heterogeneous (CPU, GPU, ASIC) runtimeto be executed efficiently. While recent efforts have exposed popularGPU-accelerated machine learning libraries [Caf 2016; DataBricks2016] to Spark applications, the Spark runtime, including its taskscheduling, resource management, and data partitioning decisions,operates with no knowledge of the heterogeneous capabilities of theplatform. Extending Spark to schedule tasks onto high-throughputaccelerated computing platforms is known to require significantruntime redesign and extensions to application-visible abstractions(e.g., ability for kernels to specify resource requirements and datalayouts, and to maintain local state) [Bordawekar 2016; Rosen andXin 2015]. We hope that the design and implementation of Scan-ner influences ongoing development of the Spark runtime to bettersupport video processing applications and accelerated computing.
Distributed machine learning frameworks.
Modern machinelearning frameworks [Abadi et al. 2016; Chen et al. 2015; Microsoft2017] adopt a general dataflow programming model suitable for dis-tributing GPU-accelerated training and inference pipelines acrossclusters of machines. While it may be possible to implement Scan-ner’s functionality as a library built upon these frameworks, doingso would require implementing new operations, runtime supportfor media accelerators, and integration with a pixel storage systemproviding the desired relational model and efficient video access—inother words, reimplementing most of Scanner itself. We elected toimplement Scanner from the ground up as a lightweight runtimefor simplicity and to achieve high performance.
Databases for raster and array data.
Scanner models imageand video collections as relations, and echoes the design of Spark-SQL [Armbrust et al. 2015] in that row selection and joins on re-lations are used to define distributed datasets streamed to pro-cessing pipelines. Like relational Geographic Information Systems(GIS) ([PostGIS Project 2016]) or science/engineering-oriented Ar-ray Databases (ADBMS) such as SciDB [Cudre-Mauroux et al. 2009]or RasDaMan [Baumann et al. 1998] which extend traditional data-base management systems with raster or multi-dimensional arraytypes, Scanner natively supports image and video column types forefficiency. While GIS and ADBMS are optimized for operations suchas range queries that extract pixel regions from high-resolutionimages, Scanner’s storage layer is designed to efficiently decom-press and sample sequences of frames for delivery to computationgraphs. As stated in Section 2.2, in contrast to array database de-signs, we intentionally avoided creating a new language for pro-cessing pixels in-database (e.g., SciDB’s Array Functional Languageor RasDaMan’s RASCAL [Rasdaman.org 2015]). Instead we chose tosupport efficient delivery of video frame data to execution graphswith operations written in existing, well-understood languages likeCUDA, C++, or Halide [Ragan-Kelley et al. 2012].
As large video collections become increasingly pervasive, and algo-rithms for interpreting their contents improve in capability, therewill be an increasing number of applications that require efficientvideo analysis at scale. We view Scanner as an initial step towardsestablishing efficient parallel computing infrastructure to support these emerging applications. Future work should address higher-level challenges such as the design of query languages for visualdata mining (what is SQL for video?), the cost of per-frame imageanalysis for the case of video (e.g., exploiting temporal coherenceto accelerate DNN evaluation on a video stream), and integration oflarge-scale computation, visualization, and human effort to morerapidly label and annotate large video datasets [Ratner et al. 2018].While the current version of Scanner achieves high efficiency, itrequires the application developer to choose target compute plat-forms (CPU vs. GPU), video storage data formats, and key schedulinggranularities (e.g., task size). It would be interesting to consider theextent to which these decisions could be made automatically forthe developer as an application runs. Also, simple extensions ofScanner could expand system scope to provide high-throughputdelivery of sampled video frames in model training scenarios (notjust model inference) and to deliver regions of video frames ratherthan full frames (e.g., to support iteration over scene objects ratherthan video frames).Most importantly, we are encouraged that Scanner has alreadyproven to be useful. Our collaborations with video data analysts,film cinematographers, human pose reconstruction experts, andcomputer vision researchers show Scanner has enabled these re-searchers to iterate on big video datasets much faster than before, orattempt analyses that were simply not feasible given their level ofparallel systems experience and existing tools. We hope that Scannerwill enable many more researchers, scientists, and data analysts toexplore new applications based on large-scale video analysis.
This work was supported by the NSF (IIS-1422767, IIS-1539069),the Intel Science and Technology Center for Visual Cloud Com-puting, a Google Faculty Fellowship, and the Brown Institute forMedia Innovation. TV News datasets were provided by the InternetArchive.
REFERENCES . USENIX Association, GA, 265–283.Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Ef-fective Straggler Mitigation: Attack of the Clones. In
Presented as part of the10th USENIX Symposium on Networked Systems Design and Implementation (NSDI13)
ACM Trans. Graph.
35, 6, Article 198 (Nov. 2016), 13 pages.Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley,Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia.2015. Spark SQL: Relational Data Processing in Spark. In
Proceedings of the 2015ACM SIGMOD International Conference on Management of Data (SIGMOD ’15) . ACM,New York, NY, USA, 1383–1394.P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimen-sional Database System RasDaMan.
SIGMOD Rec.
27, 2 (June 1998), 575–577.Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp,Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al.2016. End to End Learning for Self-Driving Cars. arXiv preprint arXiv:1604.07316 (2016).ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018. canner: Efficient Video Analysis at Scale • 138:13
O’Reilly Media, Inc (2016).Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston,and Pat Hanrahan. 2004. Brook for GPUs: Stream Computing on Graphics Hardware.
ACM Trans. Graph.
23, 3 (Aug. 2004), 777–786.Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime Multi-Person 2DPose Estimation using Part Affinity Fields. arXiv preprint arXiv:1611.08050 (2016).Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao,Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and EfficientMachine Learning Library for Heterogeneous Distributed Systems. arXiv:1512.01274 (2015). arXiv:eprint arXiv:1512.01274X. Chen, A. Shrivastava, and A. Gupta. 2013. NEIL: Extracting Visual Knowledge fromWeb Data. In . 1409–1416.E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks.
Commun.ACM
13, 6 (June 1970), 377–387.P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov,D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel,M. Stonebraker, and S. Zdonik. 2009. A Demonstration of SciDB: A Science-orientedDBMS.
Proc. VLDB Endow.
2, 2 (Aug. 2009), 1534–1537.DataBricks. 2016. TensorFrames. Github web site: https://github.com/databricks/tensorframes. (2016).Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing onLarge Clusters. In
Proceedings of the 6th Conference on Symposium on Opearting Sys-tems Design & Implementation - Volume 6 (OSDI’04) . USENIX Association, Berkeley,CA, USA, 10–10.Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. 2012.What Makes Paris Look Like Paris?
ACM Trans. Graph.
31, 4, Article 101 (July 2012),9 pages.Inc. Facebook. 2017. Facebook Surround 360. Web site: https://facebook360.fb.com/facebook-surround-360/. (2017).S. Ginosar, K. Rakelly, S. M. Sachs, B. Yin, C. Lee, P. Krahenbuhl, and A. A. Efros. 2017. ACentury of Portraits: A Visual Historical Record of American High School Yearbooks.
IEEE Transactions on Computational Imaging
PP, 99 (2017).James Hays and Alexei A. Efros. 2007. Scene completion using millions of photographs.
ACM Trans. Graph.
26, 3, Article 4 (July 2007). https://doi.org/10.1145/1276377.1276382ISO/IEC 2015.
ISO/IEC 14496-12:2015: Coding of audio-visual objects – Part 12: ISO basemedia file format . Standard. International Organization for Standardization, Geneva,CH.Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, RossGirshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Archi-tecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y.Sheikh. 2015. Panoptic Studio: A Massively Multiview System for Social MotionCapture. In . 3334–3342.Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee,Timothy Godisart, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara,and Yaser Sheikh. 2016. Panoptic Studio: A Massively Multiview System for SocialInteraction Capture. (2016). arXiv:arXiv:1612.03153Neel Joshi, Wolf Kienzle, Mike Toelle, Matt Uyttendaele, and Michael F Cohen. 2015.Real-time hyperlapse creation via optimal frame selection.
ACM Transactions onGraphics (TOG)
34, 4 (2015), 63.Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Ra-minder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle,Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau,Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, WilliamGulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt,Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, HarshitKhaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon,James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, GordonMacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, RaviNarayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, NarayanaPenukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani,Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, AndySwing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, VijayVasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017.In-Datacenter Performance Analysis of a Tensor Processing Unit. In
Proceedings ofthe 44th Annual International Symposium on Computer Architecture (ISCA ’17) . ACM,New York, NY, USA, 1–12. https://doi.org/10.1145/3079856.3080246Ira Kemelmacher-Shlizerman. 2016. Transfiguring Portraits.
ACM Trans. Graph.
35, 4,Article 94 (July 2016), 8 pages.D. Marpe, T. Wiegand, and G. J. Sullivan. 2006. The H.264/MPEG4 advanced videocoding standard and its applications.
IEEE Communications Magazine
PostGIS 2.3.2dev Manual . PostGIS Project.Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amaras-inghe, and Frédo Durand. 2012. Decoupling Algorithms from Schedules for EasyOptimization of Image Processing Pipelines.
ACM Trans. Graph.
31, 4, Article 32(July 2012), 12 pages.Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand,and Saman Amarasinghe. 2013. Halide: A Language and Compiler for OptimizingParallelism, Locality, and Recomputation in Image Processing Pipelines.
SIGPLANNot.
48, 6 (June 2013), 519–530.Rasdaman.org 2015.
Rasdaman Version 9.2 Query Language Guide . Rasdaman.org.Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christo-pher Ré. 2018. Snorkel: Rapid Training Data Creation with Weak Supervision.
Proc.VLDB Endow. (to appear)
12, 1 (2018).Josh Rosen and Reynold Xin. 2015. Project Tungsten: Bringing Apache Spark Closer toBare Metal. Databricks Engineering Blog: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html. (2015).Krishna Kumar Singh, Kayvon Fatahalian, and Alexei Efros. 2016. KrishnaCam: Usinga longitudinal, single-person, egocentric dataset for scene understanding tasks. In .Josef Sivic, Biliana Kaneva, Antonio Torralba, Shai Avidan, and William T. Freeman.2008. Creating and Exploring a Large Photorealistic Virtual Space. In
First IEEEWorkshop on Internet Vision .Noah Snavely, Steven M. Seitz, and Richard Szeliski. 2006. Photo Tourism: ExploringPhoto Collections in 3D.
ACM Trans. Graph.
25, 3 (July 2006), 835–846.C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In . 1–9.William Thies, Michal Karczmarek, and Saman P. Amarasinghe. 2002. StreamIt: A Lan-guage for Streaming Applications. In
Proceedings of the 11th International Conferenceon Compiler Construction (CC ’02) . Springer-Verlag, London, UK, UK, 179–196.S. Yang and B. Wu. 2015. Large Scale Video Data Analysis Based on Spark. In . 209–212.Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica.2010. Spark: Cluster Computing with Working Sets. In
Proceedings of the 2Nd USENIXConference on Hot Topics in Cloud Computing (HotCloud’10) . USENIX Association,Berkeley, CA, USA, 10–10.Jun-Yan Zhu, Yong Jae Lee, and Alexei A. Efros. 2014. AverageExplorer: InteractiveExploration and Alignment of Visual Data Collections.
ACM Trans. Graph.
33, 4,Article 160 (July 2014), 11 pages.ACM Transactions on Graphics, Vol. 37, No. 4, Article 138. Publication date: August 2018.
32 64168 IMGCPUVIDCPU-BASELINEVIDCPU-SMALLGOPVIDCPU-STRIDED VIDGPU-BASELINEVIDGPU-SMALLGOPVIDGPU-STRIDED
Video Decode Throughput T h r o u g h p u t ( E ff ec t i v e F PS ) VID-BASELINEIMGVID-SMALLGOPVID-STRIDED-1VID-STRIDED-8VID-STRIDED-16VID-STRIDED-32VID-STRIDED-64Representation Size (GB)19.11.11.28.42.01.20.70.4
Video Format Sizes
Stride
Fig. 14. Left: Effective throughput of various video representations at in-creasing stride. The evaluation was run on a machine with two 8-core IntelXeon E5-2620 CPUs and four Pascal Titan Xp GPUs. Right: A table with theon-disk size of each video representation.
As discussed in Section 3.1, representing videos as tables allowsScanner to decouple the logical representation of a video (eachframe a distinct row in a table) from the physical storage format.In this section, we will show how the table representation enableshigh throughput video decoding and eases management of the videorepresentation by exploring a variety of physical video formats thatare all accessed using the same Scanner table interface. Due to theflexibility of the execution engine, we were able to perform all of thefollowing storage format transformations directly within Scanner.Figure 14 shows the throughput in frames per second of decodingframes from different physical video formats of the same videos.The evaluation was run on three 1920x1080 feature length films (atotal of 600k frames). The size of each representation is listed inthe table of Figure 14. In the following paragraphs, we will walkthrough the tradeoffs associated with each format under differentaccess patterns.
Images. imgcpu represents reading 95% quality JPEG imagespre-extracted from the video. Images can be read and decoded in-dependently of each other so they provide good performance forsparse access patterns. However, images have a significantly largerstorage footprint (170GB vs 1.2 GB for H.264 video) and are thusbound by I/O throughput.
Video. vidcpu-base and vidgpu-base show H.264 video decodeon the original video format. The low stride performance is high andthe storage footprint is low. However, since decoding a specific framein video can require decoding all preceding frames in a keyframesequence (tens to hundreds of frames), Scanner must decode anincreasing percentage of unused frames as the stride increases.
Video with shorter keyframe intervals.
Video decode through-put at higher strides can be improved by decreasing the distance be-tween keyframes, trading off an increase in file size (more keyframesconsume more storage space). This is shown by the improvementin throughput at large strides and increase in file size of vidcpu-smallgop and vidgpu-smallgop which perform decode on a videotable that was re-encoded using Scanner with a keyframe intervalof 24.
Strided Video.