[PDF] Clockwork Convnets for Video Semantic Segmentation

Abstract

Recent years have seen tremendous progress in still-image segmentation; however the naïve application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video. We propose a video recognition framework that relies on two key observations: 1) while pixels may change rapidly from frame to frame, the semantic content of a scene evolves more slowly, and 2) execution can be viewed as an aspect of architecture, yielding purpose-fit computation schedules for networks. We define a novel family of "clockwork" convnets driven by fixed or adaptive clock signals that schedule the processing of different layers at different update rates according to their semantic stability. We design a pipeline schedule to reduce latency for real-time recognition and a fixed-rate schedule to reduce overall computation. Finally, we extend clockwork scheduling to adaptive video processing by incorporating data-driven clocks that can be tuned on unlabeled video. The accuracy and efficiency of clockwork convnets are evaluated on the Youtube-Objects, NYUD, and Cityscapes video datasets.

Full PDF

CClockwork Convnets for Video Semantic Segmentation

Evan Shelhamer (cid:63)

Kate Rakelly (cid:63)

Judy Hoffman (cid:63)

Trevor Darrell { shelhamer,rakelly,jhoffman,trevor } @cs.berkeley.edu UC Berkeley

Abstract.

Recent years have seen tremendous progress in still-image segmenta-tion; however the na¨ıve application of these state-of-the-art algorithms to everyvideo frame requires considerable computation and ignores the temporal conti-nuity inherent in video. We propose a video recognition framework that relies ontwo key observations: 1) while pixels may change rapidly from frame to frame,the semantic content of a scene evolves more slowly, and 2) execution can beviewed as an aspect of architecture, yielding purpose-ﬁt computation schedulesfor networks. We deﬁne a novel family of “clockwork” convnets driven by ﬁxedor adaptive clock signals that schedule the processing of different layers at dif-ferent update rates according to their semantic stability. We design a pipelineschedule to reduce latency for real-time recognition and a ﬁxed-rate schedule toreduce overall computation. Finally, we extend clockwork scheduling to adaptivevideo processing by incorporating data-driven clocks that can be tuned on unla-beled video. The accuracy and efﬁciency of clockwork convnets are evaluated onthe Youtube-Objects, NYUD, and Cityscapes video datasets.

Semantic segmentation is a central visual recognition task. End-to-end convolutionalnetwork approaches have made progress on the accuracy and execution time of still-image semantic segmentation, but video semantic segmentation has received less at-tention. Potential applications include UAV navigation, autonomous driving, archivalfootage recognition, and wearable computing. The computational demands of videoprocessing are a challenge to the simple application of image methods on every frame,while the temporal continuity of video offers an opportunity to reduce this computation.Fully convolutional networks (FCNs) [1,2,3] have been shown to obtain remarkableresults, but the execution time of repeated per-frame processing limits application tovideo. Adapting these networks to make use of the temporal continuity of video re-duces inference computation while suffering minimal loss in recognition accuracy. Thetemporal rate of change of features, or feature “velocity”, across frames varies fromlayer to layer. In particular, deeper layers in the feature hierarchy change more slowlythan shallower layers over video sequences. We propose that network execution can beviewed as an aspect of architecture and deﬁne the “clockwork” FCN (c.f. clockwork re-current networks [4]). Combining these two insights, we group the layers of the networkinto stages, and set separate update rates for these levels of representation. The execu-tion of a stage on a given frame is determined by either a ﬁxed clock rate (“ﬁxed-rate”) (cid:63)

Authors contributed equally. a r X i v : . [ c s . C V ] A ug E. Shelhamer ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell Video Frames

Segmentation time Modular NetworkClock FiresExecuted Persisted

Video Frames

Segmentation

Fig. 1: Our adaptive clockwork method illustrated with the famous

The Horse in Mo-tion [9], captured by Eadweard Muybridge in 1878 at the Palo Alto racetrack. Theclock controls network execution: past the ﬁrst stage, computation is scheduled only atthe time points indicated by the clock symbol. During static scenes cached representa-tions persist, while during dynamic scenes new computations are scheduled and outputis combined with cached representations.or data-driven (“adaptive”). The prediction for the current frame is then the fusion (viathe skip layer architecture of the FCN) of these computations on multiple frames, thusexploiting the lower resolution and slower rate-of-change of deeper layers.We demonstrate the efﬁcacy of the architecture for both ﬁxed and adaptive sched-ules. We show results on multiple datasets for a pipelining schedule designed to re-duce latency for real-time recognition as well as a ﬁxed-rate schedule designed to re-duce computation and hence time and power. Next we learn the clock-rate adaptivelyfrom the data, and demonstrate computational savings when little motion occurs in thevideo without sacriﬁcing recognition accuracy during dynamic scenes. We verify ourapproach on synthetic frame sequences made from PASCAL VOC [5] and evaluate onvideos from the NYUDv2 [6], YouTube-Objects [7], and Cityscapes [8] datasets.

We extend fully convolutional networks for image semantic segmentation to video se-mantic segmentation. Convnets have been applied to video to learn spatiotemporal rep-resentations for classiﬁcation and detection but rarely for dense pixelwise, frame-by-frame inference. Practicality requires network acceleration, but generic techniques donot exploit the structure of video. There is a large body of work on video segmentation,but the focus has not been on semantic segmentation, nor are methods computationallyfeasible beyond short video shots.

Fully Convolutional Networks

A fully convolutional network (FCN) is a modeldesigned for pixelwise prediction [1]. Every layer in an FCN computes a local op-eration, such as convolution or pooling, on relative spatial coordinates. This localitymakes the network capable of handling inputs of any size while producing output ofcorresponding dimensions. Efﬁciency is preserved by computing single, dense forwardinference and backward learning passes. Current classiﬁcation architectures – AlexNet lockwork Convnets for Video Semantic Segmentation 3 [10], GoogLeNet [11], and VGG [12] – can be cast into corresponding fully convolu-tional forms. These networks are learned end-to-end, are fast at inference and learn-ing time, and can be generalized with respect to different image-to-image tasks. FCNsyield state-of-the-art results for semantic segmentation [1], boundary prediction [13],and monocular depth estimation [2]. While these tasks process each image in isolation,FCNs extend to video. As more and more visual data is captured as video, the baselineefﬁciency of fully convolutional computation will not sufﬁce.

Video Networks and Frame Selection

Time can be incorporated into a networkby spatiotemporal ﬁltering or recurrence. Spatiotemporal ﬁltering, i.e. 3D convolution,can capture motion for activity recognition [14,15]. For video classiﬁcation, networkscan integrate over time by early, late, or slow fusion of frame features [15]. Recurrencecan capture long-term dynamics and propagate state across time, as in the popular longshort-term memory (LSTM) [16]. Joint convolutional-recurrent networks ﬁlter withinframes and recur across frames: the long-term recurrent convolutional network [17]fuses frame features by LSTM for activity recognition and captioning. Frame selectionreduces computation by focusing computational resources on important frames identi-ﬁed by the model: space-time interest points [18] are video keypoints engineered forsparsity, and a whole frame selection and recognition policy can be learned end-to-endfor activity detection [19]. For optical ﬂow, an intrinsically temporal task, a cross-frameFCN is state-of-the-art among fast methods [3]. These video recognition approachesnone address frame-by-frame output.

Network Acceleration

Although FCNs are fast, video demands computation thatis faster still, particularly for real-time inference. The spatially dense operation of theFCN amortizes the computation of overlapping receptive ﬁelds common to contem-porary architectures. However, the standard FCN does nothing to temporally amortizethe computation of sequential inputs. Computational concerns can drive architecturalchoices. For instance, GoogLeNet requires less computation and memory than VGG,although its segmentation accuracy is worse [1]. Careful but time-consuming modelsearch can improve networks within a ﬁxed computational budget [20]. Methods to re-duce computation and memory include reduced precision by weight quantization [21],low-rank approximations with clustering, [22], low-rank approximations with end-to-end tuning [23], and kernel approximation methods like the fast food transformation[24]. None of these generic acceleration techniques harness the frame-to-frame struc-ture of video. The proposed clockwork speed-up is orthogonal and compounds anyreductions in absolute inference time. Our clockwork insight holds for all layered ar-chitectures whatever the speed/quality operating point chosen.

Semantic Segmentation

Much work has been done to address the problem of seg-mentation in video. However, the focus has not been on semantic segmentation. In-stead research has addressed spatio-temporal “supervoxels” [25,26], unsupervised andmotion-driven object segmentation [27,28,29], or weakly supervising the segmentationof tagged videos [30,31,32]. These methods are not suitable for real-time or the com-plex multi-class, multi-object scenes encountered in semantic segmentation settings.Fast Object Segmentation in Unconstrained Videos [28] infers only ﬁgure-ground seg-mentation at 0.5s/frame with ofﬂine computed optical ﬂow and superpixels. Althoughits proposals have high recall, even when perfectly parallelized [29] this method takes

E. Shelhamer ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell > s/frame and a separate recognition step is needed for semantic segmentation. Incontrast the standard FCN computes a full semantic segmentation in 0.1s/frame. Our approach is inspired by observing the time course of learned, hierarchical fea-tures over video sequences. Drawing on the local-to-global idea of skip connections forfusing global, deep layers with local, shallow layers, we reason that the semantic repre-sentation of deep layers is relevant across frames whereas the shallow layers vary withmore local, volatile details. Persisting these features over time can be seen as a temporalskip connection.Measuring the relative difference of features across frames conﬁrms the temporalcoherence of deeper layers. Consider a given score layer (a linear predictor of pixel classfrom features), (cid:96) , with outputs S (cid:96) ∈ [ K × H × W ] , where K is the number of categoriesand H , W is the output dimensions for layer (cid:96) . We can compute the difference at time t with a score map distance function d sm , chosen to be the hamming distance of one hotencodings. d sm ( S t(cid:96) , S t − (cid:96) ) = d hamming ( φ ( S t(cid:96) ) , φ ( S t − (cid:96) )) Table 1 reports the average of these temporal differences for the score layers, ascomputed over all videos in the YouTube-Objects dataset [7]. It is perhaps unsurprisingthat the deepest score layer changes an order of magnitude less than the shallower layerson average. We therefore hypothesize that caching deeper layer scores from past framescan inform the inference of the current frame with relatively little reduction in accuracy.The slower rate of change of deep layers can be attributed to architectural andlearned invariances. More pooling affords more robustness to translation and noise, andlearned features may be tuned to the supervised classes instead of general appearance. score layer temporal difference depth semantic accuracy pixels .26 ± .18 0 - pool3 .11 ± .06 9 9.6% pool4 .11 ± .06 13 20.7% fc7 .02 ± .02 19 65.5% Table 1: The average temporal difference over all YouTube-Objects videos of the re-spective pixelwise class score outputs from a spectrum of network layers. The deeperlayers are more stable across frames – that is, we observe supervised convnet featuresto be “slow” features [33]. The temporal difference is measured as the proportion oflabel changes in the output. The layer depth counts the distance from the input in thenumber of parametric and non-linear layers. Semantic accuracy is the intersection-over-union metric on PASCAL VOC of our frame processing network ﬁne-tuned for separateoutput predictions (Section 5). lockwork Convnets for Video Semantic Segmentation 5

Fig. 2: The proportional difference between adjacent frames of semantic predictionsfrom a mid-level layer ( pool4 , green) and the deepest layer ( fc7 , blue) are shownfor the ﬁrst 75 frames of two videos. We see that for a video with lots of motion (left)the difference values are large while for a relatively static video (right) the differencevalues are small. In both cases, the differences of the deeper fc7 are smaller than thedifferences of the shallower pool4 . The “velocity” of deep features is slow relative toshallow features and most of all the input. At the same time, the differences betweenshallow and deep layers are dependent since the features are compositional.While deeper layers are more stable than shallower layers, for videos with enoughmotion the score maps throughout the network may change substantially. For example,in Figure 2 we show the differences for the ﬁrst 75 frames of a video with large motion(left) and with small motion (right). We would like our network to adaptively updateonly when the deepest, most semantic layer ( fc7 ) score map is likely to change. Wenotice that though the intermediate layer ( pool4 ) difference is always larger than thedeepest layer difference for any given frame, the pool4 differences are much largerfor the video with large motion than for the video with relatively small motion. Thisobservation forms the motivation for using the intermediate differences as an indicatorto determine the ﬁring of an adaptive clock.

We adapt the fully convolutional network (FCN) approach for image-to-image mapping[1] to video frame processing. While it is straightforward to perform inference with astill-image segmentation network on every video frame, this na¨ıve computation is inef-ﬁcient. Furthermore, disregarding the sequential nature of the input not only sacriﬁcesefﬁciency but discards potential temporal recognition cues. The temporal coherence ofvideo suggests the persistence of visual features from prior frames to inform inferenceon the current frame. To this end we deﬁne the clockwork FCN, inspired by the clock-work recurrent network [4], to carry temporal information across frames. A generalizednotion of clockwork relates both of these networks.We consider both throughput and latency in the execution of deep networks acrossvideo sequences. The inference time of the regular FCN-8s at ∼ ms per frame ofsize × on a standard GPU can be too slow for video. We ﬁrst deﬁne ﬁxed clocksthen extend to adaptive and potentially learned clockwork to drive network processing.Whatever the task, any video network can be accelerated by our clockwork technique. E. Shelhamer ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell Conv2Conv3 Conv4 Conv5 fc6 fc7 ScoreDeconv2xScoreScore FuseDeconv FuseDeconv

96 256 384 384 384 4096 4096Conv1Stage 1 Stage 2 Stage 3 Stage 1 (60.0ms)Stage 2 (18.7ms)Stage 3 (23.0ms)Frame Timing of FCN-8sClock 1 Clock 2 Clock 3 clockworkstandard

Fig. 3: The clockwork FCN with its stages and corresponding clocks.A schematic of our clockwork FCN is shown in Figure 3.There are several choice points in deﬁning a clockwork architecture. We deﬁne anovel, generalized clockwork framework, which can purposely schedule deeper lay-ers more slowly than shallower layers. We form our modules by grouping the layersof a convnet to span the feature hierarchy. Our networks persists both state and out-put across time steps. The clockwork recurrent network of [4], designed for long-termdependency modeling of time series, is an instance of our more general scheme forclockwork computation. The differences in architecture and outputs over time betweenclockwork recurrence and our clockwork are shown in Figure 4.While different, these nets can be expressed by generalized clockwork equations y ( t ) H = f T (cid:16) C ( t ) H (cid:12) f H ( y ( t − H ) + C ( t ) I (cid:12) f I ( x ( t ) ) (cid:17) (1) y ( t ) O = f O (cid:16) C ( t ) O (cid:12) f H ( y ( t ) H ) (cid:17) (2)with the state update deﬁned by Equation 1 and the output deﬁned by Equation 2. Thedata x ( t ) , hidden state y ( t ) H output y ( t ) O vary with time t . The functions f I , f H , f O , f T deﬁne input, hidden state, output, and transition operations respectively and are ﬁxedacross time. The input, hidden, and output clocks C ( t ) I , C ( t ) H , C ( t ) O modulate networkoperations by the elementwise product (cid:12) with the corresponding function evaluations.We recover the standard recurrent network (SRN), clockwork recurrent network (clockRN), and our network (clock FCN) in this family of equations. The settings of functionsand clocks are collected in Table 2.Inspired by the clockwork RN, we investigate persisting features and schedulinglayers to process video with a semantic segmentation convnet. Recalling the lessenedsemantic rate of deeper layers observed in Section 3, the skip layers in FCNs originallyincluded to preserve resolution by fusing outputs are repurposed for this staged compu-tation. We cache features and outputs over time at each step to harness the continuity ofvideo. In contrast, the clockwork RN persists state but output is only made according tothe clock, and each clockwork RN module is connected to itself and all slower modulesacross time whereas a module in our network is only connected to itself across time. lockwork Convnets for Video Semantic Segmentation 7network f I f H f O f T C I C H C O SRN W I W H TanH TanH clock RN W I W H TanH TanH

C C C clock FCN ◦ I ReLU

I C C Table 2: The standard recurrent network (SRN), clockwork recurrent network (clockRN), and our network (clock FCN) in generalized clockwork form. The recurrent net-works have learned hidden weights W H and non-linear transition functions f T , whileclock FCN persists state by the identity I . Both recurrent modules are ﬂat with linearinput weights W I , while clock FCN modules have hierarchical features by layer com-position ◦ . The SRN has trivial constant, all-ones clocks. The clock RN has a sharedinput, hidden, and output clock with exponential rates. Our clock FCN has alternatinginput and hidden clocks C, C to compute or cache and has a constant, all-ones outputclock to fuse output on every frame. Clockwork architectures partition a network into modules or stages that are executedaccording to different schedules. In the standard view, the execution of an architecture isan all-or-nothing operation that follows from the deﬁnition of the network. Relaxing thestrict identiﬁcation of architecture and execution instead opens up a range of potentialschedules. These schedules can be encompassed by the introduction of one ﬁrst-classarchitectural element: the clock.A clock deﬁnes a dynamic cut in the computation graph of a network. As clocksmask state in the representation, as detailed in Equations 1 and 2, clocks likewise maskexecution in the computation. When a clock is on, its edges are intact and executiontraverses to the next nodes/modules. When a clock is off, its edges are cut and executionis blocked. Alternatives such as computing the next stage or caching a past stage can bescheduled by a paired clock C and counter-clock C with complementary sets of edges.Any layer (or composition of layers) with binary output can serve as a clock. As a layer,a clock can be ﬁxed or learned. For instance, the following are simple clocks of the form f ( x, t ) for features x and time t : – to always execute – t ≡ to execute every other time – (cid:107) x t − x t − (cid:107) > θ to execute for a difference threshold Having incorporated scheduling into the network with clocks, we can optimize theschedule for various tasks by altering the clocks.

Pipelining

To reduce latency for real-time recognition we pipeline the computationof sequential frames analogously to instruction pipelining in processors. We instantiate

E. Shelhamer ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell Module 1Module 2Module 3 Input 1 Input 2 Input 3 Input 4 Input 1 Input 2 Input 3 Input 4

Clockwork RN Clockwork FCN recurrentfeedforward

Fig. 4: A comparison of the layer connectivity and time course of outputs in the clock-work recurrent network [4] and in our clockwork FCN. Module color marks the timestep of evaluation, and blank modules are disconnected from the network output. Theclock RN is ﬂat with respect to the input while our network has a hierarchical fea-ture representation. Each clock RN module is temporally connected to itself and slowermodules while in our network each module is only temporally connected to itself. Fea-tures persist over time in both architectures, but in our architecture they contribute tothe network output at each step.a three-stage pipeline, in which stage 1 reﬂects frame i , stage 2 frame i − , and stage 3frame i − . The total time to process the frame is the time of the longest stage, stage in our pipeline, plus the time for interpolating and fusing outputs. Our 3-stage pipelineFCN reduces latency by . A 2-stage variation further balances latency and accuracy. Fixed-Rate

To reduce overall computation we limit the execution rates of stagesand persist features across frames for skipped stages. Given the learned invariance andslow semantics of deep layers observed in Section 3, the deeper layers can be executedat a lower rate to save computation while other stages update. These clock rates arefree parameters in the schedule for exchanging inference speed and accuracy. We againdivide the network into three stages, and compare rates for the stages. The exponentialclockwork schedule is the natural choice of halving the rate at each stage for moreefﬁciency. The alternating clockwork schedule consolidates the earlier stages to executethese on every frame and executes the last stage on every other frame for more accuracy.These different sets of rates cover part of the accuracy/efﬁciency spectrum.The current stages are divided into the original score paths of the FCN-8s archi-tecture, but they need not be. One could prioritize latency, spatial reﬁnement, or certainoutput classes by rebalancing the computation. It is possible to partially compute a spanof layers and defer their full execution to a following stage; this can be accomplishedby sparse evaluation through dynamic striding and dilation [34]. In principle the stageprogression can be decided online in lieu of ﬁxing a schedule for all inference. We turnto adaptive clockwork for deciding execution. lockwork Convnets for Video Semantic Segmentation 9

All of the clocks considered thus far have been ﬁxed functions of time but not the data.Setting these clocks gives rise to many schedules that can be tuned to a given taskor video, but this introduces a tedious dimension of model search. Much of the videocaptured in the wild is static and dynamic in turn with a variable amount of motion andsemantic progression at any given time. Choosing many stages or a slow clock rate mayreduce computation, but will likewise result in a steep decline in accuracy for dynamicscenes. Conversely, faster update rates or fewer stages may capture transitory detailsbut will needlessly compute and re-compute stable scenes. Adaptive clocks ﬁre basedon the input and network state, resulting in a responsive schedule that varies with thedynamism of the scene. The clock can ﬁre according to any function of the input andnetwork state. A difference clock can ﬁre on the temporal difference of a feature acrossframes. A conﬁdence clock can ﬁre on peaks in the score map for a single frame. Thisapproach extends inference from a pre-determined architecture to a set of architecturesto choose from for each frame, relying on the full FCN for high accuracy in dynamicscenes while taking advantage of cached representations in more static scenes.threshold clock (cid:107) x t − x t − (cid:107) > θ learned clock f θ ( x t , x t − ) The simplest adaptive clock is a threshold, but adaptive clocks could likewise be learned(for example as a temporal convolution across frames). The threshold can be optimizedfor a speciﬁc tradeoff along the accuracy/efﬁciency curve. Given the hierarchical depen-dencies of layers and the relative stability of deep features observed in Section 3, wethreshold differences at a shallower stage for adaptive scheduling of deeper stages. Thesensitivity of the adaptive clock can even be set on unannotated video by thresholdingthe proportional temporal difference of output labels as in Table 1. Refer to Section 5.3for the results of threshold-adaptive clockwork with regard to clock rate and accuracy.

Our base network is FCN-8s, the fully convolutional network of [1]. The architecture isadapted from the VGG16 architecture [12] and ﬁne-tuned from ILSVRC pre-training.The net is trained with batch size one, high momentum, and all skip layers at once.In our experiments we report two common metrics for semantic segmentation thatmeasure the region intersection over union (IU): – mean IU: (1 /n cl ) (cid:80) i n ii / (cid:16) t i + (cid:80) j n ji − n ii (cid:17) – frequency weighted IU: ( (cid:80) k t k ) − (cid:80) i t i n ii / (cid:16) t i + (cid:80) j n ji − n ii (cid:17) for n ij the number of pixels of class i predicted to belong to class j , where there are n cl different classes, and for t i = (cid:80) j n ij the total number of pixels of class i .We evaluate our clockwork FCN on four video semantic segmentation datasets. Synthetic sequences of translated scenes

We ﬁrst validate our method by evaluat-ing on synthetic videos of moving crops of PASCAL VOC images [5] in order to score ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell on a ground truth annotation at every frame. For source data, we select the 736 imagesubset of the PASCAL VOC 2011 segmentation validation set used for FCN-8s valida-tion in [1]. Video frames are generated by sliding a crop window across the image bya predetermined number of pixels, and generated translations are vertical or horizontalaccording to the portrait or landscape aspect of the chosen image. Each synthetic videois six frames long. For each seed image, a “fast” and “slow” video is made with 32 pixeland 16 pixel frame-to-frame displacements respectively. NYU-RGB clips

The NYUDv2 dataset [6] collects short RGB-D clips and in-cludes a segmentation benchmark with high-quality but temporally sparse pixel an-notations (every tenth video frame is labeled). We run on video from the “raw” clipssubsampled 10X and evaluate on every labeled frame. We consider RGB input alone asthe depth frames of the full clips are noisy and uncurated. Our pipelined and ﬁxed-rateclockwork FCNs are run on the entire clips and accuracy is reported for those framesincluded in the segmentation test set.

Youtube-Objects

The Youtube-Objects dataset [7] provides videos collected fromYoutube that contain objects from ten PASCAL classes. We restrict our attention toa subset of the videos that have pixelwise annotations by [35] as the original annota-tions include only initial frame bounding boxes. This subset was drawn from all objectclasses, and contains 10,167 frames from 126 shots, for which every 10th frame ishuman-annotated. We run on only annotated frames, effectively 10X subsampling thevideo. We directly apply our networks derived from PASCAL VOC supervision and donot ﬁne-tune to the video annotations.

Cityscapes

The Cityscapes dataset [8] collects frames from video recorded at 17hzby a car-mounted camera while driving through German cities. While annotations aretemporally sparse, the preceding and following input frames are provided. Our networkis learned on the train split and then all schedules are evaluated on val . Pipelined execution schedules reduce latency by producing an output each time the ﬁrststage is computed. Later stages are persisted from previous frames and their outputs arefused with the output of the ﬁrst stage computed on the current frame. The number ofstages is determined by the number of clocks. We consider a full anda condensed where the stages are deﬁned by the modules in Figure 3.In the pipelined schedule, all clock rates are set to 1, but clocks ﬁre simultaneously toupdate every stage in parallel. This is made possible by asynchrony in stage state, sothat a later stage is independent of the current frame but not past frames.To assess our pipelined accuracy and speed, we compare to reference methods thatbound both recognition and time. A frame oracle evaluates the full FCN on every frameto give the best achievable accuracy for the network independent of timing. As latencybaselines for our pipelines, we truncate the FCN to end at the given stage. Both of ourstaged, pipelined schedules execute at lower latency than the oracle with better accuracyfor ﬁxed latency than the baselines. We verify these results on synthetic PASCAL se-quences as reported in Table 3. Results on PASCAL, NYUD, and YouTube are reportedin Table 4. lockwork Convnets for Video Semantic Segmentation 11

16 pixel shift Time (% of full) Mean IU fwIU Mean IU-bdry fwIU-bdry3-Stage Baseline 59% 9.2 52.6 6.1 9.43-Stage Pipeline 59% 56.0 76.5 44.6 42.92-Stage Baseline 77% 22.5 64.7 16.6 21.92-Stage Pipeline 77%

Frame Oracle 100% 65.9 83.6 57.0 56.332 pixel shift Time (% of full) Mean IU fwIU Mean IU-bdry fwIU-bdry3-Stage Baseline 59% 9.2 52.6 6.0 9.43-Stage Pipeline 59% 45.5 67.4 37.7 36.02-Stage Baseline 77% 22.4 62.8 16.2 21.72-Stage Pipeline 77%

Frame Oracle 100% 65.6 82.6 55.8 55.3

Table 3:

Pipelined segmentation of translated PASCAL sequences. Synthesized video of trans-lating PASCAL scenes allows for assessment of the pipeline at every frame. The pipelined FCNsegments with higher accuracy in the same time envelope as the every-other-frame evaluation ofthe full FCN. Metrics are computed on the standard masks and a 10-pixel band at boundaries.

Our pipeline scheduled networks reduce latency with minimal accuracy loss relativeto the standard FCN run on each frame without time restriction. These quantitativeresults demonstrate that the deeper layer representations from previous frames containuseful information that can be effectively combined with low-level predictions for thecurrent frame.

NYUD Youtube Pascal Shift 16Schedule Time (% of full) Mean IU fwIU Mean IU fwIU Mean IU fwIU3-Stage Baseline 59% 8.1 22.2 12.2 74.2 9.2 54.73-Stage Pipeline 59% 25.1 38.0 58.1 87.0 56.0 76.52-Stage Baseline 77% 16.5 32.1 21.5 7.8 22.5 64.72-Stage Pipeline 77%

Frame Oracle 100% 31.1 45.5 70.0 91.5 65.9 83.6

Table 4:

Pipelined execution of semantic segmentation on three different datasets. Inferenceapproaches include pipelines of different lengths and a full FCN frame oracle. We also showbaselines with comparable latency to the pipeline architectures. Our pipelined network offers thebest accuracy of computationally comparable approaches running near frame rate. The loss inaccuracy relative to the frame oracle is less than the relative speed-up.

We show a qualitative result for our pipelined FCN on a sequence from the YouTube-Objects dataset [7]. Figure 5 shows one example where our pipeline FCN is particularlyuseful. Our network quickly detects the occlusion of the car while the baseline lags anddoes not immediately recognize the occlusion or reappearance. ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell ou r s ba s e li ne time Fig. 5: Pipelined vs. standard FCN on YouTube video. Our method is able to detect theocclusion of the car as it is happening unlike the lagging baseline computed on everyother frame.

Fixed-rate clock schedules reduce overall computation relative to full, every frame eval-uation by assigning different update rates to each stage such that later stages are exe-cuted less often. Rates can be set aggressively low for extreme efﬁciency or conserva-tively high to maintain accuracy while sparing computation. The exponential clock-work schedule executes the ﬁrst stage on every frame then updates following stagesexponentially less often by halving with each stage. The alternating clockwork sched-ule combines stages 2 and 3, executes the ﬁrst stage on every frame, then schedules thefollowing combined stage every other frame.A frame oracle that evaluates the full FCN on every frame is the reference model foraccuracy. Evaluating the full FCN on every other frame is the reference model for com-putation. Due to the distribution of execution time over stages, this is faster than eitherclockwork schedule, though clockwork offers higher accuracy. Alternating clockworkachieves higher accuracy than the every other frame reference. See Table 5.

16 pixel shift Clock Rates Mean IU fwIU Mean IU-bdry fwIU-bdrySkip Frame Baseline (2,2,2) 63.0 81.5 60.2 52.2Exponential (1,2,4) 61.4 80.4 50.5 49.1Alternating (1,1,2)

Frame Oracle (1,1,1) 65.9 83.6 57.0 56.332 pixel shift Clock Rates Mean IU fwIU Mean IU-bdry fwIU-bdrySkip Frame Baseline (2,2,2) 59.5 77.9 49.4 48.2Exponential (1,2,4) 55.5 74.7 46.3 44.8Alternating (1,1,2)

Frame Oracle (1,1,1) 65.6 82.6 55.8 55.3

Table 5:

Fixed-rate segmentation of translated PASCAL sequences. We evaluate the network onsynthesized video of translating PASCAL scenes to assess the effect of persisting layer featuresacross frames. Metrics are computed on the standard masks and a 10-pixel band at boundaries.lockwork Convnets for Video Semantic Segmentation 13

NYUD Youtube CityscapesSchedule Mean IU fwIU Mean IU fwIU Mean IU fwIUSkip Frame Baseline 27.7 41.3 65.6 89.7 62.1 87.4Alternating 28.5 42.4 67.0 90.3

Adaptive

Table 6:

Fixed-rate and adaptive clockwork FCN evaluation. We score our network on threedatasets with an alternating schedule that executes the later stage every other frame and an adap-tive schedule that executes according to a frame-by-frame threshold on the difference in output.The adaptative threshold is tuned to execute the full network on of frames to equalize com-putation between the alternating and adaptive schedules.

Exponential clockwork shows degraded accuracy yet takes . × the computationof evaluation on every other frame, so we discard this ﬁxed schedule in favor of adap-tive clockwork. Although exponential rates sufﬁce for the time series modeled by theclockwork recurrent network [4], these rates deliver unsatisfactory results for the task ofvideo semantic segmentation. See Table 6 for alternating clockwork results on NYUD,YouTube-Objects, and Cityscapes. The best clock schedule can be data-dependent and unknown before segmenting avideo. Therefore, we next evaluate our adaptive clock rate as described in Section 4.3.In this case the adaptive clock only fully processes a frame if the relative differencein pool4 score is larger than some threshold θ . This threshold may be interpreted asthe the fraction of the score map that must switch labels before the clock updates theupper layers of the network. See Table 6 for adaptive clockwork results on NYUD,YouTube-Objects, and Cityscapes.We experiment with varying thresholds on the Youtube-Objects dataset to measureaccuracy and efﬁciency. We pick thresholds in θ = [0 . , . as well as θ = 0 . forunconditionally updating on every frame.In Figure 6 (left) we report mean IU accuracy as a function of our adaptive clockﬁring rate; that is, the percentage of frames the clock decides to fully process in the net-work. The thresholds which correspond to a few points on this curve are indicated withmean IU (right). Notice that our adaptive clockwork is able to fully process only 52%of the frames while suffering a minimal loss in mean IU ( θ = 0 . ). This indicates thatour adaptive clockwork is capable of discovering semantically stationary scenes andsaves signiﬁcant computation by only updating when the output score map is predictedto change.For a closer inspection, we study one Youtube video in more depth in Figure 7. Weﬁrst visualize the clock updates for our adaptive method (top left) and for a simple pixeldifference baseline (bottom left), where black indicates the clock is on and the corre-sponding frame is fully computed. This video has signiﬁcant change in certain sections(ex: at frame ∼ there is zoom and at ∼ there is motion) with long periods ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell θ = 0.10 θ = 0.25 θ = 0.35 Method % Full Frames Mean IUAdaptive [ θ = 0 . ] 93% 70.0Adaptive [ θ = 0 . ] 52% 68.3Adaptive [ θ = 0 . ] 21% 59.0Frame Oracle 100% 70.0 Fig. 6: Adaptive Clockwork performance across the Youtube-Objects dataset. We ex-amine various adaptive difference thresholds θ and plot accuracy (mean IU) against thepercentage of frames that the adaptive clock chooses to fully compute. A few corre-sponding thresholds are indicated. Adaptive Clock UpdatesPixel Diff Clock Updates Ground TruthAdaptive Clock Adaptive Clock UpdatesPixel Diff Clock Updates Ground TruthAdaptive Clock

Fig. 7: An illustrative example of our adaptive clockwork method on a video fromYoutube-Objects. On the left, we compare clock updates over time (shown in black) ofour adaptive clock as well as a clock based on pixel differences. Our adaptive clock up-dates the full network on only 26% of the frames, determined by the threshold θ = 0 . on the proportional output label change across frames, while scheduling updates basedon pixel difference alone results in updating 90% of the frames. On the right we showoutput segmentations from the adaptive clockwork network as well as ground truthsegments for select frames from dynamic parts of the scene (second and third framesshown) and relatively static periods (ﬁrst and second frames shown).of relatively little motion (ex: frames − ). While the pixel difference metric issusceptible to the changes in minor image statistics from frame to frame, resulting invery frequent updates, our method only updates during periods of semantic change andcan cache deep features with minimal loss in segmentation accuracy: compare adaptiveclock segmentations to ground truth (right). lockwork Convnets for Video Semantic Segmentation 15 Generalized clockwork architectures encompass many kinds of temporal networks, andincorporating execution into the architecture opens up many strategies for schedulingcomputation. We deﬁne a clockwork fully convolutional network for video semanticsegmentation in this framework. Motivated by the stability of deep features across se-quential frames, our network persists features across time in a temporal skip architec-ture. By exploring ﬁxed and adaptive schedules, we are able to tune processing forlatency, overall computation time, and recognition performance. With adaptive, data-driven clock rates the network is scheduled online to segment dynamic and static scenesalike while maintaining accuracy. In this way our adaptive clockwork network is abridge between convnets and event-driven vision architectures. The clockwork perspec-tive on temporal networks suggests further architectural variations for spatiotemporalvideo processing.

References

1. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation.In: PAMI. (2016)2. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a commonmulti-scale convolutional architecture. In: ICCV. (2015) 2650–26583. Fischer, P., Dosovitskiy, A., Ilg, E., H¨ausser, P., Hazrbas¸, C., Golkov, V., van der Smagt, P.,Cremers, D., Brox, T.: Learning optical ﬂow with convolutional networks. In: ICCV. (2015)4. Koutn´ık, J., Greff, K., Gomez, F., Schmidhuber, J.: A Clockwork RNN. In: ICML. (2014)5. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visualobject classes (voc) challenge. IJCV (2) (June 2010) 303–3386. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inferencefrom rgbd images. In: ECCV. (2012)7. Prest, A., Leistner, C., Civera, J., Schmid, C., Ferrari, V.: Learning object class detectorsfrom weakly annotated video. In: CVPR, IEEE (2012) 3282–32898. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U.,Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In:CVPR. (2016)9. Muybridge, E.: The horse in motion. Library of Congress Prints and Photographs Division(1882)10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep convolutionalneural networks. In: NIPS. (2012)11. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-nition. In: ICLR. (2015)13. Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV. (2015)14. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recog-nition. PAMI (1) (2013) 221–23115. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scalevideo classiﬁcation with convolutional neural networks. In: CVPR. (2014) 1725–173216. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8) (1997)1735–17806 E. Shelhamer ∗ , K. Rakelly ∗ , J. Hoffman ∗ , and T. Darrell17. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and de-scription. In: CVPR. (2015) 2625–263418. Laptev, I.: On space-time interest points. IJCV (2-3) (2005) 107–12319. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detectionfrom frame glimpses in videos. In: CVPR. (2016)20. He, K., Sun, J.: Convolutional neural networks at constrained time cost. In: CVPR. (2015)21. Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on cpus. In:Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop. Volume 1. (2011)22. Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structurewithin convolutional networks for efﬁcient evaluation. In: NIPS. (2014) 1269–127723. Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks withlow rank expansions. In: BMVC. (2014)24. Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola, A., Song, L., Wang, Z.: Deepfried convnets. In: ICCV. (2015)25. Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efﬁcient hierarchical graph-based videosegmentation. In: CVPR, IEEE (2010) 2141–214826. Xu, C., Corso, J.J.: Evaluation of super-voxel methods for early video processing. In: CVPR,IEEE (2012) 1202–120927. Shi, J., Malik, J.: Motion segmentation and tracking using normalized cuts. In: ICCV, IEEE(1998) 1154–116028. Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: ICCV.(December 2013)29. Fragkiadaki, K., Arbelaez, P., Felsen, P., Malik, J.: Learning to segment moving objects invideos. In: CVPR. (June 2015)30. Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., Vijaya-narasimhan, S., Essa, I., Rehg, J., Sukthankar, R.: Weakly supervised learning of objectsegmentations from web-scale video. In: ECCV-W, Springer (2012) 198–20831. Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation inweakly labeled video. In: CVPR, IEEE (2013) 2483–249032. Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., Bu, J.: Weakly supervised multiclass videosegmentation. In: CVPR, IEEE (2014) 57–6433. Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of invariances.Neural computation14