Joint Forecasting of Features and Feature Motion for Dense Semantic Future Prediction
NNoname manuscript No. (will be inserted by the editor)
Joint Forecasting of Features and Feature Motionfor Dense Semantic Future Prediction
Josip ˇSari´c · Sacha Vraˇzi´c · Siniˇsa ˇSegvi´c
Received: date / Accepted: date
Abstract
We present a novel dense semantic forecastingapproach which is applicable to a variety of architecturesand tasks. The approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation fieldwhich warps past features into their future positions. Feature-to-feature (F2F) module regresses the future features directlyand is therefore able to account for emergent scenery. Thecompound F2MF approach decouples effects of motion fromthe effects of novelty in a task-agnostic manner. We aimto apply F2MF forecasting to the most subsampled and themost abstract representation of a desired single-frame model.Our implementations take advantage of deformable convo-lutions and pairwise correlation coefficients across neigh-bouring time instants. We perform experiments on three denseprediction tasks: semantic segmentation, instance-level seg-mentation, and panoptic segmentation. The results revealstate-of-the-art forecasting accuracy across all three modal-ities on the Cityscapes dataset. to the development of autonomous robots and vehicles. Re-cent discovery of deep learning methods triggered a great
This work has been funded by Rimac Automobili. This work has alsobeen supported by the Croatian Science Foundation under the grantADEPT and European Regional Development Fund under the grantKK.01.1.1.01.0009 DATACROSS.Josip ˇSari´c and Siniˇsa ˇSegvi´cUniversity of ZagrebFaculty of Electrical Engineering and ComputingE-mail: [email protected]: [email protected] Vraˇzi´cRimac AutomobiliSveta Nedelja, CroatiaE-mail: [email protected]
FEATUREEXTRACTION SEMANTICFORMATIONSPATIO-TEMPORALCORRESPONDENCE I t Ŝ t+Δt I t-9 BF2M F2FWFEATUREEXTRACTION
F2MF X t-9 X t ^ X t+Δt Fig. 1
Overview of the proposed F2MF forecasting. Low-resolutionfeatures X τ are extracted from observed RGB images I τ , τ ∈ { t − , t − , t − , t } by a pre-trained recognition module. The features areenriched with their spatio-temporal correlations and forwarded to F2Mand F2F modules which specialize for forecasting previously observedand novel scenery. The forecasted future features ˆX t + ∆t are a blend(B) of F2M and F2F outputs. Dense semantic predictions ˆS t + ∆t arefinally formed throughout a pre-trained upsampling module. progress in single-image tasks such as instance segmenta-tion (He et al., 2017) or panoptic segmentation (Cheng et al.,2020). However, the reasoning of an intelligent agent mustnot be limited to the present moment in time since conse-quences of our current actions and the goals of our missionsare going to occur in the future. Consequently, accurate an-ticipation of future events could be an important ingredienttowards making our present systems better and more intelli-gent.Early visual forecasting methods relied on handcraftedmodels of scene dynamics (Davison et al., 2007). However,this approach is prone to systematic errors due to insuffi-ciently accurate modeling. For instance, the popular cameramodels (Zhang, 2016) are unable to express arbitrary lens a r X i v : . [ c s . C V ] J a n Josip ˇSari´c et al. distortions. Likewise, typical expressions of location uncer-tainty (Reuter et al., 2014) are prone to propagation of er-rors between recognition and forecasting. Consequently, re-cent work attempts to implicitly capture the laws of scenedynamics throughout deep learning in video (Oprea et al.,2020). This can be carried out either by forecasting futureRGB frames (Mathieu et al., 2016; Gao et al., 2019) or by di-rectly forecasting the corresponding semantic content (Alahiet al., 2016; Luc et al., 2017, 2018).Semantic forecasting is an appealing solution since deci-sion-making systems mainly care about the content of futurescenes rather than about their appearance. Experiments fromLuc et al. (2017) suggest that direct semantic forecastingleads to better accuracy than analyzing the forecasted RGBframe. Moreover, it comes as no surprise that direct semanticforecasting requires significantly less computational effortthan independent prediction in the forecasted RGB frame.Thus, empirical evidence shows that semantic forecastingis a method of choice with respect to processing speed andprediction accuracy. Intuition leads to similar conclusions.It seems appropriate to first figure out what is going to hap-pen at an abstract level before moving on to the details oflighting, reflectance and surface normals. Indeed, it appearsthat semantic forecasting should be a prerequisite for RGBforecasting rather than vice versa.There are multiple approaches for expressing semanticforecasting. S2S forecasting maps observed semantic mapsinto their future counterparts (Luc et al., 2017). M2M fore-casting receives the optical flow between observed imagesand produces the optical flow between the future image andthe last observed image (Terwilliger et al., 2019). Finally,F2F forecasting operates on the level of abstract convolu-tional features (Luc et al., 2018). There are several advan-tages of F2F forecasting with respect to the remaining twoapproaches. In comparison with S2S, it offers better spatio-temporal correspondence, more expressive features, and task-agnosticism. In comparison with M2M, it offers semanticreasoning and capability to imagine novel scenery.This paper builds on previous feature-level forecastingapproaches (Luc et al., 2018; Sun et al., 2019) and enrichesthem with the following contributions. We propose to regu-larize F2F forecasting by complementing it with F2M fore-casting which expresses future features by warping their cur-rent conterparts throughout a dense deformation field. Thisformulation regularizes F2F forecasting by expressing it asa causal relationship We construct the final F2MF forecastby blending F2M and F2F forecasts with densely regressedweight factors as illustrated in Figure 1. This allows ourmethod to foresee emergence of unobserved scenery, and toexploit that information for decoupling variation caused bynovelty from variation due to motion. We account for geo-metric nature of our task by employing spatio-temporal cor-relation coefficients and deformable convolutions. We show that semantic segmentation forecasting can be expressed interms of the most compressed representation of the single-frame model with no loss of accuracy. The resulting general-ization performance surpasses the state-of-the-art by a widemargin. Our method is very-well suited for real-time imple-mentation due to low computational overhead with respectto single-frame prediction. The method can also be readilyapplied for forecasting other dense prediction tasks whichwe explain next.In addition to contributions from previous conferenceaccounts ( ˇSari´c et al., 2019, 2020), this paper validates suit-ability of the proposed forecasting method for various denseprediction tasks and model architectures. We present the firstexperimental account of panoptic segmentation forecastingby exploiting a single-frame model based on Panoptic Deep-lab (Cheng et al., 2020). We also present an application ofour method for instance segmentation forecasting by fore-casting features from the single-frame Mask R-CNN model(He et al., 2017). Finally, we demonstrate that our approachis able to generalize over different cameras, resolutions andframerates, as well as to produce longer-term forecasts in anautoregressive manner.
Semantic forecasting anticipates semantic contents of an un-observed future image. Conceptually, this task can be factor-ized into RGB forecasting (Mathieu et al., 2016) and seman-tic prediction in the forecasted image. However, we preferto carry it out as a single processing step (Luc et al., 2017)due to advantages stated in the introduction. In particular, weconsider forecasting dense semantic predictions — semanticsegmentation (Zhou et al., 2019), instance segmentation (Heet al., 2017) and panoptic segmentation (Cheng et al., 2020).Our method warps features from observed images into theirfuture positions, which makes it loosely related to opticalflow (Liu et al., 2019). Our work is most related to previ-ous approaches which forecast dense convolutional features(F2F, Luc et al., 2018) and optical flow (M2M, Terwilligeret al., 2019).2.1 Optical flowOptical flow is a dense 2D-motion field between neighbour-ing image frames I t and I t +1 . A future image I t +1 can beapproximated either by forward warping I t with the forwardflow f t+1 t = flow (I t , I t +1 ) , or by backward warping it withthe backward flow f tt +1 = flow (I t +1 , I t ) (Szeliski, 2010): I t +1 ≈ warp fw(I t , f t +1 t ) ≈ warp bw(I t , f tt +1 ) (1)The approximate equality reminds us that bijective mappingcan not be established due to occlusions and disocclusions. oint Forecasting of Features and Feature Motion for Dense Semantic Future Prediction 3 Recent optical flow methods leverage deep convolutionalmodels (Dosovitskiy et al., 2015; Sun et al., 2018) due toend-to-end training and capability to guess motion where thecorrespondence is absent or ambiguous. These approachesexploit explicit 2D correlation across the local neighbour-hood (Dosovitskiy et al., 2015), and local embeddings whichact as a correspondence metric. Our method is especially re-lated to self-supervised approaches (Liu et al., 2019) sincethey learn to reconstruct optical flow without any groundtruthinformation, which is also the case in our setup.2.2 RGB forecastingRGB forecasting is also known as video prediction (Mathieuet al., 2016). The task is to predict one or more future videoframes given a few recently observed frames of the samescene (Oprea et al., 2020). This is especially interesting dueto opportunity for self-supervised representation learning onpractically unlimited data.Mathieu et al. (2016) express RGB forecasting as imagegeneration with a multiscale adversarial network. This ap-proach ignores geometrical structure of the scene and there-fore incurs a high overfitting risk. Vukoti´c et al. (2017) per-form the forecast by embedding the desired temporal offsetin the latent representation of the observed scenery. Redaet al. (2018) warp observed frames by applying a regressedkernel at the location determined by the forecasted flow,which makes them unable to in-paint novel scenery.Some works generate video from still images by warpingobserved frames with forecasted flow (Li et al., 2018; Panet al., 2019). Hao et al. (2018) take a sparse motion trajec-tory and construct the forecast by combining reconstructedand in-painted output. Our approach also forecasts multipleflows, however it uses multiple previous frames to recon-struct a single future feature tensor. This allows to resolvesome disocclusions by warping from a suitable past image.Some works decompose future prediction into recon-struction from the past and in-painting the novel scenery(Li et al., 2018; Hao et al., 2018; Gao et al., 2019). Theirforecast includes the future warp and the respective disoc-clusion map. The past frame is first warped with forecastedflow, and then the disocclusions are filled by in-painting.This setup is conceptually similar to the proposed F2FMapproach, however there are several important differencesas follows. We exploit spatio-temporal correlation featuresand deformable convolutions. We perform the forecast onheavily subsampled (16 × or 32 × ) abstract features insteadof pixels. This allows our method to operate on megapixelimages and to achieve state-of-the-art accuracy in seman-tic forecasting while incurring only a modest computationaloverhead with respect to single-frame prediction. 2.3 Dense semantic predictionSeveral computer vision tasks address scene understandingat the pixel level. Semantic segmentation (Zhou et al., 2019)assigns each pixel to a suitable semantic class. This includesstuff classes such as road or vegetation, as well as objectclasses such as person or bicycle. Instance segmentation (Heet al., 2017) detects instances of object classes and asso-ciates them with the respective image regions. Panoptic seg-mentation (Cheng et al., 2020) subsumes the previous twoapproaches by assigning each pixel the semantic class andthe index of the respective instance. State-of-the-art algo-rithms for all these tasks are expressed as deep convolutionalmodels for dense semantic prediction.Several succesful architectures for dense semantic pre-diction have an asymetric hourglass-shaped structure con-sisting of the recognition backbone and a lean upsamplingdatapath (Lin et al., 2017; Kreso et al., 2017; Cheng et al.,2020). The recognition backbone usually corresponds to afully-convolutional portion of a model designed for imageclassification. The role of this component is to convert theinput image into a subsampled latent representation. Mostauthors exploit knowledge transfer from ImageNet since ima-ge-wide supervision requires much less effort than dense se-mantic supervision. The upsampling datapath recovers densesemantic predictions by blending semantics of deep featureswith location accuracy of their shallow counterparts. Theblending is usually implemented by means of skip-connecti-ons from the backbone towards the upsampling path. Em-pirical studies show advantages of asymetric designs whichcomplement deep and thick recognition with shallow andthin upsampling (Kreˇso et al., 2020). This suggests that recog-nition requires much more capacity than guessing the bor-ders when rough semantics is known.2.4 Forecasting of semantic predictions (S2S)Luc et al. (2017) were the first to propose direct semanticforecasting. Their S2S model maps past semantic segmenta-tions into the future semantic segmentation. Bhattacharyyaet al. (2019) try to account for multimodal future with vari-ational inference based on MC dropout, while conditioningthe forecast on measurements from the vehicle odometer.Nabavi et al. (2018) formulate the forecasting in a recurrentfashion with shared parameters between each two frames.Chen and Han (2019) improve their work by leveraging de-formable convolutions and enforcing temporal consistencybetween neighbouring feature tensors. Their attention-basedblending is related to forward warping based on pairwisecorrelation features ( ˇSari´c et al., 2020). However, the fore-casting accuracy of these approaches is considerably lowerthan in our ResNet-18 experiments despite greater forecast-ing capacity and better single-frame performance. This sug- Josip ˇSari´c et al. gests that ease of correspondence and avoiding error pro-pagation may be important for successful forecasting.2.5 Flow-based forecasting (M2M)Direct semantic forecasting requires a lot of training datadue to necessity to learn all motion patterns one by one.This can be improved by allowing the forecasting model toaccess geometric features which reflect 2D motion in theimage plane (Jin et al., 2017). Further development of thatidea brings us to warping the last dense prediction accord-ing to forecasted optical flow. A prominent instance of thisapproach can be succinctly described as motion-to-motion(M2M) forecasting since it receives optical flows from thethree observed frames and produces the optical flow for thefuture frame. The corresponding implementation based onconvolutional LSTM had achieved state-of-the-art semanticforecasting accuracy prior to our work (Terwilliger et al.,2019). This work is related to our F2M module which alsoforecasts by warping with regressed flow. However, our F2Mmodule operates on abstract convolutional features, and doesnot require external components and additional supervision.This discourages error propagation due to end-to-end train-ing and implies very efficient inference due to subsampledresolution. Additionally, we take into account features fromall past four frames instead of relying only on the last predic-tion. This allows our F2M module to detect complex disoc-clusion patterns and simply copy from the past where possi-ble. Further, our module has access to raw semantic featureswhich are complementary to flow patterns (Feichtenhoferet al., 2016), and often strongly correlated with future mo-tion (consider for example cars vs pedestrians). Finally, wecomplement our F2M module with pure recognition-basedF2F forecasting which outperforms F2M on previously un-observed scenery.2.6 Feature-level forecasting (F2F)Feature-to-feature (F2F) forecasting maps past features totheir future counterparts. The first F2F approach operated onimage-wide features from a fully connected AlexNet layer(Vondrick et al., 2015). Luc et al. (2018) propose dense F2Fforecasting by regressing all features along the FPN-style(Lin et al., 2017) upsampling path. Sun et al. (2019) usea convolutional LSTM module at each level of the featurepyramid, and propose inter-level connections for context shar-ing. However, forecasting at fine resolution is computation-ally expensive (Couprie et al., 2018). Hence, some later workreverted to forecasting on the coarsest feature level ( ˇSari´cet al., 2019; Chiu et al., 2020). Such aproach is advanta-geous due to small inter-frame displacements, rich contex-tual information and small computational footprint. It also has an intuitive appeal of prioritizing the big picture beforemoving on to the fine details.Vora et al. (2018) formulate feature-level forecasting asreprojection of reconstructed features to the forecasted fu-ture ego-location. However, such purely geometric approachis clearly suboptimal in presence of (dis-)occlusions and lar-ge changes of perspective. Additionally, it makes it difficultto account for independent motion of moving objects. Largeempirical advantage of our method suggests that optimalforecasting performance requires a careful balance betweenreconstruction and recognition, as well as that explicit 3Dreasoning may not be necessary.
We present a novel feature forecasting approach which com-bines standard feature-level forecasting (F2F) with its regu-larized variant (F2M) which relies on warping the past rep-resentation into the future. Fig. 2 illustrates the structure ofthe proposed joint model. Our F2MF model takes T = 4 past feature tensors X t − , X t − , X t − , X t ( X t − : t : forshort) extracted with the front end of the desired single-frame model for dense semantic prediction. The F2MF modelmaps the past features into the future feature tensor ˆX t + ∆t .This feature tensor is transformed to the semantic predic-tions ˆS t + ∆t throughout the back end of the chosen single-frame model as shown in Figure 1. ||CORR || SHAREDF2M-F2F
6x DCONV3x3
F2F
DCONV 3x3
F2M
DCONV 3x3
W BFUSION
DCONV 1x1 X t-9 X̂̂ t+Δt X t (x4) W DCONV 3x3 ^ X t+ΔtF2M ^ f t+Δt τ (x4)(x4) ^ X t+ΔtF2F w F2M w F2F (x4)
Fig. 2
Details of F2MF forecasting. F2M and F2F heads receivea processed concatenation ( || ) of features from observed frames( X t − t :3 ), and their spatio-temporal correlation coefficients. TheF2M head regresses future feature flow which warps (W) past featuresinto their future locations. The F2F head forecasts the future featuresdirectly. The compound forecast X t + ∆t is a weighted blend (B) ofF2M and F2F forecasts. F2MF forecasting proceeds as follows. Input features areconcatenated across the semantic dimension and fused witha single convolutional layer in order to reduce the number ofchannels. In parallel, the correlation module computes pair- oint Forecasting of Features and Feature Motion for Dense Semantic Future Prediction 5 wise correlation coefficients across a local neighborhood.The fused representation is concatenated with the correla-tion features. The result is further processed through six con-volutional layers in order to obtain a shared representationwhich is used by the F2M and F2F head. All convolutionallayers from the forecasting module are implemented as BN-ReLU-dconv units, where dconv stands for deformable con-volution (Zhu et al., 2018).3.1 F2M headThe F2M head assumes that the future can be explained as ageometrical transformation of the observed past. Therefore,it outputs a dense field of motion vectors ˆf τt + ∆t for each ofthe T past feature tensors ( τ ∈ { t − , t − , t − , t } ).The resulting warped tensors are subsequently blended withregressed weights which we activate with per-pixel softmax.Thus, the F2M head constructs its forecast as a weightedsum of warped features from the observed images: ˆX ( τ ) t + ∆t = warp( X τ , ˆf τt + ∆t ) (2) ˆX F2M t + ∆t = (cid:88) τ α τ · ˆX ( τ ) t + ∆t (3) α = softmax ([ w F2M τ ] τ ∈{ t − t :3 } ) . (4)This allows the F2M head to choose the most suitable previ-ous image for forecasting a particular region. Such opportu-nity is particularly beneficial in some disocclusion-occlusionpatterns as we illustrate in Fig. 10.However, the assumption that the future can be recon-structed from the past is only partially true, since the futureoften brings unpredictable novelty. Additionally, sometimesthe correspondence will be hard to determine at times due tolarge ego-motion and changes in perspective. Consequently,some parts of the scene are going to be particularly hardto forecast by simple warping from the past. Accurate fore-casting of such regions will require imagination ability ofthe module which we describe next.3.2 F2F headThe F2F head directly regresses future features from theshared representation. As it is not bound to reconstructionfrom the past, it has a chance to in-paint the features innovel regions. This is similar to previous work (Luc et al.,2018; Chiu et al., 2020; Sun et al., 2019), however thereare three important differences. First, we aim at single-levelF2F forecasting. Second, we use deformable convolutions.Third, our F2F module has access to spatio-temporal cor-relation features which relieve the need to learn correspon-dence from scratch. Our experiments show clear advantageof these features which suggests that correspondence is noteasily learned on existing datasets. 3.3 Compound F2MF forecastingWe hypothesise that F2M and F2F forecasting might be com-plementary and that their combination might lead to im-proved accuracy. Consequently, our F2MF module formu-lates future features as a weighted average of the forecastsprovided by the two individual heads. ˆX F2MF t + ∆t = β F2F · ˆX F2F t + ∆t + (cid:88) τ β F2M τ · ˆX ( τ ) t + ∆t (5) β = softmax ([ w F2F ] (cid:107) [ w F2M τ ] τ ∈{ t − t :3 } ) . (6)This formulation encourages specialization of the F2F headin novel parts of the scene. It also relaxes the penalty ofthe F2M head in such regions and allows it to focus onparts where correspondence can be established. Neverthe-less, this kind of separation might weaken the learning sig-nals within the two individual heads. Consequently, we pro-pose to train F2MF forecasting with three loss terms. Themain loss L F2MF involves the F2MF forecast (5). The twoauxiliary losses L F2F and L F2M affect the correspondingoutputs ˆX F2F t + ∆t and ˆX F2M t + ∆t . All three losses are formulatedas mean squared L2 distance between the forecast and ac-tual features computed with the single-frame model in thefuture frame. Consequently, the model can be trained withself-supervision on unlabeled video (Luc et al., 2018).3.4 Correlation moduleOur correlation module determines spatio-temporal corre-spondence between neighbouring frames. On input, it re-ceives convolutional features X t − : t : in the form of a T × C × H × W tensor. On output, it produces spatio-temporal cor-relation coefficients across a d × d neighborhood for each ofthe T − × × C’ convolution (C’=128). We hypothesize that thismapping might recover distinguishing information which isnot needed for single-frame inference. Subsequently, we con-struct our metric embedding by normalizing C’-dimensionalfeatures to unit norm so that cosine similarity becomes dotproduct. This results in a T × C’ × H × W metric embedding F . Finally, we produce d correspondence maps betweenfeatures at time τ and their counterparts at τ − within thelocal d × d neighborhood, for each τ ∈ { t − , t − , t } .We usually set d = 9 . We denote the value of the correlationtensor C τ at location q and feature map ud + v as C τud + v, q .This value corresponds to a dot product between a metricfeature at time τ and location q ∈ D ( F ) , and its counterpartat time τ − and location q + ( u, v ) where u, v ∈ .. d − (Dosovitskiy et al., 2015): C τud + v, q = F τ (cid:62) q F τ − q +[ u − d ,v − d ] , where u,v ∈ [0 .. d ) . (7) Josip ˇSari´c et al. /16
DB DB DB DBUP UP UP SPP
SEMANTIC FORMATION
F2MF
FEATURE EXTRACTION /32/8/4 /32/8 /16/4 512D
Fig. 3
Details of our single-frame model for semantic segmentation.Notice the absence of skip connections from the backbone (green) to-wards the upsampling path (red). The black jigsaw indicates the featuretensor which we use for F2MF forecasting. Feature extraction and se-mantic formation modules are in correspondence with Fig. 1. design decision is around 1 pp AP COCO and 2.7 pp AP50on Cityscapes val. Despite this handicap, our model per-forms favourably with respect to the state-of-the-art. Figure4 shows that our F2MF forecasting addresses 1024-dimen-sional features produced by the second-to-last residual blockof the backbone. Subsequently, the RPN head extracts objectcandidates from the forecasted features. Instance segmenta-tions are obtained by processing each object candidate withthe last residual block and the two heads of the Mask R-CNN model. Future semantics is formed by resizing the in-ferred instance segmentations to input resolution.
RB RB RB RB RPN
FEATURE EXTRACTION PP F2MF
MASKHEADFASTERHEAD SEMANTIC FORMATION /4 /8 1024D 14 x 147 x 7 /16
Fig. 4
Details of a single-frame model which we use in instance seg-mentation experiments. This is the basic Mask R-CNN model (Heet al., 2017) which generates proposals and detections at 16 × subsam-pled resolution. The black jigsaw symbol indicates the feature tensorwhich we use for F2MF forecasting. Feature extraction and semanticformation modules are in correspondence with Fig. 1. oint Forecasting of Features and Feature Motion for Dense Semantic Future Prediction 7 that our F2MF forecasting addresses 1024-dimensional fea-tures produced by the second-to-last residual block of thebackbone. Subsequently, the forecasted features are processedby the last residual block and by the two upsampling paths.Future semantics is finally formed by postprocessing objectcenters and per-pixel semantic information. RB RB RB RBUP UPUP UP UPUP ASPPASPP
FEATURE EXTRACTIONSEMANTIC FORMATION
INSTANCE DECODERSEMANTIC DECODER PP F2MF /4 /8 /16/32/32/8 /16/4 /8 /16/4 /321024D
Fig. 5
Details of a single-frame model which we use for panoptic seg-mentation. We train a custom Panoptic DeepLab (Cheng et al., 2020)with only one skip connection. This allows us to apply single-levelforecasting to the feature tensor produced by the penultimate residualblock, as indicated by the black jigsaw symbol. Feature extraction andsemantic formation modules are in correspondence with Fig. 1.
We evaluate our method on the Cityscapes dataset (Cordtset al., 2016). We consider short-term and mid-term forecast-ing experiments (Luc et al., 2017, 2018) which target 3 and 9timesteps into the future, respectively. We estimate the fore-casting performance by computing the usual dense predic-tion metrics in the future frame. We use mean intersectionover union (mIoU) for semantic segmentation, and COCOaverage precision (AP) for instance segmentation. Finally,we consider panoptic, segmentation and recognition quality(PQ, SQ, RQ) for panoptic segmentation.Our training procedure involves several separate steps.First, the backbone of a single-frame model is pretrainedon ImageNet. Second, we train the single-frame model onlabeled images in a task-specific setup. Then, we train theforecasting module on unlabeled images to map the past fea-tures to their future counterparts with MSE loss. The fea-tures (both the past and the future) are extracted by applyingthe single-frame model in corresponding video frames. Notethat the forecasting module does not require any annotationsin spite of being trained in a supervised fashion.We train our model for 160 epochs with early stoppingand ADAM (Kingma and Ba, 2014) optimizer with cosine annealing of the learning rate. During training, we normalizeinput and output of our F2MF forecasting module to dataset-wide zero mean and unit variance. During inference, wenormalize the input and denormalize the output. This facil-itates the training process and improves generalization. Weset the weights of all three components of the F2MF loss to1. Our data augmentation policy includes sliding the inputtuple across the video clip and horizontal flipping. We re-frain from using any auxiliary information such as vehicleodometry or stereoscopic reconstruction.4.1 Semantic segmentation forecasting on CityscapesWe experiment with two single-frame models of differentcapacity. Both of them have an encoder-decoder architec-ture (Orsic et al., 2019) without skip connections. The en-coder corresponds to standard ImageNet classification mod-els. The decoder consists of a spatial pyramid pooling andthree upsampling modules as shown in Figure 3. The twomodels differ in backbone architectures and in the widthof the decoder. The weaker model is based on a ResNet-18 and uses 128 feature maps along the upsampling path.It achieves 72.5% mIoU on Cityscapes val. The strongermodel uses a DenseNet-121 backbone, 512 feature maps inthe SPP, 256 maps in the first upsampling module, and 128maps in the last two upsampling modules. It achieves 75.8%mIoU on Cityscapes val.Table 1 shows the forecasting accuracy for semantic seg-mentation on Cityscapes val. The top section shows the per-formance of our single-frame model which we denote as or-acle since our forecasting models generate predictions for unobserved frames. We also show the performance of a sim-ple baseline approach which simply copies the segmenta-tion from the last observed input frame. The middle sec-tion shows experiments from the literature (Luc et al., 2017,2018; Bhattacharyya et al., 2019; Nabavi et al., 2018; Chenand Han, 2019; Terwilliger et al., 2019; Chiu et al., 2020;Vora et al., 2018). The bottom section presents performanceof our F2MF models with different single-frame models anddata augmentation policies. Our best model achieves state-of-the-art forecasting performance with 69.6% mIoU at short-term and 57.9% mIoU at mid-term. The table suggests thatbetter single-frame model leads to better forecasting, how-ever the gap decreases in the forecasting setup. The dif-ference in the oracle performance is 3.3 mIoU points, butit drops for 1.2 pp mIoU at short-term, and 0.9 pp mIoUin mid-term forecasting. We retrain our best model on thetrainval set and submit the test set predictions to the onlinebenchmark. This resulted in 70.2% mIoU at short-term and59.1% mIoU at mid-term forecasting. This strongly suggeststhat our performance in Table 1 has not been artificially im-proved through validation experiments on Cityscapes val.
Josip ˇSari´c et al.Short term:s ∆t =3 Mid term: ∆t =9Accuracy (mIoU) All MO All MOOracle-DN121 75.8 75.2 75.8 75.2Oracle-RN18 72.5 71.5 72.5 71.5Copy last (DN121) 53.3 48.7 39.1 29.73Dconv-F2F 57.0 / 40.8 /Dil10-S2S 59.4 55.3 47.8 40.8LSTM S2S 60.1 / / /Mask-F2F / 61.2 / 41.2FeatReproj3D 61.5 / 45.4 /Bayesian S2S 65.1 / 51.2 /LSTM AM S2S 65.8 / 51.3 /LSTM M2M 67.1 65.1 51.5 46.3F2MF-RN18 w/o d.a. 66.9 65.6 55.9 52.4F2MF-DN121 w/o d.a. 68.7 66.8 56.8 53.1F2MF-DN121 w/ d.a. F2MF-DN121 w/ d.a. †
Evaluation of our F2MF model for semantic segmentationforecasting on Cityscapes val.
All denotes all classes, MO — mov-ing objects, d.a. — data augmentation, and †— test set. We compareour work with 3Dconv-F2F (Chiu et al., 2020), Dil10-S2S (Luc et al.,2017), LSTM S2S (Nabavi et al., 2018), Mask-F2F (Luc et al., 2018),FeatReproj3D (Vora et al., 2018), Bayesian S2S (Bhattacharyya et al.,2019), LSTM AM S2S (Chen and Han, 2019) and LSTM M2M (Ter-williger et al., 2019). × subsampled features produced bythe penultimate residual block of ResNet-50. We compareour approach with two previous forecasting approaches (Lucet al., 2018; Sun et al., 2019) which use Mask R-CNN with Short term: ∆t =3 Mid term: ∆t =9AP AP50 AP AP50Oracle (ours) 36.3 63.1 36.3 63.1Oracle FPN 37.3 65.8 37.3 65.8Copy last 9.6 22.9 2.2 8.1F2F 4 × F2MF (ours)
Table 2
Instance segmentation forecasting on Cityscapes val. Wecompare our work with F2F 4 × (Luc et al., 2018) and ConvLSTM (Sunet al., 2019). All competing methods use the stronger oracle (FPN). FPN upsampling (Lin et al., 2017). Consequently, the twoprevious approaches require forecasting all four levels of thefeature pyramid. The table shows that our approach prevailsin spite of the considerably weaker oracle and significantlylower computational complexity.4.3 Panoptic segmentation forecasting on CityscapesTable 3 presents our performance on the panoptic segmen-tation task. Our single-frame model is a custom PanopticDeeplab with ResNet-50 backbone and a single skip con-nection from the backbone to the decoder/upsampling path.As in the case of instance segmentation, we target the fea-tures at × subsampled resolution from the penultimateresidual block. To the best of our knowledge, this is thefirst attempt in forecasting panoptic segmentation. The tableclearly shows that our F2MF model outperforms the copy-last baseline by a large margin. Short term: ∆t =3 Mid term: ∆t =9PQ SQ RQ PQ SQ RQOracle (ours) 57.5 79.7 70.8 57.5 79.7 70.8Oracle (original) 59.8 80.0 73.5 59.8 80.0 73.5Copy-last 32.3 70.9 42.4 22.3 68.1 30.2F2MF 47.3 75.1 60.6 33.1 71.3 43.3 Table 3
Panoptic segmentation forecasting on Cityscapes val.
These three subsections have shown that F2MF forecast-ing can be applied to three different dense semantic pre-diction tasks. These experiments confirm that the proposedfeature-level forecasting approach indeed is task-agnostic.4.4 Qualitative resultsFigure 6 visualizes our short-term and mid-term forecastsand compares them with the predictions from our single-frame model (oracle). The columns show the last observedframe, the oracle prediction overlayed on top of the futureframe, our F2MF forecast overlayed on top of the futureframe, and the F2M heatmap β F2M = − β F2F = (cid:80) τ β F2M τ .The red regions correspond to pixels forecasted through theF2M head while the blue regions correspond to pixels fore-casted by the F2F head. The blue pixels usually correspondto unoccluded scenery which has to be imagined by the modelsince it was not visible in any of the observed frames. We ob-serve that the forecasted F2M heatmaps are quite accurate.The top two rows of Figure 6 correspond to semanticsegmentation. Most pixels from row 1 of the group are pre-dicted by the F2M head because there is little motion in the oint Forecasting of Features and Feature Motion for Dense Semantic Future Prediction 9 Fig. 6
Qualitative results of dense semantic forecasting: semantic segmentation (top two rows), instance segmentation (middle two rows) andpanoptic segmentation (bottom two rows). We show one short-term and one mid-term example for each modality. The columns show the lastobserved frame, oracle prediction in the future frame, F2MF forecast and the F2M heatmap. scene. On the other hand, there is a large blue region in thebottom left of the F2M heatmap in row 2. The blue regionwas disoccluded by the car passing by the camera. As it wasnever observed before, the F2M head stands no chance, sothe F2F head has to in-paint novel content. We observe thatthe prediction is sound, although it misses some of the peo-ple in behind. We also provide a video presentation of ourmid-term forecasting performance on Frankfurt video .The middle two rows illustrate the performance of ourapproach on the instance segmentation task on Cityscapesval. We observe that the forecasts are reasonable and ac-curate, which suggests that F2MF is versatile and task ag-nostic. We observe that our model accurately predicts thefuture feet stance for some of the pedestrians in row 3. Fur-thermore, model correctly predicts the future position of the moving taxi in row 4. The corresponding F2M heatmap re-veals that the F2F head in-paints the features in the disoc-cluded area, but is unable to forecast the objects behind thetaxi.The bottom two rows illustrate our panoptic segmenta-tion forecasting on Cityscapes val. Notice that pixels cor-responding to object classes are colored in different shadesof the original class color. We observe that the forecastedsegmentation appears sharp and accurate. Similarly as in thesemantic segmentation forecasting, we observe that the F2Fhead is in charge for novel regions. This is best seen in thelast row, where the car on the right leaves the scene anddis-occludes a large part of the background. The model cor-rectly forecasts that the pixels behind the car correspond tothe buildings, sidewalk and road. F2MF-RN18 Configuration Short-term mIoU Mid-term mIoUF2F F2M Correlation All MO All MO (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
Ablation of correlation, F2F, and F2M on Cityscapes val.Standalone F2F and F2M models are trained independently.
We further investigate the complementary nature of theF2F and F2M approaches by comparing the accuracy of in-dependent single-head models in stratified groups of pixels.Previous results showed that individual F2F model performs slightly better in general. However, we know that predictingthe future in novel regions is particularly hard for the F2Mmodel. Therefore, we hypothesise that F2M might performbetter in the previously observed scenery. We divide pix-els into bins according to the β F2M weight predicted by thecompound F2MF model, and then show the forecasting ac-curacy of the two independently trained F2F and F2M mod-els in Fig. 7. On the left y-axis we show per bin forecastingaccuracy. On the right y-axis we show the ratio of pixelsin total number of pixels. We show the β F2M weight valueon the x-axis for each bin. The figure separately considersthe short-term (left) and mid-term (right) forecasting. Thepixel histogram is skewed towards higher β F2M values. Thissuggests that the F2MF model delegates the majority of thepixels to the F2M head. We also observe that this effect isless pronounced in mid-term forecast. This makes sense be-cause we observe more novelty there. The plot also showsthat F2M model performs better in pixels with higher β F2M values, which justifies the delegating decision from the com-pound model. These plots support our hypothesis that thetwo approaches are complementary.
Fig. 7
Histograms of F2F and F2M accuracy and overall pixel inci-dence over β F2M bins as defined by the compound F2FM model forshort-term (left) and mid-term (forecasting). t + 3 , t + 6 and t + 9 , andbackpropagating gradients through time. Figure 8 shows themIoU accuracy for different forecasting times both for theregular and the fine-tuned model. We observe that the fine-tuned model consistently performs better as the forecastingtime increases. oint Forecasting of Features and Feature Motion for Dense Semantic Future Prediction 11 Fig. 8
Dependence of the semantic segmentation accuracy of our au-toregressive models on different forecasting offsets. We have carriedout these experiments on Frankfurt videos from Cityscapes val.
Oracle Forecast Rel. perf. DropLuc et al. (2017) 55.4 46.8 84.5% -15.5%F2MF-DN121 62.8 51.3 81.7% -18.3%F2MF-DN121 ar. 62.8 53.4 85.0% -15.0%F2MF-DN121 ar. ft. 62.8
Semantic segmentation forecasting on CamVid dataset withmodels trained on Cityscapes.
Figure 9 shows the performance of our best model fromTable 5 on two CamVid scenes. The rows show the last rawframe, ground truth and the forecasted segmentation. We ob-serve that the feature forecasting succeeds to overcome thedomain shift to a considerable degree.
Fig. 9
Mid-term forecast on two scenes from the CamVid dataset asproduced by our best model from Table 5. The rows show the last ob-served frame, future ground truth and our F2MF forecast.
Anticipation of future semantics is a prerequisite for intel-ligent planning of current actions. Recent work addressesthis problem by implicitly capturing laws of scene dynamics
Fig. 10
Interpretation of F2F (rows 1,3) and F2MF (rows 2,4) decisions in two pixels denoted with green squares. We consider a pixel on thebicycle (rows 1-2) and a pixel on disoccluded background (rows 3-4). The columns show the four input frames, the forecasted semantic map, andthe groundtruth with overlayed future frame. Red dots show top gradients of the green pixel max-log-softmax w.r.t. input. These dots correpond toinput pixels which are most responsibe for the model decision in the green pixel. throughout deep learning in video. However, the existing ap-proaches are unable to distinguish disoccluded and emerg-ing scenery from previously observed parts of the scene.This is clearly suboptimal, since the former requires purerecognition while the latter can be explained by warping.Different than all previous approaches, our method is ableto predict emergence of unobserved scenery, and to exploitthat information for disentangling variation caused by nov-elty and variation due to motion.Our method performs dense semantic forecasting on thefeature level. Different than previous such approaches, weregularize the forecasting process by expressing it as a causalrelationship between the past and the future. The proposedF2M (feature-to-motion) forecasting generalizes better thanthe classic F2F (feature-to-feature) approach at many (butnot all) image locations. We achieve the best of both worldsby blending F2M and F2F predictions with densely regressedweight factors. We empirically confirm that low F2M weightsoccur at unobserved scenery. The resulting F2MF approachsurpasses the state-of-the-art in semantic segmentation fore-casting on the Cityscapes dataset by a wide margin.We complement convolutional features with their respec-tive correlation coefficients organized within a cost volumeover a small set of discrete displacements. Our forecastingmodels use deformable convolutions in order to account forgeometric nature of F2F forecasting. These two improve-ments bring clear advantage in all three feature-level ap-proaches: F2F, F2M, and F2MF. To the best of our knowl-edge, this is the first account of using these improvementsfor semantic forecasting.Unlike previous methods, our single-frame model for se-mantic segmentation does not use skip connections along theupsampling path. Consequently, we are able to forecast con-densed abstract features at the far end of the downsampling path with a single F2MF module. This greatly improves theinference speed and favors the forecasting accuracy due tocoarse resolution and high semantic content of the involvedfeatures. We were unable to outperform this approach withmulti-level F2F forecasting in spite of significantly bettersingle-frame accuracy.We also evaluate the proposed F2MF method on two ad-ditional dense prediction tasks: instance segmentation andpanoptic segmentation. These experiments use third partysingle-frame models and therefore show that our method canbe successfully used as a drop-in solution for converting anykind of dense prediction model into its competitive forecast-ing counterpart.The proposed method offers many exciting directionsfor future work. In particular, our method does not addressmulti-modal future, which is a key to long-term forecast-ing and worst-case reasoning. Other suitable extensions in-clude overcoming obstacles towards end-to-end training, ex-tension to RGB forecasting, as well as enforcing temporallyconsistent predictions in neighbouring video frames. oint Forecasting of Features and Feature Motion for Dense Semantic Future Prediction 13
Acknowledgements
The authors would like to thank Marin Orˇsi´c forsuggesting usage of correlation features, Ton´ci Antunovi´c for usefulsuggestions on an earlier version of this manuscript, as well as PaulineLuc and Jakob Verbeek for useful discussions during an early stage ofthis work.The authors would like to thank all contributors to GitHub reposi-tories used in our experiments. In particular, we used third party mod-ules for deformable convolutions (Chen et al., 2019), and correlationfeatures (Pinard, 2020). Our custom Panoptic DeepLab implementa-tion is based on the reference implementation (Cheng et al., 2020).Our instance segmentation forecasting use the reference Mask R-CNNimplementation Wu et al. (2019).
References
Alahi A, Goel K, Ramanathan V, Robicquet A, Fei-Fei L,Savarese S (2016) Social lstm: Human trajectory predic-tion in crowded spaces. In: Proceedings of the IEEE con-ference on computer vision and pattern recognition, pp961–971Bhattacharyya A, Fritz M, Schiele B (2019) Bayesian pre-diction of future street scenes using synthetic likelihoods.In: International Conference on Learning RepresentationsBrostow GJ, Fauqueur J, Cipolla R (2009) Semantic objectclasses in video: A high-definition ground truth database.Pattern Recognition Letters 30(2):88–97Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, Sun S, FengW, Liu Z, Xu J, Zhang Z, Cheng D, Zhu C, Cheng T,Zhao Q, Li B, Lu X, Zhu R, Wu Y, Dai J, Wang J, Shi J,Ouyang W, Loy CC, Lin D (2019) MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprintarXiv:190607155Chen X, Han Y (2019) Multi-timescale context encodingfor scene parsing prediction. In: 2019 IEEE InternationalConference on Multimedia and Expo (ICME), IEEE, pp1624–1629Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H,Chen LC (2020) Panoptic-deeplab: A simple, strong, andfast baseline for bottom-up panoptic segmentation. In:Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR)Chiu Hk, Adeli E, Niebles JC (2020) Segmenting the future.IEEE Robotics and Automation Letters 5(3):4202–4209Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M,Benenson R, Franke U, Roth S, Schiele B (2016) Thecityscapes dataset for semantic urban scene understand-ing. In: Proceedings of the IEEE conference on computervision and pattern recognition, pp 3213–3223Couprie C, Luc P, Verbeek J (2018) Joint Future Semanticand Instance Segmentation Prediction. In: ECCV Work-shop on Anticipating Human Behavior, pp 154–168Davison AJ, Reid ID, Molton N, Stasse O (2007)Monoslam: Real-time single camera SLAM. IEEE TransPattern Anal Mach Intell 29(6):1052–1067 Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C,Golkov V, Van Der Smagt P, Cremers D, Brox T (2015)Flownet: Learning optical flow with convolutional net-works. In: Proc. ICCV, pp 2758–2766Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutionaltwo-stream network fusion for video action recognition.In: 2016 IEEE Conference on Computer Vision and Pat-tern Recognition, CVPR 2016, Las Vegas, NV, USA, June27-30, 2016, pp 1933–1941Gao H, Xu H, Cai QZ, Wang R, Yu F, Darrell T (2019) Dis-entangling propagation and generation for video predic-tion. In: Proceedings of the IEEE International Confer-ence on Computer Vision, pp 9006–9015Hao Z, Huang X, Belongie S (2018) Controllable video gen-eration with sparse trajectories. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition, pp 7854–7863He K, Gkioxari G, Doll´ar P, Girshick R (2017) Mask r-cnn.In: Proceedings of the IEEE international conference oncomputer vision, pp 2961–2969Jin X, Xiao H, Shen X, Yang J, Lin Z, Chen Y, Jie Z, FengJ, Yan S (2017) Predicting scene parsing and motion dy-namics in the future. In: Advances in Neural InformationProcessing Systems, pp 6915–6924Kingma DP, Ba J (2014) Adam: A method for stochasticoptimization. arXiv preprint arXiv:14126980Kreso I, Krapac J, Segvic S (2017) Ladder-style densenetsfor semantic segmentation of large natural images. In:ICCV CVRSUAD, pp 238–245Kreˇso I, Krapac J, ˇSegvi´c S (2020) Efficient ladder-styledensenets for semantic segmentation of large images.IEEE Transactions on Intelligent Transportation SystemsLi Y, Fang C, Yang J, Wang Z, Lu X, Yang MH (2018) Flow-grounded spatial-temporal video prediction from still im-ages. In: Proceedings of the European Conference onComputer Vision (ECCV), pp 600–615Lin TY, Doll´ar P, Girshick R, He K, Hariharan B, Belongie S(2017) Feature pyramid networks for object detection. In:Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp 2117–2125Liu P, Lyu MR, King I, Xu J (2019) Selflow: Self-supervisedlearning of optical flow. In: Proc. CVPR, Computer Vi-sion Foundation / IEEE, pp 4571–4580Luc P, Neverova N, Couprie C, Verbeek J, LeCun Y (2017)Predicting deeper into the future of semantic segmenta-tion. In: Proceedings of the IEEE International Confer-ence on Computer Vision, pp 648–657Luc P, Couprie C, Lecun Y, Verbeek J (2018) Predicting fu-ture instance segmentation by forecasting convolutionalfeatures. In: Proceedings of the European Conference onComputer Vision (ECCV), pp 584–599Mathieu M, Couprie C, LeCun Y (2016) Deep multi-scalevideo prediction beyond mean square error. In: ICLR
Nabavi SS, Rochan M, Wang Y (2018) Future semantic seg-mentation with convolutional lstm. BMVCOprea S, Martinez-Gonzalez P, Garcia-Garcia A, Castro-Vargas JA, Orts-Escolano S, Rodr´ıguez JG, Argyros AA(2020) A review on deep learning techniques for videoprediction. CoRR abs/2004.05214Orsic M, Kreso I, Bevandic P, Segvic S (2019) In defenseof pre-trained imagenet architectures for real-time seman-tic segmentation of road-driving images. In: Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pp 12607–12616Pan J, Wang C, Jia X, Shao J, Sheng L, Yan J, Wang X(2019) Video generation from single semantic label map.In: Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, pp 3733–3742Pinard C (2020) Pytorch-correlation-extension. https://github.com/ClementPinard/Pytorch-Correlation-extension
Reda FA, Liu G, Shih KJ, Kirby R, Barker J, Tarjan D, TaoA, Catanzaro B (2018) Sdc-net: Video prediction usingspatially-displaced convolution. In: Proceedings of theEuropean Conference on Computer Vision (ECCV), pp718–733Reuter S, Vo B, Vo B, Dietmayer K (2014) The la-beled multi-bernoulli filter. IEEE Trans Signal Process62(12):3246–3260Sun D, Yang X, Liu MY, Kautz J (2018) Pwc-net: Cnns foroptical flow using pyramid, warping, and cost volume. In:Proc. CVPR, pp 8934–8943Sun J, Xie J, Hu JF, Lin Z, Lai J, Zeng W, Zheng Ws (2019)Predicting future instance segmentation with contextualpyramid convlstms. In: Proceedings of the 27th ACM In-ternational Conference on Multimedia, pp 2043–2051Szeliski R (2010) Computer vision: algorithms and applica-tions. Springer Science & Business MediaTerwilliger A, Brazil G, Liu X (2019) Recurrent flow-guidedsemantic forecasting. In: 2019 IEEE Winter Conferenceon Applications of Computer Vision (WACV), IEEE, pp1703–1712Vondrick C, Pirsiavash H, Torralba A (2015) Anticipatingthe future by watching unlabeled video. arXiv preprintarXiv:150408023 2Vora S, Mahjourian R, Pirk S, Angelova A (2018) Futuresemantic segmentation using 3d structure. In: ECCV 3DReconstruction meets Semantics WorkshopVukoti´c V, Pintea SL, Raymond C, Gravier G, van GemertJC (2017) One-step time-dependent future video frameprediction with a convolutional encoder-decoder neuralnetwork. In: International Conference on Image Analysisand Processing, Springer, pp 140–151Wu Y, Kirillov A, Massa F, Lo WY, Girshick R(2019) Detectron2. https://github.com/facebookresearch/detectron2https://github.com/facebookresearch/detectron2