Haar Wavelet based Block Autoregressive Flows for Trajectories
Apratim Bhattacharyya, Christoph-Nikolas Straehle, Mario Fritz, Bernt Schiele
HHaar Wavelet based Block Autoregressive Flowsfor Trajectories
Apratim Bhattacharyya , Christoph-Nikolas Straehle , Mario Fritz , andBernt Schiele Max Planck Institute for Informatics, Saarbr¨ucken, Germany, [email protected] Bosch Center for Artificial Intelligence, Renningen, Germany CISPA Helmholtz Center for Information Security, Saarbr¨ucken, Germany
Abstract.
Prediction of trajectories such as that of pedestrians is crucialto the performance of autonomous agents. While previous works haveleveraged conditional generative models like GANs and VAEs for learn-ing the likely future trajectories, accurately modeling the dependencystructure of these multimodal distributions, particularly over long timehorizons remains challenging. Normalizing flow based generative modelscan model complex distributions admitting exact inference. These includevariants with split coupling invertible transformations that are easier toparallelize compared to their autoregressive counterparts. To this end,we introduce a novel Haar wavelet based block autoregressive modelleveraging split couplings, conditioned on coarse trajectories obtainedfrom Haar wavelet based transformations at different levels of granularity.This yields an exact inference method that models trajectories at differentspatio-temporal resolutions in a hierarchical manner. We illustrate the ad-vantages of our approach for generating diverse and accurate trajectorieson two real-world datasets – Stanford Drone and Intersection Drone.
Anticipation is a key competence for autonomous agents such as self-drivingvehicles to operate in the real world. Many such tasks involving anticipationcan be cast as trajectory prediction problems, e.g. anticipation of pedestrian
Final Prediction
Observed
Coarse Prediction
Flow
Flow
Flow
HaarTransform
Haar
HaarTransform
Haar
Scale K
Scale 1
HaarTransform
Haar
Fig. 1: Our normalizing flow based model uses a Haar wavelet based decompositionto block autoregressively model trajectories at K coarse-to-fine scales. a r X i v : . [ c s . C V ] S e p A. Bhattacharyya et al. behaviour in urban driving scenarios. To capture the uncertainty of the realworld, it is crucial to model the distribution of likely future trajectories. Thereforerecent works [3,5,27,36] have focused on modeling the distribution of likelyfuture trajectories using either generative adversarial networks (GANs, [15]) orvariational autoencoders (VAEs, [22]). However, GANs are prone to mode collapseand the performance of VAEs depends on the tightness of the variational lowerbound on the data log-likelihood which is hard to control in practice [9,20]. Thismakes it difficult to accurately model the distribution of likely future trajectories.Normalizing flow based exact likelihood models [12,13,23] have been consideredto overcome these limitations of GANs and VAEs in the context of image synthesis.Building on the success of these methods, recent approaches have extended the flowmodels for density estimation of sequential data e.g. video [25] and audio [21]. Yet,VideoFlow [25] is autoregressive in the temporal dimension which results in theprediction errors accumulating over time [26] and reduced efficiency in sampling.Furthermore, FloWaveNet [21] extends flows to audio sequences with odd-evensplits along the temporal dimension, encoding only local dependencies [4,20,24].We address these challenges of flow based models for trajectory generation anddevelop an exact inference framework to accurately model future trajectorysequences by harnessing long-term spatio temporal structure in the underlyingtrajectory distribution.In this work, we propose
HBA-Flow , an exact inference model with coarse-to-fine block autoregressive structure to encode long term spatio-temporal cor-relations for multimodal trajectory prediction. The advantage of the proposedframework is that multimodality can be captured over long time horizons bysampling trajectories at coarse-to-fine spatial and temporal scales (Fig. 1). Ourcontributions are: 1. we introduce a block autoregressive exact inference modelusing Haar wavelets where flows applied at a certain scale are conditioned oncoarse trajectories from previous scale. The trajectories at each level are obtainedafter the application of Haar wavelet based transformations, thereby modelinglong term spatio-temporal correlations. 2. Our HBA-Flow model, by virtue ofblock autoregressive structure, integrates a multi-scale block autoregressive priorwhich further improves modeling flexibility by encoding dependencies in the latentspace. 3. Furthermore, we show that compared to fully autoregressive approaches[25], our HBA-Flow model is computationally more efficient as the number ofsampling steps grows logarithmically in trajectory length. 4. We demonstrate theeffectiveness of our approach for trajectory prediction on Stanford Drone andIntersection Drone, with improved accuracy over long time horizons.
Pedestrian Trajectory Prediction.
Work on traffic participant predictiondates back to the Social Forces model [18]. More recent works [1,18,38,35] considerthe problem of traffic participant prediction in a social context, by taking intoaccount interactions among traffic participants. Notably, Social LSTM [1] intro-duces a social pooling layer to aggregate interaction information of nearby traffic aar Wavelet based Block Autoregressive Flows for Trajectories 3 participants. An efficient extension of the social pooling operation is developedin [10] and alternate instance and category layers to model interactions in [28].Weighted interactions are proposed in [7]. In contrast, a multi-agent tensor fusionscheme is proposed in [40] to capture interactions. An attention based modelto effectively integrate visual cues in path prediction tasks is proposed in [37].However, these methods mostly assume a deterministic future and do not directlydeal with the challenges of uncertainty and multimodality.
Generative Modeling of Trajectories.
To deal with the challenges of un-certainty and multimodality in anticipating future trajectories, recent worksemploy either conditional VAEs or GANs to capture the distribution of futuretrajectories. This includes, a conditional VAE based model with a RNN basedrefinement module [27], a VAE based model [14] that “personalizes” predictionto individual agent behavior, a diversity enhancing “Best of Many” loss [5] tobetter capture multimodality with VAEs, an expressive normalizing flow basedprior for conditional VAEs [3] among others. However, VAE based models onlymaximize a lower bound on the data likelihood, limiting their ability to effectivelymodel trajectory data. Other works, use GANs [16,40,36] to generate sociallycompliant trajectories. GANs lead to missed modes of the data distribution.Additionally, [34,11] introduce push-forward policies and motion planning forgenerative modeling of trajectories. Determinantal point processes are used in[39] to better capture diversity of trajectory distributions. The work of [29] showsthat additionally modeling the distribution of trajectory end points can improveaccuracy. However, it is unclear if the model of [29] can be used for predictionsacross variable time horizons. In contrast to these approaches, in this work wedirectly maximize the exact likelihood of the trajectories, thus better capturingthe underlying true trajectory distribution.
Autoregressive Models.
Autoregressive exact inference models like PixelCNN[31] have shown promise in generative modeling. Autoregressive models forsequential data includes a convolutional autoregressive model [30] for raw audioand an autoregressive method for video frame prediction [25]. In particular, forsequential data involving trajectories, recent works [32] propose an autoregressivemethod based on visual sources. The main limitation of autoregressive approachesis that the models are difficult to parallelize. Moreover, in case of sequential data,errors tend to accumulate over time [26].
Normalizing Flows.
Split coupling normalizing flow models with affine transfor-mations [12] offer computationally efficient tractable Jacobians. Recent methods[13,23] have therefore focused on split coupling flows which are easier to par-allelize. Flow models are extended in [13] to multiscale architecture and themodeling capacity of flow models is further improved in [23] by introducing 1 × A. Bhattacharyya et al. operation in [23] is replaced with a Haar wavelet based downsampling scheme in[2] along the spatial dimensions. Although this leads to improved results on imagedata, this operation is not particularly effective in case of sequential data as itdoes not influence temporal receptive fields for trajectories – crucial for modelinglong-term temporal dependencies. Therefore, Haar wavelet downsampling of [2]does not lead to significant improvement in performance on sequential data (alsoobserved empirically). In this work, instead of employing Haar wavelets as adownsampling operation for reducing spatial resolution [2] in split coupling flows,we formulate a coarse-to-fine block autoregressive model where Haar waveletsproduce trajectories at different spatio-temporal resolutions.
In this work, we propose a coarse-to-fine block autoregressive exact inferencemodel,
HBA-Flow , for trajectory sequences. We first provide an overview ofconditional normalizing flows which form the backbone of our HBA-Flow model.To extend normalizing flows for trajectory prediction, we introduce an invert-ible transformation based on Haar wavelets which decomposes trajectories into K coarse-to-fine scales (Fig. 1). This is beneficial for expressing long-rangespatio-temporal correlations as coarse trajectories provide global context forthe subsequent finer scales. Our proposed HBA-Flow framework integrates thecoarse-to-fine transformations with invertible split coupling flows where it blockautoregressively models the transformed trajectories at K scales. We base our HBA-Flow model on normalizing flows [12] which are a type ofexact inference model. In particular, we consider the transformation of theconditional distribution p ( y | x ) of trajectories y to a distribution p ( z | x ) over z with conditional normalizing flows [2,3] using a sequence of n transformations g i : h i − (cid:55)→ h i , with h = y and parameters θ i , y g ←→ h g ←→ h · · · g n ←→ z . (1)Given the Jacobians J θ i = ∂ h i / ∂ h i − of the transformations g i , the exactlikelihoods can be computed with the change of variables formula,log p θ ( y | x ) = log p ( z | x ) + n (cid:88) i =1 log | det J θ i | , (2)Given that the density p ( z | x ) is known, the likelihood over y can be computedexactly. Recent works [12,13,23] consider invertible split coupling transformations g i as they provide a good balance between efficiency and modeling flexibility.In (conditional) split coupling transformations, the input h i is split into twohalves l i , r i , and g i applies an invertible transformation only on l i leaving r i unchanged. The transformation parameters of l i are dependent on r i and x , thus aar Wavelet based Block Autoregressive Flows for Trajectories 5 f hba ( c , f )
HBA-Flow generative model with the Haar wavelet [17] basedrepresentation F hba . Right: Our multi-scale HBA-Flow model with K scales ofHaar based transformation. h i +1 = [ g i +1 ( l i | r i , x ) , r i ]. The main advantage of (conditional) split coupling flowsis that both inference and sampling are parallelizable when the transformations g i +1 have an efficient closed form expression of the inverse g − i +1 , e.g. affine [23]or non-linear squared [41] and unlike residual flows [8].As most of the prior work, e.g. [2,12,13,23], considers split coupling flows g i that are designed to deal with fixed length data, these models are not directlyapplicable to data of variable length such as trajectories. Moreover, recall that forvariable length sequences, while VideoFlow [25] utilizes split coupling based flowsto model the distribution at each time-step, it is still fully autoregressive in thetemporal dimension, thus offering limited computational efficiency. FloWaveNets[21] split l i and r i along even-odd time-steps for audio synthesis. This even-odd formulation of the split operation along with the inductive bias [24,20,4]of split coupling based flow models is limited when expressing local and globaldependencies which are crucial for capturing multimodality of the trajectoriesover long time horizons. Next, we introduce our invertible transformation basedon Haar wavelets to model trajectories at various coarse-to-fine levels to addressthe shortcomings of prior flow based methods [25,21] for sequential data. Haar wavelet transform allows for a simple and easy to compute coarse-to-finefrequency decomposed representation with a finite number of components unlikealternatives e.g. Fourier transformations [33]. In our HBA-Flow framework, weconstruct a transformation F hba comprising of mappings f hba recursively appliedacross K scales. With this transformation, trajectories can be encoded at differentlevels of granularity along the temporal dimension. We now formalize invertiblefunction f hba and its multi-scale Haar wavelet based composition F hba . A. Bhattacharyya et al.
Single Scale Invertible Transformation.
Consider the trajectory at scale k as y k = [ y k , · · · , y T k k ], where T k is the number of timesteps of trajectory y k . Here,at scale k = 1, y = y is the input trajectory. Each element of the trajectoryis a vector, y jk ∈ R d encoding spatial information of the traffic participant.Our proposed invertible transformation f hba at any scale k is a composition, f hba = f haar ◦ f eo . First, f eo transforms the trajectory into even ( e k ) and odd( o k ) downsampled trajectories, f eo ( y k ) = e k , o k where , e k = [ y k , · · · , y T k k ] and o k = [ y k , · · · , y T k − k ] . (3)Next, f haar takes as input the even ( e k ) and odd ( o k ) downsampled trajecto-ries and transforms them into coarse ( c k ) and fine ( f k ) downsampled trajectoriesusing a scalar “mixing” parameter α . In detail, f haar ( e k , o k ) = f k , c k where , c k = (1 − α ) e k + α o k and f k = o k − c k = (1 − α ) o k + ( α − e k (4)where, the coarse ( c k ) trajectory is the element-wise weighted average of theeven ( e k ) and odd ( o k ) downsampled trajectories and the fine ( f k ) trajectory isthe element-wise difference to the coarse downsampled trajectory. The coarsetrajectories ( c k ) provide global context for finer scales in our block autoregressiveapproach, while the fine trajectories ( f k ) encode details at multiple scales. Wenow discuss the invertibilty of this transformation f hba and compute the Jacobian. Lemma 1.
The generalized Haar transformation f hba = f haar ◦ f eo is invertiblefor α ∈ [0 , and the determinant of the Jacobian of the transformation f hba = f haar ◦ f eo for sequence of length T k with y jk ∈ R d is det J hba = (1 − α ) ( d · Tk ) / . We provide the proof in Appendix A. This property allows our HBA-Flowmodel to exploit f hba for spatio-temporal decomposition of the trajectories y while remaining invertible with a tractable Jacobian for exact inference. Next, weuse this transformation f hba to build the coarse-to-fine multi-scale Haar waveletbased transformation F hba and discuss its properties. Multi-scale Haar Wavelet based Transformation.
To construct our gen-eralized Haar wavelet based transformation F hba , the mapping f hba is appliedrecursively at K scales (Fig. 2, left). The transformation f hba at a scale k appliesa low and a high pass filter pair on the input trajectory y k resulting in thecoarse trajectory c k and the fine trajectory f k with high frequency details. Thecoarse (spatially and temporally sub-sampled) trajectory ( c k ) at scale k is thenfurther decomposed by using it as the input trajectory y k +1 = c k to f hba atscale k + 1. This is repeated at K scales, resulting in the complete Haar wavelettransformation F hba ( y ) = [ f , · · · , f K , c K ] which captures details at multiple ( K )spatio-temporal scales. The finest scale f models high-frequency spatio-temporalinformation of the trajectory y . The subsequent scales f k represent details at aar Wavelet based Block Autoregressive Flows for Trajectories 7 coarser levels, with c K being the coarsest transformation which expresses the“high-level” spatio-temporal structure of the trajectory (Fig. 1).Next, we show that the number of scales K in F hba is upper bounded by thelogarithm of the length of the sequence. This implies that F hba , when integratedin the multi-scale block auto-regressive model provides a computationally efficientsetup for generating trajectories. Lemma 2.
The number of scales K of the Haar wavelet based representation F hba is K ≤ log( T ) , for an initial input sequence y of length T .Proof. The Haar wavelet based transformation f hba halves the length of trajectory y k at each level k . Thus, for an initial input sequence y of length T , the lengthof the coarsest level K in F hba ( y ) is | c K | = T / K ≥
1. Thus, K ≤ log( T ). We illustrate our HBA-Flow model in Fig. 2. Our HBA-Flow model first transforms the trajectories y using F hba , where the invertibletransform f hba is recursively applied on the input trajectory y to obtain f k and c k at scales k ∈ { , · · · , K } . Therefore, the log-likelihood of a trajectory y underour HBA-Flow model can be expressed using the change of variables formula as,log( p θ ( y | x )) = log( p θ ( f , c | x )) + log | det ( J hba ) | = log( p θ ( f , · · · , f K , c K | x )) + K (cid:88) i =1 log | det ( J hba ) i | . (5)Next, our HBA-Flow model factorizes the distribution of fine trajectoriesw.l.o.g. such that f k at level k is conditionally dependent on the representationsat scales k + 1 to K ,log( p θ ( f , · · · , f K , c K | x )) = log( p θ ( f | f , · · · , f K , c K , x )) + · · · + log( p θ ( f K | c K , x )) + log( p θ ( c K | x )) . (6)Finally, note that [ f k +1 , · · · , f K , c K ] is the output of the (bijective) trans-formation F hba ( c k ) where f hba is recursively applied to c k = y k +1 at scales { k + 1 , · · · , K } . Thus HBA-Flow equivalently models p θ ( f k | f k +1 , · · · , c K , x ) as p θ ( f k | c k , x ),log( p θ ( y | x )) = log( p θ ( f | c , x )) + · · · + log( p θ ( f K | c K , x ))+ log( p θ ( c K | x )) + K (cid:88) i =1 log | det ( J hba ) i | . (7)Therefore, as illustrated in Fig. 2 (right), our HBA-Flow models the dis-tribution of each of the fine components f k block autoregressively conditionedon the coarse representation c k at that level. The distribution p θ ( f k | c k , x ) ateach scale k is modeled using invertible conditional split coupling flows (Fig. 2, A. Bhattacharyya et al. right) [21], which transform the input distribution to the distribution over latent“priors” z k . This enables our framework to model variable length trajectories. Thelog-likelihood with our HBA-Flow approach can be expressed using the changeof variables formula as,log( p θ ( f k | c k , x )) = log( p φ ( z k | c k , x )) + log | det( J sc ) k | (8)where, log | det( J sc ) k | is the log determinant of Jacobian ( J sc ) k of the splitcoupling flow at level k . Thus, the likelihood of a trajectory y under our HBA-Flowmodel can be expressed exactly using Eqs. (7) and (8).The key advantage of our approach is that after spatial and temporal downsam-pling of coarse scales, it is easier to model long-term spatio-temporal dependencies.Moreover, conditioning the flows at each scale on the coarse trajectory providesglobal context as the downsampled coarse trajectory effectively increases thespatio-temporal receptive field. This enables our HBA-Flows better capturemultimodality in the distribution of likely future trajectories. HBA-Prior.
Complex multimodel priors can considerably increase the modelingflexibility of generative models [3,21,25]. The block autoregressive structure of ourHBA-Flow model allows us introduce a Haar block autoregressive prior (HBA-Prior) over z = [ z , · · · , z f K , z c K ] in Eq. (8), where z k is the latent representationfor scales k ∈ { , · · · , K − } and z f K , z c K are the latents for the coarse and finerepresentations scales K . The log-likelihood of the prior factorizes as,log( p φ ( z | x )) = log( p φ ( z | z , · · · , z f K , z c K , x )) + · · · + log( p φ ( z f K | z c K , x )) + log( p φ ( z c K | x )) . (9)Each coarse level representation c k is the output of a bijective transformationof the latent variables [ z k +1 , · · · , z f K z c K ] through the invertible split couplingflows and the transformations f hba at scales { k + 1 , · · · , K } . Thus, HBA-Priormodels p φ ( z k | z k +1 , · · · , z f K , z c K , x ) as p φ ( z k | c k , x ) at every scale (Fig. 2, left).The log-likelihood of the prior can also be expressed as,log( p φ ( z | x )) = log( p φ ( z | c , x )) + · · · + log( p φ ( z K − | c K − , x ))+ log( p φ ( z f K | c K , x )) + log( p φ ( z c K | x )) . (10)We model p φ ( z k | c k , x ) as conditional normal distributions which are multi-modal as a result of the block autoregressive structure. In comparison to thefully autoregressive prior in [25], our HBA-Prior is efficient as it requires only O (log( T )) sampling steps. Analysis of Sampling Time.
From Eq. (6) and Fig. 2 (left), our HBA-Flowmodel autoregressively factorizes across the fine components f k at K scales. FromLemma 2, K ≤ log( T ). At each scale our HBA-Flow samples the fine components f k using split coupling flows, which are easy to parallelize. Thus, given enoughparallel resources, our HBA-Flow model requires maximum K ≤ log( T ) i.e. O (log( T )) sampling steps and is significantly more efficient compared to fullyautoregressive approaches e.g. VideoFlow [25], which require O ( T ) steps. aar Wavelet based Block Autoregressive Flows for Trajectories 9 Method Visual Er @ 1sec Er @ 2sec Er @ 3sec Er @ 4sec -CLL Speed“Shotgun” [32] – 0.7 1.7 3.0 4.5 91.6 –DESIRE-SI-IT4 [27] (cid:88) (cid:88) (cid:88) (cid:88) FloWaveNet [21] + HWD [2] – FloWaveNet [21] (cid:88) (cid:88) Table 1: Five fold cross validation on the Stanford Drone dataset. Lower is betterfor all metrics. Visual refers to additional conditioning on the last observed frame.Top: state of the art, Middle: Baselines and ablations, Bottom: Our HBA-Flow.
We evaluate our approach for trajectory prediction on two challenging real worlddatasets – Stanford Drone [35] and Intersection Drone [6]. These datasets containtrajectories of traffic participants including pedestrians, bicycles, cars recordedfrom an aerial platform. The distribution of likely future trajectories is highlymultimodal due to the complexity of the traffic scenarios e.g. at intersections.
Evaluation Metrics.
We are primarily interested in measuring the match of thelearned distribution to the true distribution. Therefore, we follow [3,5,27,32] anduse Euclidean error of the top 10% of samples (predictions) and the (negative)conditional log-likelihood (-CLL) metrics. The Euclidean error of the top 10%of samples measures the coverage of all modes of the target distribution and isrelatively robust to random guessing as shown in [3].
Architecture Details.
We provide additional architecture details in AppendixB.
We use the standard five-fold cross validation evaluation protocol [3,5,27,32] andpredict the trajectory up to 4 seconds into the future. We use the Euclidean errorof the top 10% of predicted trajectories at the standard ( / ) resolution using 50samples and the CLL metric in Table 1. We additionally report sampling timefor a batch of 128 samples in milliseconds.We compare our HBA-Flow model to the following state-of-the-art models:The handcrafted “Shotgun” model [32], the conditional VAE based models of[5,3,27] and the autoregressive STCNN model [32]. We additionally include Observed Mean Top 10%B - GT, Y -[21], R - Ours FloWaveNet [21]Predictions
HBA-Flows (Ours)Predictions
Fig. 3: Mean top 10% predictions (Blue - Groudtruth, Yellow - FloWaveNet [21],Red - Our
HBA-Flow model) and predictive distributions on Intersection Dronedataset. The predictions of our HBA-Flow model are more diverse and bettercapture the multimodality the future trajectory distribution.the various exact inference baselines for modeling trajectory sequences: theautoregressive flow model of VideoFlow [25], FloWaveNet [21] (without our Haarwavelet based block autoregressive structure), FloWaveNet [21] with the Haarwavelet downsampling of [2] (FloWaveNet + HWD), our HBA-Flow model with aGaussian prior (without our HBA-Prior). The FloWaveNet [21] baselines serves asideal ablations to measure the effectiveness of our block autoregressive HBA-Flowmodel. For fair comparison, we use two scales (levels) K = 2 with eight non-linearsquared split coupling flows [41] each, for both our HBA-Flow and FloWaveNet[21] models. Following [3,32] we additionally experiment with conditioning onthe last observed frame using a attention based CNN (indicated by “Visual” inTable 1). Method mADE ↓ mFDE ↓ SocialGAN [16] 27.2 41.4MATF GAN [40] 22.5 33.5SoPhie [36] 16.2 29.3Goal Prediction [11] 15.7 28.1CF-VAE [3] 12.6 22.3HBA-Flow + Prior (Ours)
Table 2: Evaluation on the StanfordDrone using the split of [11,36,40].We observe from Table 1 that ourHBA-Flow model outperforms bothstate-of-the-art models and baselines. Inparticular, our HBA-Flow model outper-forms the conditional VAE based modelsof [3,5,27] in terms of Euclidean distanceand -CLL. Further, our HBA-Flow ex-hibits competitive sampling speeds. Thisshows the advantage of exact inferencein the context of generative modelingof trajectories – leading to better match to the groundtruth distribution. OurHBA-Flow model generates accurate trajectories compared to the VideoFlow aar Wavelet based Block Autoregressive Flows for Trajectories 11 [25] baseline. This is because unlike VideoFlow, errors do not accumulate inthe temporal dimension of HBA-Flow. Our HBA-Flow model outperforms theFloWaveNet model of [21] with comparable sampling speeds demonstrating theeffectiveness of the coarse-to-fine block autoregressive structure of our HBA-Flowmodel in capturing long-range spatio-temporal dependencies. This is reflectedin the predictive distributions and the top 10% of predictions of our HBA-Flowmodel in comparison with FloWaveNet [21] in Fig. 5. The predictions of ourHBA-Flow model are more diverse and can more effectively capture the multi-modality of the trajectory distributions especially at complex traffic situationse.g. intersections and crossings. We provide additional examples in AppendixC. We also observe in Table 1 that the addition of Haar wavelet downsampling[2] to FloWaveNets [21] (FloWaveNet + HWD) does not significantly improveperformance. This illustrates that Haar wavelet downsampling as used in [2]is not effective in case of sequential trajectory data as it is primarily a spatialpooling operation for image data. Finally, our ablations with Gaussian priors(HBA-Flow) additionally demonstrate the effectiveness of our HBA-Prior (HBA-Flow + Prior) with improvements with respect to accuracy. We further include acomparison using the evaluation protocol of [35,37,36,11] in Table 2. Here, onlya single train/test split is used. We follow [3,11] and use the minimum averagedisplacement error (mADE) and minimum final displacement error (mFDE) asevaluation metrics. Similar to [3,11] the minimum is calculated over 20 sam-ples. Our HBA-Flow model outperforms the state-of-the-art demonstrating theeffectiveness of our approach.
Observed Mean Top 10%B - GT, Y -[21], R - Ours FloWaveNet [21]Predictions
HBA-Flow (Ours)Predictions
Fig. 4: Mean top 10% predictions (Blue - Groudtruth, Yellow - FloWaveNet [21],Red - Our
HBA-Flow model) and predictive distributions on Intersection Dronedataset. The predictions of our HBA-Flow model are more diverse and bettercapture the modes of the future trajectory distribution.
Method Er @ 1sec Er @ 2sec Er @ 3sec Er @ 4sec Er @ 5sec -CLLBMS-CVAE [5] 0.25 0.67 1.14 1.78 2.63 26.7CF-VAE [3] 0.24 0.55 0.93 1.45 2.21 21.2FloWaveNet [21] 0.23 0.50 0.85 1.31 1.99 19.8FloWaveNet [21] + HWD [2] 0.23 0.50 0.84 1.29 1.96 19.5HBA-Flow + Prior (Ours)
Table 3: Five fold cross validation on the Intersection Drone dataset.
We further include experiments on the Intersection Drone dataset [6]. The datasetconsists of trajectories of traffic participants recorded at German intersections.In comparison to the Stanford Drone dataset, the trajectories in this dataset aretypically longer. Moreover, unlike the Stanford Drone dataset which is recordedat a University Campus, this dataset covers more “typical” traffic situations.Here, we follow the same evaluation protocol as in Stanford Drone dataset andperform a five-fold cross validation and evaluate up to 5 seconds into the future.We report the results in Table 3. We use the strongest baselines from Table 1for comparison to our HBA-Flow + Prior model (with our HBA-Prior), withthree scales, each having eight non-linear squared split coupling flows [41]. Forfair comparison, we compare with a FloWaveNet [21] model with three levelsand eight non-linear squared split coupling flows per level. We again observethat our HBA-Flow leads to much better improvement with respect to accuracyover the FloWaveNet [21] model. Furthermore, the performance gap betweenHBA-Flow and FloWaveNet increases with longer time horizons. This shows thatour approach can better encode spatio-temporal correlations. The qualitativeexamples in Fig. 6 from both models show that our HBA-Flow model generatesdiverse trajectories and can better capture the modes of the future trajectory dis-tribution, thus demonstrating the advantage of the block autoregressive structureof our HBA-Flow model. We also see that our HBA-Flow model outperforms theCF-VAE model [3], again illustrating the advantage of exact inference.
In this work, we presented a novel block autoregressive
HBA-Flow frameworktaking advantage of the representational power of autoregressive models andthe efficiency of invertible split coupling flow models. Our approach can bet-ter represent the multimodal trajectory distributions capturing the long rangespatio-temporal correlations. Moreover, the block autoregressive structure of ourapproach provides for efficient O (log( T )) inference and sampling. We believethat accurate and computationally efficient invertible models that allow exactlikelihood computations and efficient sampling present a promising direction ofresearch of anticipation problems in autonomous systems. aar Wavelet based Block Autoregressive Flows for Trajectories 13 References
1. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Sociallstm: Human trajectory prediction in crowded spaces. In: CVPR (2016)2. Ardizzone, L., L¨uth, C., Kruse, J., Rother, C., K¨othe, U.: Guided image generationwith conditional invertible neural networks. arXiv preprint arXiv:1907.02392 (2019)3. Bhattacharyya, A., Hanselmann, M., Fritz, M., Schiele, B., Straehle, C.N.: Con-ditional flow variational autoencoders for structured sequence prediction. In:BDL@NeurIPS (2019)4. Bhattacharyya, A., Mahajan, S., Fritz, M., Schiele, B., Roth, S.: Normalizing flowswith multi-scale autoregressive priors. In: CVPR (2020)5. Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequencesbased on a best of many sample objective. In: CVPR (2018)6. Bock, J., Krajewski, R., Moers, T., Vater, L., Runde, S., Eckstein, L.: The inddataset: A drone dataset of naturalistic vehicle trajectories at german intersections.arXiv preprint arXiv:1911.07602 (2019)7. Chandra, R., Bhattacharya, U., Bera, A., Manocha, D.: Traphic: Trajectory pre-diction in dense and heterogeneous traffic using weighted interactions. In: CVPR(2019)8. Chen, T.Q., Behrmann, J., Duvenaud, D.K., Jacobsen, J.H.: Residual flows forinvertible generative modeling. In: NeurIPS (2019)9. Cremer, C., Li, X., Duvenaud, D.: Inference suboptimality in variational autoen-coders. ICML (2018)10. Deo, N., Trivedi, M.M.: Convolutional social pooling for vehicle trajectory prediction.In: CVPR Workshop (2018)11. Deo, N., Trivedi, M.M.: Scene induced multi-modal trajectory forecasting viaplanning. In: ICRA Workshop (2019)12. Dinh, L., Krueger, D., Bengio, Y.: Nice: Non-linear independent components esti-mation. In: ICLR (2015)13. Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. In: ICLR(2017)14. Felsen, P., Lucey, P., Ganguly, S.: Where will they go? predicting fine-grainedadversarial multi-agent motion using conditional variational autoencoders. In:ECCV (2018)15. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS (2014)16. Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social gan: Sociallyacceptable trajectories with generative adversarial networks. In: CVPR (2018)17. Haar, A.: Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen69(3), 331–371 (1910)18. Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. In: Physicalreview E (1995)19. Ho, J., Chen, X., Srinivas, A., Duan, Y., Abbeel, P.: Flow++: Improving flow-basedgenerative models with variational dequantization and architecture design. In: ICML(2019)20. Huang, C.W., Dinh, L., Courville, A.: Augmented normalizing flows: Bridgingthe gap between generative flows and latent variable models. arXiv preprintarXiv:2002.07101 (2020)21. Kim, S., Lee, S.G., Song, J., Kim, J., Yoon, S.: Flowavenet: A generative flow forraw audio. In: ICML (2019)4 A. Bhattacharyya et al.22. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)23. Kingma, D.P., Dhariwal, P.: Glow: Generative flow with invertible 1x1 convolutions.In: NeurIPS (2018)24. Kirichenko, P., Izmailov, P., Wilson, A.G.: Why normalizing flows fail to detectout-of-distribution data. arXiv preprint arXiv:2006.08545 (2020)25. Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine, S., Dinh, L., Kingma,D.: Videoflow: A flow-based generative model for video. ICLR (2020)26. Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochasticadversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)27. Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire:Distant future prediction in dynamic scenes with interacting agents. In: CVPR(2017)28. Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W., Manocha, D.: Trafficpredict:Trajectory prediction for heterogeneous traffic-agents. In: AAAI (2019)29. Mangalam, K., Girase, H., Agarwal, S., Lee, K.H., Adeli, E., Malik, J., Gaidon, A.:It is not the journey but the destination: Endpoint conditioned trajectory prediction.ECCV (2020)30. van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model forraw audio. In: ISCA Speech Synthesis Workshop (2016)31. van den Oord, A., Kalchbrenner, N., Espeholt, L., Kavukcuoglu, K., Vinyals, O.,Graves, A.: Conditional image generation with PixelCNN decoders. In: NIPS (2016)32. Pajouheshgar, E., Lampert, C.H.: Back to square one: probabilistic trajectoryforecasting without bells and whistles. In: NeurIPs Workshop (2018)33. Porwik, P., Lisowska, A.: The haar-wavelet transform in digital image processing:its status and achievements. Machine graphics and vision 13(1/2), 79–98 (2004)34. Rhinehart, N., Kitani, K.M., Vernaza, P.: R2p2: A reparameterized pushforwardpolicy for diverse, precise generative path forecasting. In: ECCV (2018)35. Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette:Human trajectory understanding in crowded scenes. In: ECCV (2016)36. Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, S.H., Savarese,S.: Sophie: An attentive gan for predicting paths compliant to social and physicalconstraints. In: CVPR (2019)37. Sadeghian, A., Legros, F., Voisin, M., Vesel, R., Alahi, A., Savarese, S.: Car-net:Clairvoyant attentive recurrent network. In: ECCV (2018)38. Yamaguchi, K., Berg, A.C., Ortiz, L.E., Berg, T.L.: Who are you with and whereare you going? In: CVPR (2011)39. Yuan, Y., Kitani, K.: Diverse trajectory forecasting with determinantal pointprocesses. ICLR (2020)40. Zhao, T., Xu, Y., Monfort, M., Choi, W., Baker, C., Zhao, Y., Wang, Y., Nian Wu,Y.: Multi-agent tensor fusion for contextual trajectory prediction. In: CVPR (2019)41. Ziegler, Z.M., Rush, A.M.: Latent normalizing flows for discrete sequences. In:ICML (2019)aar Wavelet based Block Autoregressive Flows for Trajectories 15
Appendix A. Additional Details of Lemma 1 .1 Proof of Lemma 1Lemma 3.
The generalized Haar transformation f hba = f haar ◦ f eo is invertiblefor α ∈ [0 , and the determinant of the Jacobian of the transformation f hba = f haar ◦ f eo for sequence of length T k with y jk ∈ R d is det J hba = (1 − α ) ( d · Tk ) / .Proof. To compute the Jacobian of f hba , note that each element of the outputfine ( f k ) and coarse ( c k ) trajectories can be expressed in terms of the elementsof the input trajectory y k . From Eqs. (3) and (4) in the main paper, the coarse( c k ) trajectories at level k can be expressed as, c k = (1 − α ) e k + α o k = (1 − α ) · [ y k , · · · , y T k k ] + α · [ y k , · · · , y T k − k ]= [ α y k + (1 − α ) y k , α y k + (1 − α ) y k , · · · , α y T k − k + (1 − α ) y T k k ] . (11)Similarly, the fine ( f k ) trajectories at level k can be expressed as, f k =(1 − α ) o k + ( α − e k =(1 − α ) · [ y k , · · · , y T k − k ] + ( α − · [ y k , · · · , y T k k ]=[(1 − α ) y k + ( α − y k , (1 − α ) y k + ( α − y k , · · · , (1 − α ) y T k − k + ( α − y T k k ] . (12)We can now rearrange the elements of the output trajectory f hba by placingelements from f k and c k in an alternating fashion, f hba ( y k ) = f k , c k = [(1 − α ) y k + ( α − y k , α y k + (1 − α ) y k , · · · , (1 − α ) y T k − k + ( α − y T k k , α y T k − k + (1 − α ) y T k k ] . (13)As each element y jk ∈ R d , we can further simplify the output trajectory f hba in terms of the individual elements in y jk . This results in a block diagonalJacobian J hba ∈ R d · T k × d · T k of f hba of the form, J hba = (1 − α ) ( α −
1) 0 0 0 · · · α (1 − α ) 0 0 0 · · · − α ) ( α −
1) 0 · · · α (1 − α ) 0 · · · · · · (1 − α ) ( α − · · · α (1 − α ) . (14) The repeating block in J hba repeats ( d · T k ) / times as the trajectory is oflength T k and each element of the trajectory has d dimensions. Therefore, thedeterminant of the Jacobian J hba is (1 − α ) ( d · Tk ) / .To show that f hba = f haar ◦ f eo is invertible, first note that f eo rearranges theelements of the input trajectory as is thus trivially invertible. Now, note that f haar is a linear system. For α ∈ [0 ,
1) we see that det J hba >
0. Thus, the linearsystem f haar in Eq. (4) in the main paper is non-singular and invertible. Thus, f hba is invertible. Appendix B. Architecture and Optimization
Here, we provide additional architectural details of our HBA-Flow model in Fig.2 (right), in particular the split coupling flows. The split coupling flows in ourHBA-Flow model are based on those of FloWaveNet [21]. However, as mentionedin the main paper, we employ more powerful non-linear squared flows [41] acrossbaselines versus the affine flows used in [21]. The non-causal wavenets in the splitcoupling flows are similar to the ones employed in [21] with 4 convolutional layerswith 256 filters each. In practice, we do not find it necessary to employ activationnormalization layers along with the more powerful non-linear squared flows. Weuse identical non-causal wavenets to learn the parameters of our HBA-Prior.Finally, note that we train the full HBA-Flow model along with the priorusing the AdaMax [ ? ] optimizer. The “mixing” parameter α in f hba is learnable,although α = 0 . Appendix C. Qualitative Results
We provide additional qualitative results on Stanford Drone in Fig. 5 and In-tersection Drone in Fig. 6 comparing to FloWaveNet [21]. These results furthersupport the results in Figs. 4 and 5 in the main paper. We again see that thepredictions of our HBA-Flow model are more diverse and can more effectivelycapture the modes of the trajectory distributions at complex traffic situationslike intersections and crossings. Again, this is further supported by the top 10%of predictions, which are closer to the groundtruth trajectories. aar Wavelet based Block Autoregressive Flows for Trajectories 17
Observed Mean Top 10%B - GT, Y -[21], R - Ours FloWaveNet [21]Predictions
HBA-Flow (Ours)Predictions
Fig. 5: Mean top 10% predictions (Blue - Groudtruth, Yellow - FloWaveNet [21],Red - Our
HBA-Flow model) and predictive distributions on Stanford Dronedataset. The predictions of our HBA-Flow model are more diverse and bettercapture the modes of the future trajectory distribution.
Observed Mean Top 10%B - GT, Y -[21], R - Ours FloWaveNet [21]Predictions
HBA-Flow (Ours)Predictions
Fig. 6: Mean top 10% predictions (Blue - Groudtruth, Yellow - FloWaveNet [21],Red - Our