[PDF] Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance

Abstract

Full PDF

IInformative Scene Decomposition for Crowd Analysis, Comparison andSimulation Guidance

FEIXIANG HE,

University of Leeds, United Kingdom

YUANHANG XIANG,

Xi’an Jiaotong University, China

XI ZHAO ∗ , Xi’an Jiaotong University, China

HE WANG † , University of Leeds, United Kingdom

Fig. 1. Overview of our framework.

Crowd simulation is a central topic in several fields including graphics. Toachieve high-fidelity simulations, data has been increasingly relied upon foranalysis and simulation guidance. However, the information in real-worlddata is often noisy, mixed and unstructured, making it difficult for effectiveanalysis, therefore has not been fully utilized. With the fast-growing volumeof crowd data, such a bottleneck needs to be addressed. In this paper, wepropose a new framework which comprehensively tackles this problem.It centers at an unsupervised method for analysis. The method takes asinput raw and noisy data with highly mixed multi-dimensional (space, timeand dynamics) information, and automatically structure it by learning thecorrelations among these dimensions. The dimensions together with theircorrelations fully describe the scene semantics which consists of recurringactivity patterns in a scene, manifested as space flows with temporal and dy-namics profiles. The effectiveness and robustness of the analysis have been ∗ Corresponding author † Corresponding authorAuthors’ addresses: Feixiang He, University of Leeds, School of Computing, UnitedKingdom, [email protected]; Yuanhang Xiang, Xi’an Jiaotong University, School ofComputer Science and Technology, China, [email protected]; Xi Zhao, Xi’anJiaotong University, School of Computer Science and Technology, China, [email protected]; He Wang, University of Leeds, School of Computing, United Kingdom,[email protected] to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].© 2020 Association for Computing Machinery.0730-0301/2020/7-ART1 $15.00https://doi.org/10.1145/3386569.3392407 tested on datasets with great variations in volume, duration, environmentand crowd dynamics. Based on the analysis, new methods for data visual-ization, simulation evaluation and simulation guidance are also proposed.Together, our framework establishes a highly automated pipeline from rawdata to crowd analysis, comparison and simulation guidance. Extensiveexperiments and evaluations have been conducted to show the flexibility,versatility and intuitiveness of our framework.CCS Concepts: •

Computing methodologies → Animation ; Topic mod-eling ; Learning in probabilistic graphical models ; Scene understand-ing ; Activity recognition and understanding ; Multi-agent planning ;• Mathematics of computing → Probabilistic inference problems ; Nonparametric statistics .Additional Key Words and Phrases: Crowd Simulation, Simulation Evalua-tion, Bayesian Inference

ACM Reference Format:

Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang. 2020. InformativeScene Decomposition for Crowd Analysis, Comparison and SimulationGuidance.

ACM Trans. Graph.

39, 4, Article 1 (July 2020), 15 pages. https://doi.org/10.1145/3386569.3392407

Crowd simulation has been intensively used in computer anima-tion, as well as other fields such as architectural design and crowdmanagement. The fidelity or realism of simulation has been a long-standing problem. The main complexity arises from its multifacetednature. It could mean high-level global behaviors [Narain et al. 2009],mid-level flow information [Wang et al. 2016] or low-level individ-ual motions [Guy et al. 2012]. It could also mean perceived realism[Ennis et al. 2011] or numerical accuracy [Wang et al. 2017]. In

ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. a r X i v : . [ c s . G R ] A p r :2 • Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang any case, analyzing real-world data is inevitable for evaluating andguiding simulations.The main challenges in utilizing real-world data are data com-plexity, intrinsic motion randomness and the shear volume. Thedata complexity makes structured analysis difficult. As the mostprevalent form of crowd data, trajectories extracted from sensorscontain rich but mixed and unstructured information of space, timeand dynamics. Although high-level statistics such as density canbe used for analysis, they are not well defined and cannot givestructural insights [Wang et al. 2017]. Second, trajectories showintrinsic randomness of individual motions [Guy et al. 2012]. Therandomness shows heterogeneity between different individuals andgroups, and is influenced by internal factors such as state of mindand external factors such as collision avoidance. Hence a singlerepresentation is not likely to be able to capture all randomnessfor all people in a scene. This makes it difficult to guide simulationwithout systematically considering the randomness. Lastly, withmore recording devices being installed and data being shared, theshear volume of data in both space and time, with excessive noise,requires efficient and robust analysis.Existing methods that use real-world data for purposes such asqualitative and quantitative comparisons [Wang et al. 2016], sim-ulation guidance [Ren et al. 2018] or steering [LÃşpez et al. 2019],mainly focus on one aspect of data, e.g. space, time or dynamics,and tend to ignore the structural correlations between them. Alsoduring simulation and analysis, motion randomness is often ignoredor uniformly modelled for all trajectories [Guy et al. 2012; Helbinget al. 1995]. Ignoring the randomness (e.g. only assuming the least-effort principle) makes simulated agents to walk in straight lineswhenever possible, which is rarely observed in real-world data; uni-formly modelling the randomness fails to capture the heterogeneityof the data. Besides, most existing methods are not designed to dealwith massive data with excessive noise. Many of them require thefull trajectories to be available [Wolinski et al. 2014] which cannotbe guaranteed in real world, and do not handle data at the scale oftens of thousands of people and several days long.In this paper, we propose a new framework that addresses thethree aforementioned challenges. This framework is centered at ananalysis method which automatically decomposes a crowd sceneof a large number of trajectories into a series of modes . Each modecomprehensively captures a unique pattern of spatial, temporal anddynamics information. Spatially, a mode represents a pedestrian flowwhich connects subspaces with specific functionalities, e.g. entrance,exit, information desk, etc.; temporally it captures when this flowappears, crescendos, wanes and disappears; dynamically it revealsthe speed preferences on this flow. With space, time and dynamicsinformation, each mode represents a unique recurring activity andall modes together describe the scene semantics . These modes serveas a highly flexible visualization tool for general and task-specificanalysis. Next, they form a natural basis where explicable evaluationmetrics can be derived for quantitatively comparing simulated andreal crowds, both holistically and dimension-specific (space, timeand dynamics). Lastly, they can easily automate simulation guidance,especially in capturing the heterogeneous motion randomness inthe data. The analysis is done by a new unsupervised clustering methodbased on non-parametric Bayesian models, because manual labellingwould be extremely laborious. Specifically, Hierarchical DirichletProcesses (HDP) are used to disentangle the spatial, temporal anddynamics information. Our model consists of three intertwinedHDPs and is thus named Triplet HDPs (THDP). The outcome is a(potentially infinite) number of modes with weights. Spatially, eachmode is a crowd flow represented by trajectories sharing spatialsimilarities. Temporally, it is a distribution of when the flow appears,crescendos, peaks, wanes and disappears. Dynamically, it shows thespeed distribution of the flow. The whole data is then represented bya weighted combination of all modes. Besides, the power of THDPcomes with an increased model complexity, which brings challengeson inference. We therefore propose a new method based on MarkovChain Monte Carlo (MCMC). The method is a major generalizationof the Chinese Restaurant Franchise (CRF) method, which was orig-inally developed for HDP. We refer to the new inference method asChinese Restaurant Franchise League (CRFL). THDP and CRFL aregeneral and effective on datasets with great spatial, temporal anddynamics variations. They provide a versatile base for new methodsfor visualization, simulation evaluation and simulation guidance.Formally, we propose the first, to our best knowledge, multi-purpose framework for crowd analysis, visualization, simulationevaluation and simulation guidance, which includes:(1) a new activity analysis method by unsupervised clustering.(2) a new visualization tool for highly complex crowd data.(3) a set of new metrics for comparing simulated and real crowds.(4) a new approach for automated simulation guidance.To this end, we have technical contributions which include:(1) the first, to our best knowledge, non-parametric method thatholistically considers space, time and dynamics for crowdanalysis, simulation evaluation and simulation guidance.(2) a new Markov Chain Monte Carlo method which achieveseffective inference on intertwined HDPs. Empirical modelling and data-driven methods have been the twomainstreams in simulation. Empirical modelling dominates earlyresearch, where observations of crowd motions are abstracted intomathematical equations and deterministic systems. Crowds can bemodelled as fields or flows [Narain et al. 2009], or as particle systems[Helbing et al. 1995], or by velocity and geometric optimization[van den Berg et al. 2008]. Social behaviors including queuing andgrouping [Lemercier et al. 2012; Ren et al. 2016] have also beenpursued. On the other hand, data-driven simulation has also beenexplored, in using e.g. first-person vision to guide steering behaviors[LÃşpez et al. 2019] or trajectories to extract features to describemotions [Karamouzas et al. 2018; Lee et al. 2007]. Our research ishighly complementary to simulation research in providing analysis,guidance and evaluation metrics. It aims to work with existingsteering and global planning methods.

ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. nformative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance • 1:3

Crowd analysis has been a trendy topic in computer vision [Wangand O’Sullivan 2016; Wang et al. 2008]. They aim to learn structuredlatent patterns in data, similar to our analysis method. However, theyonly consider limited information (e.g. space only or space/time)compared to our method because our method explicitly modelsspace, time, dynamics and their correlations. In contrast, anotherway of scene analysis is to focus on the anomalies [Charalambouset al. 2014]. Their perspective is different from ours and thereforecomplementary to our approach. Trajectory analysis also plays animportant role in modern sports analysis [Sha et al. 2018, 2017], butthey do not deal with a large number of trajectories as our methoddoes. Recently, deep learning has been used for crowd analysis intrajectory prediction [Xu et al. 2018], people counting [Wang et al.2019], scene understanding [Lu et al. 2019] and anomaly detection[Sabokrou et al. 2017]. However, they either do not model low-levelbehaviors or can only do short-horizon prediction (seconds). Ourresearch is orthogonal to theirs by focusing on the analysis and itsapplications in simulations.Besides computer vision, crowd analysis has also been investi-gated in physics. In [Ali and Shah 2007], Lagrangian Particle Dy-namics is exploited for the segmentation of high-density crowdflows and detection of flow instabilities, where the target was simi-lar to our analysis. But they only consider space when separatingflows, while our research explicitly models more comprehensiveinformation, including space, time and dynamic. Physics-inspiredapproaches have also been applied in abnormal trajectory detec-tion for surveillance [Chaker et al. 2017; Mehran et al. 2009]. Anapproach based on social force model [Mehran et al. 2009] is intro-duced to describe individual movement in microscopic by placing agrid particle over the image. A local and global social network arebuilt by constructing a set of spatio-temporal cuboids in [Chakeret al. 2017] to detect anomalies. Compared with these methods, ouranomaly detection is more informative and versatile in providingwhat attributes contribute to the abnormality.

How to evaluate simulations is a long-standing problem. One majorapproach is to compare simulated and real crowds. There are quali-tative and quantitative methods. Qualitative methods include visualcomparison [Lemercier et al. 2012] and perceptual experiments [En-nis et al. 2011]. Quantitative methods fall into model-based methods[Golas et al. 2013] and data-driven methods [Guy et al. 2012; Lerneret al. 2009; Wang et al. 2016, 2017]. Individual behaviors can bedirectly compared between simulation and reference data [Lerneret al. 2009]. However, it requires full trajectories to be availablewhich is difficult in practice. Our comparison is based on the latentbehavioral patterns instead of individual behaviors and does notrequire full trajectories. The methods in [Wang et al. 2016, 2017]are similar to ours where only space is considered. In contrast, ourapproach is more comprehensive by considering space, time anddynamics. Different combinations of these factors result in differentmetrics focusing on comparing different aspects of the data. Thecomparisons can be spatially focused or temporally focused. They can also be comparing general situations or specific modes. Overall,our method provides greater flexibility and more intuitive results.

Quantitative simulation guidance has been investigated before,through user control or real-world data. In the former, trajectory-based user control signals can be converted into guiding trajectoriesfor simulation [Shen et al. 2018]. Predefined crowd motion ‘patches’can be used to compose heterogeneous crowd motions [Jordao et al.2014]. The purpose of this kind of guidance is to give the user thefull control to ‘sculpture’ crowd motions. The latter is to guide sim-ulations using real-world data to mimic real crowd motions. Givendata and a parameterized simulation model, optimizations are usedto fit the model on the data [Wolinski et al. 2014]. Alternatively,features can be extracted and compared for different simulations,so that predictions can be made about different steering methodson a simulation task [Karamouzas et al. 2018]. Our approach alsoheavily relies on data and is thus similar to the latter. But insteadof anchoring on the modelling of individual motions, it focuses onthe analysis of scene semantics/activities. It also considers intrinsicmotion randomness in a structured and principled way.

The overview of our framework is in Fig. 1. Without loss of gener-ality, we assume that the input is raw trajectories/tracklets whichcan be extracted from videos by existing trackers, where we canestimate the temporal and velocity information. Naively modellingthe trajectories/tracklets, e.g. by simple descriptive statistics suchas average speed, will average out useful information and cannotcapture the data heterogeneity. To capture the heterogeneity in thepresence of noise and randomness, we seek an underlying invariantas the scene descriptor. Based on empirical observations, steadyspace flows, characterized by groups of geometrically similar tra-jectories, can be observed in many crowd scenes. Each flow is arecurring activity connecting subspaces with designated function-alities, e.g. a flow from the front entrance to the ticket office thento a platform in a train station. Further, this flow reveals certainsemantic information, i.e. people buying tickets before going to theplatforms. Overall, all flows in a scene form a good basis to describethe crowd activities and the basis is an underlying invariant. Howto compute this basis is therefore vital in analysis.However, computing such a basis is challenging. Naive statistics oftrajectories are not descriptive enough because the basis consists ofmany flows, and is therefore highly heterogeneous and multi-modal.Further the number of flows is not known a priori . Since the flowsare formed by groups of geometrically similar trajectories/tracklets,a natural solution is to cluster them [Bian et al. 2018]. In this spe-cific research context, unsupervised clustering is needed due to thatthe shear data volume prohibits human labelling. In unsupervisedclustering, popular methods such as K-means and Gaussian MixtureModels [Bishop 2007] require a pre-defined cluster number whichis hard to know in advance. Hierarchical Agglomerative Clustering[Kauffman and Rousseeuw 2005] does not require a predefined clus-ter number, but the user must decide when to stop merging, whichis similarly problematic. Spectral-based clustering methods [Shi and

ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :4 • Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang

Malik 2000] solve this problem, but require the computation of asimilarity matrix whose space complexity is O ( n ) on the number oftrajectories. Too much memory is needed for large datasets and per-formance degrades quickly with increasing matrix size. Due to theafore-mentioned limitations, non-parametric Bayesian approacheswere proposed [Wang et al. 2016, 2017]. However, a new approachis still needed because the previous approaches only consider space,and therefore cannot be reused or adapted for our purposes.We propose a new non-parametric Bayesian method to clusterthe trajectories with the time and velocity information in an unsu-pervised fashion, which requires neither manual labelling nor theprior knowledge of cluster number. The outcome of clustering is aseries of modes, each being a unique distribution over space, timeand speed. Then we propose new methods for data visualization,simulation evaluation and automated simulation guidance.We first introduce the background of one family of non-parametricBayesian models, Dirichlet Processes (DPs), and Hierarchical Dirich-let Processes (HDP) (Sec. 4.1). We then introduce our new modelTriplet HDPs (Sec. 4.2) and new inference method Chinese Restau-rant Franchise League (Sec. 5). Finally new methods are proposedfor visualization (Sec. 6.1), comparison (Sec. 6.2) and simulationguidance (Sec. 6.3). Dirichlet Process . To understand DP, imagine there is a multi-modal 1D dataset with five high-density areas (modes). Then aclassic five-component Gaussian Mixture Model (GMM) can fit thedata via Expectation-Minimization [Bishop 2007]. Now further gen-eralize the problem by assuming that there are an unknown numberof high-density areas. In this case, an ideal solution would be toimpose a prior distribution which can represent an infinite numberof Gaussians, so that the number of Gaussians needed, their meansand covariances can be automatically learnt. DP is such a prior.A DP( γ , H) is a probabilistic measure on measures [Ferguson1973], with a scaling parameter γ > 0 and a base probability measure H . A draw from DP, G ~ DP ( γ , H ) is: G = (cid:205) ∞ k = β k δ ϕ k , where β k ∈ β is random and dependent on γ . ϕ k ∈ ϕ is a variable distributedaccording to H , ϕ k ∼ H . δ ϕ k is called an atom at ϕ k . Specifically forthe example problem above, we can define H to be a Normal-Inverse-Gamma (NIG) so that any draw, ϕ k , from H is a Gaussian, then G becomes an Infinite Gaussian Mixture Model (IGMM) [Rasmussen1999]. In practice, k is finite and computed during inference. Hierarchical DPs . Now imagine that the multi-modal dataset inthe example problem is observed in separate data groups. Althoughall the modes can be observed from the whole dataset, only a subsetof the modes can be observed in any particular data group. To modelthis phenomenon, a parent DP is used to capture all the modes witha child DP modelling the modes in each group: G j ∼ DP ( α j , G ) or G j = ∞ (cid:213) i = β ji δ ψ ji where G = ∞ (cid:213) k = β k δ ϕ k (1)where G j is the modes in the j th data group. α j is the scaling factorand G is its based distribution. β ji is the weight and δ ψ ji is the atom.Now we have the Hierarchical DPs, or HDP [Teh et al. 2006] (Fig. 2 Fig. 2. Left: HDP. Right: Triplet HDP.

Left). At the top level, the modes are captured by G ∼ DP ( γ , H ) . Ineach data group j , the modes are captured by G j which is dependenton α j and G . This way, the modes, G j , in every data group comefrom the common set of modes G , i.e. ψ ji ∈ { ϕ , ϕ , ..., ϕ k } . In Fig. 2Left, there is also a variable θ ji called factor which indicates withwhich mode ( ψ ji or equally ϕ k ) the data sample x ji is associated.Finally, if H is again a NIG prior, then the HDP becomes HierarchicalInfinite Gaussian Mixture Model (HIGMM). We now introduce THDP (Fig. 2 Right). There are three HDPs inTHDP, to model space, time and speed. We name them Time-HDP(Green), Space-HDP (Yellow) and Speed-HDP (Blue). Space-HDP isto compute space modes. Time-HDP and Speed-HDP are to computethe time and speed modes associated with each space mode, whichrequires the three HDPs to be linked. The modeling choice of thelinks will be explained later. The only observed variable in THDPis w , an observation of a person in a frame. It includes a location-orientation ( x ji ), timestamp ( y kd ) and speed ( z kc ). θ sji , θ tkd and θ ekc are their factor variables. Given a single observation denoted as w ,we denote one trajectory as ¯ w , a group of trajectories as ˇ w and thewhole data set as w . Our final goal is to compute the space, timeand speed modes, given w : G s = ∞ (cid:213) k = β k δ ϕ sk G t = ∞ (cid:213) l = ζ l δ ϕ tl G e = ∞ (cid:213) q = ρ q δ ϕ eq (2)In THDP, a space mode is defined to be a group of geometricallysimilar trajectories ˇ w . Since these trajectories form a flow, we alsorefer to it as a space flow. A space flow’s timestamps ( y kd s) andspeed ( z kc s) are both 1D data and can be modelled in similar ways.We first introduce the Time-HDP. One space flow ˇ w might appear,crescendo, peak, wane and disappear several times. If a Gaussiandistribution is used to represent one time peak on the timeline,multiple Gaussians are needed. Naturally IGMM is used to modelthe y kd ∈ ˇ w . A possible alternative is to use Poisson Processes tomodel the entry time. But IGMM is chosen due to its ability to fitcomplex multi-modal distributions. It can also model a flow for theentire duration. Next, since there are many space flows and the y kd s of each space flow form a timestamp data group, we thereforeassume that there is a common set of time peaks shared by all space ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. nformative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance • 1:5

Fig. 3. From left to right: 1. A space flow. 2. Discretization and flow celloccupancy, darker means more occupants. 3. Codebook with normalizedoccupancy as probabilities indicated by color intensities. 4. Five coloredorientation subdomains (Pink indicates static). flows and each space flow shares only a subset. This way, we use aDP to represent all the time peaks and a child DP below the first DPto represent the peaks in each space flow. This is a HIGMM (for theTime-HDP) where the H t is a NIG. Similarly for the speed, z kc ∈ ˇ w can also have multiple peaks on the speed axis, so we use IGMMfor this. Further, there are many space flows. We again assume thatthere is a common set of speed peaks and each space flow only hasa subset of these peaks and use another HIGMM for the Speed-TDP.After Time-HDP and Speed-HDP, we introduce the Space-HDP.The Space-HDP is different because, unlike time and speed, spacedata ( x ji s) is 4D (2D location + 2D orientation), which means itsmodes are also multi-dimensional. In contrast to time and speed, a4D Gaussian cannot represent a group of similar trajectories well. Sowe need to use a different distribution. Similar to [Wang et al. 2017],we discretize the image domain (Fig. 3: 1) into a m × n grid (Fig. 3:2). The discretization serves three purposes: 1. the cell occupancyserves as a good feature for a flow, since a space flow occupies a fixedgroup of cells. 2. it removes noises caused by frequent turns andtracking errors. 3. it eliminates the dependence on full trajectories.As long as instantaneous positions and velocities can be estimated,THDP can cluster observations. This is crucial in dealing with real-world data where full trajectories cannot be guaranteed. Next, sincethere is no orientation information so that the representation cannotdistinguish between flows from A-to-B and flows from B-to-A, wediscretize the instantaneous orientation into 5 cardinal subdomains(Fig. 3: 4). This makes the grid m × n × codebook and every 4D x ji can be converted into a celloccupancy. Note although the grid resolution is problem-specific, itdoes not affect the validity of our method.Next, since the cell occupancy on the grid (after normalization)can be seen as a Multinomial distribution, we use Multinomials torepresent space flows. This way, a space flow has high probabilitiesin some cells and low probabilities in others (Fig. 3:3). Further, weassume the data is observed in groups and any group could containmultiple flows. We use a DP to model all the space flows of thewhole dataset with child DPs representing the flows in individualdata groups, e.g. video clips. This is a HDP (Space-HDP) with H s being a Dirichlet distribution.After the three HDPs introduced separately, we need to link them,which is the key of THDP. For a space flow ˇ w , all x ji ∈ ˇ w areassociated with the same space mode, denoted by ϕ s , and all y kd ∈ ˇ w are associated with the time modes { ϕ t } which forms a temporalprofile of ϕ s . This indicates that y kd ’s time mode association isdependent on x ji ’s space mode association. In other words, if x ji ∈ ˇ w ( ϕ s ) and x ji ∈ ˇ w ( ϕ s ), where x ji = x ji but ˇ w (cid:44) ˇ w (two flows can partially overlap), then their corresponding y kd ∈ ˇ w and y kd ∈ ˇ w should be associated with { ϕ t } and { ϕ t } where { ϕ t } (cid:44) { ϕ t } when ˇ w and ˇ w have different temporal profiles. We thereforecondition θ tkd on θ sji (The left red arrow in Fig. 2 Right) so that y kd ’stime mode association is dependent on x ji ’s space mode association.Similarly, a conditioning is also added to θ ekc on θ sji . This way, w ’sassociations to space, time and speed modes are linked. This is thebiggest feature that distinguishes THDP from just a simple collectionof HDPs, which would otherwise require doing analysis on space,time and dynamics separately, instead of holistically. Given data w , the goal is to compute the posterior distribution p ( β , ϕ s , ζ , ϕ t , ρ , ϕ e | w ). Existing inference methods for DPs includeMCMC [Teh et al. 2006], variational inference [Hoffman et al. 2013]and geometric optimization [Yurochkin and Nguyen 2016]. However,they are designed for simpler models (e.g. a single HDP). Further,both variational inference and geometric optimization suffer fromlocal minimum. We therefore propose a new MCMC method forTHDP. The method is a major generalization of Chinese RestaurantFranchise (CRF). Next, we first give the background of CRF, thenintroduce our method. A single DP has a Chinese Restaurant Process (CRP) representation.CRF is its extension onto HDPs. We refer the readers to [Teh et al.2006] for details on CRP. Here we directly follow the CRF metaphoron HDP (Eq. 1, Fig. 2 Left) to compute the posterior distribution p ( β , ϕ | x ). In CRF, each observation x ji is called a customer . Each datagroup is called a restaurant . Finally, since a customer is associatedwith a mode (indicated by θ ji ), the mode is called a dish and is tobe learned, as if the customer ordered this dish. CRF dictates that,in every restaurant, there is a potentially infinite number of tables,each with only one dish and many customers sharing that dish.There can be multiple tables serving the same dish. All dishes are ona global menu shared by all restaurants. The global menu can alsocontain an infinite number of dishes. In summary, we have multiplerestaurants with many tables where customers order dishes from acommon menu.CRF is a Gibbs sampling approach. The sampling process is con-ducted at both customer and table level alternatively. At the cus-tomer level, each customer is treated, in turn, as a new customer,given all the other customers sitting at their tables. Then she needsto choose a table in her restaurant. There are two criteria influenc-ing her decision: 1. how many customers are already at the table( table popularity ) and 2. how much she likes the dish on that table( dish preference ). If she decides to not sit at any existing table, shecan create a new table then order a dish. This dish can be from themenu or she can create a new dish and add it to the menu. Next, atthe table-level, for each table, all the customers sitting at that tableare treated as a new group of customers, and are asked to choose adish together. Their collective dish preference and how frequentlythe dish is ordered in all restaurants (dish popularity) will influencetheir choice. They can choose a dish from the menu or create a new ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :6 • Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang

ALGORITHM 1:

Chinese Restaurant Franchise

Result: β , ϕ (Eq. 1) Input: x ; while Not converged do for every restaurant j do for every customer x ji do Sample a table t ji (Eq. 11, Appx. A); if a new table is chosen then Sample a dish or create a new dish (Eq. 12, Appx. A) end end for every table and its customers x jt do Sample a new dish (Eq. 13, Appx. A) end end Sample hyper-parameters [Teh et al. 2006] end one and add it to the menu. We give the algorithm in Algorithm 1and refer the readers to Appx. A for more details. We generalize CRF by proposing a new method called ChineseRestaurant Franchise League. We first change the naming conven-tion by adding prefixes space-, time- and speed- to customers, restau-rant and dishes to distinguish between corresponding variables inthe three HDPs. For instance, an observation w now contains aspace-customer x ji , a time-customer y kd and a speed-customer z kc .CRFL is a Gibbs sampling scheme, shown in Algorithm 2. The dif-ferences between CRF and CRFL are on two levels. At the top level,CRFL generalizes CRF by running CRF alternatively on three HDPs.This makes use of the conditional independence between the Time-HDP and the Speed-HDP given the Space-HDP fixed. At the bottomlevel, there are three major differences in the sampling, betweenEq. 11 and Eq. 3, Eq. 12 and Eq. 4, Eq. 13 and Eq. 5. ALGORITHM 2:

Chinese Restaurant Franchise League

Result: β , ϕ s , ζ , ϕ t , ρ , ϕ e (Eq. 2) Input: w ; while Not converged do Fix all variables in Space-HDP; Do one CRF iteration (line 3-13,

Algorithm 1 ) on Time-HDP; Do one CRF iteration (line 3-13,

Algorithm 1 ) on Speed-HDP; for every space-restaurant j in Space-HDP do for every space-customer x ji do Sample a table t ji (Eq. 3); if a new table is chosen then Sample a dish or create a new dish (Eq. 4); end end for every table and its space-customers x jt do Sample a new space-dish (Eq. 5); end end Sample hyper-parameters (Appx. B.3); end The first difference is when we do customer-level sampling (line8 in Algorithm 2), the left side of Eq. 11 in CRF becomes: p ( t ji = t , x ji , y kd , z kc | x − ji , t − ji , k , y − kd , o − kd , l , z − kc , p − kc , q ) (3)where t ji is the new table for space-customer x ji . y kd and z kc arethe time and speed customer. x − ji and t − ji are the other customers(excluding x ji ) in the j th space-restaurant and their choices of tables. k is the space dishes. Correspondingly, y − kd and o − kd are the othertime-customers (excluding y kd ) in the k th time-restaurant and theirchoices of tables. l is the time dishes. Similarly, z − kc and p − kc are theother speed-customers (excluding z kc ) in the k th speed-restaurantand their choices of tables. q is the speed-dishes. The intuitive in-terpretation of the differences between Eq. 3 and Eq. 11 is: when aspace-customer x ji chooses a table, the popularity and preferenceare not the only criteria anymore. She has to also consider the prefer-ences of her associated time-customer y kd and speed-customer z kc .This is because when x ji orders a different space-dish, y kd and z kc will be placed into a different time-restaurant and speed-restaurant,due to that the organizations of time- and speed-restaurants aredependent on the space-dishes (the dependence of θ tkd and θ ekc on θ sji ). Each space-dish corresponds to a time-restaurant and aspeed-restaurant (see Sec. 4.2). Since a space-customer’s choice ofspace-dish can change during CRFL, the organization of time- andspeed-restaurants becomes dynamic! This is why CRF cannot bedirectly applied to THDP.The second difference is when we need to sample a dish (line 10in Algorithm 2), the left side of Eq. 12 in CRF becomes: p ( k jt new = k , x ji , y kd , z kc | k − jt new , y − kd , o − kd , l , z − kc , p − kc , q ) ∝ (cid:40) m · k p ( x ji | · · · ) p ( y kd | · · · ) p ( z kc | · · · ) γp ( x ji | · · · ) p ( y kd | · · · ) p ( z kc | · · · ) (4)where k jt new is the new dish for customer x ji . · · · represents allthe conditional variables for simplicity. p ( y kd | · · · ) and p ( z kc |··· ) are the major differences. We refer the readers to Appx. B regardingthe computation of Eq. 3 and Eq. 4.The last difference is when we do the table-level sampling (line14 in Algorithm 2), the left side of Eq. 13 in CRF changes to: p ( k jt = k , x jt , y kd jt , z kc jt | k − jt , y − kd jt , o − kd jt , l − ko , z − kc jt , p − kc jt , q − kp ) ∝ (cid:40) m − jt · k p ( x jt | · · · ) p ( y kd jt | · · · ) p ( z kc jt | · · · ) γp ( x jt | · · · ) p ( y kd jt | · · · ) p ( z kc jt | · · · ) (5)where x jt is the space-customers at the t th table, y kd jt and z kc jt arethe associated time- and speed-customers. k − jt , y − kd jt , o − kd jt , l − ko , z − kc jt , p − kc jt , q − kp are the rest and their table and dish choices inthree HDPs. · · · represents all the conditional variables for sim-plicity. p ( x jt | · · · ) is the Multinomial f as in Eq. 13. Unlike Eq. 4, p ( y kd jt | · · · ) and p ( z kc jt | · · · ) cannot be easily computed and needsspecial treatment. We refer the readers to Appx. B for details. ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. nformative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance • 1:7

Now we have fully derived CRFL. Given a data set w , we can com-pute the posterior distribution p ( β , ϕ s , ζ , ϕ t , ρ , ϕ e | w ) where β , ζ and ρ are the weights of the space, time and speed dishes, ϕ s , ϕ t and ϕ e respectively. ϕ s are Multinomials. ϕ t and ϕ e are Gaussians. Asmentioned in Sec. 5.1, the number of ϕ s , is automatically learnt, sowe do not need to know the space dish number in advance. Neitherdo we need it for ϕ t and ϕ e . This makes THDP non-parametric.Further, since one ϕ s could be associated with potentially an in-finite number of ϕ t s and ϕ e s and vice versa, the many-to-manyassociations are also automatically learnt. For each sampling iteration in Algorithm 2, the time complexities ofsampling on time-HDP, speed-HDP and space-HDP are O [ W ( N + L ) + KN L ] , O [ W ( A + Q ) + KAQ ] and O [ W ( M + K ) + W ( K + ) η + JMK ] respectively, where η = N + L + A + Q . W is the total observationnumber. K , L and Q are the dish numbers of space, time and speed. J is the number of space-restaurants. M , N and A are the average tablenumbers in space-, time- and speed-restaurants respectively. Notethat K appears in all three time complexities because the number ofspace-dishes is also the number of time- and space-restaurants.The time complexity of CRFL is O [ W ( N + L ) + KN L ] + O [ W ( A + Q ) + KAQ ] + O [ W ( M + K ) + W ( K + ) η + JMK ] . This time complexityis not high in practice. W can be large, depending on the dataset,over which a sampling could be used to reduce the observationnumber. In addition, K is normally smaller than 50 even for highlycomplex datasets. L and Q are even smaller. J is decided by the userand in the range of 10-30. M , N and A are not large either due to thehigh aggregation property of DPs, i.e. each table tends to be chosenby many customers, so the table number is low. THDP provides a powerful and versatile base for new tools. Inthis section, we present three tools for structured visualization,quantitative comparison and simulation guidance.

After inference, the highly rich but originally mixed and unstruc-tured data is now structured. This is vital for visualization. It isimmediately easy to visualize the time and speed modes as they aremixtures of univariate Gaussians. The space modes require furthertreatments because they are m × n × w ,we compute a softmax function: p k ( ¯ w ) = e p k ( ¯ w ) (cid:205) Kk = e p k ( ¯ w ) k ∈ [1, K] (6)where p k ( ¯ w ) = p ( ¯ w | β k , ϕ sk , ζ k , ϕ t , ρ k , ϕ e ) . ϕ sk and β k are the k thspace mode and its weight. The others are the associated time andspeed modes. The time and speed modes ( ϕ t and ϕ e ) are associatedwith space flow ϕ sk , with weights, ζ k and ρ k . K is the total numberof space flows. This way, we classify every trajectory into a space flow. Then we can visualize representative trajectories with highprobabilities, or show anomaly trajectories with low probabilities.In addition, since THDP captures all space, time and dynamics,there is a variety of visualization. A period of time can be representedby a weighted combination of time modes { ϕ t }. Assuming that theuser wants to see what space flows are prominent during this period,we can visualize trajectories based on ∫ ρ , ϕ e p ( β , ϕ s |{ ϕ t }) , whichgives the space flows with weights. This is very useful if for instance{ ϕ t } is rush hours, ∫ ρ , ϕ e p ( β , ϕ s |{ ϕ t }) shows us what flows areprominent and their relative importance during the rush hours.Similarly, if we visualize data based on ∫ ζ , ϕ t p ( ρ , ϕ e | ϕ s ) , it will tellus if people walk fast/slowly on the space flow ϕ s . A more complexvisualization is p ( ζ , ϕ t , ρ , ϕ e | ϕ s ) where the time-speed distributionis given for a space flow ϕ s . This gives the speed change againsttime of this space flow, which could reveal congestion at times.Through marginalizing and conditioning on different variables(as above), there are many possible ways of visualizing crowd dataand each of them reveals a certain aspect of the data. We do notenumerate all the possibilities for simplicity but it is very obviousthat THDP can provide highly flexible and insightful visualizations. Being able to quantitatively compare simulated and real crowds isvital in evaluating the quality of crowd simulation. Trajectory-based[Guy et al. 2012] and flow-based [Wang et al. 2016] methods havebeen proposed. The first flow-based metrics are proposed in [Wanget al. 2016] which is similar to our approach. In their work, the twometrics proposed were: average likelihood (AL) and distribution-pair distance (DPD) based on Kullback-Leibler (KL) divergence. Theunderlying idea is that a good simulation does not have to strictlyreproduce the data but should have statistical similarities with thedata. However, they only considered space. We show that THDPis a major generalization of their work and provides much moreflexibility with a set of new AL and DPD metrics.

Given a simulation data set, ˆ w = ( ˆ x ji , ˆ y kd , ˆ z kc ) and p ( β , ϕ s , ζ , ϕ t , ρ , ϕ e | w ) inferred from real-world data w , we cancompute the AL metric based on space only, essentially computingthe average space likelihood while marginalizing time and speed:1 | ˆ w | (cid:213) j , i K (cid:213) k = β k ∫ z ∫ y p ( ˆ x ji | ϕ sk , ˆ y kd , ˆ z kc ) p ( ˆ y kd ) p ( ˆ z kc ) dydz (7)where | ˆ w | is the number of observations in ˆ w . The dependence on β , ϕ s , ζ , ϕ t , ρ , ϕ e are omitted for simplicity. If we completely discardtime and speed, Eq. 7 changes to the AL metric in [Wang et al. 2017], | ˆ w | (cid:205) j , i (cid:205) k β k p ( ˆ x ji | ϕ sk ) . However, the metric is just a special caseof THDP. We give a list of AL metrics in Table 1, which all havesimilar forms as Eq. 7. AL metrics are based on average likelihoods,summarizing the differences between two data sets into one number.To give more flexibility, we also propose distribution-pair metrics.We first learn two posterior distributions p ( ˆ β , ˆ ϕ s , ˆ ζ , ˆ ϕ t , ˆ ρ , ˆ ϕ e | ˆ w )and p ( β , ϕ s , ζ , ϕ t , ρ , ϕ e | w ). Then we can compare individual ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :8 • Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang

Metric To compare1. | ˆ w | (cid:205) p ( ˆ x ji , ˆ y kd , ˆ z kc |•) overall similarity2. | ˆ w | (cid:205) p ( ˆ x ji , ˆ y kd |•) space&time ignoring speed3. | ˆ w | (cid:205) p ( ˆ x ji , ˆ z kc |•) space&speed ignoring time4. | ˆ w | (cid:205) p ( ˆ y kd , ˆ z kc |•) time&speed ignoring space5. | ˆ w | (cid:205) p ( ˆ x ij |•) space ignoring time & speed6. | ˆ w | (cid:205) p ( ˆ y kd |•) time ignoring space & speed7. | ˆ w | (cid:205) p ( ˆ z kc |•) speed ignoring space & time Table 1. AL Metrics, • represents { β , ϕ s , ζ , ϕ t , ρ , ϕ e }. pairs of ϕ s and ˆ ϕ s , ϕ t and ˆ ϕ t , ϕ e and ˆ ϕ e . Since all space, time andspeed modes are probability distributions, we propose to use Jensen-Shannon divergence, as oppose to KL divergence [Wang et al. 2017]due to KL’s asymmetry: JSD ( P || Q ) = D ( P || M ) + D ( Q || M ) (8)where D is KL divergence and M = ( P + Q ) . P and Q are probabil-ity distributions. Again, in the DPD comparison, THDP providesmany options, similar to the AL metrics in Table 1. We only giveseveral examples here. Given two space flows, ϕ s and ˆ ϕ s , JSD( ϕ s || ˆ ϕ s ) directly compares two space flows. Further, P and Q can beconditional distributions. If we compute JSD( p ( ϕ t | ϕ s ) || p( ˆ ϕ t | ˆ ϕ s ))where ϕ t and ˆ ϕ t are the associated time modes of ϕ s and ˆ ϕ s respec-tively. This is to compare the two temporal profiles. This is veryuseful when ϕ s and ˆ ϕ s are two spatially similar flows but we wantto compare the temporal similarity. Similarly, we can also comparetheir speed profiles JSD( p ( ϕ e | ϕ s ) || p( ˆ ϕ e | ˆ ϕ s )) or their time-speedprofiles JSD( p ( ϕ t , ϕ e | ϕ s ) || p( ˆ ϕ t , ˆ ϕ e | ˆ ϕ s )). In summary, similarto AL metrics, different conditioning and marginalization choicesresult in different DPD metrics. We propose a new method to automate simulation guidance withreal-world data, which works with existing simulators includingsteering and global planning methods. Assuming that we want tosimulate crowds in a given environment based on data, there arestill several key parameters which need to be estimated including,starting/destination positions, the entry timing and the desiredspeed. After inferring, we use GMM to model both starting anddestination regions for every space flow. This way, we completelyeliminate the need for manual labelling, which is difficult in spaceswith no designated entrances/exits (e.g. a square). Also, we removedthe one-to-one mapping requirement of the agents in simulationand data. We can sample any number of agents based on space flowweights ( β ) and still keep similar agent proportions on differentflows to the data. In addition, since each flow comes with a temporaland speed profile, we sample the entry timing and desired speedfor each agent, to mimic the randomness in these parameters. It isdifficult to manually set the timing when the duration is long andsampling the speed is necessary to capture the speed variety withina flow caused by latent factors such as different physical conditions.Next, even with the right setting of all the afore-mentioned param-eters, existing simulators tend to simulate straight lines whenever possible while the real data shows otherwise. This is due to that nointrinsic motion randomness is introduced. Intrinsic motion ran-domness can be observed in that people rarely walk in straight linesand they generate slightly different trajectories even when askedto walk several times between the same starting position and desti-nation [Wang et al. 2017]. This is related to the state of the personas well as external factors such as collision avoidance. Individualmotion randomness can be modelled by assuming the randomnessis Gaussian-distributed [Guy et al. 2012]. Here, we do not assumethat all people have the same distribution. Instead, we propose to doa structured modelling. We observe that people on different spaceflows show different dynamics but share similar dynamics withinthe same flow. This is because people on the same flow share thesame starting/destination regions and walk through the same partof the environment. In other words, they started in similar positions,had similar goals and made similar navigation decisions. Althoughindividual motion randomness still exists, their randomness is likelyto be similarly distributed. However, this is not necessarily trueacross different flows. We therefore assume that each space flow canbe seen as generated by a unique dynamic system which capturesthe within-group motion randomness which implicitly considersfactors such as collision avoidance. Given a trajectory, ¯ w , from aflow ˇ w , we assume that there is an underlying dynamic system: x ¯ wt = As t + ω t ω ∼ N ( , Ω ) s t = Bs t − + λ t λ ∼ N ( , Λ ) (9)where x ¯ wt is the observed location of a person at time t on trajectory¯ w . s t is the latent state of the dynamic system at time t . ω t and λ t are the observational and dynamics randomness. Both are whiteGaussian noises. A and B are transition matrices. We assume that Ω is a known diagonal covariance matrix because it is intrinsic to thedevice (e.g. a camera) and can be trivially estimated. We also assumethat A is an identity matrix so that there is no systematic bias and theobservation is only subject to the state s t and noise ω t . The dynamicsystem then becomes: x ¯ wt ∼ N ( Is t , Ω ) and s t ∼ N ( Bs t − , Λ ) , wherewe need to estimate s t , B and Λ . Given the U trajectories in ˇ w , thetotal likelihood is: p ( ˇ w ) = Π Ui = p ( ¯ w i ) where p ( ¯ w i ) = Π T i − t = p ( x it | s t ) P ( s t | s t − ) s = x i , s T = x iT i (10)where T i is the length of trajectory ¯ w i . We maximize loд P ( ˇ w ) viaExpectation-Maximization [Bishop 2007]. Details can be found inthe Appx. C. After learning the dynamic system for a space flow andgiven a starting and destination location, s and s T , we can samplediversified trajectories while obeying the flow dynamics. Duringsimulation guidance, one target trajectory is sampled for each agentand this trajectory reflects the motion randomness. In this section, we first introduce the datasets, then show our highlyinformative and flexible visualization tool. Next, we give quantitativecomparison results between simulated and real crowds by the newlyproposed metrics. Finally, we show that our automated simulation

ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. nformative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance • 1:9

Fig. 4. Forum (top), CarPark (Middle) and TrainStation (Bottom) dataset. In each dataset, Top left: original data; P1-P9: the top 9 space modes; Top right: thetime modes of P1-P9; Bottom right: the speed modes of P1-P9. Both time and speed profiles are scaled by their respective space model weights, with the y axisindicating the likelihood. guidance with high semantic fidelity. We only show representativeresults in the paper and refer the readers to the supplementary videoand materials for details.

We choose three publicly available datasets:

Forum [Majecka 2009],

CarPark [Wang et al. 2008] and

TrainStation [Yi et al. 2015], tocover different data volumes, durations, environments and crowddynamics. Forum is an indoor environment in a school building,recorded by a top-down camera, containing 664 trajectories andlasting for 4.68 hours. Only people are tracked and they are mostlyslow and casual. CarPark consists of videos of an outdoor car parkwith mixed pedestrians and cars, by a far-distance camera and con-tains totally 40,453 trajectories over five days. TrainStation is a bigindoor environment with pedestrians and designated sub-spaces.It is from New York Central Terminal and contains totally 120,000frames with 12,684 pedestrians within approximately 45 minutes.The speed varies among pedestrians.

We first show a general, full-mode visualization in Fig. 4. Due tothe space limit, we only show the top 9 space modes and their cor-responding time and speed profiles. Overall, THDP is effective indecomposing highly mixed and unstructured data into structuredresults across different data sets. The top 9 space modes (with timeand speed) are the main activities. With the environment informa-tion (e.g. where the doors/lifts/rooms are), the semantic meaningsof the activities can be inferred. In addition, the time and dynamicsare captured well. One peak of a space flow (indicated by color) inthe time profiles indicates that this flow is likely to appear aroundthat time. Correspondingly, one peak of a space flow in the speedprofile indicates a major speed preference of the people on that flow.Multiple space flows can peak near one point in both the time andspeed profiles. The speed profiles of Forum and TrainStation areslightly different, with most of the former distributed in a smallerregion. This is understandable because people in TrainStation ingeneral walk faster. The speed profile of CarPark is quite differentin that it ranges more widely, up to 10m/s. This is because bothpedestrians and vehicles were recorded.Besides, we show conditioned visualization. Suppose that theuser is interested in a period (e.g. rush hours) or speed range (e.g.

ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :10 • Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang

Fig. 5. Left: TrainStation, Right: CarPark. The space flow prominence (indi-cated by bar heights) of P1-P9 in Fig. 4 respectively given a time period (bluebars) or speed range (orange bars). The higher the bar, the more prominentthe space flow is.Fig. 6. Space flows from Forum, CarPark and TrainStation and their time-speed distributions. The y (up) axis is likelihood. The x and z axes are timeand speed. The redder, the higher the likelihood is. to see where people generally walk fast/slowly), the associated flowweights can be visualized (Fig. 5). This allows users to see whichspace flows are prominent in the chosen period or speed range.Conversely, given a space flow in interest, we can visualize the time-speed distribution (Fig. 6), showing how the speed changes alongtime, which could help identify congestion on that flow at times.Last but not least, we can identify anomaly trajectories and showunusual activities. The anomalies here refer to statistical anomalies.Although they are not necessarily suspicious behaviors or events,they can help the user to quickly reduce the number of cases neededto be investigated. Note that the anomaly is not only the spacialanomaly. It is possible that a spatially normal trajectory that is ab-normal in time and/or speed. To distinguish between them, we firstcompute the probabilities of all trajectories and select anomalies.Then for each anomaly trajectory, we compute its relative probabili-ties (its probability divided by the maximal trajectory probability) inspace, time and speed, resulting in three probabilities in [0, 1]. Thenwe use them (after normalization) as the bary-centric coordinates ofa point inside of a colored triangle. This way, we can visualize whatcontributes to their abnormality (Fig. 7). Take T1 for example. It hasa normal spacial pattern, and therefore is close to the ‘space’ vertex.It is far away from both ‘time’ and ‘speed’ vertex, indicating T1’stime and speed patterns are very different from the others’. THDPcan be used as a versatile and discriminative anomaly detector.Non-parametric Bayesian approaches have been used for crowdanalysis [Wang et al. 2016, 2017]. However, existing methods canbe seen as variants of the Space-HDP and cannot decompose infor-mation in time and dynamics. Consequently, they cannot show anyresults related to time & speed, as opposed to Fig. 4-7. A naive alter-native would be to use the methods in [Wang et al. 2016, 2017] to

Fig. 7. Representative anomaly trajectories. Every trajectory has a cor-responding location in the triangle on the right, indicating what factorscontribute more in its abnormality. For instance, T1 is close to the spacevertex, it means its spatial probability is relatively high and the main abnor-mality contribution comes from its time and speed. For T2, the contributionmainly comes from its speed. first cluster data regardless time and dynamics, then do per-clustertime and dynamics analysis, equivalent to using the Space-HDPfirst, then the time-HDP & Speed-HDP subsequently. However, thiskind of sequential analysis has failed due to one limitation: thespatial-only HDP misclassifies observations in the overlapped ar-eas of flows [Wang and O’Sullivan 2016]. The following time anddynamics analysis would be based on wrong clustering. The simul-taneity of considering all three types of information, accomplishedby the links (red arrows in Fig. 2 Right) among three HDPs in THDP,is therefore essential.

To compare simulated and real crowds, we ask participants (Masterand PhD students whose expertise is in crowd analysis and simula-tion) to simulate crowds in Forum and TrainStation. We left CarParkout because its excessively long duration makes it extremely dif-ficult for participants to observe. We built a simple UI for settingup simulation parameters including starting/destination locations,the entry timing and the desired speed for every agent. For simula-tor, our approach is agnostic about simulation methods. We choseORCA in Menge [Curtis et al. 2016] for our experiments but othersimulation methods would work equally well. Initially, we providethe participants with only videos and ask them to do their best toreplicate the crowd motions. They found it difficult because theyhad to watch the videos and tried to remember a lot of information,which is also a real-world problem of simulation engineers. Thissuggests that different levels of detail of the information are neededto set up simulations. The information includes variables such asentry timings and start/end positions, which are readily available,or descriptive statistics such as average speed, which can be rela-tively easily computed. We systematically investigate their roles inproducing scene semantics. After several trials, we identified a setof key parameters including starting/ending positions, entry timingand desired speed. Different simulation methods require differentparameters, but these are the key parameters shared by all. We alsoidentified four typical settings where we gradually provide moreand more information about these parameters. This design helps usto identify the qualitative and quantitative importance of the keyparameters for the purpose of reproducing the scene semantics.The first setting, denoted as Random, is where only the start-ing/destination regions are given. The participants have to estimate

ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. nformative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance • 1:11

Information / Setting Random SDR SDRT SDRTS

Starting/Dest. Areas ✓ ✓ ✓ ✓

Exact Starting/Dest. Positions × ✓ ✓ ✓ Trajectory Entry Timing × × ✓ ✓

Trajectory Average Speed × × × ✓ Table 2. Different simulation settings and the information provided.

Metric/Simulations Random SDR SDRT SDRTS OursOverall ( × − ) 7.11 20.67 37.08 40.55 Space-Only ( × − ) 2.7 5.3 5.3 5.5 5.1Space-Time ( × − ) 1.23 2.96 5.56 5.77 Space-Speed ( × − ) 1.5 3.6 3.5 4.0 Overall ( × − ) 6.7 11.97 13.96 19.39 Space-Only ( × − ) 3.5 6.8 6.7 6.6 Space-Time ( × − ) 8.02 15.87 19.00 18.84 Space-Speed ( × − ) 2.9 5.0 4.9 6.9 6.7 Table 3. Comparison on Forum (Top) and TrainStation (Bottom) based onAL metrics.

Higher is better. Numbers should only compared within thesame row.) the rest. Based on Random, we further give the exact starting/endingpositions, denoted by SDR. Next, we also give the entry timing foreach agent based on SDR, denoted by SDRT. Finally, we give theaverage speed of each agent based on SDRT, denoted by SDRTS.Random is the least-informed scenario where the users have to esti-mate many parameters, while SDRTS is the most-informed situation.A comparison between the four settings is shown in Table 2.We use four AL metrics to compare simulations with data, as theyprovide detailed and insightful comparisons: Overall (Table 1: 1),Space-Only (Table 1: 5), Space-Time (Table 1: 2) and Space-Speed(Table 1: 3) and show the comparisons in Table 3. In Random, theusers had to guess the exact entrance/exit locations, entry timingand speed. It is very difficult to do by just watching videos and thushas the lowest score across the board. When provided with exactentrance/exit locations (SDR), the score is boosted in Overall andSpace-Only. But the scores in Space-Time and Space-Speed remainrelatively low. As more information is provided (SDRT & SDRTS),the scores generally increase. This shows that our metrics are sensi-tive to space, time and dynamics information during comparisons.Further, each type of information is isolated out in the comparison.The Space-Only scores are roughly the same between SDR, SDRTand SDRTS. The Space-Time scores do not change much betweenSDRT and SDRTS. The isolation in comparisons makes our AL met-rics ideal for evaluating simulations in different aspects, providinggreat flexibility which is necessary in practice.Next, we show that it is possible to do more detailed comparisonsusing DPD metrics. Due to the space limit, we show one space flowfrom all simulation settings (Fig. 8), and compare them in spaceonly (DPD-Space), time only (DPD-Time) and time-speed (DPD-TS)in Table 4. In DPD-Space, all settings perform similarly becausethe space information is provided in all of them. In DPD-Time,SDRT & SDRTS are better because they are both provided with thetiming information. What is interesting is that SDRTS is worse thanSDRT on the two flows in DPD-TS. Their main difference is thatthe desired speed in SDRTS is set to be the average speed of thattrajectory, while the desired speed in SDRT is randomly drawn from Metric/Simulations SDR SDRT SDRTS OursDPD-Space 0.4751 0.3813 0.4374

DPD-Time 0.3545 0.0795 0.064

DPD-TS 1.0 0.8879 1.0

DPD-Space 0.2753 0.2461 0.2423

DPD-Time 0.0428 0.0319 0.0295

DPD-TS 0.9970 0.8157 0.9724

Table 4. Comparison on space flow P2 in Forum (Top) and space flow P1 inTrainStation (Bottom) based on DPD metrics, both shown in Fig. 4.

Lower is better. a Gaussian estimated from real data. The latter achieves a slightlybetter performance on both flows in DPD-TS.Quantitative metrics for comparing simulated and real crowdshave been proposed before. However, they either only compareindividual motions [Guy et al. 2012] or only space patterns [Wanget al. 2016, 2017]. Holistically considering space, time & speed hasa combinatorial effect, leading to many explicable metrics evaluat-ing different aspects of crowds (AL & DPD metrics). This makesmulti-faceted comparisons possible, which is unachievable in ex-isting methods. Technically, the flexible design of THDP allowsfor different choices of marginalization, which greatly increasesthe evaluation versatility. This shows the theoretical superiority ofTHDP over existing methods.

Our automated simulation guidance proves to be superior to carefulmanual settings. We first show the AL results in Table 3. Our guidedsimulation outperforms all other settings that were carefully andmanually set up. The superior performance is achieved in the Over-all comparisons as well as most dimension-specific comparisons.Next, we show the same space flow of our guided simulation inFig. 8, in comparison with other settings. Qualitatively, SDR, SDRTand SDRTS generate narrower flows due to straight lines are sim-ulated. In contrast, our simulation shows more realistic intra-flowrandomness which led to a wider flow. It is much more similar tothe real data. Quantitatively, we show the DPD results in Table 4.Again, our automated guidance outperforms all other settings.Automated simulation guidance has only been attempted by afew researchers before [Karamouzas et al. 2018; Wolinski et al. 2014].However, their methods aim to guide simulators to reproduce low-level motions for the overall similarity with the data. Our approachaims to inform simulators with structured scene semantics. More-over, it gives the freedom to the users so that the full semanticsor partial semantics (e.g. the top n flows) can be used to simulatecrowds, which no previous method can provide.

For space discretization, we divide the image space of Forum, CarParkand TrainStation uniformly into 40 ×

40, 40 ×

40 and 120 ×

120 pixelgrids respectively. Since Forum is recorded by a top-down camera,we directly estimate the velocity from two consecutive observationsin time. For CarPark and TrainStation, we estimate the velocity byreconstructing a top-down view via perspective projection. THDPalso has hyper-parameters such as the scaling factors of every DP

ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :12 • Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang

Fig. 8. Space flow P2 in Forum (Top) and P1 in TrainStation (Bottom) in different simulations. The y axes of the time and speed profiles indicate likelihood. (totally 6 of them). Our inference method is not very sensitive tothem because they are also sampled, as part of the CRFL sampling.Please refer to Appx. B.3 for details. In inference, we have a burn-inphase, during which we only use CRF on the Space-HDP and ignorethe rest two HDPs. After the burn-in phase, we use CRFL on thefull THDP. We found that it can greatly help the convergence of theinference. For crowd simulation, we use ORCA in Menge [Curtiset al. 2016].We randomly select 664 trajectories in Forum, 1000 trajectoriesin CarPark and 1000 trajectories in Trainstation for performancetests. In each experiment, we split the data into segments in timedomain to mimic fragmented video observations. The number ofsegments is a user-defined hyper-parameter and depends on thenature of the dataset. We chose the segment number to be 384, 87and 28, for Forum, CarPark and TrainStation respectively to coversituations where the video is finely or roughly segmented. Duringtraining, we first run 5k CRF iterations on the Space-HDP only inthe burn-in phase, then do the full CRFL on the whole THDP tospeed up the mixing. After training, the numbers of space, time andspeed modes are 25, 5 and 7 in Forum; 13, 6 and 6 in CarPark; 16, 3and 4 in TrainStation. The training took 85.1, 11.5 and 7.8 minuteson Forum, Carpark and TrainStation, on a PC with an Intel i7-67003.4GHz CPU and 16GB memory.

We chose MCMC to avoid the local minimum issue. (Stochastic)Variational Inference (VI) [Hoffman et al. 2013] and Geometric Op-timization [Yurochkin and Nguyen 2016] are theoretically faster.However, VI for a single HDP is already prone to local minimum[Wang et al. 2016]. We also found the same issue with geometricoptimization. Also, can we use three independent HDPs? Using in-dependent HDPs essentially breaks the many-to-many associationsbetween space, time and speed modes. It can cause mis-clusteringdue to that the clustering is done on different dimensions separately[Wang and O’Sullivan 2016].The biggest limitation of our method does not consider the cross-scene transferability. Since the analysis focuses on the semanticsin a given scene, it is unclear how the results can inspire simula-tion settings in unseen environments. In addition, our metrics donot directly reflect visual similarities on the individual level. Wedeliberately avoid the agent-level one-to-one comparison, to allowgreater flexibility in simulation setting while maintaining statistical similarities. Also, we currently do not model high-level behaviorssuch as grouping, queuing, etc. This is due to that such informa-tion can only be obtained through human labelling which wouldincur massive workload and be therefore impractical on the chosendatasets. We intentionally chose unsupervised learning to deal withlarge datasets.

In this paper, we present the first, to our best knowledge, multi-purpose framework for comprehensive crowd analysis, visualization,comparison (between real and simulated crowds) and simulationguidance. To this end, we proposed a new non-parametric Bayesianmodel called Triplet-HDP and a new inference method called Chi-nese Restaurant Franchise League. We have shown the effectivenessof our method on datasets varying in volume, duration, environmentand crowd dynamics.In the future, we would like to extend the work to cross-environmentprediction. It would be ideal if the modes learnt from given envi-ronments can be used to predict crowd behaviors in unseen envi-ronments. Preliminary results show that the semantics are tightlycoupled with the layout of sub-spaces with designated functionali-ties. This means a subspace-functionality based semantic transfer ispossible. Besides, we will look into using semi-supervised learningto identify and learn high level social behaviors, such as groupingand queuing.

ACKNOWLEDGEMENT

The project is partially supported by EPSRC (Ref:EP/R031193/1), theFundamental Research Funds for the Central Universities (xzy012019048)and the National Natural Science Foundation of China (61602366).

REFERENCES

Saad Ali and Mubarak Shah. 2007. A lagrangian particle dynamics approach for crowdflow segmentation and stability analysis. In 2007 IEEE Conference on ComputerVision and Pattern Recognition. IEEE, 1–6.Jiang Bian, Dayong Tian, Yuanyan Tang, and Dacheng Tao. 2018. A survey on trajectoryclustering analysis. CoRR abs/1802.06971 (2018). arXiv:1802.06971Christopher Bishop. 2007. Pattern Recognition and Machine Learning. Springer, NewYork.Rima Chaker, Zaher Al Aghbari, and Imran N Junejo. 2017. Social network model forcrowd anomaly detection and localization. Pattern Recognition 61 (2017), 266–281.Panayiotis Charalambous, Ioannis Karamouzas, Stephen J Guy, and Yiorgos Chrysan-thou. 2014. A data-driven framework for visual crowd analysis. In ComputerGraphics Forum, Vol. 33. Wiley Online Library, 41–50.ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. nformative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance • 1:13

Sean Curtis, Andrew Best, and Dinesh Manocha. 2016. Menge: A Modular Frameworkfor Simulating Crowd Movement. Collective Dynamics 1, 0 (2016).Cathy Ennis, Christopher Peters, and Carol OâĂŹSullivan. 2011. Perceptual Effects ofScene Context and Viewpoint for Virtual Pedestrian Crowds. ACM Transaction onApplied Perception 8, 2, Article 10 (Feb. 2011), 22 pages.Thomas S. Ferguson. 1973. A Bayesian Analysis of Some Nonparametric Problems.The Annals of Statistics 1, 2 (1973), 209–230.Abhinav Golas, Rahul Narain, and Ming Lin. 2013. Hybrid Long-range CollisionAvoidance for Crowd Simulation. In ACM SIGGRAPH Symposium on Interactive3D Graphics and Games. 29–36.Stephen J. Guy, Jur van den Berg, Wenxi Liu, Rynson Lau, Ming C. Lin, and DineshManocha. 2012. A Statistical Similarity Measure for Aggregate Crowd Dynamics.ACM Transaction on Graphics 31, 6 (2012), 190:1–190:11.Dirk Helbing et al. 1995. Social Force Model for Pedestrian Dynamics. Physical ReviewE (1995).Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. StochasticVariational Inference. Journal of Machine Learning Research 14, 1 (2013), 1303–1347.Kevin Jordao, Julien PettrÃľ, Marc Christie, and Marie-Paule Cani. 2014. Crowd Sculpt-ing: A Space-time Sculpting Method for Populating Virtual Environments. ComputerGraphics Forum (2014).Ioannis Karamouzas, Nick Sohre, Ran Hu, and Stephen J. Guy. 2018. Crowd Space: APredictive Crowd Analysis Technique. ACM Transaction on Graphics 37, 6, Article186 (Dec. 2018), 14 pages.Leonard Kauffman and Peter J. Rousseeuw. 2005. Finding Groups in Data: AnIntroduction to Cluster Analysis. John Wiley & Sons.Kang Hoon Lee, Myung Geol Choi, Qyoun Hong, and Jehee Lee. 2007. Group behaviorfrom video: a data-driven approach to crowd simulation. In Proceedings of the 2007ACM SIGGRAPH/Eurographics symposium on Computer animation. 109–118.S. Lemercier, A. Jelic, R. Kulpa, J. Hua, J. Fehrenbach, P. Degond, C. Appert-Rolland, S.Donikian, and J. PettrÃľ. 2012. Realistic Following Behaviors for Crowd Simulation.Computer Graphics Forum 31, 2 (2012), 489–498.Alon Lerner, Yiorgos Chrysanthou, Ariel Shamir, and Daniel Cohen-Or. 2009. Datadriven evaluation of crowds. In International Workshop on Motion in Games.Springer, 75–83.A LÃşpez, F Chaumette, E Marchand, and J PettrÃľ. 2019. Character navigation indynamic environments based on optical flow. In Proceedings of Eurographics 2019(Eurographics 2019). Eurographics.Ning Lu et al. 2019. ADCrowdNet: An Attention-injective Deformable ConvolutionalNetworkfor Crowd Understanding. IEEE Conference on Computer Vision andPattern Recognition (2019).B. Majecka. 2009. Statistical models of pedestrian behaviour in the Forum. MSc Dis-sertation. School of Informatics, University of Edinburgh, Edinburgh.Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd behaviordetection using social force model. In 2009 IEEE Conference on Computer Visionand Pattern Recognition. IEEE, 935–942.Rahul Narain, Abhinav Golas, Sean Curtis, and Ming C. Lin. 2009. Aggregate Dynamicsfor Dense Crowd Simulation. ACM Transaction on Graphics 28, 5 (2009), 122:1–122:8.Carl Edward Rasmussen. 1999. The Infinite Gaussian Mixture Model. In InternationalConference on Neural Information Processing Systems (Denver, CO) (NIPSâĂŹ99).MIT Press, Cambridge, MA, USA, 554âĂŞ560.Jiaping Ren, Wei Xiang, Yangxi Xiao, Ruigang Yang, Dinesh Manocha, and XiaogangJin. 2018. Heter-Sim: Heterogeneous multi-agent systems simulation by interactivedata-driven optimization. CoRR abs/1812.00307 (2018). arXiv:1812.00307Zeng Ren, P. Charalambous, J. Bruneau, Q. Peng, and J. PettrÃľ. 2016. Group modelling:A unified velocity-based approach. Computer Graphics Forum (2016).Mohammad Sabokrou et al. 2017. Deep-cascade:cascading 3D deep neural networksfor fast anomaly detection and localization in crowded scenes. IEEE Transactionon Image Processing (2017).Long Sha, Patrick Lucey, Yisong Yue, Xinyu Wei, Jennifer Hobbs, Charlie Rohlf, andSridha Sridharan. 2018. Interactive sports analytics: An intelligent interface for utiliz-ing trajectories for interactive sports play retrieval and analytics. ACM Transactionson Computer-Human Interaction (TOCHI) 25, 2 (2018), 1–32.Long Sha, Patrick Lucey, Stephan Zheng, Taehwan Kim, Yisong Yue, and Sridha Srid-haran. 2017. Fine-grained retrieval of sports plays using tree-based alignment oftrajectories. (2017). arXiv:1710.02255Yijun Shen, Joseph Henry, He Wang, Edmond S. L. Ho, Taku Komura, andHubert P. H. Shum. 2018. Data-Driven Crowd Motion Control WithMulti-Touch Gestures. Computer Graphics Forum 37, 6 (2018), 382–394.arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.13333Jianbo Shi and J. Malik. 2000. Normalized cuts and image segmentation. IEEETransactions on Pattern Analysis and Machine Intelligence 22, 8 (2000), 888–905.Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2006. HierarchicalDirichlet Processes. Journal of American Statistical Association 101, 476 (2006),1566–1581. J. van den Berg, Ming C. Lin, and Dinesh Manocha. 2008. Reciprocal Velocity Obstaclesfor real-time multi-agent navigation. IEEE International Conference on Roboticsand Automation (2008).He Wang, Jan OndÅŹej, and Carol O’Sullivan. 2016. Path Patterns: Analyzing andComparing Real and Simulated Crowds. In Proceedings of the 20th ACM SIGGRAPHSymposium on Interactive 3D Graphics and Games (I3D ’16). ACM, New York, NY,USA, 49–57. https://doi.org/10.1145/2856400.2856410He Wang, Jan OndÅŹej, and Carol O’Sullivan. 2017. Trending Paths: A New Semantic-level Metric for Comparing Simulated and Real Crowd Data. IEEE Transactions onVisualization and Computer Graphics 23, 5 (2017), 1454–1464.He Wang and Carol O’Sullivan. 2016. Globally Continuous and Non-Markovian CrowdActivity Analysis from Videos. Springer International Publishing, Cham, 527–544.Qi Wang et al. 2019. Learning from Synthetic Data for Crowd Counting in the Wild.IEEE Conference on Computer Vision and Pattern Recognition (2019).Xiaogang Wang, Keng Teck Ma, Gee-Wah Ng, and W. E. L. Grimson. 2008. Trajectoryanalysis and semantic region modeling using a nonparametric Bayesian model. InIEEE Conference on Computer Vision and Pattern Recognition. 1–8.David Wolinski, Stephen J. Guy, Anne-HÃľlÃĺne Olivier, Ming C. Lin, Dinesh Manocha,and Julien PettrÃľ. 2014. Parameter estimation and comparative evaluation of crowdsimulations. Computer Graphics Forum 33, 2 (2014), 303–312.Yanyu Xu et al. 2018. Encoding Crowd Interaction with Deep Neural Network forPedestrian Trajectory Prediction. IEEE Conference on Computer Vision and PatternRecognition (2018).S. Yi, H. Li, and X. Wang. 2015. Understanding pedestrian behaviors from stationarycrowd groups. In IEEE Conference on Computer Vision and Pattern Recognition.3488–3496.Mikhail Yurochkin and XuanLong Nguyen. 2016. Geometric Dirichlet Means Algorithmfor topic inference. In International Conference on Neural Information ProcessingSystems.

A CHINESE RESTAURANT FRANCHISE

To give the mathematical derivation of the sampling process de-scribed in Sec. 5.1, we first give meanings to the variables in Fig. 2Left. θ ji is the dish choice made by x ji , the i th customer in the j threstaurant. G j is the tables with dishes and the dishes are from theglobal menu G . Since θ ji indicates the choice of tables and thereforedishes, we use some auxiliary variables to represent the process.We introduce t ji and k jt as the indices of the table and the dishon the table chosen by x ji . We also denote m jk as the number oftables serving the k th dish in restaurant j and n jtk as the number ofcustomers at table t in restaurant j having the k th dish. We also usethem to represent accumulative indicators such as m · k representingthe total number of tables serving the k th dish. We also use super-script to indicate which customer or table is removed. If customer x ji is removed, then n − jijtk is the number of customers at the table t in restaurant j having the k th dish without the customer x ji . Customer-level sampling . To choose a table for x ji (line 5 inAlgorithm 1), we sample a table index t ji : p ( t ji = t | t − ji , k ) ∝ (cid:40) n − jijt · f − x ji k jt ( x ji ) if t already exists α j p ( x ji | t − ji , t ji = t new , k ) if t = t new (11)where n − jijt · is the number of customers at table t (table popularity),and f − x ji k jt ( x ji ) is how much x ji likes the k jt th dish, f k jt , served onthat table (dish preference). f k jt is the dish and thus is a problem-specific probability distribution. f − x ji k jt ( x ji ) is the likelihood of x ji on f k jt . In our problem, f k jt is Multinomial if it is the Space-HDPor otherwise Normal. α j is the parameter in Eq. 1, so it controls howlikely x ji will create a new table, after which she needs to choosea dish according to p ( x ji | t − ji , t ji = t new , k ) . When a new table iscreated, t ji = t new , we need sampling a dish (line 7 in Algorithm 1),indexed by k jt new , according to: ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. :14 • Feixiang He, Yuanhang Xiang, Xi Zhao, and He Wang p ( k jt new = k | t , k − jt new ) ∝ (cid:40) m · k f − x ji k ( x ji ) if k already exists γ f − x ji k new ( x ji ) if k = k new (12)where m · k is the total number of tables across all restaurants servingthe k th dish (dish popularity). f − x ji k ( x ji ) is how much x ji like the k th dish, again the likelihood of x ji on f k . γ is the parameter inEq. 1, so it controls how likely a new dish will be created. Table-level sampling . Next we sample a dish for a table (line11 in Algorithm 1). We denote all customers at the t th table in the j th restaurant as x jt . Then we sample its dish k jt according to: p ( k jt = k | t , k − jt ) ∝ (cid:40) m − jt · k f − x jt k ( x jt ) if k already exists γ f − x jt k new ( x jt ) if k = k new (13)Similarly, m − jt · k is the total number of tables across all restaurantsserving the k th dish, without x jt (dish popularity). f − x jt k ( x jt ) is howmuch the group of customers x jt likes the k th dish (dish preference).This time, f − x jt k ( x jt ) is a joint probability of all x ji ∈ x jt .Finally, in both Eq. 12 and Eq. 13, we need to sample a newdish. This is done by sampling a new distribution from the basedistribution H , ϕ k ∼ H . After inference, the weights β can be com-puted as β ∼ Dirichlet ( m · , m · , · · · , m · k , γ ) . The choice of H isrelated to the data. In our metaphor, the dishes of the Space-HDPare flows so we use Dirichlet. In the Time-HDP and Speed-HDP, thedishes are modes of time and speed which are Normals. So we useNormal-Inverse-Gamma for H . The choices are because Dirchletand Norma-Inverse-Gamma are the conjugate priors of Multinomialand Normal respectively. The whole CRF sampling is done by itera-tively computing Eq. 11 to Eq. 13. The dish number will dynamicallyincrease/decrease until the sampling mixes. In this way, we do notneed to know in advance how many space flows or time modes orspeed modes there are because they will be automatically learnt. B CHINESE RESTAURANT FRANCHISE LEAGUEB.1 Customer Level Sampling

When we do customer-level sampling to sample a new table (line 8in Algorithm 2), the left side of Eq. 11 becomes: p ( t ji = t , x ji , y kd , z kc | x − ji , t − ji , k , y − kd , o − kd , l , z − kc , p − kc , q ) (14)So whether y kd and z kc like the new restaurants should be takeninto consideration. After applying Bayesian rules and factorizationon Eq. 14, we have: p ( t ji = t , x ji , y kd , z kc |•) = p ( t ji | t − ji , k ) p ( x ji | y kd , z kc , t ji = t , k jt = k , •) p ( y kd | t ji = t , k jt = k , y − kd , o − kd , l ) p ( z kc | t ji = t , k jt = k , z − kc , p − kc , q ) (15)where • is { x − ji , t − ji , k , y − kd , o − kd , l , z − kc , p − kc , q }. The four proba-bilities on the right-hand side of Eq. 15 have intuitive meanings. p ( t ji | t − ji , k ) and p ( x ji | y kd , z kc , t ji = t , k jt = k , •) are the table pop-ularity and dish preference of x ji in the space-HDP: p ( t ji | t − ji , k ) ∝ (cid:40) n − jijt if t already exists α j if t = t new (16) p ( x ji | y kd , z kc , t ji = t , k jt = k , •) ∝  f − x ji k jt ( x ji ) if t exists m · k f − x ji k ( x ji ) else if k exists γ f − x ji k new ( x ji ) if k = k new (17)Eq. 16 and Eq. 17 are just re-organization of Eq. 11 and Eq. 12.The remaining p ( y kd | t ji = t , k jt = k , y − kd , o − kd , l ) and p ( z kc | t ji = t , k jt = k , z − kc , p − kc , q ) can be seen as how much the time-customer y kd and speed-customer z kc like the k th time and speed restaurantrespectively (restaurant preference). This restaurant preference doesnot appear in single HDPs and thus need special treatment. This isthe first major difference between CRFL and CRF. Since we proposethe same treatment for both, we only explain the time-restaurantpreference treatment here.If every time we sample a t ji , we compute p ( y kd | t ji = t , k jt = k , y − kd , o − kd , l ) on every time table in every time-restaurant, it willbe prohibitively slow. We therefore marginalize over all the timetables in a time-restaurant, to get a general restaurant preference of y kd : p ( y kd | t ji = t , k jt = k , y − kd , o − kd , l ) = h k · (cid:213) o kd = p ( o kd = o | t ji = t , k jt = k , y − kd , o − kd ) p ( y kd | o kd = o , l ko = l , l ) (18)where o kd is the table choice of y kd in the kth time-restaurant. l ko isthe time-dish served on the o th table in the k th time-restaurant. h k · is the total number of tables in the k th time-restaurant. Similar toEq. 16 and Eq. 17: p ( o kd = o | t ji = t , k jt = k , y − kd , o − kd ) ∝ (cid:40) s − kdko if o exists ϵ k if o kd = o new (19)where s − kdko is the number of time-customers already at the o th tableand ϵ k is the scaling factor. p ( y kd | o kd = o , l ko = l , l ) ∝  д − y kd l ko ( y kd ) if o exists h · l д − y kd l ( y kd ) else if l exists εд − y kd l new ( y kd ) if l = l new (20)where h · l is the total number tables serving time-dish l and д is a pos-terior predictive distribution of Normal, a Student’s t-Distribution. ε controls how likely a new time dish would be needed. Now we havefinished deriving the sampling for p ( y kd | t ji = t , k jt = k , y − kd , o − kd , l ) .Similar derivations can be done for p ( z kc | t ji = t , k jt = k , z − kc , p − kc , q ) .After table sampling, we need to do dish sampling (line 10 inAlgorithm 2). The left side of Eq. 12 becomes: ACM Trans. Graph., Vol. 39, No. 4, Article 1. Publication date: July 2020. nformative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance • 1:15 p ( k jt new = k , x ji , y kd , z kc | k − jt new , y − kd , o − kd , l , z − kc , p − kc , q ) ∝ (cid:40) m − jt · k p ( x ji | · · · ) p ( y kd | · · · ) p ( z kc | · · · ) γp ( x ji | · · · ) p ( y kd | · · · ) p ( z kc | · · · ) (21)The differences between Eq. 21 and Eq. 12 are p ( y kd | · · · ) and p ( z kc | · · · ) . Both are Infinite Gaussian Mixture Model so the likeli-hoods can be easily computed. We therefore have given the wholesampling process for the customer-level sampling (Eq. 14). We stillneed to deal with the table-level sampling. B.2 Table Level Sampling

Similarly, when we do the table-level sampling (line 14 in Algo-rithm 2), the left side of Eq. 13 change to: p ( k jt = k , x jt , y kd jt , z kc jt | k − jt , y − kd jt , o − kd jt , l − ko , z − kc jt , p − kc jt , q − kp ) ∝ (cid:40) m − jt · k p ( x jt | · · · ) p ( y kd jt | · · · ) p ( z kc jt | · · · ) γp ( x jt | · · · ) p ( y kd jt | · · · ) p ( z kc jt | · · · ) (22)where x jt is the space-customers at the table t , y kd jt and z kc jt are theassociated time and speed customers. k − jt , y − kd jt , o − kd jt , l − ko , z − kc jt , p − kc jt , q − kp are the rest customers and their choices of tables anddishes in three HDPs. · · · represents all the conditional variablesfor simplicity. p ( x jt | · · · ) is the Multinomial f as in Eq. 13. p ( y kd jt | · · · ) and p ( z kc jt | · · · ) are not easy to compute. However,they can be treated in the same way so we only explain how to com-pute p ( y kd jt | · · · ) here. To fully compute p ( y kd jt | · · · ) = p ( y kd jt | k jt = k , o − kd jt , l − ko ) , one needs to consider it for every y kd jt ∈ y kd jt whichis extremely expensive. This is because we deal with large datasetsand there can easily be thousands, if not more, of customers in y kd jt .In Eq. 15, we already see how y kd ’s time-restaurant preferenceinfluences the table choice of x ji . Given a group y kd jt , their collec-tive time-restaurant preference, p ( y kd jt | · · · ) , will influence the dishchoice of x jt . Since the distribution of individual time-restaurantpreference is hard to compute analytically, we approximate it. Wedo a random sampling over y kd jt to approximate p ( y kd jt | · · · ) . Thisnumber of samples is a hyper-parameter, referred as customer se-lection . For every single y ∈ y kd jt we can compute its probability inthe same way as in Eq. 18. So we approximate the p ( y kd jt | · · · ) withthe joint probability of the sampled time-customers. B.3 Sampling for Hyper-parameters

A Dirichlet Process contains two parameters, a base distributionand a concentration parameter. To make THDP more robust tothese parameters, we impose a prior, a Gamma distribution ontothe concentration parameter γ ∼ Γ ( α , ϖ ) , where α is the shapeparameter and ϖ is the rate parameter. There are totally six α s and ϖ s for the six DPs in THDP. They are initialized as 0.1. Then theyare updated during the optimization using the method in [Teh et al.2006]. The update is done in every iteration in CRFL, after sampling all the other parameters. The customer selection parameter is setto 1000 across all experiments. Finally, after CRFL, the inference isdone for the three distributions in Eq. 2: ϕ sk ∼ H s , β ∼ Dirichlet ( m · , m · , · · · , m · k , γ ) (23) ϕ tl ∼ H t , ζ ∼ Dirichlet ( h · , h · , · · · , h · l , ε ) (24) ϕ eq ∼ H e , ρ ∼ Dirichlet ( a · , a · , · · · , a · q , λ ) (25)where m · k is the total number of space-tables choosing space-dish k ; h · l is the total number of time-tables choosing time-dish l ; a · q isthe total number of speed-tables choosing speed-dish q . γ , ε and λ are the scaling factors of G s , G t and G e . C SIMULATION GUIDANCE

The dynamics of of one trajectory, ¯ w , is: x ¯ wt = As t + ω t ω ∼ N ( , Ω ) s t = Bs t − + λ t λ ∼ N ( , Λ ) Given the U trajectories, from a space flow ˇ w , the total likelihoodis: p ( ˇ w ) = Π Ui = p ( ¯ w i ) where p ( ¯ w i ) = Π T i − t = p ( x it | s t ) P ( s t | s t − ) s = x i , s T = x iT i (26)where A is an identity matrix and Ω is a known diagonal matrix. T i is the length of the trajectory i . We use homogeneous coordinatesto represent both x = [ x , x , ] T and s = [ s , s , ] T . Consequently, A is a R × identity matrix. Ω is set to be a R × diagonal matrixwith its non-zeros entries set to 0.001. B is a R × transition matrixand Λ is R × covariance matrix, both to be learned.We apply Expectation-Maximization (EM) [Bishop 2007] to esti-mate parameters B , Λ and states S by maximizing the log likelihood loд P ( u ) . Each iteration of EM consists of a E-step and a M-step. Inthe E-step, we fix the parameters and sample states s via the poste-rior distribution of x . The posterior distribution and the expectationof complete-data likelihood are denoted as L = E S | X ; ˆ B , ˆ Λ ( loдP ( S , X ; B , Λ )) = (cid:213) i τ i E s i | x i { p ( s i , x i )} (27)where τ i is defined as τ i = Ti (cid:205) Tit = p ( x i t | s i t ) (cid:205) Ui = Ti (cid:205) Tit = p ( x it | s it ) . In the M-step, wemaximize the complete-data likelihood and the model parametersare updated as: B new = (cid:205) i τ i (cid:205) T i t = P it , t − (cid:205) i τ i (cid:205) T i t = P it − , t − (28) Λ new = (cid:205) i τ i ( (cid:205) T i t = P it , t − B new (cid:205) T i t = P it , t − ) (cid:205) i τ i ( T i − ) (29) P it , t = E s i | x i ( s t s T t ) (30) P it , t − = E s i | x i ( s t s T t − ) (31)During updating, we use Λ = ( Λ + Λ T ) to ensure its symmetry.to ensure its symmetry.