[PDF] LCollision: Fast Generation of Collision-Free Human Poses using Learned Non-Penetration Constraints

Abstract

We present LCollision, a learning-based method that synthesizes collision-free 3D human poses. At the crux of our approach is a novel deep architecture that simultaneously decodes new human poses from the latent space and predicts colliding body parts. These two components of our architecture are used as the objective function and surrogate hard constraints in a constrained optimization for collision-free human pose generation. A novel aspect of our approach is the use of a bilevel autoencoder that decomposes whole-body collisions into groups of collisions between localized body parts. By solving the constrained optimizations, we show that a significant amount of collision artifacts can be resolved. Furthermore, in a large test set of 2.5× 10 6 randomized poses from SCAPE, our architecture achieves a collision-prediction accuracy of 94.1% with 80× speedup over exact collision detection algorithms. To the best of our knowledge, LCollision is the first approach that accelerates collision detection and resolves penetrations using a neural network.

Full PDF

LLCollision: Fast Generation of Collision-Free Human Posesusing Learned Non-Penetration Constraints

Qingyang Tan , Zherong Pan , Dinesh Manocha Department of Computer Science, University of Maryland at College Park Department of Computer Science, University of Illinois at [email protected], [email protected], [email protected]

Abstract

We present LCollision, a learning-based method that synthe-sizes collision-free 3D human poses. At the crux of our ap-proach is a novel deep architecture that simultaneously de-codes new human poses from the latent space and predictscolliding body parts. These two components of our architec-ture are used as the objective function and surrogate hard con-straints in a constrained optimization for collision-free humanpose generation. A novel aspect of our approach is the use ofa bilevel autoencoder that decomposes whole-body collisionsinto groups of collisions between localized body parts. Bysolving the constrained optimizations, we show that a signif-icant amount of collision artifacts can be resolved. Further-more, in a large test set of . × randomized poses fromSCAPE, our architecture achieves a collision-prediction ac-curacy of . with × speedup over exact collision de-tection algorithms. To the best of our knowledge, LCollisionis the ﬁrst approach that accelerates collision detection andresolves penetrations using a neural network. There has been considerable work on developing learningalgorithms for 3D objects represented as point clouds (Qiet al. 2017), meshes (Hanocka et al. 2019), volumetric grids(Wang, Liu, and Tong 2020), and physical objects (Li et al.2019). Because these algorithms are used for different ap-plications, a major challenge is accounting for user require-ments and physics-based constraints. Considering these con-straints can signiﬁcantly improve the test-time robustness bypreserving some known criteria of “correct” predictions. Forexample, we need to consider various forces and dynamicsconstraints for differentiable simulation (Qiao et al. 2020)and cloth embedding (Tan et al. 2020), and a reliable robotmotion planner should preserve a clearance distance fromobstacles (Pham, De Magistris, and Tachibana 2018).In this paper, we tackle the problem of collision-free hu-man pose generation. Recently, 3D mesh representationshave been used for learning-based human pose synthesis(Tretschk et al. 2020; Bouritsas et al. 2019; Ranjan et al.2018; Bagautdinov et al. 2018; Tan et al. 2018a). Thesemethods learn a manifold of plausible human poses froma dataset, represented as the latent space of a deep au-toencoder. Such autoencoders can be trained for applica-tions including interactive rigging, human pose recognitionfrom images and videos, and VR games. However, current learning-based methods do not account for any physics-based requirements such as (self-)collision-free constraint,thereby resulting in penetrations or other artifacts (Tretschket al. 2020; Bouritsas et al. 2019; Ranjan et al. 2018;Bagautdinov et al. 2018; Tan et al. 2018a). By compari-son, non-learning-based methods for character rigging (Shiet al. 2007) and physics-based simulation (Barbiˇc and James2010) can detect and explicitly handle the collisions us-ing numerical methods. Our goal is to equip learned-basedmethods with similar collision-handling capabilities.Although 3D data representations explicitly allow themodeling of collision-free constraints, satisfying these hardconstraints in an end-to-end learning system is an open prob-lem. Prior works have tried one of three ways to incorporatehard constraints in a learning system. First, classical second-order methods (Boggs and Tolle 1995) for constrained opti-mization can enforce exact hard constraints on the parame-ters of the neural network. Second, variants of the stochasticprojected gradient descend (Marquez Neila, Salzmann, andFua 2017; Kervadec et al. 2019) have been proposed to ap-proximately satisfy the constraints on the neural network pa-rameters. Finally, differentiable optimization layers (Pham,De Magistris, and Tachibana 2018; Agrawal et al. 2019) canmodify the neural network output to satisfy such constraints.However, these methods are either limited to convex con-straints, impractical for large networks, or do not providesufﬁcient accuracy in terms constraint satisfaction.

Main Results:

We present LCollision, a new learning al-gorithm to generate human poses that satisfy collision-freeconstraints. Our approach incorporates the non-penetrationconstraints by solving a general constrained optimizationduring the test time, where the feasible domain correspond-ing to these hard constraints is learned during the trainingtime. The novel components of our approach include:•

Constrained Optimization Using Neural NetworkFunction Approximation:

Instead of using exactcollision-response, learning the feasible domain usinga neural network provides approximate sub-gradientsvia back-propagation, which is much faster than exactcollision-checking algorithms.•

Collision Decomposition:

A collision only affects localregions of the human body, and we design our collisionpredictor to respect these local effects. Each point on thehuman body is softly assigned to a set of local body parts,and the collision loss is decomposed to these local do-mains, accordingly. a r X i v : . [ c s . G R ] D ec Hybrid Ranking, Potential Energy, and Entropy Loss:

Although exact hard constraints correspond to a binaryloss (violation or non-violation), this loss should be differ-entiable so that constrained optimizations can be guidedby gradient information. We propose a penetration-depth-based formulation (Zhang et al. 2007) as a collision metricto offer gradient direction, combined with the ranking lossto maintain the relative penetration depth between a pairof samples.We have evaluated our method on the SCAPE dataset(Anguelov et al. 2005), the MIT-Swing dataset (Vlasic et al.2008), and the MIT Jumping dataset (Vlasic et al. 2008).Combining these techniques, we achieve an accuracy of . , a false positive rate of . , and a false negative rateof . when predicting collisions for . × randomizedhuman poses sampled from these datasets. After learningthe feasible domain, solving a constrained optimization fora collision-free human pose with vertices takes . iterations and . s on average. Moreover, our learned col-lision detector is × faster than prior exact collision detec-tion methods running on a CPU (Pan, Chitta, and Manocha2012). We review related works on human pose estimation and syn-thesis, collision detection and response, and deep networktraining with hard constraints.

Human Pose Estimation & Synthesis:

There is consid-erable work on human pose estimation and synthesis. Ear-lier methods (Leibe, Seemann, and Schiele 2005) representa pedestrian as a bounding box. An improved algorithmwas proposed in (Agarwal and Triggs 2005), and this algo-rithm predicts the 55-D joint angles for a skeletal humanpose. More accurate prediction results have been proposedin (Rogez et al. 2008) using random forests and in (Toshevand Szegedy 2014) using convolutional neural networks.Our approach is based on recent learning methods (Tanet al. 2018a; Tretschk et al. 2020) that use 3D meshes togenerate detailed human poses. Mesh-based representationsare inherently difﬁcult to learn due to the intrinsic high-dimensionality, and the algorithm can produce sub-optimalresults with various artifacts such as self-penetrations, noisymesh surfaces, and ﬂipped meshes. In view of these prob-lems, (Villegas et al. 2018) only computes skeletal posesusing learning and then uses skinning to recover the mesh-based representation. However, this approach requires ad-ditional skeleton-mesh correspondence information, whichis typically unavailable in many datasets, including SCAPE(Anguelov et al. 2005).

Collision Detection & Response:

An important crite-rion of “correct” human body shapes is that they are (self-) collision-free, i.e. elements of the mesh do not penetrateeach other. Collision detection and response computationshave been well-studied, with many practical algorithms pro-posed for large-scale 3D meshes (Pan, Chitta, and Manocha2012; Kim, Lin, and Manocha 2018) that can be used to re-solve penetrations. Collisions can be handled in a discreteor continuous manner. Discrete collision handling (Kim, Lin, and Manocha 2018) assumes that meshes can occasion-ally reach an invalid status with penetrations and thereforechecks for collisions at ﬁxed time intervals. In contrast, con-tinuous collision detection algorithms estimate the time in-stance corresponding to the ﬁrst contact and thereby main-tain non-penetration conﬁgurations. These continuous col-lision detection (CCD) methods (Tang et al. 2009; Brid-son, Fedkiw, and Anderson 2002; Tang, Manocha, and Tong2010) make some assumptions about the interpolating mo-tion between two time instances and use analytic methods topredict the time of the collision. Many of these methods canbe accelerated using GPU parallelism (Govindaraju, Lin,and Manocha 2005). In theory, we can use different colli-sion handling methods to avoid penetrations in a 3D mesh ofa human pose. However, there are two practical challenges.First, collision response is a physical behavior tightly cou-pled with a physics-based model of the human body. How-ever, modeling the physical deformations of a human bodycan be computationally expensive. The running time for sim-ulating one timestep of a human body can be more than seconds (Smith, Goes, and Kim 2018), which is intractablefor interactive applications. Second, collision handling algo-rithms require a volumetric mesh, while many applicationsof human pose synthesis rely on surface meshes. Techniqueshave also been proposed to estimate the extent of penetra-tions between complex 3D geometric models, e.g., penetra-tion depth (Zhang et al. 2007; Kim et al. 2002; Burgard,Brock, and Stachniss 2008). However, these formulationscan be non-smooth and expensive to compute. Training Deep Networks with Hard Constraints:

Anadditional layer of challenge is to incorporate collision han-dling into a deep learning framework. In particular, state-of-the-art deep learning methods are unable to handle suchhard constraints. Constraints on neural network parame-ters (Ravi et al. 2019) are used for regularizing the net-work training and can be approximately enforced usingvariants of the projected gradient descend algorithm. Onthe other hand, constraints on neural network output areused to model application-speciﬁc requirements such ascollision-free constraints. Prior works (Pham, De Magistris,and Tachibana 2018; Agrawal et al. 2019; Marquez Neila,Salzmann, and Fua 2017; Nandwani et al. 2019) use a simi-lar approach to enforce hard constraints: converting the con-strained optimization into an unconstrained min-max opti-mization, which can be solved approximately by updatingthe primal and dual variables. A special case arises when thehard constraints are convex; then the constrained optimiza-tion can be solved efﬁciently with exact constraint enforce-ment (Pham, De Magistris, and Tachibana 2018; Agrawalet al. 2019). However, the collision-free constraints in ourapplications are neither convex nor smooth.

Recent methods (Tretschk et al. 2020; Bouritsas et al. 2019;Ranjan et al. 2018; Bagautdinov et al. 2018; Tan et al. 2018a)have used neural networks to generate new poses from asmall set of examples via shape embedding. In this section,we give an overview of the process of computing the em-bedding space for human pose generation and highlight the

Pose AfterResponse ... ... E n c ode r L v D e c ode r L v E n c ode r L v E n c ode r L v AttentionMechanism C o lli s i on S t a t e E n c ode r C o lli s i on P r ed i c t o r C o lli s i on P r ed i c t o r ... DecoderLv2DecoderLv2 + +

Pose BeforeResponseHardConstraint Constrainted OptimizationObjective M L P P r ed i c t o r C ¯ X X − ¯ X Z Z ∣Z ∣ S S S S ∣Z ∣ Figure 1:

Our network architecture combines the domain-decomposed human pose embedding framework (green) and a novel collisionstate estimator (gray). Given an input pose, we use a weight-shared, level-1 autoencoder to learn a global shape embedding. The error on eachdomain is further reduced using a set of level-2 autoencoders. Both the level-1 and level-2 autoencoders’ latent codes are used to predict aglobal collision state. Finally, the latent code of each level-2 autoencoder is compared against the global collision state to infer a localizedpenetration depth. These inferred penetrations are used in hard constraints of a constrained optimization framework for collision handling.

Figure 2:

We show decomposed domains on the SCAPE datasetusing the learned attention mask M ki , highlighted in different col-ors. The darkness of a given color represents the weight of the softassignment. These weights are used for localized collision compu-tations. collision-free constraints that LCollision tries to satisfy. Our method uses the algorithm in (Tan et al. 2018b; Yanget al. 2020), which has the ability to extract local defor-mation components (more details given in the appendix).We represent human models as triangle meshes – a specialgraph

G =< V , E > , with V being a set of vertices and E being a set of edges. In our datasets, all the models sharethe same topology, i.e. E is the same over all the meshes,while V differs. We transform V to the as-consistent-as-possible (ACAP) feature space (Gao et al. 2019), denoted as X ∈ R ×∣V∣ , to handle large deformations. We use a bilevelautoencoder to embed X in a latent space. Both levels ofthe autoencoder involve one graph convolutional layer andone fully connected layer. The fully-connected layer mapsthe feature to a K -dimensional latent code, with weights de-noted as C ∈ R K × ×∣V∣ . A sparsity loss is used to ensurethat each dimension of C only accounts for a group of localpoints. Domain Decomposition via Attention:

We use a bilevelarchitecture because we want the level-2 autoencoder tolearn a decomposed domain of the original mesh, i.e. eachlevel-2 autoencoder only reduces the level-1 residual on a subset of V . The learned domain decomposition not onlyenhances the reusability and explainability of the neural net-work but is also used to model the local collisions betweenbody sub-parts, as explained in Section 3.3.Each autoencoder maps some input feature X to a latentcode Z and then reconstructs Z to feature ¯ X . We use sub-scripts to denote the index of an autoencoder, e.g., X , Z ,and ¯ X are the input, latent code, and output of the level-1 autoencoder, respectively. We assume that each entry oflevel-1 latent code corresponds to a sub-domain of the hu-man body on which the residual is further reduced usingone level-2 autoencoder, so there are altogether ∣Z ∣ + autoencoders. The k th level-2 autoencoder is responsiblefor representing a subset of residual X − ¯ X . To deter-mine the subset, an attention mask is computed as: M ki =∑ j = C kji /∑ ∣Z ∣ k = ∑ j = C kji . In addition, the input to the k th level-2 autoencoder is X ik = M ki (X i − ¯ X i ) . The softassignment induced by the attention mask conducts the do-main decomposition in our network. We illustrate some hu-man body parts decomposed using M ki in Figure 2. A pivotal requirement of plausible human poses is that theyare collision-free, i.e. triangles on the surface mesh do notpenetrate each other. However, this constraint is ignoredby previous neural-network-based human pose generationmethods. We deﬁne a self-collision as an intersection be-tween two topologically disjointed triangles, i.e. two trian-gles that do not share any edges. We use the following con-dition to indicate a collision: t p ∩ t q ≠ ∅ , where t p and t q aretwo triangles. Penetration depth (PD) is a notion that mea-sures the extent of collision constraint violations betweentwo objects. We deﬁne the local PD for triangle pair ( t p , t q ) as: PD p,q = min {∥ d ∥ ∶ ( t p + d ) ∩ t q = ∅} , here PD p,q is the minimum distance to move t p such that t p and t q have no overlap. The collision-free constraint canbe reformulated as the constraint that PD p,q = for any ( p, q ) pairs. Conceptually, collision constraints can be sat-isﬁed by solving the following constrained optimization: min goal s . t . PD p,q = , ( p, q ) disjoint , where goal is the objective (e.g., as close as possible to auser-desired pose). Prior works solve the constrained opti-mization by computing PD p,q for all ( p, q ) pairs and treatingeach colliding ( t p , t q ) as a standalone constraint, leading tolarge problem sizes and high computational costs. Instead,we use a neural network to speed up the computation. Our method is inspired by the subspace self-collisionculling algorithm (SSCC) (Barbiˇc and James 2010) andthe learning-based collision simpliﬁcation algorithm (Teng,Otaduy, and Kim 2014). In SSCC, the authors observe thatcollisions usually occur between pairs of triangles that areoriginally close to one another on the template mesh. Pairsof distant triangles penetrate only when the mesh has under-gone sufﬁcient deformation. The observation made by SSCCsuggests the use of mesh decompositions as described in 3.1.It is worth noting that both works (Barbiˇc and James 2010;Teng, Otaduy, and Kim 2014) use learned linear subspacesto accelerate collision detection and culling. However, theexpressivity of linear subspaces is rather limited, so SSCCcan only model deformations that are near the neutral poseand cannot represent larger deformations. Further, it is as-sumed in (Teng, Otaduy, and Kim 2014) that a domain de-composition is provided by users. Our work uniﬁes and ex-tends these ideas into a collision prediction algorithm thatworks for large deformations and does not require any addi-tional data from users.

Our overall learning architecture is illustrated in Figure 1.Our method augments a normal mesh embedding autoen-coder with an additional component to classify the collisionstatus. Given a latent code Z all deﬁned as: Z all = (Z T , Z T , ⋯ T , Z ∣Z ∣ T ) T , we output a collision probability MLP classifier . We as-sume that the . sub-level set of MLP classifier corre-sponds to collision-free meshes so that many constraints ofthe form PD i,j = can be replaced by a single constraintMLP classifier < . , which reduces the computational cost. In this subsection, we explain our network architecture tocope with the locality of self-collisions illustrated in grayblocks of Figure 1, including the collision state encoder andthe collision predictor.

Naive Subdivision:

Our level-2 autoencoders inherentlydecompose the mesh into ∣Z ∣ sub-domains. Therefore, ifcollisions occur within the k th sub-domain, then collisionsshould be inferred from Z k alone, and we use a collision predictor (CP) in the form of a multilayer perceptron (MLP)to map Z k to some collision indicator. If a pair of trian-gles belongs to two sub-domains, e.g., Z k and Z k ′ , thena possible solution is to use another MLP that takes both (Z kT , Z k ′ T ) T . However, this approach requires O(∣Z ∣ ) CPs with an excessively large number of weights, and thelatent codes of level-2 autoencoders only represent the rela-tive residual X − ¯ X , while the absolute information X islost. Our Method:

To avoid issues with naive subdivision,we propose using a collision state encoder (CSE) that en-codes both relative and absolute information over all meshsub-domains. CSE is an MLP that takes Z all and brings Z all through three latent layers with ( , , ) neu-rons and ReLU activation. Finally, CSE outputs a latent codereferred to as the global collision state, or S = CSE (Z all ) for short. S and Z k are then fed into a CP to obtain the col-lision indicator related to the k th sub-domain, i.e. collisionsbetween pairs of triangles where at least one of the trian-gles belongs to the k th sub-domain. There are altogether ∣Z ∣ CPs, where the k th CP maps (S T , Z kT ) T through four la-tent layers with ( , , , ) neurons and ReLU ac-tivation. Finally, CP outputs a scalar collision indicator S k ,i.e. S k = CP (S , Z k ) . We need the collision indicators S k and groundtruth labels S i to be compatible with numerical optimization. Since weuse gradient-based numerical optimization, we need to pro-vide valid gradient information. To this end, S k should notonly be a collision indicator but also a collision violationmetric. In other words, if S ′ k > S k ≥ , then we must have S ′ k correspond to a mesh with more collisions than S k , forwhich we use the notion of penetration depth. Given a mesh G , we use the FCL library (Pan, Chitta, and Manocha 2012)to compute the squared penetration depth PD p,q of each col-liding triangle pair. This colliding pair correlates 6 verticesin V , and we add PD p,q / to each vertex as the vertex-wisecollision violation. After processing all colliding trianglepairs, we have a penetration depth energy vector PDe ∈ R ∣V∣ .The overall computation is described by Algorithm 1. Algorithm 1

Generating Penetration Energy Vector PDe

1: Init PDe = ⃗ ∈ R ∣V∣

2: Run FCL ﬁnding the set of all collided disjoint triangle parisas ˆ T for ( t p , t q ) ∈ ˆ T and the corresponding PD p,q do for Vertex i belongs to t p and t q do

5: PDe i += PD p,q / end for end for After computing the PDe, we use the following domain-decomposed data loss to train S i : L PD = ∣Z∣ ∑ k = ∥S k − ∣V∣ ∑ i = M ki PDe i ∥ + ∥S sum − PDe-sum ∥ , where PDe-sum = ∑ ∣V∣ i = PDe i is the ground truth total pen-igure 3: We illustrate 20 representative results of collision responses, where the poses on the left are the original poses directly generatedusing (Yang et al. 2020), and the poses on the right are the ones after collision responses. We highlight the adjusted body parts using blackboxes. In all the examples, our method can successfully avoid penetrations. However, in two cases (red boxes), our adjusted poses driftseverely from the original poses. etration energy and S sum = ∑ ∣Z∣ k = S k is the neural networkprediction. Here, we use the same attention mask M ki de-ﬁned in Section 3.1 to decompose the collision energy intobody parts. Note that we do not have any loss terms related to S . However, a neural network is known to suffer from over-ﬁtting when learning exact distance functions (Hoffer andAilon 2015; Burges et al. 2005), including those correspond-ing to PD. Further, it is inherently difﬁcult to train a perfectregression model for values like collision penetration depthwith a long-tail distribution (Wang, Ramanan, and Hebert2017). We avoid over-ﬁtting by using the marginal rankingloss. Given two meshes, G and ˆ G (with approximated totalpenetration energy denoted as S sum and ̂S sum ) randomlysampled from the dataset, if ˆ G has a higher collision viola-tion than G in terms of the total penetration energy, then wedeﬁne: L rank = max ( , α − ( ̂S sum − S sum )) , and vice versa. Here, α is used as a margin to enforce rank-ing strictness. We choose α as the mean energy difference ofthe given dataset.With the above training technique, we can predict S , S ∣ Z ∣ and use them as hard constraints by letting S i = ,resulting in ∣ Z ∣ constraints. We can further reduce the on-line computational cost by reducing the number of con-straints to only one. To perform this computation, we traina single classiﬁer MLP classifier (S , ⋯ , S ∣ Z ∣ ) to summarizethe information and predict whether there are any collisionsthroughout the human body, i.e. MLP classifier is an indica-tor of whether S sum = . To make sure that the . sub-level set is the collision-free subset, we use the cross entropy loss: L entropy = − I ( PDe-sum > ) log ( MLP classifier )− I ( PDe-sum = ) log ( − MLP classifier ) . Our collision response solver takes a constrained optimiza-tion in the following form: argmin Z all ∥Z all − Z ∗ all ∥ s . t . MLP classifier (S , ⋯ , S ∣Z ∣ ) ≤ . . (1)The idea is to provide a desired pose Z ∗ all for the bilevel de-coder, and Equation 1 solves for a collision-free Z all thatis as close to Z ∗ all as possible. We solve Equation 1 usingthe augmented Lagrangian method implemented in LOQO(Vanderbei 1999), with all the gradient information com-puted via back-propagation through the neural network. Thisaugmented Lagrangian method can start from an infeasi-ble domain, which means that LOQO allows the hard con-straints to be temporarily violated between the iterations. Asa result, LOQO uses gradient information to pull the solutionback to the feasible sub-manifold. We implement our method using PyTorch (Paszke et al.2017). All the training and testing are performed on a sin-gle desktop machine with a -core CPU, GB memory,nd an NVIDIA GTX 1080Ti GPU. The training is decom-posed into two stages. During the ﬁrst stage, we train thetwo-level human pose embedding architecture using a set of N meshes. This training would optimize only the ∣Z ∣+ au-toencoders and the attention mechanics. After this ﬁrst stage,we generate a much larger dataset of M ≫ N meshes bysampling the latent code Z all uniformly in the range: [ . min (Z all ) , . max (Z all )] ∣Z all ∣ , where min (Z allk ) < , max (Z allk ) > , and min , max are elementwise over all mesh samples.We train our collision predictor and classiﬁer on the aug-mented dataset while ﬁxing the ∣Z ∣ + autoencoders andthe attention mechanics. This stage uses the loss: L = w P D L P D + w rank L rank + w entropy L entropy , which is conﬁgured with w P D = , w rank = , w entropy = ,and trained using a learning rate of . and a batch sizeof over epochs. We evaluate our method on threedatasets: the SCAPE dataset (Anguelov et al. 2005) with N = meshes, the MIT Swing dataset (Vlasic et al. 2008)with N = meshes, and the MIT Jumping dataset (Vla-sic et al. 2008) with N = meshes. For each dataset, weuse all the meshes to train the embedding space during theﬁrst stage, where we set ∣Z ∣ = for SCAPE and ∣Z ∣ = for Swing and Jumping. During the second stage, we use . M samples of the augmented dataset for training and . M samples for validation. We use two settings, one with M = × and the other with M = . × . Baseline MSE RANK CLASSIFY

Ours . × − . × − . L entropy + L PD . × − . × − . L entropy + L rank - . × − . L entropy - - . ND . × − . × − . Table 1:

We compare our method (Ours) with 4 baselines: L entropy + L PD , L entropy + L rank , L entropy , and ND (no col-lision decomposition). For each method, we train on the smallerdataset with M = × meshes, and we compare their accuracyin terms of predicting penetration depth energies (MSE), rankingpenetration depth energies (RANK), and classifying collision-freemeshes (CLASSIFY). The result shows that our hybrid loss im-proves the overall accuracy of collision predictions. Especially, theimprovement over ND demonstrates the effectiveness of decom-posing a collision into body parts. Accuracy of Collision Prediction:

We consider sev-eral baselines that are essentially simpliﬁed variants of ourpipeline in Figure 1. We notice that the constrained opti-mization Equation 1 only needs the output of MLP classifier to be correct, which is the goal L entropy . Therefore, weconsider retaining only L entropy while removing L rank and L P D , leading to three baselines: L entropy +L P D , L entropy +L rank , and L entropy , where we use the same weights forthe retained terms. In order to demonstrate the power ofcollision decomposition, we also compare our LCollisionwith a simpliﬁed network architecture that does not decom-pose the collision into body parts. For this baseline, we sim-ply use S to predict total penetration energy S sum andclassify collision status, and we modify L P D to only have ∥S sum − PDe-sum ∥ . The other two losses L P D and L entropy remain the same. This baseline is denoted as N D (no de-composition).In Table 1, we compare the accuracy of baselines in termsof predicting penetration depth energies, ranking penetrationdepth energies, and classifying collision-free meshes. To en-sure that our predicted penetration depth energies are accu-rate, we use the mean squared error (MSE) of total penetra-tion depth energy averaged over the . M test meshes. Toensure the accuracy of the ranking penetration depth ener-gies, we randomly formulate a pair for each sample in the . M test meshes, and we record the average ranking mar-gin (RANK). To classify collision-free meshes, we use therate of success (CLASSIFY) over the . M test meshes.From this ablation study, we compare N D and ourmethod to ﬁnd that penetration decomposition can improvethe accuracy of collision predictions, which also suggeststhat using the two parts of the L P D could better inform thenetwork of collision locality. Using penetration depth en-ergy in the system not only provides gradient information foroptimization but can also boost performance through L P D . L rank does help improve performance, but its effect is rela-tively minor compared to L P D . M Dataset MSE RANK CLASSIFY × SCAPE . × − . × − . Swing . × − . × − . Jumping . × − . × − . . × SCAPE . × − . × − . Swing . × − . × − . Jumping . × − . × − . Table 2:

We study the robustness of our method in terms of datasetsizes. Increasing the dataset size M can signiﬁcantly boost thecollision detection accuracy (CLASSIFY). This result implies thatlearning to predict collisions is challenging, and a larger trainingdataset can help improve the overall results. Dataset Time (Pan et al.) Time (ours) SpeedupCPU GPU CPU GPUSCAPE 1min 23s . s . s 21x 81xSwing 5min 12s . s .

82x 342xJumping 5min 19s . s . s 75x 282x Table 3:

We compare LCollision with (Pan, Chitta, and Manocha2012) in terms of the computational cost for collision detection.(Pan, Chitta, and Manocha 2012) only supports the CPU version,while we tested both the CPU and the GPU versions of our method.All datasets have . × samples ( . M validation samples for M = × ). Meshes in the Swing and Jumping datasets havemore vertices ( and ) than SCAPE ( ), and the com-plexity of (Pan, Chitta, and Manocha 2012) depends on the numberof points; thus, exact collision checking (Pan, Chitta, and Manocha2012) takes more time. However, they all share the same level oflatent space size with SCAPE, and the running times of our methodare almost identical. Our second study inspects the robustness of our networkarchitecture in terms of the size of the dataset. As shown inTable 2, we tested our method trained using two different M . Increasing M from × to . × can signiﬁcantlyboost the collision detection accuracy (CLASSIFY). This re-sult implies that learning to predict collisions is challenging,nd a larger training dataset can help improve the overallresults. Speedup Compared with Exact Collision Checking:

The goal of our method is to speed up the collision detectionprocess over prior, exact methods that are applied to mesh-based representations. We compare the running time with(Pan, Chitta, and Manocha 2012) on the test set of × samples ( . × samples) for the SCAPE, Swing, andJumping datasets. The implementation of (Pan, Chitta, andManocha 2012) only supports CPU, while LCollision runson both CPU and GPU. To achieve the best performance for(Pan, Chitta, and Manocha 2012), we run their method using15 threads in parallel and stop when one collision occurs orthe mesh is reported to be collision-free. For our method, wefeed the network with models at the same time. We op-timize the hyper-parameters to obtain optimal performance.We show the results in Table 3 and observe two orders ofmagnitude speedup. Computational Time of Collision Response (unit: s) o f i t e r a t i o n o f C o lli s i o n R e s p o n s e Figure 4:

The jointhistogram of the numberof iterations (Y-axis)and the computationaltime (X-axis) used forsolving the constrainedoptimization (Equation 1)for the Swing dataset.The average number ofiterations is 5.44 and theaverage computation timeis 1.29s. − − − − − Swing Dataset - Relative Penetration Energy Change M o d e l P e r c e n t a g e Figure 5:

The histogram ofrelative penetration energychange for successful exam-ples in the Swing dataset.Our method achieves a suc-cess rate of . , and weobserved an average relativedecrease of . in penetra-tion energy. The Collision Response Solver:

In Figure 3, we show 20results with successful collision responses for the SCAPEdataset (more results on the other datasets given in the ap-pendix). To proﬁle the collision response solver quantita-tively, we sample a set of 3000 random human poses byrandomizing Z all for both the SCAPE and Swing datasets.Some of the models have self-collisions and are classiﬁedcorrectly by our learning-based collision detection algo-rithm. For each of these meshes, we solve Equation 1 and weconsider a solution successful if the augmented Lagrangianalgorithm returns a feasible solution. On the SCAPE dataset,our method achieves a success rate of . , and we ob-serve a relative decrease of . in penetration energy.On the Swing dataset, our method achieves a success rateof . , and we observe a relative decrease of . . InFigure 4, we plot the number of iterations and computationaltime used by the constrained optimizer until convergence for the Swing dataset. The average iteration is 5.44 and the av-erage time is 1.29s. For the SCAPE dataset, the average it-eration is 2.09 and the average time is 0.25s. In Figure 5,we highlight the distribution of relative penetration energychange for successful collision response models. We present LCollision, a method for learning the collision-free human pose sub-manifold. We use a mesh embeddingautoencoder to learn a full human pose manifold and aug-ment it with additional components to classify the collision-free meshes. Our method decomposes the mesh into sev-eral sub-domains and learns the decision boundary of thecollision-free sub-manifold by reusing the decomposed sub-domains. Speciﬁcally, we learn to predict the penetrationdepths aggregated to each sub-domain and then use a binaryclassiﬁer to predict whether a given mesh has any collisions.When evaluated on the SCAPE dataset, our method achievesa success rate of . in predicting collisions and a successrate of . in collision responses.Our method has some limitations. Being a learning-basedmethod, our collision predictor cannot achieve a suc-cess rate, in contrast to exact collision detection algorithms.This could pose a problem when our method is used to gen-erate computer animations, where a few missed collisionscan have a considerable impact on the overall simulationaccuracy. Moreover, our learning method can only be ap-plied to models with ﬁxed topology and requires additionaldata collection and training for different mesh topologies.In the future, we would like to consider active learning tocollect more data and improve the accuracy of the collisionpredictor in a self-supervised manner, and thereby reducethe need for large training datasets. A similar approach isused in (Pan, Zhang, and Manocha 2013; He et al. 2015) forrigid objects. A second issue is the use of a continuous con-straint optimizer (Vanderbei 1999) for collision responses.These solvers require twice-differentiable hard constraints,which is not the case in our application because we use non-differentiable ReLU activation units. It is worth exploringnew constraint optimization solvers that could work withnon-smooth constraints speciﬁed by a neural network. Thereare many issues in terms of incorporating hard constraintsinto a neural network. If only soft penalties are needed,we can reformulate the hard constraint in Equation 1 as asoft penalty term and solve the unconstrained problem via aNewton-Type method, allowing users to adjust the penetra-tion allowed in the ﬁnal. We can extend our work by con-sidering other types of hard constraints such as dynamicsand accurate collision response models. Finally, since ourmethod uses a hybrid loss, it may compromise the perfor-mance in some metrics, e.g., the regression loss of the pene-tration depth. Moreover, techniques based on parameter es-timation can be used to improve the performance of suchlearning methods (Wolinski et al. 2014). Acknowledgement

This work was supported in part by ARO under GrantsW911NF1810313 and W911NF1910315, and in part by Intel. eferences

Agarwal, A.; and Triggs, B. 2005. Recovering 3D human posefrom monocular images.

IEEE transactions on pattern analysisand machine intelligence

Advances in Neural Information Processing Systems , 9558–9570.Anguelov, D.; Srinivasan, P.; Koller, D.; Thrun, S.; Rodgers, J.; andDavis, J. 2005. SCAPE: shape completion and animation of people.In

ACM SIGGRAPH , 408–416.Bagautdinov, T.; Wu, C.; Saragih, J.; Fua, P.; and Sheikh, Y. 2018.Modeling facial geometry using compositional vaes. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , 3877–3886.Barbiˇc, J.; and James, D. L. 2010. Subspace Self-Collision Culling.

ACM Trans. on Graphics (SIGGRAPH 2010)

Acta numerica

4: 1–51.Bouritsas, G.; Bokhnyak, S.; Ploumpis, S.; Bronstein, M.; andZafeiriou, S. 2019. Neural 3d morphable models: Spiral convo-lutional networks for 3d shape representation learning and gener-ation. In

Proceedings of the IEEE International Conference onComputer Vision , 7213–7222.Bridson, R.; Fedkiw, R.; and Anderson, J. 2002. Robust treatmentof collisions, contact and friction for cloth animation. In

Proceed-ings of the 29th annual conference on Computer graphics and in-teractive techniques , 594–603.Burgard, W.; Brock, O.; and Stachniss, C. 2008.

A Fast and Prac-tical Algorithm for Generalized Penetration Depth Computation ,265–272.Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamil-ton, N.; and Hullender, G. 2005. Learning to rank using gradientdescent. In

Proceedings of the 22nd international conference onMachine learning , 89–96.Gao, L.; Lai, Y.-K.; Yang, J.; Ling-Xiao, Z.; Xia, S.; and Kobbelt,L. 2019. Sparse data driven mesh deformation.

IEEE transactionson visualization and computer graphics .Govindaraju, N. K.; Lin, M. C.; and Manocha, D. 2005. Quick-cullide: Fast inter-and intra-object collision culling using graphicshardware. In

IEEE Proceedings. VR 2005. Virtual Reality, 2005. ,59–66. IEEE.Hanocka, R.; Hertz, A.; Fish, N.; Giryes, R.; Fleishman, S.; andCohen-Or, D. 2019. MeshCNN: A network with an edge.

ACMTransactions on Graphics

IEEE Robotics and Automation Letters

International Workshop on Similarity-Based PatternRecognition , 84–92. Springer.Kervadec, H.; Dolz, J.; Tang, M.; Granger, E.; Boykov, Y.; andAyed, I. B. 2019. Constrained-CNN losses for weakly supervisedsegmentation.

Medical image analysis

54: 88–99.Kim, Y.; Lin, M.; and Manocha, D. 2018. Collision and proximityqueries.

Handbook of Discrete and Computational Geometry . Kim, Y. J.; Otaduy, M. A.; Lin, M. C.; and Manocha, D. 2002. Fastpenetration depth computation for physically-based animation. In

Proceedings of the 2002 ACM SIGGRAPH/Eurographics sympo-sium on Computer animation , 23–31.Leibe, B.; Seemann, E.; and Schiele, B. 2005. Pedestrian detectionin crowded scenes. In , volume 1,878–885. IEEE.Li, Y.; Wu, J.; Tedrake, R.; Tenenbaum, J. B.; and Torralba, A.2019. Learning Particle Dynamics for Manipulating Rigid Bodies,Deformable Objects, and Fluids. In

ICLR .Marquez Neila, P.; Salzmann, M.; and Fua, P. 2017. Imposing HardConstraints on Deep Networks: Promises and Limitations URLhttp://infoscience.epﬂ.ch/record/262884.Nandwani, Y.; Pathak, A.; Singla, P.; et al. 2019. A Primal DualFormulation For Deep Learning With Constraints. In

Advances inNeural Information Processing Systems , 12157–12168.Pan, J.; Chitta, S.; and Manocha, D. 2012. FCL: A general purposelibrary for collision and proximity queries. In , 3859–3866. IEEE.Pan, J.; Zhang, X.; and Manocha, D. 2013. Efﬁcient penetrationdepth approximation using active learning.

ACM Transactions onGraphics (TOG) , 6236–6243.Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deeplearning on point sets for 3d classiﬁcation and segmentation. In

Proceedings of the IEEE conference on computer vision and pat-tern recognition , 652–660.Qiao, Y.-L.; Liang, J.; Koltun, V.; and Lin, M. C. 2020. ScalableDifferentiable Physics for Learning and Control. arXiv preprintarXiv:2007.02168 .Ranjan, A.; Bolkart, T.; Sanyal, S.; and Black, M. J. 2018. Gen-erating 3D faces using convolutional mesh autoencoders. In

Pro-ceedings of the European Conference on Computer Vision (ECCV) ,704–720.Ravi, S. N.; Dinh, T.; Lokhande, V. S.; and Singh, V. 2019. Ex-plicitly imposing constraints in deep networks via conditional gra-dients gives improved generalization and faster convergence. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol-ume 33, 4772–4779.Rogez, G.; Rihan, J.; Ramalingam, S.; Orrite, C.; and Torr, P. H.2008. Randomized trees for human pose detection. In , 1–8. IEEE.Shi, X.; Zhou, K.; Tong, Y.; Desbrun, M.; Bao, H.; and Guo, B.2007. Mesh puppetry: cascading optimization of mesh deformationwith inverse kinematics. In

ACM SIGGRAPH 2007 papers , 81–es.Smith, B.; Goes, F. D.; and Kim, T. 2018. Stable neo-hookean ﬂeshsimulation.

ACM Transactions on Graphics (TOG)

Proceedings ofthe IEEE conference on computer vision and pattern recognition ,5841–5850.an, Q.; Gao, L.; Lai, Y.-K.; Yang, J.; and Xia, S. 2018b. Mesh-based autoencoders for localized deformation component analysis.In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence .Tan, Q.; Pan, Z.; Gao, L.; and Manocha, D. 2020. Realtime Simula-tion of Thin-Shell Deformable Materials Using CNN-Based MeshEmbedding.

IEEE Robotics and Automation Letters

IEEE Transactions onVisualization and Computer Graphics

Proceed-ings of the 2010 ACM SIGGRAPH symposium on Interactive 3DGraphics and Games , 7–13.Teng, Y.; Otaduy, M. A.; and Kim, T. 2014. Simulating Articu-lated Subspace Self-Contact.

ACM Trans. Graph.

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 1653–1660.Tretschk, E.; Tewari, A.; Zollh¨ofer, M.; Golyanik, V.; and Theobalt,C. 2020. DEMEA: Deep Mesh Autoencoders for Non-RigidlyDeforming Objects.

European Conference on Computer Vision(ECCV) .Vanderbei, R. J. 1999. LOQO user’s manual—version 3.10.

Opti-mization methods and software

Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , 8639–8648.Vlasic, D.; Baran, I.; Matusik, W.; and Popovi´c, J. 2008. Articu-lated mesh animation from multi-view silhouettes. In

ACM SIG-GRAPH 2008 papers , 1–9.Wang, P.-S.; Liu, Y.; and Tong, X. 2020. Deep Octree-based CNNswith Output-Guided Skip Connections for 3D Shape and SceneCompletion.Wang, Y.-X.; Ramanan, D.; and Hebert, M. 2017. Learning tomodel the tail.

Advances in Neural Information Processing Sys-tems

30: 7029–7039.Wolinski, D.; J. Guy, S.; Olivier, A.-H.; Lin, M.; Manocha, D.; andPettr´e, J. 2014. Parameter estimation and comparative evaluationof crowd simulations. In

Computer Graphics Forum , volume 33,303–312. Wiley Online Library.Yang, J.; Gao, L.; Tan, Q.; Huang, Y.; Xia, S.; and Lai, Y.-K.2020. Multiscale Mesh Deformation Component Analysis withAttention-based Autoencoders.Zhang, L.; Kim, Y. J.; Varadhan, G.; and Manocha, D. 2007. Gen-eralized penetration depth computation.