SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing
Garvita Tiwari, Bharat Lal Bhatnagar, Tony Tung, Gerard Pons-Moll
SSIZER: A Dataset and Model for Parsing 3DClothing and Learning Size Sensitive 3DClothing
Garvita Tiwari , Bharat Lal Bhatnagar , Tony Tung , and Gerard Pons-Moll MPI for Informatics, Saarland Informatics Campus, Germany Facebook Reality Labs, Sausalito, USA { gtiwari,bbhatnag,gpons } @mpi-inf.mpg.de, [email protected] Fig. 1:
SIZER dataset of people with clothing size variation. (
Left ): 3D Scans ofpeople captured in different clothing styles and sizes . (
Right ): T-shirt and shortpants for sizes small and large, which are registered to a common template.
Abstract.
While models of 3D clothing learned from real data exist, nomethod can predict clothing deformation as a function of garment size.In this paper, we introduce SizerNet to predict 3D clothing conditionedon human body shape and garment size parameters, and ParserNet toinfer garment meshes and shape under clothing with personal details in asingle pass from an input mesh. SizerNet allows to estimate and visualizethe dressing effect of a garment in various sizes, and ParserNet allowsto edit clothing of an input mesh directly, removing the need for scansegmentation, which is a challenging problem in itself. To learn thesemodels, we introduce the
SIZER dataset of clothing size variation whichincludes 100 different subjects wearing casual clothing items in varioussizes, totaling to approximately 2000 scans. This dataset includes thescans, registrations to the SMPL model, scans segmented in clothingparts, garment category and size labels. Our experiments show betterparsing accuracy and size prediction than baseline methods trained on
SIZER . The code, model and dataset will be released for research pur-poses at: https://virtualhumans.mpi-inf.mpg.de/sizer/ . a r X i v : . [ c s . C V ] J u l Tiwari et al.
Modeling how 3D clothing fits on the human body as a function of size hasnumerous applications in 3D content generation (e.g., AR/VR, movie, videogames, sport), clothing size recommendation (e.g., e-commerce), computer visionfor fashion, and virtual try-on. It is estimated that retailers lose up to $600 billioneach year due to sales returns as it is currently difficult to purchase clothingonline without knowing how it will fit [3,2].Predicting how clothing fits as a function of body shape and garment sizeis an extremely challenging task. Clothing interacts with the body in complexways, and fit is a non-linear function of size and body shape. Furthermore, clothing fit differences with size are subtle , but they can make a difference whenpurchasing clothing online. Physics based simulation is still the most commonlyused technique because it generalizes well, but unfortunately, it is difficult toadjust its parameters to achieve a realistic result, and it can be computationallyexpensive.While there exist several works that learn how clothing deforms as a functionof pose [30], or pose and shape [30,43,22,37,34], there are few works modeling howgarments drape as a function of size. Recent works learn a space of styles [50,37]from physics simulations, but their aim is plausibility, and therefore they cannot predict how a real garment will deform on a real body.What is lacking is (1) a 3D dataset of people wearing the same garments indifferent sizes and (2) a data-driven model learned from real scans which varieswith sizing and body shape. In this paper, we introduce the
SIZER dataset, thefirst dataset of scans of people in different garment sizes featuring approximately2000 scans, 100 subjects and 10 garments worn by subjects in four different sizes.Using the
SIZER dataset we learned a Neural Network model, which we referto as
SizerNet , which given a body shape and a garment, can predict how thegarment drapes on the body as a function of size. Learning
SizerNet requiresto map scans to a registered multi-layer meshes – separate meshes for bodyshape, and top and bottom garments. This requires segmenting the 3D scans,and estimating their body shape under clothing, and registering the garmentsacross the dataset, which we obtain using the method explained in [14,38]. Fromthe multi-layer meshes, we learn an encoder to map the input mesh to a latentcode, and a decoder which additionally takes the body shape parameters ofSMPL [33], the size label (S, M, L, XL) of the input garment, and the desiredsize of the output, to predict the output garment as a displacement field to atemplate.Although visualizing how an existing garment fits on a body as a function ofsize is already useful for virtual try-on applications, we would also like to changethe size of garments in existing 3D scans. Scans however, are just pointclouds,and parsing them into a multi-layer representation at test time using [14,38]requires segmentation, which sometimes requires manual intervention. There-fore, we propose
ParserNet , which automatically maps a single mesh registration(SMPL deformed to the scan) to multi-layer meshes with a single feed-forwardpass.
ParserNet , not only segments the single mesh registration, but it reparam-
IZER 3 eterizes the surface so that it is coherent with common garment templates. Theoutput multi-layer representation of
ParserNet is powerful as it allows simula-tion and editing meshes separately. Additionally, the tandem of
SizerNet and
ParserNet allows us to edit the size of clothing directly on the mesh, allowingshape manipulation applications never explored before.In summary, our contributions are: • SIZER dataset: A dataset of clothing size variation of approximately 2000scans including 100 subjects wearing 10 garment classes in different sizes,where we make available, scans, clothing segmentation, SMPL+G registra-tions, body shape under clothing, garment class and size labels. • SizerNet: The first model learned from real scans to predict how clothingdrapes on the body as a function of size. • ParserNet: A data-driven model to map a single mesh registration into amulti-layered representation of clothing without the need for segmentationor non-linear optimization.Fig. 2: We propose a model to estimate and visualize the dressing effect of a gar-ment conditioned on body shape and garment size parameters. For this we intro-duce
ParserNet ( f Uw , f Lw , f Bw ), which takes a SMPL registered mesh M ( θ , β , D )as input and predicts the SMPL parameters ( θ , β ), parsed 3D garments usingpredefined templates T g ( β , θ , ) and predicts body shape under clothing whilepreserving the personal details of the subject. We also propose SizerNet , anencoder-decoder ( f enc w , f dec w ) based network, that resizes the garment given asinput with the desired size label ( δ in , δ out ) and drapes it on the body shapeunder clothing. Clothing modeling.
Accurate reconstruction of 3D cloth with fine structures(e.g., wrinkles) is essential for realism while being notoriously challenging. Meth-ods based on multi-view stereo can recover global shape robustly but strugglewith high frequency details in non-textured regions [51,44,16,6,47,32]. The pio-neering work of [9,8] demonstrated for the first time detailed body and clothing
Tiwari et al. reconstruction from monocular video using a displacement from SMPL, whichspearheaded recent developments [23,7,10,42,24,25]. These approaches do notseparate body from clothing. In [38,30,14,26], the authors propose to recon-struct clothing as a layer separated from the body. These models are trainedon 3D scans of real clothed people data and produce realistic models. On theother hand, physics based simulation methods have also been used to modelclothing [48,49,35,21,45,46,37,43,22]. Despite the potential gap with real-worlddata, they are a great alternative to obtain clean data, free of acquisition noiseand holes. However, they still require manual parameter tuning (e.g., time stepfor better convergence, sheer and stretch for better deformation effects, etc.),and can be slow or unstable. In [43,22,21] a pose and shape dependent clothingmodel is introduced, and [37,50] also model garment style dependent clothingusing a lower-dimensional representation for style and size like PCA and garmentsewing parameters, however there is no direct control on the size of clothing gen-erated for given body shape. In [53], authors model the garment fit on differentbody shapes from images. Our model
SizerNet automatically outputs realistic3D cloth models conditioned on desired features (e.g., shape, size).
Shape under clothing.
In [11,60,57], the authors propose to estimate bodyshape under clothing by fitting a 3D body model to 3D reconstructions of people.An objective function typically forces the body to be inside clothing while be-ing close to the skin region. These methods cannot generalize well to complex orloose clothing without additional prior or supervision [17]. In [27,36,54,29,28,52],the authors propose learned models to estimate body shape from 2D images ofclothed people, but shape accuracy is limited due to depth ambiguity. Our model
ParserNet takes as input a 3D mesh and outputs 3D bodies under clothing withhigh fidelity while preserving subject identity (e.g., face details).
Cloth parsing.
The literature has proposed several methods for clothed hu-man understanding. In particular, efficient cloth parsing in 2D has been achievedusing supervised learning and generative networks [55,56,58,18,19,20]. 3D cloth-ing parsing of 3D scans has also been investigated [38,14]. The authors proposetechniques based on MRF-GrabCut [41] to segment 3D clothing from 3D scansand transfer them to different subjects. However the approach requires severalsteps, which is not optimal for scalability. We extend previous work with
SIZER ,a fully automatic data-driven pipeline. In [13], the authors jointly predict cloth-ing and inner body surface, with semantic correspondences to SMPL. However,it does not have semantic clothing information.
3D datasets.
To date, only a few datasets consist of 3D models of subjectswith segmented clothes. 3DPeople [40], Cloth3D [12] consists of a large datasetof synthetic 3D humans with clothing. None of the synthetic datasets containsrealistic cloth deformations like the SIZER dataset. THUman [61] consists ofsequences of clothed 3D humans in motion, captured with a consumer RGBDsensor (Kinectv2), and are reconstructed using volumetric SDF fusion [59]. How-
IZER 5 ever, 3D models are rather smooth compared to our 3D scans and no groundtruth segmentation of clothing is provided. Dyna and D-FAUST [39,15] consist ofhigh-res 3D scans of 10 humans in motion with different shape but the subjectsare only wearing minimal clothing. BUFF [60] contains high-quality 3D scans of6 subjects with and without clothing. The dataset is primarily designed to trainmodels to estimate body shape under clothing and doesn’t contain garments seg-mentation. In [14], the authors create a digital wardrobe with 3D templates ofgarments to dress 3D bodies. In [26], authors propose a mixture of synthetic andreal data, which contains garment, body shape and pose variations. However,the fraction of real dataset ( ∼
300 scans) is fairly small. DeepFahsion3D [62] is adataset of real scans of clothing containing various garment styles. None of thesedatasets contain garment sizing variation. Unlike our proposed
SIZER dataset,no dataset contains a large amount of pre-segmented clothing from 3D scans atdifferent sizes, with corresponding body shapes under clothing.
In this paper, we address a very challenging problem of modeling garment fittingas a function of body shape and garment size. As explained in Sec. 2, one of thekey bottlenecks that hinder progress in this direction is the lack of real-worlddatasets that contain calibrated and well-annotated garments in different sizesdraped on real humans. To this end, we present
SIZER dataset, a dataset of over2000 scans containing people in diverse body shapes in various garments stylesand sizes. We describe our dataset in Sec. 3.1 and 3.2.
We introduce the
SIZER dataset that contains 100 subjects, wearing the samegarment in 2 or 3 garment sizes (S, M, L, XL). We include 10 garment classes,namely shirt, dress-shirt, jeans, hoodie, polo t-shirt, t-shirt, shorts, vest, skirt,and coat, which amounts to roughly 200 scans per garment class. We capturethe subjects in a relaxed A-pose to avoid stretching or tension due to pose inthe garments. Figure 1 shows some examples of people wearing a fixed set ofgarments in different sizes. We use a Treedy’s static scanner [5] which has 130+cameras, and reconstruct the scans using Agisoft’s Metashape software [1]. Ourscans are high resolution and are represented by meshes, which have differentunderlying graph connectivity across the dataset, and hence it is challenging touse this dataset directly in any learning framework. We preprocess our dataset,by registering them to SMPL [33]. We explain the structure of processed datain the following section.
To improve general usability of the
SIZER dataset, we provide SMPL+G reg-istrations [31,14] registrations. Registering our scans to SMPL, brings all our
Tiwari et al. scans to correspondence, and provides more control over the data via pose andshape parameters from the underlying SMPL. We briefly describe the SMPLand SMPL+G formulations below.SMPL represents the human body as a parametric function M ( · ), of pose ( θ )and shape ( β ). We add per-vertex displacements ( D ) on top of SMPL to modeldeformations corresponding to hair, garments, etc. thus resulting in the SMPLmodel. SMPL applies standard skinning W ( · ) to a base template T in T-pose.Here, W denotes the blend weights and B p ( · ) and B s ( · ) models pose and shapedependent deformations respectively. M ( β , θ , D ) = W ( T ( β , θ , D ) , J ( β ) , θ , W ) (1) T ( β , θ , D ) = T + B s ( β ) + B p ( θ ) + D (2)SMPL+G is a parametric formulation to represent the human body andgarments as separate meshes.To register the garments we first segment scans intogarments and skin parts [14]. We refine the scan segmentation step used in [14] byfine-tuning the Human Parsing network [20] with a multi-view consistency loss.We then use the multi-mesh registration approach from [14] to register garmentsto the SMPL+G model. For each garment class, we obtain a template mesh whichis defined as a subset of the SMPL template, given by T g ( β , θ , ) = I g T ( β , θ , ),where I g ∈ Z m g × n is an indicator matrix, with I gi,j = 1 if garment g vertex i ∈ { . . . m g } is associated with body shape vertex j ∈ { . . . n } . m g and n denote the number of vertices in the garment template and the SMPL meshrespectively. Similarly, we define a garment function G ( β , θ , D g ) using Eq. (3),where D g are the per-vertex offsets from the template G ( β , θ , D g ) = W ( T g ( β , θ , D g ) , J ( β ) , θ , W ) . (3)For every scan in the SIZER dataset, we will release the scan, segmentedscan, and SMPL+G registrations, garment category and garment size label.This dataset can be used in several applications like virtual try-on, characteranimation, learning generative models, data-driven body shape under clothing,size and(or) shape sensitive clothing model, etc. To stimulate further researchin this direction, we will release the dataset,code and baseline models, whichcan be used as a benchmark in 3D clothing parsing and 3D garment resizing.We use this dataset to build a model for the task of garment extraction fromsingle mesh (
ParserNet ) and garment resizing (
SizerNet ), which we describe inthe next section.
We introduce
ParserNet (Sec. 4.2), the first method for extracting garmentsdirectly from SMPL registered meshes. For parsing garments, we first predict theunderlying body SMPL parameters using a pose and shape prediction network(Sec. 4.1) and use
ParserNet to extract garment layers and personal features
IZER 7 like hair, facial features to create body shape under clothing. Next, we present
SizerNet (Sec. 4.3), an encoder-decoder based deep network for garment resizing.An overview of the method is shown in Fig. 2.
To estimate body shape under clothing, we first create the undressed SMPLbody for a given clothed input single layer mesh M ( β , θ , D ), by predicting θ , β using f θ w and f β w respectively. We train f θ w and f β w with L loss over parametersand per-vertex loss between predicted SMPL body and clothed input mesh, asshown in Eq. (4) and (5). Since the reference body under clothing parameters θ , β obtained via instance specific optimization (Sec. 3.2) can be inaccurate, weadd an additional per-vertex loss between our predicted SMPL body vertices M (ˆ θ , ˆ β , ) and the input clothed mesh M ( β , θ , D ). This brings the predictedundressed body closer to the input clothed mesh. We observe more stable resultstraining f θ w and f β w separately initially, using the reference β and θ respectively.Since the β components in SMPL are normalized to have σ = 1, we un-normalizethem by scaling by their respective standard deviations [ σ , σ , . . . , σ ] as givenin Eq. (5). L θ = w pose || ˆ θ − θ || + w v || M ( β , ˆ θ , ) − M ( β , θ , D ) || (4) L β = w shape 10 (cid:88) i =1 σ i (ˆ β i − β i ) + w v || M (ˆ β , θ , ) − M ( β , θ , D ) || (5)Here, w pose , w shape and w v are weights for the loss on pose, shape and pre-dicted SMPL surface. (ˆ θ , ˆ β ) denote predicted parameters. The output is a smooth (SMPL model) body shape under clothing. Parsing garments from a single mesh ( M ) can be done bysegmenting it into separate garments for each class ( G g,k seg ), which leads to differ-ent underlying graph connectivity ( G g,k seg = ( G g,k seg , E g,k seg )) across all the instances( k ) of a garment class g , shown in Fig. 3 (right). Hence, we propose to parsegarments by deforming vertices of a template T g ( β , θ , ) with fixed connectivity E g , obtaining vertices G g,k ∈ G g,k , where G g,k = ( G g,k , E g ), shown in Fig. 3(middle).Our key idea is to predict the deformed vertices G g directly as a convexcombination of vertices of the input mesh M = M ( β , θ , D ) with a learnedsparse regressor matrix W g , such that G g = W g M . Specifically, ParserNet predicts the sparse matrix ( W g ) as a function of input mesh features (verticesand normals) and a predefined per-vertex neighborhood ( N i ) for every vertex i of garment class g . We will henceforth drop ( . ) g,k unless required. In this way, Tiwari et al.
Fig. 3: Left to right: Input single mesh ( M k ), garment template ( T g ( β , θ , ) = I g T ( β , θ , )), garment mesh extracted using G g,k = I g M k , multi-layer meshes( G g,k ) registered to SMPL+G, all with garment class specific edge connectivity E g , and segmented scan G g,k seg with instance specific edge connectivity E g,k seg .the output vertices G i ∈ R , where i ∈ { , . . . , m g } , are obtained as a convexcombination of input mesh vertices M j ∈ R in a predefined neighborhood ( N i ). G i = (cid:88) j ∈N i W ij M j . (6) Parsing detailed body shape under clothing.
For generating detailed bodyshape under clothing, we first create a smooth body mesh , using SMPL param-eters θ and β predicted from f θ w , f β w (Sec. 4.1). Using the same aforementionedconvex combination formulation, Body ParserNet transfers the visible skin ver-tices from the input mesh to the smooth body mesh, obtaining hair and fa-cial features. We parse the input mesh into upper, lower garments and detailedshape under clothing using 3 sub-networks ( f Uw , f Lw , f Bw ) of ParserNet , as shownin Fig. 2.
We aim to edit the garment mesh based on garment size labels such as S, M,L, etc, to see the dressing effect of the garment for a new size. For this task,we propose an encoder-decoder based network, which is shown in Fig. 2 (right).The network f enc w , encodes the garment mesh G in to a lower-dimensional latentcode x gar ∈ R d , shown in Eq. (7). We append ( β , δ in , δ out ) to the latent space,where δ in , δ out are one-hot encodings of input and desired output sizing and β is the SMPL β parameter for underlying body shape. x gar = f enc w ( G in ) , f enc w ( . ) : R m g × → R d (7)The decoder network, f dec w ( . ) : R | β | × R d × R | δ | → R m g × predicts the dis-placement field D g = f dec w ( β , x gar , δ in , δ out ) on top on template. We obtain theoutput garment G out in the new desired size δ out using Eq. (3). IZER 9
We train the networks,
ParserNet and
SizerNet with training losses given byEq. (8) and (9) respectively, where w , w norm , w lap , w interp and w w are weightsfor the loss on vertices, normal, Laplacian, interpenetration and weight regular-izer term respectively. We explain each of the loss terms in this section. L parser = w L + w norm L norm + w lap L lap + w interp L interp + w w L w (8) L sizer = w L + w norm L norm + w lap L lap + w interp L interp (9) •
3D vertex loss for garments.
We define L as L loss between predictedand ground truth vertices L = || G P − G GT || . (10) •
3D vertex loss for shape under clothing.
For training f Bw (ParserNetfor the body), we use the input mesh skin as supervision for predicting per-sonal details of subject. We define a garment class specific geodesic distanceweighted loss term, as shown in Eq. (11), where I s is the indicator matrixfor skin region and w geo is a vector containing the sigmoid of the geodesicdistances from vertices to the boundary between skin and non-skin regions.The loss term is high when the prediction is far from the input mesh M forthe visible skin region, and lower for the cloth region, with a smooth tran-sition regulated by the geodesic term. Let abs ij ( · ) denote an element-wiseabsolute value operator. Then the loss is computed as L body3D = (cid:107) w T geo · abs ij ( G s P − I s M ) (cid:107) . (11) • Normal Loss.
We define L norm as the difference in angle between groundtruth face normal ( N i GT ) and predicted face normal ( N iP ). L norm = 1 N faces N faces (cid:88) i (1 − ( N GT ,i ) T N P,i ) . (12) • Laplacian smoothness term.
This enforces the Laplacian of predictedgarment mesh to be close to the Laplacian of ground truth mesh. Let L g ∈ R m g × m g be the graph Laplacian of the garment mesh G GT , and ∆ init = L g G GT ∈ R m g × be the differential coordinates of the G GT , then we com-pute the Laplacian smoothness term for a predicted mesh G P as L lap = || ∆ init − L g G P || . (13) • Interpenetration loss.
Since minimizing per-vertex loss does not guar-antee that the predicted garment lies outside the body surface, we use theinterpenetration loss term in Eq. (14) proposed in GarNet [22]. For everyvertex G P ,j , we find the nearest vertex in the predicted body shape underclothing ( B i ) and define the body-garment correspondences as C ( B , G P ). Let N i be the normal of the i th body vertex B i . If the predicted garmentvertex G P ,j penetrates the body, it is penalized with the following loss L interp = (cid:88) ( i,j ) ∈C ( B , G P ) d ( G P ,j , G GT ,j ) To preserve the fine details when parsing the inputmesh, we want the weights predicted by the network to be sparse and con-fined in a local neighborhood. Hence, we add a regularizer which penalizeslarge values for W ij if the distance between of M j and the vertex M k withlargest weight k = arg max j W ij is large. Let d ( · , · ) dennote Euclidean dis-tance between vertices, then the regularizer equals L w = m g (cid:88) i =1 (cid:88) j ∈N i W ij d ( M k , M j ) , k = arg max j W ij . (15) We implement f θ w and f β w networks with 2 fully connected and a linear outputlayer. We implement ParserNet f Uw , f Lw , f Bw with 3 fully connected layers. Weuse neighborhood ( N i ) size of |N i | = 50, for our experiments. We first trainthe network for garment classes which share the same garment template andthen fine-tune separately for each garment class g . To speed up training for ParserNet , we train the network to predict W g = I g , where I g is the indicatormatrix for garment class g , explained in Sec. 3.2. This initializes the network toparse the garment by cutting out a part of the input mesh based on the constantper-garment indicator matrix, shown in Fig. 3.For SizerNet we use d = 30 and we implement f encw , f decw with fully connectedlayers and skip connections between encoder and decoder network. We held out40 scans for testing in each garment class, which includes some cases with unseensubjects and some with unseen garment size for seen subjects. For pose-shapeprediction network, ParserNet and SizerNet we use batch-size of 8 and learningrate of 0 . To validate the choice of parsing the garments using a sparse regressor matrix( W ), we compare the results of ParserNet with two baseline approaches: 1) Alinearized version of ParserNet implemented with LASSO, and 2) A naive FC IZER 11 Fig. 4: Comparison of ParserNet with a FC network from front and lateral view.network, which has the same architecture as ParserNet . However, instead of pre-dicting the weight matrix ( W ), the FC network directly predicts the deformation( D g ) from the garment template ( T g ( β , θ , )) for a given input ( M ).We compare the per-vertex error of ParserNet with the aforementioned base-lines in Tab. 1. Figure 4 shows that ParserNet can produce details, fine wrinkles,and large garment deformations, which is not possible with a naive FC network.This is attainable because ParserNet reconstructs the output garment mesh asa localized sparse weighted sum of input vertex locations, and hence preservesthe geometry details present in the input mesh. However, in the case of naive FCnetwork, the predicted displacement field ( D g ) is smooth and does not explainlarge deformations. Hence, naive FC network is not able to predict loose gar-ments and does not preserve fine details. We show results of ParserNet for moregarment classes in Fig. 5 and add more results in the supplementary material. Editing garment meshes based on garment size label is an unexplored problemand, hence there are no well defined quantitative metrics. We introduce twoquantitative metrics, namely change in mesh surface area ( A err ) and per-vertexerror ( V err ) for evaluating the resizing task. Surface area accounts for the scaleof a garment, which only changes with the garment size, and per-vertex error accounts for details and folds created due to the underlying body shape andlooseness/tightness of the garment. Moreover, subtle changes in garment shape ParserNet Garment LinearModel FC ParserNet Polo 32.21 17.25 Shorts 29.78 20.12 Shirt 27.63 19.35 Pants 34.82 18.2 Vest 28.17 18.56 Coat 41.27 22.19 Hoodies 37.34 23.69 Shorts2 31.38 23.45 T-Shirt 26.94 15.98 Table 1: Average per-vertex error V err of proposed method for parsing garmentmeshes for different garment class (in mm).Fig. 5: Input single mesh and ParserNet results for more garments.with respect to size are difficult to evaluate. Hence, we use heat map visualiza-tions for qualitative analysis of the results.Since there is no other existing work for garment resizing task to comparewith, we evaluate our method against the following three baselines.1. Error margin in data: We define error margin as the change in per-vertexlocation ( V err ) and surface area ( A err ) between garments of two consecutivesize for a subject in the dataset. Our model should ideally produce a smallererror than this margin.2. Average prediction : For every subject in the dataset, we create the averagegarment ( G avg ), by averaging over all the available sizes for a subject.3. Linear scaling + Alignment : We linearly scale the garment mesh, accordingto desired size label, and then align the garment to the underlying body.Table 2 shows the errors for each experiment. SizerNet results in lower errors,as compared to the linear scaling method, which reflects the need for modelling IZER 13 Fig. 6: (a) Input single mesh. (b) Parsed multi-layer mesh from ParserNet. (c),(d)Resized garment in two subsequent smaller sizes. (e), (f) Heatmap of change inper vertex error on original parsed garment for two new sizes.the non-linear relationship between garment shape, underlying body shape andgarment size. We also see that network predictions yield lower error as comparedto average garment prediction, which suggests that the model is learning the sizevariation, even though the differences in the ground truth itself are subtle. Wepresent the results of SizerNet for common garment classes in Tab. 2, Fig. 6, 7and add more results in the supplementary material. Garment Error-margin Average-pred Linear Scaling Ours V err A err V err A err V err A err V err A err Polo t-shirt 33.25 24.56 23.86 3.63 35.05 8.45 Shirt 36.52 19.57 21.95 2.76 34.53 7.01 Shorts 43.21 27.21 24.79 5.41 35.77 4.99 Pants 30.83 15.15 21.54 4.73 38.16 7.13 Table 2: Average per vertex error ( V err in mm ) and surface area error( A err in%) of proposed method for garment resizing. Fig. 7: Results of ParserNet + SizerNet , where we parse the garments from inputsingle mesh and change the size of garment to visualise dressing effect. We introduce SIZER , a clothing size variation dataset and model, which is thefirst real dataset to capture clothing size variation on different subjects. Wealso introduce ParserNet : a 3D garment parsing network and SizerNet : a sizesensitive clothing model. With this method, one can change the single meshregistration to multi-layer meshes of garments and body shape under clothing,without the need for scan segmentation and can use the result for animation,virtual try-on, etc. SizerNet can drape a person with garments in different sizes.Since our dataset only consists of roughly aligned A-poses, we are limited toA-pose. We only exploit geometry information (vertices and normals) for 3Dclothing parsing. In future work, we plan to use the color information in Parser-Net via texture augmentation, to improve the accuracy and generalization ofthe proposed method. We will release the model, dataset, and code to stimu-late research in the direction of 3D garment parsing, segmentation, resizing andpredicting body shape under clothing. Acknowledgements. This work is funded by the Deutsche Forschungsgemein-schaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme,project: Real Virtual Humans) and a Facebook research award. We thank Tarun,Navami, and Yash for helping us with the data capture and RVH team members [4],for their meticulous feedback on this manuscript.IZER 15 References 1. Agisoft metashape, 2. The high cost of retail returns, 3. Ihl group, 4. Real virtual humans, max planck institute for informatics, https://virtualhumans.mpi-inf.mpg.de/people.html 5. Treedy’s scanner, 6. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H., Thrun, S.: Perfor-mance capture from sparse multi-view video. ACM Trans. Graph. (3), 98:1–98:10(2008)7. Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learningto reconstruct people in clothing from a single RGB camera. In: IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) (jun 2019)8. Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Detailed humanavatars from monocular video. In: International Conference on 3D Vision (3DV)(sep 2018)9. Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based re-construction of 3d people models. In: IEEE Conference on Computer Vision andPattern Recognition (CVPR) (June 2018)10. Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: Detailed fullhuman body geometry from a single image. In: IEEE International Conference onComputer Vision (ICCV). IEEE (oct 2019)11. B˘alan, A.O., Black, M.J.: The naked truth: Estimating body shape under clothing.In: European Conf. on Computer Vision. pp. 15–29. Springer (2008)12. Bertiche, H., Madadi, M., Escalera, S.: CLOTH3D: clothed 3d humans. vol.abs/1912.02792 (2019)13. Bhatnagar, B.L., Sminchisescu, C., Theobalt, C., Pons-Moll, G.: Combining im-plicit function learning and parametric models for 3d human reconstruction. In:European Conference on Computer Vision (ECCV). Springer (August 2020)14. Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net:Learning to dress 3d people from images. In: IEEE International Conference onComputer Vision (ICCV). IEEE (oct 2019)15. Bogo, F., Romero, J., Pons-Moll, G., Black, M.J.: Dynamic FAUST: Registeringhuman bodies in motion. In: IEEE Conf. on Computer Vision and Pattern Recog-nition (2017)16. Bradley, D., Popa, T., Sheffer, A., Heidrich, W., Boubekeur, T.: Markerless garmentcapture. In: ACM Transactions on Graphics. vol. 27, p. 99. ACM (2008)17. Chen, X., Pang, A., Zhu, Y., Li, Y., Luo, X., Zhang, G., Wang, P., Zhang, Y., Li, S.,Yu, J.: Towards 3d human shape recovery under clothing. CoRR abs/1904.02601 (2019)18. Dong, H., Liang, X., Wang, B., Lai, H., Zhu, J., Yin, J.: Towards multi-pose guidedvirtual try-on network. International Conference on Computer Vision (ICCV)(2019)19. Dong, H., Liang, X., Zhang, Y., Zhang, X., Xie, Z., Wu, B., Zhang, Z., Shen, X.,Yin, J.: Fashion editing with adversarial parsing learning. Conference on ComputerVision and Pattern Recognition (CVPR) (2020)20. Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., Lin, L.: Instance-level humanparsing via part grouping network. In: ECCV (2018)6 Tiwari et al.21. Guan, P., Reiss, L., Hirshberg, D., Weiss, A., Black, M.J.: DRAPE: DRessingAny PErson. ACM Trans. on Graphics (Proc. SIGGRAPH) (4), 35:1–35:10 (Jul2012)22. Gundogdu, E., Constantin, V., Seifoddini, A., Dang, M., Salzmann, M., Fua, P.:Garnet: A two-stream network for fast and accurate 3d cloth draping. In: IEEEInternational Conference on Computer Vision (ICCV). IEEE (oct 2019)23. Habermann, M., Xu, W., , Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Livecap:Real-time human performance capture from monocular video (oct 2019)24. Habermann, M., Xu, W., , Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Deepcap:Monocular human performance capture using weak supervision. In: IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2020)25. Huang, Z., Xu, Y., Lassner, C., Li, H., Tung, T.: Arch: Animatable reconstructionof clothed humans. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition. pp. 3093–3102 (2020)26. Jiang, B., Zhang, J., Hong, Y., Luo, J., Liu, L., Bao, H.: Bcnet: Learning bodyand cloth shape from a single image. arXiv preprint arXiv:2004.00214 (2020)27. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of humanshape and pose. In: Computer Vision and Pattern Regognition (CVPR) (2018)28. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct3D human pose and shape via model-fitting in the loop. In: International Confer-ence on Computer Vision (Oct 2019)29. Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression forsingle-image human shape reconstruction. In: CVPR (2019)30. Laehner, Z., Cremers, D., Tung, T.: Deepwrinkles: Accurate and realistic cloth-ing modeling. In: European Conference on Computer Vision (ECCV) (September2018)31. Lazova, V., Insafutdinov, E., Pons-Moll, G.: 360-degree textures of people in cloth-ing from a single image. In: International Conference on 3D Vision (3DV) (sep2019)32. Leroy, V., Franco, J., Boyer, E.: Multi-view dynamic shape refinement using lo-cal temporal integration. In: IEEE International Conference on Computer Vision,ICCV. pp. 3113–3122. Venice, Italy (oct 2017)33. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: Askinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) (6), 248:1–248:16 (Oct 2015)34. Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M.:Learning to dress 3d people in generative clothing. In: IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR). IEEE (jun 2020)35. Miguel, E., Bradley, D., Thomaszewski, B., Bickel, B., Matusik, W., Otaduy, M.A.,Marschner, S.: Data-driven estimation of cloth simulation models. Comput. Graph.Forum (2), 519–528 (2012)36. Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting:Unifying deep learning and model based human pose and shape estimation. In:International Conf. on 3D Vision (2018)37. Patel, C., Liao, Z., Pons-Moll, G.: The virtual tailor: Predicting clothing in 3das a function of human pose, shape and garment style. In: IEEE Conference onComputer Vision and Pattern Recognition (CVPR). IEEE (Jun 2020)38. Pons-Moll, G., Pujades, S., Hu, S., Black, M.: ClothCap: Seamless 4D clothingcapture and retargeting. ACM Transactions on Graphics (4) (2017)39. Pons-Moll, G., Romero, J., Mahmood, N., Black, M.J.: Dyna: a model of dynamichuman shape in motion. ACM Transactions on Graphics , 120 (2015)IZER 1740. Pumarola, A., Sanchez, J., Choi, G., Sanfeliu, A., Moreno-Noguer, F.: 3DPeople:Modeling the Geometry of Dressed Humans. In: International Conference in Com-puter Vision (ICCV) (2019)41. Rother, C., Kolmogorov, V., Blake, A.: Grabcut: Interactive foreground extractionusing iterated graph cuts. vol. 23 (2004)42. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu:Pixel-aligned implicit function for high-resolution clothed human digitization. In:Proceedings of the IEEE International Conference on Computer Vision. pp. 2304–2314 (2019)43. Santesteban, I., Otaduy, M.A., Casas, D.: Learning-Based Animation of Clothingfor Virtual Try-On. Computer Graphics Forum (Proc. Eurographics) (2019)44. Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEEComputer Graphics and Applications (3), 21–31 (2007)45. Stuyck, T.: Cloth Simulation for Computer Graphics. Synthesis Lectures on VisualComputing, Morgan & Claypool Publishers (2018)46. Tao, Y., Zheng, Z., Zhong, Y., Zhao, J., Quionhai, D., Pons-Moll, G., Liu, Y.:Simulcap : Single-view human performance capture with cloth simulation. In: IEEEConference on Computer Vision and Pattern Recognition (CVPR) (jun 2019)47. Tung, T., Nobuhara, S., Matsuyama, T.: Complete multi-view reconstruction ofdynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In:IEEE 12th International Conference on Computer Vision, ICCV. pp. 1709–1716.Kyoto, Japan (Sep 2009)48. Wang, H., Hecht, F., Ramamoorthi, R., O’Brien, J.F.: Example-based wrinklesynthesis for clothing animation. ACM Transactions on Graphics (Proceedings ofSIGGRAPH) (4), 107:1–8 (Jul 2010)49. Wang, H., Ramamoorthi, R., O’Brien, J.F.: Data-driven elastic models for cloth:Modeling and measurement. ACM Transactions on Graphics (Proceedings of SIG-GRAPH) (4), 71:1–11 (Jul 2011)50. Wang, T.Y., Ceylan, D., Popovic, J., Mitra, N.J.: Learning a shared shape spacefor multimodal garment design. ACM Trans. Graph. (6), 1:1–1:14 (2018)51. White, R., Crane, K., Forsyth, D.A.: Capturing and animating occluded cloth.ACM Trans. Graph.26