[PDF] Capturing Detailed Deformations of Moving Human Bodies

Abstract

We present a new method to capture detailed human motion, sampling more than 1000 unique points on the body. Our method outputs highly accurate 4D (spatio-temporal) point coordinates and, crucially, automatically assigns a unique label to each of the points. The locations and unique labels of the points are inferred from individual 2D input images only, without relying on temporal tracking or any human body shape or skeletal kinematics models. Therefore, our captured point trajectories contain all of the details from the input images, including motion due to breathing, muscle contractions and flesh deformation, and are well suited to be used as training data to fit advanced models of the human body and its motion. The key idea behind our system is a new type of motion capture suit which contains a special pattern with checkerboard-like corners and two-letter codes. The images from our multi-camera system are processed by a sequence of neural networks which are trained to localize the corners and recognize the codes, while being robust to suit stretching and self-occlusions of the body. Our system relies only on standard RGB or monochrome sensors and fully passive lighting and the passive suit, making our method easy to replicate, deploy and use. Our experiments demonstrate highly accurate captures of a wide variety of human poses, including challenging motions such as yoga, gymnastics, or rolling on the ground.

Full PDF

CCapturing Detailed Deformations of Moving Human Bodies

HE CHEN, HYOJOON PARK, KUTAY MACIT, and LADISLAV KAVAN,

University of Utah (a) (b) (c) (d) (e) (f)

Fig. 1. (a) Our novel motion capture suit with a special pattern; (b) An example input image from our multi-camera capture system; (c) Raw 3D reconstructionof labeled corners from the suit (raw data, no body model was used); (d) The result after interpolating missing observations (here, a body model is used); (e)Our reconstructed mesh aligns very closely with the original input images; (f) Our method works even in uncommon poses with many self-occlusions, such asthis yoga pose.

We present a new method to capture detailed human motion, sampling morethan 1000 unique points on the body. Our method outputs highly accurate4D (spatio-temporal) point coordinates and, crucially, automatically assignsa unique label to each of the points. The locations and unique labels of thepoints are inferred from individual 2D input images only, without relying ontemporal tracking or any human body shape or skeletal kinematics models.Therefore, our captured point trajectories contain all of the details fromthe input images, including motion due to breathing, muscle contractionsand flesh deformation, and are well suited to be used as training data to fitadvanced models of the human body and its motion. The key idea behind oursystem is a new type of motion capture suit which contains a special patternwith checkerboard-like corners and two-letter codes. The images from ourmulti-camera system are processed by a sequence of neural networks whichare trained to localize the corners and recognize the codes, while beingrobust to suit stretching and self-occlusions of the body. Our system reliesonly on standard RGB or monochrome sensors and fully passive lightingand the passive suit, making our method easy to replicate, deploy and use.Our experiments demonstrate highly accurate captures of a wide variety of

Authors’ address: He Chen, [email protected]; Hyojoon Park, [email protected]; Kutay Macit, [email protected]; Ladislav Kavan, [email protected], University of Utah.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].© 2021 Association for Computing Machinery.0730-0301/2021/2-ART $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn human poses, including challenging motions such as yoga, gymnastics, orrolling on the ground.Additional Key Words and Phrases: human animation, motion capture, skindeformation

ACM Reference Format:

He Chen, Hyojoon Park, Kutay Macit, and Ladislav Kavan. 2021. CapturingDetailed Deformations of Moving Human Bodies.

ACM Trans. Graph.

1, 1(February 2021), 18 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

In most real-world images, the human body is occluded by clothing,making precise body measurements difficult or impossible. A signif-icant amount of previous work focuses on approximate but robustpose estimation in the wild [Cao et al. 2018; Güler et al. 2018]. How-ever, a small muscle twitch or the speed of breathing may containsignals that are critical in certain contexts, e.g., in the context ofsocial interactions, minute body shape motion may reveal importantinformation about the person’s emotional state or intent [Joo et al.2018]. Detailed human body measurements are highly relevant alsoin orthopedics and rehabilitation [Zhou and Hu 2008], virtual clothtry on [Giovanni et al. 2012; Ma et al. 2020], or building realisticavatars for telepresence and AR/VR [Barmpoutis 2013; Lombardiet al. 2018].When precise measurements are needed, prior work utilized ei-ther 1) reflective markers attached to a motion capture suit or gluedto the skin [Park and Hodgins 2006], or 2) painting colored patternson the skin [Bogo et al. 2017]. The traditional reflective (“mocap”)

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. a r X i v : . [ c s . C V ] F e b • He Chen, Hyojoon Park, Kutay Macit, and Ladislav Kavan markers present certain limitations. Because all of the markers lookalike (Fig. 2a), marker labeling relies strongly on temporal trackingand high frame-rate cameras. However, robust marker labeling isa hard problem [Song and Godøy 2016] which often requires man-ual corrections, especially for markers that have been occluded fortoo long. The difficulty of this problem grows with the number ofmarkers [Park and Hodgins 2006], thus sparse marker sets are mostcommon in the industry. Sparse marker sets are sufficient for fittinga low-dimensional skeletal body model, but not for capturing thedetails of flesh deformation or motion due to breathing.To capture moving bodies with high detail, the DFAUST approach[Bogo et al. 2017] starts by geometrically registering a template bodymodel to 3D scans [Hirshberg et al. 2012; Pons-Moll et al. 2015] andthen uses colored patterns on the skin to obtain high-accuracytemporal correspondences via optical flow. These colored patternsserve a similar purpose as the checkerboard-like corners on our suit,i.e., they enable precise localization of points on the surface of thebody. The key difference of our approach is that our suit containsalso unique two-letter codes adjacent to each corner, allowing usto label the corners directly by recognizing the codes. This is notpossible with the DFAUST’s patterns, because they are self-similar,created by applying colored stamps to the skin. Instead, the DFAUSTapproach relies on the initial geometric registration and temporaltracking, which can suffer from error accumulation and may lead toincorrect local minima in more challenging poses or fast motions.The DFAUST dataset contains a variety of high-detailed humanbody animations, but is restricted to upright standing-type motions.In contrast, we demonstrate captures of a wider variety of motions,including gymnastics exercises, yoga poses or rolling on the ground,see Fig. 22 and our accompanying video and data.Our new motion capture method was enabled by recent advancesin deep learning and high-resolution camera sensors. The key ideais to use a new type of motion capture suit with special fiducialmarkers, consisting of checkerboard-like corners for precise local-ization and two-letter codes for unique labeling. Our localizationand labeling process is very robust, because it does not rely ontemporal tracking or any type of body model; in fact, our approachsucceeds even if only a small part of the body is visible in the image,e.g., the zoom-ins in the right part of Fig. 17. A similar advantageexists also in the temporal domain. Because our localization and la-beling approach can process each image independently, there are noissues due to occlusions and dis-occlusions which complicate tradi-tional temporal tracking in both marker-based as well as marker-lessmethods.Even though the automatic localization and labeling are very ro-bust, achieving this functionality is non-trivial, because our methodsneed to be robust against significant stretching of the tight-fittingsuit as well as projective distortion. Fortunately, checkerboard-likecorners remain to be checkerboard-like even despite significantstretching of the suit. We apply three convolutional neural networkscombined with geometric algorithms. The first of our convolutionalneural networks (CNNs) is a corner detector which localizes all ofthe corners in an input image (4000 × RejectorNet thenperforms quality control and legible and upright-oriented codes.The remaining codes are passed to

RecogNet which reads the char-acters in the two-letter code. Because the orientation of our codesis unique (we avoided symmetric symbols such as “O” or “I” as wellas ambiguous pairs like “6” and “9”), recognition of the code allowsus to uniquely label each of the adjacent corners, see Fig. 2c.The labels of our corners establish correspondences both in timeand in space, i.e., between individual cameras in our multi-viewsystem, which means that we can easily triangulate the 2D cornerlocations into 3D points. However, the 3D reconstructed (triangu-lated) points will inevitably miss observations due to self-occlusionsand the limited number of our cameras, see Fig. 1c. To fill (interpo-late) these missing observations, we start by fitting the STAR model[Osman et al. 2020] and then refine it for each of our actors using apoint trajectory from a calisthenics-type motion sequence capturedusing our method. This refinement ensures that we obtain the bestpossible low-dimensional model for each of our actors, since highquality is our main objective. We use this refined body model tointerpolate the missing corners in the rest pose, resulting in the finalmesh without any holes, see Fig. 1d.Our goal was to make each two-letter code as small as possible, sowe can recover the highest possible number of points on the body.We created our special capture suits in two sizes, one “medium”(with 1487 corners) and one “small” (with 1119 corners) and wecaptured three actors: one male and two females (two of the actorsused the “medium” suit).We evaluated both the geometric accuracy through reprojectionerror of 3D reconstruction, and the quality of the temporal corre-spondences by computing the optical flow between the syntheticimage and the real image. Results show 99% of our reconstructedpoints has a reprojection error of less than 1.01 pixels, and 95% ofthe pixels on the optical flow have a optical flow norm of less than1.2 pixels. In our camera setup, 1 pixel approximately converts to1mm on a person 2 meters away from the camera.

Contributions: (1) We propose a new method to measure 3Dmarker locations at each frame and automatically infer correspond-ing marker labels . This is achieved without any priors on humanbody shapes or kinematics, which means that our data are “rawmeasurements”, immune to any type of modeling or inductive bias.(2) We introduce a novel type of fiducial markers and capture suit,which enables marker localization and unique labeling using onlylocal image patches. Our approach does not utilize temporal track-ing, which makes it robust to marker dis-occlusions and also invitesparallel processing, because each frame can be processed indepen-dently.(3) All results in this paper were obtained using an experimentalmulti-camera system utilizing 16 commodity RGB cameras and pas-sive lighting. High-end multi-camera systems based on machinevision cameras such as those built by Facebook [Lombardi et al.2018] or Google [Guo et al. 2019] require significant hardware in-vestments and engineering expertise. In contrast, our system is easyto build from inexpensive off-the-shelf parts. We provide our dataas supplemental material and we invite individual researchers, inde-pendent studios or makers to replicate our setup and capture newactors and motions.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 3

Optical systems based on reflective markers [Menache 2000] are themost widely used approaches to capture the human body. Whiletypically only sparse marker sets are used, [Park and Hodgins 2006,2008] pushed the resolution of reflective markers based system upto 350 markers to capture the detailed skin deformation. However,difficulties in marker labeling [Song and Godøy 2016] complicatefurther increases of resolution by adding even more markers. Recentwork utilizes self-similarity analysis [Aristidou et al. 2018] and deeplearning [Han et al. 2018; Holden 2018] to reduce the expensivemanual clean-up in marker labeling procedure. An alternative tothe classical reflective markers is the use of colored cloth, enablingthe capture of certain types of garments such as pants [White et al.2007] or hand tracking using colored gloves [Wang and Popović2009].Early work in markerless motion capture [Gavrila and Davis1996] and [Bregler et al. 2004] inferred human poses directly from2D images or videos. [Kehl and Van Gool 2006] integrates multipleimage cues such as edges, color information and volumetric recon-struction to achieve higher accuracy. [Brox et al. 2009] tracks a 3Dhuman body on 2D image by combining image segments, opticalflows and SIFT features [Lowe 1999]. [De Aguiar et al. 2008] deformslaser-scan of the tracked subject under the constraints of multi-viewvideos to capture spatio-temporally coherent body deformation andtextural surface appearance of the actors. Silhouettes [Liu et al. 2013;Vlasic et al. 2008] or visual hulls [Corazza et al. 2010] can be used toobtain more detailed human body deformations. [Stoll et al. 2011]model the human body via a sums of Gaussians, representing bothshape and appearance of the captured actors.Deep learning enabled estimation of 2D human poses from monoc-ular multi-person images [Newell et al. 2016; Pishchulin et al. 2016;Raaj et al. 2019; Wei et al. 2016], more recently also with hands andfaces [Cao et al. 2018, 2017; Hidalgo et al. 2019]. 3D pose or evendense 3D surface of the human body can also be predicted from asingle image [Choutas et al. 2020; Güler et al. 2018; Mehta et al. 2017;Xiang et al. 2019; Xu et al. 2019]. Morphable human models can belearned from multi-person datasets [Pons-Moll et al. 2015; Robinetteet al. 2002]. Models such as SCAPE [Anguelov et al. 2005], SMPL[Loper et al. 2015] and STAR [Osman et al. 2020] focus on the body,while models such as Adam [Joo et al. 2018] and SMPL-X [Pavlakoset al. 2019] also include the face and the hands. Focusing on high-quality rendering rather than geometry, [Guo et al. 2019; Meka et al.2020] proposed methods for photo-realistic relighting of movinghumans, including clothing and accessories such as backpacks.The idea of a motion capture suit with special texture is related tofiducial markers used e.g., in robotics or augmented reality, such asARTag [Fiala 2005], AprilTag [Olson 2011; Wang and Olson 2016],ArUco [Garrido-Jurado et al. 2014] and many others, but these fidu-cial markers are typically assumed to be non-deforming. They arealso not easy to read for humans, which would complicate their an-notations. The localization of our fiducial marker is related to cornerdetection. Many corner detection methods have been developed tomeet different use-case scenarios. For localizing the corners, thereare methods that are designed to detect general corners featuresthat naturally exists in nature, like [DeTone et al. 2018; Rosten and Drummond 2006], Another class of corner detectors focuses on rigidcalibration checkerboards [Bennett and Lasenby 2014; Chen et al.2018; Donné et al. 2016; Hu et al. 2019], particularly useful in cam-era calibration. Because these methods assume the checkerboardpattern to be rigid, they will not work on our checkerboard-likesuit which can have significant deformations (Fig. 2b). The coderecognition component of our method is related to text recognition.As discussed in [Long et al. 2020], the text recognizer generallyperforms poorly on text with large spatial transformations. Onepossible solution is based on generating region proposals [Jaderberget al. 2014; Ma et al. 2018] to rectify the spatial transformation.High-resolution temporal correspondences can be obtained byregistering a template mesh to RGB-D images or 3D scans. The reg-istration can be based on solely geometric information [Allain et al.2015; Li et al. 2009], or combined with RGB images to align [Bogoet al. 2015] to reduce the tangential sliding. Model-less approachesare also possible [Collet et al. 2015; Dou et al. 2015; Newcombeet al. 2015]. Those methods focus on registering sequential motionsframe to frame, with the assumption of small displacements be-tween subsequent frames. Therefore, they can suffer from erroraccumulation, resulting in drift over time [Casas et al. 2012]. Align-ing non-sequential motions is also possible [Huang et al. 2011; Tungand Matsuyama 2010], but it is challenging to establish correspon-dence between very different poses [Boukhayma et al. 2016; Pradaet al. 2016]. Deformation models can be trained from 3D scans [Allenet al. 2003; Anguelov et al. 2005], with non-rigid scan registrationbeing the technical challenge [Hirshberg et al. 2012].Similarly to our new motion capture suit, the FAUST [Bogo et al.2014] and DFAUST [Bogo et al. 2017] methods paint high-frequencycolored patterns directly to the skin. We chose to work with a suitbecause putting it on and off is easy and fast compared to applyingcolored stamps and washing them off after the capture session.Even though we only experimented with basic tight-fitting suitsin this paper, future improvements such as adhesive suits or non-permanent tattoos are possible, see Section 7. Our capture system issignificantly simpler and less expensive: we use only on 16 standard(RGB) cameras with passive uniform lights, while [Bogo et al. 2014,2017] used 22 pairs of stereo cameras, 22 RGB cameras and 34 speckleprojectors (active light). Perhaps more important are the technicaldifferences between our approach and DFAUST, in particular the factthat our codes are unique as opposed to the self-repetitive patternsused in FAUST and DFAUST. Rather than creating a dataset, ourgoal was to create a universal and practical method to enable futureresearch on advanced human body modeling and its applications inareas ranging from graphics to sports medicine.

Suit.

To create our special motion capture suit, we started bypurchasing a tight-fitting unitard, originally intended for dance orperforming arts. Fortuitously, one of the manufacturer-provided pat-terns was precisely the black-and-white checkerboard texture remi-niscent of computer vision calibration boards (in fact this providedsome of the original inspiration for this project). We purchased twosuits, one “medium” and one “small” and augmented them by writingcodes into the white squares using a marker pen. The medium suit

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. • He Chen, Hyojoon Park, Kutay Macit, and Ladislav Kavan

CornerCode (a) (b) (c)

Fig. 2. (a) A classical motion capture suit with reflective markers. (b) Thedesign of our motion capture suit with fiducial markers. (c) Each or ourmarkers consists of checkerboard-like corners and codes. (a) (b)

Fig. 3. (a) A photo of our 16 camera setup. (b) The cameras form a circlesurrounding the capture volume. contains 1487 corners and 625 two-letter codes; the small suit has1119 corners and 456 codes. For our two-letter codes, we only usedsymbols whose upright orientation is unique and non-ambiguous,specifically: “1234567ABCDEFGJKLMPQRTUVY”.

Camera system.

Our multicamera setup contains 16 standard(RGB) cameras arranged into a circle surrounding the capture vol-ume (Fig. 3b). Each camera captures 4000 × . 𝑚𝑠 shutter speeds, guaranteeing sharp images even withthe fastest human motions. The cameras are calibrated by wav-ing a traditional calibration checkerboard in from of them. Theintrinsic and extrinsic camera parameters are calibrated using thewell-established method [Zhang 2000] for which we use OpenCV’scheckerboard corner detector for rigid calibration boards [Bradski2000]. Next, the camera parameters and the 3D checkerboard cornerpositions in the world coordinates are further refined using bundleadjustment [Triggs et al. 1999]. We use the Levenberg-Marquardtalgorithm and the Ceres library [Agarwal and Mierle 2012]. Image processing pipeline.

The calibrated cameras generate se-quences of images, which are processed by our pipeline outlined inFig. 4. We start by detecting checkerboard-like corners in the inputimage with sub-pixel accuracy (Fig. 4a, Section 3.1). Next, we needto uniquely label the detected corners by recognizing the adjacenttwo-letter codes. Because the codes are written in the white squaressurrounded by four corners, we generate candidate quadrilaterals(quads) by connecting four-tuples of corners. Only a few four-tuplesof corners correspond to the white squares, but it is okay to generatea quad that does not correspond to a white square, because it willbe discarded later; hence we use the term “candidate quads”, seeFig. 4b. Since the quads are generated by connecting four corners,we naturally have correspondences between the corners and thequads. The candidate quads are rectified by mapping them into aregular square using homography (Fig. 4c, Section 3.2) to removesuit stretching and perspective distortion, and then passed as inputto

RejectorNet which performs quality control and checks whetherthe quad actually corresponds to a white square with a code (Fig.4d). The

RejectorNet also ensures the correct upright orientation ofthe code. The images accepted by

RejectorNet are then passed to

RecogNet , which reads the two-letter code that finally enables us touniquely label each corner (Fig. 4e).We would like to point out that our method is local by design,i.e., each stage of the pipeline works with small patches of the inputimage. This gives us a several advantages: a ) Our method is capableof extracting reliable geometric information of the human body and -crucially - correspondences even from a small patch of the suit. Thismakes our method very robust to occlusions or partial views of thehuman body e.g. due to zoomed-in cameras. b ) By decomposing thesuit into small quads and undistorting them using homography, wecan counteract much of the projective distortions and suit stretching(see Fig.4c), simplifying the learning task. c ) The CNN quad classifierincludes a quality control mechanism, rejecting white squares withdubious quality and further improving the robustness of our method. The corner detector’s task is to detect and localize all checkerboard-like corners in the input image. This task is non-trivial because thereare corner-like features in the background, the suit stretches alongwith the skin and there are significant lighting variations.Our corners have two key properties: a ) The corners are sparselyand approximately uniformly distributed on the suit; b ) The cornersare defined locally, i.e., a small image patch is enough to identifyand localize a corner. We divide our input image into a grid of × cells (Fig. 5a) with the assumption that there could be at most onecheckerboard-like corner in each × cell, and apply CornerdetNet

CNN (Fig. 5c) to detect and localize a checkerboard-like corner fromeach cell separately. The design of

CornerdetNet is inspired by singleshot detectors [Liu et al. 2016; Redmon and Farhadi 2017], whichperform prediction and localization simultaneously.The input to

CornerdetNet is the × cell where a corner is beingsought, including a 6-pixel margins added to each side (Fig. 5b),making the input crop size × . These margins allow us reliablydetect even the corners close to the boundaries of the × cell. The6-pixel margins overlap with adjacent cells (Fig. 5b), but the × ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 5 (a) CornerdetNet (b) Candidate Quads (c) Homography Transformation (d) RejectorNet (e) RecogNet Fig. 4. Corner localization and labeling pipeline: (a)

CornerdetNet

CNN detects and localizes our suit corners; (b) Four-tuples of corners are connected intocandidate quads; (c) homography transformations using all four possible orientations (we selected a few quads and outlined them with different colors forillustration purposes); (d)

RejectorNet

CNN: a binary classifier accepting only valid white squares with upright oriented two-letter codes; (e)

RecogNet

CNNrecognizes the two characters in the valid codes. (1, 0.72, 0.28)(0, -, -) … (0, -, -)(0, -, -) (a) Input Image (b)

20 × 20

Cropped Subimages (c) CornerdetNet (d) Output

Inputs1 @ ×

20 10 @ ×

16 10 @ × @ × × × × …… … …… Fig. 5. (a) The input image is divided into a regular grid of × cells. (b) Each cell is expanded by a margin and the corresponding expanded crop is passed asinput to CornerdetNet . (c) The architecture of our

CornerdetNet . (d) Example outputs; even though there is a corner in the second image, it is outside of theinner × cell, thus the detector correctly reports 0 (no corner). cells do not overlap. The CornerdetNet outputs three floating-pointnumbers. The first one is a logit of a binary classifier predictingwhether a corner is present or not, and the other two are normalizedcoordinates ( [ , ] × [ , ] ) of the corner relative to the × cell.The training loss for CornerdetNet is: L 𝑐 ( 𝑝 ∗ , 𝑝 ; c ∗ , c ) = L 𝑝 ( 𝑝 ∗ , 𝑝 ) + 𝜆 𝑐 𝑝 || c − c ∗ || (1)Where 𝜆 𝑐 balances the prediction loss and localization loss; we set 𝜆 𝑐 = when training CornerdetNet . 𝑝 ∗ represents the logit ofthe binary classifier, 𝑐 ∗ represents the prediction of corner location, 𝑝 and 𝑐 represents the ground truth respectively, and L 𝑝 ( 𝑝 ∗ , 𝑝 ) iscross entropy. Corner Clustering and Refinement.

When a corner lies exactly onthe boundary of two × cells, it can be detected more than once(Fig. 6a). To fix such duplicate detections, we perform a clusteringpass: if any two detected corners are too close (< 3 pixels), we dis-card the one with the lower logit value. Since this might introduce (1, 0.463, 1.012)(1, 0.459, 0.003) … (a) (b) Average(463.3, 787.2)(463.5, 787.3)(463.1, 787.7) (463.3, 787.3)

Fig. 6. (a) A corner is detected twice because it is on the boundary between2 cells. In the parentheses are the

CornerdetNet outputs on each crop. (b)The re-crops generated for a detected corner. In the parentheses are thecorner position in global pixel coordinates. additional localization noise, we generate new crops randomly per-turbed around the original corner positions, run localization on each

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. • He Chen, Hyojoon Park, Kutay Macit, and Ladislav Kavan of these crops and average the results in global pixel coordinates,see Fig. 6b. This helps especially when corners are crossing theboundaries of the × cells (Fig. 6a). At this point we have detected typically several hundreds of cornersin each input image. The next step is to read the codes and link themto the corners, which will give us a unique label for each corner. Asshown in Fig. 1, the deformations of the suit and its codes can besignificant, including not only projective transformations, but alsostretching and shearing because the tight-fitting suit is highly elastic.First we have to detect the white squares with two letter codes. Weknow that each such white square is surrounded by four corners.Therefore, we generate quadrilaterals (quads) by connecting four-tuples of corners. In theory we could connect any four-tuples ofcorners into a quad, but in practice we can immediately discardconcave quads (which do not correspond to correct sequences ofcorners) or quads that would cover too few or too many pixels(which would make it impossible to contain a legible code). We callthe resulting quads “candidate quads”, because they may - but are notguaranteed to - contain a correct two-letter code. We transform thefour corners of each candidate quad to a standardized square usinga homography transformation to simplify subsequent processing.The standardized square is a × pixel image with pixelmargin on each side, i.e., × total. The pixel margin allowsthe RejectorNet to detect errors stemming from incorrect cornerdetections. Since we do not know the correct upright orientation ofour two-letter code yet, we generate all four possible orientations,see Fig. 8c,d.

Candidate quad generation.

It would be wasteful to enumerateall 4-tuples of corners for further processing by neural networks.Therefore, we first apply simple criteria to filter out quads that can-not contain a valid code. We start by iterating over all the corners,and for each corner, we select three other corners within a boundingbox. When connecting corners into a quad, we ensure that eachquad is convex, clock-wise oriented and unique. Additional filteringcriteria include geometric criteria and image based criteria: geomet-ric criteria constrain the area, maximum/minimum edge-lengthsand maximum/minimum angles of the generated candidate quad;image based criteria constrain the average intensity, and standarddeviation of all the pixels in the generated candidate quad. To obtainthe range for each criterion, we gather statistics for each of thosequantities in the training dataset (Section 4) and create conservativeintervals to ensure that we cannot mistakenly reject any valid quad.The candidate quads that pass all of these early rejection filters aretransformed using homography and passed to further processingto quad classifier neural networks. Fig. 8c shows an example of aninvalid quad and Fig. 8d demonstrates a valid one.

Quad classifiers.

We trained two quad classifiers,

RejectorNet and

RecogNet . RejectorNet is a binary classifier predicting whether acandidate quad is valid, i.e., whether the four corners are at thecorrect locations and their order is correct relative to the uprightcode orientation, see Fig. 8b. Also, the white square surrounded by avalid quad needs to contain a clearly legible code. Invalid quads are discarded, and the valid ones are passed to

RecogNet which readsthe codes, such as “U7” in Fig. 7.

RecogNet is a multi-class classifierwith two heads, one for each character of the two letter code. Thearchitectures of both networks are shown in Fig. 7. We use standardcross entropy losses to train those classifiers. The training of ourCNNs is discussed in Section 6.

Why separate RejectorNet and RecogNet?

We considered combin-ing the two networks into one, but we found that network trainingis easier if we treat each problem separately. Specifically, the

Rejec-torNet should perform quality control of a × standardizedimage, including rejection of errors made by CornerdetNet (Fig. 12b).Because we prefer missing observations to errors, we train

Rejector-Net to be conservative and reject any inputs of dubious quality. Thesecond network,

RecogNet , has to recognize two characters in anyimage. We can make

RecogNet more reliable by training it even onvery difficult input images, enhancing the robustness of the entirepipeline. The details of our training process and data augmentationare discussed in Section 4.2.

Corner labeling and 3D Reconstruction.

At this point, the two-letter codes of the valid quads have been recognized, includingtheir upright orientation. The next step is to uniquely label eachcorner. We define a labeling function 𝑙 ( 𝑐𝑜𝑑𝑒, 𝑖 𝑞 ) which maps a two-letter 𝑐𝑜𝑑𝑒 and corner index 𝑖 𝑞 ∈ { , , , } (see Fig. 8b) to aninteger which represents a unique corner ID. The unique corner IDsare defined for each suit. Many corners have two two-letter codesadjacent to them. If both of the two-letter codes are visible, we canleverage this fact as a redundancy check, detecting potential errorsof RecogNet . Given unique corner IDs, we can convert corresponding2D corners in more than two views into labeled 3D points. Let C 𝑖 be the set of cameras that see corner 𝑖 , 𝑘 ∈ C 𝑖 is a camera that seescorner 𝑖 , c 𝑘𝑖 ∈ R be the corner 𝑖 ’s location in image coordinatesystem of camera 𝑘 , and 𝑓 𝑘 : R → R is the projection function ofcamera 𝑘 . We computed 3D reconstructed corner p 𝑖 by minimizingthe reprojection error: p 𝑖 = arg min p 𝑖 ∈ R ∑︁ 𝑘 ∈C 𝑖 || 𝑓 𝑘 ( p 𝑖 ) − c 𝑘𝑖 || (2)This is a non-linear least square optimization problem; we computean initial guess of p 𝑖 using the Linear-LS method [Hartley and Sturm1997] and optimize it using non-linear least squares solver [Agarwaland Mierle 2012]. Error Filtering.

The label consistency check discussed above worksonly if two adjacent two-letter codes are present in the suit andvisible in the images. If this is not the case, a corner can be assignedthe wrong label if

RejectorNet or RecogNet make a mistake. Thiskind of labeling errors will typically result in non-sensical corre-spondences with large reprojection errors, which we detect andcorrect by a RANSAC-type method. discussed below. Specifically,for a corner with label 𝑖 , let C 𝑖 be the set of the cameras that claim tosee this corner. We assume that outliers in C 𝑖 , i.e., the cameras thatmislabeled corner 𝑖 , should only be a minority. We iterate over allpairs of cameras ( 𝑗, 𝑘 ) in C 𝑖 , and 3D reconstruct the corner 𝑖 fromeach pair. Among all of the pairs, we pick the 3D reconstructionthat has the lowest reprojection error averaged over all cameras in ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 7

Max-pool2 × × × × × × × @ ×

104 12 @ ×

100 12 @ ×

50 25 @ ×

46 25 @ × @ × @ × × @ × @ × × × × × × × @ ×

104 12 @ ×

100 12 @ ×

50 25 @ ×

46 25 @ × @ × @ × (a) Candidate Quad (b) HomographyTransformation (c) CNN Classifiers RejectorNetRecogNet

Output254050 120 80 50 (character 2)

Fig. 7. Our quad processing pipeline: (a) Example candidate quad. (b) The undistorted candidate quads with margin; these images are input to our quadclassifiers. (c) The architectures of

RejectorNet and

RecogNet

CNNs. (b)(a) 1 234 (c) (d)

Fig. 8. Quad generation: (a) For a given corner (red), we find three othercorners (yellow) within a bounding box and compute their convex hull togenerate a candidate quad. Here we visualize two example candidate quads(green and blue); (b) Our corner numbering convention with respect toupright code orientation; (c) and (d): The four possible orientations of thecandidate quads, and their corresponding homography transformations. (c)is an invalid candidate, but (d) is valid. C 𝑖 and assume this is the correct 3D location p 𝑖 . Next, we analyzethe reprojection errors of p 𝑖 into all of the cameras C 𝑖 . The reprojec-tion error should be low in cameras with correct labeling, but highif there was a labeling error. We use the . × 𝐼𝑄𝑅 (interquartilerange) rule [Upton and Cook 1996] to detect the outliers in termsof reprojections errors. We re-compute the triangulation of p 𝑖 af-ter removing the outliers from the cameras C 𝑖 . This RANSAC-typeoutlier filter does not work when there are only two cameras that Fig. 9. Example input images (top) from our multi-camera system andthe corresponding raw 3D reconstructions (bottom). The reconstructed 3Dpoints are meshed according to the patterns in our suits. see one corner. Therefore, we additionally discard reconstructedcorners with an average reprojection error larger than . pixels.These tests are designed to be conservative, because mistakenlydiscarded points are not a major problem, just missing observationswhich can be inpainted as discussed in Section 5. A key feature of our approach is that all of our networks are trainedonly on small image patches, e.g., see Fig. 10c and Fig. 11e. Thisallows our trained model to generalize to different suits, captureenvironments, camera configurations and body poses that are notin the training set, because our local fiducial markers exhibit signifi-cantly less variability than images of full human poses. This is quite

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. • He Chen, Hyojoon Park, Kutay Macit, and Ladislav Kavan

Sliding Window … (0, -, -)(1, 0.498, 0.931)(1, 0.498, 0.550) …… (a) (b) (c) Fig. 10. Generating training data for

CornerdetNet : (a) The annotated image.(b) Sliding a × window across the annotated image. For each positionof the window we generate a crop – input for CornerdetNet . (c) The crop is apositive sample (1, 𝑥 , 𝑦 ) for CornerdetNet if it contains a valid checkerboardcorner in its center × pixels (red square); otherwise it is a negative sample(0, -, -). different from deep-learning based methods that perform globalpose-prediction, looking for the body as a whole. The training ofour networks does not require large training sets. We have pre-pared our training data ourselves, without the use of any externalannotation services or existing data sets. Our dataset contains 24 manually annotated images, randomly se-lected from captures of our three actors. For each image, we applytwo types of annotations: corner annotation (Fig. 10a) and quadannotation (Fig. 11a). In the corner annotation, we manually anno-tate all of our checkerboard-like corners on the suit with sub-pixelaccuracy. In the quad annotation, we manually connect the cornersannotated in the previous step into quads. Specifically, we createquads that correspond to valid white squares with two-letter codesin the suit and the annotators also write down the code of each an-notated quad. We ensure the quad vertices are in a clockwise orderand start from the top-left corner, defined by the upright orientationof the code (see Fig. 8b). These annotations are then automaticallyconverted into training data for our networks as follows.

Corner Detector.

We generate the training data for

CornerdetNet by sliding a × window with stride 1, as shown in Fig. 10b. Eachof the × crops is an input to CornerdetNet , labeled positive ifand only if an annotated corner lies inside its center × pixels. Forpositive samples we also compute the sub-pixel corner coordinatesrelative to the × cell. Quad Classifiers.

We start by generating candidate quads fromthe annotated corners in the 24 manually annotated images usingthe algorithm discussed in Section 3.2. Note that the same quadgeneration algorithm will be used during deployment, i.e., whenprocessing new motion sequences. The quad generator is conserva-tive and creates many quads that do not correspond to valid whitesquares, see Fig. 11c. However, we know which quads are valid,because all of the valid ones were manually annotated, see Fig. 11a.This allows us to automatically generate both positive and negativeexamples for a given candidate-quad generator, see Fig. 11d. These × images are used to train RejectorNet . It is important thatthe quad generator used during deployment is identical to the quad (a)(c) (d) … (e) (0, -) (0, -) (0, -) (0, -) (0, -) (0, -) (0, -) (0, -)(0, -) (1, 5T) (0, -) (0, -)(0, -) (0, -) (0, -) (0, -) (b) Fig. 11. (a, b): Manual quad annotations. (c) The candidate quads gener-ated using the algorithm from Section 3.2. (d) Selection of valid quads. (e)Homography-transformed quads with four possible rotations, includingground truth labels for training

RejectorNet and

RecogNet . (a) (c)(b) Fig. 12. (a) The augmented positive training data for

RejectorNet . (b) Wegenerate negative training samples for

RejectorNet by warping one or morecorners of a positive sample away from its original location, which simulatesthe case that the quad’s corners were not correctly localized. (c) The dataaugmentation for

RecogNet is aggressive, including significant blurring andlarge elastic deformations. generator used when generating the training data for

RejectorNet .The two-letter code annotations of the valid quads are then usingto train

RecogNet . Data Augmentation.

All of the crops generated from annotatedimages as described in the previous sections are augmented byapplying intensity perturbations (contrast, brightness, gamma). Inaddition, we also apply geometric deformations on each input image.For the corner detector, we also augment the training data by gen-erating random rotations of each image, because checkerboard-likecorners are rotation invariant.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 9

Different data augmentation approaches need to be applied to

RejectorNet and

RecogNet . For the

RejectorNet , we blur the imageusing Gaussian filter and add elastic deformation using thin-platesplines [Wood 2003] to simulate skin deformation. We constrainthe elastic deformations to fix the checkerboard-like corners inplace, see Fig. 12a, otherwise positive examples could be turned intonegative ones. We also use this fact to our advantage: if we displacea checkerboard-like corner of a valid white square, we obtain a new(augmented) negative example, simulating the case when quad’scorners have not been correctly localized, see Fig. 12b.Since

RecogNet is required to predict characters from any inputimage, we can afford to augment our data more aggressively. Specif-ically, we use much more significant geometric distortions, intensityvariations, blurring and additional noise, see Fig. 12c. This aggres-sive data augmentation has an interesting effect: the performanceon the training data becomes worse, since we made the recognitiontask more difficult. However, we obtain better performance on the test set, which is what matters. This agrees with human intuition:if students are given harder homework (training), they will likelyperform better in their first job.

Synthetic Data.

To further enhance the diversity of our trainingdata, we also generated synthetic data sets by rendering an ani-mated SMPL [Loper et al. 2015] model. We use synthetic data onlyfor training the

RecogNet , because this was the bottleneck in theoverall pipeline, see Section 6 for more details. We textured thebody mesh with the same checkerboard-like pattern as used in thereal suit and applied animations from a public motion capture data-base [Mahmood et al. 2019]. We randomly generated new two-lettercodes, including variations in font types and sizes to emulate thehandwriting of the codes. For each animation frame, we renderedimages with virtual cameras, simulating our real capture setup bycopying the intrinsic and extrinsic parameters from our real cam-eras. The visibility of corners in the rendered images is determinedusing ray tracing. To control the quality of quads that will be addedto the training set, we check for corner visibility and use a classifierconsidering the quad’s 3D normal direction and quad geometry inthe rendered image.

Our method for 3D reconstruction of labeled points will inevitablyresult in missing observations because the human body often self-occludes itself and is observed only by a limited number of cameras,see Fig. 9. In this section, we propose a method to interpolate (in-paint) the missing corners. Even though we could use any existingmulti-person human body model [Loper et al. 2015; Osman et al.2020] for this purpose, we can achieve higher quality, because ourpipeline gives us highly accurate measurements of the actor’s bodyand its deformations. Therefore, instead of relying on previous sta-tistical body shape models, we capture example motions of a givenactor using our method and use this data to create a more precise re-fined body model , i.e., a model with parameters refined for a specificperson.Our body model has two types of parameters: shape parametersthat are invariant in time, and pose parameters that change from frame to frame as the body moves. The shape parameters are onlyoptimized during the model refinement process. After the bodymodel refinement process is done, we fix the shape parameters andonly allow the pose parameters to change. However, even after therefinement, the low-dimensional body model will not fit the 3Dreconstructed corners exactly (Fig. 14c). We call the remaining resid-uals “non-articulated displacements”, because they correspond tomotion that is not well explained by the articulated body model. Thenon-articulated displacements arise due to breathing, muscle acti-vations, flesh deformation, etc. Therefore, in addition to our refinedbody model we also interpolate the non-articulated displacementsmapped to the rest pose via inverse skinning. The combinationof the refined body model with the non-articualted displacementinterpolation enables us to achieve high quality inpainting.

Our body model is based on linear blend skinning (LBS) [Magnenat-Thalmann et al. 1988]. Let v 𝑖 ∈ R for 𝑖 = , , . . . , 𝑁 be the de-formed vertices in homogeneous coordinates, where 𝑁 is the num-ber of all the vertices on our body model. We denote the skinningmodel as: v 𝑖 = 𝐷 𝑖 ( ˜v 𝑖 , J , W , 𝜃 ) , where ˜V = ( ˜v , . . . , ˜v 𝑁 ) are restpose vertex positions, J = ( j , . . . , j 𝑀 ) are joint locations and 𝑀 is the number of joints in our model; W ∈ R 𝑁 × 𝑀 is the matrixof skinning weights and 𝜃 ∈ R 𝑀 × are joint rotations representedby quaternions. In summary, ˜V , J and W are shape parameters(constant in time) and 𝜃 are pose parameters (varying in time). Thedeformed vertices can be computed as: v 𝑖 = 𝐷 𝑖 ( ˜v 𝑖 , J , W , 𝜃 ) = 𝑀 ∑︁ 𝑗 = 𝑤 𝑖,𝑗 𝑇 𝑗 ( 𝜃, J ) t 𝑖 (3)Note that here v 𝑖 , ˜v 𝑖 ∈ R × are the deformed and rest pose ver-tex in homogeneous coordinates and T 𝑗 ( 𝜃, J ) ∈ R × representsthe transformation matrix of joint 𝑗 . In the following we will usehomogeneous coordinates interchangably with their 3D Cartesiancounterparts. Initialization.

We initialize our body model by registering ourcorners to the STAR model [Osman et al. 2020]. We start by selectinga frame 𝑓 init in a rest-like pose where most corners are visible, andfit the STAR model to our labeled 3D points in 𝑓 init using a non-rigidICP scheme which finds correspondences between our suit cornersand the STAR model’s mesh. The non-rigid ICP process is initializedby 10 to 20 hand picked correspondences between the STAR modeland the 3D reconstructed corners. During the ICP procedure, Weoptimize both pose and shape parameters of the STAR model anditeratively update correspondences by projecting each of our 3Dreconstructed points to the closest triangle of the STAR model (theactual closest point is represented using barycentric coordinates).In this stage, we have registered most of our corners to the STARmodel, but we still need to add corners that were unobserved inframe 𝑓 init . We can fit the STAR model to subsequent frames of ourtraining motion using non-rigid ICP initialized by the registeredcorners instead of hand picked correspondences. These subsequentframes will reveal corners unobserved in the initial frame, whichwe register against the STAR mesh by closest-point projection as ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. before. We use the corners registered to the STAR model’s rest poseas the initial rest pose shape ˜V , and use barycentric interpolationto generate the initial skinning weights W . Note that the numberof vertices and mesh connectivity of our body model is differentfrom the STAR model’s mesh. We use each corner on the suit asa vertex of our model, and the rest pose vertex ˜v 𝑖 corresponds tocorner 𝑖 in our suit. The meshing of our body model is discussedbelow. We use the STAR model’s joints as the initial joint location J . We removed the joints that controls head, neck, toes and palmfrom the STAR model, resulting in 𝑀 = joints. We call this modelour initial body model . Model refinement.

After the initialization, we further optimizethe shape parameters to obtain our refined body model that moreaccurately fits a specific actor. Specifically, we optimize the skinningweights W , the joint locations J and the rest pose vertex positions ˜V .Unlike, SMPL or STAR, we do not use pose-corrective blend shapesand instead correct the shape by interpolating non-articulated dis-placements, discussed in Section 5.2.If 𝑃 𝑘 , 𝑘 = , , . . . , 𝐾 is the set of 3D points that were reconstructedfrom frame 𝑘 and 𝐾 is the number of frames in the training set, werefine the body model by minimizing: L 𝐴 ( ˜V , J , W , Θ ) = L 𝑓 ( ˜V , J , W , Θ ) + 𝜆 𝑔 L 𝑔 ( W ) + 𝜆 𝐽 L 𝐽 ( J ) (4)where Θ = ( 𝜃 , 𝜃 , . . . , 𝜃 𝐾 ) are the pose parameters of all the framesin the training set and L 𝑓 ( V , J , W , Θ ) is the fitting error term: L 𝑓 ( ˜V , J , W , Θ ) = (cid:205) 𝐾𝑘 = | 𝑃 𝑘 | 𝐾 ∑︁ 𝑘 = ∑︁ p 𝑘𝑖 ∈ 𝑃 𝑘 ∥ 𝐷 𝑖 ( ˜v 𝑖 , J , W , 𝜃 𝑘 ) − p 𝑘𝑖 ∥ (5) L 𝐽 ( J ) is an 𝑙 loss penalizing joint locations moving too far awayfrom their initial positions and L 𝑔 ( 𝑊 ) is a regularization termencouraging sparsity of the skinning weights: L 𝑔 ( W ) = 𝑁 ∑︁ 𝑖 = 𝑀 ∑︁ 𝑗 = 𝑔 𝑖,𝑗 𝑤 𝑖,𝑗 (6)where 𝑔 𝑖,𝑗 is the geodesic distance from corner 𝑖 to the closest vertexthat has non-zero initial weight for joint j 𝑗 in the STAR model. Theregularization weights were empirically set to 𝜆 𝑔 = and 𝜆 𝐽 = when our spatial units are millimeters.We optimize L 𝐴 with an alternating optimization scheme. Start-ing with the initial LBS model, we first calculate pose parameters 𝜃 𝑘 for each frame. Then we optimize W , J , and ˜V one by one, whilekeeping the other parameters fixed. We iterate this procedure untilthe error decrease becomes negligible; in our results we neededbetween 50-100 iterations.After the optimization is finished, we mesh the rest pose vertices ˜V . From the unique ID of each corner, we know how they wereconnected into quads in the suit. We manually add vertices to closethe holes which come from areas of the suit such as the zipper andthe seams (see Fig. 13a). The result is a quad-dominant mesh (Fig.13b). (a) (b) Fig. 13. (a) The quad structure corresponding to our suit has holes due tothe zipper and the seams; (b) The completed rest pose mesh. (a) (b) (c)(d) (e) (f)

Not exact fit

Fig. 14. Our hole-filling pipeline: (a) an input image; (b) 3D points 𝑃 recon-structed from input images and connected with quads; (c) our refined bodymodel (gray) does not fit 𝑃 exactly (transparent rendering shows discrepan-cies); (d) inverse skinning, mapping 𝑃 to the rest pose; (e) rest pose meshinterpolation, matching the inverse-skinned 𝑃 exactly; (f) the final resultobtained by forward skinning of the interpolated rest pose mesh. After the optimization, the fitting error (Eq. 5) will drop from 13.5mmto 7.1mm on the test set; further results are reported in Section 6.3.The refined LBS body model is good for representing articulatedskeletal motion of the actor’s body, but it does not represent welleffects such as breathing or flesh deformation. However, the non-articulated component of the motion that cannot be represented

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 11 by LBS is relatively small. Therefore, we start by applying inverseskinning transformations (also known as “unposing”) to our ob-served 3D reconstructed points p i , see Fig. 14d. We denote theinverse skinning of point 𝑖 at pose 𝑘 as 𝐷 − 𝑖 ( p 𝑖 , J , W , 𝜃 𝑘 ) . As canbe seen in Fig. 14d, the 𝐷 − 𝑖 ( p 𝑖 , J , W , 𝜃 𝑘 ) will not exactly match ˜v 𝑖 due to the non-articulated residuals. Formally, the non-articulateddisplacements Δ ˜v 𝑘𝑖 are defined as: v 𝑘𝑖 = 𝐷 ( ˜v 𝑖 + Δ ˜v 𝑘𝑖 , J , W , 𝜃 𝑘 ) (7)The key problem of our inpainting consists in interpolating thevalues of Δ ˜v 𝑘𝑖 from the observed points to the unobserved ones, inother words, predicting the unobserved non-articulated displace-ments, see Fig. 14e. The modified rest pose is then mapped back by(forward) skinning to produce the final mesh, see Fig. 14f.Our method for predicting the unobserved non-articulated dis-placements in the rest pose is based on the assumption of spatio-temporal smoothness. We stack all of the rest pose displacementsinto a 𝐾𝑁 × matrix X , where 𝐾 is the number of frames and 𝑁 the number of vertices (all vertices, both unobserved and observedones). We find X by solving the following constrained optimizationproblem: min L spat ( X ) + 𝑤 𝑇 L temp ( X ) s.t. CX = D (8)where L spat is a spatial Laplacian term that penalizes non-smoothdeformations of the mesh and L temp is a temporal Laplacian termthat penalizes non-smooth trajectories of the vertices. Both of theterms are positive semi-definite quadratic forms. The parameter 𝑤 𝑇 is a weight balancing these two terms which we empirically set to100. The sparse selector matrix C represents the observed points(constraints) and D their unposed 3D positions for each frame (eachframe may have a different set of observed points). Specifically, wedefine L spat as: L spat = ∑︁ 𝑖 = X 𝑇𝑖 LX 𝑖 (9)where L is cotangent-weighted Laplacian of the rest pose and X 𝑖 is the 𝑖 -th column of X . We found this quadratic deformation en-ergy to be sufficient because our non-articulated displacements inthe rest pose are small, though in future work it would be possibleto explore non-linear thin shell deformation energies. For L temp ,we use 1D temporal Laplacian which corresponds to acceleration: ∥ Δ ˜v 𝑘 − 𝑖 − Δ ˜v 𝑘𝑖 + Δ ˜v 𝑘 + 𝑖 ∥ . The L spat operator is applied to allframes independently, and L temp is applied to all vertices indepen-dently. However, their weighted combination in Eq. 8 introducesspatio-temporal coupling, allowing one observed point to affectunobserved points through both space and time.The optimization problem Eq. 8 is a convex quadratic problem sub-ject to equality constraints, which we transform to a linear system(KKT matrix) and solve. The only complication is that when pro-cessing too many frames, the KKT system can become too large. Forexample, with 𝐾 = frames, the KKT matrix becomes approxi-mately × . Even though the KKT matrix is sparse, the linearsolve becomes costly. To avoid this problem, we observe that smooth-ing over too many frames is not necessary and introduce a window-ing scheme, decomposing longer sequence into 150-frame windows and solve them independently. To avoid any non-smoothness whentransitioning from one window to another, the 150-frame windowsoverlap by 50 frames. After solving the problem in Eq. 8 for eachwindow separately, we smoothly blend the overlapping 50 framesto ensure smoothness when transitioning from one window to thenext one.In this section we considered only off-line hole filling, wherewe can infer information from future frames. This approach wouldnot be applicable to real-time settings where future frames are notavailable. Table 1. Composition of our training data. The original training set is thetraining set before data augmentation from 20 annotated images.

Type of Data

CornerdetNet RecogNet RecogNet

Original Training Set 667320 21257 7402Augmented Training Set 5678934 121060 118432Synthetic Training Set

NA NA × ) of our actors. Out of the 24,we withheld 4 as a test set. Table 1 shows the total numbers of imagesused for training our CNNs. As shown in the first row of Table 1, theoriginal training set (without data augmentation) for RecogNet ismuch smaller compared to

CornerdetNet and

RejectorNet , because ofthe limited number of valid quads in each of our annotated images.To improve the classification performance of

RecogNet , we usedsynthetically generated images to complement the real data, asdiscussed in Section 4.2. The synthetic data contain 214471 crops( × ), which significantly improved the robustness of RecogNet ,see Section 6.1.We train our CNNs using Tensorflow [Abadi et al. 2015] usinga single NVIDIA Titan RTX; for each of our CNNs, an overnightrun is typically enough to converge to good results using the Adamoptimizer. After our CNNs have been trained, we run inferenceon a PC with an i7-9700K CPU and an NVIDIA GTX 1080 GPU.With a × input image, an inference pass of CornerdetNet takes 300ms, generating candidate quads 10ms,

RejectorNet takes1-2s to classify all of the candidate quads ( × ) and RecogNet takes 5ms to recognize the valid quads. The computational bottle-neck is the

RejectorNet due to the large number of candidate quads;this could be improved in the future by a more aggressive culling ofcandidate quads. For each frame, the time for 3D reconstruction isnegligible, taking less than 1ms for all points. Even though we usedonly one computer and processed our image sequences off-line, wewould like to point out that our method for extracting 3D labeledpoints from multi-view images is embarrassingly parallel, becauseeach frame and even each input crop for our CNNs can be processedindependently. Coupling through time is introduced only in thefinal hole-filling step (Section 5.2). The time for solving the sparselinear system (Eq. 8) for a 150 frames window is about 10s.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021.

We captured motion sequences of three actors, one male and twofemales. One of the female actor wears the small suit and the othertwo actors wear the medium suit. For each actor, we captured about12,000 frames (at 30FPS) of raw image data consisting of 1) cameracalibration, 2) 6000 frames of calisthenics-type sequence intendedfor body model refinement (also serving as a warm-up for the actor),3) the main performance. Each frame consists of 16 images fromour multicamera setup. It took about 300 hours to process all of the576,000 images (4.6 TB) using one computer.

CornerdetNet.

There are two parts of the

CornerdetNet ’s output: 1)classification response that predicts whether there is a valid cornerin the center × window of the input × image and 2) itssubpixel coordinates (or arbitrary values of a corner is not present).In Table 2 we summarize the results for both classification andlocalization errors. The localization error is measured by the distancein pixels between the predicted corner location and the manuallyannotated corner location. The overall classification accuracy for CornerdetNet is 99.393% on the training set and 99.510% on the testset. The fact that

CornerdetNet works better on the test set supportsour hypothesis that more aggressive data augmentation results inworse performance on the training set but better performance onthe test set. On the test set, the average localization error is 0.21pixels and 99% of the corner localizations achieve error 0.6361 pixelsor less, which is remarkably low. With our camera setup, 1 pixelerror corresponds to approximately 1mm of 3D error for an actor2 meters away from the camera. In practice, this means that our3D reconstructed points are highly accurate, allowing us to captureminute motions such as muscle twitches or flesh jiggling.

Table 2. Results for

CornerdetNet on training and test sets. n=5678934 Actual True Actual FalsePrediction True 2.533% 0.587%Prediction False 0.02% 96.86% (a) Confusion matrix on training set n=13650 Actual True Actual FalsePrediction True 1.7% 0.44%Prediction False 0.051% 97.81% (b) Confusion matrix on test set

Data set Max Mean MedianTraining set 1.959 0.1793 0.1448Test set 0.9485 0.2121 0.1904 (c) Max/mean/median of corner localization error in pixels

Data set 95% 99% 99.9% 99.99%Training set 0.4613 0.6611 0.9115 1.166Test set 0.5151 0.6361 0.8912 0.9428 (d) Percentiles of corner localization error in pixels

RejectorNet.

The confusion matrices of a trained

RejectorNet net-work are reported in Table 3. The overall classification accuracy for (a)(b)

Fig. 15. Examples of errors made by

RejectorNet . (a) False positives; eventhough the codes are legible, these samples were labeled negative due toslight image imperfections. (b) False negatives, labeled positive but close tothe decision boundary.

RejectorNet is 99.723% on the training set and 99.704% on the testset. From the confusion matrix, we can observe that we have morefalse positives than false negatives. The reason is that we intention-ally annotated the training data conservatively. As shown in Fig.15a, quads with even slight imperfections were labeled as negativeexamples. This results in

RecogNet reporting more false positives,but the

RecogNet actually inherits the conservative nature of theannotations; in practice,

RecogNet only rarely accepts a low-qualityquad image.

Table 3. Confusion matrix for

RejectorNet ’s results on training and test set. n=121060 Actual True Actual FalsePrediction True 13.133% 0.243%Prediction False 0.034% 86.59% (a) Confusion matrix on training set n=3372 Actual True Actual FalsePrediction True 1.305% 0.297%Prediction False 0.0% 98.399% (b) Confusion matrix on test set

RecogNet.

We compare the

RecogNet trained with/without thesynthetic training set in Table 4. Without using the synthetic train-ing set, the

RecogNet had prediction accuracy of 99.522% on the testset. This accuracy was low and it was the main source of errors inour pipeline. Enhanced with the synthetic training set, the predic-tion accuracy on the test set increased to 99.919% and significantlyimproved our results.

Table 4. Classification accuracy for

RecogNet trained with/without synthetictraining data.

With synthetictraining set Classification AccuracyReal Training Synthetic Training TestNo 99.940% NA 99.522%Yes 99.967% 94.849% 99.919%

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 13 (a)(b)

Fig. 16. Comparison of histograms of reprojection errors in 3D reconstruc-tion and camera calibration. (a) Distribution of reprojection errors computedper camera for all the 3D reconstructed corners in 10000 consecutive frames.(b) The distribution of reprojection errors of camera calibration. We can seethose two histograms look very similar.

Overall performance.

In the previous sections we reported theresults of each individual CNN. To evaluate our complete cornerlocalization and labeling pipeline (Fig. 4), we use our test set of4 manually annotated images ( × ) where we know theground truth positions and labels of all corners. The 4 images in thetest set collectively contain 1702 manually labeled corners. Our 2Dpipeline detected 92% (1566 of 1702) of the ground truth corners.The discarded corners corresponded to low quality quads whichwere rejected by RejectorNet . Note that we intentionally trainedthe

RejectorNet to be conservative, i.e., to reject all borderline cases(but we do not argue that the same principle should be applied toSIGGRAPH technical paper submissions!). Missing observations donot represent a big problem because they can be fixed by inpainting(Section 5.2); also, we observed that low-quality quads are oftenassociated with inaccurate corner localization, increasing the noisein the 3D reconstruction. The mean corner localization error inour test set is 0.4607 pixels and the maximum localization error is1.854. Due to our conservative rejection approach, the final CNN,

RecogNet , made zero mistakes on the test set, i.e., all of the 1566corners were assigned the correct label.

Metrics.

Evaluating 3D reconstruction accuracy is hard, becausewe do not have any ground truth measurements of a moving humanbody. To evaluate the accuracy of our 3D reconstructed corners,we compute their reprojection errors and we compare them to thereprojection errors obtained in our camera calibration process (Sec-tion 3.2). Using 𝑓 𝑘 to denote the projection function of camera 𝑘 , if a reconstructed 3D point p 𝑖 is seen by camera 𝑘 and c 𝑘𝑖 is thepixel location of the corresponding 2D corner 𝑖 in camera 𝑘 , thereprojection error for corner 𝑖 in camera 𝑘 is defined as: e 𝑖,𝑘 = || 𝑓 𝑘 ( p 𝑖 ) − c 𝑘𝑖 || (10)The reprojection error for camera calibration is defined analogously,except that we use 3D calibration boards with perfect, rigid checker-board corners and a standard OpenCV corner detector. In contrast,our corners are painted on an elastic suit worn by an actor. Quantitative evaluation.

We report the histograms of reprojectionerrors of 3D reconstruction and camera calibration in Fig. 16. The3D reconstruction reprojection error is computed per camera forall the reconstructed points in a consecutive sequence of 10000frames. The calibration reprojection error was computed on 448frames that we use to calibrate the cameras, where we wave a × calibration board in front of our cameras. In 16, we can see the twoerror distributions look very similar, which means the reprojectionerrors of our 3D reconstruction results have similar statistics asthe reprojection errors in camera calibration. We cannot expect toobtain lower reprojection errors than camera calibration.Table 5 shows the percentiles of all the reprojections errors in10000 frames that we use to evaluate the 3D reconstruction. 99%of the reprojection errors is less than 1.009, which is remarkablyaccurate given the high resolution of our images ( × ). Table 5. Percentiles of reprojection errors computed per camera for all thereconstructed points in a consecutive sequence of 10000 frames.

95% 99% 99.9% 99.99%0.6979 1.009 1.409 3.376

Qualitative evaluation.

Fig. 17 shows challenging cases wherethere are significant self occlusions. We mesh the reconstructedpoint cloud using the rest pose mesh structure introduced in Section5.1 by preserving the observed faces in the rest pose mesh (seeFig. 13b). Then we project the reconstructed mesh back to the imageusing the camera parameters, which gives us the green wireframe inFig. 17. We can see that the mesh wireframe aligns very closely withthe checkerboard pattern on the suit. Another important observationis that even despite large occlusions, our method can still obtaincorrectly labeled corners as long as the entire two-letter code isvisible see e.g., the foot and calf in Fig. 17a,b. In Fig. 17c, we cansee that the conservative

RejectorNet correctly rejects the wrinkledquads in the belly region, since reading the codes would be difficultor impossible.

To refine our actor model, we record a 6000 frames training sequence.After the body model refinement we select another 3000 framescorresponding to motions different from the ones in the training set.The fitting error is defined as the distance between the vertices of thedeformed body model and the actual 3D reconstructed corners. Wecompare the fitting errors between the initial model, which is just aremeshing of the STAR model (see Section 5.1), and the refined bodymodel, which was optimized on the training set (see Section 5.2). Fig.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. (a)(b)(c)

Fig. 17. Our results in challenging poses. Our reconstructed mesh (visualizedas green wireframe) closely matches the checkerboard pattern in the originalinput images (background). Note the successful isolated code recognitionsin the feet in (a) and (b).Fig. 18. Histograms of fitting errors of the initial body model and the refinedbody model on the training set and the test set.

Distance in mm

Fig. 19. The fitting errors on the body model; left: an initial body modelwhich is just a re-meshing of the STAR model; right: our final refined bodymodel.

18 shows the distribution of fitting errors per vertex of the initialbody model and the refined body model on the training set and thetest set. We can see in both data sets, the refined body model is muchmore accurate. Specifically, the body model refinement reduces theaverage fitting error from 13.6mm to 5.2mm on the training set andfrom 13.5mm to 7.1mm on the test set. Fig. 19 visualizes the fittingerrors on the body model before and after body model refinementin one example frame using a heat map.

To quantify the accuracy of the 3D reconstruction of the entire body,we compare renderings of a textured mesh with the original imagesusing optical flow [Bogo et al. 2014, 2017]. First, we need to create asuit-like texture for our body mesh (Fig. 13b). We create a standardUV parametrization for our mesh and generate the texture from10 hand picked frames using a differentiable renderer [Ravi et al.2020]; though this is just one possible way to generate the texture[Bogo et al. 2017]. We render the textured body mesh with back faceculling enabled and overlay it over clean plates (i.e., images of thecapture volume without any actor). The virtual camera parametersare set to our calibration of the real cameras. The optical flow iscomputed from the synthetic images to the undistorted real imagesusing FlowNet2 [Ilg et al. 2017] with the default settings.Because our mesh does not have the hands and the head, we firstrender a foreground mask of our body mesh (Fig. 20c). We onlyevaluate the optical flow on the region covered by the foregroundmask to exclude the hands, the head and the background. The fore-ground mask cannot exclude the hands and the head when they areoccluding the body (as in Fig. 20a, b) but, fortunately, the opticalflow is robust to missing parts (see Fig. 20d).We use optical flow to compare the original images with twotypes of renders: 1) our low-dimensional refined body model (thegray mesh in Fig. 14c, which does not fit the reconstructed corners

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 15 (a) (b) (c) (d)

Fig. 20. (a) The input image. (b) The synthetic image rendered from ourbody mesh. (c) The foreground mask of our body mesh. (d) The optical flowbetween the synthetic image (b) and the real one (a). The angle of flow isvisualized by hue and the magnitude of flow by value in HSV color model. (a)(b)

Fig. 21. Optical flow errors. The blue curves correspond to a low-dimensionalrefined body model only; the red curves correspond to our final resultsincluding non-articulated displacements. (a) The plot of average optical flownorm in a motion sequence of 2000 frames (we show four example framesbelow the graph). (b) Histograms of per-pixel optical flow norms for thesame sequence. exactly), 2) our final result after adding non-articulated displace-ments (Fig. 14f). Fig. 21a plots the average optical flow norm foreach frame for consecutive 2000 frames, including various chal-lenging poses and fast motions. We can see that the result withnon-articulated displacements is much more accurate than only ourlow-dimensional refined body model. This is mainly due to fleshdeformation which is not well explained by the refined body model,especially in more extreme poses which correspond to the spikes inthe blue curve in Fig. 21a. The red curve corresponds to our finalresult which exhibits consistently low optical flow errors.We also plot the distribution of the optical flow norm for eachpixel in the foreground mask in Fig. 21b. With our final animatedmesh, 95% of pixels have optical flow norm less than 1.20 and 99%of pixels have optical flow norm less than 2.46.

An obvious limitation of our method is the necessity of wearing aspecial motion capture suit. A suit can in principle slide over the skin,but we did not observe any significant sliding in our experimentsbecause our suits are tightly fitting. If this became a problem in thefuture, we could increase adhesion with internal silicone patches asin sportswear, or even apply spirit gum or medical adhesives. Thesuit needs to be made in various sizes and fit may be a challengefor obese people. The holy grail of full-body capture is to get ridof suits and instead rely only on skin features such as the pores,similarly to facial performance capture. We tried imaging the bareskin, but with our current camera resolution ( × ) we wereunable to get sufficient detail from the skin. We could obtain moredetail with narrower fields of view and more cameras to cover thecapture volume, but then there are issues with the depth of fieldand hardware budgets. Additional complications of imaging bareskin are body hair and privacy concerns; our suit certainly has itsdisadvantages, but mitigates these issues. A significant advantageof our suit compared to traditional motion capture suits is that wedo not need to attach any markers (reflective spheres, See Fig. 2a).Traditional motion capture markers can impede motion or even falloff, e.g., when the actor is rolling on the ground. An intriguing direc-tion for future work would be to enhance our suit with additionalsensors, in particular EMG, IMU or pressure sensors in the feet.In this paper we focused on the body and ignored the motion ofthe face and the hands. Our actors are wearing sunglasses becauseour continuous passive lights are too bright; the perceived brightnesscould be reduced by lights which strobe in sync with camera shutters,but this would require significant investments in hardware. In futurework, our method could be directly combined with modern methodsthat capture the motion of the face and the hands [Choutas et al.2020; Joo et al. 2018; Pavlakos et al. 2019; Xiang et al. 2019]. We notethat our current system captures the motion of the feet, but not theindividual toes.Our current data processing is off-line only. In the future, we be-lieve it should be possible to create a real-time version of our system.This would require machine vision cameras tightly integrated withdedicated GPUs or tensor processors for real-time neural networkinference. Each such hardware unit could emit small amounts ofdata: only information about the corner locations and their labels, ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021.

Fig. 22. Results in challenging poses. From the left to right: 1) input images, 2) raw 3D reconstructions with holes, 3) our final meshes after interpolatingnon-articulated displacements, and 4) wireframe rendering of the final meshes overlaid with the original images.

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021. apturing Detailed Deformations of Moving Human Bodies • 17 avoiding the high bandwidth requirements typically associated withhigh-resolution video streams.Another avenue for future work involves research of differenttypes of fiducial markers that can be printed on the suit. In fact, wemade initial experiments with printing on textile and sewing ourown suits, which gives us much more flexibility than handwrittentwo-letter codes discussed in this paper. We postponed this line ofresearch due to the Covid19 pandemic. Our pipeline for reconstruct-ing labeled 3D points does not make any assumptions about thehuman body, which means that we could apply our method also forcapturing the motion of clothing or even loose textiles such as acurtain.

We have presented a method for capturing more than 1000 uniquelylabeled points on the surface of a moving human body. This tech-nology was enabled by our new type of motion capture suit withcheckerboard-type corners and two-letter codes enabling unique la-beling of each corner. Our results were obtained with a multi-camerasystem built from off-the-shelf components at a fraction of the costof a full-body 3DMD setup, while demonstrating a wider varietyof motions than the DFAUST dataset [Bogo et al. 2017], includinggymnastics, yoga poses and rolling on the ground. Our method forreconstructing labeled 3D points does not rely on temporal coher-ence, which makes it very robust to dis-occlusions and also invitesparallel processing. We provide our code and data as supplementarymaterials and we will release an optimized version of our code asopen source.

REFERENCES

GoogleInc

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 268–276.Brett Allen, Brian Curless, Brian Curless, and Zoran Popović. 2003. The space ofhuman body shapes: reconstruction and parameterization from range scans. In

ACMtransactions on graphics (TOG) , Vol. 22. ACM, 587–594.Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers,and James Davis. 2005. Scape: shape completion and animation of people. In

ACMTransactions on Graphics (TOG) , Vol. 24. ACM, 408–416.Andreas Aristidou, Daniel Cohen-Or, Jessica K Hodgins, and Ariel Shamir. 2018. Self-similarity analysis for motion capture cleaning. In

Computer Graphics Forum , Vol. 37.Wiley Online Library, 297–309.Angelos Barmpoutis. 2013. Tensor body: Real-time reconstruction of the human bodyand avatar synthesis from RGB-D.

IEEE transactions on cybernetics

43, 5 (2013),1347–1356.Stuart Bennett and Joan Lasenby. 2014. ChESS–Quick and robust detection of chess-board features.

Computer Vision and Image Understanding

118 (2014), 197–210.Federica Bogo, Michael J Black, Matthew Loper, and Javier Romero. 2015. Detailedfull-body reconstructions of moving people from monocular RGB-D sequences. In

Proceedings of the IEEE International Conference on Computer Vision . 2300–2308.Federica Bogo, Javier Romero, Matthew Loper, and Michael J Black. 2014. FAUST:Dataset and evaluation for 3D mesh registration. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition . 3794–3801. Federica Bogo, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2017. DynamicFAUST: Registering human bodies in motion. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition . 6233–6242.Adnane Boukhayma, Vagia Tsiminaki, Jean-Sébastien Franco, and Edmond Boyer. 2016.Eigen appearance maps of dynamic shapes. In

European Conference on ComputerVision . Springer, 230–245.G. Bradski. 2000. The OpenCV Library.

Dr. Dobb’s Journal of Software Tools (2000).Christoph Bregler, Jitendra Malik, and Katherine Pullen. 2004. Twist based acquisitionand tracking of animal and human kinematics.

International Journal of ComputerVision

56, 3 (2004), 179–194.Thomas Brox, Bodo Rosenhahn, Juergen Gall, and Daniel Cremers. 2009. Combined re-gion and motion-based 3D tracking of rigid and articulated objects.

IEEE transactionson pattern analysis and machine intelligence

32, 3 (2009), 402–415.Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2018. OpenPose:realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprintarXiv:1812.08008 (2018).Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person2d pose estimation using part affinity fields. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition . 7291–7299.Dan Casas, Margara Tejera, Jean-Yves Guillemaut, and Adrian Hilton. 2012. 4D paramet-ric motion graphs for interactive animation. In

Proceedings of the ACM SIGGRAPHSymposium on Interactive 3D Graphics and Games . 103–110.Ben Chen, Caihua Xiong, and Qi Zhang. 2018. CCDN: Checkerboard corner detectionnetwork for robust camera calibration. In

International Conference on IntelligentRobotics and Applications . Springer, 324–334.Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael JBlack. 2020. Monocular expressive body regression through body-driven attention. arXiv preprint arXiv:2008.09062 (2020).Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese,Hugues Hoppe, Adam Kirk, and Steve Sullivan. 2015. High-quality streamablefree-viewpoint video.

ACM Transactions on Graphics (ToG)

34, 4 (2015), 1–13.Stefano Corazza, Lars Mündermann, Emiliano Gambaretto, Giancarlo Ferrigno, andThomas P Andriacchi. 2010. Markerless motion capture through visual hull, articu-lated icp and subject specific model generation.

International journal of computervision

87, 1-2 (2010), 156–169.Edilson De Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel,and Sebastian Thrun. 2008. Performance capture from sparse multi-view video. In

ACM SIGGRAPH 2008 papers . 1–10.Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2018. Superpoint: Self-supervised interest point detection and description. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition Workshops . 224–236.Simon Donné, Jonas De Vylder, Bart Goossens, and Wilfried Philips. 2016. MATE:Machine learning for adaptive calibration template detection.

Sensors

16, 11 (2016),1858.Mingsong Dou, Jonathan Taylor, Henry Fuchs, Andrew Fitzgibbon, and Shahram Izadi.2015. 3D scanning deformable objects with a single RGBD sensor. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition . 493–501.Mark Fiala. 2005. ARTag, a fiducial marker system using digital techniques. In ,Vol. 2. IEEE, 590–596.Sergio Garrido-Jurado, Rafael Muñoz-Salinas, Francisco José Madrid-Cuevas, andManuel Jesús Marín-Jiménez. 2014. Automatic generation and detection of highlyreliable fiducial markers under occlusion.

Pattern Recognition

47, 6 (2014), 2280–2292.D Gavrila and LS Davis. 1996. Tracking of humans in action: A 3-D model-basedapproach. In

ARPA Image Understanding Workshop . (Palm Springs), 737–746.Stevie Giovanni, Yeun Chul Choi, Jay Huang, Eng Tat Khoo, and KangKang Yin. 2012.Virtual try-on using kinect and HD camera. In

International Conference on Motionin Games . Springer, 55–65.Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense humanpose estimation in the wild. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition . 7297–7306.Kaiwen Guo, Peter Lincoln, Philip Davidson, Jay Busch, Xueming Yu, Matt Whalen,Geoff Harvey, Sergio Orts-Escolano, Rohit Pandey, Jason Dourgarian, et al. 2019. Therelightables: Volumetric performance capture of humans with realistic relighting.

ACM Transactions on Graphics (TOG)

38, 6 (2019), 1–19.Shangchen Han, Beibei Liu, Robert Wang, Yuting Ye, Christopher D Twigg, and KenrickKin. 2018. Online optical marker-based hand tracking with deep labels.

ACMTransactions on Graphics (TOG)

37, 4 (2018), 1–10.Richard I Hartley and Peter Sturm. 1997. Triangulation.

Computer vision and imageunderstanding

68, 2 (1997), 146–157.Gines Hidalgo, Yaadhav Raaj, Haroon Idrees, Donglai Xiang, Hanbyul Joo, TomasSimon, and Yaser Sheikh. 2019. Single-Network Whole-Body Pose Estimation. arXivpreprint arXiv:1909.13423 (2019).David A Hirshberg, Matthew Loper, Eric Rachlin, and Michael J Black. 2012. Coregis-tration: Simultaneous alignment and modeling of articulated 3D shape. In

European

ACM Trans. Graph., Vol. 1, No. 1, Article . Publication date: February 2021.

Conference on Computer Vision . Springer, 242–255.Daniel Holden. 2018. Robust solving of optical motion capture data by denoising.

ACMTransactions on Graphics (TOG)

37, 4 (2018), 1–12.Danying Hu, Daniel DeTone, and Tomasz Malisiewicz. 2019. Deep charuco: Darkcharuco marker pose estimation. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 8436–8444.Peng Huang, Chris Budd, and Adrian Hilton. 2011. Global temporal registration ofmultiple non-rigid surface sequences. In

CVPR 2011 . IEEE, 3473–3480.Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, andThomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deepnetworks. In

Proceedings of the IEEE conference on computer vision and patternrecognition . 2462–2470.Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep features for textspotting. In

European conference on computer vision . Springer, 512–528.Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total capture: A 3d deformationmodel for tracking faces, hands, and bodies. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition . 8320–8329.Roland Kehl and Luc Van Gool. 2006. Markerless tracking of complex human motionsfrom multiple views.

Computer Vision and Image Understanding

ACM Transactions on Graphics (ToG)

28, 5(2009), 1–10.Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In

Europeanconference on computer vision . Springer, 21–37.Yebin Liu, Juergen Gall, Carsten Stoll, Qionghai Dai, Hans-Peter Seidel, and ChristianTheobalt. 2013. Markerless motion capture of multiple characters using multiviewimage segmentation.

IEEE transactions on pattern analysis and machine intelligence

35, 11 (2013), 2720–2735.Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appear-ance models for face rendering.

ACM Transactions on Graphics (TOG)

37, 4 (2018),1–13.Shangbang Long, Xin He, and Cong Yao. 2020. Scene text detection and recognition:The deep learning era.

International Journal of Computer Vision (2020), 1–24.Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael JBlack. 2015. SMPL: A skinned multi-person linear model.

ACM transactions ongraphics (TOG)

34, 6 (2015), 248.David G Lowe. 1999. Object recognition from local scale-invariant features. In

Pro-ceedings of the seventh IEEE international conference on computer vision , Vol. 2. Ieee,1150–1157.Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and XiangyangXue. 2018. Arbitrary-oriented scene text detection via rotation proposals.

IEEETransactions on Multimedia

20, 11 (2018), 3111–3122.Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang,and Michael Black. 2020. Learning to Dress 3D People in Generative Clothing. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE.Nadia Magnenat-Thalmann, Richard Laperrire, and Daniel Thalmann. 1988. Joint-dependent local deformations for hand animation and object grasping. In

In Pro-ceedings on Graphics interface’88 . Citeseer.Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J.Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In

InternationalConference on Computer Vision . 5442–5451.Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Moham-mad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt.2017. Vnect: Real-time 3d human pose estimation with a single rgb camera.

ACMTransactions on Graphics (TOG)

36, 4 (2017), 44.Abhimitra Meka, Rohit Pandey, Christian Haene, Sergio Orts-Escolano, Peter Bar-num, Philip Davidson, Daniel Erickson, Yinda Zhang, Jonathan Taylor, SofienBouaziz, Chloe Legendre, Wan-Chun Ma, Ryan Overbeck, Thabo Beeler, Paul De-bevec, Shahram Izadi, Christian Theobalt, Christoph Rhemann, and Sean Fanello.2020. Deep Relightable Textures - Volumetric Performance Capture with Neu-ral Rendering.

ACM Transactions on Graphics (Proceedings SIGGRAPH Asia)

39, 6.https://doi.org/10.1145/3414685.3417814Alberto Menache. 2000.

Understanding motion capture for computer animation and videogames . Morgan kaufmann.Richard A Newcombe, Dieter Fox, and Steven M Seitz. 2015. Dynamicfusion: Recon-struction and tracking of non-rigid scenes in real-time. In

Proceedings of the IEEEconference on computer vision and pattern recognition . 343–352.Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks forhuman pose estimation. In

European conference on computer vision . Springer, 483–499.Edwin Olson. 2011. AprilTag: A robust and flexible visual fiducial system. In . IEEE, 3400–3407.Ahmed A A Osman, Timo Bolkart, and Michael J. Black. 2020. STAR: A Spare TrainedArticulated Human Body Regressor. In

European Conference on Computer Vision (ECCV) . https://star.is.tue.mpg.deSang Il Park and Jessica K Hodgins. 2006. Capturing and animating skin deformationin human motion.

ACM Transactions on Graphics (TOG)

25, 3 (2006), 881–889.Sang Il Park and Jessica K Hodgins. 2008. Data-driven modeling of skin and muscledeformation. In

ACM SIGGRAPH 2008 papers . 1–6.Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AAOsman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3dhands, face, and body from a single image. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 10975–10985.Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka,Peter V Gehler, and Bernt Schiele. 2016. Deepcut: Joint subset partition and labelingfor multi person pose estimation. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 4929–4937.Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and Michael J Black. 2015. Dyna:A model of dynamic human shape in motion.

ACM Transactions on Graphics (TOG)

34, 4 (2015), 120.Fabián Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. 2016.Motion graphs for unstructured textured meshes.

ACM Transactions on Graphics(TOG)

35, 4 (2016), 1–14.Yaadhav Raaj, Haroon Idrees, Gines Hidalgo, and Yaser Sheikh. 2019. Efficient OnlineMulti-Person 2D Pose Tracking with Recurrent Spatio-Temporal Affinity Fields.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .4620–4628.Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, JustinJohnson, and Georgia Gkioxari. 2020. Accelerating 3D Deep Learning with Py-Torch3D. arXiv:2007.08501 (2020).Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. In

Proceedingsof the IEEE conference on computer vision and pattern recognition . 7263–7271.Kathleen M Robinette, Sherri Blackwell, Hein Daanen, Mark Boehmer, and Scott Flem-ing. 2002.

Civilian american and european surface anthropometry resource (caesar),final report. volume 1. summary . Technical Report. SYTRONICS INC DAYTON OH.Edward Rosten and Tom Drummond. 2006. Machine learning for high-speed cornerdetection. In

European conference on computer vision . Springer, 430–443.Min-Ho Song and Rolf Inge Godøy. 2016. How fast is your body motion? Determininga sufficient frame rate for an optical motion tracking system using passive markers.

PloS one

11, 3 (2016), e0150993.Carsten Stoll, Nils Hasler, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt. 2011.Fast articulated motion tracking using a sums of gaussians body model. In . IEEE, 951–958.Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. 1999.Bundle adjustment—a modern synthesis. In

International workshop on vision algo-rithms . Springer, 298–372.Tony Tung and Takashi Matsuyama. 2010. Dynamic surface matching by geodesicmapping for 3d animation transfer. In . IEEE, 1402–1409.Graham Upton and Ian Cook. 1996.

Understanding statistics . Oxford University Press.Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popović. 2008. Articulated meshanimation from multi-view silhouettes. In

ACM Transactions on Graphics (TOG) ,Vol. 27. ACM, 97.John Wang and Edwin Olson. 2016. AprilTag 2: Efficient and robust fiducial detection.In .IEEE, 4193–4198.Robert Y Wang and Jovan Popović. 2009. Real-time hand-tracking with a color glove.

ACM transactions on graphics (TOG)

28, 3 (2009), 1–8.Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutionalpose machines. In

Proceedings of the IEEE conference on Computer Vision and PatternRecognition . 4724–4732.Ryan White, Keenan Crane, and David A Forsyth. 2007. Capturing and animatingoccluded cloth.

ACM Transactions on Graphics (TOG)

26, 3 (2007), 34–es.Simon N Wood. 2003. Thin plate regression splines.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology)

65, 1 (2003), 95–114.Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. 2019. Monocular total capture: Posingface, body, and hands in the wild. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 10965–10974.Yuanlu Xu, Song-Chun Zhu, and Tony Tung. 2019. Denserac: Joint 3d pose and shapeestimation by dense render-and-compare. In

Proceedings of the IEEE InternationalConference on Computer Vision . 7760–7770.Zhengyou Zhang. 2000. A flexible new technique for camera calibration.

IEEE Transac-tions on pattern analysis and machine intelligence

22, 11 (2000), 1330–1334.Huiyu Zhou and Huosheng Hu. 2008. Human motion tracking for rehabilitation—Asurvey.