A Low-Cost, Flexible and Portable Volumetric Capturing System
Vladimiros Sterzentsenko, Antonis Karakottas, Alexandros Papachristou, Nikolaos Zioulis, Alexandros Doumanoglou, Dimitrios Zarpalas, Petros Daras
AA low-cost, flexible and portable volumetric capturing system
Vladimiros Sterzentsenko ∗† , Antonis Karakottas ‡∗ , Alexandros Papachristou §∗ , Nikolaos Zioulis ¶∗ ,Alexandros Doumanoglou (cid:107) , Dimitrios Zarpalas ∗∗ and Petros Daras †† Information Technologies Institute, Centre for Research and Technology - Hellas
Abstract —Multi-view capture systems are complex systemsto engineer. They require technical knowledge to installand complex processes to setup. However, with the ongoingdevelopments in new production methods, we are now ata position to be able to generate high quality realistic 3Dassets. Nonetheless, the capturing systems developed withthese methods are intertwined with them, relying on customsolutions and seldom - if not at all - publicly available.We design, develop and publicly offer a multi-view capturesystem based on the latest RGB-D sensor technology. Wealso develop a portable and easy-to-use external calibrationprocess to allow for its widespread use.
Keywords -Multi-view system, 3D Capture, RGB-D, Regis-tration, Multi-sensor calibration, VR, AR, Intel RealSense
I. I
NTRODUCTION
The ongoing developments related to Virtual Reality(VR) and Augmented Reality (AR) technologies, and moreimportantly the availability of new presentation devices- head mounted displays (HMDs) - are also increasingthe demand for new types of immersive media. Supersed-ing traditional video, three-dimensional (3D) media aresuited for both VR and AR and have been termed as”Free Viewpoint Video (FVV)” [1], ”Volumetric Video”,”Holograms” [2] and/or ”3D/4D media” ‡‡ . They offer theability of selecting any viewpoint to watch the content,essentially allowing for unrestricted viewing, thereforegreatly increasing the feeling of immersion.Besides the expensive and laborious production of 3Dmedia by artists using 3D modeling and animation soft-ware, there are various ways to 3D capture the real worldand digitize it. Like typical video, 3D media can beconsumed either in a live [2], [3] or in an on-demandmanner [1], [4], with state-of-the-art systems allowing fordeformations and topology changes. Offline systems typi-cally use a pre-defined template that will be fit on the data[5] or otherwise utilize lengthy reconstruction processes[1]. Consequently, 3D media production can either bereal-time or post-processed. Either way, the backbone ofrealistic 3D content productions is a multi-view capturesystem. Such systems are complex to develop due to the ∗ Indicates equal contribution. † [email protected] ‡ [email protected] § [email protected] ¶ [email protected] (cid:107) [email protected] ∗∗ [email protected] †† [email protected] ‡‡ These terms are used interchangeably in this document. large number of choices associated to their design. Thissystem complexity is also translated to increased costs,specialized hardware (HW) requirements and technicallydemanding processes.Initially the multi-view capture system needs to be setup, a process that, depending on choice of the type andnumber of cameras/sensors can greatly vary. Using stereopairs for the extraction of depth in a live setting requiresextra processing power to be allocated for the disparityestimation task for each viewpoint (i.e. stereo pair) [2].An offline system that operates on a template model fittingbasis using the extracted silhouettes [6], [7], requires alarger number of cameras whose live feeds need to berecorded, thereby necessitating the use of large storage.The most suitable topology and architecture depends onthe targeted use case. In the former case, besides settingup the stereo pairs, each one needs to be connectedto a computer, with the processing offloaded to anotherworkstation. In the latter case, depending on the frame-rate, resolution, encoding performance and disk writingthroughput, the setup of a multi-disk server or a distributedlocal storage topology is required.Following the installation of the multi-view capturesystem, a number of preparatory steps are needed beforeits actual use. These potentially involve spatial (externaland internal calibration) and temporal (synchronization)alignment of the sensors. These processes can introducenew HW requirements (e.g. external signal triggers forsynchronization [1], [2], [6], or external optical trackingsystems for calibration [8]) and are usually accomplishedvia complex procedures (e.g. moving checkerboard [2] orintricate registration structures [1]).Overall, as a combination of design decisions andcomplexity in operating, most existing multi-view capturesystems are hindered by high HW costs, stationarity due tobeing hard to relocate after installation, or come with con-siderable technical requirements, forbidding adaptabilityand non-expert use. Our goal in this work is to design anddeliver a flexible and up-to-date consumer level multi-viewcapture system to support affordable content creation forAR and VR. Our design is oriented towards taking steps inimproving cost expenditure, portability, re-use and ease-of-use. In summary, our contributions are the following: • A publicly available volumetric capture system uti-lizing recent sensor technology offered online athttps://github.com/VCL3D/VolumetricCapture. • The design of a low-cost, portable and flexible multi-view capture system. • A quick, robust, user friendly and affordable multi- a r X i v : . [ c s . C V ] S e p ensor calibration process.II. R ELATED W ORK
Multi-view capturing systems have mostly been devel-oped for eventually producing three-dimensional contentand are highly complex systems to design [9]. They typi-cally require numerous sensors that need to be positioned,synchronized and calibrated and functionally they needto support either, or both, live acquisition and recording.They capture full 3D, by extracting the geometrical infor-mation of the captured scene, or pseudo-3D, by estimatingthe scene’s depth and offering limited free viewpoint selec-tion. Two of the pioneering works in this direction are [10]and [11] respectively. The first one used a large number ofcameras placed in a dome to surround the captured areaand extracted complete geometric information, while thesecond one placed the cameras in front of the users andestimated the captured scene’s depth.A state-of-the-art multi-view capturing dome has re-cently been presented in [12] that comprises 480 VGA,31 HD cameras and 10 Microsoft Kinect 2.0. Its primarydesign goal was the social capture of multiple people.The system is calibrated using structure from motionand bundle adjustment using a white tent with a patternprojected on it. While being a very impressive systemto engineer, it is nonetheless a very rigid, complex andexpensive one. A more recent work for frontal facingmulti-view capture [13] showcased 32 cameras placed inan arc configuration which was calibrated by matchingfeatures found on the floor without the use of any pattern.Similarly, a system of 18 cameras in an array configurationthat also leveraged the power of GPUs for real-time 3Dreconstruction was presented in [14]. However, its cali-bration was accomplished by using Tsai’s checkerboardmethod [15], a complex and cumbersome process whichrequires technical knowledge by the operator.For full 3D capture, model-based performance cap-ture methods [4], [6], [7] allowed for the reduction ofthe number of sensors, compared to the aforementioneddome placement approaches, by employing 8 camerasperimetrically pointing inwards. As depth sensors’ qualitystarted improving, their deployment in multi-view systemsquickly followed as a way to address the issues of camera-based capturing systems, namely low 3D reconstructionquality and green screen requirements. However, prelim-inary attempts were still calibrated using the inefficientcheckerboard process [16], limiting their flexibility.As commercial grade depth sensors, and more im-portantly, integrated color and depth (RGB-D) sensorsstarted becoming available, a surge of renewed interestin 3D real-time or 4D post-processed content productionquickly followed. Nonetheless, preliminary systems usingmultiple Kinect sensors either for 3D reconstruction [17]or marker-less motion capture [18] still used checkerboardbased calibration approaches with even custom materialsrequired for the latter one. However, in [18], an initialattempt in taking a step ahead of the typical calibrationprocess was made by offering an alternative calibration process using a moving point light source. At the sametime, structure-based calibration systems started surfacingtypically using markers to either directly estimate eachcamera’s pose with respect to the structure [19], [20] oras initial estimates to be densely refined [21]. However,even the state-of-the-art real-time 3D capturing system of[2], using 16 near infrared cameras, 8 color cameras and 8structured light projectors, still relies on the checkerboardmethod of [22] for calibration. Similarly, the high quality4D production system of [1] that consists of 106 cameras,relies on an octagonal tower structure for its calibration,albeit being very complex and hard to assembly and re-locate.As a result there have been various works aiming tomake the overall calibration process easier. The work of[8] utilized an expensive external optical tracking systemto calibrate the multi-view system’s captured volume areausing a checkerboard to further improve the accuracyof the solution and achieve an easier and more robustwork flow. In [23] and [24], the authors utilize a col-ored ball that is moved within the capturing area toestablish correspondences and calibrate the multi sensorsystems. Additionally, in [23], their method simultane-ously synchronizes the sensors in addition to calibratingthem. While HW synchronization is the optimal, somesensors do not support it, necessitating the use of software(SW) based synchronization approaches. More recently,[25] presented a marker-less structure-based multi-sensorcalibration using a CNN trained with synthetic structurerenders. However, training was limited to specific angleintervals around the structure. Nonetheless, the presentedmulti-sensor calibration process was made significantlyeasier.Overall, we find that most systems require complexprocesses to calibrate that need heavy human operations- usually with technical knowledge. This renders themhard to (re-)use for commercial purposes, due to heavycustomization in materials and configurations, also lim-iting their portability. In addition, most - if not all -systems’ implementations are not publicly available withsome being notorious hard to assemble and/or develop.Our goal is to design and develop, and publicly offer aneasy to setup multi-view capturing system, with low-costcomponents and minimize the technical requirements aswell as process complexity in operating it.III. V
OLUMETRIC C APTURE
Our volumetric capture system is designed to orches-trate the capturing, streaming and recording of the dataacquired from a multi-sensor infrastructure. While inprinciple it can be used for moving sensors too, ourfocus is oriented towards static inwards placement forcapturing human performances within a predefined space.Our design choices strive to reach an optimal balanceamong affordability, modularity, portability, scalability andusability.
Sensor:
We employ the most recent version of the IntelRealSense technology [26], a consumer-grade RGB-D a) (b) (c)
Figure 1: Capturing System Overview and Architecture. (a) Our basic system setup, utilizes N = 4 acquisition modules( eyes ) and a central orchestrator workstation. The orchestrator communicates with the eyes through LAN. (b) Theacquisition module is composed of an Intel RealSense D415 sensor mounted on a tripod, connected to an Intel NUCprocessing unit, also mounted on the same tripod. (c) Example volumetric capturing station setup with the sensorslooking inwards and capturing a o view of the subject.sensor which allows us to reap the advantages of integrateddepth sensing. This reduces the complexity of our systemas we can deploy a single integrated RGB-D sensor insteadof 4 (2 gray-scale for stereo computation, 1 for coloracquisition and 1 projector to improve depth estimation inuniform colored regions) as in [2]. In addition, comparedto approaches surrounding the captured area with monoc-ular sensors [4], [5], [7], we can deploy less number ofsensors due to the availability of depth information. Morespecifically we use the D415 sensor §§ , which comparedto its sibling, the D435, offers better quality at closerdistances due to a denser projection pattern and alsosupports HW synchronization between its color and depthsensors. In addition, this type of sensors offer inter-sensorHW synchronization. Contrary, using Microsoft Kinects,would require a soft synchronization solution, that aretypically SW-based, like the audio synchronization of [3],adding yet another process when setting up the system.Further, the D415 sensors allow for setting up each sensoras a master or slave, and as a result, the requirement andadded complexity and cost of using and having to setup,external HW triggers is lifted. Architecture:
Our building block is an acquisition mod-ule, called an eye , that represents a viewpoint positionedglobally in relation to the capturing volume and is servinga RGB-D data stream. We connect N eyes in a distributedfashion to work towards a common goal, providing fusedcolored point clouds or otherwise registered multi-viewRGB-D streams. These are delivered into a client thatis also the orchestrator , controlling the behavior andparameterization of the eye server units through messagepassing. Control messages as well as data streams aretransferred by a broker using a publish-subscribe event-based architecture, with the system’s data flow depicted inFigure 2. All these aforementioned components comprisea single coherent, Volumetric Capture system.
Hardware:
The physical interpretation of our eye acqui- §§ https://software.intel.com/en-us/realsense/d400 Figure 2:
Volumetric Capture data flow. Multiple ( N )acquisition modules ( eyes ) capture the scene’s color anddepth information. The acquired data are first compressed,serialized and published to the message broker over thenetwork. The orchestrator client then deserializes anddecompresses the received messages to visualize and/orstore them.sition module is illustrated in Figure 1b. A D415 sensoris mounted on a tripod and connected to an Intel NUCmini-PC, which is in turn mounted on a tripod VESAmount. These, and the orchestrator , are connected viaEthernet cables to a LAN switch as seen in Figure 1a. Theswitch’s bandwidth depends on the number of sensors andtheir streams’ resolution and frame rates, but for typical o capture use, at least a Gbps bandwidth is required.Another important specification is that it needs to be non-blocking to be able to handle all of its ports’ bandwidthsat full capacity simultaneously. This is essential whenusing HW synchronization, as network traffic comes inbursts that would otherwise manifest in extra latency.Furthermore, through the use of mini-PCs, we distributeprocessing at a negligible effect on the system’s portability.This way we move the computational burden of com-pression and pre-processing on the acquisition modules,allowing for more efficient recording and overall reducedcomputational complexity on the receiving client.n alternative to our distributed design would be toconnect all sensors into a single workstation, which couldarguably slightly increase its portability. However, thisdesign choice requires the installation of additional USB3.0 controllers, as each sensor consumes high bandwidth tostream data in higher resolutions and frame rates. Becauseof this, the cables of the D415s are very short ( m ) and,therefore, high quality USB 3.0 extension cables wouldbe required. Depending on the distance and the data rate,optical repeaters might be needed that greatly increase thecost, bringing it on par with our HW choices. Further,scalability would be limited to the USB 3.0 extension slotsthat a high-end motherboard can support and input-outputbandwidth. Implementation Details:
Our system’s main com-ponents, the client ( orchestrator ) and server ( eye ), arenatively implemented in C++. Since we utilize headlessclients (mini-PCs), an automated way of discovering theacquisition modules is required. To that end, we deploy aservice to each mini-PC, developed in C eye component process.For our message broker, we use RabbitMQ which can beco-located with the orchestrator component. We use lossycompression for the color streams and lossless compres-sion for the depth streams. Compression method choicesaim at minimizing acquisition latency to enable use inreal-time 3D reconstruction scenarios. To that end, we useintra-frame JPEG compression for the color streams, andentropy-based compression for the depth streams. For theformer an SIMD optimized version [27] is used, whilefor the latter, a variety of algorithms are used under ablocking optimization technique [28]. This allows for amore explicit control of the overall bandwidth that each eye unit produces, as the depth stream mostly dominatesthe encoding performance and resulting compressed framesizes.Figure 3: 3D capture snapshot acquired from the
Volumet-ric Capture application showcasing the calibrated outputwhen capturing a human subject. Each viewpoint’s poseis also depicted via the camera frustum placements.IV. P
RACTICAL C ALIBRATION
The cornerstone of multi-view systems is the spatialalignment, or otherwise external calibration of the sensorswith respect to a global coordinate system, as seen in Fig-ure 5. Typical checkerboard calibration processes require heavy human intervention as well as technical knowledgeto avoid ambiguous or error-prone checkerboard poses. Inorder to make this process more convenient and usableby non-technical personnel, we opt for a structure-basedcalibration that only requires users to assemble and placethe structure within the capturing volume. While previoussuch approaches placed markers or patterns on the struc-tures [1], [3], [29], we extend and improve the marker-lesscalibration of [25].
Structure:
Similar to [3] and [25] we use a structureassembled out of commercially available packaging boxeswhose dimensions are standardized. This allows us tocreate a virtual replica of the calibration structure inthe form of a 3D model. In practice we use 4 boxesand deviate from the structure assembly of previous ap-proaches so as to create a fully asymmetric structure that,at the same time, has no fully planar views. This way,we naturally resolve any difficulties in identifying eachof the structure’s sides and further guarantee that theextracted correspondences will not produce ill-formed orambiguous solutions when used to estimate the sensor’spose. The updated structure can be seen in Figure 4 thatalso showcases the changes compared to the structure of[25]. (a) (b) (c)
Figure 4: Update of the calibration structure. In (a) theold calibration structure is presented, on which the planarside can be sheen with green overlay. In (b) where theupdated calibration structure is presented, there is nolonger any coplanar side. Each side segments of thecalibration structure can be seen in (c)
Training Data:
Our goal is to use the structure’s priorknowledge to establish correspondences between eachsensor’s viewpoint and the global coordinate system thatthe structure defines. Since we aim to be using no markers,and therefore no color information, this is accomplishedby training a CNN to identify these correspondences. Thevirtual 3D model can then be used to generate trainingpairs on-the-fly. By placing a virtual camera at a relativeposition around the 3D model that defines the center ofthe coordinate space, we can render it and generate adepth map D ( p ) ∈ R out of the resulting z -buffer, where p = ( u, v ) ∈ Ω ⊂ N represents pixel coordinates in theimage domain Ω : u ∈ [1 , . . . , W ] , v ∈ [1 , . . . , H ] , with W and H its width and height respectively. Given also amaterial of the model, we can additionally output a texturemap L ( p ) ∈ R acquired from the resulting render buffer.By assigning a different material (i.e. color) in each of thefour boxes’ sides (total distinct sides), these imagesthen correspond to a depth map semantic segmentationupervision pair { D ( p ) , L ( p ) } . Our rendered data aregenerated at a resolution of × , corresponding to adownscaled (factor of ) depth map of a D415 sensor. Wealso add noise to the resulting depth maps and augmentthem with random backgrounds as in [25], later denotedas ˜ D ( p ) . However, our approach differs in various waysthat will be thereafter explained. Pose Sampling:
We sample poses using cylindricalcoordinates t c = ( ρ, φ, z ) defined on the virtual structure’scoordinate system. These are then transformed to sensorposes as follows: i) we extract a Cartesian 3D position t D from each t c ; ii) we estimate a rotation matrix R byestimating the view matrix from t to the origin (0 , , ofthe coordinate system - which is at the center of the virtualstructure model - using the y axis as the up vector; iii) weaugment the rotation R by adding rotational noise via thecomposition of random rotations R i , i ∈ { x , y , z } aroundeach axis. Similar to [25], we sample these variables fromuniform distributions U ( a, b, c ) in intervals [ a, b ] at steps c : φ n ∼ U ( n × o N − o , n × o N + 10 o , . o ) ,z ∼ U (0 . m, . m, . m ) ,ρ ∼ U (1 . m, . m, . m ) ,e { x , y , z } ∼ U ( − ◦ , ◦ , . ◦ ) , (1)with n ∈ [1 , . . . , N ] and e being a Euclidean angle aroundaxii { x , y , z } that is transformed into a rotation matrix R { x , y , z } . An illustration of this sampling is available inFigure 5. Using this sampling we try to cover for a widerange of placements of each n th sensor in a variety ofcapturing scenarios of N sensors, while also modelingrealistic imperfect approximate positioning. Contrary to[25], we sample across the whole circle around the struc-ture but enforce that groups of N sensors will be placedapproximately at the appropriate φ angle intervals. Network:
Instead of training a CNN to predict denselabels to identify each specific box’s side on a per depthmap basis, we exploit the complementarity of the N view-points and train our CNN to receive as input all viewpointsjointly. As the goal is to achieve o coverage around acapturing volume, each viewpoint’s depth view is relatedto the other viewpoints. Given that the viewpoints willbe evenly placed around the structure, each viewpoint’sdepth map is complementary input to the rest as it willrestrict their predictions. Consequently, we design a CNNthat receives N depth map inputs ˜ D ( p ) and fuses theirinformation to extract this relative and complementaryinformation. As we cannot fix the sequence of the inputsbecause it requires knowledge of the spatial relations,which is our final objective, we randomize the order ofthe inputs the CNN receives during training. Multi-task Learning:
Our task is to label each one ofthe structure’s box’s sides, all of which are planar surfaces.Taking into account that the planar surfaces’ orientation isdefined by their normal, it is apparent that the observedscene’s normal information is complementary to our boxside labeling task. We exploit this complementarity by de- Figure 5: Pose generation sampling parameters illustration.Each sample pose is randomly generated at the ( ρ, φ, z ) cylindrical coordinate defined around the calibration struc-ture’s origin. It is further randomly rotated around the ( x , y , z ) axii at a respective Euler angle ( e x , e y , e z ) .signing our network for multi-task learning to take advan-tage of this inter-task relationship. During each render, wealso output a normal map N ( p ) ∈ R in another render-buffer that will be used to supervise the CNN’s secondarytask, normal estimation from depth maps. Summarizing,our CNN, with its architecture presented in Figure 6,receives as input multiple - randomly perturbed - depthmaps ˜ D i ( p ) , i ∈ [1 , . . . , N ] observing the same structurefrom different viewpoints, as well as their ground truthlabel maps L i and jointly estimates the semantic labels’probability distributions ˆ L i ( p ) of the structure’s boxes’sides as well as their normal maps ˆ N i ( p ) , for each inputdepth map, while fusing their multi-view information. Weuse a cross-entropy loss for the labels and a L loss forthe normals.We try to minimize the overall loss: E overall = N (cid:88) i = E i , E i = E normal + λ E semantic (2)over all dataset samples with: E normal = 1 M Ω (cid:88) p || ˆ N ( p ) − N ( p ) || , (3) E semantic = 1 M Ω (cid:88) p Pr ( L ( p )) log( smax (ˆ L ( p ))) , (4)where λ is a weight factor balancing the contributions ofthe regression E normal and classification E semantic losses, M = W × H equals the total number of pixels, Pr is a function that extracts the ground truth probabilitydistribution from the rendered texture map L , for eachpixel p , and smax is the softmax function evaluated ateach corresponding pixel p .We do not enforce normalized predictions as it has beenobserved that the L loss alone will suffice in producingnormalized values [30]. Refinement:
Finally, we refine the dense label predic-tions of the CNN using a dense fully-connected Condi-tional Random Field (CRF) model [31] formulated overigure 6: Our CNN architecture comprises 4 input encoding branches, one for each view, a bottleneck, and 4 outputdecoding branches. The input branches’ features are fused through concatenation and fed into the bottleneck. The fouroutput branches decode the bottleneck into two separate predictions for each branch, densely estimating a normal andlabel map for each branch. Each input branch comprises three blocks having two convolution ( conv ) layers each, withthe second downscaling its output features (stride equal to 2). The bottleneck contains three blocks. The first comprisesfour convs , with the last one downscaling its output, while the second block comprises two convs . The bottleneck’s thirdblock updates its input features by using a deconvolution ( deconv ) layer (stride equal to 2). Each output branch containsfour blocks of layers. The first block utilizes one conv , followed by two deconvs . The last deconv’s output branchesout, to feed the two internal branches of the output branch. Both of these are composed by two deconvs , with the onlydifference being the number of the predicted features. The segmentation branch’s prediction layer classifies 25 featureswhile the normal prediction branch produces 3 output features. All (de-)convolutional layers use × kernels.the predicted label distributions ˆ L and normals maps ˆ N .The per pixel p energy function is: E CRF ( p ) = (cid:88) i ψ unary ( p i ) + (cid:88) i (cid:88) j (cid:54) = i ψ pairwise ( p i , p j ) . (5)The unary potential ψ unary ( p ) is the densely predicteddistribution over the label space, describing the cost ofa pixel taking the corresponding label as estimated bythe CNN’s output label probability distribution map ˆ L .The pairwise potential terms are Gaussian edge potentials,describing the cost of variables i and j taking theircorresponding labels respectively. For the pairwise termwe use as a feature formulation f similar to [31] byincluding the positions in the image space, but instead ofthe values in the image (RGB) domain, we use the valuesof the predicted normal map ˆ N . Therefore the appearancekernel of [31] becomes: α ( f i , f j ) = exp ( − | p i − p j | σ D − | ˆ N ( p i ) − ˆ N ( p j ) | σ D ) , (6)where σ D and σ D the ranges that the spatial and normalkernels operate respectively. This is based on a the sameintuition, that since we are labeling planar surfaces, theestimated normals define the labels, and therefore, theiredges and similarities help in improving the resulting labelpredictions. Correspondences and Optimization:
Once we obtainour refined labels, we can extract a single correspondence from each labeled region. Similar to [25], we back-projectthe depths for each region and obtain 3D coordinates,from which we extract their median value to obtain arobust estimate for each box’s side’s mid-point. We canthen obtain an initial estimate of each sensor viewpoint’spose via Procrustes analysis [32] using the 3D-to-3Dcorrespondences between the sensor’s view and the virtual3D model. Using this initial estimate, we then optimize thedense point clouds from each back-projected depth map,using ICP formulated with a point-to-plane error, undera graph-based optimization in order to obtain a globalsolution (more details can be found at [25]).V. R
ESULTS
We evaluate our proposed calibration method under avariety of different sensor placements to showcase itsrobustness to user sensor placement. Our system cansupport an arbitrary number of sensors, limited by theHW (network speed, HDD write speed) and the use caserequirements (resolution, frame rate). However, in ourevaluation we focus on a N = 4 sensor setup that achievesoptimal coverage while keeping the HW requirements -and by extension the cost - to a minimum. We train ournetwork using Caffe [33] on an Nvidia Titan X, with N = 4 input depth maps, for k iterations and an initiallearning rate of − , using the ADAM optimizer [34]initialized with its default values, and with λ = 0 . . Whenrendering we use average depth sensor intrinsic parametersobtained by 9 different factory D415 sensors, divided byhe downscaling factor. We test our network’s performanceby generating a test set with pose samples drawn fromuniform distributions with different parameters than thosereported in Equation 1. These are selected to producelabeled images from different pose configurations thanthose used for training. In this test set, our model achievesa . mIoU.We compare our calibration results against other similarmethods, both structure-based and object-based. For thestructure-based method comparisons we use LiveScan3D[21] by attaching their markers on our structure, and weadditionally compare against the marker-based method of[3], [20] by also attaching QR codes on our structure andenhancing it with the same graph-based dense optimizationstep we use, effectively evaluating only the correspon-dence extraction’s effect for the initial pose estimation. Forobject-based method comparisons we use the approachesof [23] and [24] which utilize a ball that is moved withinthe capturing volume to establish correspondences to thenoptimize the sensors’ poses. In order to use the samesequences for comparison, we updated the method of [24]to work with a green colored ball. Therefore, we firstcapture RGB-D data by moving a green ball attached on astick with a known diameter ( cm ) within the capturingvolume, and then we place the structure and re-capturedata to obtain the necessary input for all methods. Weconduct these experiments for 5 different placements aspresented in Table I. For evaluating the accuracy of thecalibration methods we use the Rooted Mean SquaredEuclidean (RMSE) distance between the closest points ofadjacent views. The final error metric of each method isextracted by taking the mean RMSE distance of all pairsof adjacent views.Table I: RMSE results (in mm) of our method and thecompared ones. Approximate sensors’ placements were: a ∼ { ρ : . m, z : . m } , b ∼ { ρ : . m, z : . m } , c ∼ { ρ : . m, z : . m } , d ∼ { ρ : . m, z : . m − . } , and e ∼ { ρ : . m, z : . − . m } . For thosenot available, the methods did not manage to converge.). Method a b c d e[21] 21.8 N/A N/A N/A N/A[23] 20.82 18.41 20.79 21.83 N/A[24] 21.57 N/A 18.52 20.67 21.54[3], [20]
N/AOurs 17.57 15.41 17.26 16.85
From the results presented in Table I, we see thatnot all methods manage to consistently converge into agood solution apart from ours. In addition, while themarker-based approach produced better results in thoseplacements that it managed to converge, it should benoted that it required a lot of parameter fine-tuning ofthe SIFT detection parameters, on a per-experiment basis,to extract matching features. In addition, our marker-lessmethod produces comparable accuracy results. We cantherefore conclude that our method robustly produces highquality external calibration results with minimal humanintervention and technical knowledge. We additionally offer some qualitative results to show-case the effect of the post-refinement dense CRF step.Figure 7 shows the output of the CNN for a quadrupleof depth maps for experiment b , and then presents theoutput of the post-refinement step that improves the qualityof the labeled regions. This helps in establishing moreaccurate correspondences for the initial pose estimatesand thus, better drives the subsequent dense graph-basedoptimization step. Moreover, we also offer qualitativeresults of the accuracy of the registration for all theconducted experiments in Figure 8. (a)(b) Figure 7: Segmentation results before (a) and after (b)applying the CRF post-refinement step. (a) (b) (c) (d) (e)
Figure 8: Qualitative results of the obtained external cali-bration among the four sensors’ viewpoints in our five (a-e) experiments. Each viewpoint is colored with a differentcolor. On the top row, we offer top-down views, while onthe bottom row, their respective side views are illustrated.VI. C
ONCLUSION
Multi-view capturing is gaining traction with the re-cent developments to AR and VR, however, multi-viewsystems are difficult to engineer and develop and areusually complex to setup and difficult to re-locate. Wehave designed and publicly offer a multi-view systembased on recent sensor technology that is significantlylower cost, easier to setup and portable, in contrast withother systems in the literature. This was achieved throughcareful design decisions and the development of a newcalibration method that is easy to use and at the sameime, robust. Even though the demonstrated calibrationprocess relies on learning a specific placement, this doesnot restrict for training new networks for other setups aswell (e.g. 3 sensors at o angle intervals, or 8 sensorsarranged in two different 4 sensor perimeters at differentheights). We believe that our system can be used as abasis future research on production methods, as well asfor 3D content creation by freelancers and professionalsalike enabling quicker workflows due to quicker and moreflexible setup times.A CKNOWLEDGMENT
This work was supported by the EU’s H2020 Frame-work Programme funded project Hyper360 (GA ).We are also grateful for a hardware donation by Nvidia.R
EFERENCES [1] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev,D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan, “High-quality streamable free-viewpoint video,”
ACM Transac-tions on Graphics (TOG) , vol. 34, no. 4, p. 69, 2015.[2] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang,A. Kowdle, Y. Degtyarev, D. Kim, P. L. Davidson,S. Khamis, M. Dou et al. , “Holoportation: Virtual 3d tele-portation in real-time,” in
Proceedings of the 29th AnnualSymposium on User Interface Software and Technology .ACM, 2016, pp. 741–754.[3] D. S. Alexiadis, A. Chatzitofis, N. Zioulis, O. Zoidi,G. Louizis, D. Zarpalas, and P. Daras, “An integratedplatform for live 3d human reconstruction and motioncapturing,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 27, no. 4, pp. 798–813, April 2017.[4] N. Robertini, D. Casas, E. De Aguiar, and C. Theobalt,“Multi-view performance capture of surface details,”
International Journal of Computer Vision , vol. 124,no. 1, pp. 96–113, Aug 2017. [Online]. Available:https://doi.org/10.1007/s11263-016-0979-1[5] G. Ye, Y. Liu, Y. Deng, N. Hasler, X. Ji, Q. Dai, andC. Theobalt, “Free-viewpoint video of human actors usingmultiple handheld kinects,”
IEEE Transactions on Cyber-netics , vol. 43, no. 5, pp. 1370–1382, Oct 2013.[6] E. De Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P.Seidel, and S. Thrun, “Performance capture from sparsemulti-view video,”
ACM Transactions on Graphics (TOG) ,vol. 27, no. 3, p. 98, 2008.[7] J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn,and H.-P. Seidel, “Motion capture using joint skeletontracking and surface estimation,” in
Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conferenceon . IEEE, 2009, pp. 1746–1753.[8] S. Beck and B. Froehlich, “Sweeping-based volumetriccalibration and registration of multiple rgbd-sensors for 3dcapturing systems,” in
Virtual Reality (VR), 2017 IEEE .IEEE, 2017, pp. 167–176.[9] A. Kubota, A. Smolic, M. Magnor, M. Tanimoto, T. Chen,and C. Zhang, “Multiview imaging and 3dtv,”
IEEE signalprocessing magazine , vol. 24, no. 6, pp. 10–21, 2007. [10] T. Kanade, P. Rander, and P. Narayanan, “Virtualizedreality: Constructing virtual worlds from real scenes,”
IEEEmultimedia , vol. 4, no. 1, pp. 34–47, 1997.[11] P. Kauff and O. Schreer, “An immersive 3d video-conferencing system using shared virtual team user environ-ments,” in
Proceedings of the 4th international conferenceon Collaborative virtual environments . ACM, 2002, pp.105–112.[12] H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee,T. S. Godisart, B. Nabbe, I. Matthews et al. , “Panopticstudio: A massively multiview system for social interac-tion capture,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , 2017.[13] J.-G. Lou, H. Cai, and J. Li, “A real-time interactive multi-view video system,” in
Proceedings of the 13th annualACM international conference on Multimedia . ACM,2005, pp. 161–170.[14] F. Marton, E. Gobbetti, F. Bettio, J. A. I. Guiti´an, andR. Pintus, “A real-time coarse-to-fine multiview capturesystem for all-in-focus rendering on a light-field display,” in . IEEE, 2011,pp. 1–4.[15] R. Tsai, “A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelftv cameras and lenses,”
IEEE Journal on Robotics andAutomation , vol. 3, no. 4, pp. 323–344, 1987.[16] Y. M. Kim, D. Chan, C. Theobalt, and S. Thrun, “Designand calibration of a multi-view tof sensor fusion system,”in
Computer Vision and Pattern Recognition Workshops,2008. CVPRW’08. IEEE Computer Society Conference on .IEEE, 2008, pp. 1–7.[17] N. Ahmed and I. Junejo, “Using multiple rgb-d cameras for3d video acquisition and spatio-temporally coherent 3d an-imation reconstruction,”
International Journal of ComputerTheory and Engineering , vol. 6, no. 6, p. 447, 2014.[18] K. Berger, K. Ruhl, Y. Schroeder, C. Bruemmer, A. Scholz,and M. A. Magnor, “Markerless motion capture usingmultiple color-depth sensors.” in
VMV , 2011, pp. 317–324.[19] B. Kainz, S. Hauswiesner, G. Reitmayr, M. Steinberger,R. Grasset, L. Gruber, E. Veas, D. Kalkofen,H. Seichter, and D. Schmalstieg, “Omnikinect: Real-time dense volumetric data acquisition and applications,”in
Proceedings of the 18th ACM Symposium on VirtualReality Software and Technology , ser. VRST ’12. NewYork, NY, USA: ACM, 2012, pp. 25–32. [Online].Available: http://doi.acm.org/10.1145/2407336.2407342[20] N. Zioulis, D. Alexiadis, A. Doumanoglou, G. Louizis,K. Apostolakis, D. Zarpalas, and P. Daras, “3d tele-immersion platform for interactive immersive experiencesbetween remote users,” in , Sept 2016, pp. 365–369.[21] M. Kowalski, J. Naruniec, and M. Daniluk, “Live scan3d: Afast and inexpensive 3d data acquisition system for multiplekinect v2 sensors,” in
3D Vision (3DV), 2015 InternationalConference on . IEEE, 2015, pp. 318–325.[22] Z. Zhang, “A flexible new technique for camera calibra-tion,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 22, no. 11, pp. 1330–1334, Nov 2000.23] A. Fornaser, P. Tomasin, M. De Cecco, M. Tavernini,and M. Zanetti, “Automatic graph based spatiotemporalextrinsic calibration of multiple kinect v2 tof cameras,”
Robotics and Autonomous Systems , vol. 98, pp. 105–125,2017.[24] P.-C. Su, J. Shen, W. Xu, S.-C. S. Cheung, and Y. Luo,“A fast and robust extrinsic calibration for rgb-d cameranetworks,”
Sensors , vol. 18, no. 1, p. 235, 2018.[25] A. Papachristou, N. Zioulis, D. Zarpalas, and P. Daras,“Markerless structure-based multi-sensor calibration forfree viewpoint video capture,” in
Proceedings of 26thInternational Conference in Central Europe on ComputerGraphics, Visualization and Computer Vision’2018 , ser.WSCG ’18, 2018, pp. 88–97. [Online]. Available:http://wscg.zcu.cz/WSCG2018/!! CSRN-2801.pdf[26] L. Keselman, J. Iselin Woodfill, A. Grunnet-Jepsen,A. Bhowmik, M. Gupta, A. Jauhari, K. Kulkarni, S. Jaya-suriya, A. Molnar, P. Turaga et al. , “Intel realsense stereo-scopic depth cameras,” in
The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops .[27] “TurboJPEG,” https://github.com/libjpeg-turbo/libjpeg-turbo, accessed: 2018-09-03.[28] “Blosc,” https://github.com/Blosc/c-blosc, accessed: 2018-09-03.[29] M. Kowalski, J. Naruniec, and M. Daniluk, “Livescan3d: Afast and inexpensive 3d data acquisition system for multiplekinect v2 sensors,” in , Oct 2015, pp. 318–325.[30] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convo-lutional network for real-time 6-dof camera relocalization,”in
Proceedings of the IEEE international conference oncomputer vision , 2015, pp. 2938–2946.[31] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fullyconnected crfs with gaussian edge potentials,” in
Advancesin neural information processing systems , 2011, pp. 109–117.[32] D. G. Kendall, “A survey of the statistical theory of shape,”
Statistical Science , pp. 87–99, 1989.[33] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Con-volutional architecture for fast feature embedding,” arXivpreprint arXiv:1408.5093 , 2014.[34] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980