[PDF] A Robust Billboard-based Free-viewpoint Video Synthesizing Algorithm for Sports Scenes

Abstract

We present a billboard-based free-viewpoint video synthesizing algorithm for sports scenes that can robustly reconstruct and render a high-fidelity billboard model for each object, including an occluded one, in each camera. Its contributions are (1) applicable to a challenging shooting condition where a high precision 3D model cannot be built because a small number of cameras featuring wide-baseline are equipped; (2) capable of reproducing appearances of occlusions, that is one of the most significant issues for billboard-based approaches due to the ineffective detection of overlaps. To achieve contributions above, the proposed method does not attempt to find a high-quality 3D model but utilizes a raw 3D model that is obtained directly from space carving. Although the model is insufficiently accurate for producing an impressive visual effect, precise objects segmentation and occlusions detection can be performed by back-projecting it onto each camera plane. The billboard model of each object in each camera is rendered according to whether it is occluded or not, and its location in the virtual stadium is determined considering the location of its 3D model. We synthesized free-viewpoint videos of two soccer sequences recorded by five cameras with the proposed and state-of-art methods to demonstrate its performance.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

A Billboard-based Free-viewpoint VideoSynthesizing Algorithm for Sports Scenes

Jun Chen, Ryosuke Watanabe, Keisuke Nonaka, Tomoaki Konno, Hiroshi Sankoh, and Sei Naito

Abstract —We present a billboard-based free-viewpoint videosynthesizing algorithm for sports scenes that can robustly recon-struct and render a high-ﬁdelity billboard model for each object,including an occluded one, in each camera. Its contributions are(1) applicable to a challenging shooting situation where a highprecision 3D model cannot be built because only a small numberof cameras, featuring wide-baseline are available; (2) capable ofreproducing the appearance of occlusions, which is one of themost signiﬁcant issues for billboard-based approaches due to theineffective detection of overlaps. To achieve these goals above,the proposed method does not attempt to ﬁnd a high-quality 3Dmodel but utilizes a raw 3D model that is obtained directly fromspace carving. Although the model is insufﬁciently accurate forproducing an impressive visual effect, precise object segmentationand occlusions detection can be performed by back-projectiononto each camera plane. The billboard model of each object ineach camera is rendered according to whether it is occludedor not, and its location in the virtual stadium is determinedby considering the barycenter of its 3D model. We synthesizedfree-viewpoint videos of two soccer sequences recorded by ﬁvecameras, using the proposed and state-of-the-art methods todemonstrate the effectiveness of the proposed method.

Index Terms —Free-viewpoint Video Synthesis, 3D Video, Mul-tiple View Reconstruction, Image Processing.

I. I

NTRODUCTION F REE-VIEWPOINT video synthesis is an active researchﬁeld in computer vision, aimed at providing a beyond-3Dexperience, in which audiences can view virtual media fromany preferred angle and position. In a free-viewpoint videosystem, the virtual viewpoint can be interactively selected tosee a part of the ﬁeld from angles where a camera cannotbe mounted. Moreover, the viewpoint can be moved aroundthe stadium to allow audiences to have a walk-through or ﬂy-through experience [1], [2], [3], [4].The primary way to produce such a visual effect is toequip the observed scene with a synchronized camera-network[5], [6], [7]. A free-viewpoint video is then created by usingmulti-view geometry techniques, such as 3D reconstructionor view-dependent representation. The 3D model representa-tion, by means of a 3D mesh or point cloud [8], [9], [10],[11], provides full freedom of virtual view and continuousappearance changes for objects. Therefore, this representationis close to the original concept of a free-viewpoint video.An example of this technology is the ”Intel True View”for Super Bowl LIII [12] that enables immersive viewingexperiences by transforming video data captured from

38 5 K J. Chen, R. Watanabe, K. Nonaka, T. Konno, H. Sankoh, S. Naito arewith Ultra-realistic Communication Group, KDDI Research, Inc., Fujimino,Japan (corresponding author (J. Chen) Tel: +81-70-3825-9914; e-mail: [email protected]). ultra-high-deﬁnition cameras into a 3D video. This technologyachieves impressive results. However, a camera-network withmany well-calibrated cameras is required to obtain a precisemodel. This makes these methods difﬁcult to deploy cost-effectively. Moreover, the heavy computational process ofrendering leads to a non-real-time video display, especiallyfor portable devices like smartphones. The view-dependentrepresentation techniques [13], [14], [15], [5] do not providea consistent solution for all input cameras, but compute aseparate reconstruction for each viewpoint. In general, thesetechniques do not require a large number of cameras. Asreported in [16], a novel view can be synthesized employingonly two cameras by using sparse point correspondences anda coarse-to-ﬁne reconstruction method. The requirement fornumerous physical devices was relaxed. But at the same time,this introduces new challenges. The biggest challenge in thesemethods is the detection and rendering of “occlusion”, whichis the overlap of multiple objects in a camera view.With the convergence of technologies from computer visionand deep learning [17], [18], an alternative way to create afree-viewpoint video is to convert a single camera signal intoa proper 3D representation [19], [20]. The new way makes acreation easily controllable, ﬂexible, convenient, and cheap. Asnoted in [19], it uses a CNN to estimate a player body depthmap to reconstruct a soccer game from just a single YouTubevideo. Despite their generality, however, there are numerouschallenges in this setup due to several factors. First, it cannotreproduce an appropriate appearance over the entire range ofvirtual views due to the limited information. For example,the surface texture of an opposite side, beyond the camera’ssight, is unlikely to produce a satisfactory visual effect. Thedetection and treatment of occlusions caused by overlaps ofmultiple objects in a single camera view remain to be solved.Also, errors in occlusion detection lead to inaccurate depthestimation.In this paper, we focus on a multi-camera setup to providean immersive free-viewpoint video for a sports scene, such assoccer or rugby, that involves a large ﬁeld. Its goal is to resolvethe conﬂicting creation of a high-ﬁdelity free-viewpoint videowith the requirement for many cameras. To be speciﬁc, weproposed an algorithm to robustly reconstruct an accuratebillboard model for each object, including occluded ones,in each camera. It can be applied to challenging shootingconditions where only a few cameras featuring wide-baselineare present. Our key ideas are: (1) accurate depth estimationand object segmentation are achieved by projecting labelled3D models, obtained from shape-from-silhouette without op-timization, onto each camera plane; (2) the occlusion of each a r X i v : . [ c s . MM ] A ug OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 object is detected using the acquired 2D segmentation mapwithout the involvement of parameters and robust against self-occlusion; (3) a reasonable 3D coordinate of each billboardmodel in a virtual stadium is calculated according to thebarycenter of the raw 3D model to provide a stereovisioneffect.We present the synthesized results of two soccer contentsthat were recorded by ﬁve cameras. Our results can be viewedon a PC, smartphone, smartglass, or head-mounted display,enabling free-viewpoint navigation to any virtual viewpoint.Comparative results are also provided to show the effectivenessof the proposed method in terms of the naturalness of thesurface appearance in the synthesized billboard models.II. R

ELATED W ORKS

A. Free-viewpoint Video Creation from Multiple Views1) 3D model Representation:

The visual hull [21], [22],[23] is a 3D reconstruction technique that approximates theshape of observed objects from a set of calibrated silhouettes.It usually discretizes a pre-deﬁned 3D volume into voxelsand tests whether a voxel is available or not by determiningwhether it falls inside or outside the silhouettes. Coupledwith the marching cubes algorithm [24], [25], the discretevoxel representation can be converted into a triangle meshform. Some approaches focus on the direct calculation ofa mesh representation by analyzing the geometric relationbetween silhouettes and a visual hull surface based on theassumption of local smoothness or point-plane duality [26],[27]. Visual hull approaches suffer from two main limitations.First, many calibrated cameras need to be placed in a -degree circle to obtain a relatively precise model. Second,it gives the maximal volume consistent with objects’ silhou-ettes, failing to reconstruct concavities. More generally, visualhull approaches serve as initialization for more elaborate 3Dreconstruction. The photo-hull [28], [29], [30] approximatesthe maximum boundaries of objects using photo-consistencyof a set of calibrated images. It eliminates the process ofsilhouette extraction but introduces more restrictions, such ashighly precise camera calibration, sufﬁcient texture, and dif-fuse surface reﬂectance. As noted in [31], [32], [33], [34], [35],advanced approaches combine photo-consistency, silhouette-consistency, sparse feature correspondence, and more, to solvethe problem of high-quality reconstruction. However, it takestime to process parameter-tuning to balance the constraints.

2) View-dependent Representation:

View-dependent rep-resentations can be classiﬁed into view interpolation andbillboard-based methods by their different procedures. Viewinterpolation [14], [36], [37] utilizes the projective geometrybetween neighboring cameras to synthesize a view withoutexplicit reconstruction of a 3D model. It has the advantage ofavoiding the processes of camera calibration and 3D modelestimation. However, the quality of a synthesized view isrestricted by the accuracy of the correspondences amongcameras, which means that the optimal baseline is constrainedin a relatively narrow range. An interpolation method [38] thatrenders a scene using both pixel correspondence and a depthmap was reported to improve the visual effect. Nevertheless, it still suffers from a narrow baseline. Billboard-based methods[13], [39], [40], [41] construct a single planar billboard foreach object in each camera. The billboards rotate aroundindividual points of the virtual stadium as the viewpointmoves, providing walk-through and ﬂy-through experiences.These methods cannot reproduce continuous changes in theappearance of an object, but the representation can easily bereconstructed.Our previous work [5] overcomes the problemof occlusion by utilizing conservative 3D models to segmentobjects. Its underlying assumption is that the back-projectionarea of a conservative 3D model in a camera is always largerthan the input silhouette. It outperforms conventional methodsin terms of robustness on camera setup and naturalness oftexture. However, we ﬁnd that the reconstruction of rough 3Dmodels increases noise and degrades the ﬁnal visual effect.

B. Free-viewpoint Video Creation from a Single View

Creating a free-viewpoint video from a single camera(generally a moving camera) is a delicate task, which in-volves automatic camera calibration, semantic segmentation,and monocular depth estimation. The calibration methods[42], [43], [44] are generally composed of three processes,including ﬁeld line extraction, cross point calculation, andﬁeld model matching. With an assumption of small movementbetween consecutive frames, [45] calibrates the ﬁrst frameusing conventional methods and propagates the parametersof the current frame from previous frames by estimating thehomographic matrix. Semantic segmentation [46], [47], [17] isa pixel-level dense prediction task that labels each pixel of animage with a corresponding class of what is being represented.In an application of free-viewpoint video creation, it works outwhat objects there are, and where are they in an image, to theinformation needed for further processing. Estimating depth isa crucial step in scene reconstruction. Unlike the estimationapproach in multiple views that can use the correspondencesamong cameras, monocular depth estimation [48], [49], [50]is a technique of estimating depth from a single RGB image.Many recent works [51], [52], [53] follow an end-to-endlearning paradigm consisting of a Convolutional Network for2D/3D body joint localization and a subsequent optimizationstep to regress to a 3D pose. The constraint on these methodsis the requirement of images with 2D/3D pose ground truthfor training. The study [19] presented here describes the ﬁrst-ever method that can transform a monocular video of a soccergame into a free-viewpoint video by combining the techniquesmentioned above. It constructs a dataset of depth-map / imagepairs from FIFA video game for the restricted soccer scenarioto improve the accuracy of depth estimation. The approachreported in [15] can also create a free-viewpoint video from asingle video. The major deﬁciency of creation from a singleview is that it can not reproduce any surface appearance thatthe camera does not observe.III. A

LGORITHM FOR F REE - VIEWPOINT V IDEO C REATION

An overview of our proposed solution is shown in Fig. 1.It includes six steps: data capturing, silhouette segmentation,3D reconstruction, depth estimation and 2D segmentation,

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Fig. 1: Workﬂow of the proposed method.billboard model creation, and free-viewpoint video rendering.Processes (b)-(e) work off-line in a server-side, while the ren-dering is performed in real-time on the client-side accordingto the user’s operation. The input data are captured usinga synchronized camera network, in which the camera viewis ﬁxed during recording.Each camera is calibrated by themethod reported in [43] to estimate the extrinsic parameters,intrinsic parameters, and lens distortion.

A. Silhouette Segmentation

For a sports scene, it is reasonable to assume that theobjects, including players and ball, are moving. Therefore, ob-jects can be extracted by a background subtraction method [54]that includes three processes: global extraction, classiﬁcation,and local reﬁnement. In the ﬁrst process, a background imageis obtained by taking the average of hundreds of consecutivevideo frames. The difference in pixels between each frame andthe background image is then calculated. The pixel positionswhose differences are less than a certain threshold are regardedas background, with the remaining pixels judged to be fore-ground. In the second process, we classify the shadow areainto independent shadow and dependent shadow according tothe shadow’s luminance, shape, and size. The independentshadows are removed here. Finally, a reﬁnement is conductedto remove the dependent shadow based on the assumption thatthe chrominance difference between objects and background isrecognizable. The threshold is adjusted dynamically accordingto the chrominance in each local area.

B. Raw 3D Model Reconstruction

Our method to estimate the 3D shape of observed objectsfrom a wide-baseline camera network is to use an algorithm ofshape from silhouettes. It discretizes a pre-deﬁned 3D volumeinto voxels, projects each voxel onto all the camera imageplanes, and removes the voxels that fall outside the silhouettes. The set of remaining voxels called a volumetric visual hull[21] gives a shape approximation to the observed scene.After a volumetric visual hull is obtained, the individualobjects are segmented employing a connected componentslabeling algorithm [55], and an identiﬁer label is assignedto each object. We extract the th- and st-order moment M α,β,γ ( V t ) { ( α, β, γ ) = (0 , , , (1 , , , (0 , , , (0 , , } of each object with Eq. 1 to determine their sizes and locationswith Eq. 2. M α,β,γ ( V t ) = (cid:88) ( x,y,z ) ∈V t x α y β z γ . (1) { N ( V t ) , X ( V t ) , Y ( V t ) , Z ( V t ) } = { M , , ( V t ) , M , , ( V t ) M , , ( V t ) , M , , ( V t ) M , , ( V t ) , M , , ( V t ) M , , ( V t ) } . (2)Here, V t expresses the t th object. { x, y, z } denotesthe 3D coordinate of an occupied voxel. N ( V t ) and { X ( V t ) , Y ( V t ) , Z ( V t ) } indicate the number of voxels in V t and its barycenter, respectively. In the next step, we build meshmodels by coupling a volumetric visual hull with a marchingcubes algorithm [24].The visual hull may contain noise that comes from imperfectsilhouettes. We remove such noisy regions, taking into accountthe number of voxels of an object as illustrated in the followingequation: V t = (cid:40) OF F, if T min < N ( V t ) < T max ON, otherwise . (3)An object is removed if its number of voxels is less than aminimum threshold T min or exceeds a maximum threshold T max . Our solution focuses on outdoor sports scenes, such assoccer or rugby match, so that it is practical to give reasonableassignments to T min and T max by considering the actualsizes of ball and athletes. The bottom image in Fig. 1 (c)presents an example of segmentation in which a unique color is OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 (a) depth map(b) segmentation map

Fig. 2: Depth estimation and 2D segmentation.assigned to each object while the up-right rectangles illustratethe minimum bounding box of each object.

C. Depth Estimation and 2D Segmentation

To estimate the depth map in a camera view, we projectedthe mesh models onto the camera plane to associate 2D imagepixels with 3D triangles on the mesh surface. The projectionof a 3D triangle is a 2D triangle so that we deﬁned theassociations of a 3D triangle as the pixels bounded by its 2Dprojection. The depth of the i th pixel d i is assigned as thedepth of the nearest corresponding triangle, as expressed inthe following equation: d i = min (cid:8) d i , d i , · · · , d in (cid:9) . (4)Here, n indicates the number of 3D triangles that correspondwith pixel i . d ij ( j = 1 , , · · · , n ) denotes the distance fromthe j th triangle to the camera center. Fig. 2 (a) presents a depthmap, in which light gray coloration identiﬁes objects as nearerto the camera center, while objects that are farther away aredark.While estimating the depth map we also record the labelof the nearest corresponding triangle, to indicate which objectthe pixel is associated with. This can be regarded as a processof segmentation in which each object is separated from theothers. Fig. 2 (b) demonstrates the result of segmentation, inwhich pixels with the same color intensity correspond to thesame object. D. Billboard Model Reconstruction

In a billboard free-viewpoint video, each object is repre-sented as a planar image with texture, while the 3D visualeffect is produced by placing the planar images in the properposition in a virtual stadium. In our study, we created abillboard model in the three steps described below. (a) source image (b) object 1 (c) object 2 (d) object 3

Fig. 3: Individual object extraction. (a) presents a croppedimage where object 1 overlaps with object 2, and object 3is isolated. (b), (c), and (d) respectively are the extractedindividual objects, where the gray color indicates an occludedregion. We manually blocked the uniform number using blackrectangle to avoid copyright issues. This process remains thesame in the following chapters. (a) object 1 (b) object 2 (c) object 3

Fig. 4: Texture extraction.

1) Individual Object Extraction:

We successively projectthe segmented mesh models onto a speciﬁc camera plane toextract an individual 2D region for each object and determinetheir states, visible or not. The regions that map with a singleobject are certainly visible to the camera, while the othersthat are associated with two or more objects are ambiguous.To judge the visibility of an ambiguous region, we comparethe label of the projecting polygons with the label stored in the2D segmentation map. It is visible when the two labels are thesame. Otherwise, it is blocked by other objects. Fig. 3 showsa demonstration in which the visible and invisible regions arerespectively expressed with white and gray. Compared with thevisibility detection method using a ray-casting algorithm [5]that introduces an intractable threshold, our proposed methodruns without parameters and is robust against self-occlusion.

2) Texture Extraction:

For the visible pixels in an individualobject region, surface textures can be reproduced directly byextracting the color of the same pixels from the input image.The invisible pixels are rendered from the neighboring camerasby coupling with the depth map and corresponding polygons.Fig. 4 presents the rendering result of the objects in Fig. 3.In the case of objects 1 and 3, our method produces a goodappearance because their textures come from the facing camerawithout a blending process. Concerning the object 2 that ispartially occluded, it introduces small but acceptable visualartifacts.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 (a) mesh model (b) billboard model

Fig. 5: Location determination.Fig. 6: The selection of a reference camera.

3) Location Determination:

To accurately locate billboardmodels on the ground, we calculate the 2D barycentre of eachobject region and associate it with the 3D barycentre of itsmesh model, as shown in Fig. 5. The red marks respectivelypresent the 2D and 3D barycentre while the red rectangleindicates the 3D area occupied by the mesh model.

E. Free-viewpoint Video Rendering

The free-viewpoint video is rendered on the client-side,where the 3D coordinate and direction of a virtual viewpointcan be obtained from the user’s operation. We identify thereference camera for rendering as the nearest camera bycalculating the Euler distance between a virtual viewpointand each recording camera. Fig. 6 shows an example of theselection of a reference camera. The ﬁrst camera is nearerto virtual view 1 than the second camera, so the billboardsin camera 1 render its virtual image, and vice versa. In therendering process, billboards of a reference camera are placedin a virtual stadium and rotated according to the user-selectedviewpoint. IV. E

XPERIMENTAL R ESULTS

To demonstrate the performance of our method, we compareit to the following methods: (a) the ﬁrst content (b) the second content

Fig. 7: Camera conﬁguration.- RB [5] as the more recent representative of the billboard-based free-viewpoint video production approach, whichextracts object regions in each camera by reconstructinga rough 3D model.- FFVV [7] as a more recent and fast representative of a fullmodel free-viewpoint video generation method, whichcan produce a free-viewpoint video in real-time.- CVH [9] as a conventional full model production method.We applied the proposed and comparison methods to twotypes of soccer contents to validate their usability underdifferent shooting conditions. The vision of the cameras of theﬁrst content focuses on half of a pitch while the observationarea of the second content targets the penalty area. Both of thecontents were captured with ﬁve synchronized cameras. Theresolution of each camera was × , and the framerate was fps. Fig. 7 shows the camera conﬁgurations forthe two contents, in which black and red symbols respectivelyshow the position of recording cameras and virtual cameras.For the ﬁrst content, we deﬁne the 3D space for reconstruc-tion as meters wide, meters high, and . meters deep.The camera threshold of RB and CVH for the constructionof a rough 3D shape was , which remains the same in theproduction of the second content. The voxel size for shapeapproximation in all the methods was cm × cm × cm.The thresholds, T min and T max , for noise ﬁltering were × and × , respectively.Fig. 8 (a) shows the cropped input image of each recordingcamera to highlight the region covered by virtual cameras.Fig. 8 (b) presents comparisons of three virtual viewpointimages produced by the proposed and reference methods,respectively. Fig. 8 (c) and (d) present the surface textureof two selected objects. First, let us focus on the reproducedimages from the ﬁrst virtual viewpoint, shown in the ﬁrst rowof Fig. 8 (b), (c), and (d). The viewpoint was set with thesame direction with “cam01” so that the methods (proposedmethod, RB, and FFVV) employing view-dependent renderingtechniques can produce a high-quality texture. Nevertheless,CVH that utilizes global rendering techniques fails to give aproper appearance due to the inaccurate 3D shape approxi-mation. Next, let us look at the images constructed from thesecond virtual viewpoint, shown in the second row of Fig. 8(b), (c), and (d). The virtual camera was set as bird’s-eye fromthe above whose nearest reference camera is “cam02”. It canbe seen that the proposed method successfully recovers thecolor appearance of an occluded object. However, the othertechniques introduce severe artifacts or leave some important OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 camera 01 camera 02 camera 03 camera 04 camera 05 (a) Cropped input images. We manually blurred the commercial billboards to avoid copyright issues. This process remains the same for the otherexperiments. proposed method RB FFVV CVH (b) Synthesized free-viewpoint video viewing from three virtual viewpoints input RB FFVV CVHproposedmethod (c) Close up view of a selected player input RB FFVV CVHproposedmethod (d) Close up view of another selected player

Fig. 8: Free-viewpoint video of the ﬁrst content.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 proposedmethodRB camera 01 camera 02 camera 03 camera 04 camera 05 (a) player 1 proposedmethodRB camera 01 camera 02 camera 03 camera 04 camera 05 (b) player 2

Fig. 9: Projections of billboard models of the ﬁrst content oncapturing viewpoints. The red and blue masking respectivelyindicate the visible and occluded regions. (a) proposed method (b) RB

Fig. 10: Projections of the 3D model of the ﬁrst content onthe XY-plane. The red and yellow circles respectively highlightthe noises and segmentation faults.parts unrendered. Finally, let us observe the images (last rowof Fig. 8 (b), (c), and (d)) rendered by a virtual camera that wasplaced on the opposite side of “cam01”. Since the input imagesdid not provide sufﬁcient information for interpolation, the fullmodel expression methods, FFVV and CVH, were incapableof offering a suitable chromatic appearance. However, thebillboard methods have the potential to handle situations likethis because they represent objects using planar billboardmodels that obtained from the nearest camera.To illustrate the differences between our method and RB,we projected the billboard models back to the capturingviewpoints. The region mapped with a billboard model ismarked with orange or blue. Orange means that the region isvisible, while blue indicates an overlapped area. Fig. 9 presentsexamples of projections of two objects. Comparison shows thatthe billboard models of our method are reliable and accurate, while RB tends to expand the individual object region andmake a wrong judgment for occlusion. In the meantime, weprojected the 3D models used in our method and RB onto theXY-plane to reveal the difference of 3D models, as shown inFig. 10. It can be seen that the model reconstructed by RBcontains many noises that are highlighted by red circles inthe ﬁgure. Moreover, RB mistakenly recognizes two separateobjects as one object, as demonstrated by the yellow circle inFig. 10.For the second content, the 3D space for production andthe voxel size for shape approximation in all the methodswere deﬁned as m × m × m and . cm × . cm × . cm, respectively. The thresholds, T min and T max , fornoise ﬁltering were . × and . × , respectively.Fig. 11 demonstrates the input images, synthesized imagesfrom three virtual viewpoints, and the highlighted surfacetexture of two selected objects. All the virtual viewpointswere set as bird’s-eye from above to evaluate the texturequality when the virtual facing directions are far from therecording directions. Concerning the result in the ﬁgure, it canbe observed that our method and RB, acting as billboard-basedmethods, outperforms the full model representation approachin all the tests. The reason for this phenomenon is that theshape approximated from ﬁve cameras featuring wide-baselineis quite inaccurate. The horizontal slice of a reconstructedmodel is more likely to be a pentagon but not a circle orellipse with a smooth edge. Thus the rendering quality is farfrom satisfactory. Next, let us focus on the difference betweenthe proposed method with RB. Besides the misalignment inrendering an occluded area, it can be seen that there are severalartifacts or noise in the result of RB (the second row of Fig. 11(c) and the third row of Fig. 11 (d)). The relaxed shape-from-silhouette approach is likely to introduce noises with irregularshape and size, as shown in Fig. 13. Consequently, parts ofthe visible region in some cameras are judged to be occlusion,as demonstrated in Fig. 12. Even though RB developed somenoise ﬁltering approaches, it is a challenging task to removeall noises, especially when their shapes resemble a ball.V. D ISCUSSION

In this section, we discuss some factors that may affect thevisual effect of a reconstructed free-viewpoint video. First,camera calibration plays a vital role in free-viewpoint videocreation. Most of the reported approaches work based onthe assumption that a sports ﬁeld, such as a soccer ﬁeld orrugby ﬁeld, is the same as a design drawing. However, theassumption fails in most cases. This is sometimes because ofhuman error when marking an actual sports ﬁeld. Moreover,sports associations usually provide rough guidelines, but nota speciﬁc number with reliable precision. For example, soccerﬁeld dimensions are within the range found optimal by FIFA: − yards ( − m) long by − yards( − m) wide. Thus the camera calibration is not accurateenough, leading to errors in 3D shape reconstruction andtexture rendering. Second, a camera-network should be laid outas carefully as possible to create a high-quality free-viewpointvideo. The primary requirement is that the cameras should be OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 camera 01 camera 02 camera 03 camera 04 camera 05 (a) Cropped input images proposed method RB FFVV CVH (b) Synthesized free-viewpoint video viewing from three virtual viewpoints input RB FFVV CVHproposedmethod (c) Close up view of a selected player input RB FFVV CVHproposedmethod (d) Close up view of another selected player

Fig. 11: Free-viewpoint video of the second content.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 proposedmethodRB camera 01 camera 02 camera 03 camera 04 camera 05 (a) player 1 proposedmethodRB camera 01 camera 02 camera 03 camera 04 camera 05 (b) player 2

Fig. 12: Projections of billboard models of the second contenton capturing viewpoints. (a) proposed method (b) RB

Fig. 13: Projections of 3D model of the second content onXY-plane.distributed uniformly in a stadium. This setup is more likely toget well-rounded texture information that enhances the qualityof reproduced surface appearances. In addition, this setupcan provide continuous changes when switching viewpointbecause a virtual view is represented by the billboard modelof its nearest recording camera. A third factor is the numberof cameras. There is no doubt that the more equipped camerasthere are, the better. However, we recommend the full modelrepresentation to be made as if there is an unlimited numberof cameras. The proposed method should receive top prioritywhen only a small number of cameras is provided. From ourexperience, ﬁve cameras are sufﬁcient to create a high-ﬁdelityfree-viewpoint video. Finally, our method is appropriate forscenes involving many players, such as soccer, rugby, andbasketball, but not suitable for simple scenarios with fewplayers, such as judo, taekwondo, and wrestling. The proposedmethod creates a stereo visual effect by placing 2D billboardmodel on different positions of a virtual stadium. The scenarios with fewer players create fewer billboard models in eachcamera. Especially when the players grapple with each other,the proposed method only constructs one billboard model ineach camera. When all the players are represented by onebillboard model, the spatial relationships among players arelost, making their 3D visual effectiveness weak.VI. C

ONCLUSION

In this paper, we presented a novel billboard-based synthesisapproach suitable for free-viewpoint video production forsports scenes. It converts 2D images captured by a syn-chronized camera network to a high-quality 3D video. Ourapproach has high ﬂexibility because only a few camerasare required. Therefore, it can apply to challenging shootingconditions where the cameras are sparsely placed around awide area. We approximate 3D models of objects using aconventional shape-from-silhouette technique and then projectthem onto each image plane to extract individual object regionsand discover occlusions. Each object region is rendered bya view-dependent approach in which the textures of non-occluded portions are taken from the nearest camera, whileseveral cameras are used to reproduce the appearance of occlu-sions. Experimental results of soccer contents have proved thatthe surface texture of each object, including occluded ones,can be reproduced more naturally than by the other state-of-the-art methods. In the future, we will parallelize our methodand combine it with efﬁcient data compression and streamingmethods for delivering real-time free-viewpoint video.R

EFERENCES[1] Masayuki Tanimoto, “Overview of free viewpoint television,”

SignalProcessing: Image Communication , vol. 21, no. 6, pp. 454–461, 2006.[2] Aljoscha Smolic, “3d video and free viewpoint videofrom capture todisplay,”

Pattern recognition , vol. 44, no. 9, pp. 1958–1968, 2011.[3] Takeo Kanade, Peter Rander, and PJ Narayanan, “Virtualized reality:Constructing virtual worlds from real scenes,”

IEEE multimedia , vol. 4,no. 1, pp. 34–47, 1997.[4] T. Matsuyama and T. Takai, “Generation, visualization, and editing of3d video,” in

Proceedings. First International Symposium on 3D DataProcessing Visualization and Transmission , June 2002, pp. 234–245.[5] Hiroshi Sankoh, Sei Naito, Keisuke Nonaka, Houari Sabirin, and JunChen, “Robust billboard-based, free-viewpoint video synthesis algorithmto overcome occlusions under challenging outdoor sport scenes,” in

Proceedings of the 26th ACM International Conference on Multimedia .2018, MM ’18, pp. 1724–1732, ACM.[6] Keisuke Nonaka, Ryosuke Watanabe, Jun Chen, Houari Sabirin, andSei Naito, “Fast plane-based free-viewpoint synthesis for real-time livestreaming,” in . IEEE, 2018, pp. 1–4.[7] Jun Chen, Ryosuke Watanabe, Keisuke Nonaka, Tomoaki Konno, Hi-roshi Sankoh, and Sei Naito, “A fast free-viewpoint video synthesisalgorithm for sports scenes,” in . IEEE, 2019, (accepted).[8] Christian Theobalt, Gernot Ziegler, Marcus Magnor, and Hans-PeterSeidel, “Model-based free-viewpoint video: Acquisition, rendering, andencoding,” in

Proceedings of Picture Coding Symposium, San Francisco,USA , 2004, pp. 1–6.[9] J. Kilner, J. Starck, A. Hilton, and O. Grau, “Dual-mode deformablemodels for free-viewpoint video of sports events,” in

Sixth InternationalConference on 3-D Digital Imaging and Modeling (3DIM 2007) , 2007,pp. 177–184.[10] Yebin Liu, Qionghai Dai, and Wenli Xu, “A point-cloud-based multiviewstereo algorithm for free-viewpoint video,”

IEEE transactions onvisualization and computer graphics , vol. 16, no. 3, pp. 407–418, 2010.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 [11] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev,David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan,“High-quality streamable free-viewpoint video,”

ACM Transactions onGraphics (ToG)

Computer Graphics Forum . Wiley Online Library,2010, vol. 29, pp. 585–594.[14] Naho Inamoto and Hideo Saito, “Virtual viewpoint replay for a soccermatch by view interpolation from multiple cameras,”

IEEE Transactionson Multimedia , vol. 9, no. 6, pp. 1155–1166, 2007.[15] Houari Sabirin, Qiang Yao, Keisuke Nonaka, Hiroshi Sankoh, and SeiNaito, “Toward real-time delivery of immersive sports content,”

IEEEMultiMedia , vol. 25, no. 2, pp. 61–70, 2018.[16] Marcel Germann, Tiberiu Popa, Richard Keiser, Remo Ziegler, andMarkus Gross, “Novel-view synthesis of outdoor sport events usingan adaptive view-dependent geometry,” in

Computer Graphics Forum .Wiley Online Library, 2012, vol. 31, pp. 325–333.[17] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick, “Maskr-cnn,” in

Computer Vision (ICCV), 2017 IEEE International Conferenceon . IEEE, 2017, pp. 2980–2988.[18] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh,“Convolutional pose machines,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 4724–4732.[19] Konstantinos Rematas, Ira Kemelmacher-Shlizerman, Brian Curless, andSteve Seitz, “Soccer on your tabletop,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018, pp.4738–4747.[20] Armin Mustafa and Adrian Hilton, “Semantically coherent co-segmentation and reconstruction of dynamic scenes,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2017, pp. 422–431.[21] Aldo Laurentini, “The visual hull concept for silhouette-based imageunderstanding,”

IEEE Transactions on pattern analysis and machineintelligence , vol. 16, no. 2, pp. 150–162, 1994.[22] German KM Cheung, Takeo Kanade, J-Y Bouguet, and Mark Holler, “Areal time system for robust 3d voxel reconstruction of human motions,”in

Computer Vision and Pattern Recognition, 2000. Proceedings. IEEEConference on . IEEE, 2000, vol. 2, pp. 714–720.[23] Alexander Ladikos, Selim Benhimane, and Nassir Navab, “Efﬁcientvisual hull computation for real-time 3d reconstruction using cuda,” in

Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08.IEEE Computer Society Conference on . IEEE, 2008, pp. 1–8.[24] William E Lorensen and Harvey E Cline, “Marching cubes: A high res-olution 3d surface construction algorithm,” in

ACM siggraph computergraphics . ACM, 1987, vol. 21, pp. 163–169.[25] Timothy S Newman and Hong Yi, “A survey of the marching cubesalgorithm,”

Computers & Graphics , vol. 30, no. 5, pp. 854–879, 2006.[26] Chen Liang and K-YK Wong, “Complex 3d shape recovery using adual-space approach,” in

Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on . IEEE, 2005,vol. 2, pp. 878–884.[27] Jean-S´ebastien Franco and Edmond Boyer, “Efﬁcient polyhedral mod-eling from silhouettes,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 31, no. 3, pp. 414–427, 2009.[28] Greg Slabaugh, Ron Schafer, and Mat Hans, “Image-based photo hulls,”in

Proceedings. First International Symposium on 3D Data ProcessingVisualization and Transmission . IEEE, 2002, pp. 704–862.[29] Shufei Fan and Frank P Ferrie, “Photo hull regularized stereo,”

Imageand Vision Computing , vol. 28, no. 4, pp. 724–730, 2010.[30] Gregory G Slabaugh, Ronald W Schafer, et al., “Image-based photohulls,” Dec. 13 2005, US Patent 6,975,756.[31] Yasutaka Furukawa and Jean Ponce, “Carved visual hulls for image-based modeling,”

International Journal of Computer Vision , vol. 81,no. 1, pp. 53–67, 2009.[32] V. Leroy, J. Franco, and E. Boyer, “Multi-view dynamic shapereﬁnement using local temporal integration,” in , Oct 2017, pp. 3113–3122.[33] Jonathan Starck and Adrian Hilton, “Surface capture for performance-based animation,”

IEEE computer graphics and applications , vol. 27,no. 3, 2007.[34] Nadia Robertini, Dan Casas, Helge Rhodin, Hans-Peter Seidel, andChristian Theobalt, “Model-based outdoor performance capture,” in . IEEE, 2016,pp. 166–175. [35] Matthew Loper, Naureen Mahmood, and Michael J Black, “Mosh:Motion and shape capture from sparse markers,”

ACM Transactionson Graphics (TOG) , vol. 33, no. 6, pp. 220, 2014.[36] Wenfeng Li, Jin Zhou, Baoxin Li, and M Ibrahim Sezan, “Virtualview speciﬁcation and synthesis for free viewpoint television,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 19, no.4, pp. 533–546, 2009.[37] C. Verleysen, T. Maugey, P. Frossard, and C. De Vleeschouwer, “Wide-baseline foreground object interpolation using silhouette shape prior,”

IEEE Transactions on Image Processing , vol. 26, no. 11, pp. 5477–5490, Nov 2017.[38] Christian Lipski, Felix Klose, and Marcus Magnor, “Correspondenceand depth-image based rendering a hybrid approach for free-viewpointvideo,”

IEEE Transactions on Circuits and Systems for Video Technol-ogy , vol. 24, no. 6, pp. 942–951, 2014.[39] Tomoyuki Tezuka Mehrdad Panahpour Tehrani Keita Takahashi Toshi-aki Fujii Ryo Suenaga, Kazuyoshi Suzuki, “A practical implementationof free viewpoint video system for soccer games,” 2015.[40] K. Nonaka, Q. Yao, H. Sabirin, J. Chen, H. Sankoh, and S. Naito,“Billboard deformation via 3d voxel by using optimization for free-viewpoint system,” in , Aug 2017, pp. 1500–1504.[41] Keisuke Nonaka, Houari Sabirin, Jun Chen, Hiroshi Sankoh, and SeiNaito, “Optimal billboard deformation via 3d voxel for free-viewpointsystem,”

IEICE TRANSACTIONS on Information and Systems , vol. 101,no. 9, pp. 2381–2391, 2018.[42] Qiang Yao, Akira Kubota, Kaoru Kawakita, Keisuke Nonaka, HiroshiSankoh, and Sei Naito, “Fast camera self-calibration for synthesizingfree viewpoint soccer video,” in . IEEE, 2017, pp.1612–1616.[43] Qiang Yao, Hiroshi Sankoh, Keisuke Nonaka, and Sei Naito, “Automaticcamera self-calibration for immersive navigation of free viewpoint sportsvideo,” in . IEEE, 2016, pp. 1–6.[44] Dirk Farin, Susanne Krabbe, Wolfgang Effelsberg, et al., “Robust cameracalibration for sport videos using court models,” in

Storage and RetrievalMethods and Applications for Multimedia 2004 . International Society forOptics and Photonics, 2003, vol. 5307, pp. 80–92.[45] Qiang Yao, Keisuke Nonaka, Hiroshi Sankoh, and Sei Naito, “Robustmoving camera calibration for synthesizing free viewpoint soccer video,”in .IEEE, 2016, pp. 1185–1189.[46] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolu-tional networks for semantic segmentation,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2015, pp. 3431–3440.[47] Liang-Chieh Chen, George Papandreou, Florian Schroff, and HartwigAdam, “Rethinking atrous convolution for semantic image segmenta-tion,” arXiv preprint arXiv:1706.05587 , 2017.[48] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, andDacheng Tao, “Deep ordinal regression network for monocular depthestimation,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 2002–2011.[49] Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe,“Monocular depth estimation using multi-scale continuous crfs as se-quential deep networks,”

IEEE transactions on pattern analysis andmachine intelligence , 2018.[50] Amlaan Bhoi, “Monocular depth estimation: A survey,” arXiv preprintarXiv:1901.09402 , 2019.[51] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, andKostas Daniilidis, “Coarse-to-ﬁne volumetric prediction for single-image3d human pose,” , Jul 2017.[52] Denis Tome, Chris Russell, and Lourdes Agapito, “Lifting from thedeep: Convolutional 3d pose estimation from a single image,” ,Jul 2017.[53] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and YichenWei, “Towards 3d human pose estimation in the wild: A weakly-supervised approach,” , Oct 2017.[54] Qiang Yao, Hiroshi Sankoh, Houari Sabirin, and Sei Naito, “Accuratesilhouette extraction of multiple moving objects for free viewpointsports video synthesis,” in . IEEE, 2015, pp. 1–6.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 [55] Jun Chen, Keisuke Nonaka, Hiroshi Sankoh, Ryosuke Watanabe, HouariSabirin, and Sei Naito, “Efﬁcient parallel connected component labelingwith a coarse-to-ﬁne strategy,”