Sampling Based Scene-Space Video Processing
Felix Klose, Oliver Wang, Jean-Charles Bazin, Marcus Magnor, Alexander Sorkine-Hornung
SSampling Based Scene-Space Video Processing
Felix Klose ∗ Oliver Wang ∗ Jean-Charles Bazin Marcus Magnor Alexander Sorkine-Hornung Disney Research Zurich TU BraunschweigDenoising Action shots Virtual aperture
Figure 1:
Single frames from video results created with our sampling based scene-space video processing framework. It enables fundamentalvideo applications such as denoising (left) as well as new artistic results such as action shots (center) and virtual aperture effects (right). Ourapproach is robust to unavoidable inaccuracies in 3D information, and can be used on casually recorded, moving video.
Abstract
Many compelling video processing effects can be achieved if per-pixel depth information and 3D camera calibrations are known.However, the success of such methods is highly dependent on theaccuracy of this “scene-space” information. We present a novel,sampling-based framework for processing video that enables high-quality scene-space video effects in the presence of inevitable errorsin depth and camera pose estimation. Instead of trying to improvethe explicit 3D scene representation, the key idea of our method isto exploit the high redundancy of approximate scene informationthat arises due to most scene points being visible multiple timesacross many frames of video. Based on this observation, we pro-pose a novel pixel gathering and filtering approach. The gather-ing step is general and collects pixel samples in scene-space, whilethe filtering step is application-specific and computes a desired out-put video from the gathered sample sets. Our approach is easilyparallelizable and has been implemented on GPU, allowing us totake full advantage of large volumes of video data and facilitat-ing practical runtimes on HD video using a standard desktop com-puter. Our generic scene-space formulation is able to comprehen-sively describe a multitude of video processing applications suchas denoising, deblurring, super resolution, object removal, com-putational shutter functions, and other scene-space camera effects.We present results for various casually captured, hand-held, mov-ing, compressed, monocular videos depicting challenging scenesrecorded in uncontrolled environments.
CR Categories:
I.2.10 [Computer Graphics]: Picture/ImageGeneration—Display Algorithms I.3.7 [Artificial Intelligence]: Vi-sion and Scene Understanding—3D/stereo scene analysis;
Keywords:
Video processing, Sampling, Inpainting, Denoising,Computational Shutters
Scene-space video processing, where pixels are processed accord-ing to their 3D positions, has many advantages over traditionalimage-space processing. For example, handling camera motion,occlusions, and temporal continuity entirely in 2D image-space canin general be very challenging, while dealing with these issues inscene-space is simple. As scene-space information, e.g., in theform of depth maps and camera pose parameters, becomes moreand more widely available due to advances in tools and mass mar-ket hardware devices (such as portable light field cameras, depth-enabled smart phones [Google 2015] and RGBD cameras), tech-niques that leverage depth information will play an important rolein future video processing approaches.Previous work has shown that accurate scene-space informationmakes fundamental video processing problems simple, automatic,and robust [Bhat et al. 2007; Zhang et al. 2009a] and even enablesthe creation of new, compelling video effects [Kholgade et al. 2014;Kopf et al. 2014]. However, the visual output quality of such meth-ods is highly dependent on the quality of the available scene-spaceinformation. Despite considerable advances in 3D reconstructionover the last decades, computing exact 3D information for arbitraryvideo recorded under uncontrolled conditions remains an elusivegoal (and will likely remain so for the foreseeable future, due toinherent ambiguities in these tasks).We propose an alternative, general-purpose framework that facili-tates robust scene-space video processing in the presence of incor-rect 3D information by exploiting the high degree of redundancy in ∗ denotes joint first authorship with equal contribution a r X i v : . [ c s . C V ] F e b ideo that arises from multiple observations of the same scene pointover time. Our approach takes as input a casually acquired , hand-held, moving, and uncalibrated video, possibly altered by compres-sion artifacts, along with potentially incorrect depth maps and cam-era pose information, both of which are either derived from the in-put footage itself or acquired using other sensors. Our work on sam-ple based scene-space processing is inspired by recent advances inedge-aware filtering and consists of the following two steps. First,for each output pixel an efficient, general-purpose gathering stepcollects a large set of input video pixels that potentially representalternative observations of the same scene point. In the secondstep, an application-specific filtering operation efficiently reducesthe contribution of outliers in the sample set and computes the out-put pixel color as a weighted combination of the gathered samples.This sampling-based framework allows us to formulate a widerange of video effects as straightforward, transparent filtering oper-ations in scene space, operating directly on noisy scene-space datawithout the need for accurate reconstruction or refinement tech-niques. We demonstrate a wide range of practical applications,from fundamental video processing tasks such as denoising, de-blurring, and super resolution to advanced video effects such asaction shots, virtual apertures, object removal, and computationalshutter effects. We show results computed from different sourcesof depth information including standard multi-view stereo recon-struction and one application that uses actively acquired depth froma Kinect. Because our approach is robust to spurious depth values,we can use simple and fast local depth estimation and deal with theresulting depth noise in the filtering step. While our approach as-sumes the existence of a consistent scene-space representation overa number of frames, its robustness to outliers allows us to handlesome degree of dynamic content, and we show a variety of resultson scenes with moving objects.The entire framework can be easily parallelized. We describe aGPU implementation that is capable of gathering billions of sam-ples on-the-fly without the need for explicit storage in a dedicateddata structure or quantization of scene-space, achieving practicalruntime performance for real-world HD video footage on a desktopcomputer.The key idea is that our framework relies on simple, fast and trans-parent algorithms that achieve high-quality results from inaccurate3D information by exploiting scene structure redundancy acrossmultiple video frames. Our algorithm applies to a wide range of applications in image andvideo processing. In this section, we therefore briefly review relatedwork on a general level and discuss more specific related researchin the context of the respective applications in Sec. 5.
Pre-processing
Our approach leverages recent advances in 3Dreconstruction techniques. Among these are structure-from-motion (SFM) methods that simultaneously recover scene pointsand camera poses [Kolev et al. 2009; Furukawa and Ponce 2010],and real-time variants like simultaneous localization and map-ping (SLAM) [Newcombe and Davison 2010; Tanskanen et al.2013]. In all our experiments, we estimate camera pose param-eters automatically using commonly available commercial tools(NUKEX 9.0v5). In addition, our scene-space framework requiresper-pixel depth needed to project each video pixel into scene-space.Depth from monocular video is customarily computed using stereomethods employed on pairs of subsequent video frames [Scharsteinand Szeliski 2002; Seitz et al. 2006]. Other techniques for generat-ing spatio-temporally consistent depth maps from video sequencesrely on segmentation and bundle optimization [Zhang et al. 2009b]or filtering [Lang et al. 2012]. Such methods involve complex, computationally expensive global optimization routines to handleinherent ambiguities and/or to enforce smoothness. In contrast, forour framework we found it completely sufficient to compute densedepth using a much simpler, data-term only depth estimation algo-rithm, or to use raw depth data from sensors like the Kinect, sinceour sample collection and filtering step robustly eliminates outliersduring rendering.
Depth-image based rendering
Scene-space information is fre-quently used in image-based rendering methods to synthesize viewsof recorded scenes from arbitrary perspectives. Most commonly,depth-image based rendering (DIBR) methods work by project-ing pixels into virtual camera views using per-pixel depth informa-tion [Zitnick et al. 2004]. Recent extensions use additional image-based correspondences to mask depth projection errors [Lipski et al.2014]. For a survey of depth image-based rendering approaches,we refer the reader to [Shum et al. 2007]. In contrast to the goals ofimage-based rendering, our work does not aim to synthesize novelviews from virtual camera positions. We are instead focused on pro-cessing the recorded video frames themselves. This key distinctionallows us to achieve high visual fidelity and realism by exploitingthe original frames as a prior.
Point-based methods
Similar to our approach, point-based ren-dering techniques deal with large numbers of unstructured points.These methods work by “splatting” unstructured, oriented 3Dpoint clouds into a virtual camera view for displaying complexscenes [Zwicker et al. 2001]. Similar to our approach, these meth-ods accumulate surface samples in screen space. In order to copewith noisy outlier points, it is common in point-based renderingto resample and filter the 3D point cloud, e.g., by moving leastsquares [Alexa et al. 2003; ¨Oztireli et al. 2009; Kuster et al. 2014].These methods focus on rendering predominantly correct 3D pointscorrupted by spurious noise (often acquired by a laser scanner), andobtain robustness by assuming that the underlying geometry is a lo-cally smooth manifold. Epipolar constraints have been shown toimprove rendering quality for transitions between views, despitesome inaccurate depth estimates [Goesele et al. 2010]. Our frame-work, in contrast, is designed to handle cases where the vast major-ity of samples are outliers that arise from incorrect 3D estimates.Our approach does not assume a spatially smooth model, but relieson redundancy from very large sample sets (on the order of billionsof samples or more) and a bilateral weighting scheme to removethe contribution of outliers. Our target application is general videoprocessing as opposed to 3D rendering.
Filtering
Our work is inspired in part by the recent successesof “edge-aware” filtering methods in image and video process-ing [Paris et al. 2007]. Such approaches are often used for imageprocessing where an image-space patch is filtered by a weightedcombination of pixels based on a multivariate normal distributioncentered around the input pixel. This simple idea has been shownto be successful in handling a large degree of outliers and has beenused in a wide range of applications including tonemapping andstyle transfer [Aubry et al. 2014], upsampling [Kopf et al. 2007],colorization [Gastal and Oliveira 2011], and to approximate globalregularizers [Lang et al. 2012]. Zhang et al. [2009c] propose amethod that leverages multiple view geometry to find image patchesfor denoising. This approach marginalizes over depth values, andis therefore somewhat robust to bad depth estimations, howeverit works only for single images. Our filtering step resemblesthese methods, but instead of deriving weights from image patcheswe filter samples collected from scene-space frustums in a high-dimensional sample space.
Depth-aware video enhancement
Prior approaches have useddepth to achieve various video effects such as stylization and re- igure 2:
Conceptual overview of our method. Samples are pro-jected from video frames into scene-space (left), then a gatheringstep finds all samples that fall into the viewing frustum of an outputpixel (middle), finally a filtering operation reduces this set to theoutput pixel’s color (right). lighting [Richardt et al. 2012], manipulating still images by regis-tering stock 3D models to image content [Kholgade et al. 2014], orcomputing visually consistent hyperlapses [Kopf et al. 2014]. Othermethods have proposed registering high-resolution stills to improvethe quality of video effects such as HDR, superresolution and ob-ject removal on static [Bhat et al. 2007] as well as dynamic [Guptaet al. 2009] scenes. Zhang et al. [2009a] generate transparency,bullet time, and depth-of-field effects using high-quality, compu-tationally expensive depth maps. While these approaches generatecompelling results, they rely on accurate scene information. Conse-quently, the quality of their result is directly affected by any errorsin the estimated 3D scene correspondences. Our approach, in con-trast, presents a new way of operating in scene-space that is robustto noisy depth.Related to our work, Sunkavalli et al. [2012] generate novel stillimages from short videos by aligning video frames and comput-ing importance-based pixel weights on the resulting image stack.They show applications such as super resolution, blur and noise re-duction, as well as visual summaries (action shots). While this ap-proach creates compelling snapshots, it operates entirely in image-space and does not straightforwardly extend to video. By regardingthe task in scene-space instead of image-space, our framework maybe considered a generalization of this earlier work that is also ap-plicable to videos recorded by a moving camera.
The intuition behind our method can be summarized as follows,Fig. 2. For each output pixel, we gather all samples that lie withina 3D region defined by the pixel frustum of the output pixel, cor-responding to all potential observations of the same scene point.We then filter this sample set to generate an output pixel color byweighting the samples appropriately. This intuition is conceptualonly; in practice the gathering step is performed directly in the im-ages, avoiding the need for a costly intermediate 3D point cloudrepresentation.In the following, we write the color of a pixel p at frame f in the in-put video I as I f ( p ) . Our goal is to compute all output pixel colors O f ( p ) . For each O f ( p ) , we draw a set of samples S f ( p ) directlyfrom I . A sample s ∈ R is composed of color ( s rgb ∈ R ),scene-space position ( s xyz ∈ R ), and frame time ( s f ∈ R ). Color s rgb is in the range [0-255], s xyz are in scene space units, wherethe scene is scaled such that 90% of the points lie in a cube,and s f is in units corresponding to the frames of the input video.Samples are generated by projecting a pixel from an input frame I f using camera matrix C f and its respective depth value D f ( p ) ,Sec.4. Filtering is defined as a function Φ( S ) ∈ P ( R ) → R thattakes a sample set and produces an output color per pixel, Sec. 5. Input Depth mapPoint cloud Figure 3:
Visualization of the quality of depth information ourmethod is able to use. The point cloud below shows a side viewof 5 images projected into scene-space. The amount of 3D outliersis clearly visible.
In the following we drop the index f for clarity when consideringindividual, sequentially processed frames. Preprocessing
As a preprocessing step, we derive camera cali-bration parameters (extrinsics and intrinsics) C , and depth infor-mation D from the input video I . Images are processed in anapproximately linear color space by gamma correction. Unlessotherwise specified, we compute camera calibration parameters au-tomatically using commonly available commercial tools (NUKEX9.0v5). Dense depth is either derived from the input video and cam-era calibrations using multi-view stereo techniques or, in the caseof the action shots example, acquired by a Kinect sensor. Unlessotherwise mentioned, we use a simple, local depth estimation algo-rithm where the standard multi-view stereo data-term [Seitz et al.2006] is computed over a temporal window around each frame.For each pixel this entails searching along a set of epipolar linesdefined by C , and picking the depth value with the lowest aver-age cost (we use sum of squared RGB color differences on × patches). This simple approach does not include any smoothnessterm and therefore does not require any complex global optimiza-tion scheme, making it easy to implement and efficient to compute.As expected it yields many local depth outliers, introducing high-frequency “salt-and-pepper” noise in the depth map. Fig. 3 showsan example of a noisy point cloud corresponding to a depth mapcomputed using this approach. Sec. 6 discloses timing details ofthe depth computation step and all subsequent components of ourmethod. Next we discuss the general, application independent sample gath-ering step, describing how we compute a sample set S ( p ) for eachoutput pixel p . The goal in constructing such a set is to collectmultiple observations of the same scene point visible to p . Foreach output pixel p , a physical camera integrates information overa frustum-shaped 3D volume V in scene-space, Fig 4. Thereforethe input video pixels that project into V are the samples we wantto collect.The straightforward approach would be to project all video pix- igure 4: Sample gathering. A pixel in output image O defines afrustum V which projects to a convex polygon V J in an input frame J (left). Each pixel in V J is then checked whether its projection in O is inside V O (right). Pixels that pass this test are added to thesample set. The green arrow indicates a gathered sample, while thered arrows indicate samples that were tested, but rejected. els into scene-space using their associated depth and camera ma-trices, and store the resulting 3D point cloud in a space partition-ing structure (e.g., a kd-tree). Gathering S ( p ) could then be doneby querying the data structure. However, a 1000-frame video at720p resolution consists of nearly a billion 7D samples, and ren-dering an output video would require an equal number of frustumshaped queries. Computing this lookup on a general-purpose, data-agnostic data structure would be computationally intractable, so in-stead we exploit the underlying geometric nature of our input datato drastically boost efficiency.The key idea behind our efficient gathering step is to exploit theduality between the scene and its 2D projections in the input video.In order to find which pixels project into the frustum V , we look atits projection into a single input frame J . All pixels that project into V must reside inside the respective 2D convex hull V J (determinedby projecting the frustum V into J ), Fig 4. Therefore, rather thanstoring and validating 3D points, we instead directly operate on thepixels in V J , looping through all J .More formally, given output camera matrix C O , the 3D frustumvolume V of a pixel p is simply defined as a standard truncatedpyramid using the pixel location ( p x , p y ) and a frustum size l : V = { C − O · [ p x ± l , p y ± l , { near, far } , T ) } , (1)where near, far are the depth values of the near and far clippingplanes (.01 and 1000 respectively). The 2D frustum hull V J is ob-tained by individually projecting the 3D frustum vertices of V into J . Because pixels inside of V J may not project into V , but liein front or behind it, we cannot simply accept the entire region.Therefore, we rasterize all pixels q ∈ V J and check whether theirprojection back into the output view lies within V O , Fig. 4. Specif-ically, we check the distance from q projected back into O to theoriginal pixel p . (cid:107) p − C O · C − J · [ q x , q y , q d , T (cid:107) < l (2)Each pixel q passing this test is converted into a 7D sample andadded to the sample set S ( p ) .In case of error-free depth maps, camera poses, and a static scene,the samples inside that pixel’s frustum ( l = 1 ) would be a com-plete set of all observations of that scene point (as well as any oc-cluded scene points). However, inaccuracies in camera pose anddepth inevitably lead to false positives, i.e., outlier samples wronglygathered, and false negatives, i.e., scene point observations that aremissed. To account for depth and camera calibration inaccuracies, we increase the per-pixel frustum size l to cover a wider range. Inmost cases, we use l = 3 pixels. In general, this produces on theorder of a few hundred to a thousand samples per set. Note that thedepth of the current output pixel p is irrelevant at this stage of thealgorithm.Since output sample sets are handled independently, we can com-pute all frustum volumes for one output frame concurrently (i.e., theapproach is parallelized over output pixels p ). This kind of local,independent parallelism is especially well suited for GPU imple-mentation. To handle the large volumes of input data, and to ad-ditionally maximize data coherency during the computation, eachpixel in O gathers all samples from I f in parallel before movingon to I f +1 . Now that we have computed the sample set S ( p ) , the second step isto determine a final pixel color O ( p ) from this sample set. Amongthe samples in S ( p ) , some will correspond to a scene point ob-served by the output camera, but others will come from occludedregions, incorrect 3D information, or moving objects. An applica-tion specific weighting function is used to emphasize samples thatwe can trust, while reducing the influence of outliers. All resultswe show in the following are computed using a sample set filteringoperation of the form: O ( p ) = Φ( S ( p )) = 1 W (cid:88) s ∈ S ( p ) w ( s ) s rgb (3)where w ( s ) is an application specific weighting function and W = (cid:80) s ∈ S ( p ) w ( s ) , is the sum of all weights. One key property of Eq. 3is that it is very fast to compute. This allows us to take advantageof the large amount of redundant data present in video. To givean intuition for the importance of efficiency; for a 30 second 720pvideo at 30fps, and number of samples per set | S ( p ) | ≈ , theoutput video will be created from about a trillion weighted samples.We show how this can be done in a reasonable amount of time ona desktop computer. In empirical tests, we observe that, on average W | S ( p )) | ≈ . , indicating that around 82% of the information in S ( p ) is discarded by our approach.A central feature of our method is the flexibility it provides in defin-ing different weighting functions w ( s ) on the 7D samples. In par-ticular, it is straightforward to specify effects based on scene-space coordinates by making w ( s ) depend on the scene-space position ofa sample. We demonstrate this in several applications, such as ac-tion shots, and inpainting defining approximate 3D regions (bound-ing boxes), each with their own set of parameters for w ( s ) .We demonstrate the general applicability of our method by firstshowing a number of difficult fundamental video processing op-erations, followed by advanced video effects. All results were com-puted using different variants of the weighting function w ( s ) . Weencourage the reader to watch the accompanying video to assess theresults. All video results and datasets are available on the projectwebsite. Denoising
As the same scene point is observed many timesthroughout a video, we can use these multiple observations to de-noise input frames. Simply averaging all samples in S ( p ) by setting w ( s ) = 1 causes occluded and noisy samples to corrupt the result,Fig. 14. One key observation is that we have a reasonable prior onthe expected color O f ( p ) of an output pixel p , namely the inputpixel color I f ( p ) at the same frame and pixel location. We call thesample that originates from projecting I f ( p ) into scene-space to bethe reference sample s ref . Filtering can then be done as a weighted e no i s i ng Input Scene-space denoising BM3D [Dabov et al. 2007] D e b l u rr i ng Input Scene-space deblurring [Cho et al. 2012]
Figure 5:
Results of our method on video denoising and deblurring. We obtain similar quality results to existing state-of-the-art approachesspecifically tailored to the respective task. sum of samples, where weights are computed as a multivariate nor-mal distribution with mean s ref : w denoise ( s ) = exp (cid:18) − ( s ref − s ) σ (cid:19) . (4)While we use the above notation for clarity, samples are actuallyrepresented in a 7D space and we use a diagonal covariance matrix.We call the diagonal entries σ rgb for the three color dimensions, σ xyz for the scene-space position and σ f for the frame time. Fordenoising, we use typical parameters σ rgb = 40 , σ xyz = 10 , σ f =6 . The values of σ are set based on the expected variance of theirrespective modalities; due to the low quality of the input videos,depth estimates in this application are very noisy and σ xyz is set to ahigh value (scenes are scaled to a cube). For some applicationswe do not list all three parameters, denoting that some dimensionsare not used.We compare the result of our scene-space filter with a state-of-the-art video denoising method (BM3D) [Dabov et al. 2007] in Fig. 5.Despite a simple filtering operation, we achieve similar quality de-noising results, demonstrating the power of working in scene-spacewith large volumes of data. Our method even produces reasonableresults for scenes exhibiting lighting changes and limited amountsof foreground motion. We show examples of these in the supple-mental video.We conducted an additional quantitative evaluation to compare tospatiotemporal BM3D. In this experiment, we took a largely noise-free video sequence, and added zero mean, grayscale Gaussian noise with varying standard deviations. After adding increasingamounts of noise, we ran the noisy videos through our entire au-tomatic pipeline, including depth computation, camera calibration,and denoising. Fig. 7 shows the resulting plot of PSNR values. Inthe case of a static scene and camera motion, we observe that ourmethod is able to outperform BM3D by incorporating 3D informa-tion in the filtering. Deblurring
We can also use our method for deblurring videoframes that are blurry due to sudden camera movements, e.g., dur-ing hand-held capture. We use the same equation as above, but addto it a measure of frame blurriness: w deblur ( s ) = exp (cid:18) − ( s ref − s ) σ (cid:19) (cid:88) q ∈ I sf |∇ I s f ( q ) | , (5)where ∇ is the gradient operator, and I s f is the frame that the sam-ple s originated from. The first part is the same multivariate nor-mal distribution as Eq.(4), and blurriness is computed as the sumof gradient magnitudes in I s f . This down-weights the contributionfrom blurry frames when computing an output color. We use values σ rgb = 200 , σ xyz = 10 , σ f = 20 . We compare to a recent state-of-the-art approach that uses a patch search in space and time tofind non-blurred patches that combines the content of these patcheswith the current frame [Cho et al. 2012]. We can observe similarquality results, Fig 5. Super resolution
We can also perform a scene-space form ofsuper resolution, where the goal is to create a high-resolution out-put video (cid:98) O from a low-resolution input video I . The traditional up e rr e s o l u ti on Input Scene-space super resolution Infognition super resolution [2015]
Figure 6:
A comparison using our framework and a commercial tool for super-resolution.
Standard deviation of noise added PS N R Noisy inputBM3DScene-space
Figure 7:
PSNR for denoising. Synthetic noise was removed froma real video sequence using BM3D and our scene-space method. approach of using sub-pixel shifts derived from aligning multipleimages, and solving for the high-resolution image, e.g., using it-erative reweighted least squares [Sunkavalli et al. 2012], or usingexternal priors, e.g., on image gradients [Sun et al. 2008]. Instead,we simply choose a weighting scheme that prefers observations ofthe scene point with the highest available resolution. We assumethat each scene point is most clearly recorded when it is observedfrom as close as possible (i.e., the sample with the smallest pro-jected area in scene space). To measure this, we introduce a newsample property that we call s area . The scene-space area of a sam-ple is computed by projecting its pixel corners into the scene andcomputing the area of the resulting quad. Assuming square pixelsit is sufficient to compute the length of one edge. Let p l and p r bethe left and right edge pixel locations of a sample located at p and C be the camera matrix for the sample’s frame s f ; s area = (cid:13)(cid:13) C − · [ p l , D ( p ) , T − C − · [ p r , D ( p ) , T (cid:13)(cid:13) (6)We then use the following weighting function: w sr ( s ) = exp (cid:18) − ( s ref − s ) σ (cid:19) exp (cid:18) − s area σ area (cid:19) . (7)We use parameters values σ rgb = 50 , and σ area = . . Intuitively,the latter term down-weights samples that were observed by cam-eras from farther away, preferring the samples with more detailedinformation. In order to generate reference samples s ref in thiscase, we bi-linearly upsample I to the output resolution. As ourgathering step allows samples to be gathered from arbitrary pixelfrustums, we simply gather samples from frustums correspondingto pixel coordinates from (cid:98) O , rather than O .In addition to fundamental video processing applications, our ap-proach can also be used to create compelling, stylistic scene-spaceeffects. Inpainting and semi-transparency
We can use our method to“see-through” objects by displaying content that is observed behindan object at another point in time. This application requires that auser specifies which objects should be made transparent, either byproviding per-frame binary image masks M where M ( p ) = 1 indicates that pixel should be removed, or a scene-space boundingregion, and M ( p ) = 0 otherwise. In the latter case, we project allsamples that fall into the scene-space bounding region back into theoriginal images to create M . We show in the supplemental videohow one might create a scene-space mask with substantially lessinteraction than drawing 2D image masks, and compare our resultto a state-of-the-art video inpainting method [Granados et al. 2012]using masks provided by the authors, Fig. 8.As we do not have a reference s ref in S ( p ) for the mask region,we instead compute an approximate reference sample by taking themean of all samples, s ref = 1 | S ( p ) | (cid:88) s ∈ S ( p ) s (8)and weight samples with the following function, w inpaint ( s ) = (cid:40) exp (cid:16) − ( s ref − s ) σ (cid:17) when M ( s p ) = 00 else (9)This computes a weighted combination of samples based on theirproximity to the mean sample. If we iterated this procedure, thiswould amount to a weighted mean-shift algorithm that convergeson cluster centers in S ( p ) . However we found that after two stepsthe result visually converges. For inpainting, we use parameter val-ues σ rgb = 55 . To achieve semi-transparent results, we add thestandard multivariate weighting to the input frame I ( p ) and use σ rgb = 80 , in order to emphasize similar color samples. Computational scene-space shutters
A “computational shut-ter” replaces the process of a camera integrating photons that arriveat a pixel sensor with a controlled post processing algorithm. By ex-tending this concept into scene-space , we can generate compellingresults that are fully consistent over camera motion. In this case, ourweighting function is replaced by the shutter function that operateson the time of each sample: w compshutter ( s ) = ξ ( s f ) (10)where ξ ( s f ) is a box function in a typical camera. The moststraightforward example is a scene-space long exposure shot, Fig. 9.As opposed to image-space long exposure shots, time-varying com-ponents become blurred but the static parts of the scene remainsharp, even with a moving camera. b j ec t s e m i - t r a n s p a r e n c y Input 3D mask 3D mask reprojected Scene-space semi-transparency V i d e o i np a i n ti ng Input Background mask Scene-space inpainting [Granados et al. 2012]
Figure 8:
Scene-space samples can be used for video inpainting. Regions can be specified either in scene-space (top row) or in image-space(bottom row). The 2D background mask is inverted for clarity.
We define possible alternatives for ξ ( s f ) visually in Fig. 10. Ifwe consider ξ ( s f ) to be an impulse train, and apply it only in auser-defined scene-space region, we can obtain “action shot” stylevideos. By using a long-tail decaying function, we can createtrails of moving objects. These effects are related to video synop-sis [Pritch et al. 2008], as they give an immediate impression of themotion of a scene. In both cases, the temporally offset content be-haves correctly with respect to occlusions and perspective changes.When the computational shutter contains Dirac delta functions, thisimplies a projection of content from one frame into another (i.e.,action shots in Fig. 9). As only a single time instance is used, wecannot leverage repeated observations of these scene points, andwe lose some of the robustness to noisy depth, leading to visible re-projection errors typical to depth image-based rendering. Nonethe-less, this example highlights the expressiveness of our method by itsability to model complex scene-space effects using simple shutterfunctions. For this, we require reasonable depth of moving fore-ground objects, which we acquire using a Kinect depth sensor, as itcannot be deduced from the video directly.Until now, we have not explicitly addressed occlusions. This isbecause inaccurate depth makes reasoning about occlusions diffi-cult. Instead we have relied on s ref and sample redundancy toprevent color bleeding artifacts. However, using this approach fordynamic foreground objects, our method can only capture a singleobservation at a given moment of time. Because we have neither areference sample nor a significant number of samples with which todetermine a reasonable prior, we use the following simple occlusionheuristic to prevent color bleed-through for scenes with reasonabledepth values, e.g., from a Kinect. We introduce the notion of a sam-ple depth order s ord , that is the number of samples in S ( p ) that arecloser to p than the current sample s , s ord = { q ∈ S ( p ) (cid:12)(cid:12) ( p xyz − q xyz ) < ( p xyz − s xyz ) } . (11)Our weighting function becomes: w action = ξ ( s f ) exp (cid:18) − s ord σ ord (cid:19) . (12)We use σ ord = 10 . This weighting function emphasizes the sam- ples in the gathered frustum that are the closest to the output cam-era. Virtual aperture
With appropriate weighting functions, we canalso represent complex effects such as virtual apertures, exploit-ing the existence of our samples in a coherent scene-space. To dothis, we model an approximate physical aperture in scene-space andweight our samples accordingly, Fig. 11. This allows us to createarbitrary aperture effects, such as focus pulls and focus manifoldsdefined in scene-space .We design a weighting function for an approximate virtual apertureas a double cone with its thinnest point a at the focal point z ,Fig. 12. The slope a s of the cone defines the size of the aperture asa function of distance from focal point; a ( z ) = a + | z − z | ∗ a s . (13)To avoid aliasing artifacts we use the sample area s area introducedpreviously to weight each sample by the ratio of its size and theaperture size at its scene-space position. The intuition is that sam-ples carry the most information at their observed scale.With r as the distance of s xyz along the camera viewing ray, and q as the distance from the ray to s , we use w va = (cid:40) s area πa ( r ) when q < a ( r )0 else . (14)Many other formulations for synthetic apertures are conceivableand work on camera arrays demonstrated how it is possible to cre-ate compelling virtual aperture images and videos [Wilburn et al.2005; Vaish et al. 2005]. In our case, we are not using multipleviewpoints at the same time instance, but rather samples capturedfrom neighboring frames to compute aperture effects. Our method was implemented in Python with CUDA bindings andtested on an Intel i7 3.2 GHz desktop computer, with 24GB RAMand a GeForce GTX 980 graphics card. Datasets were capturedusing an iPhone 5s, GoPro Hero 3, Canon S100 point-and-shoot, o m pu t a ti on a l S hu tt e r s Input Scene-space long exposure Image-space long exposure A c ti on S ho t s Figure 9:
Computational shutter effects. In a scene-space long exposure video, static objects remain sharp despite camera motion, whilemoving components like the fountain become blurred. Image-space long exposure on the other hand, blurs the whole frame (top row). Scene-space processing enables effects such as action shots or motion trails, where the filtering step selects distinct frames in time (bottom row).Because these effects live in scene-space, they behave correctly in terms of occlusions and perspective change.
Figure 10:
A selection of “computational shutter” weighting func-tions, from left to right; a box filter equivalent to a regular camerashutter, an impulse train used to generate multiple exposure videoaction shots, and a long falloff used for motion trails. and Sony α
7s camera. In general, we process videos ranging from200-1000 frames (6-30 seconds) long, all at 720p resolution. Un-less otherwise noted, all datasets were preprocessed with an auto-matic script that computes camera pose and data-term only depthmaps. Tab. 1 gives the processing times for different steps of ourmethod. The samples/pixel are averages observed in our exper-iments, and are determined by the required sampling regions; ininpainting we use a large temporal σ f , accumulating samples overmany frames, and with a virtual aperture, we gather from a biggerfrustum, accumulating many samples over a large spatial region.The cost of gathering vs. filtering also depends on the application,but in most cases, we observed that roughly 80-95% is spent gath-ering samples and the rest in filtering. Despite the large amount ofdata considered, our prototype implementation shows that samplingbased scene-space approaches can generate results in a reasonable amount of time. Further optimizations are possible for performancecritical applications. We present a general framework for implementing a variety of ef-fects efficiently and intuitively. We do not claim that our approach
Figure 11:
Frame from a video with an added virtual aperture ef-fect (narrow depth-of-field). always performs better than any of the application specific state-of-the-art methods that we compare to, only that we can generatecompelling video processing effects. For this reason, we providevideos in the supplemental material as validation of our approach.We additionally analyzed the performance of our method over arange of different sources of 3D information. We experimentedwith alternate camera pose estimation tools (VisualSFM, PFTrack)with similar results. In Fig. 13, we show a denoising result usinga variety of depth maps, from over-smoothed to highly noisy. Thisexample shows that our method degrades gracefully as the qualityof depth is reduced, leading to only subtle artifacts visible as resid-ual noise and color bleeding from occluded scene objects. Wefurther demonstrate the effect of our robust filtering step in Fig. 14,by comparing our filtering to averaging all gathered samples. Witha simple mean, errors in scene-space information are clearly visible.In most applications, we use < σ f < , which implies thatwe are only using information up to about 60 frames on either sideof the output frame. This limitation is due to drift in the cameracalibration. After more frames, we found that the reprojection er- igure 12: The shape of our virtual aperture weighting functionis a double cone with the focal point at the center. The distancefrom this focal point along the viewing ray determines the size ofthe area from which samples are used. samples/pixel sec./framePreprocessing - 1.5Depth Computation - 28.5Denoising 500 3.4Deblurring 250 8.9Action shots 100 4.7Video Inpainting 1000 16.0Virtual Aperture 12000 10.2Motion Trails 600 29.0Super resolution ( × resolution) 800 140.1 Table 1:
Processing time for different applications of our method.Preprocessing is the total time spent on automatic extraction offrames, lens undistortion, and camera pose estimation. Runtimesare dependent on scene structure, camera motion, sampling pa-rameters, so timings are only given as an estimate of the runtimethe algorithm can be expected to perform at. rors become worse and worse and gathering these samples did nothelp much. By choosing lower σ f , our approach can be robust tothis kind of drift, at the expense of having fewer samples availablefor filtering. However, given drift-free calibrations, e.g., from poseestimation techniques employing loop closure, we could take ad-vantage of additional frames. Our approach has several limitations, and numerous directions forpromising future work. Our framework assumes that the color ofscene points stays constant over a local temporal window, i.e., scenechanges due to lighting variation or object motion are not explicitlymodeled. Despite this, the robust nature of the filtering step enablesus to generate realistic-looking results even when these assump-tions are violated. However, for fast moving objects, our ability tocollect samples is limited to the current frame only, and as a resultour filtering step can at best recreate the original frame, Fig. 15.For the action shots example, samples are gathered from a single,distant frame and no valid reference samples exist, which can leadto standard reprojection errors, clearly visible in Fig. 9.One solution would be to gather samples in object space rather thanscene-space. Doing so however, requires computing accurate densescene flow (scene-space optical flow). Computing scene flow is achallenging problem, and while there has been exciting recent workshowing high quality 3D trajectories [Joo et al. 2014], these areoften achieved with specialized acquisition scenarios (in this casea 480 camera dome). Nonetheless, as scene flow becomes moreavailable, our approach is well suited to take advantage of it; byintegrating the motion of the scene into the sample gathering stepwe can support fully dynamic scenes as well. Occlusions are another inherent difficulty when working with un-structured point clouds. As we gather samples throughout the entireoutput pixel frustum, we collect occluded samples that project be-hind objects if these regions become visible at another time in thevideo. If a good reference sample s ref is not available, these canshow up as color bleeding in the foreground, Fig. 15. With suffi-ciently accurate depth information however, (e.g, in the action shotsapplication), the contribution of these occluded samples can be re-duced, allowing us to prevent unwanted color bleeding.Our approach computes neighboring output pixel colors indepen-dently. This is important for efficient parallelization, but means thatit is not straightforward to enforce higher-level image-space consis-tency in the final result. This can lead to high frequency artifacts,for example in Fig. 8 and the H ORSE sequence in the supplementalmaterial, where speckling is visible in regions with high contrast be-tween foreground and background colors. In cases where both thedepth and reference sample color are incorrect, object boundariescan become distorted as the weighting of sample sets varies frompixel to pixel. This effect is visible in Fig. 16 and in the K
ITCHEN sequence in the supplemental video, where the edge of the wall be-comes distorted. We can mitigate this problem to some extent bycontrolling the size of the pixel frustums; by having overlappingfrustums between neighboring pixels, neighboring sample sets willshare samples, achieving a limited amount of spatial consistency.However, explicit methods that solve for image-space consistency,e.g., by solving for pixel colors as a MRF considering neighboringoutput pixels as being connected, would be an interesting area forfuture work.Finally, our method requires scene-space information, which can beseen as an additional burden. In fact, for some videos, existing ap-proaches are not sufficiently robust to determine camera pose anddepth values automatically. As these technologies improve, how-ever, more and more scenes will be suitable for automatic sampling-based processing. In professional environments, these problems aresolved on a daily basis with a high degree of accuracy using maturetools and skilled operators, and our approach can directly be inte-grated into such production pipelines.
We have presented a general framework that allows for robustscene-space video processing in the presence of inaccurate 3D in-formation. Our method is based on simple, transparent algorithmsthat can take advantage of the large volumes of data that comes withHD videos. We have demonstrated the generality of our approachfor different fundamental video processing applications and com-pared the results to state-of-the-art methods specifically designed tothe respective tasks. Additionally, we have shown several advancedvideo processing operations such as video inpainting, action shots,and virtual apertures. Each of these applications was computed onreal world, hand-held footage captured with consumer-grade cam-eras, demonstrating the robustness of our approach and general ap-plicability both to mass market and professional requirements. Webelieve that our novel scene-space processing approach will enablenew video applications that were previously impossible, limited orcould not be fully explored because of inevitably unreliable depthinformation. Video results and datasets are available on the projectwebsite to facilitate future research.
Acknowledgments
We would like to thank Henning Zimmer, Changil Kim, Ya˘gızAksoy, Cengiz ¨Oztireli, and Mario Botsch for helpful discussionsthroughout the project, as well as Sunghyun Cho and Miguel Grana-dos for making their datasets public. Marcus Magnor acknowledgesfunding from ERC Grant
Figure 13:
Evaluation of denoising with varying quality depth maps: globally smooth (computed using Nuke) (a), data-term only depth (b),and (b) with added high frequency noise (c). While the denoised output is similar in many places, some artifacts are visible when zoomed in.For example, when denoised with the globally smoothed (a) or added noise (c) results, more of the background color bleeds in from behindthe table legs, and in (c) the final result has substantially more high frequency noise. a) Mean of all samples Our method b) Mean of all samples Our method
Figure 14:
Averaging the color of all gathered samples (mean) generates poor quality results because it includes many incorrect samplesdue to 3D projection errors. Using our filtering, the contribution of these samples can be greatly reduced.
Figure 15:
The top row shows an example of a rapidly moving ob-ject. Our method cannot denoise the flapping bird as excessive mo-tion prevents us from gathering reliable samples from other frames.Despite this, we can denoise the rest of the scene and avoid ghost-ing of the bird. Occlusions can be an additional problem, as wegather samples both on and behind objects. We can reduce theircontribution using depth information, or the reference sample.
References A LEXA , M., B
EHR , J., C
OHEN -O R , D., F LEISHMAN , S., L
EVIN ,D.,
AND S ILVA , C. T. 2003. Computing and rendering point setsurfaces.
TVCG .A UBRY , M., P
ARIS , S., H
ASINOFF , S. W., K
AUTZ , J.,
AND D U - RAND , F. 2014. Fast local Laplacian filters: Theory and appli-cations.
ACM Trans. Graphics . Figure 16:
The blurred edge of the wall in the noisy input image(left), makes both depth estimation and the reference sample unre-liable, which can cause high frequency artifacts (right). B HAT , P., Z
ITNICK , C. L., S
NAVELY , N., A
GARWALA , A.,A
GRAWALA , M., C
OHEN , M. F., C
URLESS , B.,
AND K ANG ,S. B. 2007. Using photographs to enhance videos of a staticscene. In
EGSR .C HO , S., W ANG , J.,
AND L EE , S. 2012. Video deblurring forhand-held cameras using patch-based synthesis. ACM Trans.Graphics (Proc. SIGGRAPH) .D ABOV , K., F OI , A., K ATKOVNIK , V.,
AND E GIAZARIAN , K. O.2007. Image denoising by sparse 3D transform-domain collabo-rative filtering.
Trans. Image Processing .F URUKAWA , Y.,
AND P ONCE , J. 2010. Accurate, dense, and ro-bust multiview stereopsis.
TPAMI .G ASTAL , E. S. L.,
AND O LIVEIRA , M. M. 2011. Domain trans-form for edge-aware image and video processing.
ACM Trans.Graphics (Proc. SIGGRAPH) .G OESELE , M., A
CKERMANN , J., F
UHRMANN , S., H
AUBOLD ,C., K
LOWSKY , R., S
TEEDLY , D.,
AND S ZELISKI , R. 2010.Ambient point clouds for view interpolation.
ACM Trans.Graphics (Proc. SIGGRAPH) . OOGLE , 2015. Project Tango. .G RANADOS , M., K IM , K. I., ANDGTANGO J AN K AUTZ , J. T.,
AND T HEOBALT , C. 2012. Background inpainting for videoswith dynamic objects and a free-moving camera. In
ECCV .G UPTA , A., B
HAT , P., D
ONTCHEVA , M., C
URLESS , B.,D
EUSSEN , O.,
AND C OHEN , M. 2009. Enhancing and ex-periencing spacetime resolution with videos and stills. In
ICCP .I NFOGNITION , 2015. Infognition superresolution plugin. .J OO , H., P ARK , H. S.,
AND S HEIKH , Y. 2014. Map visibilityestimation for large-scale dynamic 3D reconstruction. In
CVPR .K HOLGADE , N., S
IMON , T., E
FROS , A. A.,
AND S HEIKH , Y.2014. 3D object manipulation in a single photograph using stock3D models.
ACM Trans. Graphics (Proc. SIGGRAPH) .K OLEV , K., K
LODT , M., B
ROX , T.,
AND C REMERS , D. 2009.Continuous global optimization in multiview 3D reconstruction.
IJCV .K OPF , J., C
OHEN , M. F., L
ISCHINSKI , D.,
AND U YTTENDAELE ,M. 2007. Joint bilateral upsampling.
ACM Trans. Graphics(Proc. SIGGRAPH) .K OPF , J., C
OHEN , M. F.,
AND S ZELISKI , R. 2014. First-personhyper-lapse videos.
ACM Trans. Graphics (Proc. SIGGRAPH) .K USTER , C., B
AZIN , J.-C., ¨O
ZTIRELI , A. C., D
ENG , T., M AR - TIN , T., P
OPA , T.,
AND G ROSS , M. 2014. Spatio-temporalgeometry fusion for multiple hybrid cameras using moving leastsquares surfaces.
CGF (Eurographics) .L ANG , M., W
ANG , O., A
YDIN , T. O., S
MOLIC , A.,
AND G ROSS ,M. 2012. Practical temporal consistency for image-based graph-ics applications.
ACM Trans. Graphics (Proc. SIGGRAPH) .L IPSKI , C., K
LOSE , F.,
AND M AGNOR , M. A. 2014. Correspon-dence and depth-image based rendering a hybrid approach forfree-viewpoint video.
T-CSVT .N EWCOMBE , R. A.,
AND D AVISON , A. J. 2010. Live dense re-construction with a single moving camera. In
CVPR .¨O
ZTIRELI , A. C., G
UENNEBAUD , G.,
AND G ROSS , M. 2009.Feature preserving point set surfaces based on non-linear kernelregression.
CGF (Eurographics) .P ARIS , S., K
ORNPROBST , P., T
UMBLIN , J.,
AND D URAND , F.2007. A gentle introduction to bilateral filtering and its applica-tions. In
ACM SIGGRAPH courses .P RITCH , Y., R AV -A CHA , A.,
AND P ELEG , S. 2008. Nonchrono-logical video synopsis and indexing.
TPAMI .R ICHARDT , C., S
TOLL , C., D
ODGSON , N. A., S
EIDEL , H.,
AND T HEOBALT , C. 2012. Coherent spatiotemporal filtering, upsam-pling and rendering of RGBZ videos.
CGF (Eurographics) .S CHARSTEIN , D.,
AND S ZELISKI , R. 2002. A taxonomy and eval-uation of dense two-frame stereo correspondence algorithms.
IJCV .S EITZ , S. M., C
URLESS , B., D
IEBEL , J., S
CHARSTEIN , D.,
AND S ZELISKI , R. 2006. A comparison and evaluation of multi-viewstereo reconstruction algorithms. In
CVPR .S HUM , H., C
HAN , S.,
AND K ANG , S. B. 2007.
Image-basedrendering . Springer. S UN , J., X U , Z., AND S HUM , H. 2008. Image super-resolutionusing gradient profile prior. In
CVPR .S UNKAVALLI , K., J
OSHI , N., K
ANG , S. B., C
OHEN , M. F.,
AND P FISTER , H. 2012. Video snapshots: Creating high-quality im-ages from video clips.
TVCG .T ANSKANEN , P., K
OLEV , K., M
EIER , L., C
AMPOSECO , F.,S
AURER , O.,
AND P OLLEFEYS , M. 2013. Live metric 3D re-construction on mobile phones. In
ICCV .V AISH , V., G
ARG , G., T
ALVALA , E.-V., A
NTUNEZ , E.,W
ILBURN , B., H
OROWITZ , M.,
AND L EVOY , M. 2005. Syn-thetic aperture focusing using a shear-warp factorization of theviewing transform. In
CVPR Workshop .W ILBURN , B., J
OSHI , N., V
AISH , V., T
ALVALA , E., A NT ´ UNEZ ,E. R., B
ARTH , A., A
DAMS , A., H
OROWITZ , M.,
AND L EVOY ,M. 2005. High performance imaging using large camera arrays.
ACM Trans. Graphics (Proc. SIGGRAPH) .Z HANG , G., D
ONG , Z., J IA , J., W AN , L., W ONG , T.-T.,
AND B AO , H. 2009. Refilming with depth-inferred videos. TVCG .Z HANG , G., J IA , J., W ONG , T.,
AND B AO , H. 2009. Consistentdepth maps recovery from a video sequence. TPAMI .Z HANG , L., V
ADDADI , S., J IN , H., AND N AYAR , S. K. 2009.Multiple view image denoising. In
CVPR .Z ITNICK , C. L., K
ANG , S. B., U
YTTENDAELE , M., W
INDER , S.A. J.,
AND S ZELISKI , R. 2004. High-quality video view inter-polation using a layered representation.
ACM Trans. Graphics(Proc. SIGGRAPH) .Z WICKER , M., P
FISTER , H.,
VAN B AAR , J.,
AND G ROSS , M.2001. Surface splatting. In