[PDF] A Bit Too Much? High Speed Imaging from Sparse Photon Counts

Abstract

Recent advances in photographic sensing technologies have made it possible to achieve light detection in terms of a single photon. Photon counting sensors are being increasingly used in many diverse applications. We address the problem of jointly recovering spatial and temporal scene radiance from very few photon counts. Our ConvNet-based scheme effectively combines spatial and temporal information present in measurements to reduce noise. We demonstrate that using our method one can acquire videos at a high frame rate and still achieve good quality signal-to-noise ratio. Experiments show that the proposed scheme performs quite well in different challenging scenarios while the existing approaches are unable to handle them.

Full PDF

AA Bit Too Much? High Speed Imaging fromSparse Photon Counts

Paramanand Chandramouli , Samuel Burri , Claudio Bruschini , Edoardo Charbon , and Andreas Kolb University of Siegen, Germany Swiss Federal Institute of Technology (EPFL), Neuchˆatel, Switzerland

Recent advances in photographic sensing technologies have made it possible to achieve light detection in terms of a single photon.Photon counting sensors are being increasingly used in many diverse applications. We address the problem of jointly recoveringspatial and temporal scene radiance from very few photon counts. Our ConvNet-based scheme effectively combines spatial andtemporal information present in measurements to reduce noise. We demonstrate that using our method one can acquire videos ata high frame rate and still achieve good quality signal-to-noise ratio. Experiments show that the proposed scheme performs quitewell in different challenging scenarios while the existing approaches are unable to handle them.

Index Terms —Photon Counting Sensors, SPAD Imaging, Convolutional Neural Network, High Speed Imaging

I. I

NTRODUCTION

We address the problem of imaging fast dynamic scenesthrough single photon counting sensors [1]. Single photonavalanche diode (SPAD) detectors are endowed with theability of photon counting and time-stamping. They are gettingincreasingly popular for a variety of imaging applications[2]–[6]. We attempt to the enhance the fast motion capturecapability of SPAD sensors by developing a robust videorecovery algorithm. Consider the example shown in Fig. 1.An analog oscilloscope with a traversing sinusoid was imagedby a SwissSPAD camera [7]. Fig. 1 (a), shows a frame fromthe captured video obtained at 156k frames per second (fps).The value of each pixel in this frame is a binary numberwherein positive values indicate the detection of a photon.Note that in a typical DSLR camera, the number of photonscaptured is of the order of thousands of photons per pixel[3], [8] while the frame in Fig. 1 (a) shows the detectionof only one photon . Due to the extremely low photon count,one can hardly infer any structure present in the scene. Onecould average consecutive frames to reduce the effect ofnoise while sacriﬁcing the temporal resolution (Fig. 1 (c)).To preserve temporal resolution as well as recover accuratescene reﬂectance, we develop a convolutional neural network(CNN) that takes the low-photon count sequence as input andgenerates a high-photon count estimate at the same framerate . Our scheme effectively combines the spatio-temporalinformation present in the input sequence for video recovery.The resultant frame from our method is shown in Fig. 1(b). Note that in Figs 1 (a) and (b), one can observe thelocalization of the sinusoidal wave while in Fig. 1 (c), thetemporal information is lost. In Fig. 1 (b), we also observethat the details of the static regions have been recovered quitewell.Previously, in the context of time-correlated SPAD imaging,regularization-based approaches have been developed to recon-struct scene reﬂectivity [3], [9]. For oversampled binary obser-vations, image reconstruction algorithms such as [10] are ap-plicable. In this paper, our objective is to jointly recover spatial and temporal variations of radiance in dynamic scenes withoutspatial oversampling. Existing video denoising schemes devisemethods for combining local and non-local structures presentacross space and time [11]–[13]. In extremely noisy scenarios,explicitly determining such information does not work well.Instead of “hand-crafted” approaches to combine structuralinformation, our method uses convolutional neural networksconsisting of 3D ﬁltering across the spatio-temporal volume[14]. By accumulating a set of binary frames, one can obtainvideo sequences with lesser noise and reduced frame rate. Inthis paper, we address different scenarios in which differentnumber of the binary frames are combined. Although weshow the application of our method on SwissSPAD cameras,our scheme can be applied to any other photon countingor binary imaging sensors [15], [16]. High speed consumercameras typically require signiﬁcantly bright illumination [17].In contrast, we do not use any high intensity illumination andoperate in normal lighting conditions.

A. Related work

Since our work is related to SPAD imaging, photon countingsensors and video denoising, we brieﬂy discuss relevant priorworks in these topics.

SPAD-based imaging

SPAD sensors are photodetectors inwhich photon radiation can be detected from the resulting largeavalanche currents. SPAD sensor arrays are capable of photoncounting at a high speed with high timing resolution and areuseful in a variety of applications such as ﬂuorescence lifetimeimaging microscopy (FLIM), positron emission tomography(PET), time-of-ﬂight imaging etc. [2], [3], [18]. Recently,Burri et al. developed a SPAD array known as SwissSPAD [7].The SwissSPAD is fabricated in a high-voltage CMOS processand features a large 512 ×

128 array with global gating. In thispaper, we use data from different SwissSPAD sensor arraysfor demonstrating our high speed video recovery scheme.Recent works in range imaging through SPAD sensors in-clude [3], [19], [20]. SPAD cameras have been used to performchallenging tasks such as transient imaging [4], [5], [21]–[23], a r X i v : . [ c s . C V ] M a y a) (b) (c) Fig. 1. Imaging wave propagation in an oscilloscope: (a) A 1-bit frame from the input to our algorithm captured at 156000 fps. (b) Corresponding resultantframe of the proposed scheme. (c) Average of 255 frames with the frame shown in (a) as the center (the dark pixels are due to sensor defects). Kindly referto the supplementary material for the complete video. non-line-of sight imaging [6], [24]–[26] and imaging throughfog [27].

Quanta Imaging

Closely related to the topic of singlephoton counting is that of Binary/Quanta image sensors [1],[28]. These sensors were developed with the objective ofshrinking the pixel pitch. In Quanta imaging, the denselydeﬁned sensors oversample the scene radiance to generatebinary measurements [28], [29]. Image reconstruction schemeshave been proposed for these sensors [10], [30]–[32]. Themain difference between these reconstruction methods and ourscheme is that these algorithms consider the availability ofa higher number of samples to estimate the image intensityat a pixel location. These methods have mainly focusedon recovering a static image. In contrast, our focus is onrecovering videos.

Video denoising

Video denoising has been widely studiedfor many years and different kinds of algorithms have beenproposed. Because of the vastness of the topic, we restrictour discussion to some of the popular and recent works.When compared with applying image denoising on each frame,video denoising has an advantage because the high temporalcoherence can be leveraged to make a better prediction.Consequently, denoising algorithms adopt different strategiessuch as non-local means [12], [33], motion estimation [34],[35], 3D dictionary representations [36] etc. Maggioni et al.[11] propose a denoising method popularly known as VBM4Dand is based on the collaborative ﬁltering scheme (VBM3D)[37]. They apply this ﬁltering scheme to a stack of 3-Dspatio-temporal volumes that are obtained through non-localgrouping. Sutour et al., [12] develop a variational denoisingscheme by adaptively combining nonlocal means with totalvariation on spatio-temporal volumes. Their method adapts todifferent noise levels. The authors in [38] propose a denoisingmethod for the scenario of mixed noise (like Gaussian noisemixed with impulsive noise). They formulate the denoisingproblem as low-rank matrix completion problem. This workhas been extended in [39] to avoid pre-detection of outliers.Recently, techniques based on deep learning have beenproposed for video denoising. Chen et al. [40] propose a deeprecurrent neural network for video denoising. Their methoddoes not assume a speciﬁc noise model and performs close tostate-of-the art VBM4D scheme [11]. Xue et al. [41] proposetask oriented ﬂow to achieve a speciﬁc video processingobjective such as denoising. Instead of trying to achieve aprecise ﬂow estimation, they train a model whose objective isto predict a motion ﬁeld tailored for a speciﬁc task. In ourexperiments, we observe that video denoising methods fail in the scenario of SPAD imaging. This could be due to the factthat grouping of similar structures across the spatio-temporalvolume fails when the number of photons in the observationis sparse.

Contributions

The contributions of this paper can be sum-marized as follows: i) We address the problem of joint spatio-temporal radiance estimation from single photon countingsensors. ii) We develop a CNN-based video recovery schemethat can handle very high noise levels and still maintain thehigh frame rate iii) We devise methods to obtain appropriatereal and synthetic datasets for training and evaluation. Thisdata will be made available subsequently.II. I

MAGE FORMATION

In this section, we describe the image formation process forSwissSPAD cameras which have a globally gated sensor array[7]. The imaging model can also be applied to any other gatedSPAD arrays [15], [16].Each pixel in the SwissSPAD array is composed of a SPADp-n junction suitably biased for enabling photon triggeredavalanche. A one-bit counter is present at every pixel. TheSPAD array has a global shutter in which all the pixels can bekept active for a duration as low as 3.8 ns. The pixel countercontent is transferred from the sensor pixels via a fast digitalreadout which takes about 6.4 µ s for transferring the contentsof the whole array of × pixels. In effect, a 1-bit frameof size 128 ×

512 indicating the detection of one photon willbe read out in 6.4 µ s resulting in a frame rate of 156kfps [7].Due to the dark counts generated in a SPAD by thermalevents, the number of photon counts per unit time follows aPoisson distribution [42]. The probability of k photon countsin unit time is given by p c ( k ) = χ k e − χ k ! (1)where χ is the expected value of counts and is related to theimpinging count rate, dark count rate and photon detectionefﬁciency [42]. Within a particular time frame, the SwissSPADsensor array can only report whether one or more photons weredetected. Consequently, the probability of recording a detec-tion in one readout time is P ( count >

0) = 1 − e − χ . Sincethe number of photons impinging at a pixel depends on thescene radiance corresponding to that pixel, one can concludethat the scene radiance is sensed non-linearly (according to − e − χ ). This non-linear mapping has been experimentallyveriﬁed in [42].ince the SwissSPAD camera records only a ‘binary pixel’in each frame, at a time instant t , the intensity I t ( i, j ) observedat a pixel ( i, j ) is a Bernoulli sample whose probabilitydepends on the intensity of the corresponding scene point.Fig. 2 (a) shows a single 1-bit image of a resolution chartobtained by the SwissSPAD camera. When such frames werecaptured and averaged, the resultant image is shown in Fig.2 (b). This is a 2-bit image corresponding to − frames.Similarly Figs. 2 (c), (d) and (e) show 4-bit, 8-bit and 14-bitimages. As the number of samples are increased, the noise inthe observation reduces. The isolated bright pixels present inall the observations of Fig. 2 correspond to the hot pixels. A. Objective

When a binary image sequence I t is averaged upto b bitlevels, one would get another sequence denoted by u bτ withframe rate reduced by the factor N b = 2 b − . The sequencewith bit resolution b is given by u bτ = 1 N b N b − (cid:88) t =0 I t + τN b (2)While imaging a static scene, one can afford to collect asmany frames as possible and average them for obtaining anaccurate estimate of the scene reﬂectance map. However, asseen in Fig. 1, while imaging fast dynamic scenes, if one wereto average many frames, the temporal information would belost. Our objective is to overcome this trade-off between framerate and intensity resolution. For a particular scene, considerthat one requires to have a speciﬁc frame rate and thereby thebit resolution b gets ﬁxed to a certain level. Let u bτ denotethe corresponding sequence. Our aim is to estimate a highbit intensity sequence u ˜ bτ where ˜ b is much greater than b andalso at the same time preserve the frame rate of the originalsequence.In this paper, we consider the scenarios where N b takesthe values of either , , , and . i.e., either 1-bit, 2-bit, 3-bit or 4-bit input sequences, respectively. From imagesequences of such low photon counts, we attempt to recoversequences corresponding to a very high bit resolution ˜ b > bits. Note that, beyond a certain value of number of bits, anyfurther increase would not be adding new information [43]. Weobserved that when ˜ b > , the noise becomes imperceptible.III. P ROPOSED METHOD

We propose to learn a mapping function f for generating asequence with a high bit resolution ˆ u = f (cid:0) u b ; θ (cid:1) , such thatit is close to the noise-free sequence u ˜ b . Note that we havedropped the time index τ to simplify the notation. The term θ denotes the network parameters. A. Network architecture

Our network architecture is composed of 3D convolutionallayers and residual blocks. The network design is motivatedby the fact that many image restoration methods employ ar-chitectures with a similar structure (but with 2D convolutions)[44]–[46]. Following such an approach also helps to avoid gradient exploding/vanishing problems [47]. Since 2D CNNsin ResNet style have achieved signiﬁcant success, we intendto use 3D convolutional layers for processing videos. Fig. 3shows one residual block of our network. Totally, our networkconsists of K = 3 such residual blocks. Each residual block,has units that are composed of 3D convolutional ﬁlters of size × × . The input convolutional unit consists of one input and output channels. While the three intermediate ‘conv’ unitsconsist of input and output channels, the output unithas output channel. Except the output unit, all the othersare followed by a ‘Leaky’ ReLU to model non-linearities. Forcomparison, we also train with ﬁlter size × × .The input to the network is a spatio-temporal patch. At theend of each residual block indexed by k , the input is addedto the resultant of the ‘output conv’ layer to arrive at ˆ u k ,an estimate of the spatio-temporal patch corresponding to theclean high-bit sequence. For training the network, we minimizethe loss function which is composed of the loss functionsof each residual block. Since it is observed in [44] that theCharbonnier penalty function leads to good performance asagainst the standard (cid:96) penalty, we also use the Charbonnierpenalty to deﬁne our loss function. i.e., the ﬁnal loss functionis given by E (cid:16) ˆ u k , u ˜ b , θ (cid:17) = 1 N N (cid:88) i =1 K (cid:88) k =1 ρ (cid:16) ˆ u k − u ˜ b (cid:17) (3)where N is the number of training samples in a batch and thepenalty term is given by ρ ( x ) = (cid:112) x + η . The value of η ischosen to be − following [44]. B. Training data

For each bit-level, we train our model using simulateddata. For the case of 4-bit sequences, we also train with realpairs of low-bit and corresponding high-bit sequences obtainedfrom SPAD camera. For the 1-bit and 2-bit scenarios, theframe rates that can be achieved by SwissSPAD are 156kfps and 78k fps, respectively. At such high frame rates,one can expect that the temporal variations are quite low.Hence, for these two scenarios, we generate a video datasetwith high temporal coherence. The raw videos used in thetraining of video-deblurring scheme of [48], consists of imagesequences at 240fps. The spatial extent of these sequencesis × . We spatially downsample these sequences bya factor of to obtain sequences with reduced variationsacross time. We randomly crop these downsampled sequencesat different temporal locations and arrive at sequencesof dimension × × . These sequences are directlyconsidered to be the high intensity resolution sequences ( u ˜ b ).To generate the low-bit sequences, for every frame, we average N b Bernoulli sampled (binary) instances. In each Bernoulliinstance, the probability of getting a one at a pixel is equalto the corresponding normalized true intensity value at thatpixel. i.e., brighter pixels are more likely to generate a . Togenerate such a sample, we generate a uniform random numberat every pixel. At a particular pixel, if the randomly generatednumber is less than the true normalized image intensity value,then that pixel is assigned as one, and zero otherwise. Wea) (b) (c) (d) (e) Fig. 2. Resolution chart imaged by the SwissSPAD camera at different bits: (a) 1-bit frame. (b) 2-bit frame. (c) 4-bit frame. (d) 8-bit frame. (e) 14-bit frame.Fig. 3. One residual unit from the proposed ConvNet architecture. Overall,the network consists of K such cascaded residual blocks. also randomly add salt-and-pepper-noise at a few points tosimulate hot-pixels. For 3-bit and 4-bit scenarios, we foundthat training with videos from UCF 101 dataset [49] that havea normal frame-rate was sufﬁcient. We explain the procedureof obtaining real training data from the SPAD camera in thesupplementary material. Model training

We trained our network models on aNVIDIA GeForce GTX 1080 Ti using stochastic gradientdescent without batch normalization. The models named3DCNN1B, 3DCNN2B, 3DCNN3B and 3DCNN4B denotethe CNNs trained with 1-bit, 2-bit, 3-bit and 4-bit sequences,respectively. The network trained using 4-bit real data isreferred to as 3DCNNR. From the sequences, we use for training and the rest for testing. The input in eachbatch of training is obtained by randomly cropping 3D patchesof size × × from any of the sequences of the trainingdataset. We randomly resize, ﬂip, and rotate by multiples of ◦ for data augmentation. TABLE IQ

UANTITATIVE EVALUATION ON SIMULATED DATA . Measure

IV. E

XPERIMENTS

To run our algorithm on a GPU for videos of realisticsize, we divide an input sequence into overlapping 3D spatio-temporal patches and subsequently merge the outputs. Forsequences of a particular bit-level, we use its correspondingCNN. If the bit-level of an input is not known, it can bedetermined easily by checking the number of unique levelsin the pixel intensities.

We used the test setof simulated data to evaluate the performance of our video reconstruction method. Table I shows the peak signal to noiseratio (PSNR) and structural similar index measure (SSIM)averaged over all the test sequences. Note that 3DCNN1Band 3DCNN2B had different inputs, but the ground-truth se-quences were the same as seen in the representative examplesshown in Fig. 4. In Table I, the value within parenthesesindicates that the ﬁlter size used in the CNN was × × instead of × × . To check if any other denoising methodworks for this scenario, we applied the algorithms proposedin [11], [12], [41], and [50] on ﬁve of these image sequences.None of these methods were able to restore videos for thesesequences. We varied the parameters of the algorithms andsearched for the optimal values. Out of these other algo-rithms, the best performance was obtained from [12] (whenapplied with Poisson noise statistics). However even this isnot quite satisfactory. The SSIM values of the outputs from[12] ranged between . and . . For visual comparison, weshow one example output of [12] in Fig. 6. The syntheticexperiments clearly demonstrate that existing video denoisingmethods cannot be used for observations with sparse photoncounts. However, we subsequently notice improvements intheir performance when the bit-level improves.We also checked if a single network can be used to recoverboth 1-bit and 2-bit sequences. For this purpose, we trainedanother CNN wherein the inputs for training consisted ofboth 1-bit and 2-bit sequences (with equal probability). Weobserve a slight reduction in the performance of this jointly-trained CNN compared to that of the speciﬁc networks seenin Table I. For the case of 1-bit test sequences, the resultantmean PSNR and SSIM from the jointly-trained CNN are . and . , respectively. For the 2-bit test sequences,the resultant mean PSNR and SSIM are . and . ,respectively. We have also tested the proposed networks onvideo sequences with downsampling factors less than . Inthese test sequences, the spatio-temporal variations will beincreased when compared to the training data. These resultsare reported in the supplementary material. Real experiments

We initially show an example of 1-bit realsequence. Fig. 7 shows a scene wherein the SwissSPAD cam-era was placed close to a rotating tool. In this particular setup,because of limitations in the data-transfer rate, 1-bit frameswere captured at the rate of 42 kfps. A binary image fromthe input sequence is shown in Fig. 7 (a). Its correspondingoutput is shown in Fig. 7 (b). For comparison, we averaged 15ne-bit frame Two-bit frame 3DCNN1B result 3DCNN2B result Ground truth

Fig. 4. Representative examples from our test set (Videos are in supplementary material)TABLE IIQ

UANTITATIVE RESTORATION PERFORMANCE COMPARISON ON THE REAL

SPAD

TEST DATASET . Measure [41] VBM4D [11] [12] 3DCNNRPSNR 28.53 30.51 31.41 35.54SSIM 0.7328 0.7816 0.8218 0.909 binary frames to generate a 4-bit sequence. This 4-bit sequencewas fed as input to our 3DCNN4B network. In the videosincluded as supplementary material, one can clearly see the loss of temporal resolution in the output of 3DCNN4B whenthe averaged sequence was input.Fig. 8 shows another rotating tool present in a static back-ground. The input in this scenario was a 2-bit sequence cap-tured at about 25 kfps. While Fig. 8 (a) shows an input frame,Fig. 8 (b) shows the corresponding result from 3DCNN2B.By averaging ﬁve frames of this sequence, one can obtain a4-bit sequence. Figs. 8 (c) and (d) show a 4-bit input and aresultant (3DCNN4B) frame, respectively. We observe that inthe static regions, the output of 3DCNN2B does come closeto that of 3DCNN4B. The supplementary material containsa comparison of our output with a high-intensity resolutionreference image captured when the scene was still.a) (b) (c)(d) (e) (f)

Fig. 5. (a) A 2-bit frame from the input to our algorithm captured at 78000 fps, and (b) corresponding resultant frame (3DCNN2B). (c) Output of [12]. (d)A 4-bit frame, and (e) corresponding output from 3DCNNR. (f) Average of 120 4-bit frames.Fig. 6. Result of the algorithm in [12] for the input shown in the last row ofFig. 4: Output for (left) 1-bit sequence and (right) two-bit sequence. (a) (b)(c) (d)

Fig. 7. High speed rotating tool (1-bit): (a) A frame from the input sequenceand (b) its corresponding output from 3DCNN1B. (c) A 4-bit input framegenerated by averaging. (d) Output of 3DCNN4B on the 4-bit input.

We next show results on additional oscilloscope sequencesof [7] (Fig. 5). The 2-bit sequence was fed as input to3DCNN2B, and 4-bit sequence was input to 3DCNN4B. Onecan observe that the quality of our 2-bit output does come closeto that of 4-bit and even with the high-bit observation (Fig. 5(f)) at static regions. This shows that our algorithm is capableof producing high-quality outputs from only 2-bit frames. Theoutput from [12] on the 2-bit sequence (Fig. 5 (c)) looks quiteinferior to our output. The supplementary material containscomplete videos and comparisons. We have also shown theresult from [10] and compared it with our method.Subsequently, we compare the performances of 3DCNN1B,3DCNN2B and 3DCNN3B on the same scene. The imagesequence corresponds to breaking of glass with a resolutionchart in the background. The scene was captured at 156kfpsand at 1-bit resolution. We divided the sequence into groupsof seven frames. For 1-bit observations, we keep the centralframe in each group and drop six other frames. For 2-bit (a) (b)(c) (d)

Fig. 8. High speed rotating tool (2-bit): (a) A frame from the input sequenceand (b) its corresponding output from 3DCNN2B. (c) A 4-bit input framegenerated by averaging. (d) Output of 3DCNN4B on the 4-bit input. observations, we average the middle three frames and drop therest. For 3-bit observations, we average all the seven framesin the group. Essentially, we arrive at 1-bit, 2-bit and 3-bitsequences with similar temporal variations and correspondingto about 22kfps. With these inputs, we obtain the outputs fromour restoration scheme. Fig. 9 shows inputs and outputs at twodifferent instants of time. In the second instant, we see thatthe particles have been scattered after the breaking of glass.Despite the inputs being highly noisy, we see that the structuralinformation has been recovered quite well. This shows thateven with just one-bit measurements, our network model isable to reconstruct scene information quite robustly.

We quantitatively evaluate different videodenoising methods using our real SPAD dataset that has both -bit noisy sequence and the corresponding high-bit sequence.We evaluate the performance on ten randomly selected test-image sequences. The performance of different schemes arepresented in Table II. The algorithms of [11] and [12], do nothandle hot pixels. For these methods, to evaluate score, wereplace the intensity of a hot-pixel by the value equal to themedian of the × neighborhood of that hot pixel (excludingthe damaged pixels while calculating median). The other algo-rithms can handle outliers and this step is not necessary. Thetable shows that our scheme of 3DCNNR clearly outperformsother methods. A representative example from the SwissSPADdataset is shown in Fig. 11. On close observation of differentregions, we can see that the reconstruction is more faithful in ig. 9. Each row indicates a different time instant. (Left Pair) 1-bit observation and result. (Central) 2-bit observation and result. (Right) 3-bit observationand result. (a) (b)(c) (d) Fig. 10. Real example of a balloon bursting sequence. Frame from (a) inputsequence, result of (b) [11], (c) [12], (d) 3DCNN4B. the proposed CNN model.We next show a result on a 4-bit sequence captured at12 kfps. In Fig. 10, show sample frames from the resultantsequences. From the videos, one can clearly observe that theproposed method performs quite well. There has been animprovement in the performance of video denoising schemeswhen compared to the 1-bit and 2-bit scenarios. However, inthe results of [11], [12] and [41] (supplementary material),we do observe artifacts clearly in regions where there is nomotion. In the supplementary material, we show additionalresults and also include a quantitative evaluation of 3DCNN3Bon a synthetic dataset. V. C

ONCLUSIONS

We proposed a video recovery scheme for single-photoncounting cameras with sparse photon counts. The performanceof our model is quite good in real scenarios despite the trainingon simulated data. The level of performance achieved by ourmethod in 1-bit and 2-bit scenarios is not possible with anyexisting approach. Even for 3-bit and 4-bit scenarios, ourmethod outperforms existing video denoising schemes.In other applications of SPAD cameras such as rangeimaging and time-of-ﬂight imaging, the number photon countsis of much higher magnitude than the numbers seen in thispaper. Our work could serve as a template for developing morephoton-efﬁcient techniques to perform these tasks. We wouldexplore this direction in future.R

EFERENCES[1] E. R. Fossum, N. Teranishi, A. Theuwissen, D. Stoppa, and E. Charbon,

Photon-Counting Image Sensors . MDPI, 2018.[2] E. Charbon, “Spad based image sensors,” in

IEEE International ElectronDevices Meeting (IEDM) , 2014.[3] D. Shin, F. Xu, D. Venkatraman, R. Lussana, F. Villa, F. Zappa, V. K.Goyal, F. N. Wong, and J. H. Shapiro, “Photon-efﬁcient imaging witha single-photon camera,”

Nature Communications , vol. 7, 2016.[4] G. Gariepy, N. Krstaji´c, R. Henderson, C. Li, R. R. Thomson, G. S.Buller, B. Heshmat, R. Raskar, J. Leach, and D. Faccio, “Single-photonsensitive light-in-ﬂight imaging,”

Nature Communications , vol. 6, 2015.[5] M. O’Toole, F. Heide, D. B. Lindell, K. Zang, S. Diamond, and G. Wet-zstein, “Reconstructing transient images from single-photon sensors,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 1539–1547. a) (b) (c)(d) (e) (f)

Fig. 11. A representative example from our real test set. (a) A 4-bit frame from the input sequence. (b) Corresponding high-bit resolution image. Frame fromrecovered video using (c) 3DCNNR, (d) [11], (e) [12] and (f) [41].[6] F. Heide, M. O’Toole, K. Zhang, D. Lindell, S. Diamond, and G. Wet-zstein, “Robust non-line-of-sight imaging with single photon detectors,” arXiv preprint arXiv:1711.07134 , 2017.[7] S. Burri, Y. Maruyama, X. Michalet, F. Regazzoni, C. Bruschini, andE. Charbon, “Architecture and applications of a high resolution gatedspad image sensor,”

Optics Express , vol. 22, no. 14, pp. 17 573–17 589,2014.[8] J. Nakamura,

Image sensors and signal processing for digital stillcameras . CRC press, 2017.[9] K. Yan, L. Lifei, D. Xuejie, Z. Tongyi, L. Dongjian, and Z. Wei,“Photon-limited depth and reﬂectivity imaging with sparsity regulariza-tion,”

Optics Communications , vol. 392, pp. 25–30, 2017.[10] S. H. Chan, O. A. Elgendy, and X. Wang, “Images from bits: Non-iterative image reconstruction for quanta image sensors,”

Sensors ,vol. 16, 2016.[11] M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian, “Video denoising,deblocking and enhancement through separable 4-d nonlocal spatiotem-poral transforms,”

IEEE Transactions on Image Processing , vol. 21,no. 9, pp. 3952–3966, 2012.[12] C. Sutour, C.-A. Deledalle, and J.-F. Aujol, “Adaptive regularizationof the nl-means: Application to image and video denoising,”

IEEETransactions on Image Processing , vol. 23, no. 8, pp. 3506–3521, 2014.[13] B. Wen, Y. Li, L. Pﬁster, and Y. Bresler, “Joint adaptive sparsity andlow-rankness on the ﬂy: An online tensor reconstruction scheme forvideo denoising,” in

The IEEE International Conference on ComputerVision (ICCV) , 2017.[14] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learningspatiotemporal features with 3d convolutional networks,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2015, pp.4489–4497.[15] N. A. W. Dutton, L. Parmesan, A. J. Holmes, L. Grant, and R. K.Henderson, “320x240 oversampled digital single photon counting imagesensor,” ,pp. 1–2, 2014.[16] I. Gyongy, N. Calder, A. Davies, N. Dutton, P. Dalgarno, R. Duncan,C. Rickman, and R. Henderson, “256 × IEEEInternational Electron Devices Meeting (IEDM) ×

32 0.13 µ m cmos low dark-count single-photonavalanche diode array,” Optics Express , vol. 18, no. 10, pp. 10 257–10 269, 2010.[19] A. Kirmani, D. Venkatraman, D. Shin, A. Colac¸o, F. N. Wong, J. H.Shapiro, and V. K. Goyal, “First-photon imaging,”

Science , vol. 343,no. 6166, pp. 58–61, 2014.[20] D. B. Lindell, M. O’Toole, and G. Wetzstein, “Single-photon 3d imagingwith deep sensor fusion,”

ACM Transactions on Graphics (TOG) ,vol. 37, no. 4, 2018. [21] ——, “Towards transient imaging at interactive rates with single-photon detectors,” in

IEEE International Conference on ComputationalPhotography (ICCP) , 2018.[22] Q. Sun, X. Dun, Y. Peng, and W. Heidrich, “Depth and transient imagingwith compressive spad array cameras,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018, pp. 273–282.[23] A. K. Pediredla, A. C. Sankaranarayanan, M. Buttafava, A. Tosi,and A. Veeraraghavan, “Signal processing based pile-up compen-sation for gated single-photon avalanche diodes,” arXiv preprintarXiv:1806.07437 , 2018.[24] M. Buttafava, J. Zeman, A. Tosi, K. Eliceiri, and A. Velten, “Non-line-of-sight imaging using a time-gated single photon avalanche diode,”

Optics Express , vol. 23, no. 16, pp. 20 997–21 011, 2015.[25] G. Gariepy, F. Tonolini, R. Henderson, J. Leach, and D. Faccio,“Detection and tracking of moving objects hidden from view,”

NaturePhotonics , vol. 10, no. 1, pp. 23–26, 2016.[26] M. OToole, D. B. Lindell, and G. Wetzstein, “Confocal non-line-of-sightimaging based on the light-cone transform,”

Nature , vol. 555, no. 7696,p. 338, 2018.[27] G. Satat, M. Tancik, and R. Raskar, “Towards photography throughrealistic fog,” in

IEEE International Conference on ComputationalPhotography (ICCP) , 2018, pp. 1–10.[28] E. R. Fossum, “What to do with sub-diffraction-limit (sdl) pixels? aproposal for a gigapixel digital ﬁlm sensor (dfs).”[29] F. Yang, L. Sbaiz, E. Charbon, S. Susstrunk, and M. Vetterli, “Imagereconstruction in the gigavision camera,” in

ICCV Workshops , 2009.[30] J. H. Choi, O. Elgendy, and S. H. Chan, “Image reconstruction forquanta image sensors using deep neural networks,” in

IEEE InternationalConference on Acoustics, Speech and Signal Processing. , 2018.[31] T. Remez, O. Litany, and A. Bronstein, “A picture is worth a billion bits:Real-time image reconstruction from dense binary threshold pixels,” in

IEEE International Conference on Computational Photography (ICCP) ,2016.[32] R. A. Rojas, W. Luo, V. Murray, and Y. M. Lu, “Learning optimalparameters for binary sensing image reconstruction algorithms,” in

IEEEInternational Conference on Image Processing (ICIP) , 2017, pp. 2791–2795.[33] A. Buades, B. Coll, and J.-M. Morel, “Denoising image sequences doesnot require motion estimation,” in

IEEE Conference on Advanced Videoand Signal Based Surveillance. , 2005, pp. 70–74.[34] C. Liu and W. T. Freeman, “A high-quality video denoising algorithmbased on reliable motion estimation,” in

European Conference onComputer Vision , 2010, pp. 706–719.[35] M. Werlberger, T. Pock, M. Unger, and H. Bischof, “Optical ﬂow guidedtv-l1 video interpolation and restoration,” in

International Workshopon Energy Minimization Methods in Computer Vision and PatternRecognition , 2011, pp. 273–286.[36] M. Protter and M. Elad, “Image sequence denoising via sparse andredundant representations,”

IEEE Transactions on Image Processing ,vol. 18, no. 1, pp. 27–35, 2009.37] K. Dabov, A. Foi, and K. O. Egiazarian, “Video denoising by sparse 3dtransform-domain collaborative ﬁltering,” , pp. 145–149, 2007.[38] H. Ji, C. Liu, Z. Shen, and Y. Xu, “Robust video denoising using lowrank matrix completion,” in

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2010, pp. 1791–1798.[39] H. Ji, S. Huang, Z. Shen, and Y. Xu, “Robust video restoration by jointsparse and low rank matrix approximation,”

SIAM Journal on ImagingSciences , vol. 4, no. 4, pp. 1122–1142, 2011.[40] X. Chen, L. Song, and X. Yang, “Deep rnns for video denoising,” in

Applications of Digital Image Processing . International Society forOptics and Photonics, 2016.[41] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhance-ment with task-oriented ﬂow,” arXiv preprint arXiv:1711.09078 , 2017.[42] I. M. Antolovic, S. Burri, C. Bruschini, R. Hoebe, and E. Charbon,“Nonuniformity analysis of a 65-kpixel cmos spad imager,”

IEEETransactions on Electron Devices , vol. 63, no. 1, pp. 57–64, 2016.[43] E. R. Fossum, J. Ma, S. Masoodian, L. Anzagira, and R. Zizza, “Thequanta image sensor: Every photon counts,”

Sensors , vol. 16, no. 8, p.1260, 2016.[44] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacianpyramid networks for fast and accurate super-resolution,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2017.[45] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding, and J. Paisley, “Removingrain from single images via a deep detail network,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 3855–3863.[46] J. Jiao, W.-C. Tu, S. He, and R. W. Lau, “Formresnet: Formatted residuallearning for image restoration,” in

IEEE Conference on Computer Visionand Pattern Recognition Workshops (CVPRW) , 2017, pp. 1034–1042.[47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 770–778.[48] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang,“Deep video deblurring for hand-held cameras,” in

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2017, pp. 237–246.[49] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 humanactions classes from videos in the wild,” arXiv preprint arXiv:1212.0402 ,2012.[50] K. G. Lore, A. Akintayo, and S. Sarkar, “Llnet: A deep autoencoderapproach to natural low-light image enhancement,”