[PDF] Event-based High Dynamic Range Image and Very High Frame Rate Video Generation using Conditional Generative Adversarial Networks

Abstract

Event cameras have a lot of advantages over traditional cameras, such as low latency, high temporal resolution, and high dynamic range. However, since the outputs of event cameras are the sequences of asynchronous events overtime rather than actual intensity images, existing algorithms could not be directly applied. Therefore, it is demanding to generate intensity images from events for other tasks. In this paper, we unlock the potential of event camera-based conditional generative adversarial networks to create images/videos from an adjustable portion of the event data stream. The stacks of space-time coordinates of events are used as inputs and the network is trained to reproduce images based on the spatio-temporal intensity changes. The usefulness of event cameras to generate high dynamic range(HDR) images even in extreme illumination conditions and also non blurred images under rapid motion is also this http URL addition, the possibility of generating very high frame rate videos is demonstrated, theoretically up to 1 million frames per second (FPS) since the temporal resolution of event cameras are about 1{\mu}s. Proposed methods are evaluated by comparing the results with the intensity images captured on the same pixel grid-line of events using online available real datasets and synthetic datasets produced by the event camera simulator.

Full PDF

EEvent-based High Dynamic Range Image and Very High Frame Rate VideoGeneration using Conditional Generative Adversarial Networks

S. Mohammad Mostafavi I. *, Lin Wang , Yo-Sung Ho , Kuk-Jin Yoon Gwangju Institute of Science and Technology (GIST), Korea Advanced Institute of Science and Technology (KAIST) [email protected], [email protected], [email protected], [email protected]

Abstract

Event cameras have a lot of advantages over traditionalcameras, such as low latency, high temporal resolution, andhigh dynamic range. However, since the outputs of eventcameras are the sequences of asynchronous events overtime rather than actual intensity images, existing algorithmscould not be directly applied. Therefore, it is demandingto generate intensity images from events for other tasks. Inthis paper, we unlock the potential of event camera-basedconditional generative adversarial networks to create im-ages/videos from an adjustable portion of the event datastream. The stacks of space-time coordinates of events areused as inputs and the network is trained to reproduce im-ages based on the spatio-temporal intensity changes. Theusefulness of event cameras to generate high dynamic range(HDR) images even in extreme illumination conditions andalso non blurred images under rapid motion is also shown.In addition, the possibility of generating very high framerate videos is demonstrated, theoretically up to 1 millionframes per second (FPS) since the temporal resolution ofevent cameras are about 1 µ s. Proposed methods are eval-uated by comparing the results with the intensity imagescaptured on the same pixel grid-line of events using onlineavailable real datasets and synthetic datasets produced bythe event camera simulator.

1. Introduction

Event cameras are bio-inspired vision sensors that mimicthe human eye in receiving the visual information [14].While traditional cameras transmit intensity frames at a ﬁxedrate, event cameras transmit the changes of intensity at thetime of the changes, in the form of asynchronous eventsthat deliver space-time coordinates of the intensity changes.They have lots of advantages over traditional cameras, e.g .low latency in the order of microseconds, high temporal res- ∗ These two authors contributed equally

Figure 1. From left to right, input events, active pixel sensor (APS)images from the DAVIS camera, and our results. Our methodsconstruct HDR images with more details that normal cameras couldnot reproduce as in APS frames. We will show high frame ratevideo generation results in the supplementary material. olution (around 1 µ s) and high dynamic range. However,since the outputs of events cameras are the sequences ofasynchronous events over time rather than actual intensityimages, most existing algorithms cannot be directly applied.Thus, although it has been recently shown that event cam-eras are suﬃcient to perform some tasks such as 6-DoF poseestimation[24] and 3D reconstruction [22, 11], it will be agreat help if we can generate intensity images from eventsfor other tasks such as object detection, tracking and SLAM.Actually, it has been stated that event cameras, in princi-ple, transfer all the information needed to reconstruct imagesor a full video stream [2, 25, 24]. However, this statement hasnever been thoroughly substantiated. Motivated by recentadvances of deep learning in image reconstruction and trans-lation, we tackle the problem of generating intensity imagesfrom events, and further unlock the potential of event cam-eras to produce high quality HDR intensity images and highframe rate videos with no motion blur, which is especiallyimportant when the robustness to fast motion and to extremeillumination conditions is critical as in autonomous driving.1 a r X i v : . [ c s . C V ] N ov o the best of our knowledge, our work is the ﬁrst attemptfocusing on pure events to HDR images and high framerate video translation, and proving that event cameras canproduce high-quality non-blurred images and videos evenunder fast motion and extreme illumination conditions. Weﬁrst propose the event-based domain translation frameworkthat generates better quality images from events comparedwith active pixel sensor (APS) frames and other previousmethods. For this framework, two novel and initiative eventstacking methods are also proposed based on shifting overthe event stream, stacking based on time (SBT) and stack-ing based on the number of events (SBE), such that we canreach high frame rate and HDR representation with no mo-tion blur, which is, in contrast, impossible for the normalcameras. It turns out that it is possible to generate a videowith up to 1 million FPS using these stacking methods.To verify the robustness of the proposed methods, we con-duct intensive experiments and evaluation/comparison. Inexperiments, real datasets from a dynamic and active-pixelvision sensor, DAVIS, which is a joint event and intensitycamera [20], are used. The sensor’s pixel grid-line of theevents and the intensity are on the same location which helpsreducing extra steps of rectiﬁcation and warping for adjust-ing two images to each other. We make an open datasetthat includes more than 17 K images captured by the DAVIScamera to learn a generic model for event-to-image/videotranslation. In addition, we make a synthetic dataset con-taining 17 K images by using the event camera simulator [23]for experiments.

2. Related work

One of the early attempts on visually interpreting or re-constructing the intensity image from events is the work byCook et al . [6], in which recurrently interconnected areascalled maps were utilized to interpret intensity and opticﬂow. The model guides the network of relations betweenthe maps of optical ﬂow, intensity, spatial and temporal in-tensity derivative, camera calibration, and the 3D rotationto converge towards a global mutual consistency. Kim et al .[10] used pure events on rotation only scenes to track thecamera and also built a super-resolution accurate mosaic ofthe scene based on probabilistic ﬁltering. In [3], intensityimages were reconstructed using a patch-based sparse dic-tionary both on simulated and real event data in the presenceof noise. Bardow et al . [2] took a few steps further by recon-structing the intensity image and the motion ﬁeld for genericmotion in contrast to previous rotation only schemes. Theyproposed to minimize a cost function deﬁned with events andspatiotemporal regularization terms on a sliding window in-terval of the event stream. Moreover, they reached a nearreal-time implementation on GPU. Meanwhile, Reinbacher et al . [25] introduced a variational denoising framework thatiteratively ﬁlters incoming events. They guided the eventsthrough a manifold regarding their timestamps to reconstructthe image. In comparison to [2], their method yields moregrayscale variations in untextured areas and recovers moredetails, and their GPU based algorithm can also perform inreal-time. The measurements and simulations on the eventcamera with RGBW color ﬁlters were proposed by Moeys et al . in [19]. They presented the naive and computationalmethods for reconstructing the intensity image. The formerrequires an initial APS image from the event camera andupdates the image with the incoming events, but does notproduce sharp edges and background noise has negative ef-fect on the outputs. The latter, on the other hand, createsbetter results based on an iterative scheme that creates aregularized image by solving the Poisson equation about thedivergence of the intensity image and can run in real-time.The aforementioned methods did create intensity imagesmainly by pure events, however, the reconstruction was notphotorealistic. Recently, Shedligeri et al . [28] introduceda hybrid method that fuses intensity images and events tocreate photorealistic images. Their method relies on a set ofthree autoencoders. This method produces promising resultsfor normally illuminated scenes, but it fails in recoveringHDR scenes under extreme illumination conditions since itonly utilizes event data for ﬁnding the 6-DoF pose.

Although deep learning has not been much applied toevent-based vision, some recent studies have demonstratedthat deep learning successfully performs with event data.Moeys et al . [18] utilized both event data and APS imagesto train a convolutional neural network (CNN) for control-ling the steering of a predator robot. Other methods onsteering prediction for self-driving cars by using pure eventsand/or by incorporating the APS images in an end-to-endfashion have been also studied in [4, 15]. On the otherhand, a stacked spatial LSTM network was introduced in[22], which relocalizes the 6-DoF pose from events, and theoptical ﬂow estimation based on a self-supervised encoder-decoder network was proposed in [33].Supervised learning is adopted to create pseudo labelsfor detecting objects under ego-motion in [5]. The pseudolabels are transferred to the event image by training a CNNon APS images. And, as mentioned in the previous section,the fusion of event data and APS images was introducedin [28], which utilized autoencoders to create photorealisticimages. To the best of our knowledge, we are the ﬁrst toapply generative adversarial networks on event data.

Actually, there is no qualitative research showing the ef-fectiveness of conditional GANs (cGANs) on event data.rior works have focused on cGANs for image predictionfrom a normal map[29], future frame prediction[16] and im-age generation from sparse annotations[9]. The diﬀerencebetween using GANs for image-to-image translation con-ditionally and unconditionally is that unconditional GANshighly rely on the conﬁning lost function to control the out-put to be conditioned. cGANs have been successfully ap-plied to style transfer [13, 1, 8, 34, 12] in the frame imagedomain, and these applications mostly focused on convert-ing an image from one representation to another based onthe supervised setting. Besides, it requires input-outputpairs for graphics tasks while assuming some relationshipbetween domains. When comes to event vision, cGANshave not yet been examined qualitatively and quantitatively,and therefore, we seek to unlock the potential of cGANs forimage reconstruction based on event data. However, sincethe general approach for frame-based image translation istypically diﬀerent from event-based one, we ﬁrst propose adeep learning framework to accomplish this task and fullytake advantages of an event camera such as low latency, hightemporal resolution, high dynamic range with the proposedframework. We then qualitatively and quantitatively evalu-ate the proposed framework with real and synthetic datasets.

3. Proposed method

To reconstruct HDR and high temporal resolution imagesand videos from events, we exploit currently available deeplearning models, such as cGANs, as potential solutions forevent vision. cGANs are generative models that learn amapping from observed image x and random noise vector z to the output image y , G : { x , z }→ y . The generator G is trained to produce output that is not distinguishable fromoriginal images by an adversarially trained discriminator, D [7]. The objective is to minimize the distance betweenground truth and output from generator, and to maximizethe observation from discriminator.cGANs such as Pix2Pix [8] and CycleGANs [34] haveproved their capability in image-to-image translation bring-ing breakthrough results. The key strength of cGANs is thatthere is no need to tailor the loss function regrading givenspeciﬁc tasks, and it can generally adapt its own learnedloss to the data domain where it is trained. However, eventdata is quite diﬀerent from those used for traditional visionapproaches based on cGANs, so we propose new methodsthat can provide oﬀ-the-shelf inputs for neural networks inSec. 3.1 ﬁrst and build a network in Sec. 3.2. In an event camera, each event e is represented as a tuple ( u , v , t , p ) , where u and v are the pixel coordinates and t isthe timestamp of the event, and p = ± p = p ( u , v , t ) for some time duration ensuring event data enoughfor image reconstruction. When denoting the temporal res-olution of an event camera by δ t and the time duration by t d ,the size of the 3D volume is ( w , h , n ) , where w and h repre-sent the spatial resolution of an event camera and n = t d / δ t .This is equivalent to have the n -channel image input for thenetwork. This representation preserves all the informationabout events. However, the problem is that the number ofchannels is very huge. For example, when t d is set to 10 m s,then n is about 10 K , which is extraordinarily large, since thetemporal resolution of an event camera is about 1 µ s. Forthis reason, we construct the 3D event volume with small n by forming each channel via merging and stacking theevents within a small time interval. Event stacking can bedone in diﬀerent ways, but the temporal information of eventis necessarily sacriﬁced in return. In this approach, the streaming events in-between the timereferences of two consecutive intensity images (APS) of theevent camera, denoted as ∆ t , are merged. But not all eventsare merged into a single frame. Instead, the time duration ofthe event stream is devided into n equal-scale portions, andthen n grayscale frames, S ip ( u , v ) , i = , , .., n , are formedby merging the events in each time interval [ ( i − ) ∆ tn , i ∆ tn ] . S ip ( u , v ) is the sum of polarity( p ) values at ( u , v ) . These n grayscale frames are stacked together again to form onestack S p ( u , v , i ) = S ip ( u , v ) , i = , , .., n , which is fed to thenetwork as the input. As mentioned, this stacking methodloses the time information of events within time interval ∆ tn .However, the stack itself, as the sequence of frames fromone to n , still holds the temporal information to some extent.Therefore, larger n can keep more temporal information.Fig 2 illustrates how to merge and stack the events. When n = i.e . stacking frames F A , F B , and F C into one stack),the stack can be visualized as a pseudo color frame, as shownin the left part of Fig. 2 above the APS image. Based onthe time shown at the event manifold in the middle of Fig. 2,starting from time zero on the 3D view, the location of APSimage is around the location of the third red rectangle near0.03 sec (the frame rate of the APS image is 33 FPS). Unfortunately, SBT brings an intrinsic limitation originatedfrom the event camera, which is the lack of events whenthere is no movement of the scene or the camera. Whenthe event data within the time interval are not enough forthe image reconstruction, it is hard to get good HDR imagesinevitably. This is the case for the fourth and ﬁfth frame of igure 2. The event stream and construction of stacks by SBT and SBE. Two main color tuples of (Red(+), Blue(-)) and (Green(+), Cyan(-))express the event polarity (plus, minus) throughout this paper. In the main 3D view two types of stacking (SBT on left and SBE on right)are shown using the yellow highlighted time. The 3D view followed by its side view are color coded with (Red, Blue) and (Green, Cyan)periodically (every 5000 events) for better visualization. All the images and plotted data are from the "hdr boxes" sequence of [20]. the event stream at the left of Fig. 2. Furthermore, anotherﬂaw comes from the case of having too many events in onetime frame as in the third time frame.SBE more coincides with the nature of an event camera,which is being asynchronous to time, and can overcome theaforementioned limitations of SBT. In this method, a frameis formed by merging the events based on the number ofincoming events as illustrated in Fig.2. The ﬁrst N e eventsare merged into frame 1 and next N e events into frame 2,and this is continued up to frame n to create one stack of n frames. Then, this n -frame stack containing nN e eventsin total is used as an input to the network. This methodguarantees rich event data enough to reconstruct imagesdepending on the N e value. F E , F F , F G , and F H in Fig 2are the frames corresponding to diﬀerent numbers of events, N e , N e , N e , N e , respectively. Since we count the numberof events with time, we can adaptively adjust the number ofevents in each frame and also in one stack. Both SBT and SBE can be applied for video reconstructionfrom events using the proposed network, and in both meth-ods, the frame rate of the output video can be adjusted bycontrolling the amount of time shift of two adjacent eventstacks used as inputs to the network. When the events inthe time interval [ i − ∆ t , i ] are used for one input stack forthe image I ( i ) in a video, the next input stack for the image I ( i + t s ) in a video can be constructed by using the events inthe time interval [ i − ∆ t (cid:48) , i + t s ] (for SBT ∆ t (cid:48) = ∆ t − t s ), withthe time shift t s . Then, the frame rate of the output videobecomes t s . It is also worthy of notice that two stacks havelarge time overlap [ i − ∆ t (cid:48) , i ] with duration ∆ t (cid:48) . If ∆ t (cid:48) >> t s , the temporal consistency is naturally enforced for nearbyframes. Since the temporal resolution of an event camera isabout 1 µ s, we can reach up to one million FPS video withtemporal consistency. This will be demonstrated in Sec. 4 In this paper, we describe our generator and discriminatormotivated by [13]. Details of the architectures including thesize of each layer can be found in Fig. 3 and Fig. 4.

The core of the event-to-image translation is how to mapa sparse event input to a dense HDR output with details,sharing the same structural image features, such as edges,corners, blobs, etc. Encoder-Decoder network is the mostlyused network for image to image translation tasks. The inputis continuously downsampled through the network, and thenupsampled back to get the translated result. Since, in theevent-to-image translation problem, there is a huge amountof high-frequency important information from event datapassing through the network, it is likely to lose detailedfeatures of events during this process and induce noise to theoutputs. For that reason, we consider the similar approachesproposed in [8], where we further add skip connections tothe "U-net" network structure in [25]. In Fig. 3, the detailedinformation including number of layers and inputs/outputare depicted.

The function of the discriminator is to classify a generatedimage from events as real or fake. In other words, the gen-erator is trying to maximize the chance of the discriminator igure 3. Generator network: A U-network[26, 8] architecture (with skip connections) that takes an input with the dimension of 256 × × n ( n = misclassifying the image from events and the discrimina-tor is, in turn, trying to maximize its chances of correctlyclassifying the incoming generated image.Our network is originated from the network in [31].Fig 4 illustrates the details of our network architecture. Ourdiscriminator can be considered as a method to minimizethe style transfer loss between events and intensity images.Mathematically, the objective function is deﬁned as L eGAN ( G , D ) = E e , g [ log D ( e , g )] + E e ,(cid:15) [ log ( − D ( G ( e , (cid:15) )))] . (1)where e indicates the original event, g indicates the gener-ated image, and (cid:15) indicates the Gaussian noise as input to thegenerator. Meanwhile, G tries to minimize the diﬀerence ofimages from events, and D is to maximize it. Here, for theregularization, the L norm is used to shrink blurring as L L ( G ) = E e , g ,(cid:15) [(cid:107) g − G ( e , (cid:15) )(cid:107) . (2)This L norm is aimed to make the discriminator more fo- cus on high-frequency structure of generated images fromevents. Eventually, the objective is to estimate the total lossfrom event-to-image translation as G ∗ = arg min G max D [ L eGAN ( G , D ) + λ L L ( G )] , (3)where λ is a parameter to adjust the learning rate. With thenoise (cid:15) , the network could learn a mapping from event e and (cid:15) to g , which could match the distribution based on eventsand help to produce more deterministic outputs. Our training and test datasets are prepared based on threefolds of methods. We create the ﬁrst group of datasets by re-ferring to [20], where many real-world scenes are included.We also make the second group of datasets by ourselves forvarious training and test purposes and also for opening topublic afterwards. The datasets are captured using DAVIScamera, and have many series of scenarios. The third type ofdatasets is generated from ESIM[23], an open-source eventamera simulator. The real datasets contain many diﬀer-ent indoor and outdoor scenes captured with various rota-tions and translations of the DAVIS camera . Our trainingdata consist of pairs of stacked events as explained in Sec.3.1 together with the APS frames from both the real-worldscenes and the ground truth (GT) frames generated in ESIM.Here, to use real data for training the network, we carefullyprepare the training data to refrain the network from learn-ing improper properties of the APS frames. Actually, APSframes suﬀer from motion blur under fast motion, and alsohave limited dynamic range resulting in the loss of detailsas shown in Fig. 1. Therefore, directly using the real APSframes as ground truth is not a good way for training thenetwork, since our goal is to produce HDR images with lessblur by fully exploiting the advantages of event cameras.For that reason, the events relevant to the black and whiteregions of the training data are removed from the inputto make the network learn to generate HDR images fromevents. In addition, the APS images are classiﬁed as blurredand non-blurred based on BRISQUE scores (that will be ex-plained later) and manual inspection, and we refrain from us-ing the blurred APS images in the training set. The simulatedsequences are mainly generated from ESIM, where eventsare produced while a virtual camera moves in all directionsto capture diﬀerent scenes in given images. Since the eventsand APS images are generated from a controlled simula-tion environment, the APS frames are counted directly asthe ground truth for image reconstruction. Therefore, theaforementioned training data reﬁnement is not required forsimulated datasets.

4. Experiments and evaluation

To explore the capability of our method, we conduct in-tensive experiments on the datasets depicted in Sec. 3.3,and also use another open-source dataset with three real se-quences (Face, jumping, and ball) [2] for comparison. Wecreate a training dataset about 60 K event stacks with cor-responding APS image pairs based on their precise times-tamps, and test our method on both scenes with normalillumination and also HDR scenes. From both the real andsimulated datasets, we randomly chose 1,000 APS or groundtruth images with corresponding event stacks, not used inthe training step, for testing. Here, it is worthy of noticethat, since real datasets do not include ground truth imagesfor training and testing, we use their APS images as groundtruth for training purposes. However, the APS image itselfsuﬀers from motion blur and low dynamic range. Thus, us-ing APS images might not be the best way for training andalso for evaluating the results. For that reason, we preparethe training APS images as described in Sec. 3.3, and assessthe results using the structure similarity (SSIM) [30], featuresimilarity (FSIM) [32] computed by comparing the resultswith APS images, as well as by using the no-reference qual- Table 1. Quantitative evaluation of SBE on real-world datasets.

BRISQUE FSIM SSIMOurs( n =

3) 37.79 ± ± ± naturalness in images, is applied.On the other hand, to assess the similarity between groundtruth and generated images for synthetic datasets created us-ing ESIM [23], each ground truth is matched with the corre-sponding reconstructed image with the closest timestamp, asmentioned in [27]. The SSIM, FSIM, and the peak signal-to-noise ratio (PSNR) are adopted to evaluate non-HDR scenesand scenes that we have reliable ground truth. We compare two event stacking methods, SBT and SBE,using our real datasets. 17 K event stack-APS image pairsare used for training, where we set ∆ t for SBT to 0.03s andthe number of events in one stack to 60 K for SBE. To clearlysee the eﬀect of a stacking method, the number of frames( n ) in one stack is set to 3 for both methods.Fig 5 shows reconstructed images on our real-worlddatasets using SBE and SBT, respectively, for qualitativecomparison. It is shown that our methods (both SBT andSBE) are robust enough to reconstruct the images on dif-ferent sequences, and the generated images are quite closeto APS images considered as ground truth . Our methodscould successfully reconstruct shapes, appearance of hu-man, building, etc. When comparing SBT and SBE, SBEproduces better results in general. Table 1 shows quantita-tive evaluation results of using SBE. Note that large SSIMand FSIM values in Table 1 do not always mean the betteroutput quality because they just present the similarity withAPS images suﬀering from motion blur and low dynamicrange. In Sec. 4.1, we investigate the potential of our method onreal-world data. Based on the results in Sec. 4.1, we ﬁndthat SBE is more robust than SBT. Therefore, we conductexperiments based on SBE and show the robustness of ourmethods on datasets from ESIM [23], which can generatelarge amount of reliable event data. Since the simulatorproduces noise-free APS images with corresponding eventsfor a given image, APS images can be regarded as groundtruth, leading to evaluate the results quantitatively. In ad-dition, although our method is capable of stacking, namely,any number of frames ( n ) into a stack, we choose the num- igure 5. Reconstruction results using input event stacks (visualizedas pseudo color images) on diﬀerent real-world sequences [20].From top to bottom, APS images as ground truth, event stacksusing SBE, reconstructed images with SBE, event stacks usingSBT, and reconstructed images with SBT. ber of channels n = { , } to examine the eﬀect of diﬀerentnumbers of channels. The number of events in one stack isset to 60 K .Table 2 shows the quantitative evaluation of our methodwith n = n =

3. It is shown that our method with n = n =

1, proving thathaving more frames in one stack really improves the per-formance since it can preserve more temporal informationas mentioned in Sec. 3.1. In Fig. 6, we show a few recon-structed images as well as input event stacks and groundtruth images. One thing needs to mention is that the facereconstructed with n = We also qualitatively compare our methods on the se-quences ( face, jumping and ball ) with the results of manifold

Table 2. Experiments on ESIM (simulator) datasets. Having moreframes in one stack yields better results.

PSNR (dB) FSIM SSIMOurs( n =

1) 20.51 ± ± ± n = ± ± ± Table 3. Quantitative comparison of our method to the methodsin [2] and [21]. The reported numbers are the mean and standarddeviation of the BRISQUE measure applied to all reconstructedframes of the sequences. Our method shows better BRISQUEscores for all sequences.

Sequence Face Jumping BallBardow [2] 22.27 ± .

81 29.39 ± ± ± ± ± n = ± ± ± . n =

3) on sequences ( face, jumping and ball ) to theresults of MR [21] and IE [2] in Table 3. The results are quiteconsistent to the visual impression of Fig. 7. Our outputson all face, jumping, and ball sequences show much moredetails and result in relatively higher BRISQUE score.

5. Discussion

Although creating intensity images from an event streamitself is challenging, the resultant images can also be usedfor other vision tasks such as object recognition, tracking,3D reconstruction, self-driving, SLAM etc. In that sense,the proposed method can be applied to many applicationsthat use event cameras. Here, since the proposed methodcan fully exploit the advantages of events cameras such ashigh temporal resolution and high dynamic range, it cangenerate HDR images even better than APS images and veryhigh frame rate videos as mentioned in Sec. 3.1.3, greatlyincreasing the usefulness of the proposed method.

Events to HDR image:

In this paper, it is clearlyshown that event stacks have rich information for HDR imagereconstruction. In many cases, some parts of the scene arenot visible in the APS image because of its low dynamicrange. But many events really exist in those regions in theevent camera as in the region under the table in Fig. 1 orthe checkerboard pattern at the top left part of the stackedimage in Fig. 2. Although both examples are from darkillumination but normal cameras also fail in rather brightillumination. Figure 8 shows the ability of the proposed igure 6. Reconstructed outputs from the inputs generated by ESIM [23]. The ﬁrst row shows the ground truth, and the second row showsthe input events and reconstructed images using 1 frame per stack ( n = n = et al . [25] and Munda et al . [21], which both utilize the dataset from Bardow et al . [2].Since Reinbacher et al . did not open their source codes, we directly get the results from Munda et al . [21]. The odd-number images arethe results from Bardow et al . [2], and the even-number images are the results of our method. We can easily see that our method producesmore details ( e.g. face, beard, jumping pose, etc) as well as more natural gray variations in less textured areas.Figure 8. HDR imaging against direct sunlight (extreme illumina-tion). Left to right: APS, event stack, our reconstruction result.(sequence from [27]). method for HDR image generation in such cases. Events to high frame rate video:

The motion blurdue to fast motion of a camera or the scene is one of thechallenging problems, and this makes the vision methodsunreliable. However, our method can actually generate veryhigh frame rate (HFR) videos with much less motion blurunder the fast motion as mentioned in Sec. 3.1.3. To provethis ability, we conducted the tracking experiments using thereconstructed HFR video: with the event-based high framerate video reconstruction framework, we can recover clearmotion of a star-shape object attached on a fan with rotationspeed of 13000 RPM, and the result shows that it is capable to generate the motion up to 1 million fps.

The qualitativeresults will be shown in supplementary material.

6. Conclusion

We demonstrated how our cGANs-based approach canbeneﬁt from the properties of event cameras to accurately re-construct HDR non-blurred intensity images and high framerate videos from pure events. We ﬁrst proposed two initia-tive event stacking methods (SBT and SBE) for both imageand video reconstruction from events using the network.We then showed the advantages of using event cameras togenerate high dynamic range images and high frame raterate videos through experiments based on our datasets madeof online available real-world sequences and simulator. Inorder to show the robustness of our method, we comparedour cGANs-based event-to-image framework with other ex-isting reconstruction methods and showed that our methodoutperforms other methods on public available datasets. Wealso showed it is possible to generate high dynamic rangeimages even in extreme illumination conditions and alsoon-blurred images under rapid motion.

References [1] A. Atapour-Abarghouei and T. P. Breckon. Real-time monoc-ular depth estimation using synthetic data with domain adap-tation via image style transfer. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,volume 18, page 1, 2018. 3[2] P. Bardow, A. J. Davison, and S. Leutenegger. Simultaneousoptical ﬂow and intensity estimation from an event camera.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 884–892, 2016. 1, 2, 6, 7, 8[3] S. Barua, Y. Miyatani, and A. Veeraraghavan. Direct facedetection and video reconstruction from event cameras. In

Applications of Computer Vision (WACV), 2016 IEEE WinterConference on , pages 1–9. IEEE, 2016. 2[4] J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. Ddd17: End-to-end davis driving dataset. arXiv preprint arXiv:1711.01458 ,2017. 2[5] N. F. Chen. Pseudo-labels for supervised learning on dynamicvision sensor data, applied to object detection under ego-motion. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition Workshops , pages 644–653,2018. 2[6] M. Cook, L. Gugelmann, F. Jug, C. Krautz, and A. Steger.Interacting maps for fast visual interpretation. In

NeuralNetworks (IJCNN), The 2011 International Joint Conferenceon , pages 770–776. IEEE, 2011. 2[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative ad-versarial nets. In

Advances in neural information processingsystems , pages 2672–2680, 2014. 3[8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint , 2017. 3, 4, 5[9] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learningto generate images of outdoor scenes from attributes andsemantic layouts. arXiv preprint arXiv:1612.00215 , 2016. 3[10] H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. J. Davi-son. Simultaneous mosaicing and tracking with an eventcamera.

J. Solid State Circ , 43:566–576, 2008. 2[11] H. Kim, S. Leutenegger, and A. J. Davison. Real-time 3dreconstruction and 6-dof tracking with an event camera. In

European Conference on Computer Vision , pages 349–364.Springer, 2016. 1[12] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al.Photo-realistic single image super-resolution using a genera-tive adversarial network. In

CVPR , volume 2, page 4, 2017.3[13] C. Li and M. Wand. Precomputed real-time texture synthesiswith markovian generative adversarial networks. In

EuropeanConference on Computer Vision , pages 702–716. Springer,2016. 3, 4[14] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128 ×

128 120 db15 µ s latency asynchronous temporal contrast vision sensor. IEEE journal of solid-state circuits , 43(2):566–576, 2008. 1 [15] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garcıa, andD. Scaramuzza. Event-based vision meets deep learning onsteering prediction for self-driving cars. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recog-nition , pages 5419–5427, 2018. 2[16] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scalevideo prediction beyond mean square error. arXiv preprintarXiv:1511.05440 , 2015. 3[17] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-referenceimage quality assessment in the spatial domain.

IEEE Trans-actions on Image Processing , 21(12):4695–4708, 2012. 6[18] D. P. Moeys, F. Corradi, E. Kerr, P. Vance, G. Das, D. Neil,D. Kerr, and T. Delbrück. Steering a predator robot using amixed frame/event-driven convolutional neural network. In

Event-based Control, Communication, and Signal Process-ing (EBCCSP), 2016 Second International Conference on ,pages 1–8. IEEE, 2016. 2[19] D. P. Moeys, C. Li, J. N. Martel, S. Bamford, L. Longinotti,V. Motsnyi, D. S. S. Bello, and T. Delbruck. Color temporalcontrast sensitivity in dynamic vision sensors. In

Circuitsand Systems (ISCAS), 2017 IEEE International Symposiumon , pages 1–4. IEEE, 2017. 2[20] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, andD. Scaramuzza. The event-camera dataset and simula-tor: Event-based data for pose estimation, visual odometry,and slam.

The International Journal of Robotics Research ,36(2):142–149, 2017. 2, 4, 5, 7[21] G. Munda, C. Reinbacher, and T. Pock. Real-time intensity-image reconstruction for event cameras using manifold reg-ularisation.

International Journal of Computer Vision ,126(12):1381–1393, 2018. 7, 8[22] A. Nguyen, T.-T. Do, D. G. Caldwell, and N. G. Tsagarakis.Real-time 6dof pose relocalization for event cameras withstacked spatial lstm networks. arXiv preprint . 1, 2[23] H. Rebecq, D. Gehrig, and D. Scaramuzza. Esim: an openevent camera simulator. In

Conference on Robot Learning ,pages 969–982, 2018. 2, 6, 8[24] H. Rebecq, T. Horstschaefer, G. Gallego, and D. Scaramuzza.Evo: A geometric approach to event-based 6-dof paralleltracking and mapping in real time.

IEEE Robotics and Au-tomation Letters , 2(2):593–600, 2017. 1[25] C. Reinbacher, G. Graber, and T. Pock. Real-time intensity-image reconstruction for event cameras using manifold reg-ularisation. arXiv preprint arXiv:1607.06283 , 2016. 1, 2, 4,8[26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo-lutional networks for biomedical image segmentation. In

International Conference on Medical image computing andcomputer-assisted intervention , pages 234–241. Springer,2015. 5[27] C. Scheerlinck, N. Barnes, and R. Mahony. Continuous-time intensity estimation using event cameras. arXiv preprintarXiv:1811.00386 , 2018. 6, 8[28] P. A. Shedligeri, K. Shah, D. Kumar, and K. Mitra. Photore-alistic image reconstruction from hybrid intensity and eventbased sensor. arXiv preprint arXiv:1805.06140 , 2018. 229] X. Wang and A. Gupta. Generative image modeling usingstyle and structure adversarial networks. In

European Con-ference on Computer Vision , pages 318–335. Springer, 2016.3[30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Im-age quality assessment: from error visibility to structural sim-ilarity.

IEEE transactions on image processing , 13(4):600–612, 2004. 6[31] Z. Yi, H. R. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper-vised dual learning for image-to-image translation. In

ICCV ,pages 2868–2876, 2017. 5[32] L. Zhang, L. Zhang, X. Mou, D. Zhang, et al. Fsim: afeature similarity index for image quality assessment.

IEEEtransactions on Image Processing , 20(8):2378–2386, 2011.6[33] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis. Ev-ﬂownet: Self-supervised optical ﬂow estimation for event-based cameras. arXiv preprint arXiv:1802.06898 , 2018. 2[34] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. arXiv preprintarXiv preprint