Deep Learning with Energy-efficient Binary Gradient Cameras
DDeep Learning with Energy-efficient Binary Gradient Cameras
Suren Jayasuriya ∗ , ‡ , Orazio Gallo ∗ , Jinwei Gu ∗ , Jan Kautz ∗∗ NVIDIA, ‡ Carnegie Mellon University
Abstract
Power consumption is a critical factor for the deploy-ment of embedded computer vision systems. We explorethe use of computational cameras that directly output bi-nary gradient images to reduce the portion of the powerconsumption allocated to image sensing. We survey the ac-curacy of binary gradient cameras on a number of com-puter vision tasks using deep learning. These include ob-ject recognition, head pose regression, face detection, andgesture recognition. We show that, for certain applications,accuracy can be on par or even better than what can beachieved on traditional images. We are also the first torecover intensity information from binary spatial gradientimages—useful for applications with a human observer inthe loop, such as surveillance. Our results, which we vali-date with a prototype binary gradient camera, point to thepotential of gradient-based computer vision systems.
1. Introduction
Recent advances in deep learning have significantly im-proved the accuracy of computer vision tasks such as vi-sual recognition, object detection, segmentation, and oth-ers. Leveraging large datasets of RGB images and GPUcomputation, many of these algorithms now match, or evensurpass, human performance. This accuracy increase makesit possible to deploy these computer vision algorithms in thewild. Power consumption, however, remains a critical fac-tor for embedded and mobile applications, where batterylife is a key design constraint.For instance, Google Glass operating a modern facerecognition algorithm has a battery life of less than 40 min-utes, with image sensing and computation each consumingroughly 50% of the power budget [24]. Moreover, researchin computer architecture has focused on energy-efficient ac-celerators for deep learning, which reduce the power foot-print of neural network inference to the mW range [13, 3],bringing them in the same range of the power consumptionas image sensing.When the computer vision algorithms are too computa-tionally intensive, or would require too much power for theembedded system to provide, the images can be uploaded (a) (b) (c)(d) (e) (f)Intensity reconstructionGesture recognition
Figure 1: Two of the tasks we study in the context of binarygradient images. Insets (a) and (d) are traditional picturesof the scene. Inset (b) is a simulated, spatial binary gra-dient, and (e) a simulated temporal binary gradient. Fromthese we can reconstruct the original intensity image (c) orperform gesture recognition (f). We also used real data cap-tured with the prototype shown on the right. Inset (f) is fromMolchanov et al . [26].to the cloud for off-line processing. However, even whenusing image or video compression, the communication costcan still be prohibitive for embedded systems, sometimesby several orders of magnitude [30]. Thus an image sensingstrategy that reduces the amount of captured data can havean impact on the overall power consumption that extendsbeyond just acquisition and processing.A large component of the image sensing power is burnedto capture dense images or videos, meaning that each pixelis associated with a value of luminance, color component,depth, or other physical measurement. Not all pixels, how-ever, carry valuable information: pixels capturing edgestend to be more informative than pixels in flat areas. Re-cently, novel sensors have been used to feed gradient datadirectly to the computer vision algorithms. [5, 36]. In ad-dition, there has been a growing interested in event basedcameras such as those proposed by Lichsteiner et al . [23].These cameras consume significantly less power than tradi-tional cameras, and record binary changes of illumination a r X i v : . [ c s . C V ] D ec igure 2: A traditional image (left) and an example of realspatial binary gradient data (right). Note that these pictureswere taken with different cameras and lenses and, thus, donot exactly match.at the pixel level, and only output pixels when they be-come active. Another particularly interesting type of sensorwas proposed by Gottardi et al . [11]. This sensor producesa binary image where only the pixels in high-gradient re-gions become active; depending on the modality of opera-tion, only active pixels, or pixels that changed their activitystatus between consecutive frames, can then be read. Theresulting images appear like binary edge images, see Fig-ure 2.While these designs allow for a significant reduction ofthe power required to acquire, process, and transmit images,it also limits the information that can be extracted from thescene. The question, then, becomes whether this results ina loss of accuracy for the computer vision algorithms, andif such loss is justified by the power saving. In this paper, we focus on two aspects related to the useof binary gradient cameras for low-power, embedded com-puter vision applications.First, we explore the tradeoff between energy and accu-racy this type of data introduces on a number of computervision tasks. To avoid having to hand-tune traditional com-puter vision algorithms to binary gradient data, we use deeplearning approaches as benchmarks, and leverage the net-works’ ability to learn by example. We select a numberof representative tasks, and analyze the change in accuracyof established neural network-based approaches, when theyare applied to binarized gradients.Second, we investigate whether the intensity informationcan be reconstructed from these images in post-processing,for those tasks where it would be useful for a human to vi-sually inspect the captured image, such as long-term videosurveillance on a limited power budget. Unlike other typesof gradient-based sensors, intensity reconstruction is an ill-posed problem for our type of data because both the direc-tion and the sign of the gradient are lost, see Section 5. Tothe best of our knowledge, in fact, we are the first to showintensity reconstruction from single-shot, spatial binary gra-dients.We perform our formal tests simulating the output of thesensor on existing datasets, but we also validate our find- ings by capturing real data with the prototype developed byGottardi et al . [11] and described in Section 3.1.We believe that this paper presents a compelling reasonfor using binary gradient cameras in certain computer vi-sion tasks, to reduce the power consumption of embeddedsystems.
2. Related Work
We describe the prior art in terms of the gradient camerasthat have been proposed, and then in terms of computer vi-sion algorithms developed for this type of data.
Gradient cameras can compute spatial gradients ei-ther in the optical domain [5, 39, 18], or on-board theimage sensor, a technique known as focal plane process-ing [4, 22, 28, 15]. The gradients can be either calculatedusing adjacent pixels [11] or using current-mode image sen-sors [12]. Some cameras can also compute temporal gradi-ent images, i.e . images where the active pixels indicate atemporal change in local contrast [11, 23]. Most of thesegradient cameras have side benefits of fast frame rates andreduced data bandwidth/power due to the sparseness of gra-dients in a scene. In fact, the camera by Lichtsteiner et al .can read individual pixels when they become active [23].Moreover, the fact that gradient cameras output a functionof the difference of two or more pixels, rather than the pixelvalues themselves, allows them to deal with high-dynamic-range scenes.
Applications of gradient cameras were first expositedin the work by Tumblin et al ., who described the advan-tages of reading pixel differences rather than absolute val-ues [35]. A particular area of interest for temporal binarygradients and event-based cameras is SLAM (simultane-ous localization and mapping) and intensity reconstruction.Researchers have shown SLAM [37], simultaneous inten-sity reconstruction and object tracking [16], combined opti-cal flow and intensity reconstruction [2], and simultaneousdepth, localization, and intensity reconstruction [17]. In ad-dition, some early work has focused on using spiking neu-ral networks for event-based cameras [29]. The commondenominator to all of these techniques is that the camera,or at least the scene, must be dynamic: the sensor does notoutput any information otherwise. For tradeoffs betweenenergy and visual recognition accuracy, recent work pro-posed optically computing the first layer of convolutionalneural networks using Angle Sensitive Pixels [5]. However,the camera required slightly out-of-focus scenes to performthis optical convolution and did not work with binary gradi-ent images.In our work, we focus on the camera proposed by Got-tardi et al . [11], which can produce spatial binary gradients,and can image static scenes as well as dynamic ones. Gas-parini et al . showed that this camera can be used as a long-lifetime node in wireless networks [10]. This camera waslso used to implement low-power people counter [9], butonly in the temporal gradient modality (see Section 3.1).
3. Binary Gradient Cameras
In this section, we define the types of binary gradientimages we are considering and we analyze the power andhigh dynamic range benefits from such cameras.
For spatial binary gradients, we refer to cameras wherea pixel becomes active when a local measure of contrastis above threshold. Specifically, for two pixels i and j , we define the difference ∆ i,j = | I i − I j | , where I is the measured pixel’s brightness. Wealso define a neighborhood ν consistingof pixel P and the pixels to its left, L,and top, T (see inset). The output atpixel P will then be: G S ( P ) = (cid:40) if max i,j ∈ ν ∆ i,j > T otherwise , (1)where T is a threshold set at capture time. The output ofthis operation is a binary image where changes in local spa-tial contrast above threshold yield a 1, else a 0, see Figure 2.Note that this operation is an approximation of a binary lo-cal derivative: ∆ T,L alone can trigger an activation for P,even though the intensity at P is not significantly differentfrom either of the neighbors’. It can be shown that the con-sequence of this approximation is a “fattening” of the imageedges by a factor of roughly √ when compared to the mag-nitude of the a gradient computed with regular finite differ-ences. The advantage of this formulation is that it can beimplemented efficiently in hardware.For temporal binary gradients, the sensor proposed byLichtsteiner et al . [23], which works asynchronously, out-puts +1 (-1) for a pixel whose intensity increases (decreases)by a certain threshold, and 0 otherwise. The sensor pro-posed by Gottardi et al . produces a slightly different imagefor temporal gradients, where the value of a pixel is the dif-ference between its current and previous binary spatial gra-dient [11]: G T ( P , t ) = max (0 , | G S ( P , t ) − G S ( P , t − | ) , (2)where we made the dependency on time t explicit. Thisis implemented by storing the previous value in a 1-bitmemory collocated with the pixel to avoid unnecessary datatransfer. An image produced by this modality can be seenin Figure 1(e). Binary gradient cameras have numerous advantages interms of power and bandwidth. A major source of power consumption in modern camera sensors is the analog-to-digital conversion and the transfer of the 12-16 bits dataoff-chip, to subsequent image processing stages. Gradientsthat employ 1 or 2 bits can significantly reduce both the costfor the conversion, and the amount data to be encoded at theperiphery of the array. In fact, the sensor only transfers theaddresses of the pixels that are active, and when no pixelsare active, no power is used for transferring data.Comparing power consumption for sensors of differentsize, technology, and mode of operation is not easy. Ourtask is further complicated by the fact that the power con-sumption for a binary gradient sensor is a function of thecontrast in the scene. However, here we make some as-sumptions to get a very rough figure. Gottardi et al . [11]report that the number of active pixels is usually below 25%(in the data we captured, we actually measured that slightlyless than 10% of the pixels were active on average). Thepower consumption for the sensor by Gottardi et al . can beapproximated by the sum of two components. The first, in-dependent of the actual number of active pixels, is the powerrequired to scan the sensor and amounts to 0.0024 µ W/pixel.The second is the power required to deliver the addresses ofthe active pixels, and is 0.0195 µ W/pixel [8]. At 30fps, thispower corresponds to 7.3pJ/pixels. A modern image sensor,for comparison, is over 300pJ/pixel [34]. Once again, thesenumbers are to be taken as rough estimates.
4. Experiments
In this section, we describe the vision tasks we used tobenchmark spatial and temporal binary gradients. For thebenchmarks involving static scenes or single images, wecould only test spatial gradients. We used TensorFlow andKeras to construct our networks. All experiments were per-formed on a cluster of GPUs with NVIDIA Titan X’s orK80s. For all the experiments in this section, we pickeda reference baseline network appropriate for the task, wetrained it on intensity or RGB images, and compared theperformance of the same architecture on data that simulatesthe sensor by Gottardi et al . [11]. An example of such datacan be seen in Figure 1(b) and 1(c). Table 1 summarizes allthe comparisons we describe below.
Object Recognition —
We used MNIST [21] and CIFAR-10 [19] to act as common baselines, and for easy compari-son with other deep learning architectures, on object recog-nition tasks. MNIST comprises 60,000, 28x28 images ofhandwritten digits. CIFAR-10 has 60,000, 32x32 images ofobjects from 10 classes, with 10,000 additional images forvalidation. For these tasks we used LeNet [20].On MNIST, using simulated binary gradient data de-grades the accuracy by a mere 0.76%. For CIFAR-10,ask Dataset Traditional Binary gradient Network usedRecognition MNIST [21] 99.19% 98.43% LeNet [20]CIFAR-10 [19] 77.01% 65.68% LeNet [20]NVGesture [26] 72.5% G T : 74.79% Molchanov et al . [26] G S : 65.42%Head pose 300VW [32] 0.6 ◦ ◦ LeNet [20]BIWI Face Dataset [7] 3.5 ◦ ◦ VGG16 [33]Face detection — WIDER [38] Easy 89.2% 74.5% Faster R-CNN [31]Medium 79.2% 60.5%Hard 40.2% 28.3%Table 1: Summary of the comparison between traditional images and binary gradient images on visual recognition tasks.we trained the baseline on RGB images. The same net-work, trained on the simulated data, achieves a loss in ac-curacy of 11.33%. For reference, using grayscale instead ofRGB images causes a loss of accuracy of 4.86%, which isroughly comparable to the difference in accuracy betweenusing grayscale and gradient images—but without the cor-responding power saving.
Head Pose Regression —
We also explored single-shothead pose regression, an important use-case for human-computer interaction, and driver monitoring in vehicles. Weused two datasets to benchmark the performance of gradi-ent cameras on head pose regression. The first, the BIWIface dataset, contains 15,000 images of 20 subjects, eachaccompanied by a depth image, as well as the head 3D lo-cation and orientation [7]. The second, the 300VW dataset,is a collection of 300 videos of faces annotated with 68 land-mark points [32]. We used the landmark points to estimatethe head orientation.On the BIWI dataset, training a LeNet from scratch didnot yield network convergence. Therefore, we used a pre-trained VGG16 [33] network on the RGB images. Wethen fine-tuned the network on the simulated binary gradi-ent data. The network trained on simulated binary gradientdata yields a degradation of estimation accuracy of a 0.8mean degree error per pixel. On the 300VW dataset, wetrained LeNet on the simulated data. The mean angular er-ror increases by 1.2 degrees per pixel, which is small whenaccounting for the corresponding power saving.
Face Detection —
Another traditional vision task is facedetection. For this task we trained the network on theWIDER face dataset, a collection of 30,000+ images with390,000+ faces, and is organized in three categories for facedetection: easy, medium, and hard [38]. Figure 3 shows rep-resentative images of different levels of difficulty. Note thatthis dataset is designed to be very challenging, and includespictures taken under extreme scale, illumination, pose, andexpression changes, among other factors.For this task, we used the network proposed by Ren etal . [31]. Once again, we trained it on both the RGB and the Figure 3: Face detection on binary spatial gradient imagessimulated from the WIDER dataset.simulated binary gradient images. The results are summa-rized in Table 1. On this task, the loss in accuracy due tousing the binary gradient data is more significant, rangingfrom 11.9% to 18.7%, depending on the category.
Gesture Recognition —
Our final task was gesture recog-nition. Unlike the previous benchmarks, whose task can bedefined on a single image, this task has an intrinsic tempo-ral component: the same hand position can be found in aframe extracted from two different gestures. Therefore, forthis task we test both the spatial and temporal modalities.We used the dataset released by Molchanov et al ., whichcontains 1,500+ hand gestures from 25 gesture classes, per-formed by 20 different subjects [26]. The dataset offers sev-eral acquisition modalities, including RGB, IR, and depth,and was randomly split between training (70%) and test-ing (30%) by the authors. The network for this algorithmwas based on [26], which used an RNN on top of 3D con-volutional features. We limited our tests to RGB inputs,and did not consider the other types of data the dataset of-fers, see Figure 4. As shown in Table 1, the simulated spa-tial binary gradient modality results in an accuracy degra-dation of 7.08% relative to RGB images and 5.41% rela-tive to grayscale. However, as mentioned before, this taskhas a strong temporal component and one would expectthat the temporal gradient input should perform better. In-deed, the temporal modality yields increased accuracy onboth grayscale (+3.96%) and RGB (+2.29%) data. This is asignificant result, because the additional accuracy is possi-ble thanks to data that is actually cheaper to acquire froma power consumption standpoint. Note that the input to a) (b) (c)
Figure 4: Two frames from the NVIDIA Dynamic HandGesture Dataset [26], (a), the corresponding spatial binarygradients, (b), and temporal binary gradients, (c).the network is a set of non-overlapping clips of 8 frameseach, so the network can still “see” temporal information inmodalities other than the temporal binary gradients.Across a variety of tasks, we see that the accuracy onbinary gradient information varies. It is sometimes compa-rable to, and sometimes better than, the accuracy obtainedon traditional intensity data. Other times there is a signif-icant accuracy loss is significant, as is the case with facedetection. We think that this is due in part to the task, whichcan benefit from information that is lost in the binary gradi-ent data, and in part to the challenging nature of the dataset.Our investigation suggests that the choice of whether a bi-nary gradient camera can be used to replace a traditionalsensor, should account for the task at hand and its accuracyconstraints. Note that we did not investigate architecturesthat may better fit this type of data, and which may have animpact on accuracy. We leave the investigation for futurework, see also Section 7.
In this paper, we study the tradeoff between power con-sumption and accuracy of binary gradient cameras. Onefactor that has a strong impact on both, is the number ofbits we use to quantize the gradient, which, so far, we haveassumed to be binary. Designing a sensor with a variablenumber of quantization bits, while allowing for low powerconsumption, could be challenging. However, graylevel in-formation can be extracted from a binary gradient cameraby accumulating multiple frames, captured at a high framerate, and by combining them into a sum weighted by thetime of activation [8].For the sensor proposed by Gottardi et al . [11], the powerof computing this multi-bit gradient can be estimated as: P = 2 N · P scan + P deliver , (3)where N is the number of quantization levels, P scan is thepower required to scan all the rows of the sensor, and P deliver P o w e r ( u W ) A cc u r a c y ( % ) Accuracy Power consumption
Figure 5: Quantization vs power consumption vs accuracytradeoff on CIFAR-10. Note the significant drop in powerconsumption between 8 and 4 bits, which is not reflected bya proportional loss of accuracy, see Section 4.2.is the power to deliver the data out of the sensor, whichdepends on the number of active pixels [8]. Despite thefact that P deliver is an order of magnitude larger than P scan ,Equation 3 shows that the total power quickly grows withthe number of bits.To study the compromise between power and number ofbits, we simulated a multi-bit gradient sweep on CIFAR-10,and used Equation 3 to estimate the corresponding powerconsumption. Figure 5 shows that going from a binary gra-dient to an 8-bit gradient allows for a 3.89% increase inaccuracy, but requires more than 80 times the power. How-ever, a 4-bit gradient may offer a good compromise, see-ing that it only requires 7% of the power needed to esti-mate an 8-bit gradient (6 times the power required for thebinary gradient), at a cost of only 0.34% loss of accuracy.This experiment points to the fact that the trade-off betweenpower consumption and accuracy can be tuned based on therequirements of the task, and possibly the use-case itself.Moreover, because in the modality described above N canbe changed at runtime, one can also devise a strategy wherethe quantization levels are kept low in some baseline oper-ation mode, and increased when an event triggers the needfor higher accuracy.
5. Recovering Intensity Information from Spa-tial Binary Gradients
In addition to the automated computer vision machinery,some applications may require a human observer to look atthe data. An example is video surveillance: a low-power au-tomatic system can run continuously to detect, for instance,a person coming in the field of view. When such an event isdetected, it may be useful to have access to intensity data,which is more easily accessible by a human observer. Onesolution could be that a more power-hungry sensor, suchas an intensity camera is activated when the binary gradi-ent camera detects an interesting event [14]. Another solu-tion could be to attempt to recover the grayscale informationfrom the binary data itself. In this section, we show that thiss indeed possible.We outlined previous work on intensity reconstructionfrom temporal gradients in Section 2. Currently availabletechniques, such as the method by Bardow et al . [2], useadvanced optimization algorithms and perform a type ofPoisson surface integration [1] to recover the intensity infor-mation. However, they focus on the temporal version of thegradients. As a consequence, these methods can only recon-struct images captured by a moving camera, which severelylimits their applicability to real-world scenarios.To the best of our knowledge, there has been no work onreconstructing intensity images from a single binary spatialgradients image, in part because this problem does not havea unique solution. Capturing a dark ball against a brightbackground, for instance, would yield the same exact bi-nary spatial gradient as a bright ball on a dark background.This ambiguity prevents the methods of surface integrationfrom working, even with known or estimated boundary con-ditions.We take a deep learning approach to intensity reconstruc-tion, so as to leverage the network’s ability to learn priorsabout the data. For this purpose, we focus on the problem ofintensity recovery from spatial gradients of faces. While wecannot hope to reconstruct the exact intensity variations ofa face, we aim to reconstruct facial features from edge mapsso that it can be visually interpreted by a human. Here wedescribe the network architecture we propose to use, andthe synthetic data we used to train it. In Section 6 we showreconstructions from real data we captured using a binarygradient camera prototype.Our network is inspired by the autoencoder architecturerecently proposed by Mao et al . [25]. The encoding partconsists of 5 units, each consisting of two convolutionallayers with leaky ReLU nonlinearities followed by a maxpooling layer. The decoding part is symmetric, with 5 unitsconsisting of upsampling, a merging layer for skip connec-tions that combines the activations after the convolutionsfrom the corresponding encoder unit, and two convolutionallayers. See Figure 6 for our network structure. We trainedthis architecture on the BIWI and WIDER datasets.For the BIWI dataset, we removed two subjects com-pletely to be used for testing. Figure 7 shows an embeddedanimation of the two testing subjects. As mentioned above,the solution is not unique given the binarized nature of thegradient image, and indeed the network fails to estimate theshade of the first subject’s sweater. Nevertheless, the qual-ity is sufficient to identify the person in the picture, whichis surprising, given the sparseness of the input data.The WIDER dataset, does not contain repeated images ofany one person, which guarantees that no test face is seenby the network during training. We extracted face cropsby running the face detection algorithm described in Sec-tion 4.1, and resized them to 96x96, by either downsam-
INPUT = CONV (3x3) + Leaky RELU + Max Pooling = Upsampling + Merge (SKIP Connection) + CONV (3x3) + Leaky RELU + CONV (3x3) + LEAKY RELU
Figure 6: The architecture of the autoencoder used to re-construct intensity information from spatial binary gradientimages.Figure 7:
Embedded animation of the intensity recon-struction (middle pane) on the binary data (left pane) simu-lated from the BIWI dataset [7]. It can be viewed in AdobeReader, or other media-enabled viewers, by clicking on theimages. The ground truth is on the right.pling or upsampling, unless the original size was too small.Figure 8 shows some results of the reconstruction. Note thatthe failure cases are those where the quality of the gradientsis not sufficient (Figure 8(i)), or the face is occluded (Fig-ure 8(j)). The rest of the faces are reconstructed unexpect-edly well, given the input. Even for the face in Figure 8(j)the network is able to reconstruct the heavy makeup reason-ably well.
6. Experiments with a Prototype Spatial Bi-nary Gradient Camera
In this section we validate our findings by running ex-periments directly on real binary gradient images. As a re-minder, all the comparisons and tests we described so farwere performed on data obtained by simulating the behav-ior of the binary gradient camera. Specifically, we basedour simulator on Equation 1, and tuned the threshold T to roughly match the appearance of the simulated and realdata, which we captured with the prototype camera de-scribed by Gottardi et al . [11]. At capture time, we usethe widest aperture setting possible to gain the most light,though at the cost of a shallower depth of field, which wedid not find to affect the quality of the gradient image. Wealso captured a few grayscale images of the same scenewith a second camera set up to roughly match the field ofviews of the two. Figure 2, shows a comparison between a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 8: Intensity reconstruction (bottom row) on the binary data (middle row) simulated from the WIDER dataset [38]. Theground truth is in the top row. Note that our neural network is able to recover the fine details needed to identify the subjects.We observed that failure cases happen when the gradients are simply too poor (i) or the face is occluded (j).a grayscale image and the (roughly) corresponding framefrom the prototype camera. Barring resolution issues, at vi-sual inspection we believe our simulations match the realdata.
To qualitatively validate the results of our deep learn-ing experiments, we ran face detection on binary gradientdata captured in both outdoor and indoor environment. Wecould not train a network from scratch, due to the lack of alarge dataset, which we could not capture with the currentprototype—and the lack of ground truth data would havemade it impossible to measure performance quantitativelyanyway. We trained the network described in Section 4.1on simulated data resized to match the size of images pro-duced by the camera prototype, and then we directly raninference on the real data.We found that the same networkworked well on the indoor scenes, missing a small fractionof the faces, and typically those whose pose deviated signif-icantly from facing forward. On the other hand, the networkstruggled more when dealing with the cluttered backgroundtypical of the outdoor setting, where it missed a significantamount of faces. We ascribe this issue to the low spatialresolution offered by the prototype camera, which is only128x64 pixels. However, this is not a fundamental limita-tion of the technology, and thus we expect it to be addressedin future versions. Figure 9 shows a few detection results forboth environment.
Another qualitative validation we performed was inten-sity reconstruction from data captured directly with thecamera prototype. We trained the network on synthetic datagenerated from the WIDER dataset, and performed forwardinference on the real data. Once again, we could not per- (a) (b)(c) (d)
Figure 9: Face detection task on spatial gradient imagescaptured with the camera prototype. The top and bottomrows show frames from an indoor and an outdoor sequence,respectively. The misdetection rate is significantly higher inoutdoor sequences, as seen in insect (d).form fine-tuning due to the lack of ground truth data—thedata from an intensity camera captured from a slightly dif-ferent position, and with different lenses, did not generalizewell. While the quality of the reconstruction is slightly de-graded with respect to that of the synthetic data, the facesare reconstructed well. See Figure10 for a few example.Note that despite the low resolution (these crops are 1.5times smaller than those in Figure 8), the face features arestill distinguishable.Remember that here we are reconstructing intensity in-formation from a single frame: we are not enforcing tem-poral consistency, nor we use information from multipleframes to better infer intensity. We find that the quality ofthe reconstruction of any single frame varies: some recon-structions from real data allow the viewer to determine thedentity of the subject, others are more similar to averagefaces.Figure 10: Intensity reconstruction result inferred by thenetwork described in Section 5 and trained on the WIDERsimulated data. The top row shows 64x64 face crops cap-tured with the prototype camera, the bottom the correspond-ing reconstructed images. While the quality is not quite onpar with the intensity reconstructions, it has to be noted thatthe resolution of the crops in Figure 8, is 96x96, i.e . 1.5xlarger.
7. Discussion
To further decrease the power consumption in computervision tasks, we could couple binary gradient images withbinary neural networks. Recently, new architectures havebeen proposed that are use elementary layers (convolutions,fully connected layers) using binary weights, yielding anadditional 40% in power savings in computation [6]. Weevaluated these binary neural networks (BNNs) on, MNIST,CIFAR-10, and SVHN [27]. (The latter is a dataset of ∼ Acknowledgements
We would like to thank Pavlo Molchanov for trainingand testing our data with the model described in [26], andMassimo Gottardi for lending us the camera prototype usedin this submission paper.
References [1] A. Agrawal, R. Raskar, and R. Chellappa. What is the rangeof surface reconstructions from a gradient field? In
EuropeanConference on Computer Vision , pages 578–591. Springer,2006. 6[2] P. Bardow, A. J. Davison, and S. Leutenegger. Simultaneousoptical flow and intensity estimation from an event camera.In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2016. 2, 6[3] A. S. Cassidy, P. Merolla, J. V. Arthur, S. K. Esser, B. Jack-son, R. Alvarez-Icaza, P. Datta, J. Sawada, T. M. Wong,V. Feldman, et al. Cognitive computing building block: Aversatile and efficient digital neuron model for neurosynapticcores. In
Neural Networks (IJCNN), The 2013 InternationalJoint Conference on , pages 1–10. IEEE, 2013. 1[4] S. M. Chai, A. Gentile, W. E. Lugo-Beauchamp, J. Fonseca,J. L. Cruz-Rivera, and D. S. Wills. Focal-plane processingarchitectures for real-time hyperspectral image processing.
Applied Optics , 39(5):835–849, 2000. 2[5] H. G. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivara-makrishnan, A. Veeraraghavan, and A. Molnar. ASP vision:Optically computing the first layer of convolutional neuralnetworks using angle sensitive pixels. In
The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,June 2016. 1, 2[6] M. Courbariaux, I. Hubara, C. D. Soudry, R. El-Yaniv, andY. Bengio. Binarized neural networks: Training neural net-works with weights and activations constrained to+ 1 or-. 8[7] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Van Gool.Random forests for real time 3d face analysis.
Int. J. Comput.Vision , 101(3):437–458, February 2013. 4, 6[8] O. Gallo, I. Frosio, L. Gasparini, K. Pulli, and M. Gottardi.Retrieving gray-level information from a binary sensor andits application to gesture detection. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion Workshops , pages 21–26, 2015. 3, 5[9] L. Gasparini, R. Manduchi, and M. Gottardi. An ultra-low-power contrast-based integrated camera node and its appli-cation as a people counter. In
Advanced Video and Sig-nal Based Surveillance (AVSS), 2010 Seventh IEEE Interna-tional Conference on , pages 547–554. IEEE, 2010. 2[10] L. Gasparini, R. Manduchi, M. Gottardi, and D. Petri. Anultralow-power wireless camera node: Development and per-formance analysis.
IEEE Transactions on Instrumentationand Measurement , 60(12):3824–3832, 2011. 2[11] M. Gottardi, N. Massari, and S. Jawed. A 100 µ W 128 × IEEE Journal of Solid-StateCircuits , 44(5):1582–1592, 2009. 2, 3, 5, 612] V. Gruev, R. Etienne-Cummings, and T. Horiuchi. Linearcurrent mode imager with low fix pattern noise. In
Circuitsand Systems, 2004. ISCAS’04. Proceedings of the 2004 In-ternational Symposium on , volume 4, pages IV–860. IEEE,2004. 2[13] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A.Horowitz, and W. J. Dally. Eie: efficient inference en-gine on compressed deep neural network. arXiv preprintarXiv:1602.01528 , 2016. 1[14] S. Han, R. Nandakumar, M. Philipose, A. Krishnamurthy,and D. Wetherall. Glimpsedata: Towards continuous vision-based personal analytics. In
Proceedings of the 2014 work-shop on physical analytics , pages 31–36. ACM, 2014. 5[15] P. Hasler. Low-power programmable signal processing.In
Fifth International Workshop on System-on-Chip forReal-Time Applications (IWSOC’05) , pages 413–418. IEEE,2005. 2[16] H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. J. Davi-son. Simultaneous mosaicing and tracking with an eventcamera.
J. Solid State Circ , 43:566–576, 2008. 2[17] H. Kim, S. Leutenegger, and A. J. Davison. Real-time 3dreconstruction and 6-dof tracking with an event camera. In
European Conference on Computer Vision , pages 349–364.Springer, 2016. 2[18] S. J. Koppal, I. Gkioulekas, T. Young, H. Park, K. B. Crozier,G. L. Barrows, and T. Zickler. Toward wide-angle micro-vision sensors.
IEEE transactions on pattern analysis andmachine intelligence , 35(12):2982–2996, 2013. 2[19] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009. 3, 4[20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2324, 1998. 3, 4[21] Y. LeCun, C. Cortes, and C. J. Burges. The mnist databaseof handwritten digits, 1998. 3, 4[22] W. D. Leon-Salas, S. Balkir, K. Sayood, N. Schemm, andM. W. Hoffman. A cmos imager with focal plane compres-sion using predictive coding.
IEEE Journal of Solid-StateCircuits , 42(11):2555–2572, 2007. 2[23] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128 × µ s latency asynchronous temporal contrast visionsensor. IEEE journal of solid-state circuits , 43(2):566–576,2008. 1, 2, 3[24] R. LiKamWa, Z. Wang, A. Carroll, F. X. Lin, and L. Zhong.Draining our glass: An energy and heat characterization ofgoogle glass. In
Proceedings of 5th Asia-Pacific Workshopon Systems . ACM, 2014. 1[25] X.-J. Mao, C. Shen, and Y.-B. Yang. Image restoration us-ing convolutional auto-encoders with symmetric skip con-nections. arXiv preprint arXiv:1606.08921 , 2016. 6[26] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, andJ. Kautz. Online detection and classification of dynamic handgestures with recurrent 3d convolutional neural network. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 4207–4215, 2016. 1, 4, 5,8 [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng. Reading digits in natural images with unsupervised fea-ture learning. 2011. 8[28] A. Nilchi, J. Aziz, and R. Genov. Focal-planealgorithmically-multiplying cmos computational image sen-sor.
IEEE Journal of Solid-State Circuits , 44(6):1829–1839,2009. 2[29] P. O’Connor, D. Neil, S.-C. Liu, T. Delbruck, and M. Pfeif-fer. Real-time classification and sensor fusion with a spik-ing deep belief network.
Neuromorphic Engineering Systemsand Applications , page 61, 2015. 2[30] J. M. Ragan-Kelley.
Decoupling algorithms from the organi-zation of computation for high performance image process-ing . PhD thesis, Ch. 2, pages 19–24, Massachusetts Instituteof Technology, 2014. 1[31] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In
Advances in Neural Information Processing Sys-tems (NIPS) , 2015. 4[32] J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tz-imiropoulos, and M. Pantic. The first facial landmark track-ing in-the-wild challenge: Benchmark and results. In , pages 1003–1011. IEEE, 2015. 4[33] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014. 4[34] A. Suzuki, N. Shimamura, T. Kainuma, N. Kawazu,C. Okada, T. Oka, K. Koiso, A. Masagaki, Y. Yagasaki,S. Gonoi, et al. A 1/1.7-inch 20mpixel back-illuminatedstacked CMOS image sensor for new imaging applications.In , pages 1–3. IEEE, 2015.3[35] J. Tumblin, A. Agrawal, and R. Raskar. Why i want a gradi-ent camera. In , vol-ume 1, pages 103–110. IEEE, 2005. 2[36] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Con-radt. Event-based 3d slam with a depth-augmented dynamicvision sensor. In , pages 359–364. IEEE,2014. 1[37] D. Weikersdorfer, R. Hoffmann, and J. Conradt. Simulta-neous localization and mapping for event-based vision sys-tems. In
International Conference on Computer Vision Sys-tems , pages 133–142. Springer, 2013. 2[38] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: Aface detection benchmark. In
IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016. 4, 7[39] A. Zomet and S. K. Nayar. Lensless imaging with a control-lable aperture. In2006 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR’06)