[PDF] A Noise Filter for Dynamic Vision Sensors using Self-adjusting Threshold

Abstract

Neuromorphic event-based dynamic vision sensors (DVS) have much faster sampling rates and a higher dynamic range than frame-based imagers. However, they are sensitive to background activity (BA) events which are unwanted. we propose a new criterion with little computation overhead for defining real events and BA events by utilizing the global space and time information rather than the local information by Gaussian convolution, which can be also used as a filter. We denote the filter as GF. We demonstrate GF on three datasets, each recorded by a different DVS with different output size. The experimental results show that our filter produces the clearest frames compared with baseline filters and run fast.

Full PDF

AA Noise Filter for Dynamic Vision Sensorsusing Self-adjusting Threshold

Shasha Guo Ziyang Kang Lei Wang Limeng Zhang Xiaofan Chen Shiming Li Weixia Xu

College of Computer Science and TechnologyNational University of Defense TechnologyChangsha, Hunan, China. 410073e-mail: guoshasha13, [email protected] A bstract— Neuromorphic event-based dynamic vision sensors(DVS) have much faster sampling rates and a higher dynamic rangethan frame-based imagers. However, they are sensitive to backgroundactivity (BA) events which are unwanted. we propose a new criteriawith little computation overhead for deﬁning real events and BAevents by utilizing the global space and time information rather thanthe local information by Gaussian convolution, which can be alsoused as a ﬁlter. We denote the ﬁlter as GF. We demonstrate GFon three datasets, each recorded by a different DVS with differentoutput size. The experimental results show that our ﬁlter producesthe clearest frames compared with baseline ﬁlters and run fast. I. I

NTRODUCTION

Research on neuromorphic event-based sensors (“siliconretinae”) started a few decades back [1]. Recently, the tech-nology has matured to a point where there appears somecommercially available sensors. Some of the popular dynamicvision sensors (DVS) are DVS128 [2], the Dynamic andActive pixel Vision Sensor (DAVIS) [3], Asynchronous Time-based Image Sensor (ATIS) [4], and the CeleX-IV [5].Different from conventional frame-based imagers that workby sampling the scene at a ﬁxed temporal rate (typically 30frames per second), these sensors detect dynamic changes inillumination. This results in a higher dynamic range, highersampling rate, and lower power consumption. These sensorshave several possible applications.However, these sensors will produce background activity(BA) events under constant illumination, which are causedby temporal noise and junction leakage currents [2], [6], [7].There are already multiple noise ﬁltering methods for event-based data available. The Nearest Neighbor (NNb) ﬁlter basedon spatiotemporal correlation [8], [9], [6], [10], [11], [12]is the most commonly employed method. Besides, there aresome variations of NNb ﬁlters as well as some other ﬁltersbased on differing polarity, refractory period and inter-spikeinterval [11].However, it is difﬁcult to distinguish whether an event is areal event or noise with only the event itself. When the trackof the target is known, the time correlation of events generatedby a single pixel can be counted through repeated recording.Higher correlation suggest higher probability of real events,and vice versa. However, in the real world, it is unlikelyto obtain every targets motion information in advance, andrelying on this method of judging image quality is not feasible.Although [11] and [12] introduce their criterion for realevents, the computation is heavy and time-consuming. [11] use Gaussian convolution on the time diemension to obtainthe correlation. And [12] strengthen it by using convolutionon both space dimension and time dimension. These methodsmake each event participate in multiple calculations in additionto the key comparison operations for judging. This will puta lot of burden on the calculation and lead to the increase ofprocessing time.To tackle these challenges, we design a new criteria withlittle computation overhead for deﬁning real events and BAevents by utilizing the global space and time informationrather than the local information by Gaussian convolution.And it is naturally that this can be used as a ﬁlter since itdecides whether an event is the real event or the BA eventand thus decides whether to pass the event or ﬁlter the event.Therefore, we introduce a criteria for DVS BA events ﬁlteringas well as a new spatiotemporal BA ﬁlter.Our contributions are as follows. First, we propose a criteriafor deﬁning real events and BA events with little computationoverhead, which is also a BA ﬁlter and called GF. Second,we demonstrate GF on three datasets, each recorded by adifferent DVS system with different output size. The experi-mental results show that our ﬁlter produces the clearest framescompared with baseline ﬁlters.II. B

ACKGROUND

A. DVS128

The DVS128 [2] sensor is an event-based image sensor thatgenerates asynchronous events when it detects the changes inlog intensity. If the change of illumination exceeds an upperor lower threshold, the DVS128 will generate an ”ON” eventor ”OFF” event respectively. Each pixel independently and incontinuous time quantizes local relative intensity changes togenerate spike events. If changes in light intensity detectedby a pixel since the last event exceed the upper threshold,pixel will generate an ON event and if these changes pass thelower threshold pixel will generate an OFF event. A pixel willnot generate an event otherwise. By this mechanism, DVS128only generates events if there is a change in light intensity,therefore, sensors output stream only includes the detectedchanges in sensed signal and does not carry any redundantdata.To encode all the event information for output, the DVS128sensor uses Address Event Representation (AER) protocol[13] to create a quadruplets e ( p, x, y, ts ) for each event.Speciﬁcally, p is polarity, i.e., ON or OFF, x is the x-positionof a pixel’s event, y is the y-position of a pixel’s event, and a r X i v : . [ c s . ET ] J un ig. 1. The event readout stream of CeleX. ts is a 32-bits timestamp, i.e., the timing information of anevent. B. DAVIS

DAVIS [3] combines the DVS with an active pixel sensor(APS) at the pixel level. It allows simultaneous output ofasynchronous events and synchronous frames. The DAVISsensor also uses AER protocol to encode the event output.

C. CeleX-IV

The CeleX-IV is a high resolution Dynamic Vision Sensorfrom CelePixel Technology Co., Ltd. [5]. The resolution ofthe sensor is 768 × D. BA Events

BA events are caused due to thermal noise and junctionleakage currents [2], [6], [7]. These events degrade the qualityof the data and further incurs unnecessary communicationbandwidth and computing resources. The BA and the realactivity events differ in that the BA event lacks temporal cor-relation with events in its spatial neighborhood while the realevents, arising from moving objects or changes in illumination,have a temporal correlation with events from their spatialneighbors. On the basis of this difference, the BA events canbe ﬁltered out by detecting events which do not have spatialcorrelation with events generated by the neighborhood pixels.Such a ﬁlter is a spatiotemporal correlation ﬁlter. The ﬁlterdecides whether an event is real or noise by checking thecondition T NNb − T e < dT . If the condition is meet, theevent is regarded as a real activity event. The T NNb is thetimestamps from the neighborhood pixels, which meet thiscondition: | x p − x | ≤ and | y p − y | ≤ where p stands fora pixel. And dT is the limitation for timestamp difference.III. R ELATED WORK

In this section, we introduce three event-based spatiotem-poral ﬁlters and one frame-based ﬁlter. There are also someother ﬁltering methods. Researchers [15] proposed a ﬁlter (a) Bs1 (b) Bs2 (c) Bs3Fig. 2. Three event-based ﬁlters [17].TABLE IP

ARAMETERS OF THRESHOLD CALCULATION FOR GF . Para. DescriptionTD the time difference between the ﬁrst event and thelast event within a frameATD the average time differenceANEP the average number of events per pixelANEM the average number of events per memory cellFN the number of events of a frameSF the scaling factorX the image width of the frameY the image height of the frames the subsampling window similar to Bs2 describedin Section. IIIwith neuromorphic integrate-and-ﬁre neurons which integratespikes not only from the corresponding pixel but also itsneighborhood pixels for ﬁring. And [16] assigns a lifetime toeach event and the lifetime of a noise event will be assigned0. Here we introduce three event-based ﬁlters with O(N ),O(N/s) and O(N) space complexity respectively, and they willbe denoted as Bs1, Bs2, and Bs3 in the rest content.In Bs1 ﬁlter [8], each pixel has a memory cell for storingthe last events timestamp. The stored timestamps are used forcomputing the spatiotemporal correlation (Fig. 2(a)).Bs2 ﬁlter uses sub-sampling groups to reduce the memorysize [6]. Each sub-sampling group of factor s includes s pixels and uses one memory cell for storing the timestampof the most recent event of the group (Fig. 2(b)).Bs3 ﬁlter assigns two memory cells to each row andeach column to store the most recent event in that row orcolumn (Fig. 2(c)) [17]. This ﬁlter is designed to store all theinformation of an event, so both the two cells are 32-bits withone for storing the timestamp and one for polarity and theother axis position.IV. P ROPOSED F ILTER

We propose a new method for separating real events andBA events by using the time difference between two eventsin the same pixel. By separating real events and BA events, itcan naturally be used as a BA ﬁlter. It utilizes both the spaceinformation and time information from a global perspective,and we donote the ﬁlter as GF for simplicity.Table I gives the denotations and explainations of parame-ters that appear in the following description.We introduce the time threshold for the GF ﬁlter as follows,which is denoted as T GF .or each pixel, the

AN EP is AN EP = F NX × Y . (1)Intuitively, the time threshold for separating real events andBA events should be the ratio of the whole time differenceand the average number of events per pixel within a framewhen each pixel has a memory cell itself ( s = 1 ) like Bs1,which is T GF = T DAN EP = T D

F NX × Y . (2)However, when s pixels share a memory cell like Bs2,the average number of events per pixel turns to the averagenumber of events per memory cell, AN EM , which is

AN EM = AN EP × s = s × F NX × Y . (3)Thus, the time threshold for GF s is denoted as T GF s , anddescribed by Eq. 4. T GF s = T DAN EM = T D × ( X × Y ) s × F N . (4)This is not the end. Since the

AT D of two BA eventsis supposed to be much larger than that of two real eventsaccording to the spatiotemporal correlation, these BA eventswill increase the

AT D between any two events in the framecompared with an ideal condition that have no BA events inthe frame. In other words, it will increase the

T D of the framecompared with the ideal condition. The

T GF s based on Eq. 4could be larger than expected. So we introduce the scalingfactor SF . And the T GF s is updated as T GF s = T DAN EM = T D × X × Ys × F N × SF . (5)For CeleX, due to its special timestamp assignment asdescribed in Section. II-C, up to X events could have the sametimestamp. We suppore that these events are regarded as oneevent when computing the ANEM, namely,

AN EM = s × F NX × ( X × Y ) . (6)With consideration of scaling factor SF , the time threshold T GF s can be described as Eq. 7. T GF s = T DAN EM × SF = T D × X × ( X × Y ) s × F N × SF . (7)And it is worth noticing that the

T GF s for CeleX is likelyto be smaller than expected. Since it is up to X events couldshare the same timestamp, usually it is smaller than X.For each frame’s events, the time threshold T GF s iscalculated based on the last frame. The ﬁrst frame of oneevent stream is initialized as a constant number. Although thethreshold is calculated based on the frame information, GFshould always be seen as an event-oriented ﬁlter. The stepsfor the GF ﬁlter are outlined as follows. For each event: • Fetch the corresponding memory cell and get the lastrecorded timestamp; • Check if the present timestamp is within

T GF s of thelast timestamp. If the time difference is less than T GF s ,pass the event to the output, otherwise discard it. • Store the timestamp of the new event in the correspondingmemory cell. V. E

XPERIMENT S ETUP

A. Dataset

We use three datasets, a collected DVS dataset, DvsGesture[18], a DAVIS-240C dataset Roshambo [19], and our owndataset recorded from CeleX-IV. DvsGesture comprises 11hand gesture categories from 29 subjects under 3 illuminationconditions. Roshambo is a dataset of rock, paper, scissorsand background images. We use three sub-recordings of rock,paper, and scissors. And our own dataset is also a rock-paper-scissors dataset.To make the event stream visible, it is common to generatea picture frame from the events, either of ﬁxed time length orof a constant number of events. We choose to use the ﬁxednumber of events. A pixel in the picture will be 255 if it hasan event.The baseline ﬁlters are decribed in section III.

B. Software Conﬁguration

The ﬁxed time threshold used for Bs1 is 0.5 ms. For Bs2with subsampling window s ( s> ), the ﬁxed time thresholdis . × ( s × s ) ms. For Bs3, the time threshold is . X ms.The choice of SF is related to the number of events in aframe and the BA frequency under different circumstanceswhen recording the data. We ﬁnd the proper SF for DVS128and DAVIS is larger than 1 while the proper SF for CeleX issmaller than 1. The SF for Eq. 5 is set to be 10 and the SF for Eq. 7 is set to be 0.2 in this work.VI. E XPERIMENT RESULT

First, we want to show that GF x well separates the realevents and BA events. Then we compare the runtime cost. A. Denoising Effect

For Roshambo dataset, we use 5k events for generatingeach frame. For our dataset, since the CeleX-IV has very largeoutput, namely 768 ×

640 pixels, we choose 50k events asthe number of events per frame to make the images easy todistinguish with human eyes.Fig. 3 shows the performance of different ﬁlters on theRoshambo dataset. It can be seen that GF is clearer thanBs1 and GF is clearer than Bs2. Bs3 ﬁlters the BA as wellas many real events with a dim outline.For our dataset, we show two different cases, that is, objectmoving fast and moving slowly. Fig. 4 depicts the events whenthe hand is actively moving and thus there are many real eventswithin the frame. On the contrary, Fig. 5 shows the eventswhen the hand barely moved and thus the BA event accountfor a much higher percentage within the frame compared withthe above cases. It can be seen that, in Fig. 4, the initial framedon’t witness many noise pixel. The effect of GF is lessclearer than Bs1 but clearer than Bs2. GF is clearer than Bs1and still keeps the background area clean. Bs3 is the worst asit still contains many BA noise pixels. In Fig. 5, although Bs3shows the object, it shows many noise points as well. For theother ﬁlters, Bs1 keeps relatively more information than Bs2and GF , and GF and Bs2 are still very similar. GF showsclear outline of the object and get rid of the noise effectively. a) INIT (b) Bs1 (c) GF (d) Bs2 (e) GF (f) Bs3Fig. 3. The rock-paper-scissors recorded by DAVIS240 [3].(a) INIT (b) Bs1 (c) GF (d) Bs2 (e) GF (f) Bs3Fig. 4. The rock-paper-scissors from CeleX when object moving fast.(a) INIT (b) Bs1 (c) GF (d) Bs2 (e) GF (f) Bs3Fig. 5. The rock-paper-scissors from CeleX when object moving slowly. The percentage of BA rises in the ﬁxed count frames.a) Bs1 (b) Bs2 (c) Bs3 (d) GF1Fig. 6. Comparison of T P R and

F P R . One point in the ﬁgure represents aframe. The x-axis is the frame id. The y-axis is the ratio value ranging from0 to 1. The top line is

T P R , and the bottomline is

F P R .(a) Thr=0.5ms (b) Thr=1msFig. 7. Comparison of

T P R and

F P R for Bs1 with different threshold.One point in the ﬁgure represents a frame. The x-axis is the frame id. They-axis is the ratio value ranging from 0 to 1. The top line is

T P R , and thebottomline is

F P R . It can be explained why GF does not keep many realevents in this case. Because the object movement slows down,the time difference between real events becomes close tothat between BA events. When they are close, it is hardto distinguish them using GF . But the background ﬁltersdon’t show satisfactory result as well in such cases. However,GF solves this problem better because it has more spaceinformation support as a group pixels share the same memorycell while GF has the same memory cost as Bs2.

1) Quantitative Analysis based on GF:

This experimentis carried on Gesture dataset recorded from DVS128. SinceGF method shows good performance on denoising the frameand the time consumption is also acceptable, we use it as aevaluating method for the other ﬁlters. We calculate the T P R and

F P R of the event-based ﬁlters.

T P R is the percentage ofcorrect predictions for real events, and

F P R is the percentageof predicting a BA event as a real event. Fig. 6 shows theresults. The

F P R of all ﬁlters are low which is good to see,especially Bs2. These baseline ﬁlters rarely mistake the BAevents deﬁned by our criteria as real events, which suggeststhat our deﬁnition is approved by the baseline ﬁlters. The

T P R witness different distributions. Bs2 is the best. Bs3shows the lowest

T P R which explains the reason for lightoutlines of objects as shown in ﬁgures in section VI-A. Wefund that Bs1 only pass about half percent of real events. Wesuppose the reason might be that the threshold for Bs1 is toolow. So we adjust the threshold for Bs1 to 1ms. And the Fig. 7shows the result. After increasing the threshold, it also showsgood performance on

T P R . Fig. 8. Runtime Comparison of different ﬁlters. We use 3 million eventsand different number of events to make a frame. The x-axis is the number ofevents per frame. The y-axis is the time consumption of ﬁltering in total andthe unit is second.

B. time comparison

Fig. 8 shows the time consumption of different ﬁlters. Wemake several repeat experiments by using different number ofevents per frame with the ﬁxed number of events (3 million)from a event stream. And with the same tendency, we cansee that the Bs1 ﬁlter is 2.5x time consuming than GF ﬁlter.This time reduction is achieved because the GF only needs towrite once and compute once. However, the Bs1 needs to write9 times for updating the timestamps of 8 neighbors and thepixel itself, and compute once according to the process in [8].We can see that GF is similar to Bs2 in time cost. Also, GF is similar to GF in average time consumption because forGF , it also writes once and computes once for each comingevent as well. C. Discussion

One interesting behavior is demonstrated by the Bs3 ﬁlter.For Roshambo dataset, where each event has its own times-tamp, Bs3 still works but it ﬁlters large amount of events asit shows the relative light outline of the object. However, forCeleX dataset, where up to a row of events share the sametimestamp, it still works for the left part of the pixels but whenthe pixel is at the right side of the output, it has almost noﬁltering effect. This is especially clear in Fig. 5.We also make experiments on different subsampling win-dows as shown in Fig. 9. We can see that the Time ﬁlterperforms better than Bs2 with different windows, especiallyin slow movement scenarios.VII. S

UMMARY AND C ONCLUSIONS

Neuromorphic event-based sensors have witnessed rapiddevelopment in the past few decades, especially dynamicvision sensors. These sensors allow for much faster samplingrates and a higher dynamic range which outperform frame-based imagers. However, they are sensitive to backgroundactivity events which cost unnecessary communication andcomputing resources. Moreover, improved noise ﬁltering willenhance performance in many applications. We propose anew criteria with little computation overhead for deﬁning realevents and BA events based on the space and time informationof the event stream. We utilize the global information ratherthan the local information by Gaussian convolution. Theexperimental results show that the proposed criteria showsgood performance on denoising and run very fast. a) fast-s2 (b) fast-s4 (c) slow-s2 (d) slow-s4Fig. 9. Performance comparison between Bs2 and Time Filter with s = 2 and s = 4 on hand frames. s means s = 2 . Fast means the hand is movingfast. Slow indicates the hand is moving slow. The subsampling window s forBs2 and Time ﬁlter are same. The top row is the Bs2. The bottom row is theTime Filter. R EFERENCES[1] Misha Mahowald. The silicon retina. In

An Analog VLSI System forStereoscopic Vision , pages 4–65. Springer, 1994.[2] Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A128 times

128 120 db 15 mu s latency asynchronous temporal contrastvision sensor. IEEE journal of solid-state circuits , 43(2):566–576, 2008.[3] Raphael Berner, Christian Brandli, Minhao Yang, Shih-Chii Liu, andTobi Delbruck. A 240 ×

180 10mw 12us latency sparse-output visionsensor for mobile applications. In ,pages C186–C187. IEEE, 2013.[4] Christoph Posch, Daniel Matolin, and Rainer Wohlgenannt. A qvga 143db dynamic range frame-free pwm image sensor with lossless pixel-levelvideo compression and time-domain cds.

IEEE Journal of Solid-StateCircuits , 46(1):259–275, 2010.[5] Menghan Guo, Jing Huang, and Shoushun Chen. Live demonstration:A 768 ×

640 pixels 200meps dynamic vision sensor. In , pages 1–1.IEEE, 2017.[6] Hongjie Liu, Christian Brandli, Chenghan Li, Shih-Chii Liu, and TobiDelbruck. Design of a spatiotemporal correlation ﬁlter for event-basedsensors. In , pages 722–725. IEEE, 2015.[7] Hui Tian. Noise analysis in cmos image sensors. 2000.[8] Tobi Delbruck. Frame-free dynamic digital vision. In

Proceedings ofIntl. Symp. on Secure-Life Electronics, Advanced Electronics for QualityLife and Society , pages 21–26. Tokyo, 2008.[9] Sio-Hoi Ieng, Christoph Posch, and Ryad Benosman. Asynchronousneuromorphic event-driven image ﬁltering.

Proceedings of the IEEE ,102(10):1485–1499, 2014.[10] Alejandro Linares-Barranco, Francisco G´omez-Rodr´ıguez, Vicente Vil-lanueva, Luca Longinotti, and Tobi Delbr¨uck. A usb3. 0 fpga event-based ﬁltering and tracking framework for dynamic vision sensors. In ,pages 2417–2420. IEEE, 2015.[11] Daniel Czech and Garrick Orchard. Evaluating noise ﬁltering for event-based asynchronous change detection image sensors. In , pages 19–24. IEEE, 2016.[12] Yang Feng, Hengyi Lv, Hailong Liu, Yisa Zhang, Yuyao Xiao, andChengshan Han. Event density based denoising method for dynamicvision sensor.

Applied Sciences , 10(6):2024, 2020.[13] Alessandro Mortara and Eric A Vittoz. A communication architecturetailored for analog vlsi artiﬁcial neural networks: intrinsic performanceand limitations.

IEEE Transactions on neural networks , 5(3):459–466,1994.[14] Jing Huang, Menghan Guo, and Shoushun Chen. A dynamic vision sen-sor with direct logarithmic output and full-frame picture-on-demand. In ,pages 1–4. IEEE, 2017.[15] Vandana Padala, Arindam Basu, and Garrick Orchard. A noise ﬁlteringalgorithm for event-based asynchronous change detection image sensors on truenorth and its implementation on truenorth.

Frontiers in neuro-science , 12:118, 2018.[16] Elias Mueggler, Christian Forster, Nathan Baumli, Guillermo Gallego,and Davide Scaramuzza. Lifetime estimation of events from dynamicvision sensors. In , pages 4874–4881. IEEE, 2015.[17] Alireza Khodamoradi and Ryan Kastner. O (n)-space spatiotemporalﬁlter for reducing noise in neuromorphic vision sensors.

IEEE Trans-actions on Emerging Topics in Computing , 2018.[18] Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McK-instry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos,Guillaume Garreau, Marcela Mendoza, et al. A low power, fullyevent-based gesture recognition system. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 7243–7252, 2017.[19] Iulia-Alexandra Lungu, Federico Corradi, and Tobi Delbr¨uck. Livedemonstration: Convolutional neural network driven by dynamic visionsensor playing roshambo. In2017 IEEE International Symposium onCircuits and Systems (ISCAS)