An Efficient Coding Method for Spike Camera using Inter-Spike Intervals
Siwei Dong, Lin Zhu, Daoyuan Xu, Yonghong Tian, Tiejun Huang
1 First two authors contributed equally. Corresponding author (Y. Tian, email: [email protected]).
An Efficient Coding Method for Spike Camera using Inter-Spike Intervals
Siwei Dong * , Lin Zhu *,1 , Daoyuan Xu + , Yonghong Tian *,2 and Tiejun Huang * School of EECS, Peking University Beijing, P.R. China + School of ECE, Shenzhen Graduate School, Peking University Shenzhen, Guangdong, P.R. China
Abstract : Recently, a novel bio-inspired spike camera has been proposed, which continuously accumulates luminance intensity and fires spikes while the dispatch threshold is reached. Compared to the conventional frame-based cameras and the emerging dynamic vision sensors, the spike camera has shown great advantages in capturing fast-moving scene in a frame-free manner with full texture reconstruction capabilities. However, it is difficult to transmit or store the large amount of spike data. To address this problem, we first investigate the spatiotemporal distribution of inter-spike intervals and propose an intensity-based measurement of spike train distance. Then, we design an efficient spike coding method, which integrates the techniques of adaptive temporal partitioning, intra-/inter-pixel prediction, quantization and entropy coding into a unified lossy coding framework. Finally, we construct a PKU-Spike dataset captured by the spike camera to evaluate the compression performance. The experimental results on the dataset demonstrate that the proposed approach is effective in compressing such spike data while maintaining the fidelity. Introduction
In recent years, bio-inspired vision sensors have become very attractive in the field of self-driving cars, unmanned aerial vehicles and autonomous mobile robots, due to their significant advantages over conventional frame-based cameras, such as high dynamic range and fast sensing capability. It is accustomed to name the output data of these sensors as spikes in a similar manner to biological systems. Typically, the pixels, also known as the photoreceptors, independently respond to the luminance intensity changes with spikes elicited. In contrast to frame-based cameras, bio-inspired vision sensors all have in common is frame-free. From this perspective, the temporal resolution can be as high as possible, because there is no need of shutters. This is reasonable if we turn attention to biological retina systems, in which the photoreceptors independently convent incoming light into electrical signals that are transmitted to the inner plexiform layer of the retina and further processed as spikes conveyed to the visual cortex via the optic nerve. The independent sampling breaks the limitation of exposure which is commonly the bottleneck for most frame-based cameras to increase frame rates. To simulate biological vision, one of the most famous artificial βsilicon retinasβ is the dynamic vision sensor (DVS) [1]. It is capable of high speed detection and tracking. However, as it only cares about the relative change of luminance intensity, it is very difficult to reconstruct the texture. To improve this, Brandli et al . [2] proposed the dynamic and active-pixel vision sensor (DAVIS) which integrates the DVS with the frame-based active-pixel sensor (APS) together. The solution makes it able to capture both frames and DVS spikes, but the mismatch is quite obvious due to the lower frame rate of APS (60 rames per second) in contrast to the DVS with a temporal resolution of 1 ππ . Posch et al. [3] tried a different way by introducing the asynchronous time-based image sensor (ATIS). ATIS consists of a DVS circuit and a photo-measurement circuit. Once a DVS spike is fired (i.e. intensity change detected), the photo-measurement is triggered and the intensity is acquired by encoding the time from the change detection to the threshold crossing of a photocurrent integrator. Since the intensity is inversely proportional to the integration time, it is able to reconstruct the visual texture of the scene. ATIS seems to be perfect, but the intensity is only measured when a DVS spike is generated, and the measurement is posterior to the spike firing, so the mismatch still exists, especially in flexible motion. The key issue is the sampling mechanism of DVS which only responds to the relative intensity change. For stationary scenarios, there will be few spikes fired. Thus it is difficult to recover the absolute intensity. In our view, as a vision sensor, the ability to record the scene is indispensable. Aiming at this goal, we deeply investigated the sampling problem and finally designed the spike camera [4] which was enlightened from the integrate-and-fire neural model. The spike camera has a spatial resolution of 400 Γ
250 and the maximum spike firing rate per pixel is 40,000 Hz. In the spike camera, each pixel operates independently accumulates luminance intensity which inputs from an analog-to-digital converter (ADC), and generates a spike as soon as possible if the ADC value exceeds the dispatch threshold π : β« πΌππ‘ π‘0 β₯ π (1) where πΌ refers to the luminance intensity (usually measured by photocurrent in the circuit). Then the accumulator is reset and all the charges on it are drained. At different pixel, the accumulation speed of the luminance intensity is different. As shown in Fig.1, the brighter pixel will send out spikes more frequently than the darker one, because more photons are collected at the pixel, and its intensity accumulation speed is faster which is easier to exceed the threshold. Compared to conventional frame-based cameras and DVS, the spike camera has successively achieved the balance between both high speed motion capture and consistent texture reconstruction. Although the output data are called spikes as well, the characteristics differ from those of DVS. According to DVS sampling mechanism, a spike represents the relative intensity change, therefore the temporal redundancy is almost removed. For example, if the intensity is stable, nothing will be outputted from DVS. But in the spike camera, both the static background and the moving foreground objects are firing spikes with various frequencies. Figure 1:
The workflow of the spike camera
Due to high temporal resolution and the spatial continuity of pixels, the spike data have redundancy in both temporal and spatial domain. According to the statistics, the spike camera generates about 476 megabytes of data per second. Thus, the compression of the pike data will be of benefit for storage and transmission. In this paper, we design an efficient coding framework to address this problem. The rest of the paper is organized as follows: Section 2 briefly reviews related work, and Section 3 analyzes the distribution of the spike data. In Section 4, presents the spike coding framework. In Section 5, the experiments on compression performance and the distortion evaluation are demonstrated. Finally, the paper is concluded in Section 6. Related Works
Spike coding of bio-inspired cameras is a new research field recently proposed. The spike data is usually with a very high temporal resolution, which consumes more storage space. The compression for the spike data is a challenge since its data characteristics are very different from conventional videos. In [5], Bi et.al. first explicitly raised the issue of DVS data compression, a lossless coding algorithm of spike data was proposed and achieved an impressive coding performance. This is a good reference for our work. However, the DVS data is quite different from our spike data. DVS sensor outputs spikes only when the luminance changes, so its data is much sparser than that of the spike camera and the data characteristics of both are completely different. The spike camera is inspired by biological neuron model, this drives us to explore the coding technology from the neural computation field. Initial methods of applying entropies to neural coding treated spike trains as binary strings by quantizing the time axis [6]. Another category is interval methods, which focused on intervals between consecutive spikes rather than on spikes themselves. In [7], Bhumbra et al. proposed that for a constant level of activity, the coding capacity of a single input is equal to its repertoire of inter-spike intervals. From the perspective of the neuron, the interval methods are more convincing than the methods based on spike counts [8]. Some probability density function such as gamma and Gaussian were often used to apply the interval methods [9]. Generally, a simple Poisson integrate-and-fire model can be described as a gamma distribution and close fits to recorded spike data [10]. On the other hand, some coding techniques in conventional video coding field may also be referential for spike coding, including motion compensation, quantization and entropy coding in AVC/H.264 [11] and HEVC/H.265 [12]. These technologies give us great inspiration in designing our coding strategies. Spike Data Analysis
In this section, we investigate the probability density distribution of inter-spike interval (ISI), and the spatial and temporal distribution of spike data.
The Probability Density Distribution of Inter-Spike Interval
The distribution of inter-spike interval can be derived from the photon arrival process which is usually assumed to be a homogeneous Poisson process. It is parameterized by a single scalar Ξ» which gives the mean rate with which events arrive. Each photon arrival event is completely independent from all the others. Therefore, the probability of π photon arrival events is r(π(Ξ΄) = π) = π βπΞ΄ (πΞ΄) π π! (2) where π(Ξ΄) refers to the number of photons arrived during a period of time Ξ΄ ; π is the photon arrival rate. In a very short period of time, π is constant. Since the accumulated intensity is determined by the number of photons arriving, we assume that the arrival of π photons will reach the dispatch threshold π and generate a spike. If the spike firing time is denoted as π‘ π , the inter-spike interval π₯ = π‘ π β π‘ πβ1 . As π(Ξ΄) is a Poisson process, the time from the beginning to the occurrence of π photons arriving is a gamma distribution [13]. Thus, the probability density distribution (PDF) is: π(π₯, π½, πΌ) = 1π½ πΌ Ξ(πΌ) π₯ πΌβ1 π βπ₯/π½ , π₯ > 0 (3) where Ξ± is the shape parameter and π½ is the scale parameter in gamma distribution, here Ξ± = π , π½ = 1/π , and Ξ(β ) refers to the gamma function. According to the mechanism of the spike camera, the dispatch threshold is predefined leading to the shape parameter πΌ to be constant, but for different pixels, the luminance intensity may be slightly different with various scale parameters π½ . Fig. 2 shows the actual ISI values are well fitted by the gamma PDF.
200 210 220 230 240 250 260 270 interval value p r obab ili t y Figure 2:
Histogram of the practical ISI values, and the fitted gamma probability density function (in red line).
Figure 3:
The spatial distribution of spike data
Based on the above, we are able to analyze the temporal and spatial correlations of ISI. For a certain pixel, when the luminance intensity is stably changing or constant, the PDFs of ISIs between consecutive spikes tend to be similar. For the pixels in a spatially neighboring region, their photon arrival rates π have a great probability to be similar, and so does their PDFs. As a result, the ISI should have the temporal and spatial correlation. The Spatial and Temporal Distribution of Spike Data
To begin exploring the spatial correlation of the spike data, we need to determine a measure to estimate the distance of two spike trains first. In our previous work [5], the kernel method is used to measure the spike train distance of DVS. However, it fails to accurately measure the spike trains from the spike camera. Thatβs the motivation for us to explore a novel intensity-based distance. Considering that Eq. (1) can be simplified as
πΌπ‘ β₯ π , where π‘ exactly corresponds to the ISI. Therefore, the average intensity of the pixel in this period can be estimated by πΌΜ = ππ‘ (4) For the visual system, the difference of intensities caused by the variation of ISI Ξπ‘ can be defined as pixel index Intensity-based distanceKernel method based distance d i s t a n c e StartMost similar
πΌ = ππ‘ β ππ‘ + Ξπ‘ = πΞπ‘π‘ + π‘Ξπ‘ (5) Thus, for two spike trains π π and π π , we can first convent them to two ISI sequences. Each ISI is denoted as π‘ π (π) and π‘ π (π) , respectively. Here π = 1,2, . . , πΎ and πΎ refers to the number of ISIs. Then, the intensity-based distance between two spike trains is: βπ π β π π β = β 1πΎ β β ( 1π‘ π (π) β 1π‘ π (π) ) (6) Due to the high spatial correlation in actual objects, the photon arrival rate of spatially adjacent pixels is very similar, thus the spatial distribution of spike data should be also highly correlated. To explore the spatial distribution of the spike data, in the sequence βdisk-pkuβ, we select a sized square area in a period of time. An arbitrary pixel is chosen as the start pixel, such as the one in the upper left corner. Then the distances between the start pixel and other pixels are respectively measured by Eq. (6), in a zigzag order. The distances are shown in Fig. 3, it can be seen that the spatially adjacent spike trains are with similar intensities, leading to a near distance. By applying the proposed measure, the most similar pixel can be accurately found out. In contrast, the kernel method gets three candidates but fails to figure out the right one. (a) (b)(c) / I S I / I S I / I S I Time
Time
Time
Figure 4:
The temporal distribution of spike data. (a) The intensity (1/ISI) distribution in temporal domain; (b)The temporal distribution corresponding to bright object (the digits β5β and β8β); (c) The temporal distribution corresponding to dark object (the black disc).
As for the temporal correlation, the nature of continuous arrival of photon constitutes the temporal correlation between each consecutive spike of a certain pixel. In order to analyze the spike data intuitively, we give the distribution of the reciprocal of ISI (corresponds to the luminous intensity of each ISI) in the temporal domain. As shown in Fig. 4, the sequence βrotationβ depicts a spinning disc at 2000 rpm. We select a pixel located in the red box (as shown in the left). When the pixel is dark, we can assume that the photon arrival rate π is constant, thus the PDF of ISI can be modeled as a gamma distribution. The digits β5β and β8β will be easily distinguished when they appear in the red box since they cannot fit the PDF. The results show clearly that the spike data has a high correlation in the temporal domain. Spike Coding Framework
In this section, a spike coding framework is proposed to compress the spike data (Fig. 5). First, the spike train is adaptively partitioned into multiple segments in temporal domain. Then, the intra-pixel and inter-pixel coding including multiple prediction modes are performed to find the best reference candidate. Afterward, the prediction residuals are uantized to achieve lossy compression. Finally, the quantized residuals are fed into an adaptive context-based entropy coder. Overall, to achieve the best performance, each prediction mode will be tried and the best one with minimum rate-distortion cost is chosen.
Figure 5:
The framework of spike coding
Adaptive Temporal Partitioning
Generally, the basic coding unit is a block or a cube. However, the numbers of ISI at different pixels are quite different. In Fig. 6, even in a background region, the ISI numbers vary greatly from pixel to pixel. Consequently, in the proposed framework, the prediction and coding strategies are designed for pixels. For a certain pixel, considering the luminance intensity variation and the objectβs movements, the ISI may change significantly which is depicted in Fig. 6. In this case, it is reasonable that the ISIs of a pixel should be adaptively partitioned into multiple segments.
Figure 6:
The distribution of ISI counts for all the pixels from the sequence βrotationβ.
Firstly, the whole ISI sequence is divided to basic segments (32 ISIs in our experiments). Since the distribution of ISI obeys gamma distribution in a short period of time, a basic segment can be well fitted by a gamma PDF, and the parameters Ξ± and π½ are determined by the mean and variance of the ISIs. Then, by comparing the parameters Ξ± and π½ , we can determine whether two basic segments can be merged or not. The partitioning strategy can be iterated until all adjacent segments are in different distributions. Now the pixel contains multiple segments which can be well predicted utilizing temporal and spatial correlations. Intra-pixel Coding
Since the pixels are independently respond to the luminance change, for real-time compression scenarios, the prediction and coding are limited in the segments from the same pixel, namely the intra-pixel coding. There are two prediction modes designed, mean value mode (MVM) and forward mode (FM) shown in Fig. 7. a) Mean value mode
For a pixel in the background region, the ISIs are almost the same except for some oises. MVM is designed for this case. By subtracting the mean value of the ISIs within the segment, the residuals obtained are very close to zero which can be further quantized. The mean value is differentially coded between successive segments.
Figure 7:
Mean value mode and forward mode in Intra-pixel coding. b) Forward mode
In the sequence βrotationβ, the digits β5β and β8β appear periodically at a certain pixel. Due to fast moving, the ISIs within a segment may vary from each other which are not well predicted in MVM. FM utilizes temporal motion estimation (TME) and temporal motion compensation (TMC) to find a better reference candidate. For accurate prediction, the reference candidate is searched among all the previous coded ISIs. During TME process, Eq. (6) is used to measure the similarity. The temporal motion vector ππ£ π is not coded directly. Instead, it is predicted by the motion vectors of previous π available segments. Available segments refer to the segments with the same prediction mode of FM. Then the motion vector difference ππ£π π is coded. ππ£ππππ π = 1π β ππ£ π π ππ=1 and ππ£π π = ππ£ππππ π β ππ£ π (7) Inter-pixel Coding
To cope with complex situations such as high speed motion, the inter-pixel coding is proposed by taking advantages of spatial correlations of the spike data. In inter-pixel coding, TME and TMC are enhanced to SME and SMC which enable the spatiotemporal search for the best reference candidate.
Figure 8:
Spatiotemporal motion estimation.
Figure 9:
Motion vector prediction for current segment.
The SME process is depicted in Fig. 8. In spatial, several previous coded pixels are selected as the reference pixels within a spatial range (SR). Meanwhile, in temporal, the reference segments are also determined by a temporal range (TR). By comparing the distance between current segment and each reference segment, the motion vector can be obtained which consists of ππ£ π , ππ£ π and ππ£ π demonstrated in Fig. 8. The motion vector prediction (MVP) is performed similar to that in forward mode. In inter mode, adjacent pixels are taken into account. Specifically, at most five pixels (A, B, C, D and E in Fig. 9) can be used for prediction. Here current segment πππ π(πΈ) in pixel E is o be coded, thus only its previous coded segments are available, such as
πππ πβ1(πΈ) and
πππ πβ2(πΈ) . π denotes the π -th segment in the pixel. For the other four pixels, three segments of each are utilized. One is the corresponding segment of πππ π(πΈ) , including
πππ π(π΄) , πππ π(π΅) , πππ π(πΆ) and
πππ π(π·) . The other two are the previous and the next segment in contrast to
πππ π(πΈ) . For example, in pixel A,
πππ πβ1(π΄) , πππ π(π΄) and
πππ π+1(π΄) are available. Thus, the average of the motion vectors of all the available reference segments is the MVP of
πππ π(πΈ) , including ππ£ππππ π , ππ£ππππ π and ππ£ππππ π . Finally, the motion vector differences ππ£π π , ππ£π π and ππ£π π are encoded, where ππ£π π = ππ£ππππ π β ππ£ π , ππ£π π = ππ£ππππ π β ππ£ π and ππ£π π = ππ£ππππ π β ππ£ π . With the best reference segment as prediction, the residuals of ISIs are obtained and quantized. Quantization
For spike data, the prediction residuals need to be quantized. Typically, the quantizer is designed to be a uniform one. However, according to Eq. (5), the same distortion of ISI may lead to completely different intensity changes. For instance, assuming that the distortion
Ξπ‘ = 1 , the intensity change
ΞπΌ = π/2 if π‘ = 1 , but when π‘ becomes 100, ΞπΌ =π/10100 , which is much smaller than that when π‘ = 1 . Thus, a uniform quantization step is unfair for various ISIs. To address the problem, the ISI itself is involved to the quantization, which means that each ISI should have a unique quantization step. One may question that how would decoder get the ISI value? Indeed, the raw ISI cannot be acquired by decoder, but owing to the well-designed prediction modes, it is instructive to utilize the predicted ISIs in the quantization. Specifically, in MVM, the mean value of the segment is available, so are the ISIs of the best reference segment in FM and inter-prediction mode. Eq. (8) describes the quantization, in which π denotes the prediction residual, πΆ refers to the quantized residual, and πππ‘ππ(π‘) is the quantization step with the ISI of π‘ . The function πππ’ππ(β ) means rounding the residual π to the nearest integer. πΆ = πππ’ππ ( π π π π‘ππ (π‘)) (8) Each quantization step π π π‘ππ (π‘) at π‘ is defined as the maximum variation of ISI leading to neglected intensity change. Considering the function of rounding, the intensity keeps the same when β0.5 < ΞπΌ < 0.5 . Thus, the maximum ISI variation Ξπ‘ follows β π‘ < Ξπ‘ < π‘ . Due to | π‘ | β₯ | π‘ | , by selecting the larger one, the quantization step is π π π‘ππ (π‘) = π‘
2π + π‘ (9) So the quantization step at each ISI can be computed according to Eq. (9). In this paper, we use a 5-bit (32 levels) quantizer. Experimental Results
To evaluate the compression performance, the PKU-Spike dataset is constructed which contains six sequences including two categories of high speed motion and normal speed cenarios. Each sequence is captured by the spike camera at 40,000 Hz with a length of 3.84 seconds. As we discussed above, there are 32 levels with quantization parameter (QP) from 1 to 32. The spatial and temporal search range are set to 3 and 32, respectively. Table 1 demonstrates the compression ratio with various QPs in contrast to the raw data.
Table 1:
Compression performance of proposed coding method using different QPs.
Sequence Compression ratio (compared to the raw data) QP4 QP8 QP12 QP16 QP20 QP24 QP28 QP32 Normal speed office rolling wavehand
High speed fork disk-pku rotation
Fig. 10 depicts the distortion caused by the compression. Both the intensity-based distance proposed in Section 3.2 and the kernel method based distance are utilized to measure the distortion. From the curves, the intensity-based distance is increasing along with the compression ratio, but in some sequences, such as βrollingβ and βforkβ, the kernel method based distance may not clearly recognize the distortions.
Figure 10:
The distortion evaluation using the intensity-based distance and the kernel method based distance.
In addition, we reconstruct 1,000 images according to Eq. (4) from both the raw data and the decoded data. Then the images are evaluated via two commonly used metrics, the peak signal to noise ratio (PSNR) and the structural similarity index (SSIM), shown in Fig. 11. The result reveals that the fidelity of the spikes are well maintained. Conclusion
In this paper, we aim at compressing the large amount of the spike data generated from the spike camera. In order to better evaluate the spike train distances, an intensity-based measurement is proposed according to the sampling mechanism. Then the spatiotemporal characteristics are deeply analyzed by modelling the inter-spike intervals. On the basis of the probability density distribution of ISIs, a unified lossy coding framework is designed comprising adaptive temporal partitioning, intra-/inter-pixel coding with multiple prediction modes, quantization and entropy coding. Finally, the experiment evaluations on the PKU-Spike dataset show the proposed coding method is quite efficient in compression hile maintaining the fidelity of the spike data.
Figure 11:
The measure of reconstructed images via PSNR and SSIM.
References [1]
P. Lichtsteiner, C. Posch and T. Delbruck, "A 128Γ128 120 dB 15 Β΅s Latency Asynchronous Temporal Contrast Vision Sensor,"
IEEE Journal of Solid-State Circuits , vol. 43, no. 2, pp. 566-576, Feb. 2008. [2]
C. Brandli, R. Berner, M. Yang, S. Liu and T. Delbruck, "A 240Γ180 130 dB 3 Β΅s Latency Global Shutter Spatiotemporal Vision Sensor,"
IEEE Journal of Solid-State Circuits , vol. 49, no. 10, pp. 2333-2341, Oct. 2014. [3]
C. Posch, D. Matolin and R. Wohlgenannt, "An asynchronous time-based image sensor," 2008 IEEE International Symposium on Circuits and Systems, Seattle, WA, 2008, pp. 2130-2133. [4]
S. Dong, T. Huang and Y. Tian, "Spike Camera and Its Coding Methods," 2017 Data Compression Conference (DCC), Snowbird, UT, 2017, pp. 437-437. [5]
Z. Bi, S. Dong, Y. Tian and T. Huang, "Spike Coding for Dynamic Vision Sensors," 2018 Data Compression Conference, Snowbird, UT, 2018, pp. 117-126. [6]
D. M. MacKay and W. S. McCulloch, βThe limiting information capacity of a neuronal link,β
Bulletin of Mathematical Biology , vol.14, no. 2, pp. 127-135, 1952. [7]
G. S. Bhumbra and R. E. J. Dyball, βMeasuring spike coding in the rat supraoptic nucleus,β
The Journal of physiology , vol. 555, no. 1, pp. 281-296, 2004. [8]
G. S. Bhumbra and R. E. J. Dyball. βSpike coding from the perspective of a neurone,β
Cognitive Processing , vol.6, no. 3, pp. 157-176, 2005. [9]
G. N. Reeke and A. D. Coop, "Estimating the Temporal Interval Entropy of Neuronal Discharge,"
Neural Computation , vol. 16, no. 5, pp. 941-970, 1 May 2004. [10]
P. H. E. Tiesinga, J. V. JosΓ©, T. J. Sejnowski, βComparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated channels,β
Physical review E , vol. 62, no. 6, pp. 8413, 2000. [11]
T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra, βOverview of the H.264/AVC video coding standard,β
IEEE Trans. on Circuits and Systems for Video Technology , vol. 13, no. 7, pp. 560-576, July 2003. [12]
G. J. Sullivan, J. Ohm, W. Han and T. Wiegand, βOverview of the High Efficiency Video Coding (HEVC) Standard,β
IEEE Trans. on Circuits and Systems for Video Technology , vol. 22, no. 12, pp. 1649-1668, Dec. 2012. [13]
L. M. Leemis and J. T. McQueston, βUnivariate distribution relationships,β