[PDF] Neural Network-based Acoustic Vehicle Counting

Abstract

This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehicles from local minima of clipped vehicle-to-microphone distance. This distance is predicted from audio using a two-stage (coarse-fine) regression, with both stages realised via neural networks (NNs). Experiments show that the NN-based distance regression outperforms by far the previously proposed support vector regression. The 95% confidence interval for the mean of vehicle counting error is within [0.28%,−0.55%] . Besides the minima-based counting, we propose a deep learning counting that operates on the predicted distance without detecting local minima. Although outperformed in accuracy by the former approach, deep counting has a significant advantage in that it does not depend on minima detection parameters. Results also show that removing low frequencies in features improves the counting performance.

Full PDF

NNEURAL NETWORK-BASED ACOUSTIC VEHICLE COUNTING

Slobodan Djukanovi´c Yash Patel Jiˇr´ı Matas Tuomas Virtanen Czech Technical University, Faculty of Electrical Engineering, Prague, Czech Republic Tampere University, Audio Research Group, Tampere, Finland

ABSTRACT

This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehiclesfrom local minima of a vehicle-to-microphone distance pre-dicted from audio. The distance is predicted via a two-stage(coarse-ﬁne) regression, both realised using neural networks(NNs). Experiments show that the NN-based distance regres-sion outperforms by far the previously proposed support vec-tor regression. The conﬁdence interval for the meanof vehicle counting error is within [0 . , − . . Be-sides the minima-based counting, we propose a deep learningcounting which operates on the predicted distance without de-tecting local minima. Results also show that removing lowfrequencies in features improves the counting performance. Index Terms — Vehicle counting, log-mel spectrogram,neural network, peak detection, deap learning.

1. INTRODUCTION

Trafﬁc monitoring (TM) systems use different trafﬁc data toimprove the use and performance of roadway systems, trans-portation safety, law enforcement, prediction of future trans-portation needs etc. TM data include estimates of vehiclecount, trafﬁc volume, speed of vehicles and of various vehicleparameters (length, weight, class) [1].Current TM systems use diverse sensors and technologies,including induction loops, vibration, piezoelectric, infrared,ultrasonic, magnetic and acoustic sensors and cameras [1].Vision-based TM systems have recently become popular dueto breakthroughs in object detection, tracking and classiﬁca-tion tasks provided by deep learning methods [2]. In addi-tion, a single camera sufﬁces to cover multiple lanes for theTM tasks, which is not the case for other sensor technolo-gies. However, in addition to high computational complexity,issues like partial occlusion, shadows and illumination varia-tion limit the performance of vision-based TM systems [1, 3].Acoustic TM has several advantages in comparison withthe vision-based one [1]. Microphones are less expensive than

The research was supported by Research Center for Informatics (projectCZ.02.1.01/0.0/0.0/16019/0000765 funded by OP VVV) and CTU studentgrant (SGS OHK3-019/20). Slobodan Djukanovi´c was supported by the OPRDE programme of project International Mobility of Researchers MSCA-IFIII at CTU in Prague No. CZ.02.2.69/0.0/0.0/19 074/0016255. cameras, consume less energy, require less storage space, areeasier to install and maintain with low wear and tear. In ad-dition, acoustic TM is not affected by visual occlusions andlighting conditions, and has less privacy issues.This paper addresses acoustic vehicle counting using one-channel audio. The standard approach is to detect temporalvariation of the sound power due to vehicles passing by theacoustic sensor [4–6], which is performed using state tran-sitions of a hidden Markov model [4] or by a peak-pickingalgorithm [5, 6]. Maximal frequency at which the power ofa time-frequency representation reaches a predeﬁned thresh-old, a.k.a. top-right frequency, also enables detecting vehiclespassing by the microphone [7]. The method [8] uses a predic-tion of a pseudo-distance between a vehicle and the micro-phone referred to as clipped vehicle-to-microphone distance .Vehicle counting [8], carried out by counting local minima inthe predicted distance, outperforms those based on peak de-tection in the sound power and top-right frequency. However,the optimal, false negative - false positive compensating, de-tection threshold in [8] can only be imprecisely estimated apriori . Moreover, the distance regression [8] is computation-ally demanding.In this paper, we signiﬁcantly improve the distance re-gression, and thus counting accuracy, compared to [8] by acomputationally less demanding approach. We ﬁrst overviewclipped vehicle-to-microphone distance (Section 2) and thenpropose new counting method (Section 3). Experimental re-sults are given in Section 4 and Section 5 concludes the paper.

2. CLIPPED VEHICLE-TO-MICROPHONEDISTANCE AND VEHICLE COUNTING

In [8], clipped vehicle-to-microphone distance of the k -thvehicle is deﬁned as d ( k ) ( t ) = (cid:40)(cid:12)(cid:12) t − T ( k ) (cid:12)(cid:12) , (cid:12)(cid:12) t − T ( k ) (cid:12)(cid:12) < T d T d , elsewhere , (1)where T ( k ) represents the pass-by instant and T d is the dis-tance threshold. The V-shape of d ( k ) ( t ) models approachingand receding of the vehicle from the microphone (see dotted We will refer to clipped vehicle-to-microphone distance as the distance. a r X i v : . [ c s . S D ] O c t ine around T (1) in Fig. 1). When the audio contains N v ve-hicles, only the distance of the closest vehicle is taken intoaccount, so the overall distance is deﬁned as minimum of allseparate distances (dotted line in Fig. 1): D ( t ) = min { d (1) ( t ) , d (2) ( t ) , . . . , d ( N v ) ( t ) } . (2) Fig. 1 . Illustration of a reference distance, distance predictedfrom audio and classiﬁcation of predicted distance minima.In [8], the vehicle count is equal to the number of detectedlocal minima of the predicted distance (orange line in Fig. 1)that fall below a detection threshold. Not every local mini-mum below the threshold corresponds to a vehicle passing bythe microphone. Only minima that occur within the true ve-hicle pass-by intervals (horizontal arrows in Fig. 1) representtrue positives (TPs). Other minima below the threshold repre-sent false positives (FPs), whereas minima that occur withinthe corresponding pass-by intervals, but are above the thresh-old, represent false negatives (FNs).The method [8] is evaluated using the TP, FP and FN prob-abilities, p TP , p FP and p FN , calculated for variable detectionthreshold. The optimal threshold is obtained at point where p FP = p FN , since then the FPs and the FNs cancel each otherin statistical sense and the total number of detected vehiclesequals the true number of vehicles. In terms of counting er-ror, the best generalization in [8] is obtained when the dis-tance is predicted using the log-mel spectrogram (LMS) andnewly introduced high-frequency power (HFP) as input fea-tures. With HFP, the counting error remains low (below )within a wide range of detection threshold values.Although characterized by a low counting error withina wide range of detection thresholds, the optimal thresholdof [8] is not known in advance. Our ﬁrst objective is to ex-tend the low-error threshold range (below or even more),i.e., to make counting more robust to the choice of detec-tion threshold. Another drawback of [8] is the computationalcomplexity. For distance regression, it uses support vector re-gression (SVR), implemented in the libsvm library [9]. Itscomplexity scales between O ( n f n s ) and O ( n f n s ) , where n f and n s represent the number of features and samples inthe dataset, respectively. Therefore, our second objective isto perform distance regression in a computationally more ef-ﬁcient way to enable scaling to larger datasets. A methodwhich fulﬁlls these two objectives is described in the sequel.

3. NEURAL NETWORK-BASED COUNTING

To address the low-counting error objective, we propose toimprove the accuracy of distance regression. To that end,we carry out a two-stage (coarse-ﬁne) regression. To addressthe computational complexity objective, we propose to usefully-connected neural networks (FCNNs) instead of origi-nally used SVR. The block diagram of the proposed methodis presented in Fig. 2 (top). x i ( t ) , i = 1 , · · · , M are inputfeatures at time instant t and ˇ D ( t ) represents the predicteddistance at t . The Vehicle counting block carries out count-ing based on the distance prediction of the whole audio ﬁle.

Fig. 2 . Top : The block diagram of the proposed vehicle count-ing method.

Bottom : Distance regression in detail. Stage 2improves the distance regression output by Stage 1.

A detailed representation of the proposed distance regressionis given in Fig. 2 (bottom). Stage 1 FCNN performs regres-sion based on input features, similarly to SVR in [8]. A vectorof K + 1 successive distances ˙ D ( t − K ) , . . . , ˙ D ( t + K ) predicted by Stage 1 FCNN, centered at t , represents inputfeatures to Stage 2 FCNN. The task of Stage 2 FCNN is toreﬁne the output of Stage 1.As input features, we use the HFP+LMS combination, assuggested in [8]. Since HFP represents the power of high-frequency portion of the signal spectrum, we incorporate itinto LMS by leaving out a number of ﬁlters with the lowestcentral frequencies in the mel spectrogram ﬁlter bank [10].The resulting LMS, referred to as high-frequency LMS (HF-LMS), does not include low-frequency portion of the spec-trum which contains the most signiﬁcant part of the environ-ent noise. To take into account time dependence betweenadjacent D ( t ) values, the value D ( t ) will be predicted us-ing the samples of the HF-LMS spectrum at instant t and Q preceding and following instants (for details, see Section 3.3).Prior to vehicle counting, the predicted distance is low-pass ﬁltered to eliminate high-frequency oscillations, whichis discussed in Section 4. In this paper, we propose two vehicle counting approaches,both based on the ﬁnal (Stage 2) predicted distance.

In [8], local minima of the predicted distance were detectedby detecting peaks (local maxima) of the inverted distance(see dashed line in Fig. 3 (bottom)) based on their promi-nence . Here, we extend this approach by introducing a peakmagnitude criterion. If two close peaks have similar magni-tudes (corresponding to two close vehicles), the prominenceof the weaker one can be much less than that of the strongerone. The weaker peak can be left out if it is detected based onprominence only. Therefore, we deﬁne vehicle detection as: A vehicle is detected if detected peak of the invertedpredicted distance has magnitude larger than M p orprominence larger than P p . Selection of M p and P p is discussed in Section 4. A convolutional NN [12] is used for counting. The model op-erates on the raw predicted distance and estimates the vehiclecount directly, without an intermediate local minima detec-tion. It consists of 1D convolutional, a global-average pool-ing and fully-connected layers. The global-average-poolingallows the model to operate on varying length distance [13].The model is trained to predict the vehicle count directly.We experimented with three different loss functions: L , L and smooth L distances. The model trained with smooth L distance gives the best performance. Training with smooth L is performed using a surrogate learned via a deep embedding,where the Euclidean distance between the prediction and theground truth corresponds to the L distance [14]. The deepembedding is realized using a shallow FCNN. The surrogateand the counting model are trained in parallel. The HF-LMS feature is based on the spectrogram of inputsignal. In this paper, the length of sliding (Hamming) win-dow used in the spectrogram calculation is N w = 4096 and The prominence of a peak measures how much the peak stands out dueto its height and location relative to other peaks, and is deﬁned as the verticaldistance between the peak and its lowest contour line [11]. the stride length is N h = 1634 samples [15], which with -second audio ﬁles sampled at f s = 44100 Hz gives the time-length of all features of samples. In addition, N mel = 48 mel bands are used in HF-LMS, with the lowest frequency f min = 1000 Hz ( f max = f s/ ). To form a vector of inputfeatures, we take the HF-LMS spectra at Q = 5 precedingand following instants with a stride of . Therefore, the di-mensionality of the input space is M = (2 Q +1) N mel = 528 .The input dimensionality of Stage 2 FCNN is K + 1 = 31 .For distance threshold, we take T d = 0 . s [8].Stage 1 and 2 FCNNs have four layers with - - - and - - - neurons per layer, respectively. These conﬁg-urations gave the best regression performance cross-validatedon the training data. Both FCNNs use mean squared errorloss, ReLU activation, L kernel regularization with factors − (Stage 1) and × − (Stage 2), training epochs.Batch normalization is applied at each layer, except for theoutput, after activation. The model is implemented in Keras.Peak detection is based on scipy.signal.find peaks procedure (SciPy library for Python).

4. EXPERIMENTS

First evaluation metric we will use is normalized area un-der the curve (NAUC) of p TP ( T det ) , i.e., the average valueof p TP over the entire detection threshold T det interval [8]. p TP is calculated (as percentage of detected minima withinthe pass-by intervals, maximum one minimum per interval)at equidistant T det points. The second metric is relativevehicle counting error RVCE = (cid:0) N truev − N estv (cid:1) /N truev ×

100 [%] , (3)where N truev and N estv represent the true and the estimatedvehicle count. As opposed to (3), the RVCE deﬁnition in[8] uses the absolute difference. Here, the signed RVCE en-ables distinguishing between counting underestimation (pos-itive RVCE) and overestimation (negative RVCE).We will use the dataset from [8] which contains two parts:VC-PRG-1:5 (

250 20 -second audio ﬁles with vehicles)and VC-PRG-6 ( audio ﬁles with vehicles). The pro-posed vehicle counting method (referred to as FCNN-VC)will be trained and validated ( - training-validationsplit) using VC-PRG-1:5 and tested on VC-PRG-6. We runthe method times (training data shufﬂed each time). Alongwith the output of Stage 2, we also consider the output ofStage 1 FCNN (denoted as FCNN S ).Minima detection, and therefore probabilities p TP , p FP and p FN , as well as RVCE, are affected by i) low-pass ﬁltering ofthe predicted distance, ii) value of M p and iii) value of P p (Section 3.2.1). To ensure a fair comparison of performances,we determine an optimal set of parameters for every FCNN-based approach and report only the corresponding optimal re-sults. The optimality criterion is averaged absolute RVCEfor detection threshold range [50% T d , T d ] . We consider allossible combinations of i) low-pass ﬁlters (successive mov-ing average ﬁlters (MAFs) with lengths (5 , , (7 , and (7 , , ), ii) M p ∈ { T d , T d , T d , T d } andiii) P p ∈ { T d , T d , T d , T d } . The optimalparameters for FCNN S -VC and FCNN-VC are presented inthe ﬁrst two rows of Table 1. Table 1 . Optimal minima (vehicle) detection parametersSetup Optimal parametersFCNN S -VC MAFs (7 , , M p = 45% T d , P p = 25% T d FCNN-VC MAFs (5 , , M p = 40% T d , P p = 20% T d FCNN f -VC MAFs (5 , , M p = 45% T d , P p = 20% T d Fig. 3 . Top : TP, FP and FN probabilities (see text below).

Bottom : Distance predictions of an audio ﬁle. Minima aredetected by detecting peaks of the inverted predicted distance(blue dashed line).Figure 3 (top) compares p TP , p FP and p FN of the SVR-based counting [8] (carried out with HFP+LMS) and of onerun of FCNN S -VC and FCNN-VC. Signiﬁcant improvementis reﬂected in an increase of NAUC from . (SVR-VC) to . (FCNN S -VC) and . (FCNN-VC), the latter twoaveraged over all runs. Dots in Fig. 3 (top) represent thepoints of equal false probabilities (EFP) [8], which are . (SVR-VC), . (FCNN S -VC) and . (FCNN-VC).Figure 3 (bottom) compares distance predictions of oneaudio ﬁle carried out via SVR-VC, FCNN S -VC and FCNN-VC. The corresponding mean square regression errors for thetesting set are . × − , . × − and . × − .The proposed method signiﬁcantly outperforms SVR-VC interms of regression accuracy. In addition, Stage 2 improvesthe regression accuracy with respect to Stage 1.RVCE plots for detection threshold range [30% T d , T d ] aregiven in Fig. 4 (top). With the exception of SVR, RVCEs are ( l , l ) denotes ﬁltering ﬁrst with an MAF of length l then with an MAFof length l . shown as conﬁdence intervals for the mean (CIM). In ad-dition to SVR-VC, FCNN S -VC and FCNN-VC, we presentRVCE of the deep counting approach (Section 3.2.2) and ofFCNN-VC when full-frequency range is used in LMS inputfeatures ( f min = 0 , see Section 3.3). The latter is denotedas FCNN f -VC and its optimal parameters are given in thebottom row in Table 1. CIM of Deep-VC RVCE is a hori-zontal band within [1 . , . . FCNN-VC outperformsthe other approaches by a signiﬁcant margin. Its CIMgoes below for detection threshold around T d and iswithin [0 . , − . for thresholds above T d .The combined peak magnitude-prominence criterion inminima detection (Section 3.2.1) improves the counting per-formance with respect to detection based solely on the peakprominence [8], as illustrated in Fig. 4 (bottom). Peak promi-nences of , and are considered, after applyingtwo successive MAFs with lengths and . Fig. 4 . Relative vehicle counting error, RVCE, as a func-tion of the detection threshold. With the exception of SVR,RVCEs are shown as conﬁdence intervals for the mean.The proposed FCNN-VC (in blue) compared with alternatives(for their description, see text).

5. CONCLUSIONS

We proposed a method for acoustic vehicle counting based onthe clipped vehicle-to-microphone distance. The distance waspredicted using a two-stage NN-based regression. Signiﬁcantimprovement in regression accuracy with respect to the SVR-based approach resulted in a highly accurate vehicle countingnot depending on detection threshold within a wide range ofthreshold values. Deep counting, an alternative to the localminima-based counting, estimates the vehicle count directlyfrom the predicted distance, without detecting local minima.Although outperformed in accuracy by the latter approach,a signiﬁcant advantage of deep counting is that it does notdepend on minima detection parameters. Our future work willaddress developing end-to-end vehicle counting method. . REFERENCES [1] Myounggyu Won, “Intelligent trafﬁc monitoring sys-tems for vehicle classiﬁcation: A survey,”

IEEE Access ,vol. 8, pp. 73340–73358, 2020.[2] Milind Naphade et al., “The 2019 AI City challenge,”in

CVPR Workshops , 2019, pp. 452–460.[3] Brendan Tran Morris and Mohan Manubhai Trivedi, “Asurvey of vision-based trajectory learning and analysisfor surveillance,”

IEEE Transactions on Circuits andSystems for Video Technology , vol. 18, no. 8, pp. 1114–1127, 2008.[4] Jien Kato, “An attempt to acquire trafﬁc density byusing road trafﬁc sound,” in

Proceedings of the 2005International Conference on Active Media Technology,2005.(AMT 2005).

IEEE, 2005, pp. 353–358.[5] Jobin George, Leena Mary, and KS Riyas, “Vehicledetection and classiﬁcation from acoustic signal usingANN and KNN,” in . IEEE,2013, pp. 436–439.[6] Jobin George, Anila Cyril, Bino I Koshy, and LeenaMary, “Exploring sound signature for vehicle detectionand classiﬁcation using ANN,”

International Journal onSoft Computing , vol. 4, no. 2, pp. 29, 2013.[7] Sugang Li, Xiaoran Fan, Yanyong Zhang, Wade Trappe,Janne Lindqvist, and Richard E Howard, “Auto++: De-tecting cars using embedded microphones in real-time,”

Proceedings of the ACM on Interactive, Mobile, Wear-able and Ubiquitous Technologies , vol. 1, no. 3, pp. 70,2017. [8] Slobodan Djukanovi´c, Jiˇri Matas, and Tuomas Virta-nen, “Robust audio-based vehicle counting in low-to-moderate trafﬁc ﬂow,” in , 2020.[9] Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: Alibrary for support vector machines,”

ACM Transactionson Intelligent Systems and Technology (TIST) , vol. 2, no.3, pp. 1–27, 2011.[10] Romain Serizel, Victor Bisot, Slim Essid, and Ga¨elRichard, “Acoustic features for environmental soundanalysis,” in

Computational Analysis of Sound Scenesand Events , pp. 71–101. Springer, 2018.[11] “Topographic prominence,” https://en.wikipedia.org/wiki/Topographic_prominence , Accessed: 2020-10-19.[12] Xiang Zhang, Junbo Zhao, and Yann LeCun,“Character-level convolutional networks for textclassiﬁcation,” in

NeurIPS , 2015.[13] Min Lin, Qiang Chen, and Shuicheng Yan, “Network innetwork,”

ICLR , 2014.[14] Yash Patel, Tomas Hodan, and Jiri Matas, “Learningsurrogates via deep embedding,”

ECCV , 2020.[15] Ljubiˇsa Stankovi´c, Igor Djurovi´c, Srdjan Stankovi´c,Marko Simeunovi´c, Slobodan Djukanovi´c, and MiloˇsDakovi´c, “Instantaneous frequency in time–frequencyanalysis: Enhanced concepts and performance of esti-mation algorithms,”