Neural Network-based Acoustic Vehicle Counting
Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen
NNEURAL NETWORK-BASED ACOUSTIC VEHICLE COUNTING
Slobodan Djukanovi´c Yash Patel Jiˇr´ı Matas Tuomas Virtanen Czech Technical University, Faculty of Electrical Engineering, Prague, Czech Republic Tampere University, Audio Research Group, Tampere, Finland
ABSTRACT
This paper addresses acoustic vehicle counting using one-channel audio. We predict the pass-by instants of vehiclesfrom local minima of a vehicle-to-microphone distance pre-dicted from audio. The distance is predicted via a two-stage(coarse-fine) regression, both realised using neural networks(NNs). Experiments show that the NN-based distance regres-sion outperforms by far the previously proposed support vec-tor regression. The confidence interval for the meanof vehicle counting error is within [0 . , − . . Be-sides the minima-based counting, we propose a deep learningcounting which operates on the predicted distance without de-tecting local minima. Results also show that removing lowfrequencies in features improves the counting performance. Index Terms — Vehicle counting, log-mel spectrogram,neural network, peak detection, deap learning.
1. INTRODUCTION
Traffic monitoring (TM) systems use different traffic data toimprove the use and performance of roadway systems, trans-portation safety, law enforcement, prediction of future trans-portation needs etc. TM data include estimates of vehiclecount, traffic volume, speed of vehicles and of various vehicleparameters (length, weight, class) [1].Current TM systems use diverse sensors and technologies,including induction loops, vibration, piezoelectric, infrared,ultrasonic, magnetic and acoustic sensors and cameras [1].Vision-based TM systems have recently become popular dueto breakthroughs in object detection, tracking and classifica-tion tasks provided by deep learning methods [2]. In addi-tion, a single camera suffices to cover multiple lanes for theTM tasks, which is not the case for other sensor technolo-gies. However, in addition to high computational complexity,issues like partial occlusion, shadows and illumination varia-tion limit the performance of vision-based TM systems [1, 3].Acoustic TM has several advantages in comparison withthe vision-based one [1]. Microphones are less expensive than
The research was supported by Research Center for Informatics (projectCZ.02.1.01/0.0/0.0/16019/0000765 funded by OP VVV) and CTU studentgrant (SGS OHK3-019/20). Slobodan Djukanovi´c was supported by the OPRDE programme of project International Mobility of Researchers MSCA-IFIII at CTU in Prague No. CZ.02.2.69/0.0/0.0/19 074/0016255. cameras, consume less energy, require less storage space, areeasier to install and maintain with low wear and tear. In ad-dition, acoustic TM is not affected by visual occlusions andlighting conditions, and has less privacy issues.This paper addresses acoustic vehicle counting using one-channel audio. The standard approach is to detect temporalvariation of the sound power due to vehicles passing by theacoustic sensor [4–6], which is performed using state tran-sitions of a hidden Markov model [4] or by a peak-pickingalgorithm [5, 6]. Maximal frequency at which the power ofa time-frequency representation reaches a predefined thresh-old, a.k.a. top-right frequency, also enables detecting vehiclespassing by the microphone [7]. The method [8] uses a predic-tion of a pseudo-distance between a vehicle and the micro-phone referred to as clipped vehicle-to-microphone distance .Vehicle counting [8], carried out by counting local minima inthe predicted distance, outperforms those based on peak de-tection in the sound power and top-right frequency. However,the optimal, false negative - false positive compensating, de-tection threshold in [8] can only be imprecisely estimated apriori . Moreover, the distance regression [8] is computation-ally demanding.In this paper, we significantly improve the distance re-gression, and thus counting accuracy, compared to [8] by acomputationally less demanding approach. We first overviewclipped vehicle-to-microphone distance (Section 2) and thenpropose new counting method (Section 3). Experimental re-sults are given in Section 4 and Section 5 concludes the paper.
2. CLIPPED VEHICLE-TO-MICROPHONEDISTANCE AND VEHICLE COUNTING
In [8], clipped vehicle-to-microphone distance of the k -thvehicle is defined as d ( k ) ( t ) = (cid:40)(cid:12)(cid:12) t − T ( k ) (cid:12)(cid:12) , (cid:12)(cid:12) t − T ( k ) (cid:12)(cid:12) < T d T d , elsewhere , (1)where T ( k ) represents the pass-by instant and T d is the dis-tance threshold. The V-shape of d ( k ) ( t ) models approachingand receding of the vehicle from the microphone (see dotted We will refer to clipped vehicle-to-microphone distance as the distance. a r X i v : . [ c s . S D ] O c t ine around T (1) in Fig. 1). When the audio contains N v ve-hicles, only the distance of the closest vehicle is taken intoaccount, so the overall distance is defined as minimum of allseparate distances (dotted line in Fig. 1): D ( t ) = min { d (1) ( t ) , d (2) ( t ) , . . . , d ( N v ) ( t ) } . (2) Fig. 1 . Illustration of a reference distance, distance predictedfrom audio and classification of predicted distance minima.In [8], the vehicle count is equal to the number of detectedlocal minima of the predicted distance (orange line in Fig. 1)that fall below a detection threshold. Not every local mini-mum below the threshold corresponds to a vehicle passing bythe microphone. Only minima that occur within the true ve-hicle pass-by intervals (horizontal arrows in Fig. 1) representtrue positives (TPs). Other minima below the threshold repre-sent false positives (FPs), whereas minima that occur withinthe corresponding pass-by intervals, but are above the thresh-old, represent false negatives (FNs).The method [8] is evaluated using the TP, FP and FN prob-abilities, p TP , p FP and p FN , calculated for variable detectionthreshold. The optimal threshold is obtained at point where p FP = p FN , since then the FPs and the FNs cancel each otherin statistical sense and the total number of detected vehiclesequals the true number of vehicles. In terms of counting er-ror, the best generalization in [8] is obtained when the dis-tance is predicted using the log-mel spectrogram (LMS) andnewly introduced high-frequency power (HFP) as input fea-tures. With HFP, the counting error remains low (below )within a wide range of detection threshold values.Although characterized by a low counting error withina wide range of detection thresholds, the optimal thresholdof [8] is not known in advance. Our first objective is to ex-tend the low-error threshold range (below or even more),i.e., to make counting more robust to the choice of detec-tion threshold. Another drawback of [8] is the computationalcomplexity. For distance regression, it uses support vector re-gression (SVR), implemented in the libsvm library [9]. Itscomplexity scales between O ( n f n s ) and O ( n f n s ) , where n f and n s represent the number of features and samples inthe dataset, respectively. Therefore, our second objective isto perform distance regression in a computationally more ef-ficient way to enable scaling to larger datasets. A methodwhich fulfills these two objectives is described in the sequel.
3. NEURAL NETWORK-BASED COUNTING
To address the low-counting error objective, we propose toimprove the accuracy of distance regression. To that end,we carry out a two-stage (coarse-fine) regression. To addressthe computational complexity objective, we propose to usefully-connected neural networks (FCNNs) instead of origi-nally used SVR. The block diagram of the proposed methodis presented in Fig. 2 (top). x i ( t ) , i = 1 , · · · , M are inputfeatures at time instant t and ˇ D ( t ) represents the predicteddistance at t . The Vehicle counting block carries out count-ing based on the distance prediction of the whole audio file.
Fig. 2 . Top : The block diagram of the proposed vehicle count-ing method.
Bottom : Distance regression in detail. Stage 2improves the distance regression output by Stage 1.
A detailed representation of the proposed distance regressionis given in Fig. 2 (bottom). Stage 1 FCNN performs regres-sion based on input features, similarly to SVR in [8]. A vectorof K + 1 successive distances ˙ D ( t − K ) , . . . , ˙ D ( t + K ) predicted by Stage 1 FCNN, centered at t , represents inputfeatures to Stage 2 FCNN. The task of Stage 2 FCNN is torefine the output of Stage 1.As input features, we use the HFP+LMS combination, assuggested in [8]. Since HFP represents the power of high-frequency portion of the signal spectrum, we incorporate itinto LMS by leaving out a number of filters with the lowestcentral frequencies in the mel spectrogram filter bank [10].The resulting LMS, referred to as high-frequency LMS (HF-LMS), does not include low-frequency portion of the spec-trum which contains the most significant part of the environ-ent noise. To take into account time dependence betweenadjacent D ( t ) values, the value D ( t ) will be predicted us-ing the samples of the HF-LMS spectrum at instant t and Q preceding and following instants (for details, see Section 3.3).Prior to vehicle counting, the predicted distance is low-pass filtered to eliminate high-frequency oscillations, whichis discussed in Section 4. In this paper, we propose two vehicle counting approaches,both based on the final (Stage 2) predicted distance.
In [8], local minima of the predicted distance were detectedby detecting peaks (local maxima) of the inverted distance(see dashed line in Fig. 3 (bottom)) based on their promi-nence . Here, we extend this approach by introducing a peakmagnitude criterion. If two close peaks have similar magni-tudes (corresponding to two close vehicles), the prominenceof the weaker one can be much less than that of the strongerone. The weaker peak can be left out if it is detected based onprominence only. Therefore, we define vehicle detection as: A vehicle is detected if detected peak of the invertedpredicted distance has magnitude larger than M p orprominence larger than P p . Selection of M p and P p is discussed in Section 4. A convolutional NN [12] is used for counting. The model op-erates on the raw predicted distance and estimates the vehiclecount directly, without an intermediate local minima detec-tion. It consists of 1D convolutional, a global-average pool-ing and fully-connected layers. The global-average-poolingallows the model to operate on varying length distance [13].The model is trained to predict the vehicle count directly.We experimented with three different loss functions: L , L and smooth L distances. The model trained with smooth L distance gives the best performance. Training with smooth L is performed using a surrogate learned via a deep embedding,where the Euclidean distance between the prediction and theground truth corresponds to the L distance [14]. The deepembedding is realized using a shallow FCNN. The surrogateand the counting model are trained in parallel. The HF-LMS feature is based on the spectrogram of inputsignal. In this paper, the length of sliding (Hamming) win-dow used in the spectrogram calculation is N w = 4096 and The prominence of a peak measures how much the peak stands out dueto its height and location relative to other peaks, and is defined as the verticaldistance between the peak and its lowest contour line [11]. the stride length is N h = 1634 samples [15], which with -second audio files sampled at f s = 44100 Hz gives the time-length of all features of samples. In addition, N mel = 48 mel bands are used in HF-LMS, with the lowest frequency f min = 1000 Hz ( f max = f s/ ). To form a vector of inputfeatures, we take the HF-LMS spectra at Q = 5 precedingand following instants with a stride of . Therefore, the di-mensionality of the input space is M = (2 Q +1) N mel = 528 .The input dimensionality of Stage 2 FCNN is K + 1 = 31 .For distance threshold, we take T d = 0 . s [8].Stage 1 and 2 FCNNs have four layers with - - - and - - - neurons per layer, respectively. These config-urations gave the best regression performance cross-validatedon the training data. Both FCNNs use mean squared errorloss, ReLU activation, L kernel regularization with factors − (Stage 1) and × − (Stage 2), training epochs.Batch normalization is applied at each layer, except for theoutput, after activation. The model is implemented in Keras.Peak detection is based on scipy.signal.find peaks procedure (SciPy library for Python).
4. EXPERIMENTS
First evaluation metric we will use is normalized area un-der the curve (NAUC) of p TP ( T det ) , i.e., the average valueof p TP over the entire detection threshold T det interval [8]. p TP is calculated (as percentage of detected minima withinthe pass-by intervals, maximum one minimum per interval)at equidistant T det points. The second metric is relativevehicle counting error RVCE = (cid:0) N truev − N estv (cid:1) /N truev ×
100 [%] , (3)where N truev and N estv represent the true and the estimatedvehicle count. As opposed to (3), the RVCE definition in[8] uses the absolute difference. Here, the signed RVCE en-ables distinguishing between counting underestimation (pos-itive RVCE) and overestimation (negative RVCE).We will use the dataset from [8] which contains two parts:VC-PRG-1:5 (
250 20 -second audio files with vehicles)and VC-PRG-6 ( audio files with vehicles). The pro-posed vehicle counting method (referred to as FCNN-VC)will be trained and validated ( - training-validationsplit) using VC-PRG-1:5 and tested on VC-PRG-6. We runthe method times (training data shuffled each time). Alongwith the output of Stage 2, we also consider the output ofStage 1 FCNN (denoted as FCNN S ).Minima detection, and therefore probabilities p TP , p FP and p FN , as well as RVCE, are affected by i) low-pass filtering ofthe predicted distance, ii) value of M p and iii) value of P p (Section 3.2.1). To ensure a fair comparison of performances,we determine an optimal set of parameters for every FCNN-based approach and report only the corresponding optimal re-sults. The optimality criterion is averaged absolute RVCEfor detection threshold range [50% T d , T d ] . We consider allossible combinations of i) low-pass filters (successive mov-ing average filters (MAFs) with lengths (5 , , (7 , and (7 , , ), ii) M p ∈ { T d , T d , T d , T d } andiii) P p ∈ { T d , T d , T d , T d } . The optimalparameters for FCNN S -VC and FCNN-VC are presented inthe first two rows of Table 1. Table 1 . Optimal minima (vehicle) detection parametersSetup Optimal parametersFCNN S -VC MAFs (7 , , M p = 45% T d , P p = 25% T d FCNN-VC MAFs (5 , , M p = 40% T d , P p = 20% T d FCNN f -VC MAFs (5 , , M p = 45% T d , P p = 20% T d Fig. 3 . Top : TP, FP and FN probabilities (see text below).
Bottom : Distance predictions of an audio file. Minima aredetected by detecting peaks of the inverted predicted distance(blue dashed line).Figure 3 (top) compares p TP , p FP and p FN of the SVR-based counting [8] (carried out with HFP+LMS) and of onerun of FCNN S -VC and FCNN-VC. Significant improvementis reflected in an increase of NAUC from . (SVR-VC) to . (FCNN S -VC) and . (FCNN-VC), the latter twoaveraged over all runs. Dots in Fig. 3 (top) represent thepoints of equal false probabilities (EFP) [8], which are . (SVR-VC), . (FCNN S -VC) and . (FCNN-VC).Figure 3 (bottom) compares distance predictions of oneaudio file carried out via SVR-VC, FCNN S -VC and FCNN-VC. The corresponding mean square regression errors for thetesting set are . × − , . × − and . × − .The proposed method significantly outperforms SVR-VC interms of regression accuracy. In addition, Stage 2 improvesthe regression accuracy with respect to Stage 1.RVCE plots for detection threshold range [30% T d , T d ] aregiven in Fig. 4 (top). With the exception of SVR, RVCEs are ( l , l ) denotes filtering first with an MAF of length l then with an MAFof length l . shown as confidence intervals for the mean (CIM). In ad-dition to SVR-VC, FCNN S -VC and FCNN-VC, we presentRVCE of the deep counting approach (Section 3.2.2) and ofFCNN-VC when full-frequency range is used in LMS inputfeatures ( f min = 0 , see Section 3.3). The latter is denotedas FCNN f -VC and its optimal parameters are given in thebottom row in Table 1. CIM of Deep-VC RVCE is a hori-zontal band within [1 . , . . FCNN-VC outperformsthe other approaches by a significant margin. Its CIMgoes below for detection threshold around T d and iswithin [0 . , − . for thresholds above T d .The combined peak magnitude-prominence criterion inminima detection (Section 3.2.1) improves the counting per-formance with respect to detection based solely on the peakprominence [8], as illustrated in Fig. 4 (bottom). Peak promi-nences of , and are considered, after applyingtwo successive MAFs with lengths and . Fig. 4 . Relative vehicle counting error, RVCE, as a func-tion of the detection threshold. With the exception of SVR,RVCEs are shown as confidence intervals for the mean.The proposed FCNN-VC (in blue) compared with alternatives(for their description, see text).
5. CONCLUSIONS
We proposed a method for acoustic vehicle counting based onthe clipped vehicle-to-microphone distance. The distance waspredicted using a two-stage NN-based regression. Significantimprovement in regression accuracy with respect to the SVR-based approach resulted in a highly accurate vehicle countingnot depending on detection threshold within a wide range ofthreshold values. Deep counting, an alternative to the localminima-based counting, estimates the vehicle count directlyfrom the predicted distance, without detecting local minima.Although outperformed in accuracy by the latter approach,a significant advantage of deep counting is that it does notdepend on minima detection parameters. Our future work willaddress developing end-to-end vehicle counting method. . REFERENCES [1] Myounggyu Won, “Intelligent traffic monitoring sys-tems for vehicle classification: A survey,”
IEEE Access ,vol. 8, pp. 73340–73358, 2020.[2] Milind Naphade et al., “The 2019 AI City challenge,”in
CVPR Workshops , 2019, pp. 452–460.[3] Brendan Tran Morris and Mohan Manubhai Trivedi, “Asurvey of vision-based trajectory learning and analysisfor surveillance,”
IEEE Transactions on Circuits andSystems for Video Technology , vol. 18, no. 8, pp. 1114–1127, 2008.[4] Jien Kato, “An attempt to acquire traffic density byusing road traffic sound,” in
Proceedings of the 2005International Conference on Active Media Technology,2005.(AMT 2005).
IEEE, 2005, pp. 353–358.[5] Jobin George, Leena Mary, and KS Riyas, “Vehicledetection and classification from acoustic signal usingANN and KNN,” in . IEEE,2013, pp. 436–439.[6] Jobin George, Anila Cyril, Bino I Koshy, and LeenaMary, “Exploring sound signature for vehicle detectionand classification using ANN,”
International Journal onSoft Computing , vol. 4, no. 2, pp. 29, 2013.[7] Sugang Li, Xiaoran Fan, Yanyong Zhang, Wade Trappe,Janne Lindqvist, and Richard E Howard, “Auto++: De-tecting cars using embedded microphones in real-time,”
Proceedings of the ACM on Interactive, Mobile, Wear-able and Ubiquitous Technologies , vol. 1, no. 3, pp. 70,2017. [8] Slobodan Djukanovi´c, Jiˇri Matas, and Tuomas Virta-nen, “Robust audio-based vehicle counting in low-to-moderate traffic flow,” in , 2020.[9] Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: Alibrary for support vector machines,”
ACM Transactionson Intelligent Systems and Technology (TIST) , vol. 2, no.3, pp. 1–27, 2011.[10] Romain Serizel, Victor Bisot, Slim Essid, and Ga¨elRichard, “Acoustic features for environmental soundanalysis,” in
Computational Analysis of Sound Scenesand Events , pp. 71–101. Springer, 2018.[11] “Topographic prominence,” https://en.wikipedia.org/wiki/Topographic_prominence , Accessed: 2020-10-19.[12] Xiang Zhang, Junbo Zhao, and Yann LeCun,“Character-level convolutional networks for textclassification,” in
NeurIPS , 2015.[13] Min Lin, Qiang Chen, and Shuicheng Yan, “Network innetwork,”
ICLR , 2014.[14] Yash Patel, Tomas Hodan, and Jiri Matas, “Learningsurrogates via deep embedding,”
ECCV , 2020.[15] Ljubiˇsa Stankovi´c, Igor Djurovi´c, Srdjan Stankovi´c,Marko Simeunovi´c, Slobodan Djukanovi´c, and MiloˇsDakovi´c, “Instantaneous frequency in time–frequencyanalysis: Enhanced concepts and performance of esti-mation algorithms,”