Convolutional Neural Networks for Passive Monitoring of a Shallow Water Environment using a Single Sensor
Eric L. Ferguson, Rishi Ramakrishnan, Stefan B. Williams, Craig T. Jin
aa r X i v : . [ c s . S D ] D ec CONVOLUTIONAL NEURAL NETWORKS FOR PASSIVE MONITORING OF A SHALLOWWATER ENVIRONMENT USING A SINGLE SENSOR
Eric L. Ferguson ∗ , Rishi Ramakrishnan,Stefan B. Williams Australian Centre for Field RoboticsThe University of Sydney, Australia
Craig T. Jin
Computing and Audio Research LaboratoryThe University of Sydney, Australia
ABSTRACT
A cost effective approach to remote monitoring of protected areassuch as marine reserves and restricted naval waters is to use pas-sive sonar to detect, classify, localize, and track marine vessel ac-tivity (including small boats and autonomous underwater vehicles).Cepstral analysis of underwater acoustic data enables the time de-lay between the direct path arrival and the first multipath arrivalto be measured, which in turn enables estimation of the instanta-neous range of the source (a small boat). However, this conventionalmethod is limited to ranges where the Lloyd’s mirror effect (interfer-ence pattern formed between the direct and first multipath arrivals)is discernible. This paper proposes the use of convolutional neuralnetworks (CNNs) for the joint detection and ranging of broadbandacoustic noise sources such as marine vessels in conjunction with adata augmentation approach for improving network performance invaried signal-to-noise ratio (SNR) situations. Performance is com-pared with a conventional passive sonar ranging method for monitor-ing marine vessel activity using real data from a single hydrophonemounted above the sea floor. It is shown that CNNs operating on cep-strum data are able to detect the presence and estimate the range oftransiting vessels at greater distances than the conventional method.
Index Terms — passive sonar, convolutional neural network,acoustic ranging and detection, cepstral analysis
1. INTRODUCTION
Despite the long-term usage of traditional passive acoustics forsound-source localization, poor performance persists in some sce-narios. Current conventional, single-sensor source localizationmethods are limited in their effective range, which is further de-graded in low SNR situations. Time delay estimation aims tomeasure the time difference of arrival (TDOA) between propagationpaths of an acoustic signal and is a fundamental approach for clas-sifying, localizing and tracking sources of radiated acoustic noise.A common approach to the passive ranging of a sound source isto measure the TDOA of a signal at multiple, spatially distributedreceivers [1, 2, 3, 4]. The TDOA measured between two coherentsignal arrivals at a single receiver is geometrically equivalent to theTDOA measured by a single arrival propagating to two vertically-spaced receivers [5]. Passive acoustic ranging using a single sensoris achieved by measuring the TDOA of an acoustic signal as it ar-rives via direct and indirect underwater sound propagation paths.For example, the TDOA between the direct path signal and themultipath signal can be used to yield the instanenous range of the ∗ Work supported by Defence Science and Technology Group Australiaand IEEE Oceanic Engineering Society Scholarships. acoustic source [6]. Passive acoustic ranging using a single sensorfacilitates deployment, lowers hardware costs, and minimizes theequipment footprint when compared with multi-sensor arrays.The acoustic characteristics of a shallow water environment suchas a harbour or port are variable in both space and time with high lev-els of clutter, background noise, and multipath reflection. Time delayestimation by cepstral analysis is able to outperform other methods(such as autocorrelation analysis) in these scenarios [7], however thismethod is limited to ranges where the Lloyd’s mirror effect is dis-cernible, i.e. only at short ranges and when the SNR of the recordedsource is sufficiently high.A CNN is proposed that operates on cepstral inputs to detectand range an acoustic source passively in a shallow water environ-ment. The CNN based implementation has an important advantageover other methods in that the TDOA information for more complexmultipaths can be exploited, rather than peak quefrency values usedin conventional methods. This increases the range at which sourcetracking is possible. By considering additional propagation pathssuch as paths with two or more boundary reflections, it is hypoth-esized that the source range can be estimated at greater distances,even when the Lloyd’s mirror interference pattern is not discernibleby a human observer. The CNNs are trained using real, single chan-nel acoustic recordings of a surface vessel under way in a shallowwater environment. CNNs operating on both cepstrum and cepstro-gram inputs are considered and their performances compared. Theproposed models are shown to detect and range sources successfullyat greater distances and in varied SNR situations and are comparedwith a conventional single-sensor passive sonar localization method.Generalization performance of the network is tested by ranging an-other, previously unseen vessel with different radiated noise char-acteristics. To the best of our knowledge, this is the first acousticlocalization network to utilize the TDOA information in a reverber-ant environment to range and detect a source passively with just onesensor.The contributions of this work are: • Development of a CNN for the passive ranging of acous-tic broadband noise sources in shallow water environment atgreater distances than conventional methods allow; • Cepstral liftering of network inputs to improve ranging ofother radiated noise sources; • Data augmentation technique where colored noise is added totraining data to improve robustness in varied SNR scenarios;and • A unified, end-to-end network for the joint detection andranging of acoustic sources. pectrogram
200 400 600 800 1000 1200 1400
Time (sec) F r equen cy ( H z ) Cepstrogram
200 400 600 800 1000 1200 1400
Time (sec) Q ue f r equen cy ( m s ) Fig. 1 . a) Spectrogram showing the Lloyd’s mirror for a surfacevessel as it transits over a hydrophone at close range, and b) thecorresponding cepstrogram
2. DETECTION AND RANGING CNN
A neural network is a machine learning technique that maps the inputdata to a label or continuous value, through a multi-layer non-lineararchictecture and has been successfully applied to applications suchas image/object classification [8, 9] and terrain classification usingacoustic sensors[10]. CNNs learn sets of filters that span small re-gions of the input data, enabling them to learn local correlations.
Since an acoustic source has an effect on the cepstrum, it is possi-ble to create a unified network for classifying the presence/absenceof a vessel, and determining the range of the detected vessel. Thenetwork structure is as follows: The first layer consists of convo-lutional filters of size × n , where n refers to the input width, asis discussed further in Section 3.2. Both the second and third layersconsist of convolutional filters of size × . The third layer isthen an input layer to a fully connected hidden layer of 200 neuronswith a single regression output and a binary softmax classificationoutput. All layers (excluding output layers) use rectified linear unitsas activation functions. Since resolution is important for the accu-rate ranging of an acoustic source, max pooling is not used in thenetwork’s architecture. A cepstrum can be derived from various spectra such as the com-plex or differential spectrum. For the current approach, the powercepstrum (referred to in this paper as the cepstrum) is used and isderived from the power spectrum of a recorded signal. Cepstral anal-ysis is based on the principle that the logarithm of the power spec-trum for a signal containing echoes has an additive periodic compo-nent due to the echoes from multipath reflections [11]. This additiveperiodic component is evident when examining the Lloyd’s mirroreffect in the spectrogram when an acoustic source travels past thehydrophone at close range as seen in Fig. 1a). The cepstral represen-tation of the signal is neither in the time, nor frequency domain butrather it is in the quefrency domain [12]. Where the original time waveform contained an echo the cepstrum will contain a peak andthus the TDOA between propagation paths of an acoustic signal canbe measured by examining peaks in the cepstrum [13]. The cepstro-gram (an ensemble of cepstrum as they vary in time) is shown inFig. 1b).The cepstrum ˆ x ( n ) is obtained by the inverse Fourier transform: ˆ x ( n ) = F − (cid:0) log | S ( f ) | (cid:1) , (1)where S ( f ) is the Fourier transform of a discrete time signal x ( n ) .In order to detect and range a source using a single sensor, in-formation about the time delay between signal propagation paths isrequired. Although such information is contained in the raw sig-nals, it is beneficial to represent it in a way that can be learnt bythe network easily. There are several ways to represent time delayinformation. Motivated by work in [7], the cepstrum is chosen asnetwork input, since it provides TDOA information between signalpropagation paths that can be used to passively range the vessel. Thecapability of cepstrum analysis in extracting TDOA information issuperior to other methods (such as autocorrelation) in the presenceof multipath reflections and strong transients found in a shallow wa-ter environment [7].The first layer’s convolutional filter spans the entire input widthin order to average neighbouring cepstral values and reduce the im-pact of shot noise and other short-duration clutter. By using filtersthat span the entire width of the input, networks can be robust toshort-duration changes in the cepstrogram. The temporal differenceof cepstra in the cepstrogram is not important for the task at handsince for the present experiments only the instantaneous range anddetection is of interest. For each input into the network, the network classifies the presenceor absence of a vessel using binary softmax classification. If thevessel is present, the range of the acoustic source is predicted with aregression output.
For a given source-sensor geometry, there is a finite bounded rangeof possible TDOA values. Distant acoustic sources will have TDOAvalues that tend to zero and as the source-sensor separation distancedecreases the TDOA values will tend to a maximum value. TDOAvalues greater than this geometry dependant maximum are not usefulfor the passive sonar ranging problem, hence upper bounds of thecepstrum can be discarded.Cepstrum values near zero mostly contain pitch information forthe broadband noise source, and not TDOA information for differ-ent signal propagation paths. Acoustic sources of interest are variedin their radiated noise characteristics; for example, the inception ofpropeller cavitation leads to a significant increase in the intensity andbandwidth of the radiated noise. For this reason, lower quefrencyvalues are likely to be highly source dependant and are thus not use-ful for the passive sonar ranging problem. Hence lower bounds ofthe cepstrum can be discarded.Similar to filtering in the frequency domain by windowing aspectral represenation of a signal, liftering involves linear filteringof the log spectrum (in the quefrency domain) by windowing [12].Only quefrencies between some range contain useful TDOA infor-mation for passive acoustic ranging, as described above. The cep-strum can be liftered (filtered in quefrency) to remove informationot useful for passive ranging of the source. This has the added ben-efit of reducing computational complexity for forward and backwardpropagation through a network, since input dimensions are smallerand fewer convolutional filters are required.
The acoustic noise characteristics of a shallow water environmentis variable in both space and time with high levels of clutter, back-ground noise and multipath reflection. For example, different timesof day have varying levels of biological noise. Further, acousticsources vary in the level of sound power they emit. For robust rang-ing and detection of other sources it is important for the networkto be invariant to changes in radiated or background noise levels.By performing transformations to recorded signals the number oftraining examples is increased and network develops invariance toparticular signal variations.Since acoustic classification can be strongly affected by envi-ronmental noise, Valada [10] et. al shows that by augmenting rawacoustic data with additive white Gaussian noise, classification per-formance can increase in degraded SNR situations. This paper pro-poses augmenting raw acoustic data by adding colored noise withthe same power spectral density (PSD) as background noise record-ings during network training. The PSD is taken from backgroundnoise recorded by the same hydrophone when no surface vessel ispresent. Adding colored noise with the same PSD as backgroundnoise recordings simulates situations with either a quiet source orhigh levels of background noise. By injecting colored noise to train-ing examples the CNN performance can be improved by increasingrobustness to SNR variations. Furthermore, when n > training ex-amples can be flipped along the quefrency axis, providing additionaltraining examples. The objective of the network is to predict the presence or absenceof an acoustic source from reverberant and noisy single-channel in-put signals. If the source is present, then the range relative to thehydrophone is predicted. Previously, it was found that ranging thevessel was a more difficult problem for the CNN and required morehidden units than vessel detection [14]. This is to be expected sinceranging is dependent on the location of cepstral features, whereasdetection is only dependent on the presence of them. The total ob-jective function E minimized during network training is given bythe weighted sum of the ranging regression loss E r and the detec-tion loss E d , such that: E = αE d + (1 − α ) E r , (2)where E r is the L norm and E d is the log loss over two classes. Thetwo terms are weighted by parameter α . Training is performed byinitially setting α = 0 , such that only the regression term is signifi-cant. Training is stopped when validation error does not decrease ap-preciably per epoch. Subsequently, due to the magnitude differencebetween E r and E d , α is set to . during joint training. Trainingis stopped when the validation error did not decrease appreciably perepoch. For training data with no vessel present, there was no rangelabel and E r was ignored, i.e. gradients obtained from the regressionoutput for training samples with no boat were masked out. In orderto further prevent overfitting, regularization through dropout [15] isused at the final, fully connected layer when training. A dropout rateof 50% is used.
3. EXPERIMENTAL RESULTS
Passive ranging on a transiting vessel was conducted using a single-sensor algorithmic method described in [6], and CNNs with bothcepstrum ( n = 1 ) and cepstrogram ( n = 8 ) inputs. Their effective-ness is compared. Generalization of the CNNs is also demonstratedby detecting and ranging an additional, unseen vessel with differentradiated noise and SNR characteristics. Acoustic data of a motorised boat transiting in a shallow water en-vironment over a hydrophone were recorded at a sampling rate of kHz. Recordings start when the vessel is up to m awayfrom the sensor. The vessel then transits over the hydrophone andrecording is terminated when the vessel is m away. The boatwas equipped with a DGPS tracker, which logged its position rela-tive to the recording hydrophone at . s intervals. 28 transits wererecorded over a two day period. Background noise was also recordedwhen there was no vessel present, over the same period. 20,000training examples were randomly chosen, with an equal number ofvessel transit recordings and background noise recordings. A further5,000 labelled examples were reserved for CNN training validation.The recordings were preprocessed as outlined in Section 2.1.1, 2.1.3and 2.2. The networks are implemented in MatConvNet and aretrained with stochastic gradient descent using a NVIDIA GeForceGTX 770 GPU. Due to GPU memory limitations, the gradient de-scent was calculated in batches of 256 training examples. The net-works were trained with a learning rate of × − , weight decayof × − and momentum of . .Additional recordings of the vessel were used to measure theperformance of the methods. These recordings are referred to as thetest dataset and contain labelled examples.Additional acoustic data were recorded on a different date, usinga different boat with different radiated noise characteristics. Acous-tic recordings started when the transiting vessel was m awayfrom the hydrophone, record the transit over the hydrophone, andend when the vessel is m away. This dataset is referred to as thegeneralization set and contains labelled examples. Cepstral features were used as input to the CNN. The cepstral fea-tures have a dimension of m x n , where m is the number of que-frency bins in each cepstrum realization and n is the input widthof the cepstrogram, and is computed as follows. For every train-ing example, the data was further subdivided into n sections andthe cepstrum values calculated for each section. For each calculatedcepstrum, only some range of quefrencies contain relevant TDOAinformation and are retained since the rest of the values are not use-ful for the task here - see Section 2.1.3. Cepstrum values more than . ms are discarded since the shallow water environment geome-try makes it unlikely that useful TDOA information is present. Cep-strum values less than µ s are discarded, since they mostly containsource dependant pitch information. Thus, each cepstrogram input isliftered and samples through are used as input to the networkonly. This results in a x n input size, since m = 330 . Colorednoise was added to the recordings to change the SNR randomly be-tween − dB to dB when training, as described in Section 2.2.Multiple CNNs with variable input widths were produced andtheir performances compared. The n = 1 and n = 8 CNNs arecompared in the following section. For n = 1 , a single realisation
400 1500 1600 1700 1800 1900 2000 time (samples) r ange ( m ) CNN range estimationAlgorithmic range estimationTrue range
Fig. 2 . A comparison of the two ranging methods, as they range atransiting vessel over time. The CNN range prediction refers to theestimated range given by the ‘ n = 8 , with data aug’ network. Thetrue range shows the range of the vessel relative to the hydrophone,measured by the DGPS. Network Input Width n=1 n=8Data Augmentation no yes no yes
Average Precision . Comparison of detection performance for CNNs againstthe test dataset.of the cepstrum is used. For n = 8 , an ensemble of cepstrum (orcepstrogram) is used. Algorithmic single sensor passive ranging was conducted, using themethods outlined in [6], where the TDOA values are measured byexamining peaks in the cepstrum. Fig. 2 compares algorithmic andCNN ranging over time for a vessel in transit. The algorithmicmethod is shown to successfully range a transiting vessel at rangeswhere the Lloyd’s mirror interference pattern is present. The CNNis shown to provide an estimate of the vessel range throughout theentire transit.Table 3.3 shows the average precision for each network whenoperating on the test dataset. Additive colored noise data augmen-tation improved CNN detection precision. Increasing network inputwidth n also improved the detection precision.Fig. 3 a) shows the performance of ranging methods as a func-tion of the true range of the vessel for the test dataset. Fig. 3 b)shows the performance of ranging methods as a function of the truerange of the vessel for the generalization dataset. In the near field(ranges < m), the algorithmic ranging method out performsCNN ranging methods, achieving less average relative error. CNNmethods suffer from a significant bias in range estimates in the nearfield. At source ranges further than m the algorithmic methodfails completely and CNN methods are able to successfully estimatethe range of the vessel. The CNN is able to range the new vesselin the generalization set with a small impact to performance at theseranges.Fig. 4 shows the far field performance of the CNNs in estimat-ing the vessels range under different SNR conditions. Test data wasaugmented with varying levels of colored noise, as described in Sec-tion 2.2. For the n = 1 case, data augmentation improved rangingperformance in most cases. For the n = 8 case, additive colorednoise data augmentation improved ranging performance when theSNR was changed to dB only.
30 90 150 210 270 330 390 450
Ground Truth Range (m) A v e r age R e l a t i v e E rr o r a) Algorithmicn=1n=1, data augn=8n=8, data aug
Ranging Method
30 90 150 210 270
Ground Truth Range (m) A v e r age R e l a t i v e E rr o r b) Algorithmicn=1n=1, data augn=8n=8, data aug
Ranging Method
Fig. 3 . Comparison of range estimation performance as a function ofthe vessels true range. It is not possible to determine the range of avessel past m using conventional algorithmic methods, since theLloyd’s mirror interference pattern is not discernible. a) shows theperformance when estimating the vessel’s range in the test dataset.b) shows the performance when estimating the vessel’s range in thegeneralization dataset. n=1 n=1 aug n=8 n=8 aug network type R e l a t i v e A v e r age E rr o r none3020100 SNR Change (dB)
Fig. 4 . Comparison of far field ( < m) range estimation perfor-mance as a function of SNR.
4. CONCLUSIONS
In this paper we introduce the use of a CNN for the detection andranging of surface vessels in a shallow water environment. Usingliftered cepstra as input, the CNN detects the presence of a vesseland estimates its range relative to the recording hydrophone. Sev-eral CNN architectures are evaluated. A novel data augmentationtechnique is introduced, where colored noise of a similar PSD torecorded background noise is added to raw acoustic data when train-ing. This data augmentation improves performance in both vesselranging and detection in some SNR scenarios. Whilst the CNNs areoutperformed by a conventional algorithmic method at short ranges( < m), the CNNs are able to estimate the vessel’s range at fur-ther distances even when the Lloyd’s mirror interference pattern isnot easily identified. The CNNs are robust to changes in the SNR andbroadband spectral characteristics of marine vessels due to cepstralliftering of network inputs and novel data augmentation methods ap-plied during network training. . REFERENCES [1] G.C. Carter, “Time delay estimation for passive sonar signalprocessing,” IEEE Trans. Acoust., Speech, Signal Processing ,vol. 29, pp. 463–470, 1981.[2] G.C. Carter, Ed.,
Coherence and Time Delay Estimation , IEEEPress, New York, 1993.[3] Y.T. Chan and K.C. Ho, “A simple and efficient estimator forhyperbolic location,”
IEEE Trans. Signal Proc. , vol. 42, pp.1905–1915, 1994.[4] J.Benesty, J.Chen, and Y.Huang, “Time-delay estimation vialinear interpolation and cross correlation,”
IEEE Transac-tions on Speech and Audio Processing , vol. 12, pp. 509–519,September 2004.[5] M. Hamilton and P.M. Schultheiss, “Passive ranging in multi-path dominant environments, part 1: Known multipath param-eters,”
IEEE Transactions on Signal Processing , vol. 40, no. 1,pp. 1–12, 1992.[6] B.G. Ferguson, K.W. Lo, and R.A. Thuraisingham, “Sensorposition estimation and source ranging in a shallow water en-vironment,”
IEEE Journal of Oceanic Engineering , 2005.[7] Y.Gao, M.Clark, and P.Cooper, “Time delay estimate usingcepstrum analysis in a shallow littoral environment,” in
Un-dersea Defence Technology , Glasgow, Scotland, June 2008.[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “Im-agenet classification with deep convolutional neural networks,”in
Advances in neural information processing systems , 2012,pp. 1097–1105.[9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Ma-lik, “Rich feature hierarchies for accurate object detection andsemantic segmentation,” in
Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2014, pp.580–587.[10] A. Valada, L. Spinello, and W. Burgard, “Deep feature learningfor acoustics-based terrain classification,” in
Proceedings ofthe International Symposium on Robotics Research , Genova,Italy, 2015.[11] K.W. Lo, B.G. Ferguson, Y. Gao, and A. Maguer, “Aircraftflight parameter estimation using acoustic multipath delays,”
IEEE Trans. on Aero. and Elect. Systems , vol. 39, pp. 259–268,2003.[12] B.P. Bogert, M.J.R. Healy, and J.W. Tukey, “The que-frency analysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe cracking,” in
Pro-ceedings of the symposium on time series analysis , New York,N.Y, 1963, vol. 15.[13] A.V. OppenHeim and R.W. Schafer, “From freqency to que-frency: a history of the cepstrum,”
IEEE Signal ProcessingMagazine , vol. 21, pp. 95–106, 2004.[14] E.L. Ferguson, R. Ramakrishnan, S.B. Williams, and C.T. Jin,“Deep learning approach to passive monitoring of the under-water acoustic environment,” in
Fifth Joint Acoustical Societyof America/Acoustical Society of Japan Meeting, (accepted) ,Hawaii, USA, Dec. 2016.[15] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov, “Dropout: A simple wayto prevent neural networks from overfitting,”