Physeter catodon localization by sparse coding
Sébastien Paris, Yann Doh, Hervé Glotin, Xanadu Halkias, Joseph Razik
PPhyseter catodon localization by sparse coding
S´ebastien PARIS [email protected]
DYNI team, LSIS CNRS UMR 7296, Aix-Marseille University
Yann DOH [email protected]
DYNI team, LSIS CNRS UMR 7296, Universit´e Sud Toulon-Var
Herv´e GLOTIN [email protected]
DYNI team, LSIS CNRS UMR 7296, Universit´e Sud Toulon-Var
Xanadu HALKIAS [email protected]
DYNI team, LSIS CNRS UMR 7296, Universit´e Sud Toulon-Var
Joseph RAZIK [email protected]
DYNI team, LSIS CNRS UMR 7296, Universit´e Sud Toulon-Var
Abstract
This paper presents a spermwhale’ local-ization architecture using jointly a bag-of-features (BoF) approach and machine learn-ing framework. BoF methods are known, es-pecially in computer vision, to produce froma collection of local features a global repre-sentation invariant to principal signal trans-formations. Our idea is to regress super-visely from these local features two rough es-timates of the distance and azimuth thanksto some datasets where both acoustic eventsand ground-truth position are now avail-able. Furthermore, these estimates can feeda particle filter system in order to obtain aprecise spermwhale’ position even in mono-hydrophone configuration. Anti-collision sys-tem and whale watching are considered appli-cations of this work.
1. Introduction
Most of efficient cetacean localisation systems arebased on the Time Delay Of Arrival (TDOA) esti-mation from detected animal’s click/whistles signals As click/whistles detector, matching filter is often pref-ered
Proceedings of the 30 th International Conference on Ma-chine Learning , Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s). (Nosal & Frazer, 2006; B´enard & Glotin, 2009). Long-base hydrophones’array is involving several fixed, effi-cient but expensive hydrophones (Giraudet & Glotin,2006) while short-base version is requiring a precise ar-ray’s self-localization to deliver accurate results. Re-cently (see (Glotin et al., 2011)), based on Leroy’s at-tenuation model versus frequencies (Leroy, 1965), arange estimator have been proposed. This approachis working on the detected most powerful pulse in-side the click signal and is delivering a rough range’estimate robust to head orientation variation of theanimal. Our purpose is to use i) these hydrophone’array measurements recorded in diversified sea condi-tions and ii) the associated ground-truth trajectories ofspermwhale (obtained by precise TDAO and/or Dtagsystems) to regress both position and azimuth of theanimal from a third-party hydrophone (typically on-board, standalone and cheap model).We claim, as in computer-vision field, that BoF ap-proach can be successfully applied to extract a globaland invariant representation of click’s signals. Basi-cally, the pipeline of BoF approach is composed ofthree parts: i) a local features extractor, ii) a lo-cal feature encoder (given a dictionary pre-trained ondata) and iii) a pooler aggregating local representa-tions into a more robust global one. Several choice forencoding local patches have been developed in recentyears: from hard-assignment to the closest dictionarybasis (trained for example by K means algorithm) to We assume that the velocity vector is colinear with thehead’s angle. a r X i v : . [ c s . L G ] J un hyseter catodon localization by sparse coding a sparse local patch reconstruction (involving for ex-ample Orthognal Maching Pursuit (OMP) or LASSOalgorithms).
2. Global feature extraction by sparecoding
Let’s denote by C (cid:44) { C j } , j = 1 , . . . , H the collectionof detected clicks associated with the j th hydrophoneof the array composed by H hydrophones. Each ma-trix C j is defined by C j (cid:44) { c ji } , i = 1 , . . . , N j where c ji ∈ R n is the i th click of the j th hydrophone. Forour Bahamas2 dataset (Giraudet & Glotin, 2006), wechoose typically n = 2000 samples surrounding thedetected click. The total number of available clicks isequal to N = H (cid:80) i =1 N j .As local features, we extract simply some local sig-nal patches of p ≤ n samples (typically p = 128)and denoted by z ji,l ∈ R p . Furthermore all z ji,l are (cid:96) normalized. For each c ji , a total of L local patches Z ji (cid:44) { z ji,l } , l = 1 , . . . , L equally spaced of (cid:100) nL (cid:101) sam-ples are retrieved (see Fig. 1). All local patches asso-ciated with the j th hydrophone is denoted by Z j (cid:44) { Z ji } , i = 1 , . . . , N j while Z (cid:44) { Z j } is denoting allthe local patches matrix for all hydrophones. A finalpost-processing consists in uncorrelate local featuresby PCA training and projection with p (cid:48) ≤ p dimen-sions. In order to obtain a global robust representation of c ⊂ C , each associated local patch z ⊂ Z are firstlinearly encoded via the vector α ∈ R k such as z ≈ Dα where D (cid:44) [ d , . . . , d k ] ∈ R p × k is a pre-traineddictionary matrix whose column vectors respect theconstraint d Tj d j = 1. In a first attempt to solve thislinear problem, α can be the solution of the OrdinaryLeast Square (OLS) problem: l OLS ( α | z ; D ) (cid:44) min α ∈ R k (cid:26) (cid:107) z − Dα (cid:107) (cid:27) . (1)OLS formulation can be extended to include regular-ization term avoiding overfitting. We obtain the ridgeregression (RID) formulation: l RID ( α | z ; D ) (cid:44) min α ∈ R k (cid:26) (cid:107) z − Dα (cid:107) + β (cid:107) α (cid:107) (cid:27) . (2)This problem have an analytic solution α = ( D T D + β I k ) − D T z . Thanks to semi-positivity of D T D + β I k , we can use a cholesky factor on this matrix tosolve efficiently this linear system. In order to decreasereconstruction error and to have a sparse solution, thisproblem can be reformuled as a constrained QuadraticProblem (QP): l SC ( α | z ; D ) (cid:44) min α ∈ R k (cid:107) z − Dα (cid:107) s.t. (cid:107) α (cid:107) = 1 . (3)To solve this problem, we can use a QP solver involvinghigh combinatorial computation to find the solution.Under RIP assumptions (Tibshirani, 1994), a greedyapproach can be used efficiently to solve and eq. 3 andthis latter can be rewritten as: l SC ( α | z ; D ) (cid:44) min α ∈ R k (cid:107) z − Dα (cid:107) + λ (cid:107) α (cid:107) , (4)where λ is a regularization parameter which controlsthe level of sparsity. This problem is also known asbasis pursuit (Chen et al., 1998) or the Lasso (Tib-shirani, 1994). To solve this problem, we can use thepopular Least angle regression (LARS) algorithm. The objective of pooling (Boureau et al.; Feng et al.)is to transform the joint feature representation intoa new, more usable one that preserves important in-formation while discarding irrelevant detail. For eachclick signal, we usually compute L codes denoted V (cid:44) { α i } , i = 1 , . . . , L . Let define v j ∈ R L , j = 1 , . . . , k asthe j th row vector of V . It is essential to use featurepooling to map the response vector v j into a statis-tic value f ( v j ) from some spatial pooling operation f .We use v j , the response vector, to summarize the jointdistribution of the j th compounds of local features overthe region of interest (ROI). We will consider the (cid:96) µ -norm pooling and defined by: f n ( v ; µ ) = (cid:32) L (cid:88) m =1 | v m | µ (cid:33) µ s.t. µ (cid:54) = 0 . (5)The parameter µ determines the selection policy forlocations. When µ = 1, (cid:96) µ -norm pooling is equiva-lent to sum-pooling and aggregates the responses overthe entire region uniformly. When µ increases, (cid:96) µ -norm pooling approaches max-pooling. We can notethe value of µ tunes the pooling operation to transitfrom sum-pooling to max-pooling. In computer vision, Spatial Pyramid Matching (SPM)is a technic (introduced by (Lazebnik et al.)) whichimproves classification accuracy by performing a more hyseter catodon localization by sparse coding
100 200 300 400 500 600 700 800 900 100020406080100120
Figure 1.
Left: Example of detected click with n = 2000. Right: extracted local features with p = 128, L = 1000 (onelocal feature per column). robust local analysis. We will adopt the same strategyin order to pool sparse codes over a temporal pyramid(TP) dividing each click signal into ROI of differentsizes and locations. Our TP is defined by the matrix Λ of size ( P ×
3) (Paris et al.): Λ = [ a , b , Ω ] , (6)where a , b , Ω are 3 ( P ×
1) vectors representing sub-division ratio, overlapping ratio and weights respec-tively. P designs the number of layers in the pyramid.Each row of Λ represents a temporal layer of the pyra-mid, i.e. indicates how do divide the entire signal intosub-regions possibly overlapping. For the i th layer, theclick signal is divided into D i = (cid:98) − a i b i +1 (cid:99) ROIs where a i , b i are the i th elements of vector a , b respectively.For the entiere TP, we obtain a total of D = P (cid:80) i =1 D i ROIs. Each click signal c ( n ×
1) is divided into tem-poral ROI R i,j , i = 1 , . . . , P , j = 1 , . . . , D i of size( (cid:98) a i .n (cid:99) × i th layer have the sameweight Ω i . For the i th layer, ROIs are shifted by (cid:98) b i .n (cid:99) samples. A TP with Λ = (cid:20)
12 14 (cid:21) is designing a2-layers pyramid with D = 1 + 4 ROIs, the entiere sig-nal for the first layer and 4 half-windows of n sampleswith 25% of overlapping for the second layer. At theend of pooling stage over Λ , the global feature x ∈ R d , d = D.k is defined by the weighted concatenation (byfactor Ω i ) of L pooled codes associated with c . To encode each local features by sparse coding (seeeq. 4), a dictionary D is trained offline with an im-portant collection of M ≤ N.L local features as in-put. One would minimize the regularized empirical risk R M : R M ( V , D ) (cid:44) M M (cid:88) i =1 (cid:107) z i − Dα i (cid:107) + λ (cid:107) α i (cid:107) s.t. d Tj d j = 1 . (7)Unfortunatly, this problem is not jointly convex butcan be optimized by alternating method: R M ( V | ˆ D ) (cid:44) M M (cid:88) i =1 (cid:107) z i − ˆ Dα i (cid:107) + λ (cid:107) α i (cid:107) , (8)which can be solved in parallel by LASSO/LARS andthen: R M ( D | ˆ V ) (cid:44) M M (cid:88) i =1 (cid:107) z i − D ˆ α i (cid:107) s.t. d Tj d j = 1 . (9)Eq. 9 have an analytic solution involving a large ma-trix ( k × k ) inversion and a large memory occupationfor storing the matrix V ( k × M ). Since M is poten-tially very large (up to 1 million), an online method toupdate dictionary learning is prefered (Mairal et al.).Figure 2 depicts 3 dictionary basis vectors learned via sparse coding. As depicted, some elements reprensentsmore impulsive responses while some more harmonicresponses.
3. Range and azimuth logisticregression from global features
After the pooling stage, we extracted unsupervisly N global features X (cid:44) { x i } ∈ R d × N . We propose toregress via logistic regression both range r and az-imuth az (in x − y plan, when animal reach surfaceto breath) from the animal trajectory groundtruth de-noted y . For the current train/test splitsets of the hyseter catodon localization by sparse coding Figure 2.
Example of trained dictionary basis with sparsecoding. data, such as X = X train (cid:83) X test , y = y train (cid:83) y test and N = N train + N test , ∀ { x i , y i } ∈ X train × y train ,we minimize: (cid:98) w θ = arg min w θ (cid:40) w Tθ w θ + C N train (cid:88) i =1 log(1 + e − y i w Tθ x i ) (cid:41) , (10)where y i denotes r i and az i for θ = r and θ = az re-spectively. Eq. 10 can be efficiently solved for examplewith Liblinear software (Fan et al., 2008). In the testpart, range and azimuth for any x i ∈ X test are recon-tructed linearly by (cid:98) r i = (cid:98) w Tr x i and by (cid:99) az i = (cid:98) w Taz x i respectively.
4. Experimental results
This dataset (Giraudet & Glotin, 2006) contains a to-tal of N = 6134 detected clicks for H = 5 differenthydrophones (named H , H , H , H and H andwith N = 1205, N = 1238, N = 1241, N = 1261and N = 1189 respectively).To extract local features, we chose n = 2000, p = 128and L = 1000 (tuned by model selection). For boththe dictionary learning and the local features encod-ing, we chose λ = 0 . M = 400 .
000 local fea-tures drawn uniformaly. We performed K = 10 cross-validation where training sets reprensented 70% of thetotal of extracted global features, the rest for the test-ing sets. Logistic regression parameter C is tuned bymodel selection. We compute the average root meansquare error (ARMSE) of range/azimuth estimates perhydrophone: ARM SE ( l ) = K K (cid:80) i =1 (cid:115) N ltest (cid:80) j =1 ( y li,j − (cid:98) y li,j ) −2−1.9−1.8−1.7−1.6−1.5−1.4−1.3−1.2−1.1−1 x 10 H H H H H m m Figure 3.
The 2D trajectory (in xy plan) of the singlesperm whale observed during 25 min and correspondinghydrophones positions. where y li,j , (cid:98) y li,j and N ltest represent the ground truth,the estimate and the number of test samples for the l th hydrophone respectively. The global ARMSE is thencalculated by ARM SE = H H (cid:80) l =1 ARM SE ( l ). (cid:96) µ -norm pooling case study For prilimary results, we investigate the influence ofthe µ parameter during the pooling stage. We fix thenumber of dictionary basis to k = 128 and the tempo-ral pyramid equal to Λ = [1 , , i.e. we pool sparsecodes on whole the temporal click signal. A value of µ A R M SE ( m ) ARMSE on range Λ Figure 4.
ARMSE vs. µ for range estimation. µ = { , } seems to be a good choice for this poolingprocedure. For µ ≥
20, results are similar to those ob-tained by max-pooling. For azimuth, we observe alsothe same range of µ values. hyseter catodon localization by sparse coding Here, we fixed the value of µ = 3 and we varied thenumber of dictionary basis k from 128 to 4096 ele-ments. We also investigated the influence of the tem-poral pyramid and we give results for two particu-lary choices: Λ = [1 , ,
1] and Λ = (cid:20)
13 13 (cid:21) .For Λ , the sparse are first pooled over all the signalthen pooled over 3 non-overlapping windows for a to-tal of 1 + 3 = 4 ROIs. In order to compare resultsof our presented method, we also give results for anhand-craft feature (Glotin et al., 2011) specialized forspermwhales and based on the spectrum of the mostenergetic pulse dtected inside the click. This special-ized feature, denoted Spectrum feature , is a 128 pointsvector.
128 256 512 1024 2048 3072 4096600700800900100011001200 k A R M SE ( m ) ARMSE on range Λ Λ Spectrum Feat
Figure 5.
ARMSE vs. k for range estimation with µ = 3.
128 256 512 1024 2048 3072 40967580859095100105110115120 k A R M SE ( deg ) ARMSE on azimuth Λ Λ Spectrum Feat
Figure 6.
ARMSE vs. k for azimuth estimation with µ =3. For both range and azimuth estimate, from k = 2048,our method outperforms results of the Spectrum fea-ture and particulary for azimuth estimate. Using a temporal pyramid for pooling permits also to improveslightly results.
5. Conclusions and perspectives
We introduced in the paper, for spermwhale local-ization, a BoF approach via sparse coding deliveringrough estimates of range and azimuth of the animal,specificaly towarded for mono-hydrophone configura-tion. Our proposed method works directly on theclick signal without any prior pulses detection/analysiswhile being robust to signal transformation issue bythe propagation. Coupled with non-linear filteringsuch as particle filtering (Arulampalam et al., 2002),accurate animal position estimation could be performeven in mono-hydrophone configuration. Applicationsfor anti-collision system and whale whatching are tar-geted with this work.As perspective, we plan to investigate other local fea-tures such as spectral features, MFCC (Davis & Mer-melstein, 1980; Rabiner & Juang, 1993), Scatteringtransform features (And´en & Mallat). These lattercan be considered as a hand-craft first layer of a deeplearning architecture with 2 layers.
References
And´en, Joakim and Mallat, St´ephane. Multiscale scat-tering for audio classification. In
ISMIR, 11 .Arulampalam, M. Sanjeev, Maskell, Simon, and Gor-don, Neil. A tutorial on particle filters for onlinenonlinear/non-gaussian bayesian tracking.
IEEETrans. SP , 50:174–188, 2002.B´enard, Fr´ed´eric and Glotin, Herv´e. Whales local-ization using a large array : performance rela-tive to cramer-rao bounds and confidence regions.In e-Business and Telecommunications , pp. 294–306. Springer - Verlag, Berlin Heidelberg, september2009.Boureau, Y-Lan, Ponce, Jean, and Lecun, Yann.A theoretical analysis of feature pooling in visualrecognition. In
ICML’ 10 .Chen, Scott Shaobing, Donoho, David L., Michael, andSaunders, A. Atomic decomposition by basis pur-suit.
SIAM Journal on Scientific Computing , 20:33–61, 1998.Davis, S. and Mermelstein, P. Comparison of paramet-ric representations for monosyllabic word recogni-tion in continuously spoken sentences.
IEEE Trans.ASSP , 28:357–366, 1980. hyseter catodon localization by sparse coding
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui,Wang, Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR:A library for large linear classification.
JMLR , 2008.Feng, Jiashi, Ni, Bingbing, Tian, Qi, and Yan,Shuicheng. Geometric l p -norm feature pooling forimage classification. In CVPR ’11 .Giraudet, Pascale and Glotin, Herv´e. Real-time 3dtracking of whales by precise and echo-robust tdoasof clicks extracted from 5 bottom-mounted hy-drophones records of the autec.
Applied Acoustics ,67:1106–1117, 2006.Glotin, H., Doh, Y., Abeille, R., and Monnin, A. Phy-seter distance estimation using sub-band leroy trans-mission loss model. In , 2011.Lazebnik, Svetlana, Schmid, Cordelia, and Ponce,Jean. Beyond bags of features: Spatial pyramidmatching for recognizing natural scene categories.In
CVPR ’06 , pp. 2169–2178.Leroy, C. Sound attenuation between 200 and 10000cps mesured along single paths. Technical Re-port 43, Saclant ASW Research Center, 1965.Mairal, Julien, Bach, Francis, Ponce, Jean, and Sapiro,Guillermo. Online dictionary learning for sparsecoding. In
ICML ’09 .Nosal, E.-M. and Frazer, L. Track of a sperm whalefrom delays between direct and surface-reflectedclicks.
Applied Acoustics , 67:1187–1201, 2006.Paris, S´ebastien, Halkias, Xanadu, and Glotin, Herv´e.Efficient bag of scenes analysis for image categoriza-tion. In
ICPRAM’ 13 .Rabiner, L. and Juang, B.H.
Fundamentals of SpeechRecognition . Prentice Hall PTR, 1993.Tibshirani, Robert. Regression shrinkage and selec-tion via the lasso.