[PDF] Audio Content based Geotagging in Multimedia

Abstract

In this paper we propose methods to extract geographically relevant information in a multimedia recording using its audio. Our method primarily is based on the fact that urban acoustic environment consists of a variety of sounds. Hence, location information can be inferred from the composition of sound events/classes present in the audio. More specifically, we adopt matrix factorization techniques to obtain semantic content of recording in terms of different sound classes. These semantic information are then combined to identify the location of recording.

Full PDF

aa r X i v : . [ c s . S D ] N ov AUDIO CONTENT BASED GEOTAGGING IN MULTIMEDIA

Anurag Kumar * , Benjamin Elizalde + , Bhiksha Raj * School of Computer Science * , Electrical and Computer Engineering + Carnegie Mellon UniversityPittsburgh, PA, USA - 15213 [email protected], [email protected], [email protected]

ABSTRACT

In this paper we propose methods to extract geographically relevantinformation in a multimedia recording using its audio. Our methodprimarily is based on the fact that urban acoustic environment con-sists of a variety of sounds. Hence, location information can be in-ferred from the composition of sound events/classes present in theaudio. More speciﬁcally, we adopt matrix factorization techniquesto obtain semantic content of recording in terms of different soundclasses. These semantic information are then combined to identifythe location of recording.

Index Terms — Location Identiﬁcation, Geotagging, MatrixFactorization

1. INTRODUCTION

Extracting information from multimedia recordings has received lotof attention due to the growing multimedia content on the web. Aparticularly interesting problem is extraction of geo-locations or in-formation relevant to geographical locations. This process of pro-viding geographical identity information is usually termed as Geo-tagging [1] and is gaining importance due its role in several appli-cations. It is useful not only in location based services and recom-mender systems [2] [3][4] but also in general cataloguing, organiza-tion, search and retrieval of multimedia content on the web. Locationspeciﬁc information also allows a user to put his/her multimedia con-tent into a social context, since it is human nature to associate withgeographical identity of any material. A nice survey on differentaspects of geotagging in multimedia is provided in [1].Although, there are applications which allows users to add ge-ographical information in their photos and videos, a larger portionof multimedia content on the web is without any geographical iden-tity. In these cases geotags needs to be inferred from the multimediacontent and the associated metadata. This problem of geotaggingor location identiﬁcation also features as the Placing Tasks in yearlyMediaEval [5] tasks. The goal of Placing Tasks [6] in MediaEval isto develop systems which can predict places in videos based on dif-ferent modalities of multimedia such as images, audio, text etc. Animportant aspect of location prediction systems is the granularity atwhich location needs to be predicted. The Placing Task recognizes awide range of location hierarchy, starting from neighbourhoods andgoing upto continents. In this work we are particularly interestedin obtaining city-level geographical tags which is clearly one of themost important level of location speciﬁcation for any data. City-level information is easily relatable and is well suited to locationbased services and recommender systems.Most of the current works on geotagging focus on using vi-sual/image component of multimedia and the associated text in themultimedia ([7] [1][8][9] to cite a few). The audio component has largely been ignored and there is little work on predicting locationsbased on audio content of the multimedia. However, authors in [10]argue that there are cases where audio content might be extremelyhelpful in identifying location. For example, speech based cues canaid in recognizing location. Moreover, factors such as urban sound-scapes and locations acoustic environment can also help in locationidentiﬁcation. Very few works have looked into audio based locationidentiﬁcation in multimedia recordings [11] [12]. The approachesproposed in these works have been simplistic relying mainly on ba-sic low level acoustic features. One way is to use well known basicacoustic features such as Mel-Cepstra Coefﬁcient (MFCC), Gamma-tone ﬁlter features directly for classiﬁcation purposes. In other caseaudio-clip level features such

GMM - Supervectors or Bag Of AudioWords (BoAW) histograms are ﬁrst obtained and then classiﬁers aretrained on these features.In this work we show that geotagging using only audio compo-nent of multimedia can be done with reasonably good success rateby capturing the semantic content in the audio. Our primary asser-tion is that the semantic content of an audio recording in terms ofdifferent acoustic events can help in predicting locations. We arguethat soundtracks of different cities are composed of a set of acous-tic events. If we can somehow capture the composition of audio interms of these acoustic events then they can be used to train ma-chine learning algorithms for geotagging purposes. We start with aset of base acoustic events or sound classes and then apply methodsbased on matrix factorization to ﬁnd the composition of soundtracksin terms of these acoustic events. Once the weights correspondingto each base sound class have been obtained, we build higher levelfeature using these weights which are further used to obtain kernelrepresentations. The kernels corresponding to each basis sound arethen combined to ﬁnally train Support Vector Machines for predict-ing location identiﬁcation of the recording.The rest of the paper is organized as follows. In Section 2 wedescribe our proposed framework for audio based geotagging. InSection 3 we present our experiments and results. In Section 4 wediscuss scalability of our proposed method and also give concludingremarks.

2. AUDIO BASED GEOTAGGING

Audio based geotagging in multimedia can be performed by exploit-ing audio content in several ways. One can possibly try to use au-tomatic speech recognition (ASR) to exploit the speech informationpresent in audio. For example, speech might contain words or sen-tences which uniquely identiﬁes a place,

I am near Eiffel Tower clearly gives away the location as Paris, with high probability, ir-respective of presence or absence of any other cues. Other detailssuch as language used, mention of landmarks etc. in speech can alsohelp in audio based geotagging. .1. Audio Semantic Content based Geotagging

In this work we take a more generic approach where we try to capturesemantic content of audio through occurrence of different meaning-ful sound events and scenes in the recording. We argue that it shouldbe possible to train machines to capture identity of a location bycapturing the composition of audio recordings in terms of humanrecognizable sound events. This idea can be related to and is in factbacked by urban soundscapes works [13] [14]. Based on this ideaof location identiﬁcation through semantic content of audio, we tryto answer two important questions.

First , how to mathematically capture the composition of audio recordings and

Second , how to usethe information about semantic content of the recording for trainingclassiﬁers which can predict identity of location. We provide ouranswers for each of these questions one by one. It is worth notingthat this overall framework is different from audio events recogni-tion works. Our goal is not to identify acoustic events but to ﬁnd thecomposition of acoustic events in a way which can further be usedto obtain geographical locations.Let E = { E , E , ..E L } be the set of base acoustic eventsor sound classes whose composition is to be captured in an audiorecording. Let us assume that each of these sound classes can becharacterized by a basis matrix M l . For a given sound class E l thecolumn vectors of its basis matrix M l essentially spans the space ofsound class E l . Mathematically, this span is in space of some acous-tic feature ( e.g MFCC) used to characterize audio recordings andover which the basis matrices have been learned. How we obtain M l is discussed later. Any given soundtrack or audio recording is thendecomposed with respect to the sound class E l as X ≈ M l W Tl (1)where X is a d × n dimensional representation of the audio record-ing using acoustic features such as MFCC. For MFCCs, this implieseach column of X is d dimensional mel-frequency cepstral coefﬁ-cients and n is the total number of frames in the audio recording.The sound basis matrices M l are d × k dimensional where k repre-sents the number of basis vectors in M l . In principle k can vary witheach sound class, however, for sake of convenience we assume it issame for all E l , for l = 1 to L .Equation 1 deﬁnes the relationship between the soundtrack andits composition in terms of sound classes. The weight matrix W l captures how the sound class E l is present in the recording. It isrepresentative of the distribution of sound class E l through out theduration of the recording. Hencer, obtaining W l for each l providesus information about the structural composition of the audio in termsof sound classes in E . These W l can then be used for differentiatinglocations. Thus, the ﬁrst problem we need to address is to learn M l for each E l and then using it to compute W l for any given recording. M l and W l using semi-NMF Let us assume that for a given sound class E l we have a collectionof N l audio recordings belonging to class E l only. We parametrizeeach of these recordings through some acoustic features. In this workwe use MFCC features augmented by delta and acceleration coefﬁ-cients (denoted by MFCA) as basic acoustic features. These acousticfeatures are represented by d × n i dimensional matrix X iE l for the i th recording. d is dimensionality of acoustic features and each columnrepresents acoustic features for a frame. The basic features of allrecordings are collected into one single matrix X E l = [ X iE l , ..X NE l ] ,to get a large collective sample of acoustic features for sound class E l . Clearly, X E l has d rows and let T be the number of columns inthis matrix. To obtain the basis matrix M l for E l we employ matrix factor-ization techniques. More speciﬁcally, we use Non-Negative matrixfactorization (NMF) like method proposed in [15]. [15] proposedtwo matrix factorization methods named semi-NMF and convex-NMF which are like NMF but do not require the matrix data to benon-negative. This is important in our case, since employing clas-sical NMF [16] algorithms would require our basic acoustic featureto be non-negative. This can be highly restrictive given the chal-lenging task at hand. Even though we employ MFCCs as acousticfeatures, our proposed general framework based on semi-NMF canbe used with other features as well. Moreover, semi-NMF offersother interesting properties such as its interpretation in terms of K-means clustering . One of our higher level features is based on thisinterpretation of semi-NMF. convex-NMF did not yield desirableresults and hence we do not discuss it in this paper.semi-NMF considers factorization of a matrix, X E l as X E l ≈ M l W T . For factorization number of basis vectors k in M l is ﬁxedto a value less than min ( d, T ) . semi-NMF does not impose anyrestriction on M l , that is its element can have any sign. The weightmatrix W on the other hand is restricted to be non-negative. Theobjective is to minimize || X E l − M l W T || . Assuming that M l and W have been initialized, M l and W l are updated iteratively in thefollowing way. In each step of iteration, • Fix W, update M l as, M l = X E l W ( W T W ) − (2) • Fix M l , update W, W rs = W rs s ( X TEl M l ) + rs +[ W ( M Tl M l ) − ] rs ( X TEl M l ) − rs +[ W ( M Tl M l ) + ] rs (3)The process is iterated till error drops below certain tolerance.The + and − sign represents positive and negative parts of a matrixobtained as Z + rs = ( | Z rs | + Z rs ) / and Z − rs = ( | Z rs | − Z rs ) / .Theoretical guarantees on convergence of semi-NMF and other in-teresting properties such as invariance with respect to scaling can befound in original paper. One interesting aspect of semi-NMF de-scribed by authors is its analysis in terms of K-means clustering al-gorithm. The objective function || X − MW T || can be related toK-Means objective function with M l representing the k cluster cen-ters. Hence, the basis matrix M l also represents centers of clusters.We exploit this interpretation in the next phase of our approach. Theinitialization of M l and W l is done as per the procedure described in[15].Once M l have been learned for each E l , we can easily obtain W l for any given audio recording X by ﬁxing M l and then applyingEq 3 for X for several iterations. For a given X , W l contains infor-mation about E l in X . With K-Means interpretation of semi-NMF,the non-negative weight matrix W l can be interpreted as containingsoft assignment posteriors to each cluster for all frames in X . W l We treat the problem of location prediction as a retrieval problemwhere we want to retrieve most relevant recordings belonging to acertain location (city). Put more formally, we train binary classiﬁersfor each location to retrieve the most relevant recordings belongingto the concerned location. Let us assume that we are concerned witha particular city C and the set S = { s i , i = 1 to N } is the set ofavailable training audio recordings. The labels of the recordings arerepresented by y i ∈ {− , } with y i = 1 if s i belongs to C , other-wise y i = − . X i ( d × n i ) denotes the MFCA representation of s i .For each X i weight composition matrices W li are obtained with re-spect to all sound events E l in E . W li captures distribution of soundevent E l in X i and we propose histogram based representations tocharacterize this distribution. .3.1. Direct characterization of W l as posterior As we mentioned before semi-NMF can be interpreted in terms ofK-means clustering. For a given E l , the learned basis matrix M l can be interpreted as matrix containing cluster centers. The weightmatrix W li ( n i × k ) obtained for X i using M l can then be inter-preted as posterior probabilities for each frame in X i with respect tocluster centers in M l . Hence, we ﬁrst normalize each row of W li tosum to , to convert them into probability space. Then, we obtain k dimensional histogram representation for X i corresponding to M l as ~h li = 1 n i n i X j =1 ~w t ; ~w t = t th row of W li (4)This is done for all M l and hence for each training recording weobtain a total of L , k dimensional histograms represented by ~h li . W l We also propose another way of capturing distribution in W l wherewe actually ﬁt a mixture model to it. For a given sound class E l , weﬁrst collect W li for all X i in training data. We then train a Gaus-sian Mixture Model G l on the accumulated weight vectors. Let thisGMM be G l = { λ g , N ( ~µ g , Σ g ) , g = 1 to G l } , where λ lg , ~µ lg and Σ lg are the mixture weight, mean and covariance parameters of the g th Gaussian in G l . Once G l has been obtained, for any W li wecompute probabilistic posterior assignment of weight vectors w t in W li according to Eq 5 ( P r ( g | ~w t ) ). ~w t are again the rows in W li .These soft-assignments are added over all t to obtain the total massof weight vectors belonging to the g th Gaussian ( P ( g ) li , Eq 5). Nor-malization by n i is done to remove the effect of the duration ofrecordings. P r ( g | ~w t ) = λ lg N ( ~w t ; ~µ lg , Σ lg ) G P p =1 λ lp N ( ~w t ; ~µ lp , Σ lp ) ; P ( g ) li = n i n i P t =1 P r ( g | ~w t ) (5)The ﬁnal representation for W li is ~v li = [ P (1) li , ...P ( G l ) li ] T . ~v li is a G l -dimensional feature representation for a given recording X i with respect to E l . The whole process is done for all E l to obtain L different soft assignment histograms for a given X i . ~h li or ~v li features captures acoustic events information for any X i .We then use kernel fusion methods in Support Vector Machine(SVM) to ﬁnally train classiﬁers for geotagging purposes. We ex-plain the method here in terms of ~h li , for ~v li the steps followed aresame.For each l , we obtain separate kernels representation K l using ~h li for all X i . Since exponential χ kernel SVM are known to workwell with histogram representations [17] [18], we use kernels of theform K l ( ~h li ,~h lj ) = exp ( − D ( ~h li ,~h lj ) /γ ) where D ( ~h li ,~h lj )) is χ distance between ~h li and ~h lj . γ is set as the average of all pair wisedistance. Once we have all K l , we use two simple kernel fusionmethods; • Average kernel fusion - The ﬁnal kernel representation isgiven by, K hS = L P Ll =1 K l (; , ; ) • Product kernel fusion - In this case the ﬁnal kernel represen-tation is given by, K hP = L Q Li =1 K l (: , :) .Finally, K hS or K hP is used to train SVMs for prediction.

3. EXPERIMENTS AND RESULTS

As stated before, our goal is to perform city - level geotagging inmultimedia. Hence, we evaluate our proposed method on the datasetused in [11] which provides city level tags for ﬂickr videos. Thedataset contains contains a total of 1079 Flickr videos with videos in the training set and in the testing set. We work withonly audio of each video and we will alternatively refer to thesevideos as audio recordings. The maximum duration of recordings is seconds. The videos of the recording belong to different citieswith several cities having very few examples in training as well astesting set such as just for Bankok or for Beijing. We selected cities for evaluation for which training as well as test set con-tains at least examples. These cities are Berlin (B), Chicago(C), London (L), Los Angeles (LA), Paris (P), Rio (R), San Francisco(SF), Seoul(SE), Sydney (SY) and

Tokyo (T) . As stated before the ba-sic acoustic feature used are MFCC features augmented by delta andacceleration coefﬁcients. dimensional MFCCs are extracted foreach audio recording over a window of ms with overlap.Hence, basic acoustic features for audio recordings are dimen-sional and referred to as MFCA features.We compare our proposed method with two methods, one basedon GMM based bag of audio words (BoAW) and other based onGMM-supervectors. These are clip level feature representation builtover MFCA acoustic features for each recording. The ﬁrst step inthis method is to train a background GMM G bs with G b componentsover MFCA features where each Gaussian represents an audio word.Then for each audio recording clip level histogram features are ob-tained using the GMM posteriors for each frame in the clip. Thecomputation is similar to Eq 5; except that the process is done overMFCA features. These clip level representation are soft count bagof audio words representation. GMM-supervectors are obtained byadapating means of background GMM G bs for a given using maxi-mum a posteriori (MAP) adaptation [19]. We will use ~b to denotethese G b dimensional bag of audio words features and ~s to denotethe G b × dimensional GMM - supervectors. Exponential χ ker-nel SVMs are used with ~b features and linear SVMs are used withGMM - supervectors features. Exponential χ kernels are usuallyrepresented as K ( x, y ) = exp − γD ( x,y ) , where D ( x, y ) is χ dis-tance between vetors x and y . Both of these kernels are known towork best for the corresponding features. All parameters such as γ and the slack parameter C in SVMs are selected by cross validationover the training set.For our proposed method we need a set of sound classes E .Studies on Urban soundscapes have tried to categorize the ur-ban acoustic environments [13] [14] [20]. [21] came up with areﬁned taxonomy of urban sounds and also created a dataset,

Ur-banSounds8k , for urban sound events. This dataset contains audio recordings spread over different sound events from urbansound taxonomy. These sound events are car horn, children playing,dog barking, air conditioner noise, drilling, engine idling, gun shot,jackhammer, siren and street music. We use these sound classesas our set E and then obtain the basis matrices M l for each E l usingthe examples of these sound events provided in the UrbanSounds8kdataset.The number of basis vectors for all M l is same and ﬁxed to either or . We present results for both cases. Finally, in the classiﬁertraining stage; SVMs are trained using the fused kernel K hS (or K hP ,or K vS , or K vP ) as described in Section 2.4. Here the slack parameter C in SVM formulation is set by performing fold cross validationover the training set.We formulate the geotagging problem as retrieval problem able 1 . MAP for different cases ( ~b , ~s and ~h l ) G b →

32 64 128 256MAP ( ~b ) → MAP ( ~s ) → → Avg Ker. ( K hS ) Prod. Ker ( K hP ) k →

20 40 20 40MAP → where the goal is to retrieve most relevant audios for a city. We usewell known Average Precision (AP) as metric to measure perfor-mance for each city and Mean Average Precision (MAP) over allcities as the overall metric. Due to space constraints we are not ableto show AP results in every case and will only present overall metricMAP.Table 1 shows MAP results for BoAW and Supervector basedmethods (top rows) and our proposed method (bottom rows) us-ing ~h l features described in Section 2.3.1. For baseline method weexperimented with different component size G b for GMM G bs . k represents the number of basis vectors in each M l . K hS repre-sents the average kernel fusion and K hP product kernel fusion. First,we observe that our proposed method outperforms these state of artmethods by a signiﬁcant margin. For BoAW, G b = 256 gives high-est MAP of . but MAP saturates with increasing G b and hence,any signiﬁcant improvement in MAP by further increasing G b is notexpected. For supervectors G b = 64 gives best result and MAP de-creases on further increasing G b . Our proposed method with k = 40 and product kernel fusion gives . MAP, an absolute improve-ment of . % and . % when compared to BoAW and supervectorsbased methods respectively. MAP in other cases for our proposedmethod are also in general better than best MAP using state of artmethods. We also note that for ~h l features, product kernel fusionof different sound class kernels performs better than average kernelfusion. Also, for ~h l , k = 40 is better than k = 20 .Table 2 shows results for our ~v l features in Section 2.3.2 whichuses GMM based characterization of composition matrices W l . Weexperimented with different values of GMM component size G l .Once again we observe that overall this framework works gives su-perior performance. Once again MAP of . with ~v l is over . higher in absolute terms when comapred to best MAP with supervec-tors.This shows that the composition matrices W l are actually cap-turing semantic information from the audio and these semantic infor-mation when combined helps in location identiﬁcation. If we com-pare ~v l and ~h l methods then overall ~h l seems to give better results.This is worth noting since it suggests that W l on its own are ex-tremely meaningful and sufﬁcient. Another interesting observationis that for ~v l average kernel fusion is better than product kernel fu-sion.Figure 1 shows city wise results for all 4 methods methods. Foreach method the shown AP correspond to the case which results inbest MAP for that method. This implies GMM component size inboth BoAW and ~v l is that is G l = G b = 256 ; for ~h l k = 40 and product kernel fusion; for ~v l k = 20 and average kernel fu-sion. For supervector based method G b = 64 . For convenience, citynames have been denoted by indices used in the beginning paragraphof this section. Figure 1 also shows MAP values in the extreme right.One can observe from Figure 1 that cities such as Rio (R), San Fran-cisco (SF), Seoul (SE) are much easier to identify and all methods

Table 2 . MAP for different cases for ~v l Avg Ker. ( K vS ) Prod. Ker ( K vP ) G l ↓ — k →

20 40 20 4032 0.454 0.427 0.448 0.41764 0.482 0.466 0.432 0.424128

Cities

B C L LA P R SF SE SY T MAP A P BoAW → exp χ Ker.Supvectors → Linear Ker.W l ( ~ h l ) → Prod. Ker.W l + GMM ( ~ v l ) → Avg. Ker.

Fig. 1 . Average Precision for Cities (MAP in right extreme)give over . AP. On the other hand

Sydney (SY) is a much harderto geotag comapred to other cities. Once again our proposed methodoutperforms BoAW and supervector based methods for all cities ex-cept for Berlin (B).

4. DISCUSSIONS AND CONCLUSIONS

We presented methods for geotagging in multimedia using its audiocontent. We proposed that the semantic content of the audio capturedin terms of different sound events which occur in our environment,can be used for location identiﬁcation purposes. It is expected thatlarger the number of sound classes in E the more distinguishing el-ements we can expect to obtain and the better it is for geotagging.Hence, it is desirable that any framework working under this ideashould be scalable in terms of number of sounds in E . In our pro-posed framework the process of learning basis matrices M l are in-dependent of each other and can be easily parallelized. Similarly,obtaining composition weight matrices W li can also be computedin parallel for each E l and so do the features ~h li (or ~v li ) and kernelmatrices. Hence, our proposed is completely scalable in terms ofnumber sound events in the set E . If required, one can also easilyadd any new sound class to an existing system if required. Moreover,our proposed framework can be applied on any acoustic feature.Even with sound events from urban sound taxonomy we ob-tained reasonably good performance. Our proposed framework out-performed state of art supervector and bag of audio word basedmethods by a signiﬁcant margin. Currently, we used simple kernelfusion methods to combine event speciﬁc kernels. One can poten-tially use established methods such as multiple kernel learning atthis step. This might lead to further improvement in results. Onecan also look into other methods for obtaining basis matrices forsound events. A more comprehensive analysis on a larger datasetwith larger number of cities can through more light on the effective-ness of the proposed method. However, this work does give sufﬁ-cient evidence towards success of audio content based geotagging inmultimedia. . REFERENCES [1] Jiebo Luo, Dhiraj Joshi, Jie Yu, and Andrew Gallagher, “Geo-tagging in multimedia and computer visiona survey,” Multime-dia Tools and Applications , vol. 51, no. 1, pp. 187–211, 2011.[2] Jie Bao, Yu Zheng, and Mohamed F Mokbel, “Location-basedand preference-aware recommendation using sparse geo-socialnetworking data,” in

Proceedings of the 20th InternationalConference on Advances in Geographic Information Systems .ACM, 2012, pp. 199–208.[3] Jie Bao, Yu Zheng, David Wilkie, and Mohamed Mokbel,“Recommendations in location-based social networks: a sur-vey,”

GeoInformatica , vol. 19, no. 3, pp. 525–565, 2015.[4] Abdul Majid, Ling Chen, Gencai Chen, Hamid Turab Mirza,Ibrar Hussain, and John Woodward, “A context-aware person-alized travel recommendation system based on geotagged so-cial media data mining,”

International Journal of Geographi-cal Information Science , vol. 27, no. 4, pp. 662–684, 2013.[5] MediaEval, “ ,”2015.[6] J Choi, B Thomee, G Friedland, L Cao, K Ni, D Borth,B Elizalde, L Gottlieb, C Carrano, R Pearce, et al., “Theplacing task: A large-scale geo-estimation challenge for social-media videos and images,” in

Proceedings of the 3rd ACMMultimedia Workshop on Geotagging and Its Applications inMultimedia . ACM, 2014, pp. 27–31.[7] Michele Trevisiol, Herv´e J´egou, Jonathan Delhumeau, andGuillaume Gravier, “Retrieving geo-location of videos witha divide & conquer hierarchical multimodal approach,” in

Pro-ceedings of the 3rd ACM conference on International confer-ence on multimedia retrieval . ACM, 2013, pp. 1–8.[8] Yi-Cheng Song, Yong-Dong Zhang, Juan Cao, Tian Xia,Wu Liu, and Jin-Tao Li, “Web video geolocation by geotaggedsocial resources,”

Multimedia, IEEE Transactions on , vol. 14,no. 2, pp. 456–470, 2012.[9] Pascal Kelm, Sebastian Schmiedeke, Jaeyoung Choi, GeraldFriedland, Venkatesan Nallampatti Ekambaram, Kannan Ram-chandran, and Thomas Sikora, “A novel fusion method forintegrating multiple modalities and knowledge for multimodallocation estimation,” in

Proceedings of the 2nd ACM interna-tional workshop on Geotagging and its applications in multi-media . ACM, 2013, pp. 7–12.[10] Jaeyoung Choi, Howard Lei, Venkatesan Ekambaram, PascalKelm, Luke Gottlieb, Thomas Sikora, Kannan Ramchandran,and Gerald Friedland, “Human vs machine: establishing a hu-man baseline for multimodal location estimation,” in

Proceed-ings of the 21st ACM international conference on Multimedia .ACM, 2013, pp. 867–876.[11] Howard Lei, Jaeyoung Choi, and Gerald Friedland, “Mul-timodal city-veriﬁcation on ﬂickr videos using acoustic andtextual features,” in . IEEE,2012, pp. 2273–2276.[12] Xavier Sevillano, Xavier Valero, and Francesc Al´ıas, “Audioand video cues for geo-tagging online videos in the absenceof metadata,” in

Content-Based Multimedia Indexing (CBMI),2012 10th International Workshop on . IEEE, 2012, pp. 1–6. [13] AL Brown, Jian Kang, and Truls Gjestland, “Towards stan-dardization in soundscape preference assessment,”

AppliedAcoustics , vol. 72, no. 6, pp. 387–392, 2011.[14] SR Payne, WJ Davies, and MD Adams, “Research into thepractical and policy applications of soundscape concepts andtechniques in urban areas,” 2009.[15] Chris Ding, Tao Li, and Michael I Jordan, “Convex and semi-nonnegative matrix factorizations,”

Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on , vol. 32, no. 1, pp.45–55, 2010.[16] Daniel D Lee and H Sebastian Seung, “Algorithms for non-negative matrix factorization,” in

Advances in neural informa-tion processing systems , 2001, pp. 556–562.[17] Jianguo Zhang, Marcin Marszałek, Svetlana Lazebnik, andCordelia Schmid, “Local features and kernels for classiﬁca-tion of texture and object categories: A comprehensive study,”

International journal of computer vision , vol. 73, no. 2, pp.213–238, 2007.[18] Liangliang Cao, Shih-Fu Chang, Noel Codella, Courtenay Cot-ton, Dan Ellis, Leiguang Gong, Matthew Hill, Gang Hua, JohnKender, Michele Merler, et al., “Ibm research and columbiauniversity trecvid-2011 multimedia event detection (med) sys-tem,” .[19] William M Campbell, Douglas E Sturim, and Douglas AReynolds, “Support vector machines using gmm supervectorsfor speaker veriﬁcation,”

IEEE signal processing letters , vol.13, no. 5, pp. 308–311, 2006.[20] R Murray Schafer,

The soundscape: Our sonic environmentand the tuning of the world , Inner Traditions/Bear & Co, 1993.[21] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello,“A dataset and taxonomy for urban sound research,” in