[PDF] Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

Abstract

Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips' sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an AudioGrounding dataset, which provides the correspondence between sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event. Based on such, we propose the text-to-audio grounding (TAG) task, which interactively considers the relationship between audio processing and language understanding. A baseline approach is provided, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.

Full PDF

TTEXT-TO-AUDIO GROUNDING: BUILDING CORRESPONDENCE BETWEEN CAPTIONSAND SOUND EVENTS

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

MoE Key Lab of Artiﬁcial IntelligenceSpeechLab, Department of Computer Science and EngineeringAI Institute, Shanghai Jiao Tong University, Shanghai, China { wsntxxn, richman, mengyuewu, kai.yu } @sjtu.edu.cn ABSTRACT

Automated Audio Captioning is a cross-modal task, gener-ating natural language descriptions to summarize the audioclips’ sound events. However, grounding the actual soundevents in the given audio based on its corresponding captionhas not been investigated. This paper contributes an

Audio-Grounding dataset , which provides the correspondence be-tween sound events and the captions provided in Audiocaps,along with the location (timestamps) of each present soundevent. Based on such, we propose the text-to-audio grounding(TAG) task, which interactively considers the relationship be-tween audio processing and language understanding. A base-line approach is provided, resulting in an event-F1 score of28.3% and a Polyphonic Sound Detection Score (PSDS) scoreof 14.7%. Index Terms — text-to-audio grounding, sound event de-tection, dataset, deep learning.

1. INTRODUCTION

Using natural language to summarise audio content, com-monly referred as Automated Audio Captioning (AAC), hasattracted much attention in recent studies [1, 2, 3, 4]. Com-pared with other audio processing tasks like Acoustic SceneClassiﬁcation (ASC) and Sound Event Detection (SED),which aim to categorize audio into speciﬁc scenes or eventlabels, AAC allows the model to describe audio content innatural language, a much more unrestricted text form. AACcan thus be seen as a less structural summarization of soundevents. However, the correspondence between sound eventdetection and natural language description is rarely inves-tigated. To achieve human-like audio perception, a modelshould be able to generate an audio caption and understandnatural language grounded in acoustic content, i.e., ground-ing (detecting) each sound event mentioned in a given audiocaption to corresponding segments in that audio. Explicit

Mengyue Wu and Kai Yu are the corresponding authors. https://github.com/wsntxxn/TextToAudioGrounding grounding of sound event phrases from the corresponding au-dio is key to audio-oriented language understanding. More-over, it would be beneﬁcial for generating captions with moreaccurate event illustrations and localized AAC evaluationmethods.Although such an audio grounding task (text-to-audiogrounding, TAG) is relatively novel in audio understand-ing and audio-text cross-modal research, it is related to thefollowing problems. Visual Grounding

A similar task to TAG is objectgrounding in Computer Vision (CV) using images or videos.The

Flickr30k Entities [5] is the ﬁrst public dataset for imagegrounding. Image object grounding has become a researchhotspot since then [6, 7, 8]. Recently a plethora of work focuson new datasets and approaches for video object ground-ing [9, 10, 11]. Like audio-text grounding, visual groundingrequires a model to predict bounding boxes (2d coordinates)in an image or video frame for each object described in thecaption.

Sound Event Detection (SED)

SED aims to classify andlocalize particular sound events in an audio clip. With thegrowing inﬂuence of Detection and Classiﬁcation of AcousticScenes and Events (DCASE) challenge [12], research interestin SED has soared recently. TAG can be viewed as text-query-based SED, focusing on localizing sound events described byqueries. Due to SED and TAG’s intrinsic correlation, we bor-row common approaches and evaluation metrics from SED asa benchmark for TAG.

Second M e l Caption: A man is speaking while birds are chirping in the background

Log-mel spectrogram (LMS)

Fig. 1 . An example for

TextToAudioGrounding . For an audioclip and its corresponding caption, on- and off-set timestampsfor each sound event phrase are provided. In this example,both “a man speaking” (red) and “birds chirping” (blue) pointto multiple segments (presented by rectangles in the ﬁgure). a r X i v : . [ c s . S D ] F e b n audio grounding task inevitably consists of two parts.First is the extraction of sound event phrases from natural lan-guage caption, e.g., “people speak” and “dogs bark” can beobtained from the caption “people speak while dogs bark”.The second stage is concerned with traditional SED, detectinga sound event presence along with its onset and offset times-tamps in the given audio clip. The prerequisite is a datasetthat simultaneously provides audio, captions and the segmen-tation of sound events grounded from the caption. To the bestof our knowledge, no existing datasets or tasks are focusingon text-to-audio grounding.We contribute AudioGrounding dataset (Section 2) in thispaper, providing a corresponding series of audio - caption- sound event phrase - sound event timestamp segmentation to enable a more interactive cross-modal research within au-dio processing and natural language understanding. An il-lustration from

AudioGrounding is shown in Figure 1. Withthis dataset, we consider TAG, which localizes correspond-ing sound events in an audio clip from a given language de-scription. A baseline approach for the new TAG task is alsoproposed, see Section 3. Section 4 details the experiment re-sults and the analyses of such a TAG task, with conclusionsprovided in Section 5.

2. THE AUDIO GROUNDING DATASET

Our

AudioGrounding dataset entails audios, with onecaption per audio in the training set, ﬁve captions per audioin the validation, and test sets. We provide caption-orientedsound event tagging for each audio, along with each soundevent’s segmentation timestamps. The audio sources arerooted in

AudioSet [13] and the captions are sourced from

Audiocaps [14].

AudioSet is a large-scale manually-annotated sound eventdataset. Each audio clip has a duration of up to ten seconds,containing at least one sound event label.

AudioSet consists ofa 527 event ontology, encompassing most everyday sounds.

Audiocaps [14] is by far the largest AAC dataset, con-sisting of , + audio clips ( ≈

127 hours) collected from

AudioSet . One human-annotated caption is provided for thetraining dataset while ﬁve captions for validation and test sets,respectively. Since the entire

Audiocaps dataset is a subset of

AudioSet , sound event labels can be obtained for each audioclip in

Audiocaps .It should be noted that though

AudioSet provides soundtags and

Audiocaps consists of descriptive captions, there isno direct link between these two annotations. As we wouldlike to enhance the diversity of the sound events included, weselectively choose audio clips with more than four sound tags,resulting in audio clips sourced from

Audiocaps . Fora successful text-to-audio grounding, each audio clip should have not only a caption description (“A man is speaking whilebirds are chirping in the background”), but also the corre-sponding sound event phrases retrieved from the caption (“Aman is speaking”, “bird are chirping”), and the on- and off-sets of these sound events.

Our annotation process is decoupled into two stages: (1)sound event phrases are extracted automatically from cap-tions; (2) we invite annotators to merge extracted phrases thatcorrespond to the same sound event and provide the durationsegmentation of each sound event.

A. Extracting Sound Event Phrases from Captions

As mentioned above, the sound event labels provided in

AudioSet has no correspondence with the descriptive cap-tions in

Audiocaps . Therefore we ﬁrst extract sound eventphrases from captions using NLTK [15]. A phrase refers toa contiguous chunk of words in a caption. Following stan-dard chunking methods, we extract noun phrases (NP) andcombinations of NP and verb phrases (NP + VP). As sounddescriptions usually stem from objects that sound (e.g., a cat)and verbs create the sound (e.g., meow), NP and NP + VPphrases can roughly summarize all possible sound events.

B. Phrase Merging and Segmentation

Manual phrase merging is necessary since there might berepetitive and unwanted information in extracted phrases. Forexample, the caption in Figure 2 is chunked into three phrases:“people”, “a small crowd are speaking” and “a dog barks”.However, “people” and “a small crowd are speaking” refer tothe same sound event. Based on the extracted phrases, anno-tators are required to label an audio clip in a two-step process:1. Merge phrases describing the same sound event into asingle set and identify the number of sound events men-tioned in the audio;2. Segment each sound event with the on- and off-settimestamps.

Our annotation results in a new audio-text grounding dataset:

AudioGrounding . It contains , corresponding soundevent phrases and captions ( Audiocaps ). After phrasemerging, there are , sound events in total. Sound eventsincluded are quite diversiﬁed, with the most frequent soundevent (“a man speaks”) accounts for no more than 2% ofthe dataset. The sound event phrase duration distribution isshown in Figure 3. Most segments last for less than 2 s andthe event phrases consist of several such short segments in asingle audio clip, like speech, dog barking and cat meowing.However, a considerable proportion of events (e.g., wind, . . LMS Audio Feature

Audio encoder - CRNNPhrase encoder

Caption: people in a small crowd are speaking and a dog barks

Phrase Query in the Caption M e l s Time P oo li n g B i G RU . . . C o n v2 D P oo li n g C o n v2 D people in a a dog barksand . . . . . . . . . . . . [ ] Output probabilitymean . . . . . .

Similarity

Fig. 2 . The proposed baseline model structure for TAG. A CRNN encoder outputs a sequence of audio embedding { e A,t } Tt =1 from the LMS input F ∈ R T × D . The phrase query (containing N words) is encoded into e P by taking the mean of all wordembeddings { e P,n } Nn =1 in the query. Prediction of the sound events’ on- and off-sets are based on the similarity between { e A,t } Tt =1 , and e P . Duration / s C o un t Fig. 3 . Duration distribution of annotated sound eventsmentioned in phrases within the proposed

AudioGrounding dataset.train) is present in the whole clip, lasting for almost 10 s. Wesplit the dataset according to the

Audiocaps setting, assigningeach sample to the same subset (train/val/test) in

Audiocaps .Detailed statistics are provided in Table 1.

Table 1 . Statistics of the

AudioGrounding

Dataset.Split

3. TEXT-TO-AUDIO GROUNDING

Since the primary motivation regards sound event groundingfrom phrases in audio captions, we use two separate encodersfor audio and phrase query, respectively. The input audiofeature F is encoded into an embedding sequence { e A,t } Tt =1 while the query encoder outputs a phrase embedding e P fromthe phrase query P which consists of N words. Our base- line model architecture is illustrated in Figure 2. We applyexp ( − l as the similarity metric and binary cross-entropy(BCE) loss as the training criterion, following previous workin cross-modal audio/text retrieval [16]. The similarity scorebetween audio and phrase embedding e A,t and e P is calcu-lated as: s t = sim ( e A,t , e P ) = exp( −(cid:107) e A,t − e P (cid:107) ) (1)During training, L BCE between an audio-phrase pair is cal-culated as the mean of L BCE between e A at each frame t and e P : L BCE = − T T (cid:88) t =1 y t · log( s t ) + (1 − y t ) · log(1 − s t ) (2)where y t ∈ { , } is a strongly labeled indicator for each t .During evaluation, { s t } Tt =1 is transformed to { ˆ y t } Tt =1 , ˆ y t ∈{ , } by a threshold φ = 0 . , representing the presence( ˆ y t = 1 , s t > φ ) or absence ( ˆ y t = 0 , s t ≤ φ ) of a phrase. Audio Encoder

We adopt a convolutional recurrent neu-ral network (CRNN) [17] as the audio encoder. The detailedCRNN architecture can be found in [18]. It consists ofﬁve convolution blocks (with padded 3 × { e A,t } Tt =1 ∈ R . Phrase Encoder

For the phrase encoder, we only focuson extracting a representation for the phrase and leave out allother words in the caption. The word embedding size is alsoset to 256 to match e A,t . The mean of the word embeddingswithin a phrase is used as the representation: e P = 1 N N (cid:88) n =1 e P,n (3) . EXPERIMENTS4.1. Experimental setup

Standard Log Mel Spectrogram (LMS) is used as the audiofeature since it is commonly utilized in SED. We extract 64dimensional LMS feature from a 40 ms window size and 20ms window shift for each audio, resulting in F ∈ R T × .The model is trained for at most 100 epochs using the Adamoptimization algorithm with an initial learning rate of 0.001.The learning rate is reduced if the loss on the validation setdoes not improve for ﬁve epochs. An early stop strategy withten epochs is adopted in the training process. Since TAG shares a similar target with SED, commonly usedSED metrics are adopted for TAG evaluation. Speciﬁcally,we incorporate two metrics, being event-based metrics [19]and the newly proposed polyphonic sound detection score(PSDS) [20].•

Event-Based Metrics (Precision, Recall, F1) attachimportance to the smoothness of the predicted seg-ments, penalizing disjoint predictions. Regardingevent-F1 scores, we set a t-collar value to 100 ms(due to large amounts of short events, see Figure 3)as well as a tolerance of 20% discrepancy betweenthe reference and prediction duration for event-basedmetrics.•

PSDS is more robust to labelling subjectivity (e.g.,to create one or two ground truths for two very closedog barks) and does not depend on operating points(e.g., thresholds). The default PSDS parameters areused [20]: ρ DTC = ρ GTC = 0 . , ρ CTTC = 0 . , α CT = α ST = 0 . , e max = 100 .Models achieving high scores in both event-based metricsand PSDS are expected to predict smooth segments while be-ing robust to different operating points. Table 2 . Baseline TAG performance on the

AudioGrounding dataset. P, R, F1 represent the event-based precision, recalland, F1-score.Model F P R

PSDSRandom 0.04 0.02 1.56 0.00Baseline 28.30 28.60 27.90 14.70

We present the baseline TAG performance in Table 2. Therandom guessing approach gives a random probability be-tween 0 and 1 to each frame, resulting in a 0.04% event-F1and 0.00% PSDS, indicating the difﬁculty of this task. In con-trast, our proposed baseline model achieves 28.3% event-F1 and 14.7% PSDS, verifying its capability in audio and textunderstanding. Despite the signiﬁcant improvement againstthe random approach, we ﬁnd that the baseline model tendsto output high probability to salient parts of an audio clip,regardless of the phrase input. An example is shown in Fig-ure 4. The output probabilities of both phrase inputs appear tobe similar in their temporal distribution. For the phrase query“young female speaking”, the model assigns high presenceprobability to segments where either cats or female speech ap-pear (e.g., the last two seconds). This means the model onlylearns prominent audio patterns but neglects the informationfrom the phrase query. We change the phrase queries of eachaudio to a random phrase selected from all phrase queries ofthat audio. After the modiﬁcation, the event-F1 is still 19.6%,indicating the insensitivity of our model to the phrase input.Further research should be conducted on the text understand-ing and the fusion of these two modalities. M e l Log-mel spectrogram (LMS)

Second p r o b a b ili t y Caption: A cat meowing and young female speaking

Model output a cat meowingyoung female speaking

Fig. 4 . An example result of a TAG prediction on the

Audio-Grounding dataset. The horizontal axis of the bottom ﬁguredenotes the output probability of a sound event according tothe phrase query.

5. CONCLUSION

In this paper, we propose a Text-to-Audio Grounding taskto facilitate cross-modal learning between audio and naturallanguage further. This paper contributes an

AudioGrounding dataset, which considers the correspondence between soundevent phrases with the captions provided in Audiocaps [14]and provides the timestamps of each present sound event. Abaseline approach that combines natural language and audioprocessing yields an event-F1 of 28.3% and a PSDS of 14.7%.We would like to explore better projection of audio and phraseembeddings as well as deeper interaction between these twomodalities in future work.

6. ACKNOWLEDGEMENT

This work has been supported by National Natural ScienceFoundation of China (No.61901265), Shanghai Pujiang Pro-gram (No.19PJ1406300), and Startup Fund for YoungmanResearch at SJTU (No.19X100040009). Experiments havebeen carried out on the PI supercomputer at Shanghai JiaoTong University. . REFERENCES [1] K. Drossos, S. Adavanne, and T. Virtanen, “Automatedaudio captioning with recurrent neural networks,” in

IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA) . IEEE, 2017, pp.374–378.[2] M. Wu, H. Dinkel, and K. Yu, “Audio caption: Lis-ten and tell,” in

Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 830–834.[3] Y. Koizumi, D. Takeuchi, Y. Ohishi, N. Harada, andK. Kashino, “The NTT DCASE2020 challenge task 6system: Automated audio captioning with keywords andsentence length estimation,” DCASE2020 Challenge,Tech. Rep., June 2020.[4] Y. Wu, K. Chen, Z. Wang, X. Zhang, F. Nian, S. Li, andX. Shao, “Audio captioning based on transformer andpre-training for 2020 DCASE audio captioning chal-lenge,” DCASE2020 Challenge, Tech. Rep., June 2020.[5] B. A. Plummer, L. Wang, C. M. Cervantes, J. C.Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30kentities: Collecting region-to-phrase correspondencesfor richer image-to-sentence models,” in

Proceedings ofthe IEEE International Conference on Computer Vision(ICCV) , 2015, pp. 2641–2649.[6] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, andB. Schiele, “Grounding of textual phrases in images byreconstruction,” in

European Conference on ComputerVision (ECCV) . Springer, 2016, pp. 817–834.[7] R. A. Yeh, M. N. Do, and A. G. Schwing, “Unsuper-vised textual grounding: Linking words to image con-cepts,” in

Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recogni-tion (CVPR) , 2018, pp. 6125–6134.[8] H. Akbari, S. Karaman, S. Bhargava, B. Chen, C. Von-drick, and S.-F. Chang, “Multi-level multimodal com-mon semantic space for image-phrase grounding,” in

Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2019, pp. 12 476–12 486.[9] L. Zhou, N. Louis, and J. J. Corso, “Weakly-supervisedvideo object grounding from text by loss weighting andobject interaction,” in

British Machine Vision Confer-ence (BMVC) , 2018, pp. 1–12.[10] L. Chen, M. Zhai, J. He, and G. Mori, “Object ground-ing via iterative context reasoning,” in

Proceedings ofthe IEEE International Conference on Computer VisionWorkshops (ICCVW) , 2019, pp. 1407–1415. [11] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, andM. Rohrbach, “Grounded video description,” in

Pro-ceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2019, pp. 6578–6587.[12] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P.Shah, “Large-scale weakly labeled semi-supervisedsound event detection in domestic environments,”in

Proceedings of the Detection and Classiﬁcationof Acoustic Scenes and Events Workshop (DCASE) ,November 2018, pp. 19–23.[13] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter,“Audio set: An ontology and human-labeled dataset foraudio events,” in

Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2017, pp. 776–780.[14] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps:Generating captions for audios in the wild,” in

Proc.Conference of the North American Chapter of the Asso-ciation for Computational Linguistics (NAACL) , 2019,pp. 119–132.[15] E. Loper and S. Bird, “Nltk: The natural languagetoolkit,” in

Proceedings of the ACL-02 Workshop on Ef-fective Tools and Methodologies for Teaching NaturalLanguage Processing and Computational Linguistics ,2002, pp. 63–70.[16] B. Elizalde, S. Zarar, and B. Raj, “Cross modal au-dio search and retrieval with joint embeddings based ontext and audio,” in

Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 4095–4099.[17] H. Dinkel, M. Wu, and K. Yu, “Towards duration robustweakly supervised sound event detection,”

IEEE Trans.Audio, Speech, Language Process. , 2021.[18] X. Xu, H. Dinkel, M. Wu, and K. Yu, “A crnn-gru basedreinforcement learning approach to audio captioning,”in

Proceedings of the Detection and Classiﬁcation ofAcoustic Scenes and Events Workshop (DCASE) , Tokyo,Japan, November 2020, pp. 225–229.[19] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics forpolyphonic sound event detection,”

Applied Sciences ,vol. 6, no. 6, p. 162, 2016.[20] C¸ . Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, andS. Krstulovi´c, “A framework for the robust evaluationof sound event detection,” in