Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events
TTEXT-TO-AUDIO GROUNDING: BUILDING CORRESPONDENCE BETWEEN CAPTIONSAND SOUND EVENTS
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu
MoE Key Lab of Artificial IntelligenceSpeechLab, Department of Computer Science and EngineeringAI Institute, Shanghai Jiao Tong University, Shanghai, China { wsntxxn, richman, mengyuewu, kai.yu } @sjtu.edu.cn ABSTRACT
Automated Audio Captioning is a cross-modal task, gener-ating natural language descriptions to summarize the audioclips’ sound events. However, grounding the actual soundevents in the given audio based on its corresponding captionhas not been investigated. This paper contributes an
Audio-Grounding dataset , which provides the correspondence be-tween sound events and the captions provided in Audiocaps,along with the location (timestamps) of each present soundevent. Based on such, we propose the text-to-audio grounding(TAG) task, which interactively considers the relationship be-tween audio processing and language understanding. A base-line approach is provided, resulting in an event-F1 score of28.3% and a Polyphonic Sound Detection Score (PSDS) scoreof 14.7%. Index Terms — text-to-audio grounding, sound event de-tection, dataset, deep learning.
1. INTRODUCTION
Using natural language to summarise audio content, com-monly referred as Automated Audio Captioning (AAC), hasattracted much attention in recent studies [1, 2, 3, 4]. Com-pared with other audio processing tasks like Acoustic SceneClassification (ASC) and Sound Event Detection (SED),which aim to categorize audio into specific scenes or eventlabels, AAC allows the model to describe audio content innatural language, a much more unrestricted text form. AACcan thus be seen as a less structural summarization of soundevents. However, the correspondence between sound eventdetection and natural language description is rarely inves-tigated. To achieve human-like audio perception, a modelshould be able to generate an audio caption and understandnatural language grounded in acoustic content, i.e., ground-ing (detecting) each sound event mentioned in a given audiocaption to corresponding segments in that audio. Explicit
Mengyue Wu and Kai Yu are the corresponding authors. https://github.com/wsntxxn/TextToAudioGrounding grounding of sound event phrases from the corresponding au-dio is key to audio-oriented language understanding. More-over, it would be beneficial for generating captions with moreaccurate event illustrations and localized AAC evaluationmethods.Although such an audio grounding task (text-to-audiogrounding, TAG) is relatively novel in audio understand-ing and audio-text cross-modal research, it is related to thefollowing problems. Visual Grounding
A similar task to TAG is objectgrounding in Computer Vision (CV) using images or videos.The
Flickr30k Entities [5] is the first public dataset for imagegrounding. Image object grounding has become a researchhotspot since then [6, 7, 8]. Recently a plethora of work focuson new datasets and approaches for video object ground-ing [9, 10, 11]. Like audio-text grounding, visual groundingrequires a model to predict bounding boxes (2d coordinates)in an image or video frame for each object described in thecaption.
Sound Event Detection (SED)
SED aims to classify andlocalize particular sound events in an audio clip. With thegrowing influence of Detection and Classification of AcousticScenes and Events (DCASE) challenge [12], research interestin SED has soared recently. TAG can be viewed as text-query-based SED, focusing on localizing sound events described byqueries. Due to SED and TAG’s intrinsic correlation, we bor-row common approaches and evaluation metrics from SED asa benchmark for TAG.
Second M e l Caption: A man is speaking while birds are chirping in the background
Log-mel spectrogram (LMS)
Fig. 1 . An example for
TextToAudioGrounding . For an audioclip and its corresponding caption, on- and off-set timestampsfor each sound event phrase are provided. In this example,both “a man speaking” (red) and “birds chirping” (blue) pointto multiple segments (presented by rectangles in the figure). a r X i v : . [ c s . S D ] F e b n audio grounding task inevitably consists of two parts.First is the extraction of sound event phrases from natural lan-guage caption, e.g., “people speak” and “dogs bark” can beobtained from the caption “people speak while dogs bark”.The second stage is concerned with traditional SED, detectinga sound event presence along with its onset and offset times-tamps in the given audio clip. The prerequisite is a datasetthat simultaneously provides audio, captions and the segmen-tation of sound events grounded from the caption. To the bestof our knowledge, no existing datasets or tasks are focusingon text-to-audio grounding.We contribute AudioGrounding dataset (Section 2) in thispaper, providing a corresponding series of audio - caption- sound event phrase - sound event timestamp segmentation to enable a more interactive cross-modal research within au-dio processing and natural language understanding. An il-lustration from
AudioGrounding is shown in Figure 1. Withthis dataset, we consider TAG, which localizes correspond-ing sound events in an audio clip from a given language de-scription. A baseline approach for the new TAG task is alsoproposed, see Section 3. Section 4 details the experiment re-sults and the analyses of such a TAG task, with conclusionsprovided in Section 5.
2. THE AUDIO GROUNDING DATASET
Our
AudioGrounding dataset entails audios, with onecaption per audio in the training set, five captions per audioin the validation, and test sets. We provide caption-orientedsound event tagging for each audio, along with each soundevent’s segmentation timestamps. The audio sources arerooted in
AudioSet [13] and the captions are sourced from
Audiocaps [14].
AudioSet is a large-scale manually-annotated sound eventdataset. Each audio clip has a duration of up to ten seconds,containing at least one sound event label.
AudioSet consists ofa 527 event ontology, encompassing most everyday sounds.
Audiocaps [14] is by far the largest AAC dataset, con-sisting of , + audio clips ( ≈
127 hours) collected from
AudioSet . One human-annotated caption is provided for thetraining dataset while five captions for validation and test sets,respectively. Since the entire
Audiocaps dataset is a subset of
AudioSet , sound event labels can be obtained for each audioclip in
Audiocaps .It should be noted that though
AudioSet provides soundtags and
Audiocaps consists of descriptive captions, there isno direct link between these two annotations. As we wouldlike to enhance the diversity of the sound events included, weselectively choose audio clips with more than four sound tags,resulting in audio clips sourced from
Audiocaps . Fora successful text-to-audio grounding, each audio clip should have not only a caption description (“A man is speaking whilebirds are chirping in the background”), but also the corre-sponding sound event phrases retrieved from the caption (“Aman is speaking”, “bird are chirping”), and the on- and off-sets of these sound events.
Our annotation process is decoupled into two stages: (1)sound event phrases are extracted automatically from cap-tions; (2) we invite annotators to merge extracted phrases thatcorrespond to the same sound event and provide the durationsegmentation of each sound event.
A. Extracting Sound Event Phrases from Captions
As mentioned above, the sound event labels provided in
AudioSet has no correspondence with the descriptive cap-tions in
Audiocaps . Therefore we first extract sound eventphrases from captions using NLTK [15]. A phrase refers toa contiguous chunk of words in a caption. Following stan-dard chunking methods, we extract noun phrases (NP) andcombinations of NP and verb phrases (NP + VP). As sounddescriptions usually stem from objects that sound (e.g., a cat)and verbs create the sound (e.g., meow), NP and NP + VPphrases can roughly summarize all possible sound events.
B. Phrase Merging and Segmentation
Manual phrase merging is necessary since there might berepetitive and unwanted information in extracted phrases. Forexample, the caption in Figure 2 is chunked into three phrases:“people”, “a small crowd are speaking” and “a dog barks”.However, “people” and “a small crowd are speaking” refer tothe same sound event. Based on the extracted phrases, anno-tators are required to label an audio clip in a two-step process:1. Merge phrases describing the same sound event into asingle set and identify the number of sound events men-tioned in the audio;2. Segment each sound event with the on- and off-settimestamps.
Our annotation results in a new audio-text grounding dataset:
AudioGrounding . It contains , corresponding soundevent phrases and captions ( Audiocaps ). After phrasemerging, there are , sound events in total. Sound eventsincluded are quite diversified, with the most frequent soundevent (“a man speaks”) accounts for no more than 2% ofthe dataset. The sound event phrase duration distribution isshown in Figure 3. Most segments last for less than 2 s andthe event phrases consist of several such short segments in asingle audio clip, like speech, dog barking and cat meowing.However, a considerable proportion of events (e.g., wind, . . LMS Audio Feature
Audio encoder - CRNNPhrase encoder
Caption: people in a small crowd are speaking and a dog barks
Phrase Query in the Caption M e l s Time P oo li n g B i G RU . . . C o n v2 D P oo li n g C o n v2 D people in a a dog barksand . . . . . . . . . . . . [ ] Output probabilitymean . . . . . .
Similarity
Fig. 2 . The proposed baseline model structure for TAG. A CRNN encoder outputs a sequence of audio embedding { e A,t } Tt =1 from the LMS input F ∈ R T × D . The phrase query (containing N words) is encoded into e P by taking the mean of all wordembeddings { e P,n } Nn =1 in the query. Prediction of the sound events’ on- and off-sets are based on the similarity between { e A,t } Tt =1 , and e P . Duration / s C o un t Fig. 3 . Duration distribution of annotated sound eventsmentioned in phrases within the proposed
AudioGrounding dataset.train) is present in the whole clip, lasting for almost 10 s. Wesplit the dataset according to the
Audiocaps setting, assigningeach sample to the same subset (train/val/test) in
Audiocaps .Detailed statistics are provided in Table 1.
Table 1 . Statistics of the
AudioGrounding
Dataset.Split
3. TEXT-TO-AUDIO GROUNDING
Since the primary motivation regards sound event groundingfrom phrases in audio captions, we use two separate encodersfor audio and phrase query, respectively. The input audiofeature F is encoded into an embedding sequence { e A,t } Tt =1 while the query encoder outputs a phrase embedding e P fromthe phrase query P which consists of N words. Our base- line model architecture is illustrated in Figure 2. We applyexp ( − l as the similarity metric and binary cross-entropy(BCE) loss as the training criterion, following previous workin cross-modal audio/text retrieval [16]. The similarity scorebetween audio and phrase embedding e A,t and e P is calcu-lated as: s t = sim ( e A,t , e P ) = exp( −(cid:107) e A,t − e P (cid:107) ) (1)During training, L BCE between an audio-phrase pair is cal-culated as the mean of L BCE between e A at each frame t and e P : L BCE = − T T (cid:88) t =1 y t · log( s t ) + (1 − y t ) · log(1 − s t ) (2)where y t ∈ { , } is a strongly labeled indicator for each t .During evaluation, { s t } Tt =1 is transformed to { ˆ y t } Tt =1 , ˆ y t ∈{ , } by a threshold φ = 0 . , representing the presence( ˆ y t = 1 , s t > φ ) or absence ( ˆ y t = 0 , s t ≤ φ ) of a phrase. Audio Encoder
We adopt a convolutional recurrent neu-ral network (CRNN) [17] as the audio encoder. The detailedCRNN architecture can be found in [18]. It consists offive convolution blocks (with padded 3 × { e A,t } Tt =1 ∈ R . Phrase Encoder
For the phrase encoder, we only focuson extracting a representation for the phrase and leave out allother words in the caption. The word embedding size is alsoset to 256 to match e A,t . The mean of the word embeddingswithin a phrase is used as the representation: e P = 1 N N (cid:88) n =1 e P,n (3) . EXPERIMENTS4.1. Experimental setup
Standard Log Mel Spectrogram (LMS) is used as the audiofeature since it is commonly utilized in SED. We extract 64dimensional LMS feature from a 40 ms window size and 20ms window shift for each audio, resulting in F ∈ R T × .The model is trained for at most 100 epochs using the Adamoptimization algorithm with an initial learning rate of 0.001.The learning rate is reduced if the loss on the validation setdoes not improve for five epochs. An early stop strategy withten epochs is adopted in the training process. Since TAG shares a similar target with SED, commonly usedSED metrics are adopted for TAG evaluation. Specifically,we incorporate two metrics, being event-based metrics [19]and the newly proposed polyphonic sound detection score(PSDS) [20].•
Event-Based Metrics (Precision, Recall, F1) attachimportance to the smoothness of the predicted seg-ments, penalizing disjoint predictions. Regardingevent-F1 scores, we set a t-collar value to 100 ms(due to large amounts of short events, see Figure 3)as well as a tolerance of 20% discrepancy betweenthe reference and prediction duration for event-basedmetrics.•
PSDS is more robust to labelling subjectivity (e.g.,to create one or two ground truths for two very closedog barks) and does not depend on operating points(e.g., thresholds). The default PSDS parameters areused [20]: ρ DTC = ρ GTC = 0 . , ρ CTTC = 0 . , α CT = α ST = 0 . , e max = 100 .Models achieving high scores in both event-based metricsand PSDS are expected to predict smooth segments while be-ing robust to different operating points. Table 2 . Baseline TAG performance on the
AudioGrounding dataset. P, R, F1 represent the event-based precision, recalland, F1-score.Model F P R
PSDSRandom 0.04 0.02 1.56 0.00Baseline 28.30 28.60 27.90 14.70
We present the baseline TAG performance in Table 2. Therandom guessing approach gives a random probability be-tween 0 and 1 to each frame, resulting in a 0.04% event-F1and 0.00% PSDS, indicating the difficulty of this task. In con-trast, our proposed baseline model achieves 28.3% event-F1 and 14.7% PSDS, verifying its capability in audio and textunderstanding. Despite the significant improvement againstthe random approach, we find that the baseline model tendsto output high probability to salient parts of an audio clip,regardless of the phrase input. An example is shown in Fig-ure 4. The output probabilities of both phrase inputs appear tobe similar in their temporal distribution. For the phrase query“young female speaking”, the model assigns high presenceprobability to segments where either cats or female speech ap-pear (e.g., the last two seconds). This means the model onlylearns prominent audio patterns but neglects the informationfrom the phrase query. We change the phrase queries of eachaudio to a random phrase selected from all phrase queries ofthat audio. After the modification, the event-F1 is still 19.6%,indicating the insensitivity of our model to the phrase input.Further research should be conducted on the text understand-ing and the fusion of these two modalities. M e l Log-mel spectrogram (LMS)
Second p r o b a b ili t y Caption: A cat meowing and young female speaking
Model output a cat meowingyoung female speaking
Fig. 4 . An example result of a TAG prediction on the
Audio-Grounding dataset. The horizontal axis of the bottom figuredenotes the output probability of a sound event according tothe phrase query.
5. CONCLUSION
In this paper, we propose a Text-to-Audio Grounding taskto facilitate cross-modal learning between audio and naturallanguage further. This paper contributes an
AudioGrounding dataset, which considers the correspondence between soundevent phrases with the captions provided in Audiocaps [14]and provides the timestamps of each present sound event. Abaseline approach that combines natural language and audioprocessing yields an event-F1 of 28.3% and a PSDS of 14.7%.We would like to explore better projection of audio and phraseembeddings as well as deeper interaction between these twomodalities in future work.
6. ACKNOWLEDGEMENT
This work has been supported by National Natural ScienceFoundation of China (No.61901265), Shanghai Pujiang Pro-gram (No.19PJ1406300), and Startup Fund for YoungmanResearch at SJTU (No.19X100040009). Experiments havebeen carried out on the PI supercomputer at Shanghai JiaoTong University. . REFERENCES [1] K. Drossos, S. Adavanne, and T. Virtanen, “Automatedaudio captioning with recurrent neural networks,” in
IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA) . IEEE, 2017, pp.374–378.[2] M. Wu, H. Dinkel, and K. Yu, “Audio caption: Lis-ten and tell,” in
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 830–834.[3] Y. Koizumi, D. Takeuchi, Y. Ohishi, N. Harada, andK. Kashino, “The NTT DCASE2020 challenge task 6system: Automated audio captioning with keywords andsentence length estimation,” DCASE2020 Challenge,Tech. Rep., June 2020.[4] Y. Wu, K. Chen, Z. Wang, X. Zhang, F. Nian, S. Li, andX. Shao, “Audio captioning based on transformer andpre-training for 2020 DCASE audio captioning chal-lenge,” DCASE2020 Challenge, Tech. Rep., June 2020.[5] B. A. Plummer, L. Wang, C. M. Cervantes, J. C.Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30kentities: Collecting region-to-phrase correspondencesfor richer image-to-sentence models,” in
Proceedings ofthe IEEE International Conference on Computer Vision(ICCV) , 2015, pp. 2641–2649.[6] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, andB. Schiele, “Grounding of textual phrases in images byreconstruction,” in
European Conference on ComputerVision (ECCV) . Springer, 2016, pp. 817–834.[7] R. A. Yeh, M. N. Do, and A. G. Schwing, “Unsuper-vised textual grounding: Linking words to image con-cepts,” in
Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recogni-tion (CVPR) , 2018, pp. 6125–6134.[8] H. Akbari, S. Karaman, S. Bhargava, B. Chen, C. Von-drick, and S.-F. Chang, “Multi-level multimodal com-mon semantic space for image-phrase grounding,” in
Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2019, pp. 12 476–12 486.[9] L. Zhou, N. Louis, and J. J. Corso, “Weakly-supervisedvideo object grounding from text by loss weighting andobject interaction,” in
British Machine Vision Confer-ence (BMVC) , 2018, pp. 1–12.[10] L. Chen, M. Zhai, J. He, and G. Mori, “Object ground-ing via iterative context reasoning,” in
Proceedings ofthe IEEE International Conference on Computer VisionWorkshops (ICCVW) , 2019, pp. 1407–1415. [11] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, andM. Rohrbach, “Grounded video description,” in
Pro-ceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2019, pp. 6578–6587.[12] R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P.Shah, “Large-scale weakly labeled semi-supervisedsound event detection in domestic environments,”in
Proceedings of the Detection and Classificationof Acoustic Scenes and Events Workshop (DCASE) ,November 2018, pp. 19–23.[13] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter,“Audio set: An ontology and human-labeled dataset foraudio events,” in
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2017, pp. 776–780.[14] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps:Generating captions for audios in the wild,” in
Proc.Conference of the North American Chapter of the Asso-ciation for Computational Linguistics (NAACL) , 2019,pp. 119–132.[15] E. Loper and S. Bird, “Nltk: The natural languagetoolkit,” in
Proceedings of the ACL-02 Workshop on Ef-fective Tools and Methodologies for Teaching NaturalLanguage Processing and Computational Linguistics ,2002, pp. 63–70.[16] B. Elizalde, S. Zarar, and B. Raj, “Cross modal au-dio search and retrieval with joint embeddings based ontext and audio,” in
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 4095–4099.[17] H. Dinkel, M. Wu, and K. Yu, “Towards duration robustweakly supervised sound event detection,”
IEEE Trans.Audio, Speech, Language Process. , 2021.[18] X. Xu, H. Dinkel, M. Wu, and K. Yu, “A crnn-gru basedreinforcement learning approach to audio captioning,”in
Proceedings of the Detection and Classification ofAcoustic Scenes and Events Workshop (DCASE) , Tokyo,Japan, November 2020, pp. 225–229.[19] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics forpolyphonic sound event detection,”
Applied Sciences ,vol. 6, no. 6, p. 162, 2016.[20] C¸ . Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, andS. Krstulovi´c, “A framework for the robust evaluationof sound event detection,” in