Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning
IINVESTIGATING LOCAL AND GLOBAL INFORMATION FOR AUTOMATED AUDIOCAPTIONING WITH TRANSFER LEARNING
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Zeyu Xie, Kai Yu
MoE Key Lab of Artificial IntelligenceSpeechLab, Department of Computer Science and EngineeringAI Institute, Shanghai Jiao Tong University, Shanghai, China { wsntxxn, richman, mengyuewu, kai.yu } @sjtu.edu.cn, [email protected] ABSTRACT
Automated audio captioning (AAC) aims at generating sum-marizing descriptions for audio clips. Multitudinous conceptsare described in an audio caption, ranging from local informa-tion such as sound events to global information like acousticscenery. Currently, the mainstream paradigm for AAC is theend-to-end encoder-decoder architecture, expecting the en-coder to learn all levels of concepts embedded in the audioautomatically. This paper first proposes a topic model for au-dio descriptions, comprehensively analyzing the hierarchicalaudio topics that are commonly covered. We then explore atransfer learning scheme to access local and global informa-tion. Two source tasks are identified to respectively representlocal and global information, being Audio Tagging (AT) andAcoustic Scene Classification (ASC). Experiments are con-ducted on the AAC benchmark dataset Clotho and Audiocaps,amounting to a vast increase in all eight metrics with topictransfer learning. Further, it is discovered that local informa-tion and abstract representation learning are more crucial toAAC than global information and temporal relationship learn-ing.
Index Terms — Audio captioning, transfer learning, au-dio processing, audio tagging
1. INTRODUCTION
Automated audio captioning (AAC) is a cross-modal taskbridging audio signal processing and natural language pro-cessing (NLP) [1, 2]. The introduction of AAC to the De-tection and Classification of Acoustic Scenes and Events(DCASE) 2020 challenge sparked interest from researchers [3,4, 5, 6]. AAC is particularly interesting yet challenging asthe audio captions describe multitudinous auditory elements.Compared with visual perception, where the objects are de-fined by its shape, color, size, and its spatial position to otherobjects, auditory perception concerns with sound events and
Mengyue Wu and Kai Yu are the corresponding authors. their corresponding physical properties, temporal informa-tion of these sound events, and their relationship with otherevents, and high-level knowledge-rich auditory understand-ing. For instance, a typical caption from the DCASE bench-mark dataset Clotho [7] “people talking in a small and emptyroom” describes the sound event “people talking” and itsglobal scene “in a room”, where high-level auditory knowl-edge is processed to infer that the room is small and empty, avisual description.It should be noted that the current mainstream trainingparadigm of AAC is the end-to-end encoder-decoder frame-work, where the captions are provided as the only supervi-sion signal to the audio content. An audio encoder first ex-tracts an abstract embedding from an input audio clip in theencoder-decoder framework. Then the text decoder predictsthe caption according to the audio embedding. To encode allthe above-mentioned multifaceted information from an audioclip without explicit supervision increases the difficulty forAAC encoder training. Therefore, a hierarchical structure ofthe abstract audio topics commonly described in audio cap-tions is crucial.Stemming from auditory perception and combined withthe captions provided in the currently available AAC datasets,we propose the following audio topic model for AAC:1. Local audio topics: (a) Sound events, which can be de-scribed by the sounding object entity (“a male”), theverbs that make the sound (“talk”), the physical prop-erties of the sound (“loud”).2. Global audio topics: (a) Acoustic scenes, such as anexact scene location description (“downtown”), and anabstract description (“in the distance”). (b) High-levelabstraction, including content inference (“at a confer-ence”), and affect expression (“annoyingly”).We explore a transfer learning method to address localand global information based on such a topic model. Twosource tasks are identified to represent local and global infor-mation, being Audio Tagging (AT) and Acoustic Scene Clas-sification (ASC). ASC is an environmental sound recognition a r X i v : . [ c s . S D ] F e b mbedding sequenceConvBlock ConvBlock ConvBlock OutputLayer
Pretraining ---- Tagging
Pooling
Text DecoderTarget ---- Captioning airportbus.../speechdog...
Classes
People discussing and a single bark of a dog
CaptionConvBlock ConvBlock ConvBlock
TransferParameters
Audio Encoder Pretraining on tagging tasks:ASC / AT Transferring parameters to theaudio encoder Adapting to the target task(Audio Captioning)
Log Mel SpectrogramLog Mel Spectrogram
Embedding extractor
Fig. 1 . Our proposed transfer learning for automated audio captioning. In the first stage, a tagging system is pretrained by ASCor AT. Then the embedding extractor part of the pretrained tagging system is used to initialize the audio encoder. In the secondstage, the audio encoder is adapted to the target task (AAC), and the entire captioning system is trained end-to-end.task, attempting to classify global audio representations intopredefined scene categories, e.g., park, shopping mall. Onthe other hand, AT aims at identifying specific sound eventspresent in an audio recording. We propose pretraining the au-dio encoder on ASC and AT tasks and then transferring theparameters to the AAC encoder, as shown in Section 2. SinceAAC concerns both abstract representations and the soundevents’ temporal relationships, different pretraining backbonenetworks are explored. Experiments in Section 3 are con-ducted on the benchmark AAC dataset Clotho and Audiocaps,the largest AAC dataset by far. A consistent performance gainis obtained over the eight language similarity metrics on bothdatasets, displayed in Section 4.
2. TRANSFER LEARNING FOR AAC
We propose a transfer learning approach for a more effectiveAAC audio encoder. For the following definitions, assumethat X ∈ R T × D is an input feature with T frames and D Mel filters. The supervised pretraining tasks used in this pa-per are modeled as F : X (cid:55)→ y , where y differs for eachrespective task. For our research, we experiment with twodifferent source tasks, being audio tagging (AT) F tag : X (cid:55)→ y tag , y tag ∈ { , } E and acoustic scene classification (ASC) F asc : X (cid:55)→ y asc , y asc ∈ { , . . . , E } .The process of our proposed transfer learning for AACis illustrated in Figure 1. The backbone encoder architecturecomprises an embedding extractor, followed by a temporalpooling layer and an output layer. The embedding extrac-tor consists of several convolution blocks to extract mid-levelembedding features from the audio input. After pretraining,the parameters of the AT / ASC system are transferred to theAAC audio encoder. We experiment with a CNN and a CRNNpretraining encoder network on both the AT and ASC tasks.We intend to explore whether abstract embeddings (CNN) ortemporal information (CRNN) have a more significant impacton AAC performance. AAC Model Architecture
Regarding the audio captionframework, we adopt a temporal attentional encoder-decoderarchitecture for AAC. The overview of our system is pre-sented in Figure 2. It consists of an audio encoder which ex-tracts feature embedding sequence from a Log Mel Spectro-gram (LMS) input and a text decoder to output caption. Thewhole system is trained by the standard cross-entropy loss be-tween the predicted caption and ground truth annotation.
Audio encoder
As mentioned previously, two model ar-chitectures are adopted as the audio encoder for comparisonpurposes: a 10-layer CNN (CNN10) and a 5-layer CRNN(CRNN5). CNN10 shows superior performance for time-invariant audio tagging [8], while CRNN5 is reported to beeffective in duration robust sound event detection [9]. For ei-ther architecture, the audio encoder reads an LMS input with T frames and outputs a sequence of embeddings { e t } T (cid:63) t =1 ( T (cid:63) may not equal T due to temporal subsampling). The detailedstructure of CNN10 and CRNN5 can be found in [8] and [9]. Text decoder
We adopt a standard shallow single layerunidirectional GRU as the text decoder. At decoding timestep n , the hidden state h n is updated depending on the previoustimestep h n − , the current input word w n and a context vec-tor c n as: h n = GRU ([ c n ; WE ( w n )] , h n − ) where WE denotes the word embedding layer. A standardtemporal attention mechanism [10] is utilized to obtain c n .The attention weights { α n,t } T (cid:63) t =1 are calculated by aligning h n − with the timesteps of audio embeddings: α n,t = exp (cid:0) score ( h n − , e t ) (cid:1)(cid:80) T (cid:63) t =1 exp (cid:0) score ( h n − , e t ) (cid:1) (1) c n = T (cid:63) (cid:88) t =1 α n,t e t (2)We use the concat scoring function [11] for the alignment.After a fully connected output layer and a softmax function, NN10 / CRNN5Audio Encoder
Embedding sequence
WE
GRU Text Decoder
WE WE WE WEAttention
Log Mel Spectrogram
Fig. 2 . The illustration of the encoder-decoder AAC system. The audio encoder outputs a sequence of feature embeddings e , e , · · · , e T (cid:63) from a Log Mel Spectrogram input with T frames. The attentional GRU text decoder updates its hidden state h n based on the previous hidden h n − , the current word w n and a context vector c n , which is a weighted combination of theembedding sequence.the output word probabilities are obtained. The word with thehighest probability is selected as the output. This process isrepeated until an end-of-sentence (EOS) token is reached.
3. EXPERIMENTAL SETUP3.1. Datasets
Regarding the pretraining target data source, we use AudioSetfor AT and DCASE for ASC. For the target AAC task, weconduct our experiments on Clotho dataset [7] and Audio-caps [12].
AudioSet [13] is a large-scale manually-annotated soundevent dataset. Each audio clip has a duration of up to tenseconds, containing at least one sound event label. AudioSetconsists of a 527 event ontology, encompassing most every-day sounds. Two training sets are provided: a balanced (60hours, 20k+ clips) and an unbalanced (5000 hours, 1.85 mil-lion clips) one.
DCASE
We incorporate the development sets from ASCtask (Task1A) of DCASE2019 and DCASE2020 challenges.DCASE2019 contains 26 hours of audio recorded in urbanscene environments, and DCASE2020 comprises 39 hours,totaling 65 hours of data. To avoid a bias due to different datasizes, we randomly selected 60 hours of data from those twodatasets, matching the balanced AudioSet size.
Clotho [7] is a newly published AAC benchmark datasetused for the DCASE2020 task 6 challenge. There are 2893audio clips (18 hours) in the development set and 1043 clips(7 hours) in the evaluation set, ranging evenly from 15 to 30seconds in duration. Each audio clip has five correspondingcaption annotations.
Audiocaps [12] is by far the largest AAC dataset, con-sisting of 46k audio clips ( ≈
127 hours) collected from theAudioSet dataset. One human-annotated caption is providedfor the training dataset while five captions for validation andtest sets, respectively.
Standard 64-dimensional LMS features are extracted every20 ms with a Hann window size of 40 ms. For both pre-training and AAC fine-tuning, 90% of the development setis split as the training subset, and the rest 10% is used forcross-validation. The initial learning rates are set to − and × − , batch sizes to 64 and 32 for AT / ASC pretrain-ing and AAC fine-tuning, respectively. During pretraining,early stopping is utilized, where the model with the best per-formance on the validation set is chosen for encoder initial-ization. Training is uniformly done using Adam optimiza-tion [14]. The standard caption metrics (BLEU@1-4 [15],ROUGE [16], METEOR [17], CIDEr [18], and SPICE [19])are used for evaluation. Beam search with a beam size of 3 isadopted during evaluation to enhance performance.
4. RESULTS AND DISCUSSION
Table 1 presents our results on Clotho and Audiocaps datasets,with and without encoder pretraining, respectively elaborat-ing local (AT) and global (ASC) information. The majorityof the pretraining approaches, except for CRNN5 pretrain-ing on the ASC task, enhance the performance comparedwith training from scratch. This indicates that a caption-ing model embeds different information levels and encoderpretraining on relevant tasks can help the model attend tofurther details. We further compare our approach against thebest-published results, i.e., an encoder pretraining method bykeyword prediction (KWP) [4] achieves the best single modelperformance on Clotho evaluation set. In contrast, a multi-scale encoder with pretrained AudioSet VGGish features [12]is the state-of-the-art (SOTA) performance on Audiocapswith only audio inputs. Our CNN10 encoder pretrainingon AT (unbalanced) achieves the best result on Clotho andAudiocaps, outperforming previous work. able 1 . Performance on the Clotho and Audiocaps evaluation set of different pretraining settings. KWP, ASC, AT denote key-word prediction, acoustic scene classification and audio tagging respectively. KWP achieves the best single model performanceon Clotho evaluation set, while VGG features is the best AudioCaps model. B N represents the N-gram BLEU score.Data Encoder Pretrain B B B B ROUGE L CIDEr METEOR SPICE C l o t ho CNN10 KWP [4] 53.4 34.3 23.0 15.1 35.6 34.6 16.0 10.8CNN10 from scratch 47.5 27.5 17.6 11.1 31.9 21.3 13.6 8.1ASC 50.7 30.9 20.1 12.7 33.9 27.5 14.9 9.2AT (balanced) 53.1 33.1 21.6 13.9 35.5 31.9 16.0 10.4AT (unbalanced)
CRNN5 from scratch 51.0 31.4 20.6 13.5 34.4 28.5 15.0 9.3ASC 49.5 30.1 19.3 12.0 33.4 25.7 14.5 9.1AT (balanced) 52.3 32.6 21.4 13.8 34.9 29.9 15.7 10.2AT (unbalanced) 52.8 33.1 21.7 13.8 35.2 30.1 15.6 10.1 A ud i o C a p s Multiscale VGG features [12] 61.4 44.6 31.7 21.9 45.0 59.3 20.3 14.4CNN10 from scratch 62.1 44.0 30.3 20.5 44.2 52.1 20.6 14.1ASC 62.7 44.2 30.5 20.7 43.9 54.1 20.5 14.5AT (balanced) 63.8 46.1 32.3 22.0 45.1 57.8 21.5 15.3AT (unbalanced)
CRNN5 from scratch 61.9 44.5 31.1 21.0 44.6 54.5 20.8 14.6ASC 60.5 43.2 30.2 21.0 43.3 51.5 19.8 14.1AT (balanced) 62.9 45.4 32.1 22.6 45.0 60.2 20.7 14.9AT (unbalanced) 64.1 46.6 33.2 22.8 46.0 60.5 21.5 15.9
Local vs. Global
We deliberately choose AT and ASCtasks to represent local and global audio topics in AAC, cor-responding to the two pretraining tasks’ characteristics: ATprovides detailed audio event information, while ASC aimsto characterize the environment. Results on both Clotho andAudiocaps indicate that local audio topics are comparativelymore crucial to a captioning model than global information:AAC with AT pretraining always outperform ASC pretrain-ing. In particular, AT pretraining on unbalanced constantlyyields the best performance, amounting to SOTA perfor-mance on both datasets, regardless of the evaluation metrics.Even when AT (balanced) and ASC datasets contain approxi-mately the same amount of data ( ≈
60 h), AAC performancesignificantly improves when pretraining on AT.
Abstract embedding vs. Temporal information
In addi-tion to local vs global information, we explore whether ab-stract embeddings (CNN) or temporal information (CRNN)have a more significant impact. When trained from scratch,CRNN5 outperforms CNN10 on all metrics; in contrast topretraining, CRNN5 brings little improvement for AAC per-formance. This indicates that AAC prefers CNN10, whichcan better recognize the presence of audio events, focusingless on the temporal relationship of different events. Perfor-mance improves steadily for all CNN10 models when trainingwith more data (ASC, balanced, unbalanced). Transferringknowledge learned via large dataset pretraining (AT unbal-anced vs. balanced) improves the downstream AAC perfor-mance significantly, as in other work [8, 20].
5. CONCLUSION
This work investigates concepts commonly described in au-dio captions, referred to as an audio topic model. Based onthis, a transfer learning scheme is proposed to address the lo-cal and global information. We compare two pretraining tasks(ASC and AT) and two audio encoder architectures (CNN10and CRNN5) to investigate which abstract topic and archi-tecture are crucial to AAC. The results show that transferringknowledge from either topic leads to vast performance gain,leading to SOTA performance on both Clotho and Audiocaps.It is observed that local information (AT) and abstract em-beddings (CNN10) are more critical to ACC. We would liketo explore methods like multi-task training to better addressthe different topics within a caption for future work. Topicfusion could also shift from coarse to fine-scale, e.g., sepa-rately modeling different traits of sound events, relationships,exact and abstract acoustic scenes, along with the high-levelknowledge-infused abstraction.
6. ACKNOWLEDGEMENT
This work has been supported by National Natural ScienceFoundation of China (No.61901265), Shanghai Pujiang Pro-gram (No.19PJ1406300), and Startup Fund for YoungmanResearch at SJTU (No.19X100040009). Experiments havebeen carried out on the PI supercomputer at Shanghai JiaoTong University. . REFERENCES [1] K. Drossos, S. Adavanne, and T. Virtanen, “Automatedaudio captioning with recurrent neural networks,” in
IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA) . IEEE, 2017, pp.374–378.[2] M. Wu, H. Dinkel, and K. Yu, “Audio caption: Lis-ten and tell,” in
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 830–834.[3] Y. Koizumi, R. Masumura, K. Nishida, M. Yasuda, andS. Saito, “A transformer-based audio captioning modelwith keyword estimation,” in
Proc. ISCA Interspeech ,2020, pp. 1977–1981.[4] Y. Wu, K. Chen, Z. Wang, X. Zhang, F. Nian, S. Li, andX. Shao, “Audio Captioning Based On Transformer AndPre-trained CNN,” in
Proc. Detection and Classificationof Acoustic Scenes and Events Workshop (DCASE) , no.November, 2020, pp. 21–25.[5] X. Xu, H. Dinkel, M. Wu, and K. Yu, “A crnn-gru basedreinforcement learning approach to audio captioning,”in
Proc. Detection and Classification of Acoustic Scenesand Events Workshop (DCASE) , Tokyo, Japan, Novem-ber 2020, pp. 225–229.[6] E. C¸ akır, K. Drossos, and T. Virtanen, “Multi-task regu-larization based on infrequent classes for audio caption-ing,” in
Proc. Detection and Classification of AcousticScenes and Events Workshop (DCASE) , Tokyo, Japan,November 2020, pp. 6–10.[7] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: anaudio captioning dataset,” in
Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 736–740.[8] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D.Plumbley, “Panns: Large-scale pretrained audio neuralnetworks for audio pattern recognition,”
IEEE Trans.Audio, Speech, Language Process. , vol. 28, pp. 2880–2894, 2020.[9] H. Dinkel and K. Yu, “Duration robust weakly super-vised sound event detection,” in
Proc. IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, 2020, pp. 311–315.[10] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machinetranslation by jointly learning to align and translate,” in
The International Conference on Learning Representa-tions (ICLR) , 2015. [11] M.-T. Luong, H. Pham, and C. D. Manning, “Effectiveapproaches to attention-based neural machine transla-tion,” in
Proc. Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , 2015, pp. 1412–1421.[12] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps:Generating captions for audios in the wild,” in
Proc.Conference of the North American Chapter of the Asso-ciation for Computational Linguistics (NAACL) , 2019,pp. 119–132.[13] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter,“Audio set: An ontology and human-labeled dataset foraudio events,” in
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2017, pp. 776–780.[14] D. P. Kingma and J. Ba, “Adam: A method for stochas-tic optimization,” in
The International Conference onLearning Representations (ICLR) , 2015.[15] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu:a method for automatic evaluation of machine transla-tion,” in
Proc. Annual Meeting of the Association forComputational Linguistics (ACL) . Association forComputational Linguistics, 2002, pp. 311–318.[16] C.-Y. Lin, “Rouge: A package for automatic evalua-tion of summaries,” in
Text summarization branches out ,2004, pp. 74–81.[17] S. Banerjee and A. Lavie, “Meteor: An automatic met-ric for mt evaluation with improved correlation with hu-man judgments,” in
Proc. acl workshop on intrinsic andextrinsic evaluation measures for machine translationand/or summarization , 2005, pp. 65–72.[18] R. Vedantam, C. Lawrence Zitnick, and D. Parikh,“Cider: Consensus-based image description evalua-tion,” in
Proc. IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2015, pp. 4566–4575.[19] P. Anderson, B. Fernando, M. Johnson, and S. Gould,“Spice: Semantic propositional image caption evalu-ation,” in
European Conference on Computer Vision(ECCV) . Springer, 2016, pp. 382–398.[20] S. Mun, S. Shon, W. Kim, D. K. Han, and H. Ko,“Deep neural network based learning and transferringmid-level audio features for acoustic scene classifica-tion,” in