[PDF] Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

Abstract

In this paper, we propose a new framework for environmental sound synthesis using onomatopoeic words and sound event labels. The conventional method of environmental sound synthesis, in which only sound event labels are used, cannot finely control the time-frequency structural features of synthesized sounds, such as sound duration, timbre, and pitch. There are various ways to express environmental sound other than sound event labels, such as the use of onomatopoeic words. An onomatopoeic word, which is a character sequence for phonetically imitating a sound, has been shown to be effective for describing the phonetic feature of sounds. We believe that environmental sound synthesis using onomatopoeic words will enable us to control the fine time-frequency structural features of synthesized sounds, such as sound duration, timbre, and pitch. In this paper, we thus propose environmental sound synthesis from onomatopoeic words on the basis of a sequence-to-sequence framework. To convert onomatopoeic words to environmental sound, we use a sequence-to-sequence framework. We also propose a method of environmental sound synthesis using onomatopoeic words and sound event labels to control the fine time-frequency structure and frequency property of synthesized sounds. Our subjective experiments show that the proposed method achieves the same level of sound quality as the conventional method using WaveNet. Moreover, our methods are better than the conventional method in terms of the expressiveness of synthesized sounds to onomatopoeic words.

Full PDF

aa r X i v : . [ c s . S D ] F e b Onoma-to-wave: Environmental sound synthesisfrom onomatopoeic words

Yuki Okamoto

Ritsumeikan University, Japan [email protected] Keisuke Imoto

Doshisha University, Japan [email protected] Shinnosuke Takamichi

The University of Tokyo, Japan shinnosuke [email protected] Yamanishi

Kansai University, Japan [email protected] Takahiro Fukumori

Ritsumeikan University, Japan [email protected] Yoichi Yamashita

Ritsumeikan University, Japan [email protected]

Abstract —In this paper, we propose a new framework forenvironmental sound synthesis using onomatopoeic words andsound event labels. The conventional method of environmentalsound synthesis, in which only sound event labels are used,cannot ﬁnely control the time-frequency structural features ofsynthesized sounds, such as sound duration, timbre, and pitch.There are various ways to express environmental sound otherthan sound event labels, such as the use of onomatopoeic words.An onomatopoeic word, which is a character sequence forphonetically imitating a sound, has been shown to be effectivefor describing the phonetic feature of sounds. We believe thatenvironmental sound synthesis using onomatopoeic words willenable us to control the ﬁne time-frequency structural featuresof synthesized sounds, such as sound duration, timbre, andpitch. In this paper, we thus propose environmental soundsynthesis from onomatopoeic words on the basis of a sequence-to-sequence framework. To convert onomatopoeic words to en-vironmental sound, we use a sequence-to-sequence framework.We also propose a method of environmental sound synthesisusing onomatopoeic words and sound event labels to controlthe ﬁne time-frequency structure and frequency property ofsynthesized sounds. Our subjective experiments show that theproposed method achieves the same level of sound quality as theconventional method using WaveNet. Moreover, our methods arebetter than the conventional method in terms of the expressive-ness of synthesized sounds to onomatopoeic words.

Index Terms —Environmental sound synthesis, sound event,onomatopoeic words, sequence-to-sequence model

I. I

NTRODUCTION

In recent years, some methods of environmental soundsynthesis using deep learning approaches have been developed[1]–[3]. Environmental sound synthesis has great potential formany applications such as supporting movie and game produc-tion [2], [4], and data augmentation for sound event detectionand scene classiﬁcation [5], [6]. As one of the methods ofenvironmental sound synthesis, environmental sound synthesisusing sound event labels as input to the system [1] has beenproposed. The method of environmental sound synthesis usingsound event labels enables the generation of environmen-tal sounds expressing sound events. However, environmentalsounds have many features, such as sound duration, pitch, andtimbre. The use of only sound event labels does not enable the ﬁne control of the time-frequency structural features ofsynthesized sounds, such as sound duration, timbre, and pitch.As one way of expressing the sound events, we can consideronomatopoeic words. An onomatopoeic word is a charactersequence that phonetically imitates a sound. According toLemaitre and Rocchesso [7] and Sundaram and Narayanan[8], onomatopoeic words are effective for expressing the fea-tures of audio samples. For example, when Japanese speakersexpress the sound of a whistle using onomatopoeic words,we can differentiate the sounds with different durations andpitches using the length of the phoneme sequence, such as“py u” and “p i i i.” Thus, we believe that using onomatopoeicwords as the input of the system will enable us to control theﬁne time-frequency structural features of synthesized sounds,such as sound duration, timbre, and pitch.To generate environmental sounds from onomatopoeicwords, Kawai has proposed software called KanaWave [9].KanaWave generates environmental sounds by connectingsounds corresponding to onomatopoeic words to the inputonomatopoeic words. Therefore, the sounds generated byKanaWave do not have sufﬁcient naturalness and diversity.To utilize environmental sounds in media content, such asin animation and movie production, an environmental soundsynthesis method that can generate synthesized sounds withhigh naturalness and large diversity is required.In this paper, we propose environmental sound synthesisfrom onomatopoeic words on the basis of the sequence-to-sequence conversion framework (seq2seq framework) [10].The seq2seq framework is often used in sequence-to-sequenceconversions, such as those in speech synthesis and neuralmachine translation, and has shown high performance in manystudies [11], [12]. We also propose a method of environmentalsound synthesis using sound event labels, which are used in theconventional method, and onomatopoeic words. We considerthat the use of onomatopoeic words and sound event labelsenables us to control sound events and the time-frequencystructure.The remainder of this paper is structured as follows. InSec. II, we describe the proposed methods of environmental l o o Onomatopoeicword

Convert tophoneme sequence

Environmental sounds

Acoustic featureextractionModel trainingAcoustic model Model training blockSound synthesis blockConvert tophoneme sequence

Synthesized soundOnomatopoeicword

Acoustic feature estimationGriffin--Lim reconstruction

Fig. 1. Overview of environmental sound synthesis using onomatopoeia sound synthesis from onomatopoeic words. In Sec. III, sub-jective experiments carried out to evaluate the performance ofenvironmental sound synthesis from onomatopoeic words arereported. Finally, we summarize and conclude this paper inSec. IV. II. P

ROPOSED M ETHOD

Fig. 1 shows the framework of environmental sound syn-thesis from onomatopoeic words. This approach consists of amodel training block and a sound synthesis block. In the modeltraining block, acoustic feature sequence o and phonemesequence l are extracted from environmental sounds andonomatopoeic words, respectively. Acoustic model parameter λ is estimated using extracted features o and l as follows: ˆ λ = arg max λ P ( o | l , λ ) . (1)We propose two training methods: one with the help ofsound event labels and the other without. We will detail themodel training methods in Sec. II-A and II-B. In the soundsynthesis block, phoneme sequence l is converted from aninput onomatopoeic word. Acoustic feature sequence o isestimated from a phoneme sequence of the onomatopoeicword, l , and acoustic model ˆ λ as follows: ˆ o = arg max o P ( o | l , ˆ λ ) . (2)Finally, we reconstruct an environmental sound wave fromestimated acoustic feature sequence ˆ o using the Grifﬁn–Limalgorithm [13]. Input : phoneme sequence LSTMBidirectionalLSTM o o o Output : acoustic features o o o o T' LSTM k / a / N / k / a / N / k / a / N EncoderDecoder l l l l l l T o T'-1 ν f ν b Fig. 2. Environmental sound synthesis from onomatopoeic words

A. Environmental Sound Synthesis Using Onomatopoeicwords

Fig. 2 shows an overview of model training using ono-matopoeic words. To synthesize environmental sounds fromonomatopoeic words, we employ the seq2seq framework. Theseq2seq framework comprises an encoder and a decoder. Ourmethod uses one-layered bidirectional long short-term memory(BiLSTM) as the encoder and two-layered long short-termmemory (LSTM) as the decoder. As shown in Fig. 2, aphoneme sequence of the onomatopoeic word, l = { l , ...l T } ,is input to the encoder. The encoder extracts feature vectors ν = [ ν f , ν b ] from input sequence l . Superscripts f and b indicate forward and backward networks, respectively. Inunidirectional LSTM, the beginning features tend to be lostwhen the sequence is long. Therefore, using BiLSTM forthe encoder, we can expect to extract a feature vector ν thatcaptures the entire onomatopoeic word from past and futuredirections. The decoder estimates acoustic feature sequence o = { o , ..., o T ′ } from extracted feature vectors ν in theencoder as follows: p ( o , ..., o T ′ | l , ..., l T ) = T ′ Y t =1 p ( o t | ν , o , ..., o t − ) . (3)Using two-layered LSTM for the decoder, we can expect to es-timate acoustic features by considering features in the forwardand backward directions of onomatopoeic words extracted bythe encoder. The L1 norm between the estimated acousticfeature sequence o and the target at each time step is usedas the loss function. nput : phoneme sequence LSTM

BidirectionalLSTM

Output : acoustic features LSTM F u ll y c onn ec t e d C on ca t c S ound e v e n t l a b e l Event label conditioning F u ll y c onn ec t e d C on ca t k / a / N / k / a / N / k / a / N

EncoderDecoder l l l l l l T o o o o T' o o o T'-1 ν f ν b Fig. 3. Environmental sound synthesis from onomatopoeic words and soundevent labels

B. Environmental Sound Synthesis Using OnomatopoeicWords and Sound Event Labels

The method of environmental sound synthesis using onlyonomatopoeic words is expected to enable the control of thetime-frequency structural features of synthesized sounds, suchas sound duration. However, for example, the onomatopoeicword “p a N” could be considered to ﬁt multiple sound events,such as the sound of shooting guns and balloons breaking .Therefore, we cannot control the frequency property associatedwith the types of sound using only onomatopoeic words. Weneed features to control the frequency property, such as thetype of sound event. To control the sound events, we usethe sound event labels in addition to onomatopoeic words.We consider that using both onomatopoeic words and soundevent labels will enable us to control both the time-frequencystructure and the type of sound event for synthesized sounds.Fig. 3 shows an overview of model training using ono-matopoeic words and sound event labels. This method uses theseq2seq framework comprising one-layered BiLSTM as theencoder and two-layered LSTM as the decoder. The seq2seq-based intersequence conversion may involve conditioning onthe decoder to control the decoder’s output features [14].In this method, sound event labels c represented as one-hotvectors and extracted feature vectors ν are concatenated andgiven as the initial state of the decoder. The decoder estimatesacoustic feature sequence o = { o , ..., o T ′ } from extractedfeature vectors ν in the encoder and sound event labels c asfollows: TABLE IE

XPERIMENTAL C ONDITIONS

Sound length 1–2 sSampling rate 16,000 HzWaveform encoding 16-bit linear PCMAcoustic feature log-amplitude spectrogramWindow length for FFT 0.128 s (2,048 samples)Window shift for FFT 0.032 s (512 samples)Encoder LSTM layers 1Decoder LSTM layers 2LSTM cells 512, 512, 512Batch size 5Event label dimensions 10Teacher forcing rate 0.6Optimizer RAdam [15] p ( o , ..., o T ′ | l , ..., l T ) = T ′ Y t =1 p ( o t | ν , o , ..., o t − , c ) . (4)The L1 norm between the estimated acoustic feature sequence o and the target at each time step is used as the loss function.III. E XPERIMENTS

To use sounds synthesized from onomatopoeic words asbackground sounds or sound effects in movies or games, itis important that the synthesized sounds be of high qualityand that the onomatopoeic words well express the sound wewant to synthesize. From this viewpoint, we conducted threetypes of subjective test. For synthesized sounds, we conducted(I) an evaluation of the relevance of natural and synthesizedsounds to given onomatopoeic words, (II) an evaluation ofsound quality, and (III) a veriﬁcation of the control of soundsynthesized using a sound event label.

A. Experimental Conditions

For the evaluation, we used 10 types of sound event (bellringing, alarm clock, manual coffee grinder, cup clinking,drum, maracas, electric shaver, tearing paper, trash box bang-ing, and whistle) contained in the Real World ComputingPartnership-Sound Scene Database (RWCP-SSD) [16]. Weused a total of 1,000 samples (100 samples ×

10 soundevents), in which 95 samples of each sound event were used formodel training and the others were used for the subjective test.For the onomatopoeic words corresponding to each sound sam-ple, we used the dataset in RWCP-SSD-Onomatopoeia [17].RWCP-SSD-Onomatopoeia consists of onomatopoeic wordscollected from Japanese speakers. We used 15 onomatopoeicwords per audio sample for model training for a total of14,250 onomatopoeic words (15 onomatopoeic words × . UMBER OF SYNTHESIZED SOUNDS USED FOR SUBJECTIVE TEST

Experiment

Following the evaluation perspective described at the begin-ning of Sec. III, we conducted the following three experiments: • Experiment I: evaluation of the relevance of naturaland synthesized sounds to given onomatopoeic words

To evaluate the relevance of natural and synthesizedsounds to given onomatopoeic words, we asked two ques-tions to listeners. We presented an original or synthesizedsound and onomatopoeic words to listeners. The listenersgraded each subjective evaluation metric as follows. – Acceptance level of synthesized sounds foronomatopoeic words

The listener grades the acceptance level ofsynthesized and natural sounds for onomatopoeicwords on a scale of 1 (highly unacceptable) to 5(highly acceptable). – Expressiveness of synthesized sounds foronomatopoeic words

The listener grades the expressive level ofsynthesized and natural sounds for onomatopoeicwords on a scale of 1 (very unexpressive) to 5 (veryexpressive). • Experiment II: evaluation of sound quality

To evaluate the quality of synthesized sounds, we askedtwo questions to listeners. After listening to a natural orsynthesized sound presented randomly, the listener gradedeach subjective evaluation metric as follows. – Overall impression of environmental sounds

This evaluation metric is graded from 1 (very badas an environmental sound) to 5 (very excellent asan environmental sound). – Naturalness of environmental sounds

This evaluation metric is graded from 1 (veryunnatural as an environmental sound) to 5 (verynatural as an environmental sound). • Experiment III: veriﬁcation of the control ofsynthesized sound by sound event label

After listening to sound synthesized by our methodspresented randomly, the listener selected a sound eventlabel that best represented the sound.We conducted each experiment using a crowdsourcing plat-form. Table II shows the numbers of audio samples and listen-ers in each experiment. To compare the synthesized methods,we evaluated the sounds synthesized by the conventionalmethod using WaveNet [1] and the sounds synthesized byKanaWave [9], the conventional method of generating envi- K a n a W a v e S e q2 s e q S e q2 s e q + e v e n t l a b e l s N a t u r a l s ound s A cce p t a n ce s c o r e ProposedConventionalNatural Sound * p<0.001 ** Fig. 4. Acceptance score of natural and synthesized sounds K a n a W a v e S e q2 s e q S e q2 s e q + e v e n t l a b e l s N a t u r a l s ound s E xp r e ss i v e n e ss s c o r e ProposedConventionalNatural Sound ** * p<0.001

Fig. 5. Expressiveness score of natural and synthesized sounds F r e qu e n c y ( k H z ) Time (s)Input: / p i /

Input: / py u i /

Input: / p i: q: /

Sound synthesized by seq2seq

Fig. 6. Spectrograms of environmental sounds synthesized using onlyonomatopoeic words ronmental sounds from onomatopoeic words. The conventionalenvironmental sound synthesis method using WaveNet utilizessound event labels as input to the system to generate sounds.The conventional environmental sound synthesis from ono-matopoeic words using KanaWave utilizes only onomatopoeicwords as input to the system to generate sounds.

B. Experimental Results and Discussion

Experiment I : the average acceptance score and expressive-ness score of synthesized and natural sounds to onomatopoeicwords and their standard deviation are shown in Figs. 4 and ound synthesized by KanaWave

Whistle3 Shaver Tearing paper

Sound synthesized by seq2seq with event labels F r e qu e n c y ( k H z ) F r e qu e n c y ( k H z ) Time (s)Time (s)

Fig. 7. Spectrograms of environmental sounds synthesized by KanaWave andthe proposed method using onomatopoeic words and sound event labels

5. From these results, we ﬁnd that our proposed methods cangenerate environmental sounds that are a better representationof onomatopoeic words those generated by the conventionalmethod using KanaWave. Fig. 6 shows a spectrogram ofsounds synthesized by our methods using only onomatopoeicwords. As shown in Fig. 6, the proposed method can controlthe duration of the synthesized sound in accordance withthe input onomatopoeic words. Thus, onomatopoeic wordsare useful for controlling the time-frequency structure of thesynthesized sounds.Fig. 7 shows the spectrograms of sounds synthesized byKanaWave and the proposed method using onomatopoeicwords and sound event labels. In Fig. 7, each synthesizedsound is generated from a phoneme sequence of the ono-matopoeic word “b i i i i i i” input to the system. Inthe proposed method using onomatopoeic words and soundevent labels, we used sound event labels of whistle , electricshaver , and tearing paper . KanaWave can only generate onetype of sound from the same onomatopoeic words, as shownin Fig. 7. Therefore, the sound synthesized by KanaWavedoes not have diversity. On the other hand, the proposedmethod using onomatopoeic words and sound event labels cangenerate various sounds from the same onomatopoeic wordsby changing the input sound event labels. Experiment II : the average MOS score for the overallimpression and naturalness of synthesized and natural sounds,and their standard deviation are shown in Figs. 8 and 9. Theresult indicates that natural sounds are better than soundssynthesized by the proposed methods. On the other hand, theseresults show that sounds synthesized by the proposed methodsare better than those synthesized by KanaWave, in whichonomatopoeic words are input to the system. The experimentalresults also show that sounds synthesized by our methods hada similar sound quality to those synthesized by WaveNet. K a n a W a v e S e q2 s e q W a v e N e t S e q2 s e q + e v e n t l a b e l s N a t u r a l s ound s M O S s c o r e f o r ov e r a ll i m p r e ss i on ProposedConventionalNatural Sound * * * p<0.001

Fig. 8. MOS score for overall impression of natural and synthesized sounds K a n a W a v e S e q2 s e q W a v e N e t S e q2 s e q + e v e n t l a b e l s N a t u r a l s ound s M O S s c o r e on n a t u r a l n e ss ProposedConventionalNatural Sound ** * p<0.001

Fig. 9. MOS score for naturalness of natural and synthesized sounds

Thus, we have made environmental sound synthesis fromonomatopoeic words possible without degrading the soundquality compared with conventional methods.

Experiment III : part of the distribution of acoustic eventlabels given to synthesized sound from each onomatopoeicword is shown in Figs. 10 and 11. The sound synthesized byour method using only onomatopoeic words tends to be givenonly one sound event label. On the other hand, the soundsynthesized by our method using onomatopoeic words andsound event labels tends to be given multiple sound eventlabels. For each method, the entropy of the distribution of agiven acoustic event label was calculated to be 1.70 bit forthe method using only onomatopoeic words and 1.82 bit forthe method using onomatopoeic words and sound event labels.In this experiment, the maximum value of entropy is 3.02 bitbecause we choose from 10 types of sound event label for eachsynthesized sound. A comparison of the entropy in the twomethods shows that the entropy of using onomatopoeic wordsand sound event labels is higher than that using only ono-matopoeic words. This result shows that using onomatopoeicwords and sound event labels can represent multiple acousticevents for the same onomatopoeic word. o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll Input onomatopoeia: / b i b i b i b i b i / Input onomatopoeia: / c h i: q / C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll Input Onomatopoeia: / d u: N / Input onomatopoeia: / sh a r i sh a r i / P e r ce n t a g e o f eac h s ound e v e n t l a b e l ( % ) P e r ce n t a g e o f eac h s ound e v e n t l a b e l ( % ) Fig. 10. Number of responses of sound event labels to each sound synthesizedby our method using onomatopoeic words

Fig. 12 shows the spectrograms of natural and synthesizedsounds. In Fig. 12, each synthesized sound is generated from aphoneme sequence of the onomatopoeic word “b i: i q” inputto the system. In the proposed method using onomatopoeicwords and sound event labels, we used the sound event labelsof whistle , electric shaver , and tearing paper . As shown in theFig. 12, using only onomatopoeic words as an input generatessounds with similar features even if the initial value is changedmany times. On the other hand, using both onomatopoeicwords and sound event labels, it is possible to generate soundsthat seem to capture each sound event’s feature dependingon the input sound event label. These results show thatusing sound event labels can control sound events of soundsynthesized from onomatopoeic wordsIV. C ONCLUSION

In this paper, we proposed environmental sound synthesisfrom onomatopoeic words. Subjective tests show that the pro-posed method achieves the same level of quality as WaveNet.We indicate that the proposed methods can generate soundsrepresenting onomatopoeic words without degrading the qual-ity of synthesis compared with conventional methods. Usingsound event labels in addition to onomatopoeic words, weare also able to control not only the time-frequency structureof the synthesized sounds but also the type of sound event.In the future, we will generate environmental sound fromonomatopoeic words using more types of sound event.A

CKNOWLEDGMENT

This work was supported by JSPS KAKENHI Grant Num-ber JP19K20304 and ROIS NII Open Collaborative Research2020 Grant Number 20S0401. C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll Input Onomatopoeia: / d u: N / Input onomatopoeia: / sh a r i sh a r i / Input onomatopoeia: / b i b i b i b i b i / P e r ce n t a g e o f eac h s ound e v e n t l a b e l ( % ) Sound event label:

Drum, Trashbox

Sound event label:

Clock1, Tearing, Maracas, Coffmill

Sound event label:

Shaver, Trashbox

Input onomatopoeia: / c h i: q / Sound event label:

Cup1, Shaver, Tearing, Whistle3 P e r ce n t a g e o f eac h s ound e v e n t l a b e l ( % ) C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h B ox T ea r i ng B e ll Fig. 11. Number of responses of sound event labels to each sound synthesizedby our method using onomatopoeic words and sound event labels

Natural sound

Whistle3 Shaver Tearing paper F r e qu e n c y ( k H z ) Sound synthesized by seq2seq with event labels

Whistle3 Shaver Tearing paper F r e qu e n c y ( k H z ) Time (s)

Sound synthesized by seq2seq F r e qu e n c y ( k H z ) Fig. 12. Spectrograms of natural and synthesized environmental sounds R EFERENCES[1] Y. Okamoto, K. Imoto, T. Komatsu, S. Takamichi, T. Yagyu, R. Ya-manishi, and Y. Yamashita, “Overview of tasks and investigation ofubjective evaluation methods in environmental sound synthesis andconversion,” arXiv preprint arXiv:1908.10055 , 2019.[2] Q. Kong, Y. Xu, T. Iqbal, Y. Cao, W. Wang, and M. D. Plumbley,“Acoustic scene generation with conditional sampleRNN,”

Proc. IEEEInternational Conference on Acoustics, Speech and Signal Processing ( ICASSP ), pp. 925–929, 2019.[3] J.-Y. Liu, Y.-H. Chen, Y.-C. Yeh, and Y.-H. Yang, “Unconditional audiogeneration with generative adversarial networks and cycle regulariza-tion,” arXiv preprint arXiv:2005.08526 , 2020.[4] K. Wang, H. Cheng, and S. Liu, “Efﬁcient sound synthesis for naturalscenes,”

Proc. IEEE Virtual Reality ( VR ), pp. 303–304, 2017.[5] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello,“Scaper: A library for soundscape synthesis and augmentation,” Proc.IEEE Workshop on Applications of Signal Processing to Audio andAcoustics ( WASPAA ), pp. 344–348, 2017.[6] F. Gontier, M. Lagrange, C. Lavandier, and J. F. Petiot, “Privacyaware acoustic scene synthesis using deep spectral feature inversion,”

Proc. IEEE International Conference on Acoustics, Speech and SignalProcessing ( ICASSP ), pp. 886–890, 2020.[7] G. Lemaitre and D. Rocchesso, “On the effectiveness of vocal imitationsand verbal descriptions of sounds,”

The Journal of the Acoustical Societyof America , vol. 135, no. 2, pp. 862–873, Feb. 2014.[8] S. Sundaram and S. Narayanan, “Vector-based representation and clus-tering of audio using onomatopoeia words,”

Proc. American Associationfor Artiﬁcial Intelligence ( AAAI ) Symposium Series arXiv preprint arXiv:1409.3215 , 2014.[11] Y. Wang, R. S. Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang,Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, andR. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” arXivpreprint arXiv:1703.10135 , 2017.[12] S. Ikawa and K. Kashino, “Generating sound words from audio sig-nals of acoustic events with sequence-to-sequence model,”

Proc. IEEEInternational Conference on Acoustics, Speech and Signal Processing ( ICASSP ), pp. 346–350, 2018.[13] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short-timeFourier transform,”

IEEE Transactions on Acoustics, Speech, and SignalProcessing , vol. 32, no. 2, pp. 236–243, 1984.[14] S. Ikawa and K. Kashino, “Neural audio captioning based on conditionalsequence-to-sequence model,”

Proc. Workshop on Detection and Clas-siﬁcation of Acoustic Scenes and Events ( DCASE ), pp. 99–103, 2019.[15] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On thevariance of the adaptive learning rate and beyond,”

Proc. InternationalConference on Learning Representation ( ICLR ), pp. 1–13, 2020.[16] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Acoustical sounddatabase in real environments for sound scene understanding and hands-free speech recognition,”

Proc. Language Resources and EvaluationConference ( LREC ), pp. 965–968, 2000.[17] Y. Okamoto, K. Imoto, S. Takamichi, R. Yamanishi, T. Fukumori,and Y. Yamashita, “RWCP-SSD-Onomatopoeia: Onomatopoeic wordsdataset for environmental sound synthesis,”