Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion
Yuki Okamoto, Keisuke Imoto, Tatsuya Komatsu, Shinnosuke Takamichi, Takumi Yagyu, Ryosuke Yamanishi, Yoichi Yamashita
aa r X i v : . [ c s . S D ] A ug OVERVIEW OF TASKS AND INVESTIGATION OF SUBJECTIVE EVALUATION METHODSIN ENVIRONMENTAL SOUND SYNTHESIS AND CONVERSION
Yuki Okamoto , Keisuke Imoto , Tatsuya Komatsu , Shinnosuke Takamichi ,Takumi Yagyu , Ryosuke Yamanishi , Yoichi Yamashita Ritsumeikan University, Japan, LINE Corporation, Japan, The University of Tokyo, Japan
ABSTRACT
Synthesizing and converting environmental sounds have the po-tential for many applications such as supporting movie and gameproduction, data augmentation for sound event detection and sceneclassification. Conventional works on synthesizing and convertingenvironmental sounds are based on a physical modeling or concate-native approach. However, there are a limited number of worksthat have addressed environmental sound synthesis and conversionwith statistical generative models; thus, this research area is not yetwell organized. In this paper, we review problem definitions, appli-cations, and evaluation methods of environmental sound synthesisand conversion. We then report on environmental sound synthesisusing sound event labels, in which we focus on the current perfor-mance of statistical environmental sound synthesis and investigatehow we should conduct subjective experiments on environmentalsound synthesis.
Index Terms — Environmental sound synthesis, environmen-tal sound conversion, sound event synthesis, sound scene synthesis,subjective evaluation, WaveNet
1. INTRODUCTION
Sound synthesis and conversion are techniques for generating a nat-ural sound using a statistical model that associates input informationwith the generated sound. Sound synthesis and conversion methodswith the aim of generating speech or music have been widely de-veloped [1, 2, 3]. Recently, some researchers have also developedmethods for environmental sound synthesis and conversion that canbe applied to support movie and game production [4], the genera-tion of content for virtual reality (VR) [5], and data augmentationfor sound event detection and scene classification [6]. Many stud-ies on environmental sound synthesis and conversion have taken aphysical modeling or concatenative approach [7, 8, 6]. On the otherhand, there have been fewer studies on environmental sound syn-thesis and conversion based on statistical generative models such asdeep learning approaches. To the best of our knowledge, there is noliterature giving an overview of the problem definitions and eval-uation methods for environmental sound synthesis and conversion.Moreover, there have been no investigation of subjective evaluationmethods for environmental sound synthesis and conversion.In this paper, we therefore review problem definitions, applica-tions, and evaluation methods of environmental sound synthesis andconversion. We then report on environmental sound synthesis basedon WaveNet [9], which successfully synthesizes human voices, todiscuss the current performance of statistical environmental soundsynthesis. Moreover, we investigate subjective evaluation methodsof environmental sound synthesis.
ChattingCookingEatingReading a newspaperVacuuming ChattingCookingEatingReading a newspaperVacuuming ChattingCookingEatingReading a newspaperVacuuming A c ou s ti c s ce n e l a b e l Input: acoustic scene label
Acoustic scene synthesis
Output: synthesized sound
Figure 1: Problem definition of sound scene synthesis
CupboardCutleryDishesDrawerCar S ound e v e n t l a b e l Input: sound event label
Input: sound event label + time stamp
Sound event synthesis
Output: synthesized sound (including one sound event)
Time
DrawerFan
Output: synthesized sound (including multiple sound events with overlap)
CupboardCutlery Dishes
Figure 2: Problem definition of environmental sound synthesis us-ing sound event labelsThe remainder of this paper is structured as follows. In Sec.2, we review problem definitions of environmental sound synthesisand conversion, their applications, and evaluation methods. In Sec.3, subjective experiments carried out to evaluate the performance ofsound event synthesis using a WaveNet-based method are reported.Finally, we summarize and conclude this paper in Sec. 4.
2. PROBLEM DEFINITIONS OF ENVIRONMENTALSOUND SYNTHESIS AND CONVERSION
In this section, we review applications, problem definitions, andevaluation methods of environmental sound synthesis and conver-sion, specifically environmental sound synthesis using event orscene labels (Sec. 2.1), environmental sound synthesis using ono- oNtoNtoNtoNtoN zaaaaaabrororororo
Input: onomatopoeic word
Sound event synthesis
Output: synthesized sound
Figure 3: Problem definition of environmental sound synthesis us-ing onomatopoeic wordsmatopoeic words (Sec. 2.2), environmental sound conversion (Sec.2.3), and environmental sound synthesis/conversion using multime-dia (Sec. 2.4).
When providing movies or games with background sounds or soundeffects, we need to listen to many sounds in a large sound databaseand select the most suitable one for the scene or sound event, whichis a time-consuming part of movie or game production. To addressthis issue, a statistical method for synthesizing an environmentalsound well representing a sound event or scene, which utilizes thesound event or scene labels as below as an input, has been pro-posed [10]. Figures 1 and 2 illustrate the processes of environmen-tal sound synthesis using the sound event or scene labels as the in-puts of the systems, where we call these research tasks sound eventsynthesis (SES) and sound scene synthesis (SSS) , respectively.Another issue is that the construction of an environmental sounddataset is very time-consuming compared with the construction ofa speech or music dataset [11]. In recent studies, environmentalsound analysis based on deep neural networks has required a largenumber of sounds to achieve a reasonable performance. To over-come this problem of a shortage of environmental sound datasets,SES and SSS can be applied for data augmentation in environmentalsound analysis.To generate environmental sounds by a statistical approach,Kong et al. [10] have proposed a method of environmental soundsynthesis utilizing a conditional SampleRNN [12] with sound scenelabels represented as one-hot vectors.A method of evaluating synthesized environmental sounds isan important subject in this research area. When we apply SESor SSS to data augmentation for sound event detection or acousticscene classification, it is reasonable to evaluate the methods of SESor SSS via their event detection or scene classification performancewith augmented data. On the other hand, in the case of utilizing thesound synthesized by SES or SSS itself, it has not been investigatedin detail how the synthesis method should be evaluated. In this pa-per, we focus on the subjective evaluation method for environmentalsound synthesis in Sec. 3.On the other hand, the subjective evaluation of sounds is verytime-consuming; thus, it is desirable to test methods for environ-mental sound synthesis and conversion with an objective evaluationof synthesized sounds. There are some methods of objective eval-uation such as the perceptual evaluation of speech quality (PESQ)[13], perceptual objective listening quality analysis (POLQA) [14],and perceived evaluation of audio quality (PEAQ) [15], which areused for the evaluation of the speech quality in telecommunica-
Sound event conversion
Output: converted sound
Input: voice or environmental sound Figure 4: Problem definition of environmental sound conversionTable 1: Experimental conditionsSound length 1–2 sSampling rate 16000Wavefrom encoding 16-bit linear PCM (real sound)8-bit µ -law (synthesized sound)Filter size 2Learning rate 0.001Batch size 5Receptive field 64 ms − The SES and SSS discussed in Sec. 2.1 control synthesized envi-ronmental sounds only using the sound event or scene labels; thus,they cannot control synthesized sounds without types of sound orscenes. For instance, when synthesizing the sound of a car horn,it cannot be determined in advance whether SES will synthesize ahorn sound with a continuous high tone (e.g, peeeeeeeeee) or onewith an intermittent low tone (e.g, beep beep beep). To controlsynthesized environmental sounds more finely, we can apply en-vironmental sound synthesis using onomatopoeic words as an inputof the system, as shown in Fig. 3. For SES using onomatopoeicwords, Ikawa et al. [16] have proposed a method that converts ono-matopoeic words to wave forms of environmental sounds using anencoder–decoder model. redicted label
Coffee grinderCupClockWhistleMaracasDrumShaverTrash boxTearing paperBell C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng p a p e r B e ll A c t u a l l a b e l Figure 5: Confusion matrix of classification accuracy for originalaudio samples in terms of recall
Sound event synthesis using onomatopoeic words is a flexible wayof synthesizing environmental sounds; however, it is still difficultto control the generated environmental sounds as intended. Oneway to address this problem is to synthesize environmental soundsnot with sound event labels or onomatopoeic words but with theenvironmental sound or voice as the input of the system, as shown inFig. 4. We call this kind of task a sound event conversion (SEC) or sound scene conversion (SSC) . When we have some backgroundsounds or sound effects but they are not suitable for the movie orgame, environmental sound conversion can also be applied to obtaindesirable sounds. For instance, when we have the horn sound of carX and a video including car Y, we can convert the horn sound of carX to that of car Y using SEC without re-recording the horn soundof car Y.To convert environmental sounds to other audio signals, Grin-stein et al. [17] and Mital [18] have applied a neural-style transfer-based method [19], which enables the “style” and “content” of anaudio to be independently manipulated and copied to another audiosignal. Some researchers have addressed environmental sound synthesisand conversion using multimedia information as an input such asimages. For instance, Zhou et al. have proposed a method for syn-thesizing environmental sounds from images that is based on Sam-pleRNN [5].
3. INVESTIGATION OF SUBJECTIVE EVALUATIONMETHOD3.1. Experimental Conditions
In this section, by evaluating SES using sound event labels based onthe conditional WaveNet [9], we discuss the current performanceof environmental sound synthesis and how we should conduct a
Predicted label
Coffee grinderCupClockWhistleMaracasDrumShaverTrash boxTearing paperBell C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng p a p e r B e ll A c t u a l l a b e l Figure 6: Confusion matrix of classification accuracy for synthe-sized audio samples in terms of recall
Shaver Tearing paper
Real soundSynthesized sound
DrumShaver Tearing paper Drum F r e qu e n c y ( k H z ) F r e qu e n c y ( k H z ) Time (sec)
Figure 7: Spectrograms of real and synthesized environmentalsoundssubjective test to evaluate a method under development. For theevaluation, we considered 10 different sound events (manual cof-fee grinder, cup clinking, alarm clock ringing, whistle, maracas,drum, electric shaver, trash box banging, tearing paper, bell ringing)contained in the RWCP-SSD (Real World Computing Partnership-Sound Scene Database) [20]. We used a total of 1,000 samples (100samples ×
10 sound events), in which 95 samples of each soundevent were used for model training and the others were used forthe subjective test. Table 1 shows the experimental conditions andparameters used for WaveNet. Samples of sounds synthesized byWaveNet are available at [21].Many works on speech and music synthesis have been con-ducted using subjective tests to evaluate the quality of synthesizedsounds. For example, in speech synthesis, speech intelligibility andnaturalness are often used as evaluation metrics. On the other hand,there have been no works in which methods of subjective tests inenvironmental sound synthesis and conversion were investigated in o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng B e ll A v er ag e Sound event label A cc u r ac y ( % ) Figure 8: Recognition rate of real soundsdetail; thus, we here discuss how we should conduct subjective testsfor environmental sound synthesis. For synthesized sounds, it is im-portant that (I) they are distinguishable from other types of environ-mental sound, (II) they are not distinguishable from real sounds, and(III) they have as high naturalness as real environmental sounds. Onthe basis of these considerations, we conducted the following exper-iments: • Experiment I: evaluation of intelligibility of synthesizedsounds
After listening to a synthesized sound, the listener selecteda sound event label that best represented the sound. As acomparison, the listener also similarly evaluated real environ-mental sounds. • Experiment II: evaluation of distinguishability of real andsynthesized sounds
We conducted a preference AB test. After listening to a pairof real and synthesized sounds in random order, the listenerselected the one that sounded more real. • Experiment III: evaluation of naturalness of synthesizedsounds
We conducted a five-scale mean opinion score (MOS) test.After listening to a real or synthesized sound presentedrandomly, the listener scored the naturalness from 1 (veryunnatural as an environmental sound) to 5 (very natural as anenvironmental sound).Experiments were conducted with 24 listeners (13 males and11 females) in a quiet environment at Ritsumeikan University. InTable 2, the number of samples used in each experiment is listed.In the experiments, a Roland QUAD-CAPTURE UA-55 audio in-terface and SONY MDR-CD900ST headphones were used. : the classification results of real and synthesizedsounds in terms of recall are shown in Figs. 5 and 6, respectively.The averaged F-scores for real and synthesized sounds were 86.22%and 76.30%, respectively. From these results, synthesized drumsounds are classified with a similar performance to real sounds,whereas the synthesized sounds of a cup clinking and an electricshaver tended to be more often misclassified than the real sounds.Figure 7 shows spectrograms of real and synthesized sounds. Thisindicates that the synthesized sound of an electric shaver does nothave the fine structure of the spectrum, which has the real sound. C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng B e ll A v er ag e M O S s c o r e f o r n a t u r a l n e ss Sound event label
Figure 9: MOS score for naturalness of original and synthesizedsoundsThus, the difference between the spectrograms of the sounds of anelectric shaver and tearing paper is likely to be unclear, and thisleads to the misclassfication. From the results of experiment I, itconsidered that this subjective test is particularly helpful for eval-uating whether the method can reproduce distinguishable soundseven when they have similar characteristics.
Experiment II : listeners identified real sounds with an averageaccuracy of 82.71% as shown in Fig. 8. From this result, soundssynthesized by WaveNet do not have sufficiently high quality to beindistinguishable from real sounds. This indicates that the evalu-ation of the distinguishability of real and synthesized sounds canbe used for the comparison of conventional methods and more so-phisticated methods of environmental sound synthesis that will bedeveloped.
Experiment III : the average MOS score for the naturalness ofsynthesized and real sounds and its 95% confidence interval areshown in Fig. 9. The results indicate that the synthesized soundsof the coffee grinder, clock, and maracas had similar naturalnessscores to those of real sounds. On the other hand, for the sounds ofthe electric shaver and the trash box banging, there are large differ-ences in the MOS scores between the synthesized and real sounds.We consider that this is because SES using WaveNet cannot repro-duce the fine structure of the synthesized spectrum (e.g., the spec-trum of the electric shaver in Fig. 7). Moreover, Figs. 5, 6, and 9show that the listeners classified both the real and synthesized whis-tle sounds with reasonable performance, whereas there are large dif-ferences in the MOS scores between synthesized and real sounds.This means that the evaluation of intelligibility is not satisfactoryfor evaluating the quality of synthesized sounds.Thus, we propose that methods of environmental sound synthe-sis should be evaluated not only by testing the intelligibility of syn-thesized sounds but also by testing the distinguishability of real andsynthesized sounds and/or the naturalness of synthesized sounds.
4. CONCLUSION
In this paper, we presented the problem definitions of sound eventsynthesis, sound scene synthesis, and sound event and scene con-ersion. We then discussed the current performance of soundevent synthesis and subjective evaluation methods of environmentalsound synthesis. The evaluation experiments indicate that soundssynthesized by WaveNet do not yet have sufficiently high quality tobe indistinguishable from real sounds. Moreover, on the basis ofour experimental results, we consider that methods of environmen-tal sound synthesis should be evaluated by testing not only intelli-gibility but also distinguishability and/or naturalness.
5. REFERENCES [1] H. Zen, K. Tokuda, and A. Black, “Statistical parametricspeech synthesis,”
Speech Communication , vol. 51, no. 11, pp.1039–1064, 2009.[2] S. H. Mohammadi and A. Kain, “An overview of voice con-version systems,”
Speech Communication , vol. 88, pp. 65–82,2017.[3] J. P. Briot, G. Hadjeres, and F. Pachet, “Deep learningtechniques for music generation – a survey,” arXiv preprintarXiv:1709.01620 , 2017.[4] D. B. Lloyd, N. Raghuvanshi, and N. K. Govindaraju, “Soundsynthesis for impact sounds in video games,”
Proc. Sympo-sium on Interactive 3D Graphics and Games. ACM , pp. 55–61, 20111.[5] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg, “Visual tosound: Generating natural sound for videos in the wild,”
Proc.IEEE Conference on Computer Vision and Pattern Recogni-tion ( CVPR ), pp. 3550–3558, 2018.[6] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P.Bello, “Scaper: A library for soundscape synthesis and aug-mentation,”
Proc. IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics ( WASPAA ), pp. 344–348,2017.[7] D. Schwarz, “State of the art in sound texture synthesis,”
Proc.Digital Audio Effects ( DAFx ), pp. 221–232, 2011.[8] G. Bernardes, L. Aly, and M. E. Davies, “Seed: Resynthesiz-ing environmental sounds from examples,”
Proc. the Soundand Music Computing Conference , pp. 55–62, 2016.[9] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “WaveNet: A generative model for raw au-dio,” arXiv preprint, arXiv:1609.03499 , 2016.[10] Q. Kong, Y. Xu, T. Iqbal, Y. Cao, W. Wang, and M. D.Plumbley, “Acoustic scene generation with conditional sam-pleRNN,”
Proc. IEEE International Conference on Acoustics,Speech and Signal Processing ( ICASSP ), pp. 925–929, 2019.[11] K. Imoto, “Introduction to acoustic event and scene analysis,”
Acoustical Science and Technology , vol. 39, no. 3, pp. 182–188, 2018.[12] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “SampleRNN: An unconditionalend-to-end neural audio generation model,”
Proc. Interna-tional Conference for Learning Representations ( ICLR ), pp.1–11, 2017.[13] “Perceptual evaluation of speech quality (PESQ): An ob-jective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs,”
ITU-TRecommendation P.862 , 2001. [14] “Perceptual objective listening quality assessment,”
ITU-TRecommendation P.863 , 2011.[15] “Method for objective measurements of perceived audio qual-ity,”
ITU-R Recommendation BS.1387-1 , 2001.[16] S. Ikawa and K. Kashino, “Generating sound words fromaudio signals of acoustic events with sequence-to-sequencemodel,”
Proc. IEEE International Conference on Acoustics,Speech and Signal Processing ( ICASSP ), pp. 346–350, 2018.[17] E. Grinstein, N. Q. K. Duong, A. Ozerov, and P. P´erez, “Au-dio style transfer,”
Proc. IEEE International Conference onAcoustics, Speech and Signal Processing ( ICASSP ), pp. 586–590, 2018.[18] P. K. Mital, “Time domain neural audio style transfer,” arXivpreprint , p. arXiv:1711.11160, 2017.[19] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style trans-fer using convolutional neural networks,”
Proc. IEEE Confer-ence on Computer Vision and Pattern Recognition ( CVPR ),pp. 2414–2423, 2016.[20] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Acousticalsound database in real environments for sound scene under-standing and hands-free speech recognition,”