[PDF] Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion

Abstract

Synthesizing and converting environmental sounds have the potential for many applications such as supporting movie and game production, data augmentation for sound event detection and scene classification. Conventional works on synthesizing and converting environmental sounds are based on a physical modeling or concatenative approach. However, there are a limited number of works that have addressed environmental sound synthesis and conversion with statistical generative models; thus, this research area is not yet well organized. In this paper, we review problem definitions, applications, and evaluation methods of environmental sound synthesis and conversion. We then report on environmental sound synthesis using sound event labels, in which we focus on the current performance of statistical environmental sound synthesis and investigate how we should conduct subjective experiments on environmental sound synthesis.

Full PDF

aa r X i v : . [ c s . S D ] A ug OVERVIEW OF TASKS AND INVESTIGATION OF SUBJECTIVE EVALUATION METHODSIN ENVIRONMENTAL SOUND SYNTHESIS AND CONVERSION

Yuki Okamoto , Keisuke Imoto , Tatsuya Komatsu , Shinnosuke Takamichi ,Takumi Yagyu , Ryosuke Yamanishi , Yoichi Yamashita Ritsumeikan University, Japan, LINE Corporation, Japan, The University of Tokyo, Japan

ABSTRACT

Synthesizing and converting environmental sounds have the po-tential for many applications such as supporting movie and gameproduction, data augmentation for sound event detection and sceneclassiﬁcation. Conventional works on synthesizing and convertingenvironmental sounds are based on a physical modeling or concate-native approach. However, there are a limited number of worksthat have addressed environmental sound synthesis and conversionwith statistical generative models; thus, this research area is not yetwell organized. In this paper, we review problem deﬁnitions, appli-cations, and evaluation methods of environmental sound synthesisand conversion. We then report on environmental sound synthesisusing sound event labels, in which we focus on the current perfor-mance of statistical environmental sound synthesis and investigatehow we should conduct subjective experiments on environmentalsound synthesis.

Index Terms — Environmental sound synthesis, environmen-tal sound conversion, sound event synthesis, sound scene synthesis,subjective evaluation, WaveNet

1. INTRODUCTION

Sound synthesis and conversion are techniques for generating a nat-ural sound using a statistical model that associates input informationwith the generated sound. Sound synthesis and conversion methodswith the aim of generating speech or music have been widely de-veloped [1, 2, 3]. Recently, some researchers have also developedmethods for environmental sound synthesis and conversion that canbe applied to support movie and game production [4], the genera-tion of content for virtual reality (VR) [5], and data augmentationfor sound event detection and scene classiﬁcation [6]. Many stud-ies on environmental sound synthesis and conversion have taken aphysical modeling or concatenative approach [7, 8, 6]. On the otherhand, there have been fewer studies on environmental sound syn-thesis and conversion based on statistical generative models such asdeep learning approaches. To the best of our knowledge, there is noliterature giving an overview of the problem deﬁnitions and eval-uation methods for environmental sound synthesis and conversion.Moreover, there have been no investigation of subjective evaluationmethods for environmental sound synthesis and conversion.In this paper, we therefore review problem deﬁnitions, applica-tions, and evaluation methods of environmental sound synthesis andconversion. We then report on environmental sound synthesis basedon WaveNet [9], which successfully synthesizes human voices, todiscuss the current performance of statistical environmental soundsynthesis. Moreover, we investigate subjective evaluation methodsof environmental sound synthesis.

ChattingCookingEatingReading a newspaperVacuuming ChattingCookingEatingReading a newspaperVacuuming ChattingCookingEatingReading a newspaperVacuuming A c ou s ti c s ce n e l a b e l Input: acoustic scene label

Acoustic scene synthesis

Output: synthesized sound

Figure 1: Problem deﬁnition of sound scene synthesis

CupboardCutleryDishesDrawerCar S ound e v e n t l a b e l Input: sound event label

Input: sound event label + time stamp

Sound event synthesis

Output: synthesized sound (including one sound event)

Time

DrawerFan

Output: synthesized sound (including multiple sound events with overlap)

CupboardCutlery Dishes

Figure 2: Problem deﬁnition of environmental sound synthesis us-ing sound event labelsThe remainder of this paper is structured as follows. In Sec.2, we review problem deﬁnitions of environmental sound synthesisand conversion, their applications, and evaluation methods. In Sec.3, subjective experiments carried out to evaluate the performance ofsound event synthesis using a WaveNet-based method are reported.Finally, we summarize and conclude this paper in Sec. 4.

2. PROBLEM DEFINITIONS OF ENVIRONMENTALSOUND SYNTHESIS AND CONVERSION

In this section, we review applications, problem deﬁnitions, andevaluation methods of environmental sound synthesis and conver-sion, speciﬁcally environmental sound synthesis using event orscene labels (Sec. 2.1), environmental sound synthesis using ono- oNtoNtoNtoNtoN zaaaaaabrororororo

Input: onomatopoeic word

Sound event synthesis

Output: synthesized sound

Figure 3: Problem deﬁnition of environmental sound synthesis us-ing onomatopoeic wordsmatopoeic words (Sec. 2.2), environmental sound conversion (Sec.2.3), and environmental sound synthesis/conversion using multime-dia (Sec. 2.4).

When providing movies or games with background sounds or soundeffects, we need to listen to many sounds in a large sound databaseand select the most suitable one for the scene or sound event, whichis a time-consuming part of movie or game production. To addressthis issue, a statistical method for synthesizing an environmentalsound well representing a sound event or scene, which utilizes thesound event or scene labels as below as an input, has been pro-posed [10]. Figures 1 and 2 illustrate the processes of environmen-tal sound synthesis using the sound event or scene labels as the in-puts of the systems, where we call these research tasks sound eventsynthesis (SES) and sound scene synthesis (SSS) , respectively.Another issue is that the construction of an environmental sounddataset is very time-consuming compared with the construction ofa speech or music dataset [11]. In recent studies, environmentalsound analysis based on deep neural networks has required a largenumber of sounds to achieve a reasonable performance. To over-come this problem of a shortage of environmental sound datasets,SES and SSS can be applied for data augmentation in environmentalsound analysis.To generate environmental sounds by a statistical approach,Kong et al. [10] have proposed a method of environmental soundsynthesis utilizing a conditional SampleRNN [12] with sound scenelabels represented as one-hot vectors.A method of evaluating synthesized environmental sounds isan important subject in this research area. When we apply SESor SSS to data augmentation for sound event detection or acousticscene classiﬁcation, it is reasonable to evaluate the methods of SESor SSS via their event detection or scene classiﬁcation performancewith augmented data. On the other hand, in the case of utilizing thesound synthesized by SES or SSS itself, it has not been investigatedin detail how the synthesis method should be evaluated. In this pa-per, we focus on the subjective evaluation method for environmentalsound synthesis in Sec. 3.On the other hand, the subjective evaluation of sounds is verytime-consuming; thus, it is desirable to test methods for environ-mental sound synthesis and conversion with an objective evaluationof synthesized sounds. There are some methods of objective eval-uation such as the perceptual evaluation of speech quality (PESQ)[13], perceptual objective listening quality analysis (POLQA) [14],and perceived evaluation of audio quality (PEAQ) [15], which areused for the evaluation of the speech quality in telecommunica-

Sound event conversion

Output: converted sound

Input: voice or environmental sound Figure 4: Problem deﬁnition of environmental sound conversionTable 1: Experimental conditionsSound length 1–2 sSampling rate 16000Wavefrom encoding 16-bit linear PCM (real sound)8-bit µ -law (synthesized sound)Filter size 2Learning rate 0.001Batch size 5Receptive ﬁeld 64 ms − The SES and SSS discussed in Sec. 2.1 control synthesized envi-ronmental sounds only using the sound event or scene labels; thus,they cannot control synthesized sounds without types of sound orscenes. For instance, when synthesizing the sound of a car horn,it cannot be determined in advance whether SES will synthesize ahorn sound with a continuous high tone (e.g, peeeeeeeeee) or onewith an intermittent low tone (e.g, beep beep beep). To controlsynthesized environmental sounds more ﬁnely, we can apply en-vironmental sound synthesis using onomatopoeic words as an inputof the system, as shown in Fig. 3. For SES using onomatopoeicwords, Ikawa et al. [16] have proposed a method that converts ono-matopoeic words to wave forms of environmental sounds using anencoder–decoder model. redicted label

Coffee grinderCupClockWhistleMaracasDrumShaverTrash boxTearing paperBell C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng p a p e r B e ll A c t u a l l a b e l Figure 5: Confusion matrix of classiﬁcation accuracy for originalaudio samples in terms of recall

Sound event synthesis using onomatopoeic words is a ﬂexible wayof synthesizing environmental sounds; however, it is still difﬁcultto control the generated environmental sounds as intended. Oneway to address this problem is to synthesize environmental soundsnot with sound event labels or onomatopoeic words but with theenvironmental sound or voice as the input of the system, as shown inFig. 4. We call this kind of task a sound event conversion (SEC) or sound scene conversion (SSC) . When we have some backgroundsounds or sound effects but they are not suitable for the movie orgame, environmental sound conversion can also be applied to obtaindesirable sounds. For instance, when we have the horn sound of carX and a video including car Y, we can convert the horn sound of carX to that of car Y using SEC without re-recording the horn soundof car Y.To convert environmental sounds to other audio signals, Grin-stein et al. [17] and Mital [18] have applied a neural-style transfer-based method [19], which enables the “style” and “content” of anaudio to be independently manipulated and copied to another audiosignal. Some researchers have addressed environmental sound synthesisand conversion using multimedia information as an input such asimages. For instance, Zhou et al. have proposed a method for syn-thesizing environmental sounds from images that is based on Sam-pleRNN [5].

3. INVESTIGATION OF SUBJECTIVE EVALUATIONMETHOD3.1. Experimental Conditions

In this section, by evaluating SES using sound event labels based onthe conditional WaveNet [9], we discuss the current performanceof environmental sound synthesis and how we should conduct a

Predicted label

Coffee grinderCupClockWhistleMaracasDrumShaverTrash boxTearing paperBell C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng p a p e r B e ll A c t u a l l a b e l Figure 6: Confusion matrix of classiﬁcation accuracy for synthe-sized audio samples in terms of recall

Shaver Tearing paper

Real soundSynthesized sound

DrumShaver Tearing paper Drum F r e qu e n c y ( k H z ) F r e qu e n c y ( k H z ) Time (sec)

Figure 7: Spectrograms of real and synthesized environmentalsoundssubjective test to evaluate a method under development. For theevaluation, we considered 10 different sound events (manual cof-fee grinder, cup clinking, alarm clock ringing, whistle, maracas,drum, electric shaver, trash box banging, tearing paper, bell ringing)contained in the RWCP-SSD (Real World Computing Partnership-Sound Scene Database) [20]. We used a total of 1,000 samples (100samples ×

10 sound events), in which 95 samples of each soundevent were used for model training and the others were used forthe subjective test. Table 1 shows the experimental conditions andparameters used for WaveNet. Samples of sounds synthesized byWaveNet are available at [21].Many works on speech and music synthesis have been con-ducted using subjective tests to evaluate the quality of synthesizedsounds. For example, in speech synthesis, speech intelligibility andnaturalness are often used as evaluation metrics. On the other hand,there have been no works in which methods of subjective tests inenvironmental sound synthesis and conversion were investigated in o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng B e ll A v er ag e Sound event label A cc u r ac y ( % ) Figure 8: Recognition rate of real soundsdetail; thus, we here discuss how we should conduct subjective testsfor environmental sound synthesis. For synthesized sounds, it is im-portant that (I) they are distinguishable from other types of environ-mental sound, (II) they are not distinguishable from real sounds, and(III) they have as high naturalness as real environmental sounds. Onthe basis of these considerations, we conducted the following exper-iments: • Experiment I: evaluation of intelligibility of synthesizedsounds

After listening to a synthesized sound, the listener selecteda sound event label that best represented the sound. As acomparison, the listener also similarly evaluated real environ-mental sounds. • Experiment II: evaluation of distinguishability of real andsynthesized sounds

We conducted a preference AB test. After listening to a pairof real and synthesized sounds in random order, the listenerselected the one that sounded more real. • Experiment III: evaluation of naturalness of synthesizedsounds

We conducted a ﬁve-scale mean opinion score (MOS) test.After listening to a real or synthesized sound presentedrandomly, the listener scored the naturalness from 1 (veryunnatural as an environmental sound) to 5 (very natural as anenvironmental sound).Experiments were conducted with 24 listeners (13 males and11 females) in a quiet environment at Ritsumeikan University. InTable 2, the number of samples used in each experiment is listed.In the experiments, a Roland QUAD-CAPTURE UA-55 audio in-terface and SONY MDR-CD900ST headphones were used. : the classiﬁcation results of real and synthesizedsounds in terms of recall are shown in Figs. 5 and 6, respectively.The averaged F-scores for real and synthesized sounds were 86.22%and 76.30%, respectively. From these results, synthesized drumsounds are classiﬁed with a similar performance to real sounds,whereas the synthesized sounds of a cup clinking and an electricshaver tended to be more often misclassiﬁed than the real sounds.Figure 7 shows spectrograms of real and synthesized sounds. Thisindicates that the synthesized sound of an electric shaver does nothave the ﬁne structure of the spectrum, which has the real sound. C o ff ee g r i nd e r C up C l o c k W h i s tl e M a r aca s D r u m S h a v e r T r a s h box T ea r i ng B e ll A v er ag e M O S s c o r e f o r n a t u r a l n e ss Sound event label

Figure 9: MOS score for naturalness of original and synthesizedsoundsThus, the difference between the spectrograms of the sounds of anelectric shaver and tearing paper is likely to be unclear, and thisleads to the misclassﬁcation. From the results of experiment I, itconsidered that this subjective test is particularly helpful for eval-uating whether the method can reproduce distinguishable soundseven when they have similar characteristics.

Experiment II : listeners identiﬁed real sounds with an averageaccuracy of 82.71% as shown in Fig. 8. From this result, soundssynthesized by WaveNet do not have sufﬁciently high quality to beindistinguishable from real sounds. This indicates that the evalu-ation of the distinguishability of real and synthesized sounds canbe used for the comparison of conventional methods and more so-phisticated methods of environmental sound synthesis that will bedeveloped.

Experiment III : the average MOS score for the naturalness ofsynthesized and real sounds and its 95% conﬁdence interval areshown in Fig. 9. The results indicate that the synthesized soundsof the coffee grinder, clock, and maracas had similar naturalnessscores to those of real sounds. On the other hand, for the sounds ofthe electric shaver and the trash box banging, there are large differ-ences in the MOS scores between the synthesized and real sounds.We consider that this is because SES using WaveNet cannot repro-duce the ﬁne structure of the synthesized spectrum (e.g., the spec-trum of the electric shaver in Fig. 7). Moreover, Figs. 5, 6, and 9show that the listeners classiﬁed both the real and synthesized whis-tle sounds with reasonable performance, whereas there are large dif-ferences in the MOS scores between synthesized and real sounds.This means that the evaluation of intelligibility is not satisfactoryfor evaluating the quality of synthesized sounds.Thus, we propose that methods of environmental sound synthe-sis should be evaluated not only by testing the intelligibility of syn-thesized sounds but also by testing the distinguishability of real andsynthesized sounds and/or the naturalness of synthesized sounds.

4. CONCLUSION

In this paper, we presented the problem deﬁnitions of sound eventsynthesis, sound scene synthesis, and sound event and scene con-ersion. We then discussed the current performance of soundevent synthesis and subjective evaluation methods of environmentalsound synthesis. The evaluation experiments indicate that soundssynthesized by WaveNet do not yet have sufﬁciently high quality tobe indistinguishable from real sounds. Moreover, on the basis ofour experimental results, we consider that methods of environmen-tal sound synthesis should be evaluated by testing not only intelli-gibility but also distinguishability and/or naturalness.

5. REFERENCES [1] H. Zen, K. Tokuda, and A. Black, “Statistical parametricspeech synthesis,”

Speech Communication , vol. 51, no. 11, pp.1039–1064, 2009.[2] S. H. Mohammadi and A. Kain, “An overview of voice con-version systems,”

Speech Communication , vol. 88, pp. 65–82,2017.[3] J. P. Briot, G. Hadjeres, and F. Pachet, “Deep learningtechniques for music generation – a survey,” arXiv preprintarXiv:1709.01620 , 2017.[4] D. B. Lloyd, N. Raghuvanshi, and N. K. Govindaraju, “Soundsynthesis for impact sounds in video games,”

Proc. Sympo-sium on Interactive 3D Graphics and Games. ACM , pp. 55–61, 20111.[5] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg, “Visual tosound: Generating natural sound for videos in the wild,”

Proc.IEEE Conference on Computer Vision and Pattern Recogni-tion ( CVPR ), pp. 3550–3558, 2018.[6] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P.Bello, “Scaper: A library for soundscape synthesis and aug-mentation,”

Proc. IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics ( WASPAA ), pp. 344–348,2017.[7] D. Schwarz, “State of the art in sound texture synthesis,”

Proc.Digital Audio Effects ( DAFx ), pp. 221–232, 2011.[8] G. Bernardes, L. Aly, and M. E. Davies, “Seed: Resynthesiz-ing environmental sounds from examples,”

Proc. the Soundand Music Computing Conference , pp. 55–62, 2016.[9] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “WaveNet: A generative model for raw au-dio,” arXiv preprint, arXiv:1609.03499 , 2016.[10] Q. Kong, Y. Xu, T. Iqbal, Y. Cao, W. Wang, and M. D.Plumbley, “Acoustic scene generation with conditional sam-pleRNN,”

Proc. IEEE International Conference on Acoustics,Speech and Signal Processing ( ICASSP ), pp. 925–929, 2019.[11] K. Imoto, “Introduction to acoustic event and scene analysis,”

Acoustical Science and Technology , vol. 39, no. 3, pp. 182–188, 2018.[12] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “SampleRNN: An unconditionalend-to-end neural audio generation model,”

Proc. Interna-tional Conference for Learning Representations ( ICLR ), pp.1–11, 2017.[13] “Perceptual evaluation of speech quality (PESQ): An ob-jective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs,”

ITU-TRecommendation P.862 , 2001. [14] “Perceptual objective listening quality assessment,”

ITU-TRecommendation P.863 , 2011.[15] “Method for objective measurements of perceived audio qual-ity,”

ITU-R Recommendation BS.1387-1 , 2001.[16] S. Ikawa and K. Kashino, “Generating sound words fromaudio signals of acoustic events with sequence-to-sequencemodel,”

Proc. IEEE International Conference on Acoustics,Speech and Signal Processing ( ICASSP ), pp. 346–350, 2018.[17] E. Grinstein, N. Q. K. Duong, A. Ozerov, and P. P´erez, “Au-dio style transfer,”

Proc. IEEE International Conference onAcoustics, Speech and Signal Processing ( ICASSP ), pp. 586–590, 2018.[18] P. K. Mital, “Time domain neural audio style transfer,” arXivpreprint , p. arXiv:1711.11160, 2017.[19] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style trans-fer using convolutional neural networks,”

Proc. IEEE Confer-ence on Computer Vision and Pattern Recognition ( CVPR ),pp. 2414–2423, 2016.[20] S. Nakamura, K. Hiyane, F. Asano, and T. Endo, “Acousticalsound database in real environments for sound scene under-standing and hands-free speech recognition,”